ΕΚΠ 413 / ΕΚΠ 606 Αυτόνοµοι (Ροµ οτικοί) Πράκτορες

ΕΚΠ 413 / ΕΚΠ 606 Αυτόνοµοι (Ροµ οτικοί) Πράκτορες Ενισχυτική Μάθηση Reinforcement Learning Τµήµα Ηλεκτρονικών Μηχανικών και Μηχανικών Υ ολογιστών Πολυτεχνείο Κρήτης

Ε ανάληψη Λήψη α οφάσεων ακολουθιακά προβλήµατα αποφάσεων Μαρκωβιανές διεργασίες απόφασης Βέλτιστες ολιτικές επανάληψη αξιών επανάληψη πολιτικών Μ. Γ. Λαγουδάκης Τµήµα ΗΜΜΥ Πολυτεχνείο Κρήτης Σελίδα 2

Σήµερα Reinforcement Learning (RL) problems and approaches Prediction temporal difference learning least-squares temporal difference (LSTD) learning Control Q-learning least-squares policy iteration (LSPI) Μ. Γ. Λαγουδάκης Τµήµα ΗΜΜΥ Πολυτεχνείο Κρήτης Σελίδα 3

Reinforcement Learning Learning from Mistakes!

Machine Learning Unsupervised learning learning without a teacher information: none identify structure in the data clustering, self-organization k-means, Kohonen maps Supervised learning learning with a teacher information: correct examples generalize from examples classification, approximation SVMs, neural networks Reinforcement learning learning with a critic information: trial and error reinforce good choices value function, control policy TD-learning, Q-learning Μ. Γ. Λαγουδάκης Τµήµα ΗΜΜΥ Πολυτεχνείο Κρήτης Σελίδα 5

Reinforcement Learning Action Reward State Learn how to take actions in each state of the process so as to maximize the cumulative reward! The reward signal reinforces good decision making Learn from experience: (state, action, reward, next state)-samples Samples taken from the process or from a generative model Μ. Γ. Λαγουδάκης Τµήµα ΗΜΜΥ Πολυτεχνείο Κρήτης Σελίδα 6

Reinforcement Learning Setup Known states rewards Unknown transition model reward model Significance learning without knowing what you are learning generic approach for agent design very hard problem Μ. Γ. Λαγουδάκης Τµήµα ΗΜΜΥ Πολυτεχνείο Κρήτης Σελίδα 7

Learning Problems Prediction Control $? Action Action π π? : $$$?? Reward Reward State State Learn to predict the expected total reward for a fixed action policy [ Passive Reinforcement Learning ] Learn to control the process to maximize the expected total reward [ Active Reinforcement Learning ] Μ. Γ. Λαγουδάκης Τµήµα ΗΜΜΥ Πολυτεχνείο Κρήτης Σελίδα 8

Learning Methodology Action Reward? State Action Action Reward? State State model-based learning model-free learning Μ. Γ. Λαγουδάκης Τµήµα ΗΜΜΥ Πολυτεχνείο Κρήτης Σελίδα 9

Learning Environment cooperative competitive single-agent multi-agent Μ. Γ. Λαγουδάκης Τµήµα ΗΜΜΥ Πολυτεχνείο Κρήτης Σελίδα 10

Process Modeling Markov Decision Processes

Markov Decision Process (MDP) MDP (S, A, P, R, γ, D) S: state space of the process A: action space of the process P: transition model, P(s a, s) R: reward function, R(s, a) γ: discount factor, 0 < γ 1 D: initial state distribution Markov property next state and reward are independent of history Μ. Γ. Λαγουδάκης Τµήµα ΗΜΜΥ Πολυτεχνείο Κρήτης Σελίδα 12

Value Functions State Value Function V State-Action Value Function Q Μ. Γ. Λαγουδάκης Τµήµα ΗΜΜΥ Πολυτεχνείο Κρήτης Σελίδα 13

Value Function Representation Exact use a table to represent the value function V : one entry for each s, O( S ) space Q : one entry for each (s,a), O( S A ) space infeasible for realistic problems Approximate approximate the value function with a function approximator e.g. neural networks, polynomials, radial basis functions,... need only enough space to store the approximator parameters equations and algorithms become harder to deal with convergence properties are compromised Μ. Γ. Λαγουδάκης Τµήµα ΗΜΜΥ Πολυτεχνείο Κρήτης Σελίδα 14

Linear Value Function Approximation Μ. Γ. Λαγουδάκης Τµήµα ΗΜΜΥ Πολυτεχνείο Κρήτης Σελίδα 15

Prediction Passive Reinforcement Learning

The Prediction Problem Given a fixed deterministic (or, stochastic) policy Goal to predict the performance of policy to evaluate policy to learn the value function V (s) of policy State value function π t ( ) γ ( t) π, 0 V s = E R s s = s t= 0 Μ. Γ. Λαγουδάκης Τµήµα ΗΜΜΥ Πολυτεχνείο Κρήτης Σελίδα 17

Grid Domain Trials (1,1) -,04 (1,2) -,04 (1,3) -,04 (1,2) -,04 (1,3) -,04 (2,3) -,04 (3,3) -,04 (4,3) +1 (1,1) -,04 (1,2) -,04 (1,3) -,04 (2,3) -,04 (3,3) -,04 (3,2) -,04 (3,3) -,04 (4,3) +1 (1,1) -,04 (2,1) -,04 (3,1) -,04 (3,2) -,04 (4,2) -1 Μ. Γ. Λαγουδάκης Τµήµα ΗΜΜΥ Πολυτεχνείο Κρήτης Σελίδα 18

Άµεση Eκτίµηση Xρησιµότητας direct utility estimation in adaptive control theory (1950-1960) Ιδέα χρησιµότητα = αναµενόµενη συνολική ανταµοιβή από κατάσταση s κάθε δοκιµή (trial) δίνει ένα δείγµα για κάθε κατάσταση που επισκέπτεται παράδειγµα (1,1) -,04 (1,2) -,04 (1,3) -,04 (1,2) -,04 (1,3) -,04 (2,3) -,04 (3,3) -,04 (4,3) +1 0.72 0.76 0.80 0.84 0.88 0.92 0.96 1 εκτίµηση: µέσος όρος όλων των δειγµάτων για κάθε κατάσταση Χαρακτηριστικά αγνοεί τις εξαρτήσεις µεταξύ χρησιµοτήτων (εξίσωση Bellman) ψάχνει µεγαλύτερο χώρο και συγκλίνει µε αργό ρυθµό Μ. Γ. Λαγουδάκης Τµήµα ΗΜΜΥ Πολυτεχνείο Κρήτης Σελίδα 19

Προσαρµόσιµος υναµικός Προγραµµατισµός Adaptive Dynamic Programming (ADP) µαθαίνει το µοντέλο µεταβάσεων, T(s, (s), s') µαθαίνει το µοντέλο ανταµοιβών, R(s) τα αντικαθιστά στην εξίσωση Bellman V π ( ) ( ) γ π( ) ( ) π,, V ( ) s = R s + T s s s s λύνει το γραµµικό σύστηµα ως προς τις χρησιµότητες Χαρακτηριστικά εκτίµηση µοντέλων µε καταµέτρηση µεγάλη χωρική πολυπλοκότητα παραλλαγή: τροποποιηµένη επανάληψη πολιτικών Μ. Γ. Λαγουδάκης Τµήµα ΗΜΜΥ Πολυτεχνείο Κρήτης Σελίδα 20 s

Adaptive Dynamic Programming function Passive-ADP-Agent (αίσθηση) returns µια ενέργεια inputs: αίσθηση, µια αίσθηση που δηλώνει την τρέχουσα κατάσταση s και το σήµα ανταµοιβής r static:, µια σταθερή πολιτική mdp, MDP µε µοντέλο Τ, ανταµοιβές R, προεξόφληση γ V, ένας πίνακας χρησιµοτήτων, αρχικά κενός N sa, πίνακας συχνοτήτων για ζεύγη (s, a), αρχικά µηδενικός N sas, πίνακας συχνοτήτων για τριάδες (s, a, s ), αρχικά µηδενικός s, a, η προηγούµενη κατάσταση και ενέργεια, αρχικά κενές if s είναι νέο, then do V[s ] r, R[s ] r if s όχι κενό then do αύξηση των N sa [s, a] και N sas [s, a, s ] for each t έτσι ώστε το N sas [s, a, t] είναι µη µηδενικό, do T[s, a, t] N sas [s, a, t]/ N sa [s, a] V Value-Determination(, V, mdp) if Terminal?[s ] then s, a κενό else s, a s, [s ] return a Μ. Γ. Λαγουδάκης Τµήµα ΗΜΜΥ Πολυτεχνείο Κρήτης Σελίδα 21

Α όδοση ADP ο πιο "γρήγορος" αλγόριθµος ως προς το ρυθµό σύγκλισης κάθε δοκιµή χρειάζεται αρκετό χρόνο Μ. Γ. Λαγουδάκης Τµήµα ΗΜΜΥ Πολυτεχνείο Κρήτης Σελίδα 22

Μάθηση Χρονικών ιαφορών Ιδέα τοπική ενηµέρωση χρησιµοτήτων σύµφωνα µε τις εξαρτήσεις Temporal Difference Learning (TD) ( s) V ( s) + α R( s) + γ V ( s ) V ( s) ( ) V π π π π α = ρυθµός µάθησης (θυµίζει τα νευρωνικά δίκτυα) φθίνει µε το χρόνο για αποφυγή "ταλαντώσεων" Παράδειγµα έστω ότι από την πρώτη δοκιµή έχουµε V (1,3)=0,84, V (2,3) = 0,92 έστω η µετάβαση από το (1,3) στο (2,3) στη δεύτερη δοκιµή η εξίσωση διαφορών ορίζει ότι η V (1,3) πρέπει να αυξηθεί Μ. Γ. Λαγουδάκης Τµήµα ΗΜΜΥ Πολυτεχνείο Κρήτης Σελίδα 23

Temporal Difference Learning Μ. Γ. Λαγουδάκης Τµήµα ΗΜΜΥ Πολυτεχνείο Κρήτης Σελίδα 24

Αλγόριθµος TD function Passive-ΤD-Agent (αίσθηση) returns µια ενέργεια inputs: αίσθηση, µια αίσθηση που δηλώνει την τρέχουσα κατάσταση s και το σήµα ανταµοιβής r static:, µια σταθερή πολιτική V, πίνακας χρησιµοτήτων, αρχικά κενός N s, πίνακας συχνοτήτων για τις καταστάσεις, αρχικά µηδενικός s, a, r η προηγούµενη κατάσταση, ενέργεια, ανταµοιβή, αρχικά κενά if s είναι νέο, then do V[s ] r' if s όχι κενό then do αύξηση του N s [s] V[s] V[s] + α(n s [s])(r + γv[s ] V[s]) if Terminal?[s ] then s, a, r κενό else s, a, r s, [s ], r return a Μ. Γ. Λαγουδάκης Τµήµα ΗΜΜΥ Πολυτεχνείο Κρήτης Σελίδα 25

Α όδοση TD πιο "αργός" και πιο ασταθής από τον ADP κάθε δοκιµή χρειάζεται πολύ λίγο χρόνο Μ. Γ. Λαγουδάκης Τµήµα ΗΜΜΥ Πολυτεχνείο Κρήτης Σελίδα 26

TD with Approximation Generic approximation can only update the parameters of the approximator update the parameters according to the temporal difference use the gradient to determine the appropriate change Linear approximation linear combination of basis functions update equation Μ. Γ. Λαγουδάκης Τµήµα ΗΜΜΥ Πολυτεχνείο Κρήτης Σελίδα 27

TD with Linear Approximation Μ. Γ. Λαγουδάκης Τµήµα ΗΜΜΥ Πολυτεχνείο Κρήτης Σελίδα 28

Least-Squares Temporal Difference TD is trying to solve a linear system incrementally Idea collect all data and solve the Bellman equation at once the solution (true value function) satisfies the fixed point property Linear architectures trying to find the best point in the space of approximator parameters enforce the fixed point property under orthogonal projection the solution is a fixed-point approximation to the true value function Properties efficient use of all samples at once elimination of learning rate, schedules, oscillations,... Μ. Γ. Λαγουδάκης Τµήµα ΗΜΜΥ Πολυτεχνείο Κρήτης Σελίδα 29

LSTD Algorithm Μ. Γ. Λαγουδάκης Τµήµα ΗΜΜΥ Πολυτεχνείο Κρήτης Σελίδα 30

LSTD Performance from [Boyan, 2000] Μ. Γ. Λαγουδάκης Τµήµα ΗΜΜΥ Πολυτεχνείο Κρήτης Σελίδα 31

Control Active Reinforcement Learning

The Control Problem Given experience samples (s,a,r,s ) Goal to learn a good policy Idea a better policy can be retrieved from a state-action value function State action value function Policy Μ. Γ. Λαγουδάκης Τµήµα ΗΜΜΥ Πολυτεχνείο Κρήτης Σελίδα 33

Greedy Policy Greedy (improved) policy over V Greedy (improved) policy over Q Μ. Γ. Λαγουδάκης Τµήµα ΗΜΜΥ Πολυτεχνείο Κρήτης Σελίδα 34

Χρήση Ά ληστης Πολιτικής εκτελεί πάντα την καλύτερη ενέργεια ως προς την εκτιµώµενη συνάρτηση αξιολόγησης µη-βέλτιστες αρχικές δοκιµές αποπροσανατολίζουν την αναζήτηση αφήνει περιοχές του χώρου καταστάσεων ανεξερεύνητες! Μ. Γ. Λαγουδάκης Τµήµα ΗΜΜΥ Πολυτεχνείο Κρήτης Σελίδα 35

Εκµετάλλευση και Εξερεύνηση Εκµετάλλευση (exploitation) χρήση της άπληστης πολιτικής για βραχυπρόθεσµη µεγιστοποίηση της ανταµοιβής Εξερεύνηση (exploration) επιλογή τυχαίων κινήσεων για βελτίωση/επέκταση της εκτίµησης της συνάρτησης αξιολόγησης, µε στόχο µακροπρόθεσµα οφέλη Exploration vs. Exploitation Dilemma εξερεύνηση ή εκµετάλλευση; Βέλτιστο σχήµα άπληστη πολιτική στο όριο της άπειρης εξερεύνησης Greedy in the Limit of Infinite Exploration (GLIE) Μ. Γ. Λαγουδάκης Τµήµα ΗΜΜΥ Πολυτεχνείο Κρήτης Σελίδα 36

Ασιόδοξη Αρχικο οίηση Ε ανάληψη αξιών µε "αισιόδοξες" χρησιµότητες V + (s): αισιόδοξη εκτίµηση χρησιµότητας Ν(α,s): αριθµός φορών που έχει δοκιµαστεί η α στην s + + V s R s f T s a s s N a s α s ( ) ( ) + γ max (,, ) V ( ), (, ) f(u, n): συνάρτηση εξερεύνησης f ( u, n) + R αν n< Ne = u διαφορετικά R + : άνω όριο για τις χρησιµότητες εξασφαλίζει ότι κάθε ζεύγος (α,s) θα δοκιµαστεί N e φορές Η αρχικοποίηση των V + µπορεί να γίνει στο R + για όλες τις άγνωστες χρησιµότητες. Μ. Γ. Λαγουδάκης Τµήµα ΗΜΜΥ Πολυτεχνείο Κρήτης Σελίδα 37

Παράδειγµα Παρατηρήσεις (α) µερικές χρησιµότητες αργούν να συγκλίνουν (β) γρήγορη σύγκλιση στη βέλτιστη πολιτική Μ. Γ. Λαγουδάκης Τµήµα ΗΜΜΥ Πολυτεχνείο Κρήτης Σελίδα 38

Bellman Optimality Equation a non-linear system with unknowns Q can be solved iteratively Μ. Γ. Λαγουδάκης Τµήµα ΗΜΜΥ Πολυτεχνείο Κρήτης Σελίδα 39

Q-learning Properties requires a huge amount of samples requires appropriate settings for the learning rate makes minimal use of each sample Μ. Γ. Λαγουδάκης Τµήµα ΗΜΜΥ Πολυτεχνείο Κρήτης Σελίδα 40

Policy Iteration O( S 3 A 3 ) time Policy Evaluation (Critic) Value Function Q π Θ( S A ) space O( S A ) time Policy Improvement (Actor) Model Policy π Θ( S ) space Μ. Γ. Λαγουδάκης Τµήµα ΗΜΜΥ Πολυτεχνείο Κρήτης Σελίδα 41

Approximate Policy Iteration ε Approximate Value Function ^^ Q π Value Function Projection Policy Improvement (Actor) Policy Evaluation (Critic) Policy Projection Model Approximate Policy ^π δ Μ. Γ. Λαγουδάκης Τµήµα ΗΜΜΥ Πολυτεχνείο Κρήτης Σελίδα 42

The Main Idea ε Approximate Value Function ^^ Q π Value Function Projection Policy Improvement (Actor) Policy Evaluation (Critic) Policy Projection Approximate Policy δ Model ^π Samples Μ. Γ. Λαγουδάκης Τµήµα ΗΜΜΥ Πολυτεχνείο Κρήτης Σελίδα 43

Fixed Point Approximation Μ. Γ. Λαγουδάκης Τµήµα ΗΜΜΥ Πολυτεχνείο Κρήτης Σελίδα 44

Orthogonal Projection Μ. Γ. Λαγουδάκης Τµήµα ΗΜΜΥ Πολυτεχνείο Κρήτης Σελίδα 45

The LSTDQ Algorithm Μ. Γ. Λαγουδάκης Τµήµα ΗΜΜΥ Πολυτεχνείο Κρήτης Σελίδα 46

Least-Squares Policy Iteration Approximate Value Function Linear architecture Q^ π = φ T w Policy Evaluation and Projection LSTDQ Policy Improvement Maximization Samples Policy Greedy policy over Q^ π Μ. Γ. Λαγουδάκης Τµήµα ΗΜΜΥ Πολυτεχνείο Κρήτης Σελίδα 47

The LSPI Algorithm Μ. Γ. Λαγουδάκης Τµήµα ΗΜΜΥ Πολυτεχνείο Κρήτης Σελίδα 48

LSPI Properties Properties Quality : Learns policies of bounded quality Stability : Is stable; does not diverge Efficiency : Makes efficient use and reuse of training samples Scalability : Handles successfully large scale problems Advantages Allows great flexibility in choosing/using basis functions Poses no restrictions on sample collection It is simple and easy to implement Limitations Cannot guarantee convergence to the optimal solution With badly distributed samples, the iteration may oscillate With insufficient basis functions, LSPI may converge to a poor policy Μ. Γ. Λαγουδάκης Τµήµα ΗΜΜΥ Πολυτεχνείο Κρήτης Σελίδα 49

Experimentation Put RL to work!

Inverted Pendulum Μ. Γ. Λαγουδάκης Τµήµα ΗΜΜΥ Πολυτεχνείο Κρήτης Σελίδα 51

Pendulum: Learning Parameters Μ. Γ. Λαγουδάκης Τµήµα ΗΜΜΥ Πολυτεχνείο Κρήτης Σελίδα 52

Pendulum: Results LSPI Q-learning Μ. Γ. Λαγουδάκης Τµήµα ΗΜΜΥ Πολυτεχνείο Κρήτης Σελίδα 53

Bicycle Balancing and Riding Μ. Γ. Λαγουδάκης Τµήµα ΗΜΜΥ Πολυτεχνείο Κρήτης Σελίδα 54

Bicycle Learning Parameters Features k=100 (20 basis functions for each action) Samples collected from random episodes starting at a random state around the initial position following a purely random policy for only 20 steps only 20 minutes worth of operating time! Μ. Γ. Λαγουδάκης Τµήµα ΗΜΜΥ Πολυτεχνείο Κρήτης Σελίδα 55

Bicycle Learning Results Μ. Γ. Λαγουδάκης Τµήµα ΗΜΜΥ Πολυτεχνείο Κρήτης Σελίδα 56

Bicycle Learning Performance Μ. Γ. Λαγουδάκης Τµήµα ΗΜΜΥ Πολυτεχνείο Κρήτης Σελίδα 57

Tetris Μ. Γ. Λαγουδάκης Τµήµα ΗΜΜΥ Πολυτεχνείο Κρήτης Σελίδα 58

Tetris: Παράµετροι Μάθησης Μ. Γ. Λαγουδάκης Τµήµα ΗΜΜΥ Πολυτεχνείο Κρήτης Σελίδα 59

Tetris: Α οτελέσµατα Μ. Γ. Λαγουδάκης Τµήµα ΗΜΜΥ Πολυτεχνείο Κρήτης Σελίδα 60

Robot Learning Video Clips tetris-before, tetris-after 07-initial.mpg, 08-finished.mpg 09-ers7-slow.avi, 10-ers7-fast.avi 11-SwingUp.avi, 12-PoleBalancing.mov, 13-pole-balance.mov 14-airhockey.avi, 15-maze.avi Μ. Γ. Λαγουδάκης Τµήµα ΗΜΜΥ Πολυτεχνείο Κρήτης Σελίδα 61

Μελέτη Σύγγραµµα Κεφάλαιο 21 Άρθρα L. Kaelbling, M. Littman, A. Moore, Reinforcement Learning: A Survey, Journal of Artificial Intelligence Research 4, 237 285, 1996. S. Bradtke, A. Barto, Linear Least-Squares Algorithms for Temporal Difference Learning, Machine Learning, 22: 1-3, 33-57, 1996. M. Lagoudakis and Ronald Parr, Least-Squares Policy Iteration, Journal of Machine Learning Research 4, 1107-1149, 2003. Μ. Γ. Λαγουδάκης Τµήµα ΗΜΜΥ Πολυτεχνείο Κρήτης Σελίδα 62