Reinforcement Learning

Σχετικά έγγραφα
ΕΚΠ 413 / ΕΚΠ 606 Αυτόνοµοι (Ροµ οτικοί) Πράκτορες

Μηχανική Μάθηση Hypothesis Testing

ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 19/5/2007

Phys460.nb Solution for the t-dependent Schrodinger s equation How did we find the solution? (not required)

Models for Probabilistic Programs with an Adversary

2 Composition. Invertible Mappings

The Simply Typed Lambda Calculus

Homework 3 Solutions

Approximation of distance between locations on earth given by latitude and longitude

SCHOOL OF MATHEMATICAL SCIENCES G11LMA Linear Mathematics Examination Solutions

HOMEWORK 4 = G. In order to plot the stress versus the stretch we define a normalized stretch:

Fractional Colorings and Zykov Products of graphs

Statistical Inference I Locally most powerful tests

Section 8.3 Trigonometric Equations

EE512: Error Control Coding

Συστήματα Διαχείρισης Βάσεων Δεδομένων

Στο εστιατόριο «ToDokimasesPrinToBgaleisStonKosmo?» έξω από τους δακτυλίους του Κρόνου, οι παραγγελίες γίνονται ηλεκτρονικά.

derivation of the Laplacian from rectangular to spherical coordinates

Fourier Series. MATH 211, Calculus II. J. Robert Buchanan. Spring Department of Mathematics

Partial Differential Equations in Biology The boundary element method. March 26, 2013

Dynamic types, Lambda calculus machines Section and Practice Problems Apr 21 22, 2016

Jesse Maassen and Mark Lundstrom Purdue University November 25, 2013

Other Test Constructions: Likelihood Ratio & Bayes Tests

Matrices and Determinants

Bayesian statistics. DS GA 1002 Probability and Statistics for Data Science.

ΕΛΛΗΝΙΚΗ ΔΗΜΟΚΡΑΤΙΑ ΠΑΝΕΠΙΣΤΗΜΙΟ ΚΡΗΤΗΣ. Ψηφιακή Οικονομία. Διάλεξη 10η: Basics of Game Theory part 2 Mαρίνα Μπιτσάκη Τμήμα Επιστήμης Υπολογιστών

3.4 SUM AND DIFFERENCE FORMULAS. NOTE: cos(α+β) cos α + cos β cos(α-β) cos α -cos β

Problem Set 3: Solutions

5.4 The Poisson Distribution.

Test Data Management in Practice

Example Sheet 3 Solutions

Solutions to Exercise Sheet 5

Assalamu `alaikum wr. wb.

ΕΛΛΗΝΙΚΗ ΔΗΜΟΚΡΑΤΙΑ ΠΑΝΕΠΙΣΤΗΜΙΟ ΚΡΗΤΗΣ. Ψηφιακή Οικονομία. Διάλεξη 7η: Consumer Behavior Mαρίνα Μπιτσάκη Τμήμα Επιστήμης Υπολογιστών

Finite Field Problems: Solutions

Math 6 SL Probability Distributions Practice Test Mark Scheme

Απόκριση σε Μοναδιαία Ωστική Δύναμη (Unit Impulse) Απόκριση σε Δυνάμεις Αυθαίρετα Μεταβαλλόμενες με το Χρόνο. Απόστολος Σ.

Numerical Analysis FMN011

1) Formulation of the Problem as a Linear Programming Model

Business English. Ενότητα # 9: Financial Planning. Ευαγγελία Κουτσογιάννη Τμήμα Διοίκησης Επιχειρήσεων

ω ω ω ω ω ω+2 ω ω+2 + ω ω ω ω+2 + ω ω+1 ω ω+2 2 ω ω ω ω ω ω ω ω+1 ω ω2 ω ω2 + ω ω ω2 + ω ω ω ω2 + ω ω+1 ω ω2 + ω ω+1 + ω ω ω ω2 + ω

ST5224: Advanced Statistical Theory II

Overview. Transition Semantics. Configurations and the transition relation. Executions and computation

ΕΛΛΗΝΙΚΗ ΔΗΜΟΚΡΑΤΙΑ ΠΑΝΕΠΙΣΤΗΜΙΟ ΚΡΗΤΗΣ. Ψηφιακή Οικονομία. Διάλεξη 9η: Basics of Game Theory Mαρίνα Μπιτσάκη Τμήμα Επιστήμης Υπολογιστών

Lecture 2. Soundness and completeness of propositional logic

Exercises 10. Find a fundamental matrix of the given system of equations. Also find the fundamental matrix Φ(t) satisfying Φ(0) = I. 1.

Econ 2110: Fall 2008 Suggested Solutions to Problem Set 8 questions or comments to Dan Fetter 1

Every set of first-order formulas is equivalent to an independent set

Example of the Baum-Welch Algorithm

TMA4115 Matematikk 3

Uniform Convergence of Fourier Series Michael Taylor

Solution Series 9. i=1 x i and i=1 x i.

Πώς μπορεί κανείς να έχει έναν διερμηνέα κατά την επίσκεψή του στον Οικογενειακό του Γιατρό στο Ίσλινγκτον Getting an interpreter when you visit your

6.3 Forecasting ARMA processes

Areas and Lengths in Polar Coordinates

The challenges of non-stable predicates

Strain gauge and rosettes

4.6 Autoregressive Moving Average Model ARMA(1,1)

7 Present PERFECT Simple. 8 Present PERFECT Continuous. 9 Past PERFECT Simple. 10 Past PERFECT Continuous. 11 Future PERFECT Simple

Bounding Nonsplitting Enumeration Degrees

5. Choice under Uncertainty

6.1. Dirac Equation. Hamiltonian. Dirac Eq.

Systems with unlimited supply of work: MCQN with infinite virtual buffers A Push Pull multiclass system

Ψηφιακή Οικονομία. Διάλεξη 11η: Markets and Strategic Interaction in Networks Mαρίνα Μπιτσάκη Τμήμα Επιστήμης Υπολογιστών

C.S. 430 Assignment 6, Sample Solutions

ΕΙΣΑΓΩΓΗ ΣΤΗ ΣΤΑΤΙΣΤΙΚΗ ΑΝΑΛΥΣΗ

ΠΤΥΧΙΑΚΗ ΕΡΓΑΣΙΑ ΒΑΛΕΝΤΙΝΑ ΠΑΠΑΔΟΠΟΥΛΟΥ Α.Μ.: 09/061. Υπεύθυνος Καθηγητής: Σάββας Μακρίδης

ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 6/5/2006

Concrete Mathematics Exercises from 30 September 2016

Inverse trigonometric functions & General Solution of Trigonometric Equations

SOLUTIONS TO MATH38181 EXTREME VALUES AND FINANCIAL RISK EXAM

A Bonus-Malus System as a Markov Set-Chain. Małgorzata Niemiec Warsaw School of Economics Institute of Econometrics

ΠΤΥΧΙΑΚΗ ΕΡΓΑΣΙΑ "ΠΟΛΥΚΡΙΤΗΡΙΑ ΣΥΣΤΗΜΑΤΑ ΛΗΨΗΣ ΑΠΟΦΑΣΕΩΝ. Η ΠΕΡΙΠΤΩΣΗ ΤΗΣ ΕΠΙΛΟΓΗΣ ΑΣΦΑΛΙΣΤΗΡΙΟΥ ΣΥΜΒΟΛΑΙΟΥ ΥΓΕΙΑΣ "

( ) 2 and compare to M.

ΤΕΧΝΟΛΟΓΙΚΟ ΠΑΝΕΠΙΣΤΗΜΙΟ ΚΥΠΡΟΥ ΤΜΗΜΑ ΝΟΣΗΛΕΥΤΙΚΗΣ

ΑΡΙΣΤΟΤΕΛΕΙΟ ΠΑΝΕΠΙΣΤΗΜΙΟ ΘΕΣΣΑΛΟΝΙΚΗΣ

ΕΚΠ 413 / ΕΚΠ 606 Αυτόνοµοι (Ροµ οτικοί) Πράκτορες

Areas and Lengths in Polar Coordinates

ΑΝΑΠΤΥΞΗ ΠΑΙΧΝΙΔΙΟΥ ΣΟΒΑΡΟΥ ΣΚΟΠΟΥ ΓΙΑ ΤΗΝ ΕΚΜΑΘΗΣΗ ΕΝΝΟΙΩΝ ΑΝΤΙΚΕΙΜΕΝΟΣΤΡΕΦΟΥΣ ΠΡΟΓΡΑΜΜΑΤΙΣΜΟΥ ΣΕ JAVA

Instruction Execution Times

Abstract Storage Devices

Galatia SIL Keyboard Information

ΑΚΑ ΗΜΙΑ ΕΜΠΟΡΙΚΟΥ ΝΑΥΤΙΚΟΥ ΜΑΚΕ ΟΝΙΑΣ ΣΧΟΛΗ ΜΗΧΑΝΙΚΩΝ ΠΤΥΧΙΑΚΗ ΕΡΓΑΣΙΑ

ΔΙΕΡΕΥΝΗΣΗ ΤΗΣ ΣΕΞΟΥΑΛΙΚΗΣ ΔΡΑΣΤΗΡΙΟΤΗΤΑΣ ΤΩΝ ΓΥΝΑΙΚΩΝ ΚΑΤΑ ΤΗ ΔΙΑΡΚΕΙΑ ΤΗΣ ΕΓΚΥΜΟΣΥΝΗΣ ΤΕΧΝΟΛΟΓΙΚΟ ΠΑΝΕΠΙΣΤΗΜΙΟ ΚΥΠΡΟΥ ΣΧΟΛΗ ΕΠΙΣΤΗΜΩΝ ΥΓΕΙΑΣ

Οι αδελφοί Montgolfier: Ψηφιακή αφήγηση The Montgolfier Βrothers Digital Story (προτείνεται να διδαχθεί στο Unit 4, Lesson 3, Αγγλικά Στ Δημοτικού)

Potential Dividers. 46 minutes. 46 marks. Page 1 of 11

Main source: "Discrete-time systems and computer control" by Α. ΣΚΟΔΡΑΣ ΨΗΦΙΑΚΟΣ ΕΛΕΓΧΟΣ ΔΙΑΛΕΞΗ 4 ΔΙΑΦΑΝΕΙΑ 1

On a four-dimensional hyperbolic manifold with finite volume

Γιπλυμαηική Δπγαζία. «Ανθπυποκενηπικόρ ζσεδιαζμόρ γέθςπαρ πλοίος» Φοςζιάνηρ Αθανάζιορ. Δπιβλέπυν Καθηγηηήρ: Νηθφιανο Π. Βεληίθνο

About these lecture notes. Simply Typed λ-calculus. Types

Calculating the propagation delay of coaxial cable

Bizagi Modeler: Συνοπτικός Οδηγός

Living and Nonliving Created by: Maria Okraska

2016 IEEE/ACM International Conference on Mobile Software Engineering and Systems

Παλεπηζηήκην Πεηξαηώο Τκήκα Πιεξνθνξηθήο Πξόγξακκα Μεηαπηπρηαθώλ Σπνπδώλ «Πξνεγκέλα Σπζηήκαηα Πιεξνθνξηθήο»

ΓΕΩΜΕΣΡΙΚΗ ΣΕΚΜΗΡΙΩΗ ΣΟΤ ΙΕΡΟΤ ΝΑΟΤ ΣΟΤ ΣΙΜΙΟΤ ΣΑΤΡΟΤ ΣΟ ΠΕΛΕΝΔΡΙ ΣΗ ΚΤΠΡΟΤ ΜΕ ΕΦΑΡΜΟΓΗ ΑΤΣΟΜΑΣΟΠΟΙΗΜΕΝΟΤ ΤΣΗΜΑΣΟ ΨΗΦΙΑΚΗ ΦΩΣΟΓΡΑΜΜΕΣΡΙΑ

ΑΓΓΛΙΚΗ ΓΛΩΣΣΑ ΣΕ ΕΙΔΙΚΑ ΘΕΜΑΤΑ ΔΙΕΘΝΩΝ ΣΧΕΣΕΩΝ & ΟΙΚΟΝΟΜΙΑΣ

ΠΑΝΕΠΙΣΤΗΜΙΟ ΠΕΙΡΑΙΩΣ ΤΜΗΜΑ ΠΛΗΡΟΦΟΡΙΚΗΣ ΠΜΣ «ΠΡΟΗΓΜΕΝΑ ΣΥΣΤΗΜΑΤΑ ΠΛΗΡΟΦΟΡΙΚΗΣ» ΚΑΤΕΥΘΥΝΣΗ «ΕΥΦΥΕΙΣ ΤΕΧΝΟΛΟΓΙΕΣ ΕΠΙΚΟΙΝΩΝΙΑΣ ΑΝΘΡΩΠΟΥ - ΥΠΟΛΟΓΙΣΤΗ»

Reminders: linear functions

Transcript:

Reinforcement Learning Michèle Sebag ; TP : Diviyan Kalainathan TAO, CNRS INRIA Université Paris-Sud Nov. 24th, 26 Credit for slides: R. Sutton, F. Stulp

Types of Machine Learning problems WORLD DATA USER Observations Understand Code Unsupervised LEARNING + Target Predict Classification/Regression Supervised LEARNING + Rewards Decide Action Policy/Strategy Reinforcement LEARNING

Main specificities of Reinforcement Learning Delayed rewards The learner can evaluate itself in an autonomous way Evaluate what the learner did; not what it should/could have done Exploration / Exploitation dilemma

Reinforcement Learning Generalities An agent, spatially and temporally situated Stochastic and uncertain environment Goal: select an action in each time step,... in order maximize expected cumulative reward over a time horizon What is learned? A policy = strategy = { state action }

Reinforcement Learning Context An unknown world. Some actions, in some states, bear rewards with some delay [with some probability] find policy (state action) Goal : maximizing the expected reward

The Reinforcement Learning AIC Master Module Contents. Position of the problem, vocabulary Markov Decision Process, policy, expected returns 2. Finite case Dynamic Programming; Value estimation, Monte Carlo, TD 3. Infinite case Function approximation, direct policy search 4. More Inverse RL, Multi-armed bandits, Deep RL Evaluation Project Exam Voluntary contributions (5mn talks, adding resources)

Pointers The Book Reinforcement Learning: An Introduction, R. Sutton and A. Barto On-line at: http://webdocs.cs.ualberta.ca/~sutton/book/the-book.html Slides on-line at https://tao.lri.fr/tiki-index.php?page=courses http://perso.ensta-paristech.fr/ stulp/aic/ https://www.microsoft.com/en-us/research/video/tutorial-introduction-toreinforcement-learning-with-function-approximation/ Suggested: See videos *before* the course Let s discuss during the course what is unclear, what could be done otherwise, what could be done.

Overview Introduction Position of the problems Applications Formalisation: the value function Values

Facets Disciplines Means Machine Learning Economics Psychology Neurosciences Control Mathematics Rewards, incentives, conditioning Bounded rationality

Reinforcement Learning Of several responses made to the same situation, those which are accompanied or closely followed by satisfaction to the animal will others things being equal be more firmly connected with the situation, so that when it recurs, they will more likely to recur; those which are accompanied or closely followed by discomfort to the animal will others things being equal have their connection with the situation weakened, so that when it recurs, they will less likely to recur; the greater the satisfaction or discomfort, the greater the strengthening or weakening of the link. Thorndike, 9

Formal background Notations State space S Action space A Transition model p(s, a, s ) [, ] Goal Reward r(s) Find policy π : S A Maximize E[π] = Expected cumulative reward (detail later)

Applications Robotics Navigation, football, walk,... Control Helicopter, elevators, telecom, smart grids, manufacturing,... Operation research Transport, scheduling,... Games Backgammon, Othello, Tetris, Go,... Other Computer Human Interfaces, ML (Feature Selection, Active learning, Natural Language Processing,...)

Myths. Pandora (the box) 2. Golem (Praga) 3. The chess player (The Turc) Edgar Allan Poe 4. Robota 5. Movies...

Myths. Pandora (the box) 2. Golem (Praga) 3. The chess player (The Turc) Edgar Allan Poe 4. Robota 5. Movies... Metropolis, 2 Space Odyssey, AI, Her,...

Types of robots:. Manufacturing closed world, target behavior known task is decomposed in subtasks subtask: sequence of actions no surprise

Types of robots: 2. Autonomous vehicles open world task is to navigate action subject to precondition

Types of robots: 3. Home robots open world sequence of tasks each task requires navigation and planning

Robot https://www.microsoft.com/en-us/research/video/tutorial-introduction-toreinforcement-learning-with-function-approximation/ 5:4

Vocabulary /3 State of the robot set of states S A state: all information related to the robot (sensor information; memory) Discrete? continuous? dimension? Action of the robot set of actions A values of the robot motors/actuators. e.g. a robotic arm with 39 degrees of freedom. (possible restrictions: not every action usable in any state). Transition model: how the state changes depending on the action deterministically tr : S A S probabilistically or p : S A S [, ] Simulator; forward model. deterministic or probabilistic transition.

Vocabulary 2/3 Rewards: any guidance available. r : S A IR How to provide rewards in simulation? in real-life? What about the robot safety? Policy: mapping from states to actions. deterministic π : S A or stochastic π : S A [, ] this is the goal: finding a good policy good means: reaching the goal receiving as many rewards as possible as early as possible.

Vocabulary 3/3 Episodic task Reaching a goal (playing a game, painting a car, putting something in the dishwasher) Do it as soon as possible Time horizon is finite Continual task Reaching and keeping a state (pole balancing, car driving) Do it as long as you can Time horizon is (in principle) infinite

Case. Optimal control /2

Case. Optimal control 2/2 Known dynamics and target behavior state u, action a new state u wanted: sequence of states Approaches Inverse problem Optimal control Challenges Model errors, uncertainties Stability

Case 2. Reactive behaviors The terrain The 25 Darpa Challenge The sensors

Case 3. Planning An instance of reinforcement learning / planning problem. Solution = sequence of (state,action) 2. In each state, decide the appropriate action 3...such that in the end, you reach the goal

Games Backgammon: TD-Gammon 992 Atari: 25 Go: AlphaGo 26

Overview Introduction Position of the problems Applications Formalisation: the value function Values

Formalisation Notations State space S Action space A Transition model deterministic: s = t(s, a) probabilistic: p(s, a, s ) [, ]. Reward r(s) bounded Goal Time horizon H (finite or infinite) Find policy (strategy) π : S A which maximizes (discounted) cumulative reward from now to timestep H r(s t) t

Formalisation Notations State space S Action space A Transition model deterministic: s = t(s, a) probabilistic: p(s, a, s ) [, ]. Reward r(s) bounded Goal Time horizon H (finite or infinite) Find policy (strategy) π : S A which maximizes (discounted) cumulative reward from now to timestep H H γ t r(s t) γ < t=

Formalisation Notations State space S Action space A Transition model deterministic: s = t(s, a) probabilistic: p(s, a, s ) [, ]. Reward r(s) bounded Goal Time horizon H (finite or infinite) Find policy (strategy) π : S A which maximizes (discounted) cumulative reward from now to timestep H IE s,π[ γ t r(s t)] t=

Markov Decision Process But can we define P a ss and r(s)? YES, if all necessary information is in s NO, otherwise If state is partially observable If environment (reward and transition distribution) is changing Reward for *first* photo of an object by the satellite The Markov assumption P(s h+ s a s a... s h a h ) = P(s h+ s h a h ) Everything you need to know is the current (state, action).

Find the treasure Single reward: on the treasure.

Wandering robot Nothing happens...

The robot finds it

Robot updates its value function V (s, a) == distance to the treasure on the trajectory.

Reinforcement learning * Robot most often selects a = arg max V (s, a) * and sometimes explores (selects another action).

Reinforcement learning * Robot most often selects a = arg max V (s, a) * and sometimes explores (selects another action). * Lucky exploration: finds the treasure again

Updates the value function * Value function tells how far you are from the treasure given the known trajectories.

Finally * Value function tells how far you are from the treasure

Finally Let s be greedy: selects the action maximizing the value function

You are the learner Rich. Sutton: https://www.microsoft.com/en-us/research/video/tutorialintroduction-to-reinforcement-learning-with-function-approximation/ 5:3

Exercize Uniform policy States: squares Actions: north, south, east, west. Rewards: - if you would get outside; in A; 5 in B Transitions: as expected, except: A A ; B B. A A B B A > A, reward B > B, reward 5 Compute the value function