Deterministic Policy Gradient Algorithms: Supplementary Material

Σχετικά έγγραφα
Example Sheet 3 Solutions

Ordinal Arithmetic: Addition, Multiplication, Exponentiation and Limit

Απόκριση σε Μοναδιαία Ωστική Δύναμη (Unit Impulse) Απόκριση σε Δυνάμεις Αυθαίρετα Μεταβαλλόμενες με το Χρόνο. Απόστολος Σ.

ST5224: Advanced Statistical Theory II

Uniform Convergence of Fourier Series Michael Taylor

Phys460.nb Solution for the t-dependent Schrodinger s equation How did we find the solution? (not required)

Research Article Existence of Positive Solutions for Fourth-Order Three-Point Boundary Value Problems

2. Let H 1 and H 2 be Hilbert spaces and let T : H 1 H 2 be a bounded linear operator. Prove that [T (H 1 )] = N (T ). (6p)

C.S. 430 Assignment 6, Sample Solutions

Statistical Inference I Locally most powerful tests

derivation of the Laplacian from rectangular to spherical coordinates

ANSWERSHEET (TOPIC = DIFFERENTIAL CALCULUS) COLLECTION #2. h 0 h h 0 h h 0 ( ) g k = g 0 + g 1 + g g 2009 =?

Reminders: linear functions

2 Composition. Invertible Mappings

( P) det. constitute the cofactor matrix, or the matrix of the cofactors: com P = c. ( 1) det

Other Test Constructions: Likelihood Ratio & Bayes Tests

Matrices and Determinants

Solutions to Exercise Sheet 5

Homework 3 Solutions

b. Use the parametrization from (a) to compute the area of S a as S a ds. Be sure to substitute for ds!

Every set of first-order formulas is equivalent to an independent set

( ) ( t) ( 0) ( ) dw w. = = β. Then the solution of (1.1) is easily found to. wt = t+ t. We generalize this to the following nonlinear differential

EXISTENCE AND UNIQUENESS THEOREM FOR FRACTIONAL DIFFERENTIAL EQUATION WITH INTEGRAL BOUNDARY CONDITION

MATH423 String Theory Solutions 4. = 0 τ = f(s). (1) dτ ds = dxµ dτ f (s) (2) dτ 2 [f (s)] 2 + dxµ. dτ f (s) (3)

Math221: HW# 1 solutions

5. Choice under Uncertainty

Areas and Lengths in Polar Coordinates

Areas and Lengths in Polar Coordinates

Nowhere-zero flows Let be a digraph, Abelian group. A Γ-circulation in is a mapping : such that, where, and : tail in X, head in

ORDINAL ARITHMETIC JULIAN J. SCHLÖDER

A Note on Intuitionistic Fuzzy. Equivalence Relation

EE512: Error Control Coding

Fourier Series. MATH 211, Calculus II. J. Robert Buchanan. Spring Department of Mathematics

Fractional Colorings and Zykov Products of graphs

Parametrized Surfaces

Lecture 2. Soundness and completeness of propositional logic

CHAPTER 25 SOLVING EQUATIONS BY ITERATIVE METHODS

Math 446 Homework 3 Solutions. (1). (i): Reverse triangle inequality for metrics: Let (X, d) be a metric space and let x, y, z X.

ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 6/5/2006

ECE145a / 218a Tuned Amplifier Design -basic gain relationships

Partial Differential Equations in Biology The boundary element method. March 26, 2013

Approximation of distance between locations on earth given by latitude and longitude

HOMEWORK 4 = G. In order to plot the stress versus the stretch we define a normalized stretch:

Section 8.3 Trigonometric Equations

Mellin transforms and asymptotics: Harmonic sums

On mean-field stochastic maximum principle for near-optimal controls for Poisson jump diffusion with applications

Lecture 2: Dirac notation and a review of linear algebra Read Sakurai chapter 1, Baym chatper 3

1. Introduction and Preliminaries.

Econ 2110: Fall 2008 Suggested Solutions to Problem Set 8 questions or comments to Dan Fetter 1

Chapter 6: Systems of Linear Differential. be continuous functions on the interval

Online Appendix I. 1 1+r ]}, Bψ = {ψ : Y E A S S}, B W = +(1 s)[1 m (1,0) (b, e, a, ψ (0,a ) (e, a, s); q, ψ, W )]}, (29) exp( U(d,a ) (i, x; q)

3.4 SUM AND DIFFERENCE FORMULAS. NOTE: cos(α+β) cos α + cos β cos(α-β) cos α -cos β

12. Radon-Nikodym Theorem

SURVEY AND NEW RESULTS ON BOUNDARY-VALUE PROBLEMS OF SINGULAR FRACTIONAL DIFFERENTIAL EQUATIONS WITH IMPULSE EFFECTS

Homework 8 Model Solution Section

k A = [k, k]( )[a 1, a 2 ] = [ka 1,ka 2 ] 4For the division of two intervals of confidence in R +

Sequent Calculi for the Modal µ-calculus over S5. Luca Alberucci, University of Berne. Logic Colloquium Berne, July 4th 2008

Problem Set 3: Solutions

SCHOOL OF MATHEMATICAL SCIENCES G11LMA Linear Mathematics Examination Solutions

Main source: "Discrete-time systems and computer control" by Α. ΣΚΟΔΡΑΣ ΨΗΦΙΑΚΟΣ ΕΛΕΓΧΟΣ ΔΙΑΛΕΞΗ 4 ΔΙΑΦΑΝΕΙΑ 1

Second Order Partial Differential Equations

Exercises to Statistics of Material Fatigue No. 5

ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 19/5/2007

Srednicki Chapter 55

Congruence Classes of Invertible Matrices of Order 3 over F 2

ω ω ω ω ω ω+2 ω ω+2 + ω ω ω ω+2 + ω ω+1 ω ω+2 2 ω ω ω ω ω ω ω ω+1 ω ω2 ω ω2 + ω ω ω2 + ω ω ω ω2 + ω ω+1 ω ω2 + ω ω+1 + ω ω ω ω2 + ω

DiracDelta. Notations. Primary definition. Specific values. General characteristics. Traditional name. Traditional notation

Space-Time Symmetries

ΚΥΠΡΙΑΚΟΣ ΣΥΝΔΕΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY 21 ος ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ Δεύτερος Γύρος - 30 Μαρτίου 2011

Jesse Maassen and Mark Lundstrom Purdue University November 25, 2013

Overview. Transition Semantics. Configurations and the transition relation. Executions and computation

D Alembert s Solution to the Wave Equation

Intuitionistic Fuzzy Ideals of Near Rings

The semiclassical Garding inequality

Math 248 Homework 1. Edward Burkard. Exercise 1. Prove the following Fourier Transforms where a > 0 and c R: f (x) = b. f(x c) = e.

The Simply Typed Lambda Calculus

Finite Field Problems: Solutions

1 String with massive end-points

Inverse trigonometric functions & General Solution of Trigonometric Equations

F A S C I C U L I M A T H E M A T I C I

DESIGN OF MACHINERY SOLUTION MANUAL h in h 4 0.

DIRECT PRODUCT AND WREATH PRODUCT OF TRANSFORMATION SEMIGROUPS

New bounds for spherical two-distance sets and equiangular lines

SOME PROPERTIES OF FUZZY REAL NUMBERS

The Probabilistic Method - Probabilistic Techniques. Lecture 7: The Janson Inequality

4.6 Autoregressive Moving Average Model ARMA(1,1)

MINIMAL CLOSED SETS AND MAXIMAL CLOSED SETS

Tridiagonal matrices. Gérard MEURANT. October, 2008

Coefficient Inequalities for a New Subclass of K-uniformly Convex Functions

Numerical Analysis FMN011

The challenges of non-stable predicates

Roman Witu la 1. Let ξ = exp(i2π/5). Then, the following formulas hold true [6]:

6. MAXIMUM LIKELIHOOD ESTIMATION

Phase-Field Force Convergence

SCITECH Volume 13, Issue 2 RESEARCH ORGANISATION Published online: March 29, 2018

Evaluation of some non-elementary integrals of sine, cosine and exponential integrals type

Cyclic or elementary abelian Covers of K 4

Solution Series 9. i=1 x i and i=1 x i.

EE101: Resonance in RLC circuits

Bounding Nonsplitting Enumeration Degrees

Transcript:

Determinitic Policy Gradient lgorithm: upplementary Material. Regularity Condition Within the text we have referred to regularity condition on the MDP: Regularity condition.1: p(, a), a p(, a), µ θ (), θ µ θ (), r(, a), a r(, a), p 1 () are continuou in all parameter and variable, a, and x. Regularity condition.2: there exit a b and L uch that up p 1 () < b, up a,, p(, a) < b, up a, r(, a) < b, up a,, a p(, a) < L, and up a, a r(, a) < L. B. Proof of Theorem 1 proof of Theorem 1. The proof follow along the ame line of the tandard tochatic policy gradient theorem in utton et al. (1999). Note that the regularity condition.1 imply that V µ θ () and θ V µ θ () are continuou function of θ and and the compactne of further implie that for any θ, θ V µ θ (), a Q µ θ (, a) aµθ () and θµ θ () are bounded function of. Thee condition will be neceary to exchange derivative and integral, and the order of integration whenever neceary in the following proof. We have, θ V µ θ () θ Q µ θ (, µ θ ()) θ (r(, µ θ ()) + θ µ θ () a r(, a) aµθ () + θ γp(, µ θ ())V µ θ ( )d ) γp(, µ θ ())V µ θ ( )d θ µ θ () a r(, a) aµθ () ( ) + γ p(, µ θ ()) θ V µ θ ( ) + θ µ θ () a p(, a) aµθ () V µ θ ( ) d (1) ) θ µ θ () a (r(, a) + γp(, a)v µ θ ( )d aµθ () + γp(, µ θ ()) θ V µ θ ( )d θ µ θ () a Q µ θ (, a) aµθ () + γp(, 1, µ θ ) θ V µ θ ( )d. Where in (1) we ued the Leibniz integral rule to exchange order of derivative and integration, requiring the regularity condition, pecifically continuity of p(, a), µ θ (), V µ θ () and their derivative w.r.t. θ. nd now iterating thi formula

Determinitic Policy Gradient lgorithm: upplementary Material we have, θ µ θ () a Q µ θ (, a) aµθ () + γp(, 1, µ θ ) θ µ θ ( ) a Q µ θ (, a) aµθ ( ) d + γp(, 1, µ θ ) γp(, 1, µ θ ) θ V µ θ ( )d d θ µ θ () a Q µ θ (, a) aµθ () + γp(, 1, µ θ ) θ µ θ ( ) a Q µ θ (, a) aµθ ( ) d + γ 2 p(, 2, µ θ ) θ V µ θ ( )d (2). γ t p(, t, µ θ ) θ µ θ ( ) a Q µ θ (, a) aµθ ( ) d. t0 Where in 2 we have ued Fubini theorem to exchange the order of integration, requiring the regularity condition o that θ V µ θ () i bounded. Now taking the expectation over 1 we have, θ J(µ θ ) θ p 1 ()V µ θ ()d p 1 () θ V µ θ () d (3) γ t p 1 ()p(, t, µ θ ) θ µ θ ( ) a Q µ θ (, a) aµθ ( ) d d t0 ρ µ θ () θ µ θ () a Q µ θ (, a) aµθ () d, where in (3) we ued the Leibniz integral rule to exchange derivative and integral, requiring the regularity condition, pecifically o that p 1 () and V µ θ () and derivative w.r.t. θ are continuou. In the final line we again ued Fubini theorem to exchange the order of integration, requiring the boundedne of the integrand a implied by the regularity condition. C. Proof of Theorem 2 We firt retate Theorem 2 in detail, with dicuion, and then prove the theorem. We firt make a preliminary definition: Condition B1: Function ν σ parametrized by σ are aid to be a regular delta-approximation on R if they atify the following condition: 1. The ditribution ν σ converge to a delta ditribution: lim σ 0 ν σ(a, a)f(a)da f(a ) for a R and uitably mooth f. pecifically we require that thi convergence i uniform in a and over any cla F of L-Lipchitz and bounded function, a f(a) < L <, up a f(a) < b <, i.e.: lim up ν σ (a, a)f(a)da f(a ) 0 σ 0 f F,a 2. For each a R, ν σ (a, ) i upported on ome compact C a with Lipchitz boundary bd(c a ), vanihe on the boundary and i continuouly differentiable on C a. 3. For each a R, for each a, the gradient a ν σ (a, a) exit. 4. Tranlation invariance: For all a, a R, and any δ R n uch that a + δ, a + δ, ν(a, a) ν(a + δ, a + δ).

We retate the theorem: Determinitic Policy Gradient lgorithm: upplementary Material Theorem. Let µ θ :. Denote the range of µ θ by R θ : range(µ θ ), and R θ R θ. For each θ, Conider a tochatic policy π µθ,σ uch that π µθ,σ(a ) ν σ (µ θ (), a), where ν σ atify Condition B1 on R above. uppoe further that the regularity condition.1 and.2 (ee ection ) on the MDP hold. Then, lim θ J(π µθ,σ) θ J(µ θ ) (4) σ 0 where on the l.h.. the gradient i the tandard tochatic policy gradient and on the r.h.. the gradient i the determinitic policy gradient. Theorem 2 hold for a very wide cla of policie when R n : any continuouly differentiable, compactly upported ξ : R n R with total integral 1, can be ued to contruct ν σ (a, a ) 1/σ n ξ((a a)/σ) which atifie our condition, and the pace of uch function i large: given any compact upport uch a function can be contructed. It i eay to check that any ν σ (a, a ) contructed on compact upport with Lipchitz boundary { in thi way will atify Condition B1. imple example i any bump function uch a, in 1 dimenion, ξ(a) e 1 1 a 2 a < 1, or multidimenional 0 a 1 verion. We now prove the theorem. Throughout the proof we denote the time t marginal denity at tate following policy π by p π t (). We begin with preliminary lemma: Lemma 1. Let U V R n R n. Let ν : U V R be differentiable on U V. Then () (B) (C) where, () Tranlation invariance: For all u U, v V, and any δ R n uch that u+δ U, v+δ V, ν(u, v) ν(u+δ, v+δ). (B) There exit ome function χ : R n R uch that ν(u, v) χ(u v). (C) u ν(u, v) v ν(u, v), wherever the gradient exit. If furthermore U V i convex then C, i.e. all propertie are equivalent. proof of Lemma 1. B: For any c U V define χ : R n R by χ : c ν(w, w c) for any w U uch that c w v for ome v V. Oberve that thi define χ uniquely on all of U V. Thu given any u U, v V we can chooe w u and we have, B : Trivial χ(u v) ν(u, u (u v)) ν(u, v) B C: Let h(u, v) u v then by the chain rule u ν(u, v) h χ(h) h(u,v) u h(u, v) h χ(h) h(u,v) h χ(h) h(u,v) v h(u, v) v ν(u, v) (C and Convexity) : uppoe U V i convex. Conider any (u, v) U V, and any δ R n, we have (u,v) ν(u, v), (δ, δ) u ν(u, v), δ + v ν(u, v), δ u ν(u, v), δ u ν(u, v), δ 0 hence ν i contant in the direction (δ, δ). ince (u, v) and δ were arbitrary, ν i contant in the direction (δ, δ) for all δ R n. Now ince U V i convex, for any (u, v) U V and B (u + δ, v + δ) U V we have that the traight line connecting and B i entirely contained U V. Thu, ince ν i contant along the path ν() ν(b). We now note that the regularity condition and propertie of ν imply the following lemma which we will need to prove Theorem 2. Lemma 2. 1. For any tochatic policy π and any t, up p π t () < b and imilarly for determinitic policie.

Determinitic Policy Gradient lgorithm: upplementary Material 2. For any tochatic policy π, up ρ π () < b/(1 γ) and imilarly for determinitic policie. 3. for any tochatic policy π, up a, { a Q π (a, ) } < c < and imilarly for determinitic policie. Proof. 1. The claim i true for t 1 by the regularity condition.2, then for t 1, up p π t+1( ) up p π t () up p(, a) < b,a, π(a )p(, a)dad 2. up ρ π () t1 γt 1 up p π t () b/(1 γ) 3. We have that, up,a a Q π (a, ) up,a a r(, a) + γ up,a L + γ Lb/(1 γ)d < a p(, a) V π ( ) d where the final line follow ince i compact and the integral over i finite. Lemma 3. lim σ 0 ρ πµ θ,σ () ρ πµ θ,0 () and the convergence i uniform w.r.t., i.e. lim up σ 0 ρ πµ θ,σ () ρ πµ θ,0 () 0 (5) Proof. We have that ρ π () t1 γt 1 p π t (). Clearly p πµ θ,σ 1 () p 1 () p πµ θ,0 1 (). Note that by the definition of ν σ, given any ɛ 1 > 0 we can chooe σ uch that for all σ < σ, up π µθ,σ(a )p(, a)da Now uppoe (for induction) that for ome t 1 we have that π µθ,0(a )p(, a)da ɛ 1. up p πµ θ,σ t () p πµ θ,0 t () ɛ 2 (t), then, p π µθ,σ up t+1 ( ) p πµ θ,0 t+1 ( ) up p πµ θ,σ t () p πµ θ,0 t () π µθ,σ(a )p(, a)dad + up p πµ θ,0 t () π µθ,σ(a )p(, a)da π µθ,0(a )p(, a)da d ɛ 2 (t) bd + ɛ 1 ɛ 2 (t)bζ + ɛ 1, where ζ 1d <. ince ɛ 2 (1) 0 we therefore have that up p πµ θ,σ t () p πµ θ,0 t () ɛ 1 (bζ + 1) t 1,

Determinitic Policy Gradient lgorithm: upplementary Material nd now given any ɛ > 0 if we chooe T ufficiently large uch that, tt +1 γt 1 b < ɛ/2 and then we chooe ɛ 1 and the correponding σ ufficiently mall o that, T t1 γt 1 ɛ 1 (bζ + 1) t 1 < ɛ/2, then we enure that for any σ < σ, a required. up ρ πµ θ,σ () ρ πµ θ,0 () up + ɛ t1 t1 γ t 1 p πµ θ,σ t () t1 γ t 1 p πµ θ,0 t () T γ t 1 up p πµ θ,σ t () p πµ θ,0 t () γ t 1 up p πµ θ,σ t () p πµ θ,0 t () tt +1 T γ t 1 ɛ 1 (bζ + 1) t 1 + γ t 1 b t1 Lemma 4. For all, θ, the convergence a Q πµ θ,σ (a, ) a Q πµ θ,0 (a, ), a σ 0, i uniform in (, a), i.e. lim up σ 0 (,a) t1 a Q πµ θ,σ (a, ) a Q πµ θ,0 (a, ) 0 Proof. a Q π (a, ) a ( r(, a) + γ p(, a)v π ( )d ), o up a Q πµ θ,σ (a, ) a Q πµ θ,0 (a, ) γ (,a) up a p(, a) V πµ θ,σ ( ) V πµ θ,0 ( ) d (,,a) γζl up V πµ θ,σ ( ) V πµ θ,0 ( ) where ζ 1d <. Now, given any ɛ 1, ɛ 2 there exit σ uch that for all σ < σ we have that, up r(, a) (π µθ,σ(a ) π µθ,0(a )) da < ɛ 1 and up ρ πµ θ,σ ŝ () ρ πµ θ,0 ŝ () < ɛ 2 (6),ŝ where ρ π ŝ () i analogou to ρπ (), but conditioned on tarting in ditribution p( a, ŝ)π(a ŝ)da at t 1 rather than in ditribution p 1 (the reult (6) reult can be proved in an identical fahion to Lemma 3 noting that the reult doe not depend upon p 1 other than through it boundedne). Then, up V πµ θ,σ ( ) V πµ θ,0 ( ) up r(, a)(π µθ,σ(a ) π µθ,0(a ))da + γ up ρ πµ θ,σ ()π µ θ,σ(a )r(, a)dad ρ πµ θ,0 ()π µ θ,0(a )r(, a)dad ɛ 1 + up ρ πµ θ,σ () ρπµ θ,σ () r(, a) π µ θ,0(a )dad + up ρ πµ θ,0 () r(, a) (π µθ,σ(a ) π µθ,0(a )) dad ɛ 1 + ɛ 2 ζb + ɛ 1 /(1 γ) which can thu be made arbitrarily mall by chooing σ ufficiently mall.

Determinitic Policy Gradient lgorithm: upplementary Material proof of Theorem 2. Tranlation invariance, and Lemma 1 implie that a ν σ (a, a) a µ θ () aν σ (µ θ (), a). Then integration by part implie that, Q πµ θ,σ (, a) a ν σ (a, a) a µ θ () da Q πµ θ,σ (, a) a ν σ (µ θ (), a)da a Q πµ θ,σ (, a)ν σ (µ θ (), a)da + boundary term C µθ () C µθ () a Q πµ θ,σ (, a)ν σ (µ θ (), a)da Where the boundary term are zero ince ν σ vanihe on the boundary. We have, from the tochatic policy gradient theorem, lim θ J(π µθ,σ) lim ρ πµ θ,σ () Q πµ θ,σ (, a) θ π µθ,σ(a ) dad σ 0 σ 0 lim ρ πµ θ,σ () Q πµ θ,σ (, a) θ µ θ () a ν σ (a, a) a µ σ 0 θ () dad lim ρ πµ θ,σ () θ µ θ () a Q πµ θ,σ (, a)ν σ (µ θ (), a)dad σ 0 C µθ () lim ρ πµ θ,σ () θ µ θ () a Q πµ θ,σ (, a)ν σ (µ θ (), a)dad, (7) σ 0 where exchange of limit and integral in (7) follow by dominated convergence (in Banach pace) where we can take the dominating function (which i bounded by Lemma 2), C µθ () g θ () up {ρ πµ θ,σ ()} up { a Q πµ θ,σ (a, ) } θ µ θ () op σ a C µθ (),σ ρ πµ θ,σ () a Q πµ θ,σ (, a)ν σ (µ θ (), a)da θ µ θ (). (8) C µθ () Where op denote the operator norm, or larget ingular value. Now note that by uniform convergence of a Q πµ θ,σ (, a), Lemma 4, given any ɛ 1, ɛ 2 there exit σ uch that for all σ < σ we have o that and alo that, Hence, a Q πµ θ,σ (, a) a Q πµ θ,0 (, a) < ɛ 1 a Q πµ θ,σ (, a)ν σ (µ θ (), a)da a Q πµ θ,0 (, a)ν σ (µ θ (), a)da < ɛ 1, C µθ () C µθ () a Q πµ θ,0 (, a)ν σ (µ θ (), a)da a Q πµ θ,0 (, a) aµθ () < ɛ 2. C µθ () a Q πµ θ,σ (, a)ν σ (µ θ (), a)da a Q πµ θ,0 (, a) aµθ () < ɛ 1 + ɛ 2 C µθ () and from thi and Lemma 3 we have, (7) ρ πµ θ,0 () θ µ θ () lim a Q πµ θ,σ (, a)ν σ (µ θ (), a)dad σ 0 C µθ () ρ πµ θ,0 () θ µ θ () a Q πµ θ,0 (, a) aµθ () d ρ µ θ () θ µ θ () a Q µ θ (, a) aµθ () d