Deterministic Policy Gradient Algorithms: Supplementary Material

Determinitic Policy Gradient lgorithm: upplementary Material. Regularity Condition Within the text we have referred to regularity condition on the MDP: Regularity condition.1: p(, a), a p(, a), µ θ (), θ µ θ (), r(, a), a r(, a), p 1 () are continuou in all parameter and variable, a, and x. Regularity condition.2: there exit a b and L uch that up p 1 () < b, up a,, p(, a) < b, up a, r(, a) < b, up a,, a p(, a) < L, and up a, a r(, a) < L. B. Proof of Theorem 1 proof of Theorem 1. The proof follow along the ame line of the tandard tochatic policy gradient theorem in utton et al. (1999). Note that the regularity condition.1 imply that V µ θ () and θ V µ θ () are continuou function of θ and and the compactne of further implie that for any θ, θ V µ θ (), a Q µ θ (, a) aµθ () and θµ θ () are bounded function of. Thee condition will be neceary to exchange derivative and integral, and the order of integration whenever neceary in the following proof. We have, θ V µ θ () θ Q µ θ (, µ θ ()) θ (r(, µ θ ()) + θ µ θ () a r(, a) aµθ () + θ γp(, µ θ ())V µ θ ( )d ) γp(, µ θ ())V µ θ ( )d θ µ θ () a r(, a) aµθ () ( ) + γ p(, µ θ ()) θ V µ θ ( ) + θ µ θ () a p(, a) aµθ () V µ θ ( ) d (1) ) θ µ θ () a (r(, a) + γp(, a)v µ θ ( )d aµθ () + γp(, µ θ ()) θ V µ θ ( )d θ µ θ () a Q µ θ (, a) aµθ () + γp(, 1, µ θ ) θ V µ θ ( )d. Where in (1) we ued the Leibniz integral rule to exchange order of derivative and integration, requiring the regularity condition, pecifically continuity of p(, a), µ θ (), V µ θ () and their derivative w.r.t. θ. nd now iterating thi formula

Determinitic Policy Gradient lgorithm: upplementary Material we have, θ µ θ () a Q µ θ (, a) aµθ () + γp(, 1, µ θ ) θ µ θ ( ) a Q µ θ (, a) aµθ ( ) d + γp(, 1, µ θ ) γp(, 1, µ θ ) θ V µ θ ( )d d θ µ θ () a Q µ θ (, a) aµθ () + γp(, 1, µ θ ) θ µ θ ( ) a Q µ θ (, a) aµθ ( ) d + γ 2 p(, 2, µ θ ) θ V µ θ ( )d (2). γ t p(, t, µ θ ) θ µ θ ( ) a Q µ θ (, a) aµθ ( ) d. t0 Where in 2 we have ued Fubini theorem to exchange the order of integration, requiring the regularity condition o that θ V µ θ () i bounded. Now taking the expectation over 1 we have, θ J(µ θ ) θ p 1 ()V µ θ ()d p 1 () θ V µ θ () d (3) γ t p 1 ()p(, t, µ θ ) θ µ θ ( ) a Q µ θ (, a) aµθ ( ) d d t0 ρ µ θ () θ µ θ () a Q µ θ (, a) aµθ () d, where in (3) we ued the Leibniz integral rule to exchange derivative and integral, requiring the regularity condition, pecifically o that p 1 () and V µ θ () and derivative w.r.t. θ are continuou. In the final line we again ued Fubini theorem to exchange the order of integration, requiring the boundedne of the integrand a implied by the regularity condition. C. Proof of Theorem 2 We firt retate Theorem 2 in detail, with dicuion, and then prove the theorem. We firt make a preliminary definition: Condition B1: Function ν σ parametrized by σ are aid to be a regular delta-approximation on R if they atify the following condition: 1. The ditribution ν σ converge to a delta ditribution: lim σ 0 ν σ(a, a)f(a)da f(a ) for a R and uitably mooth f. pecifically we require that thi convergence i uniform in a and over any cla F of L-Lipchitz and bounded function, a f(a) < L <, up a f(a) < b <, i.e.: lim up ν σ (a, a)f(a)da f(a ) 0 σ 0 f F,a 2. For each a R, ν σ (a, ) i upported on ome compact C a with Lipchitz boundary bd(c a ), vanihe on the boundary and i continuouly differentiable on C a. 3. For each a R, for each a, the gradient a ν σ (a, a) exit. 4. Tranlation invariance: For all a, a R, and any δ R n uch that a + δ, a + δ, ν(a, a) ν(a + δ, a + δ).

We retate the theorem: Determinitic Policy Gradient lgorithm: upplementary Material Theorem. Let µ θ :. Denote the range of µ θ by R θ : range(µ θ ), and R θ R θ. For each θ, Conider a tochatic policy π µθ,σ uch that π µθ,σ(a ) ν σ (µ θ (), a), where ν σ atify Condition B1 on R above. uppoe further that the regularity condition.1 and.2 (ee ection ) on the MDP hold. Then, lim θ J(π µθ,σ) θ J(µ θ ) (4) σ 0 where on the l.h.. the gradient i the tandard tochatic policy gradient and on the r.h.. the gradient i the determinitic policy gradient. Theorem 2 hold for a very wide cla of policie when R n : any continuouly differentiable, compactly upported ξ : R n R with total integral 1, can be ued to contruct ν σ (a, a ) 1/σ n ξ((a a)/σ) which atifie our condition, and the pace of uch function i large: given any compact upport uch a function can be contructed. It i eay to check that any ν σ (a, a ) contructed on compact upport with Lipchitz boundary { in thi way will atify Condition B1. imple example i any bump function uch a, in 1 dimenion, ξ(a) e 1 1 a 2 a < 1, or multidimenional 0 a 1 verion. We now prove the theorem. Throughout the proof we denote the time t marginal denity at tate following policy π by p π t (). We begin with preliminary lemma: Lemma 1. Let U V R n R n. Let ν : U V R be differentiable on U V. Then () (B) (C) where, () Tranlation invariance: For all u U, v V, and any δ R n uch that u+δ U, v+δ V, ν(u, v) ν(u+δ, v+δ). (B) There exit ome function χ : R n R uch that ν(u, v) χ(u v). (C) u ν(u, v) v ν(u, v), wherever the gradient exit. If furthermore U V i convex then C, i.e. all propertie are equivalent. proof of Lemma 1. B: For any c U V define χ : R n R by χ : c ν(w, w c) for any w U uch that c w v for ome v V. Oberve that thi define χ uniquely on all of U V. Thu given any u U, v V we can chooe w u and we have, B : Trivial χ(u v) ν(u, u (u v)) ν(u, v) B C: Let h(u, v) u v then by the chain rule u ν(u, v) h χ(h) h(u,v) u h(u, v) h χ(h) h(u,v) h χ(h) h(u,v) v h(u, v) v ν(u, v) (C and Convexity) : uppoe U V i convex. Conider any (u, v) U V, and any δ R n, we have (u,v) ν(u, v), (δ, δ) u ν(u, v), δ + v ν(u, v), δ u ν(u, v), δ u ν(u, v), δ 0 hence ν i contant in the direction (δ, δ). ince (u, v) and δ were arbitrary, ν i contant in the direction (δ, δ) for all δ R n. Now ince U V i convex, for any (u, v) U V and B (u + δ, v + δ) U V we have that the traight line connecting and B i entirely contained U V. Thu, ince ν i contant along the path ν() ν(b). We now note that the regularity condition and propertie of ν imply the following lemma which we will need to prove Theorem 2. Lemma 2. 1. For any tochatic policy π and any t, up p π t () < b and imilarly for determinitic policie.

Determinitic Policy Gradient lgorithm: upplementary Material 2. For any tochatic policy π, up ρ π () < b/(1 γ) and imilarly for determinitic policie. 3. for any tochatic policy π, up a, { a Q π (a, ) } < c < and imilarly for determinitic policie. Proof. 1. The claim i true for t 1 by the regularity condition.2, then for t 1, up p π t+1( ) up p π t () up p(, a) < b,a, π(a )p(, a)dad 2. up ρ π () t1 γt 1 up p π t () b/(1 γ) 3. We have that, up,a a Q π (a, ) up,a a r(, a) + γ up,a L + γ Lb/(1 γ)d < a p(, a) V π ( ) d where the final line follow ince i compact and the integral over i finite. Lemma 3. lim σ 0 ρ πµ θ,σ () ρ πµ θ,0 () and the convergence i uniform w.r.t., i.e. lim up σ 0 ρ πµ θ,σ () ρ πµ θ,0 () 0 (5) Proof. We have that ρ π () t1 γt 1 p π t (). Clearly p πµ θ,σ 1 () p 1 () p πµ θ,0 1 (). Note that by the definition of ν σ, given any ɛ 1 > 0 we can chooe σ uch that for all σ < σ, up π µθ,σ(a )p(, a)da Now uppoe (for induction) that for ome t 1 we have that π µθ,0(a )p(, a)da ɛ 1. up p πµ θ,σ t () p πµ θ,0 t () ɛ 2 (t), then, p π µθ,σ up t+1 ( ) p πµ θ,0 t+1 ( ) up p πµ θ,σ t () p πµ θ,0 t () π µθ,σ(a )p(, a)dad + up p πµ θ,0 t () π µθ,σ(a )p(, a)da π µθ,0(a )p(, a)da d ɛ 2 (t) bd + ɛ 1 ɛ 2 (t)bζ + ɛ 1, where ζ 1d <. ince ɛ 2 (1) 0 we therefore have that up p πµ θ,σ t () p πµ θ,0 t () ɛ 1 (bζ + 1) t 1,

Determinitic Policy Gradient lgorithm: upplementary Material nd now given any ɛ > 0 if we chooe T ufficiently large uch that, tt +1 γt 1 b < ɛ/2 and then we chooe ɛ 1 and the correponding σ ufficiently mall o that, T t1 γt 1 ɛ 1 (bζ + 1) t 1 < ɛ/2, then we enure that for any σ < σ, a required. up ρ πµ θ,σ () ρ πµ θ,0 () up + ɛ t1 t1 γ t 1 p πµ θ,σ t () t1 γ t 1 p πµ θ,0 t () T γ t 1 up p πµ θ,σ t () p πµ θ,0 t () γ t 1 up p πµ θ,σ t () p πµ θ,0 t () tt +1 T γ t 1 ɛ 1 (bζ + 1) t 1 + γ t 1 b t1 Lemma 4. For all, θ, the convergence a Q πµ θ,σ (a, ) a Q πµ θ,0 (a, ), a σ 0, i uniform in (, a), i.e. lim up σ 0 (,a) t1 a Q πµ θ,σ (a, ) a Q πµ θ,0 (a, ) 0 Proof. a Q π (a, ) a ( r(, a) + γ p(, a)v π ( )d ), o up a Q πµ θ,σ (a, ) a Q πµ θ,0 (a, ) γ (,a) up a p(, a) V πµ θ,σ ( ) V πµ θ,0 ( ) d (,,a) γζl up V πµ θ,σ ( ) V πµ θ,0 ( ) where ζ 1d <. Now, given any ɛ 1, ɛ 2 there exit σ uch that for all σ < σ we have that, up r(, a) (π µθ,σ(a ) π µθ,0(a )) da < ɛ 1 and up ρ πµ θ,σ ŝ () ρ πµ θ,0 ŝ () < ɛ 2 (6),ŝ where ρ π ŝ () i analogou to ρπ (), but conditioned on tarting in ditribution p( a, ŝ)π(a ŝ)da at t 1 rather than in ditribution p 1 (the reult (6) reult can be proved in an identical fahion to Lemma 3 noting that the reult doe not depend upon p 1 other than through it boundedne). Then, up V πµ θ,σ ( ) V πµ θ,0 ( ) up r(, a)(π µθ,σ(a ) π µθ,0(a ))da + γ up ρ πµ θ,σ ()π µ θ,σ(a )r(, a)dad ρ πµ θ,0 ()π µ θ,0(a )r(, a)dad ɛ 1 + up ρ πµ θ,σ () ρπµ θ,σ () r(, a) π µ θ,0(a )dad + up ρ πµ θ,0 () r(, a) (π µθ,σ(a ) π µθ,0(a )) dad ɛ 1 + ɛ 2 ζb + ɛ 1 /(1 γ) which can thu be made arbitrarily mall by chooing σ ufficiently mall.

Determinitic Policy Gradient lgorithm: upplementary Material proof of Theorem 2. Tranlation invariance, and Lemma 1 implie that a ν σ (a, a) a µ θ () aν σ (µ θ (), a). Then integration by part implie that, Q πµ θ,σ (, a) a ν σ (a, a) a µ θ () da Q πµ θ,σ (, a) a ν σ (µ θ (), a)da a Q πµ θ,σ (, a)ν σ (µ θ (), a)da + boundary term C µθ () C µθ () a Q πµ θ,σ (, a)ν σ (µ θ (), a)da Where the boundary term are zero ince ν σ vanihe on the boundary. We have, from the tochatic policy gradient theorem, lim θ J(π µθ,σ) lim ρ πµ θ,σ () Q πµ θ,σ (, a) θ π µθ,σ(a ) dad σ 0 σ 0 lim ρ πµ θ,σ () Q πµ θ,σ (, a) θ µ θ () a ν σ (a, a) a µ σ 0 θ () dad lim ρ πµ θ,σ () θ µ θ () a Q πµ θ,σ (, a)ν σ (µ θ (), a)dad σ 0 C µθ () lim ρ πµ θ,σ () θ µ θ () a Q πµ θ,σ (, a)ν σ (µ θ (), a)dad, (7) σ 0 where exchange of limit and integral in (7) follow by dominated convergence (in Banach pace) where we can take the dominating function (which i bounded by Lemma 2), C µθ () g θ () up {ρ πµ θ,σ ()} up { a Q πµ θ,σ (a, ) } θ µ θ () op σ a C µθ (),σ ρ πµ θ,σ () a Q πµ θ,σ (, a)ν σ (µ θ (), a)da θ µ θ (). (8) C µθ () Where op denote the operator norm, or larget ingular value. Now note that by uniform convergence of a Q πµ θ,σ (, a), Lemma 4, given any ɛ 1, ɛ 2 there exit σ uch that for all σ < σ we have o that and alo that, Hence, a Q πµ θ,σ (, a) a Q πµ θ,0 (, a) < ɛ 1 a Q πµ θ,σ (, a)ν σ (µ θ (), a)da a Q πµ θ,0 (, a)ν σ (µ θ (), a)da < ɛ 1, C µθ () C µθ () a Q πµ θ,0 (, a)ν σ (µ θ (), a)da a Q πµ θ,0 (, a) aµθ () < ɛ 2. C µθ () a Q πµ θ,σ (, a)ν σ (µ θ (), a)da a Q πµ θ,0 (, a) aµθ () < ɛ 1 + ɛ 2 C µθ () and from thi and Lemma 3 we have, (7) ρ πµ θ,0 () θ µ θ () lim a Q πµ θ,σ (, a)ν σ (µ θ (), a)dad σ 0 C µθ () ρ πµ θ,0 () θ µ θ () a Q πµ θ,0 (, a) aµθ () d ρ µ θ () θ µ θ () a Q µ θ (, a) aµθ () d