Vol. 52 No. 12 3300 3308 (Dec. 2011) 1 2 3, 4 3 MDP Q OnPS Q OnPS 3 Compound Reinforcement Learning: Framework and Application 1. 16) N 16) N 1 1 A B 2 1 100 1 100 A A 1.5 B 1.25 2 A A Tohgoroh Matsui, 1 Takashi Goto, 2 Kiyoshi Izumi 3, 4 and Yu Chen 3 This paper describes an extended framework of reinforcement learning, called compound reinforcement learning, which maximizes the compound return and shows its application to financial tasks. Compound reinforcement learning is designed for return-based MDP in which an agent observes the return instead of the rewards. We introduce double exponential discounting and betting fraction into the framework and then we can write the logarithm of double-exponentially discounted compound return as the sum of a polynomially discounted logarithm of simple gross return. In this paper, we show two algorithms of compound reinforcement learning: compound Q-learning and compound OnPS. We also show the experimental results using 3-armed bandit and an application to a financial task: Japanese government bond trading. 1 Chubu University 2 UFJ Bank of Tokyo-Mitsubishi UFJ, Ltd. 3 The University of Tokyo 4 JST JST PRESTO 1 2 Fig. 1 2-armed bandit problem. 3300 c 2011 Information Processing Society of Japan
3301 2 100 1 Fig. 2 Examples of betting $1 each. 8) Q OnPS Q OnPS 3 2. 2.1 t P t t t +1 R t+1 3 100 Fig. 3 Examples of betting all money. A 0 0 100 3 3 A B 1.14 7,400 13) R t+1 = Pt+1 Pt P t = Pt+1 P t 1 (1) 1+R t+1 3) ρ t =(1+R t+1)(1 + R t+2) (1 + R T ) (2) T T ρ t =(1+R t+1)(1 + R t+2)(1 + R t+3) = (1 + R t+k+1 ) (3) MDP R t+k+1 MDP 2.2 γ k r t+k+1 k r t+k+1 γ k
3302 (1 + R t+k+1 ) γk k 1+R t+k+1 γ k f(x) =a bx ρ t =(1+R t+1)(1 + R t+2) γ (1 + R t+3) γ2 = (1 + R t+k+1 ) γk (4) 0 log ρ t =log (1 + R t+k+1 ) γk = log(1 + R t+k+1 ) γk = γ k log(1 + R t+k+1 ) =log(1+r t+1)+γ γ k log(1 + R t+k+2 ) =log(1+r t+1)+γlog ρ t+1 (5) 2.3 log(1 + R t+1) R t+1 = 1 f R t+1f 1+R t+1f R t+1f > 1 f f ρ t = (1 + R t+k+1 f) γk (6) f log ρ t = γ k log (1 + R t+k+1 f) (7) (7) r t+1 f log(1 + R t+1f) 2.4 π s V π (s) V π (s) =E π [logρ t s t = s] [ ] =E π γ k log(1 + R t+k+1 f) st = s ] =E π [log(1 + R t+1f)+γ γ k log(1 + R t+k+2 f) st = s ( [ ]) = π(s, a) P a ss R a ss + γeπ γ k log(1 + R t+k+2 f) st+1 = s a A(s) s S ( = π(s, a) P a ss R a ss + γv π (s ) ) (8) a A(s) s S π(s, a) P a ss f γ Ra ss f
3303 R a ss =E[ log(1 + R t+1f) s t = s, a t = a, s t+1 = s ] (9) π s a Q π (s, a) Q π (s, a) =E π [logρ t s t = s, a t = a] ( = P a ss R a ss + γv π (s ) ) (10) s S Q (s, a) =max π Π Qπ (s, a) [ ] =E log(1 + R t+1f)+γmax Q (s t+1,a ) st = s, at = a a ( ) = R a ss + γ max Q (s,a ) a s P a ss Q Bellman Q Bellman R a ss f Ra ss 3. 3.1 Q (11) Q Bellman Bellman R a ss R a ss (11) Q Q r t+1 log(1 + R t+1f) s t a t s t+1 R t+1 Q ( ) Q(s t,a t) Q(s t,a t)+α log(1 + R t+1f)+γmax Q(st+1,a) Q(st,at) (12) a A α γ f Q 4 Q (i) r (11) γ, α, f Q(s, a) loop s repeat Q s a a R s Q(s, a) Q(s, a)+α (log(1 + Rf)+γmax a Q(s,a ) Q(s, a)) s s until s end loop 4 Q Fig. 4 Compound Q-learning algorithm. R (ii) r log(1 + Rf) Q Q r t+1 f log(1 + R t+1f) MDP R t+1 log(1 + R t+1f) log(1 + R t+1f) r t+1 MDP Q Q Q Watkins Dayan MDP Q 19) MDP log(1 + R t+1f) R t+1f 1 1. 1 < R min R tf R max R t f 0 α t < 1 α t i =, [α t i] 2 <, (13) i=1 i=1 Q 1 s, a[q t(s, a) Q (s, a)] R min R tf R max R tf
3304 Proof. r t+1 = log(1 + R t+1) (12) Q Q R tf log(1 + R tf) MDP log(1 + R tf) Watkins Dayan r t+1 log(1 + R t+1f) 1 3.2 OnPS Profit sharing 11) s a P (s, a) s t a t P (s t,a t) P (s t,a t)+g(t, r T,T) (14) T g g(t, r T,T)=γ T t 1 r T (15) profit sharing OnPS 10) OnPS c 1 c(s t,a t) c(s t,a t)+1 (16) P (s, a) P (s, a)+αr t+1c(s, a) for all s, a (17) c(s, a) γc(s, a) for all s, a (18) Q OnPS OnPS r t+1 log(1 + R t+1f) P OnPS (17) P (s, a) P (s, a)+αlog(1 + R t+1f)c(s, a) (19) 5 OnPS (i) r R (ii) r P log(1 + Rf) P 2 Q Q OnPS profit sharing 10) OnPS profit sharing 11) γ, α, f, c for all s, a do P (s, a) c end for loop s for all s, a do c(s, a) 0 end for repeat P s a c(s, a) c(s, a)+1 a R s for all s, a do P (s, a) P (s, a)+αlog(1 + Rf)c(s, a) c(s, a) γc(s, a) end for s s until s is terminal end loop 5 OnPS Fig. 5 Compound OnPS algorithm. r T > 0 log(1 + R T f) > 0 OnPS 4. 6 3 Q Compound Q Simple 1 15) Q Variance-Penalized A
3305 6 3 Fig. 6 3-armed bandit problem. C B C C 1 γ =0.9 f =1 100 100 ɛ =0.2 ɛ- α =0.001 Q κ =1 7 Q A Q Q C Q 5. 9) 7 3 Fig. 7 Experimental results for 3-armed bandit. Matsui 9) 10 OnPS OnPS Compound OnPS Simple 14 (y t,σ t) 14 OnPS OnPS γ =0.9 f =1 1/14 RBF 15 15 Gibbs τ =0.2 τ =0.1 3 1 2004 1 1 2006 12 31 2007 1 1
3306 8 2004 2009 10 ±3σ Fig. 8 Yields of 10-year Japanese government bond (JGB). 2007 3 31 12 3 10 2004 2009 10 ±3σ 8 9 10 12 OnPS OnPS 2008 2 2008 9 15 6. 9 Fig. 9 Experimental results for 10-year JGB. R log(1 + Rf) 10 f =1 f =0.5 Simple 1 1 14),18) 10 Fig. 10 Relations between returns and reinforcements. μ σ OnPS μ =0.217 σ =0.510 OnPS μ =0.150 σ =0.578
3307 OnPS OnPS OnPS 6) f 17) 1 0 5),15) 4) 1),2) 3 C B Q C 1),7),12) 7. MDP Q OnPS 21013049 23700182 1) Basu, A., Bhattacharyya, T. and Borkar, V.S.: A Learning Algorithm for Risk- Sensitive Cost, Mathematics of Operations Research, Vol.33, No.4, pp.880 898 (2008). 2) Borkar, V.S.: Q-Learning for Risk-Sensitive Control, Mathematics of Operations Research, Vol.27, No.2, pp.294 311 (2002). 3) Campbell, J.Y., Lo, A.W. and MacKinlay, A.G.: The Econometrics of Financial Markets, Princeton University Press (1997). (2003). 4) Geibel, P. and Wysotzki, F.: Risk-Sensitive Reinforcement Learning Applied to Control under Constraints, Journal of Artificial Intelligence Research, Vol.24, pp.81 108 (2005). 5) Heger, M.: Consideration of Risk in Reinforcement Learning, Proc. 11th International Conference on Machine Learning (ICML 1994 ), pp.105 111 (1994). 6) Kelly, Jr., J.L.: A new interpretaion of information rate, Bell System Technical Journal, Vol.35, pp.917 926 (1956). 7) Li, J. and Chan, L.: Reward Adjustment Reinforcement Learning for Risk-averse Asset Allocation, Proc. International Joint Conference on Neural Networks (IJCNN 2006 ), pp.534 541 (2006). 8) Vol.26, No.2, pp.330 334 (2011). 9) Matsui, T., Goto, T. and Izumi, K.: Acquiring a government bond trading strategy using reinforcement learning, Journal of Advanced Computational Intelligence and Intelligent Informatics, Vol.13, No.6, pp.691 696 (2009). 10) Matsui, T., Inuzuka, N. and Seki, H.: On-Line Profit Sharing Works Efficiently, Proc. 7th International Conference on Knowledge-Based Intelligent Information & Engineering Systems (KES 2003 ), pp.317 324 (2003).
3308 11) Vol.9, No.4, pp.580 587 (1994). 12) Nevmyvaka, Y., Feng, Y. and Kearns, M.: Reinforcement Learning for Optimized Trade Execution, Proc. 23rd International Conference on Machine Learning (ICML 2006 ), pp.673 680 (2006). 13) Poundstone, W.: Fortune s Formula: The untold story of the scientific betting system that beat the casinos and wall street, Hill and Wang (2005). (2006). 14) Russell, S. and Norvig, P.: Artificial Intelligence: A Modern Approach, 2nd edition, Prentice Hall (2003). 2 (2008). 15) Sato, M. and Kobayashi, S.: Average-Reward Reinforcement Learning for Variance Penalized Markov Decision Problems, Proc. 18th International Conference on Machine Learning (ICML 2001 ), pp.473 480 (2001). 16) Sutton, R.S. and Barto, A.G.: Reinforcement Learning: An Introduction, The MIT Press (1998). (2000). 17) Vince, R.: Find your optimal f, Technical Analysis of Stock & Commodities, Vol.8, No.12, pp.476 477 (1990). 18) Von Neumann, J. and Morgenstern, O.: Theory of Games and Economic Behavior, Princeton University Press (1944). 19) Watkins, C.J.C.H. and Dayan, P.: Technical Note: Q-Learning, Machine Learning, Vol.8, No.3/4, pp.279 292 (1992). ( 23 4 11 ) ( 23 9 12 ) 1997 2003 2003 2007 2009 2010 AAAI ACM UFJ 1997 UFJ 2010 1998 ALM 2001 2007 1993 1998 2010 2010 1989 1994 2011