Compound Reinforcement Learning: Framework and Application

Σχετικά έγγραφα
Stabilization of stock price prediction by cross entropy optimization

Buried Markov Model Pairwise

Applying Markov Decision Processes to Role-playing Game

Schedulability Analysis Algorithm for Timing Constraint Workflow Models

Optimization, PSO) DE [1, 2, 3, 4] PSO [5, 6, 7, 8, 9, 10, 11] (P)

A research on the influence of dummy activity on float in an AOA network and its amendments

An Automatic Modulation Classifier using a Frequency Discriminator for Intelligent Software Defined Radio

Probabilistic Approach to Robust Optimization

GPU. CUDA GPU GeForce GTX 580 GPU 2.67GHz Intel Core 2 Duo CPU E7300 CUDA. Parallelizing the Number Partitioning Problem for GPUs

Quick algorithm f or computing core attribute

3: A convolution-pooling layer in PS-CNN 1: Partially Shared Deep Neural Network 2.2 Partially Shared Convolutional Neural Network 2: A hidden layer o

: Monte Carlo EM 313, Louis (1982) EM, EM Newton-Raphson, /. EM, 2 Monte Carlo EM Newton-Raphson, Monte Carlo EM, Monte Carlo EM, /. 3, Monte Carlo EM

Simplex Crossover for Real-coded Genetic Algolithms

[4] 1.2 [5] Bayesian Approach min-max min-max [6] UCB(Upper Confidence Bound ) UCT [7] [1] ( ) Amazons[8] Lines of Action(LOA)[4] Winands [4] 1

( )( ) ( ) ( )( ) ( )( ) β = Chapter 5 Exercise Problems EX α So 49 β 199 EX EX EX5.4 EX5.5. (a)

J. of Math. (PRC) Banach, , X = N(T ) R(T + ), Y = R(T ) N(T + ). Vol. 37 ( 2017 ) No. 5

Kenta OKU and Fumio HATTORI

IPSJ SIG Technical Report Vol.2014-CE-127 No /12/6 CS Activity 1,a) CS Computer Science Activity Activity Actvity Activity Dining Eight-He

Newman Modularity Newman [4], [5] Newman Q Q Q greedy algorithm[6] Newman Newman Q 1 Tabu Search[7] Newman Newman Newman Q Newman 1 2 Newman 3

Toward a SPARQL Query Execution Mechanism using Dynamic Mapping Adaptation -A Preliminary Report- Takuya Adachi 1 Naoki Fukuta 2.

2 ~ 8 Hz Hz. Blondet 1 Trombetti 2-4 Symans 5. = - M p. M p. s 2 x p. s 2 x t x t. + C p. sx p. + K p. x p. C p. s 2. x tp x t.

GPGPU. Grover. On Large Scale Simulation of Grover s Algorithm by Using GPGPU

Appendix to On the stability of a compressible axisymmetric rotating flow in a pipe. By Z. Rusak & J. H. Lee

Study of urban housing development projects: The general planning of Alexandria City

ΒΙΟΓΡΑΦΙΚΟ ΣΗΜΕΙΩΜΑ ΣΤΥΛΙΑΝΗΣ Κ. ΣΟΦΙΑΝΟΠΟΥΛΟΥ Αναπληρώτρια Καθηγήτρια. Τµήµα Τεχνολογίας & Συστηµάτων Παραγωγής.

Anomaly Detection with Neighborhood Preservation Principle

A Method for Creating Shortcut Links by Considering Popularity of Contents in Structured P2P Networks

ER-Tree (Extended R*-Tree)

2016 IEEE/ACM International Conference on Mobile Software Engineering and Systems

Introduction to Risk Parity and Budgeting

Optimization Investment of Football Lottery Game Online Combinatorial Optimization

ΕΜΠΕΙΡΙΚΗ ΠΡΟΣΕΓΓΙΣΗ ΤΗΣ NASH ΙΣΟΡΡΟΠΙΑΣ

Models for Asset Liability Management and Its Application of the Pension Funds Problem in Liaoning Province

y = f(x)+ffl x 2.2 x 2X f(x) x x p T (x) = 1 Z T exp( f(x)=t ) (2) x 1 exp Z T Z T = X x2x exp( f(x)=t ) (3) Z T T > 0 T 0 x p T (x) x f(x) (MAP = Max

Vol. 31,No JOURNAL OF CHINA UNIVERSITY OF SCIENCE AND TECHNOLOGY Feb

Homomorphism of Intuitionistic Fuzzy Groups

Τι είναι η Θεωρία Παιγνίων;

Κβαντική Επεξεργασία Πληροφορίας

Why We All Need an AIDS Vaccine? : Overcome the Challenges of Developing an AIDS Vaccine in Japan

CRASH COURSE IN PRECALCULUS

Δθαξκνζκέλα καζεκαηηθά δίθηπα: ε πεξίπησζε ηνπ ζπζηεκηθνύ θηλδύλνπ ζε κηθξνεπίπεδν.


SCITECH Volume 13, Issue 2 RESEARCH ORGANISATION Published online: March 29, 2018

SocialDict. A reading support tool with prediction capability and its extension to readability measurement

:,,,, ,,, ;,,,,,, ,, (Barro,1990), (Barro and Sala2I2Martin,1992), (Arrow and Kurz,1970),, ( Glomm and Ravikumar,1994), (Solow,1957)

3.4 SUM AND DIFFERENCE FORMULAS. NOTE: cos(α+β) cos α + cos β cos(α-β) cos α -cos β

Chapter 5. Exercise Solutions. Microelectronics: Circuit Analysis and Design, 4 th edition Chapter 5 EX5.1 = 1 I. = βi EX EX5.3 = = I V EX5.

MIDI [8] MIDI. [9] Hsu [1], [2] [10] Salamon [11] [5] Song [6] Sony, Minato, Tokyo , Japan a) b)

Secure Cyberspace: New Defense Capabilities

Evolution of Novel Studies on Thermofluid Dynamics with Combustion

Mean-Variance Analysis

Other Test Constructions: Likelihood Ratio & Bayes Tests

CHAPTER 48 APPLICATIONS OF MATRICES AND DETERMINANTS

ΑΛΓΟΡΙΘΜΟΙ Άνοιξη I. ΜΗΛΗΣ


CHAPTER 25 SOLVING EQUATIONS BY ITERATIVE METHODS

Research on real-time inverse kinematics algorithms for 6R robots

Yoshifumi Moriyama 1,a) Ichiro Iimura 2,b) Tomotsugu Ohno 1,c) Shigeru Nakayama 3,d)

Working Paper Series 06/2007. Regulating financial conglomerates. Freixas, X., Loranth, G. and Morrison, A.D.

Tridiagonal matrices. Gérard MEURANT. October, 2008

Statistical Inference I Locally most powerful tests

Dynamic types, Lambda calculus machines Section and Practice Problems Apr 21 22, 2016

ΕΚΠ 413 / ΕΚΠ 606 Αυτόνοµοι (Ροµ οτικοί) Πράκτορες

A Bonus-Malus System as a Markov Set-Chain. Małgorzata Niemiec Warsaw School of Economics Institute of Econometrics

ES440/ES911: CFD. Chapter 5. Solution of Linear Equation Systems

Test Data Management in Practice

Customized Pricing Recommender System Simple Implementation and Preliminary Experiments

SOLUTIONS TO MATH38181 EXTREME VALUES AND FINANCIAL RISK EXAM

Math 6 SL Probability Distributions Practice Test Mark Scheme

Η Μηχανική Μάθηση στο Σχολείο: Μια Προσέγγιση για την Εισαγωγή της Ενισχυτικής Μάθησης στην Τάξη

Supplementary Materials for Evolutionary Multiobjective Optimization Based Multimodal Optimization: Fitness Landscape Approximation and Peak Detection

2 Composition. Invertible Mappings

CAP A CAP

Bundle Adjustment for 3-D Reconstruction: Implementation and Evaluation

On the Galois Group of Linear Difference-Differential Equations

DOI /J. 1SSN

Nov Journal of Zhengzhou University Engineering Science Vol. 36 No FCM. A doi /j. issn

Optimization Investment of Football Lottery Game Online Combinatorial Optimization

Math221: HW# 1 solutions

Control Theory & Applications PID (, )

1) Formulation of the Problem as a Linear Programming Model

Matrices and Determinants

1 (forward modeling) 2 (data-driven modeling) e- Quest EnergyPlus DeST 1.1. {X t } ARMA. S.Sp. Pappas [4]

ΤΕΧΝΟΛΟΓΙΚΟ ΕΚΠΑΙΔΕΥΤΙΚΟ ΙΔΡΥΜΑ ΚΡΗΤΗΣ ΣΧΟΛΗ ΔΙΟΙΚΗΣΗΣ ΚΑΙ ΟΙΚΟΝΟΜΙΑΣ (ΣΔΟ) ΤΜΗΜΑ ΛΟΓΙΣΤΙΚΗΣ ΚΑΙ ΧΡΗΜΑΤΟΟΙΚΟΝΟΜΙΚΗΣ

2. THEORY OF EQUATIONS. PREVIOUS EAMCET Bits.

Web-based supplementary materials for Bayesian Quantile Regression for Ordinal Longitudinal Data

th International Conference on Machine Learning and Applications. E d. h. U h h b w k. b b f d h b f. h w k by v y

Homework 3 Solutions

College of Life Science, Dalian Nationalities University, Dalian , PR China.

14 Lesson 2: The Omega Verb - Present Tense

ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 24/3/2007

The Construction of Investor Sentiment Index for China's Stock Market Based on the Panel Data of Shanghai A Share Companies

Models for Probabilistic Programs with an Adversary

Security in the Cloud Era

Εκτεταμένη περίληψη Περίληψη

Research on model of early2warning of enterprise crisis based on entropy

Indexing Methods for Encrypted Vector Databases

Study of In-vehicle Sound Field Creation by Simultaneous Equation Method

ΕΛΛΗΝΙΚΗ ΔΗΜΟΚΡΑΤΙΑ ΠΑΝΕΠΙΣΤΗΜΙΟ ΚΡΗΤΗΣ. Ψηφιακή Οικονομία. Διάλεξη 10η: Basics of Game Theory part 2 Mαρίνα Μπιτσάκη Τμήμα Επιστήμης Υπολογιστών

Price Predictions Based on Evolvable Strategies, Principal Component Analysis, and Randomness of Data Strings

Transcript:

Vol. 52 No. 12 3300 3308 (Dec. 2011) 1 2 3, 4 3 MDP Q OnPS Q OnPS 3 Compound Reinforcement Learning: Framework and Application 1. 16) N 16) N 1 1 A B 2 1 100 1 100 A A 1.5 B 1.25 2 A A Tohgoroh Matsui, 1 Takashi Goto, 2 Kiyoshi Izumi 3, 4 and Yu Chen 3 This paper describes an extended framework of reinforcement learning, called compound reinforcement learning, which maximizes the compound return and shows its application to financial tasks. Compound reinforcement learning is designed for return-based MDP in which an agent observes the return instead of the rewards. We introduce double exponential discounting and betting fraction into the framework and then we can write the logarithm of double-exponentially discounted compound return as the sum of a polynomially discounted logarithm of simple gross return. In this paper, we show two algorithms of compound reinforcement learning: compound Q-learning and compound OnPS. We also show the experimental results using 3-armed bandit and an application to a financial task: Japanese government bond trading. 1 Chubu University 2 UFJ Bank of Tokyo-Mitsubishi UFJ, Ltd. 3 The University of Tokyo 4 JST JST PRESTO 1 2 Fig. 1 2-armed bandit problem. 3300 c 2011 Information Processing Society of Japan

3301 2 100 1 Fig. 2 Examples of betting $1 each. 8) Q OnPS Q OnPS 3 2. 2.1 t P t t t +1 R t+1 3 100 Fig. 3 Examples of betting all money. A 0 0 100 3 3 A B 1.14 7,400 13) R t+1 = Pt+1 Pt P t = Pt+1 P t 1 (1) 1+R t+1 3) ρ t =(1+R t+1)(1 + R t+2) (1 + R T ) (2) T T ρ t =(1+R t+1)(1 + R t+2)(1 + R t+3) = (1 + R t+k+1 ) (3) MDP R t+k+1 MDP 2.2 γ k r t+k+1 k r t+k+1 γ k

3302 (1 + R t+k+1 ) γk k 1+R t+k+1 γ k f(x) =a bx ρ t =(1+R t+1)(1 + R t+2) γ (1 + R t+3) γ2 = (1 + R t+k+1 ) γk (4) 0 log ρ t =log (1 + R t+k+1 ) γk = log(1 + R t+k+1 ) γk = γ k log(1 + R t+k+1 ) =log(1+r t+1)+γ γ k log(1 + R t+k+2 ) =log(1+r t+1)+γlog ρ t+1 (5) 2.3 log(1 + R t+1) R t+1 = 1 f R t+1f 1+R t+1f R t+1f > 1 f f ρ t = (1 + R t+k+1 f) γk (6) f log ρ t = γ k log (1 + R t+k+1 f) (7) (7) r t+1 f log(1 + R t+1f) 2.4 π s V π (s) V π (s) =E π [logρ t s t = s] [ ] =E π γ k log(1 + R t+k+1 f) st = s ] =E π [log(1 + R t+1f)+γ γ k log(1 + R t+k+2 f) st = s ( [ ]) = π(s, a) P a ss R a ss + γeπ γ k log(1 + R t+k+2 f) st+1 = s a A(s) s S ( = π(s, a) P a ss R a ss + γv π (s ) ) (8) a A(s) s S π(s, a) P a ss f γ Ra ss f

3303 R a ss =E[ log(1 + R t+1f) s t = s, a t = a, s t+1 = s ] (9) π s a Q π (s, a) Q π (s, a) =E π [logρ t s t = s, a t = a] ( = P a ss R a ss + γv π (s ) ) (10) s S Q (s, a) =max π Π Qπ (s, a) [ ] =E log(1 + R t+1f)+γmax Q (s t+1,a ) st = s, at = a a ( ) = R a ss + γ max Q (s,a ) a s P a ss Q Bellman Q Bellman R a ss f Ra ss 3. 3.1 Q (11) Q Bellman Bellman R a ss R a ss (11) Q Q r t+1 log(1 + R t+1f) s t a t s t+1 R t+1 Q ( ) Q(s t,a t) Q(s t,a t)+α log(1 + R t+1f)+γmax Q(st+1,a) Q(st,at) (12) a A α γ f Q 4 Q (i) r (11) γ, α, f Q(s, a) loop s repeat Q s a a R s Q(s, a) Q(s, a)+α (log(1 + Rf)+γmax a Q(s,a ) Q(s, a)) s s until s end loop 4 Q Fig. 4 Compound Q-learning algorithm. R (ii) r log(1 + Rf) Q Q r t+1 f log(1 + R t+1f) MDP R t+1 log(1 + R t+1f) log(1 + R t+1f) r t+1 MDP Q Q Q Watkins Dayan MDP Q 19) MDP log(1 + R t+1f) R t+1f 1 1. 1 < R min R tf R max R t f 0 α t < 1 α t i =, [α t i] 2 <, (13) i=1 i=1 Q 1 s, a[q t(s, a) Q (s, a)] R min R tf R max R tf

3304 Proof. r t+1 = log(1 + R t+1) (12) Q Q R tf log(1 + R tf) MDP log(1 + R tf) Watkins Dayan r t+1 log(1 + R t+1f) 1 3.2 OnPS Profit sharing 11) s a P (s, a) s t a t P (s t,a t) P (s t,a t)+g(t, r T,T) (14) T g g(t, r T,T)=γ T t 1 r T (15) profit sharing OnPS 10) OnPS c 1 c(s t,a t) c(s t,a t)+1 (16) P (s, a) P (s, a)+αr t+1c(s, a) for all s, a (17) c(s, a) γc(s, a) for all s, a (18) Q OnPS OnPS r t+1 log(1 + R t+1f) P OnPS (17) P (s, a) P (s, a)+αlog(1 + R t+1f)c(s, a) (19) 5 OnPS (i) r R (ii) r P log(1 + Rf) P 2 Q Q OnPS profit sharing 10) OnPS profit sharing 11) γ, α, f, c for all s, a do P (s, a) c end for loop s for all s, a do c(s, a) 0 end for repeat P s a c(s, a) c(s, a)+1 a R s for all s, a do P (s, a) P (s, a)+αlog(1 + Rf)c(s, a) c(s, a) γc(s, a) end for s s until s is terminal end loop 5 OnPS Fig. 5 Compound OnPS algorithm. r T > 0 log(1 + R T f) > 0 OnPS 4. 6 3 Q Compound Q Simple 1 15) Q Variance-Penalized A

3305 6 3 Fig. 6 3-armed bandit problem. C B C C 1 γ =0.9 f =1 100 100 ɛ =0.2 ɛ- α =0.001 Q κ =1 7 Q A Q Q C Q 5. 9) 7 3 Fig. 7 Experimental results for 3-armed bandit. Matsui 9) 10 OnPS OnPS Compound OnPS Simple 14 (y t,σ t) 14 OnPS OnPS γ =0.9 f =1 1/14 RBF 15 15 Gibbs τ =0.2 τ =0.1 3 1 2004 1 1 2006 12 31 2007 1 1

3306 8 2004 2009 10 ±3σ Fig. 8 Yields of 10-year Japanese government bond (JGB). 2007 3 31 12 3 10 2004 2009 10 ±3σ 8 9 10 12 OnPS OnPS 2008 2 2008 9 15 6. 9 Fig. 9 Experimental results for 10-year JGB. R log(1 + Rf) 10 f =1 f =0.5 Simple 1 1 14),18) 10 Fig. 10 Relations between returns and reinforcements. μ σ OnPS μ =0.217 σ =0.510 OnPS μ =0.150 σ =0.578

3307 OnPS OnPS OnPS 6) f 17) 1 0 5),15) 4) 1),2) 3 C B Q C 1),7),12) 7. MDP Q OnPS 21013049 23700182 1) Basu, A., Bhattacharyya, T. and Borkar, V.S.: A Learning Algorithm for Risk- Sensitive Cost, Mathematics of Operations Research, Vol.33, No.4, pp.880 898 (2008). 2) Borkar, V.S.: Q-Learning for Risk-Sensitive Control, Mathematics of Operations Research, Vol.27, No.2, pp.294 311 (2002). 3) Campbell, J.Y., Lo, A.W. and MacKinlay, A.G.: The Econometrics of Financial Markets, Princeton University Press (1997). (2003). 4) Geibel, P. and Wysotzki, F.: Risk-Sensitive Reinforcement Learning Applied to Control under Constraints, Journal of Artificial Intelligence Research, Vol.24, pp.81 108 (2005). 5) Heger, M.: Consideration of Risk in Reinforcement Learning, Proc. 11th International Conference on Machine Learning (ICML 1994 ), pp.105 111 (1994). 6) Kelly, Jr., J.L.: A new interpretaion of information rate, Bell System Technical Journal, Vol.35, pp.917 926 (1956). 7) Li, J. and Chan, L.: Reward Adjustment Reinforcement Learning for Risk-averse Asset Allocation, Proc. International Joint Conference on Neural Networks (IJCNN 2006 ), pp.534 541 (2006). 8) Vol.26, No.2, pp.330 334 (2011). 9) Matsui, T., Goto, T. and Izumi, K.: Acquiring a government bond trading strategy using reinforcement learning, Journal of Advanced Computational Intelligence and Intelligent Informatics, Vol.13, No.6, pp.691 696 (2009). 10) Matsui, T., Inuzuka, N. and Seki, H.: On-Line Profit Sharing Works Efficiently, Proc. 7th International Conference on Knowledge-Based Intelligent Information & Engineering Systems (KES 2003 ), pp.317 324 (2003).

3308 11) Vol.9, No.4, pp.580 587 (1994). 12) Nevmyvaka, Y., Feng, Y. and Kearns, M.: Reinforcement Learning for Optimized Trade Execution, Proc. 23rd International Conference on Machine Learning (ICML 2006 ), pp.673 680 (2006). 13) Poundstone, W.: Fortune s Formula: The untold story of the scientific betting system that beat the casinos and wall street, Hill and Wang (2005). (2006). 14) Russell, S. and Norvig, P.: Artificial Intelligence: A Modern Approach, 2nd edition, Prentice Hall (2003). 2 (2008). 15) Sato, M. and Kobayashi, S.: Average-Reward Reinforcement Learning for Variance Penalized Markov Decision Problems, Proc. 18th International Conference on Machine Learning (ICML 2001 ), pp.473 480 (2001). 16) Sutton, R.S. and Barto, A.G.: Reinforcement Learning: An Introduction, The MIT Press (1998). (2000). 17) Vince, R.: Find your optimal f, Technical Analysis of Stock & Commodities, Vol.8, No.12, pp.476 477 (1990). 18) Von Neumann, J. and Morgenstern, O.: Theory of Games and Economic Behavior, Princeton University Press (1944). 19) Watkins, C.J.C.H. and Dayan, P.: Technical Note: Q-Learning, Machine Learning, Vol.8, No.3/4, pp.279 292 (1992). ( 23 4 11 ) ( 23 9 12 ) 1997 2003 2003 2007 2009 2010 AAAI ACM UFJ 1997 UFJ 2010 1998 ALM 2001 2007 1993 1998 2010 2010 1989 1994 2011