2006 2006 Workshop on Information-Based Induction Sciences (IBIS2006) Osaka, Japan, October 31- November 2, 2006. [ ] Introduction to statistical models for populational optimization Λ Shotaro Akaho Abstract: This paper provides an introductory review of populational search methods based on statistical modelling. We start with a random search algorithm called Markov Chain Monte Carlo (MCMC). To solve an optimization problem, it is crucial for the performance to use specific knowledge about the problem. We discuss how the optimizer can acquire the knowledge during the optimization, which is closely related to various fields of statistical learning theory. 1 IBIS Λ, 305-8568 1-1-1, e-mail akaho@m.ieice.org, National Institute of Advanced Industrial Science and Technology (AIST), 1-1-1 Central 2, Umezono, Tsukuba, Ibaraki 305-8568 2 (stochastic optimization) (population search) 2.1 X f(x) 2 R;x 2X x x Λ x Λ = arg min x f(x) (1) X f(x) (1) f(x) f(x) x f x f(x) f(x) (2) f(x)
y = f(x)+ffl x 2.2 x 2X f(x) x x p T (x) = 1 Z T exp( f(x)=t ) (2) x 1 exp Z T Z T = X x2x exp( f(x)=t ) (3) Z T T > 0 T 0 x p T (x) x f(x) (MAP = Maximum a posterior estimation) p T (x) f(x) f(x) p T (x) 2.3 MCMC (Metropolis-Hastings) (MCMC = Markov Chain Monte Carlo) p T (x) P (x t+1 j x t ) [ ] 1 X f Z 1. x t x t+1 q(x t+1 j x t ) 2. x t+1 ( x t+1 = x t ) ff(x t ;x t+1 )= p T (x t+1 )q(x t j x t+1 ) p T (x t )q(x t+1 j x t ) ff(x t+1 ;x t ) 1 (4) q (: proposal distribution) 2 (Gibbs sampler) (Metropolis method) (independence sampler) q 2.4 MCMC MCMC [8, 12, 15] MCMC x0 p T (x) p T (x) MCMC P (x t+1 j x t ) (detailed balance condition) p T (x)p (y j x) =p T (y)p (x j y) (5) x; y MCMC (4) p T (x) p T (x t+1 ) p T (x t ) p T (x) (2) (2) Z T MCMC x t x t+1 2
3 3.1 Exploration-Exploitation MCMC X (Exploration) MCMC x t ( f(x t ) ) f(x) f(x) f(x) (Exploitation) exploration exploitation X x t f(x) f(x t ) q(x t+1 j x t ) exploration ( ) MCMC EDA 3.2 (2) T T 0 f(x) MCMC X T f(x) (SA=Simulated Annealing) [5, 6] (Haj ek[6]) fx t g t T t 1X t=1 exp( D=T t )=1 (6) D T t = D= log t 3 T t = ff t T0 (0 <ff<1) 4 1 x X x MCMC 4 x K MCMC X K x (1) ;:::;x (K) 3 D 4
KY p(x (1) ;:::;x (K) )= k=1 p T (x (k) ) (7) MCMC k =1;:::;K x (k) 1 MCMC q(x (k) t+1 j x(k) t ) 2 (1) (7) : (2) k p Tk (x (k) ) KY p(x (1) ;:::;x (K) )= k=1 p Tk (x (k) ) (8) (2) 1 MCMC MCMC : 1 MCMC x (j) x (k) x (j) x (k) (8) (parallel tempering) x (j) x (k) j k fi(x (j) ;x (k) )= p T k (x (j) )p Tj (x (k) ) p Tj (x (j) )p Tk (x (k) ) (9) fi(x (j) ;x (k) ) 1 MCMC (9) (x (k) x (k+1) ) T1;:::;T K ( ) [8] 5 MCMC f(x) X 2 R f(x) x = x Λ f(x) x = x Λ lim T!+0 R X x exp( f(x)=t )dx RX = exp( f(x)=t )dx xλ (10) [14] ; ffi 1 f(x) <f0 + ffi (11) x PAC ( f0 ) 5.1 MCMC p(x) (importance sampling) p(x) r(x) N x1;:::;x N x i
5 MCMC N w i = p(x i )=r(x i ) hxi w = P N i=1 w ix i PN i=1 w i (12) N x p(x) p(x) w i p(x) w i Z MCMC (sequential Monte Carlo) (particle filter) [3, 7] 5.2 r(x) x1;:::;x N w i ew i = w i P N i=1 w i (13) (resampling) N x 0 1 ;:::;x0 N x1;:::;x N N x 0 i p(x) w i w i (curse of dimensinality) p(x) 5.3 (GA = Genetic algorithm) DNA 5 N 1. : p T (x i ) 2. : q(x t+1 j x t ) x t x t+1 () 3. r(x (j) t+1 ;x(k) t+1 j x(j) t ;x (k) t ) x (j), x (k) x (j) t+1, x (k) t+1 ( x ) r(x) =1 p Tt (x) p T (x) p Tt+1(x) =p Tt (x)p T (x) =exp( f(x)=t t+1 )=Z t+1 ; (14) T t+1 =1=(1=T t +1=T )=T t =(1 + T t =T ) <T t (15) (genetic drift) [1]. MCMC 4 1 MCMC MCMC 6 6
6 3.1 EDA (Estimation of Distribution Algorithm) EDA EDA 6.1 (= ) MCMC p T (x) p T (x) ' p(x; ) (16) p T (x) p(x; ) p T (x) p T (x) p T (x) f(x) x p T (x) f(x) p T (x) MCMC EDA f(x) [ EDA] 1. x (1) t ;:::;x (N) t f(x) 2. p(x; ) 3. p(x; ) EDA GA MCMC p(x; ) 6.2 p(x) x1;:::;x N p(x) (Active learning)[4] x 2X =[0; 1] y = ax+b+ffl (ffl ) x x x =0; 1 a; b (optimal experimental design) x 2X
6.3 bandit problem bandit problem (RL = Reinforcement learning) [16] bandit problem (MDP = Markov decision process) (DP = Dynamic programming) Q-learning TD-learning 3.1 exploration-exploitation exploration exploitation 6.4 (MFA = Mean field approximation)[13, 10] (MRF = Markov random field) (BN = Bayesian network) p T (x) EDA (singly-connected graph) (Viterbi ) (loopygraph) x =(x1;:::;x K ) K p(x) p(x) ' KY p i (x i ; i ) (17) i=1 p(x) i ( ) p i (x i ; i ) (Bethe) (Belief-propagation) p T (x) p T (x) ( f(x) ) p T (x) [18] 7 7.1 x 2X
(PCA = Principal component analysis) (ICA = Independent component anaysis) PCA PCA ICA PCA[2] (Multicanonical) z p(x) = X z p(x; z) (18) p(x; z) p(x) EM [10] 7.2 X S Y 1 X = i=1 Y X 5.2 MCMC (jump diffusion) MCMC 8 web [9, 11] () [17] [1] S. Akaho. Statistical learning in optimization: Gaussian modeling for population search. In Proc. of Int. Conf. Neural Information Processing (ICONIP'98), 1998. [2] S. Akaho. The e-pca and m-pca: dimension reduction by information geometry. In Proc. of Int. Joint Conf. on Neural Networks (IJCNN2004), pp. 129 134, 2004. [3] A. Doucet, N. de Freitas, and N. Gordon, editors. Sequential Monte Carlo in Practice. Springer-Verlag, 2001. [4].,., 2005. [5] S. Geman and D. Geman. Stochastic Relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE Trans. on PAMI, Vol. 6, pp. 721 741, 1984. [6] B. H ajek. Cooling schedules for optimal annealing. Math. Operation. Research, Vol. 13, pp. 311 329, 1988. [7].., Vol. 88, No. 12, pp. 989 994, 2005. [8]. II. 12., 2005. [9],.. 14., 1993. [10],. EM, I. 11., 2003. [11].., 2005. [12] J.S. Liu. Monte Carlo strategies in scientific computing. Springer-Verlag, 2001. [13] M. Opper and D. Saad, editors. Advanced Mean Field Methods: Theory and Practice. MIT Press, 2001. [14] W.H. Press, S.A. Teukolsky, W.T. Vetterling, and B.P. Flannery. Numerical Recipes in C, The Art of Scientific Computing, Second edition. Cambridge University Press, 1992. [15] C.P. Robert. Monte Carlo statistical methods. Springer-Verlag, 2004. [16] R. S. Sutton and A. G. Barto. Reinforcement learning: an introduction. MIT Press, 1998. [17] V.V. Vazirani. Approximation Algorithms. Springer- Verlag, 2001. ( ), 2002. [18] M.J. Wainwright, T.S. Jaakkola, and A.S. Willsky. Tree-based reparameterization framework for analysis of sum-product and related algorithms. IEEE Trans. on Information Theory, 2003.