lecture 10: the em algorithm (contd)

lecture 10: the em algorithm (contd) STAT 545: Intro. to Computational Statistics Vinayak Rao Purdue University September 24, 2018

Exponential family models Consider a space X. E.g. R, R d or N. ϕ(x) = [ϕ 1 (x),..., ϕ D (x)] : (feature) vector of sufficient statistics θ = [θ 1,..., θ D ] : vector of natural parameters 1/20

Exponential family models Consider a space X. E.g. R, R d or N. ϕ(x) = [ϕ 1 (x),..., ϕ D (x)] : (feature) vector of sufficient statistics θ = [θ 1,..., θ D ] : vector of natural parameters Exponential family distribution: p(x θ) = 1 Z(θ) h(x) exp(θ ϕ(x)) 1/20

Exponential family models Consider a space X. E.g. R, R d or N. ϕ(x) = [ϕ 1 (x),..., ϕ D (x)] : (feature) vector of sufficient statistics θ = [θ 1,..., θ D ] : vector of natural parameters Exponential family distribution: p(x θ) = 1 Z(θ) h(x) exp(θ ϕ(x)) h(x) is the base-measure or base distribution. Z(θ) = h(x) exp(θ ϕ(x))dx is the normalization constant. 1/20

Examples The normal distribution: ( p(x µ, σ 2 1 ) = exp 1 ) (x µ)2 2πσ 2 2σ2 ) exp ( µ2 ( 2σ = 2 exp 1 2πσ 2 2σ 2 x2 + µ ) σ 2 x 2/20

Examples The normal distribution: ( p(x µ, σ 2 1 ) = exp 1 ) (x µ)2 2πσ 2 2σ2 ) exp ( µ2 ( 2σ = 2 exp 1 2πσ 2 2σ 2 x2 + µ ) σ 2 x The Poisson distribution: p(x λ) = λx exp( λ) x! = exp( λ) 1 x! exp(log(λ)x) 2/20

Minimal exponential family Sufficient statistics are linearly independent Consider a K-component discrete distribution π = (π 1,..., π K ) p(x) = K c=1 π δ(x=c) i = exp( K δ(x = c) log π c ) c=1 3/20

Minimal exponential family Sufficient statistics are linearly independent Consider a K-component discrete distribution π = (π 1,..., π K ) p(x) = K c=1 π δ(x=c) i = exp( K δ(x = c) log π c ) c=1 Is it minimal? K 1 p(x) = π K exp( δ(x = c) log π c /π K ) c=1 = 1 K 1 Z exp( δ(x = c)θ c ) c=1 3/20

Maximum-likelihood estimation Given N i.i.d. observations X {x 1,..., x N }, the likelihood is N 1 L(X θ) = Z(θ) h(x i) exp(θ ϕ(x i )) ( ) ( 1 N N ) = h(x Z(θ) i ) exp(θ N ϕ(x i )) 4/20

Maximum-likelihood estimation Given N i.i.d. observations X {x 1,..., x N }, the likelihood is N 1 L(X θ) = Z(θ) h(x i) exp(θ ϕ(x i )) ( ) ( 1 N N ) = h(x Z(θ) i ) exp(θ The log-likelihood l(x θ) = log L(X θ) is ( N ) l(x θ) = θ ϕ(x i ) N log Z(θ) + N ϕ(x i )) log h(x i ) 4/20

Maximum-likelihood estimation Given N i.i.d. observations X {x 1,..., x N }, the likelihood is N 1 L(X θ) = Z(θ) h(x i) exp(θ ϕ(x i )) ( ) ( 1 N N ) = h(x Z(θ) i ) exp(θ The log-likelihood l(x θ) = log L(X θ) is ( N ) l(x θ) = θ ϕ(x i ) N log Z(θ) + N ϕ(x i )) log h(x i ) To calculate a maximum likelihood estimate, we only need the sum of the suff. statistics. 4/20

MLE for exponential families ( N ) l(x θ) = θ ϕ(x i ) N log Z(θ) + log h(x i ) 5/20

MLE for exponential families ( N ) l(x θ) = θ ϕ(x i ) N log Z(θ) + At MLE of θ d, the dth component of θ: log h(x i ) l(x θ) θ d = 0. 5/20

MLE for exponential families ( N ) l(x θ) = θ ϕ(x i ) N log Z(θ) + At MLE of θ d, the dth component of θ: log Z(θ) ϕ d (x i ) = N θ d log h(x i ) l(x θ) θ d = 0. 5/20

MLE for exponential families ( N ) l(x θ) = θ ϕ(x i ) N log Z(θ) + At MLE of θ d, the dth component of θ: 1 N log Z(θ) ϕ d (x i ) = N θ d ϕ d (x i ) = 1 Z(θ) = 1 Z(θ) θ d = 1 Z(θ) = Z(θ) log h(x i ) l(x θ) θ d = 0. h(x) exp(θ ϕ(x)) dx θ d h(x) exp(θ ϕ(x))dx θ d 1 Z(θ) h(x)exp(θ ϕ(x))ϕ d (x)dx 5/20

MLE for exponential families Match empirical and population averages of ϕ(x): 1 N ϕ d (x i ) = E θmle [ϕ d (x)] 6/20

MLE for exponential families Match empirical and population averages of ϕ(x): 1 N ϕ d (x i ) = E θmle [ϕ d (x)] RHS: moment parameters of the exponential distribution. Thus: θ MLE are natural parameters corresponding to empirical moment parameters ( moment matching ). 6/20

MLE for exponential families Match empirical and population averages of ϕ(x): 1 N ϕ d (x i ) = E θmle [ϕ d (x)] RHS: moment parameters of the exponential distribution. Thus: θ MLE are natural parameters corresponding to empirical moment parameters ( moment matching ). Is this a maximum? is second derivative (Hessian) negative (negative definite)? We can show 2 θ i θ log Z(X θ) = Cov(ϕ j i, ϕ j ), and the Hessian of l(x θ) is N times the feature covariance matrix 6/20

Example The 1-d Gaussian: ϕ = [x x 2 ] Moment parameters are mean and mean squared Easy to find corresponding natural parameters Quite often, it is not the case. However, we will restrict ourselves to cases where it is. 7/20

Example The 1-d Gaussian: ϕ = [x x 2 ] Moment parameters are mean and mean squared Easy to find corresponding natural parameters Quite often, it is not the case. However, we will restrict ourselves to cases where it is. Can you do it for the Poisson? p(x λ) = 1 x! λx exp( λ) 7/20

Missing data in exponential family distributions Let samples from the exponential family have two parts: [x y]. Feature vector ϕ([x y]) := ϕ(x, y). P(x, y θ) := P([x y] θ) = h(x, y) Z(θ) exp(θ ϕ(x, y)) 8/20

Missing data in exponential family distributions Let samples from the exponential family have two parts: [x y]. Feature vector ϕ([x y]) := ϕ(x, y). P(x, y θ) := P([x y] θ) = h(x, y) Z(θ) exp(θ ϕ(x, y)) We observe only x. What is the posterior over y? P(y x, θ) = P(x, y θ) P(x θ) = h(x, y) P(x θ)z(θ) exp(θ ϕ(x, y)) 8/20

Missing data in exponential family distributions Let samples from the exponential family have two parts: [x y]. Feature vector ϕ([x y]) := ϕ(x, y). P(x, y θ) := P([x y] θ) = h(x, y) Z(θ) exp(θ ϕ(x, y)) We observe only x. What is the posterior over y? P(y x, θ) = P(x, y θ) P(x θ) = h(x, y) P(x θ)z(θ) exp(θ ϕ(x, y)) An exponential family distrib. over y (remember x is fixed) with: feature vector ϕ x (y) = ϕ(x, y) base distribution h x (y) = h(x, y) 8/20

Missing data in exponential family distributions Let samples from the exponential family have two parts: [x y]. Feature vector ϕ([x y]) := ϕ(x, y). P(x, y θ) := P([x y] θ) = h(x, y) Z(θ) exp(θ ϕ(x, y)) We observe only x. What is the posterior over y? P(y x, θ) = P(x, y θ) P(x θ) = h(x, y) P(x θ)z(θ) exp(θ ϕ(x, y)) An exponential family distrib. over y (remember x is fixed) with: feature vector ϕ x (y) = ϕ(x, y) base distribution h x (y) = h(x, y) Not necessarily easy to work with, but will restrict to this case. y i with different x i belong to different exp. fam. distrbs. 8/20

MLE with missing data What about the marginal likelihood P(x θ) = P(x, y θ)dy? 9/20

MLE with missing data What about the marginal likelihood P(x θ) = P(x, y θ)dy? Not an exponential family distribution! Calculating derivatives is messy No nice closed form expression for MLE 9/20

MLE with missing data What about the marginal likelihood P(x θ) = P(x, y θ)dy? Not an exponential family distribution! Calculating derivatives is messy No nice closed form expression for MLE One approach: the EM algorithm. An algorithm for MLE in exp. fam. distribs. with missing data 9/20

MLE with missing data What about the marginal likelihood P(x θ) = P(x, y θ)dy? Not an exponential family distribution! Calculating derivatives is messy No nice closed form expression for MLE One approach: the EM algorithm. An algorithm for MLE in exp. fam. distribs. with missing data Problem: Given observations i.i.d. X = {x 1,..., x N } from P(x θ) where P(X, Y θ) is exponential family, maximize w.r.t. θ l(x θ) = log P(X θ) = log P(x i θ) 9/20

MLE in latent variable models l(x θ) = log P(x i θ) = log P(x i, y i θ)dy i 10/20

MLE in latent variable models l(x θ) = = log P(x i θ) = log log P(x i, y i θ)dy i q i (y i ) P(x i, y i θ) dy q i (y i ) i (for arbitrary densities q i (y i )) 10/20

MLE in latent variable models l(x θ) = = log P(x i θ) = log log P(x i, y i θ)dy i q i (y i ) P(x i, y i θ) dy q i (y i ) i (for arbitrary densities q i (y i )) q i (y i ) log P(x i, y i θ) dy q i (y i ) i (Jensen s inequality) 10/20

MLE in latent variable models l(x θ) = = = log P(x i θ) = log log P(x i, y i θ)dy i q i (y i ) P(x i, y i θ) dy q i (y i ) i (for arbitrary densities q i (y i )) q i (y i ) log P(x i, y i θ) dy q i (y i ) i q i (y i ) log P(x i, y i θ)dy i (Jensen s inequality) q i (y i ) log q i (y i )dy i 10/20

MLE in latent variable models l(x θ) = = = = log P(x i θ) = log log P(x i, y i θ)dy i q i (y i ) P(x i, y i θ) dy q i (y i ) i (for arbitrary densities q i (y i )) q i (y i ) log P(x i, y i θ) dy q i (y i ) i q i (y i ) log P(x i, y i θ)dy i q i (y i ) log P(x i θ)dy i + (Jensen s inequality) q i (y i ) log q i (y i )dy i q i (y i ) log P(y i x i, θ) dy q i (y i ) i 10/20

MLE in latent variable models l(x θ) = = = = log P(x i θ) = log = l(x θ) log P(x i, y i θ)dy i q i (y i ) P(x i, y i θ) dy q i (y i ) i (for arbitrary densities q i (y i )) q i (y i ) log P(x i, y i θ) dy q i (y i ) i q i (y i ) log P(x i, y i θ)dy i q i (y i ) log P(x i θ)dy i + (Jensen s inequality) KL(q i (y i ) P(y i X, θ)) q i (y i ) log q i (y i )dy i q i (y i ) log P(y i x i, θ) dy q i (y i ) i 10/20

Jensen s inequality Let f(x) be a concave real-valued function defined on X. Concave: Non-positive 2nd-derivative (non-increasing deriv.) A chord always lies below the function. E.g. logarithm (defined on R + ). 11/20

Jensen s inequality Let f(x) be a concave real-valued function defined on X. Concave: Non-positive 2nd-derivative (non-increasing deriv.) A chord always lies below the function. E.g. logarithm (defined on R + ). Jensen: for any prob. vector p = (p 1,..., p K ) and any set of points (x 1,..., x K ), f( K p ix i ) K p if(x i ) 11/20

Jensen s inequality Let f(x) be a concave real-valued function defined on X. Concave: Non-positive 2nd-derivative (non-increasing deriv.) A chord always lies below the function. E.g. logarithm (defined on R + ). Jensen: for any prob. vector p = (p 1,..., p K ) and any set of points (x 1,..., x K ), f( K p ix i ) K p if(x i ) In fact, for a prob. density p(x), f( X xp(x)dx) X f(x)p(x)dx 11/20

MLE in latent variable models Defining Q(Y) = N q i(y i ), l(x θ) q i (y i ) log P(x i, y i θ)dy i + H(q i ) = E qi [log P(x i, y i θ)] + = l(x θ) H(q i ) KL(q i (y i ) P(y i X, θ)) := F X (θ, Q( )) 12/20

MLE in latent variable models Defining Q(Y) = N q i(y i ), l(x θ) q i (y i ) log P(x i, y i θ)dy i + = E qi [log P(x i, y i θ)] + = l(x θ) H(q i ) H(q i ) KL(q i (y i ) P(y i X, θ)) := F X (θ, Q( )) F X (θ, Q( )) is a lower bound to the log-likelihood l(x θ). Sometimes called variational free energy and is function of θ and the variational distribution Q(Y) (X is fixed). 12/20

Optimizing the variational lower bound Our original goal was to maximize the log-likelihood: θ MLE = argmax l(x θ) EM algorithm: maximize the lower-bound instead (θ, Q ) = argmax F X (θ, Q( )) Hopefully easier, since all summations are outside logarithms. Strategy: Coordinate ascent. Alternately maximize w.r.t Q and θ First find best lower-bound given the current θ s. Optimize this lower-bound to find θ s+1. 13/20

The EM algorithm Maximizing F X (θ, Q) with θ fixed: Recall F X (θ, Q) = l(x θ) N KL(q i(y i ) P(y i x i, θ)) 14/20

The EM algorithm Maximizing F X (θ, Q) with θ fixed: Recall F X (θ, Q) = l(x θ) N KL(q i(y i ) P(y i x i, θ)) Solution: set q i (y i ) = P(y i x i, θ) for i = 1,..., N Recall: P( x i, θ) is an exponential family distribution with natural parameters θ and feature vector ϕ(x i, ) 14/20

The EM algorithm Maximizing F X (θ, Q) with Q fixed: F X (θ, Q) = E qi [log P(x i, y i θ)] + H(q i ) The entropy terms H(q i ) don t depend on θ. Ignore. 15/20

The EM algorithm Maximizing F X (θ, Q) with Q fixed: F X (θ, Q) = E qi [log P(x i, y i θ)] + H(q i ) The entropy terms H(q i ) don t depend on θ. Ignore. log P(x i, y i θ) = θ ϕ(x i, y i ) + log h(x i ) log Z(θ) 15/20

The EM algorithm Maximizing F X (θ, Q) with Q fixed: F X (θ, Q) = E qi [log P(x i, y i θ)] + H(q i ) The entropy terms H(q i ) don t depend on θ. Ignore. log P(x i, y i θ) = θ ϕ(x i, y i ) + log h(x i ) log Z(θ) F X (θ, Q) = θ E qi [ϕ(x i, y i )] N log Z(θ) + const 15/20

The Expectation-Maximization algorithm F X (θ, Q) = θ N E qi [ϕ(x i, y i )] N log Z(θ) + const 16/20

The Expectation-Maximization algorithm F X (θ, Q) = θ N To maximize F X w.r.t. θ d, solve E qi [ϕ(x i, y i )] N log Z(θ) + const θ d F X (θ, Q) = 0: Solution: set θ to match moments (compare w. fully observed case) E θ [ϕ d (x, y)] = 1 N N E q i [ϕ d (x i, y i )] 16/20

The Expectation-Maximization algorithm F X (θ, Q) = θ N To maximize F X w.r.t. θ d, solve E qi [ϕ(x i, y i )] N log Z(θ) + const θ d F X (θ, Q) = 0: Solution: set θ to match moments (compare w. fully observed case) E θ [ϕ d (x, y)] = 1 N N E q i [ϕ d (x i, y i )] q i (y i ) = P(y i X, θ old ), an exponential family distribution whose moment parameters can be calculated (by assumption). 16/20

Step s of the EM algorithm Current parameters: θ s, Q s (Y) = N qs i (y i) 17/20

Step s of the EM algorithm Current parameters: θ s, Q s (Y) = N qs i (y i) E-step: For i = 1,..., N: Set q s+1 i (y i ) = P(y i X, θ s ). Exp. fam. distrib. with suff. stats ϕ(x i, ), natural params θ s Calculate E q s+1[ϕ(x i, y i )] i (Expectation) 17/20

Step s of the EM algorithm Current parameters: θ s, Q s (Y) = N qs i (y i) E-step: For i = 1,..., N: Set q s+1 i (y i ) = P(y i X, θ s ). Exp. fam. distrib. with suff. stats ϕ(x i, ), natural params θ s Calculate E q s+1[ϕ(x i, y i )] i M-step (Maximization): (Expectation) Set θ s+1 so that E θ s+1[ϕ(x, y)] = 1 N N E q s i [ϕ(x i, y i )]. 17/20

EM algorithm never decreases log-likelihood Recall F X (θ s 1, Q s 1 ) = l(x θ s 1 ) KL(Q(Y) P(Y X, θ s 1 )). After the E-step, Q s (Y) = P(Y X, θ s 1 )) l(x θ s 1 ) = F X (θ s 1, Q s ) 18/20

EM algorithm never decreases log-likelihood Recall F X (θ s 1, Q s 1 ) = l(x θ s 1 ) KL(Q(Y) P(Y X, θ s 1 )). After the E-step, Q s (Y) = P(Y X, θ s 1 )) l(x θ s 1 ) = F X (θ s 1, Q s ) F X (θ s, Q s ) 18/20

EM algorithm never decreases log-likelihood Recall F X (θ s 1, Q s 1 ) = l(x θ s 1 ) KL(Q(Y) P(Y X, θ s 1 )). After the E-step, Q s (Y) = P(Y X, θ s 1 )) l(x θ s 1 ) = F X (θ s 1, Q s ) F X (θ s, Q s ) F X (θ s, Q s+1 ) 18/20

EM algorithm never decreases log-likelihood Recall F X (θ s 1, Q s 1 ) = l(x θ s 1 ) KL(Q(Y) P(Y X, θ s 1 )). After the E-step, Q s (Y) = P(Y X, θ s 1 )) l(x θ s 1 ) = F X (θ s 1, Q s ) F X (θ s, Q s ) F X (θ s, Q s+1 ) = l(x θ s ) 18/20

EM algorithm never decreases log-likelihood Recall F X (θ s 1, Q s 1 ) = l(x θ s 1 ) KL(Q(Y) P(Y X, θ s 1 )). After the E-step, Q s (Y) = P(Y X, θ s 1 )) l(x θ s 1 ) = F X (θ s 1, Q s ) F X (θ s, Q s ) F X (θ s, Q s+1 ) = l(x θ s ) Can also show that local maxima of F X are local maxima of l. 18/20

EM algorithm never decreases log-likelihood Recall F X (θ s 1, Q s 1 ) = l(x θ s 1 ) KL(Q(Y) P(Y X, θ s 1 )). After the E-step, Q s (Y) = P(Y X, θ s 1 )) l(x θ s 1 ) = F X (θ s 1, Q s ) F X (θ s, Q s ) F X (θ s, Q s+1 ) = l(x θ s ) Can also show that local maxima of F X are local maxima of l. Variants of E and M-steps. 18/20

EM algorithm never decreases log-likelihood Recall F X (θ s 1, Q s 1 ) = l(x θ s 1 ) KL(Q(Y) P(Y X, θ s 1 )). After the E-step, Q s (Y) = P(Y X, θ s 1 )) l(x θ s 1 ) = F X (θ s 1, Q s ) F X (θ s, Q s ) F X (θ s, Q s+1 ) = l(x θ s ) Can also show that local maxima of F X are local maxima of l. Variants of E and M-steps. Partial M-step: Update θ to increase (rather than maximize) F X 18/20

EM algorithm never decreases log-likelihood Recall F X (θ s 1, Q s 1 ) = l(x θ s 1 ) KL(Q(Y) P(Y X, θ s 1 )). After the E-step, Q s (Y) = P(Y X, θ s 1 )) l(x θ s 1 ) = F X (θ s 1, Q s ) F X (θ s, Q s ) F X (θ s, Q s+1 ) = l(x θ s ) Can also show that local maxima of F X are local maxima of l. Variants of E and M-steps. Partial M-step: Update θ to increase (rather than maximize) F X Partial E-step: Update Q to decrease KL(Q(Y) P(Y X, θ) (rather reduce to 0). E.g. update just one or a few q i. 18/20

A toy example: A mixture of two Gaussians, N (x m, 1) and N (x 5 m, 2). First has probability 0.6, the second 0.4. 19/20

A toy example: A mixture of two Gaussians, N (x m, 1) and N (x 5 m, 2). First has probability 0.6, the second 0.4. We observe a single data point x = 1. What is the ML estimate of m? 19/20

A toy example: A mixture of two Gaussians, N (x m, 1) and N (x 5 m, 2). First has probability 0.6, the second 0.4. We observe a single data point x = 1. What is the ML estimate of m? What is the hidden variable? 19/20

A toy example: A mixture of two Gaussians, N (x m, 1) and N (x 5 m, 2). First has probability 0.6, the second 0.4. We observe a single data point x = 1. What is the ML estimate of m? What is the hidden variable? Is the overall model exponential family? 19/20

A toy example: A mixture of two Gaussians, N (x m, 1) and N (x 5 m, 2). First has probability 0.6, the second 0.4. We observe a single data point x = 1. What is the ML estimate of m? What is the hidden variable? Is the overall model exponential family? What is the posterior distribution over the hidden variable? 19/20

A toy example: A mixture of two Gaussians, N (x m, 1) and N (x 5 m, 2). First has probability 0.6, the second 0.4. We observe a single data point x = 1. What is the ML estimate of m? What is the hidden variable? Is the overall model exponential family? What is the posterior distribution over the hidden variable? If we knew the hidden variable, what is the MLE? 19/20

The EM algorithm 2 l_q 4 6 8 0.00 0.25 0.50 0.75 1.00 q 2 l_m 4 6 8 2.5 0.0 2.5 5.0 7.5 m 20/20

The EM algorithm 2 f_q 4 6 8 0.00 0.25 0.50 0.75 1.00 q Initialize m = 2.9. 2 l_m 4 6 8 2.5 0.0 2.5 5.0 7.5 m 20/20

The EM algorithm 2 f_q 4 6 8 0.00 0.25 0.50 0.75 1.00 q 2 Initialize m = 2.9. Set q = 0.37. f_m 4 6 8 2.5 0.0 2.5 5.0 7.5 m 20/20

The EM algorithm 2 f_q 4 6 8 0.00 0.25 0.50 0.75 1.00 q f_m 2 4 Initialize m = 2.9. Set q = 0.37. Set m = 1.88. 6 8 2.5 0.0 2.5 5.0 7.5 m 20/20

The EM algorithm 2 f_q 4 6 8 0.00 0.25 0.50 0.75 1.00 q f_m 2 4 6 Initialize m = 2.9. Set q = 0.37. Set m = 1.88. Set q = 0.775. 8 2.5 0.0 2.5 5.0 7.5 m 20/20

The EM algorithm 2 f_q 4 6 8 0.00 0.25 0.50 0.75 1.00 q f_m 2 4 6 8 2.5 0.0 2.5 5.0 7.5 m Initialize m = 2.9. Set q = 0.37. Set m = 1.88. Set q = 0.775. Set m = 1.265. 20/20

The EM algorithm 2 f_q 4 6 8 0.00 0.25 0.50 0.75 1.00 q f_m 2 4 6 8 2.5 0.0 2.5 5.0 7.5 m Initialize m = 2.9. Set q = 0.37. Set m = 1.88. Set q = 0.775. Set m = 1.265. Repeat till convergence 20/20