lecture 10: the em algorithm (contd)

Σχετικά έγγραφα
Probabilistic and Bayesian Machine Learning

Bayesian statistics. DS GA 1002 Probability and Statistics for Data Science.

Solution Series 9. i=1 x i and i=1 x i.

Other Test Constructions: Likelihood Ratio & Bayes Tests

Probabilistic & Unsupervised Learning

Statistical Inference I Locally most powerful tests

Econ 2110: Fall 2008 Suggested Solutions to Problem Set 8 questions or comments to Dan Fetter 1

Problem Set 3: Solutions

Homework 8 Model Solution Section

Chapter 6: Systems of Linear Differential. be continuous functions on the interval

Example Sheet 3 Solutions

ST5224: Advanced Statistical Theory II

Section 8.3 Trigonometric Equations

2 Composition. Invertible Mappings

SOLUTIONS TO MATH38181 EXTREME VALUES AND FINANCIAL RISK EXAM

Section 7.6 Double and Half Angle Formulas

The Simply Typed Lambda Calculus

Theorem 8 Let φ be the most powerful size α test of H

Jesse Maassen and Mark Lundstrom Purdue University November 25, 2013

Exercises 10. Find a fundamental matrix of the given system of equations. Also find the fundamental matrix Φ(t) satisfying Φ(0) = I. 1.

Lecture 2. Soundness and completeness of propositional logic

5.4 The Poisson Distribution.

Partial Differential Equations in Biology The boundary element method. March 26, 2013

Numerical Analysis FMN011

Chapter 6: Systems of Linear Differential. be continuous functions on the interval

Exercises to Statistics of Material Fatigue No. 5

3.4 SUM AND DIFFERENCE FORMULAS. NOTE: cos(α+β) cos α + cos β cos(α-β) cos α -cos β

Introduction to the ML Estimation of ARMA processes

6. MAXIMUM LIKELIHOOD ESTIMATION

forms This gives Remark 1. How to remember the above formulas: Substituting these into the equation we obtain with

EE512: Error Control Coding

ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 19/5/2007

Queensland University of Technology Transport Data Analysis and Modeling Methodologies

Parametrized Surfaces

Practice Exam 2. Conceptual Questions. 1. State a Basic identity and then verify it. (a) Identity: Solution: One identity is csc(θ) = 1

An Inventory of Continuous Distributions

Second Order Partial Differential Equations

C.S. 430 Assignment 6, Sample Solutions

Fourier Series. MATH 211, Calculus II. J. Robert Buchanan. Spring Department of Mathematics

Areas and Lengths in Polar Coordinates

Probability and Random Processes (Part II)

6.1. Dirac Equation. Hamiltonian. Dirac Eq.

Solutions to Exercise Sheet 5

Homework for 1/27 Due 2/5

Lecture 12: Pseudo likelihood approach

Machine Learning. Lecture 5: Gaussian Discriminant Analysis, Naive Bayes and EM Algorithm. Feng Li.

HOMEWORK 4 = G. In order to plot the stress versus the stretch we define a normalized stretch:

: Monte Carlo EM 313, Louis (1982) EM, EM Newton-Raphson, /. EM, 2 Monte Carlo EM Newton-Raphson, Monte Carlo EM, Monte Carlo EM, /. 3, Monte Carlo EM

Differential equations

Homework 3 Solutions

Lecture 10 - Representation Theory III: Theory of Weights

Matrices and Determinants

Phys460.nb Solution for the t-dependent Schrodinger s equation How did we find the solution? (not required)

Example of the Baum-Welch Algorithm

Estimation for ARMA Processes with Stable Noise. Matt Calder & Richard A. Davis Colorado State University

Approximation of distance between locations on earth given by latitude and longitude

Areas and Lengths in Polar Coordinates

Ordinal Arithmetic: Addition, Multiplication, Exponentiation and Limit

b. Use the parametrization from (a) to compute the area of S a as S a ds. Be sure to substitute for ds!

Math 6 SL Probability Distributions Practice Test Mark Scheme

ΕΛΛΗΝΙΚΗ ΔΗΜΟΚΡΑΤΙΑ ΠΑΝΕΠΙΣΤΗΜΙΟ ΚΡΗΤΗΣ. Ψηφιακή Οικονομία. Διάλεξη 10η: Basics of Game Theory part 2 Mαρίνα Μπιτσάκη Τμήμα Επιστήμης Υπολογιστών

Variational Wavefunction for the Helium Atom

Concrete Mathematics Exercises from 30 September 2016

SOLUTIONS TO MATH38181 EXTREME VALUES AND FINANCIAL RISK EXAM

4.6 Autoregressive Moving Average Model ARMA(1,1)

Web-based supplementary materials for Bayesian Quantile Regression for Ordinal Longitudinal Data

Finite Field Problems: Solutions

ENGR 691/692 Section 66 (Fall 06): Machine Learning Assigned: August 30 Homework 1: Bayesian Decision Theory (solutions) Due: September 13

6.3 Forecasting ARMA processes

Μηχανική Μάθηση Hypothesis Testing

PARTIAL NOTES for 6.1 Trigonometric Identities

ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016

Srednicki Chapter 55

Statistics 104: Quantitative Methods for Economics Formula and Theorem Review

Local Approximation with Kernels

k A = [k, k]( )[a 1, a 2 ] = [ka 1,ka 2 ] 4For the division of two intervals of confidence in R +

Math221: HW# 1 solutions

Section 9.2 Polar Equations and Graphs

( ) 2 and compare to M.

HOMEWORK#1. t E(x) = 1 λ = (b) Find the median lifetime of a randomly selected light bulb. Answer:

Spherical Coordinates

Congruence Classes of Invertible Matrices of Order 3 over F 2

Lecture 7: Overdispersion in Poisson regression

CHAPTER 25 SOLVING EQUATIONS BY ITERATIVE METHODS

Every set of first-order formulas is equivalent to an independent set

ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 6/5/2006

HW 3 Solutions 1. a) I use the auto.arima R function to search over models using AIC and decide on an ARMA(3,1)

Appendix S1 1. ( z) α βc. dβ β δ β

Fractional Colorings and Zykov Products of graphs

Math 446 Homework 3 Solutions. (1). (i): Reverse triangle inequality for metrics: Let (X, d) be a metric space and let x, y, z X.

Υπολογιστική Φυσική Στοιχειωδών Σωματιδίων

Aquinas College. Edexcel Mathematical formulae and statistics tables DO NOT WRITE ON THIS BOOKLET

Elements of Information Theory

= λ 1 1 e. = λ 1 =12. has the properties e 1. e 3,V(Y

Lecture 15 - Root System Axiomatics

Lecture 2: Dirac notation and a review of linear algebra Read Sakurai chapter 1, Baym chatper 3

These derivations are not part of the official forthcoming version of Vasilaky and Leonard

SUPERPOSITION, MEASUREMENT, NORMALIZATION, EXPECTATION VALUES. Reading: QM course packet Ch 5 up to 5.6

An Introduction to Signal Detection and Estimation - Second Edition Chapter II: Selected Solutions

Problem Set 9 Solutions. θ + 1. θ 2 + cotθ ( ) sinθ e iφ is an eigenfunction of the ˆ L 2 operator. / θ 2. φ 2. sin 2 θ φ 2. ( ) = e iφ. = e iφ cosθ.

Transcript:

lecture 10: the em algorithm (contd) STAT 545: Intro. to Computational Statistics Vinayak Rao Purdue University September 24, 2018

Exponential family models Consider a space X. E.g. R, R d or N. ϕ(x) = [ϕ 1 (x),..., ϕ D (x)] : (feature) vector of sufficient statistics θ = [θ 1,..., θ D ] : vector of natural parameters 1/20

Exponential family models Consider a space X. E.g. R, R d or N. ϕ(x) = [ϕ 1 (x),..., ϕ D (x)] : (feature) vector of sufficient statistics θ = [θ 1,..., θ D ] : vector of natural parameters Exponential family distribution: p(x θ) = 1 Z(θ) h(x) exp(θ ϕ(x)) 1/20

Exponential family models Consider a space X. E.g. R, R d or N. ϕ(x) = [ϕ 1 (x),..., ϕ D (x)] : (feature) vector of sufficient statistics θ = [θ 1,..., θ D ] : vector of natural parameters Exponential family distribution: p(x θ) = 1 Z(θ) h(x) exp(θ ϕ(x)) h(x) is the base-measure or base distribution. Z(θ) = h(x) exp(θ ϕ(x))dx is the normalization constant. 1/20

Examples The normal distribution: ( p(x µ, σ 2 1 ) = exp 1 ) (x µ)2 2πσ 2 2σ2 ) exp ( µ2 ( 2σ = 2 exp 1 2πσ 2 2σ 2 x2 + µ ) σ 2 x 2/20

Examples The normal distribution: ( p(x µ, σ 2 1 ) = exp 1 ) (x µ)2 2πσ 2 2σ2 ) exp ( µ2 ( 2σ = 2 exp 1 2πσ 2 2σ 2 x2 + µ ) σ 2 x The Poisson distribution: p(x λ) = λx exp( λ) x! = exp( λ) 1 x! exp(log(λ)x) 2/20

Minimal exponential family Sufficient statistics are linearly independent Consider a K-component discrete distribution π = (π 1,..., π K ) p(x) = K c=1 π δ(x=c) i = exp( K δ(x = c) log π c ) c=1 3/20

Minimal exponential family Sufficient statistics are linearly independent Consider a K-component discrete distribution π = (π 1,..., π K ) p(x) = K c=1 π δ(x=c) i = exp( K δ(x = c) log π c ) c=1 Is it minimal? K 1 p(x) = π K exp( δ(x = c) log π c /π K ) c=1 = 1 K 1 Z exp( δ(x = c)θ c ) c=1 3/20

Maximum-likelihood estimation Given N i.i.d. observations X {x 1,..., x N }, the likelihood is N 1 L(X θ) = Z(θ) h(x i) exp(θ ϕ(x i )) ( ) ( 1 N N ) = h(x Z(θ) i ) exp(θ N ϕ(x i )) 4/20

Maximum-likelihood estimation Given N i.i.d. observations X {x 1,..., x N }, the likelihood is N 1 L(X θ) = Z(θ) h(x i) exp(θ ϕ(x i )) ( ) ( 1 N N ) = h(x Z(θ) i ) exp(θ The log-likelihood l(x θ) = log L(X θ) is ( N ) l(x θ) = θ ϕ(x i ) N log Z(θ) + N ϕ(x i )) log h(x i ) 4/20

Maximum-likelihood estimation Given N i.i.d. observations X {x 1,..., x N }, the likelihood is N 1 L(X θ) = Z(θ) h(x i) exp(θ ϕ(x i )) ( ) ( 1 N N ) = h(x Z(θ) i ) exp(θ The log-likelihood l(x θ) = log L(X θ) is ( N ) l(x θ) = θ ϕ(x i ) N log Z(θ) + N ϕ(x i )) log h(x i ) To calculate a maximum likelihood estimate, we only need the sum of the suff. statistics. 4/20

MLE for exponential families ( N ) l(x θ) = θ ϕ(x i ) N log Z(θ) + log h(x i ) 5/20

MLE for exponential families ( N ) l(x θ) = θ ϕ(x i ) N log Z(θ) + At MLE of θ d, the dth component of θ: log h(x i ) l(x θ) θ d = 0. 5/20

MLE for exponential families ( N ) l(x θ) = θ ϕ(x i ) N log Z(θ) + At MLE of θ d, the dth component of θ: log Z(θ) ϕ d (x i ) = N θ d log h(x i ) l(x θ) θ d = 0. 5/20

MLE for exponential families ( N ) l(x θ) = θ ϕ(x i ) N log Z(θ) + At MLE of θ d, the dth component of θ: 1 N log Z(θ) ϕ d (x i ) = N θ d ϕ d (x i ) = 1 Z(θ) = 1 Z(θ) θ d = 1 Z(θ) = Z(θ) log h(x i ) l(x θ) θ d = 0. h(x) exp(θ ϕ(x)) dx θ d h(x) exp(θ ϕ(x))dx θ d 1 Z(θ) h(x)exp(θ ϕ(x))ϕ d (x)dx 5/20

MLE for exponential families Match empirical and population averages of ϕ(x): 1 N ϕ d (x i ) = E θmle [ϕ d (x)] 6/20

MLE for exponential families Match empirical and population averages of ϕ(x): 1 N ϕ d (x i ) = E θmle [ϕ d (x)] RHS: moment parameters of the exponential distribution. Thus: θ MLE are natural parameters corresponding to empirical moment parameters ( moment matching ). 6/20

MLE for exponential families Match empirical and population averages of ϕ(x): 1 N ϕ d (x i ) = E θmle [ϕ d (x)] RHS: moment parameters of the exponential distribution. Thus: θ MLE are natural parameters corresponding to empirical moment parameters ( moment matching ). Is this a maximum? is second derivative (Hessian) negative (negative definite)? We can show 2 θ i θ log Z(X θ) = Cov(ϕ j i, ϕ j ), and the Hessian of l(x θ) is N times the feature covariance matrix 6/20

Example The 1-d Gaussian: ϕ = [x x 2 ] Moment parameters are mean and mean squared Easy to find corresponding natural parameters Quite often, it is not the case. However, we will restrict ourselves to cases where it is. 7/20

Example The 1-d Gaussian: ϕ = [x x 2 ] Moment parameters are mean and mean squared Easy to find corresponding natural parameters Quite often, it is not the case. However, we will restrict ourselves to cases where it is. Can you do it for the Poisson? p(x λ) = 1 x! λx exp( λ) 7/20

Missing data in exponential family distributions Let samples from the exponential family have two parts: [x y]. Feature vector ϕ([x y]) := ϕ(x, y). P(x, y θ) := P([x y] θ) = h(x, y) Z(θ) exp(θ ϕ(x, y)) 8/20

Missing data in exponential family distributions Let samples from the exponential family have two parts: [x y]. Feature vector ϕ([x y]) := ϕ(x, y). P(x, y θ) := P([x y] θ) = h(x, y) Z(θ) exp(θ ϕ(x, y)) We observe only x. What is the posterior over y? P(y x, θ) = P(x, y θ) P(x θ) = h(x, y) P(x θ)z(θ) exp(θ ϕ(x, y)) 8/20

Missing data in exponential family distributions Let samples from the exponential family have two parts: [x y]. Feature vector ϕ([x y]) := ϕ(x, y). P(x, y θ) := P([x y] θ) = h(x, y) Z(θ) exp(θ ϕ(x, y)) We observe only x. What is the posterior over y? P(y x, θ) = P(x, y θ) P(x θ) = h(x, y) P(x θ)z(θ) exp(θ ϕ(x, y)) An exponential family distrib. over y (remember x is fixed) with: feature vector ϕ x (y) = ϕ(x, y) base distribution h x (y) = h(x, y) 8/20

Missing data in exponential family distributions Let samples from the exponential family have two parts: [x y]. Feature vector ϕ([x y]) := ϕ(x, y). P(x, y θ) := P([x y] θ) = h(x, y) Z(θ) exp(θ ϕ(x, y)) We observe only x. What is the posterior over y? P(y x, θ) = P(x, y θ) P(x θ) = h(x, y) P(x θ)z(θ) exp(θ ϕ(x, y)) An exponential family distrib. over y (remember x is fixed) with: feature vector ϕ x (y) = ϕ(x, y) base distribution h x (y) = h(x, y) Not necessarily easy to work with, but will restrict to this case. y i with different x i belong to different exp. fam. distrbs. 8/20

MLE with missing data What about the marginal likelihood P(x θ) = P(x, y θ)dy? 9/20

MLE with missing data What about the marginal likelihood P(x θ) = P(x, y θ)dy? Not an exponential family distribution! Calculating derivatives is messy No nice closed form expression for MLE 9/20

MLE with missing data What about the marginal likelihood P(x θ) = P(x, y θ)dy? Not an exponential family distribution! Calculating derivatives is messy No nice closed form expression for MLE One approach: the EM algorithm. An algorithm for MLE in exp. fam. distribs. with missing data 9/20

MLE with missing data What about the marginal likelihood P(x θ) = P(x, y θ)dy? Not an exponential family distribution! Calculating derivatives is messy No nice closed form expression for MLE One approach: the EM algorithm. An algorithm for MLE in exp. fam. distribs. with missing data Problem: Given observations i.i.d. X = {x 1,..., x N } from P(x θ) where P(X, Y θ) is exponential family, maximize w.r.t. θ l(x θ) = log P(X θ) = log P(x i θ) 9/20

MLE in latent variable models l(x θ) = log P(x i θ) = log P(x i, y i θ)dy i 10/20

MLE in latent variable models l(x θ) = = log P(x i θ) = log log P(x i, y i θ)dy i q i (y i ) P(x i, y i θ) dy q i (y i ) i (for arbitrary densities q i (y i )) 10/20

MLE in latent variable models l(x θ) = = log P(x i θ) = log log P(x i, y i θ)dy i q i (y i ) P(x i, y i θ) dy q i (y i ) i (for arbitrary densities q i (y i )) q i (y i ) log P(x i, y i θ) dy q i (y i ) i (Jensen s inequality) 10/20

MLE in latent variable models l(x θ) = = = log P(x i θ) = log log P(x i, y i θ)dy i q i (y i ) P(x i, y i θ) dy q i (y i ) i (for arbitrary densities q i (y i )) q i (y i ) log P(x i, y i θ) dy q i (y i ) i q i (y i ) log P(x i, y i θ)dy i (Jensen s inequality) q i (y i ) log q i (y i )dy i 10/20

MLE in latent variable models l(x θ) = = = = log P(x i θ) = log log P(x i, y i θ)dy i q i (y i ) P(x i, y i θ) dy q i (y i ) i (for arbitrary densities q i (y i )) q i (y i ) log P(x i, y i θ) dy q i (y i ) i q i (y i ) log P(x i, y i θ)dy i q i (y i ) log P(x i θ)dy i + (Jensen s inequality) q i (y i ) log q i (y i )dy i q i (y i ) log P(y i x i, θ) dy q i (y i ) i 10/20

MLE in latent variable models l(x θ) = = = = log P(x i θ) = log = l(x θ) log P(x i, y i θ)dy i q i (y i ) P(x i, y i θ) dy q i (y i ) i (for arbitrary densities q i (y i )) q i (y i ) log P(x i, y i θ) dy q i (y i ) i q i (y i ) log P(x i, y i θ)dy i q i (y i ) log P(x i θ)dy i + (Jensen s inequality) KL(q i (y i ) P(y i X, θ)) q i (y i ) log q i (y i )dy i q i (y i ) log P(y i x i, θ) dy q i (y i ) i 10/20

Jensen s inequality Let f(x) be a concave real-valued function defined on X. Concave: Non-positive 2nd-derivative (non-increasing deriv.) A chord always lies below the function. E.g. logarithm (defined on R + ). 11/20

Jensen s inequality Let f(x) be a concave real-valued function defined on X. Concave: Non-positive 2nd-derivative (non-increasing deriv.) A chord always lies below the function. E.g. logarithm (defined on R + ). Jensen: for any prob. vector p = (p 1,..., p K ) and any set of points (x 1,..., x K ), f( K p ix i ) K p if(x i ) 11/20

Jensen s inequality Let f(x) be a concave real-valued function defined on X. Concave: Non-positive 2nd-derivative (non-increasing deriv.) A chord always lies below the function. E.g. logarithm (defined on R + ). Jensen: for any prob. vector p = (p 1,..., p K ) and any set of points (x 1,..., x K ), f( K p ix i ) K p if(x i ) In fact, for a prob. density p(x), f( X xp(x)dx) X f(x)p(x)dx 11/20

MLE in latent variable models Defining Q(Y) = N q i(y i ), l(x θ) q i (y i ) log P(x i, y i θ)dy i + H(q i ) = E qi [log P(x i, y i θ)] + = l(x θ) H(q i ) KL(q i (y i ) P(y i X, θ)) := F X (θ, Q( )) 12/20

MLE in latent variable models Defining Q(Y) = N q i(y i ), l(x θ) q i (y i ) log P(x i, y i θ)dy i + = E qi [log P(x i, y i θ)] + = l(x θ) H(q i ) H(q i ) KL(q i (y i ) P(y i X, θ)) := F X (θ, Q( )) F X (θ, Q( )) is a lower bound to the log-likelihood l(x θ). Sometimes called variational free energy and is function of θ and the variational distribution Q(Y) (X is fixed). 12/20

Optimizing the variational lower bound Our original goal was to maximize the log-likelihood: θ MLE = argmax l(x θ) EM algorithm: maximize the lower-bound instead (θ, Q ) = argmax F X (θ, Q( )) Hopefully easier, since all summations are outside logarithms. Strategy: Coordinate ascent. Alternately maximize w.r.t Q and θ First find best lower-bound given the current θ s. Optimize this lower-bound to find θ s+1. 13/20

The EM algorithm Maximizing F X (θ, Q) with θ fixed: Recall F X (θ, Q) = l(x θ) N KL(q i(y i ) P(y i x i, θ)) 14/20

The EM algorithm Maximizing F X (θ, Q) with θ fixed: Recall F X (θ, Q) = l(x θ) N KL(q i(y i ) P(y i x i, θ)) Solution: set q i (y i ) = P(y i x i, θ) for i = 1,..., N Recall: P( x i, θ) is an exponential family distribution with natural parameters θ and feature vector ϕ(x i, ) 14/20

The EM algorithm Maximizing F X (θ, Q) with Q fixed: F X (θ, Q) = E qi [log P(x i, y i θ)] + H(q i ) The entropy terms H(q i ) don t depend on θ. Ignore. 15/20

The EM algorithm Maximizing F X (θ, Q) with Q fixed: F X (θ, Q) = E qi [log P(x i, y i θ)] + H(q i ) The entropy terms H(q i ) don t depend on θ. Ignore. log P(x i, y i θ) = θ ϕ(x i, y i ) + log h(x i ) log Z(θ) 15/20

The EM algorithm Maximizing F X (θ, Q) with Q fixed: F X (θ, Q) = E qi [log P(x i, y i θ)] + H(q i ) The entropy terms H(q i ) don t depend on θ. Ignore. log P(x i, y i θ) = θ ϕ(x i, y i ) + log h(x i ) log Z(θ) F X (θ, Q) = θ E qi [ϕ(x i, y i )] N log Z(θ) + const 15/20

The Expectation-Maximization algorithm F X (θ, Q) = θ N E qi [ϕ(x i, y i )] N log Z(θ) + const 16/20

The Expectation-Maximization algorithm F X (θ, Q) = θ N To maximize F X w.r.t. θ d, solve E qi [ϕ(x i, y i )] N log Z(θ) + const θ d F X (θ, Q) = 0: Solution: set θ to match moments (compare w. fully observed case) E θ [ϕ d (x, y)] = 1 N N E q i [ϕ d (x i, y i )] 16/20

The Expectation-Maximization algorithm F X (θ, Q) = θ N To maximize F X w.r.t. θ d, solve E qi [ϕ(x i, y i )] N log Z(θ) + const θ d F X (θ, Q) = 0: Solution: set θ to match moments (compare w. fully observed case) E θ [ϕ d (x, y)] = 1 N N E q i [ϕ d (x i, y i )] 16/20

The Expectation-Maximization algorithm F X (θ, Q) = θ N To maximize F X w.r.t. θ d, solve E qi [ϕ(x i, y i )] N log Z(θ) + const θ d F X (θ, Q) = 0: Solution: set θ to match moments (compare w. fully observed case) E θ [ϕ d (x, y)] = 1 N N E q i [ϕ d (x i, y i )] q i (y i ) = P(y i X, θ old ), an exponential family distribution whose moment parameters can be calculated (by assumption). 16/20

Step s of the EM algorithm Current parameters: θ s, Q s (Y) = N qs i (y i) 17/20

Step s of the EM algorithm Current parameters: θ s, Q s (Y) = N qs i (y i) E-step: For i = 1,..., N: Set q s+1 i (y i ) = P(y i X, θ s ). Exp. fam. distrib. with suff. stats ϕ(x i, ), natural params θ s Calculate E q s+1[ϕ(x i, y i )] i (Expectation) 17/20

Step s of the EM algorithm Current parameters: θ s, Q s (Y) = N qs i (y i) E-step: For i = 1,..., N: Set q s+1 i (y i ) = P(y i X, θ s ). Exp. fam. distrib. with suff. stats ϕ(x i, ), natural params θ s Calculate E q s+1[ϕ(x i, y i )] i M-step (Maximization): (Expectation) Set θ s+1 so that E θ s+1[ϕ(x, y)] = 1 N N E q s i [ϕ(x i, y i )]. 17/20

EM algorithm never decreases log-likelihood Recall F X (θ s 1, Q s 1 ) = l(x θ s 1 ) KL(Q(Y) P(Y X, θ s 1 )). After the E-step, Q s (Y) = P(Y X, θ s 1 )) l(x θ s 1 ) = F X (θ s 1, Q s ) 18/20

EM algorithm never decreases log-likelihood Recall F X (θ s 1, Q s 1 ) = l(x θ s 1 ) KL(Q(Y) P(Y X, θ s 1 )). After the E-step, Q s (Y) = P(Y X, θ s 1 )) l(x θ s 1 ) = F X (θ s 1, Q s ) F X (θ s, Q s ) 18/20

EM algorithm never decreases log-likelihood Recall F X (θ s 1, Q s 1 ) = l(x θ s 1 ) KL(Q(Y) P(Y X, θ s 1 )). After the E-step, Q s (Y) = P(Y X, θ s 1 )) l(x θ s 1 ) = F X (θ s 1, Q s ) F X (θ s, Q s ) F X (θ s, Q s+1 ) 18/20

EM algorithm never decreases log-likelihood Recall F X (θ s 1, Q s 1 ) = l(x θ s 1 ) KL(Q(Y) P(Y X, θ s 1 )). After the E-step, Q s (Y) = P(Y X, θ s 1 )) l(x θ s 1 ) = F X (θ s 1, Q s ) F X (θ s, Q s ) F X (θ s, Q s+1 ) = l(x θ s ) 18/20

EM algorithm never decreases log-likelihood Recall F X (θ s 1, Q s 1 ) = l(x θ s 1 ) KL(Q(Y) P(Y X, θ s 1 )). After the E-step, Q s (Y) = P(Y X, θ s 1 )) l(x θ s 1 ) = F X (θ s 1, Q s ) F X (θ s, Q s ) F X (θ s, Q s+1 ) = l(x θ s ) Can also show that local maxima of F X are local maxima of l. 18/20

EM algorithm never decreases log-likelihood Recall F X (θ s 1, Q s 1 ) = l(x θ s 1 ) KL(Q(Y) P(Y X, θ s 1 )). After the E-step, Q s (Y) = P(Y X, θ s 1 )) l(x θ s 1 ) = F X (θ s 1, Q s ) F X (θ s, Q s ) F X (θ s, Q s+1 ) = l(x θ s ) Can also show that local maxima of F X are local maxima of l. Variants of E and M-steps. 18/20

EM algorithm never decreases log-likelihood Recall F X (θ s 1, Q s 1 ) = l(x θ s 1 ) KL(Q(Y) P(Y X, θ s 1 )). After the E-step, Q s (Y) = P(Y X, θ s 1 )) l(x θ s 1 ) = F X (θ s 1, Q s ) F X (θ s, Q s ) F X (θ s, Q s+1 ) = l(x θ s ) Can also show that local maxima of F X are local maxima of l. Variants of E and M-steps. Partial M-step: Update θ to increase (rather than maximize) F X 18/20

EM algorithm never decreases log-likelihood Recall F X (θ s 1, Q s 1 ) = l(x θ s 1 ) KL(Q(Y) P(Y X, θ s 1 )). After the E-step, Q s (Y) = P(Y X, θ s 1 )) l(x θ s 1 ) = F X (θ s 1, Q s ) F X (θ s, Q s ) F X (θ s, Q s+1 ) = l(x θ s ) Can also show that local maxima of F X are local maxima of l. Variants of E and M-steps. Partial M-step: Update θ to increase (rather than maximize) F X Partial E-step: Update Q to decrease KL(Q(Y) P(Y X, θ) (rather reduce to 0). E.g. update just one or a few q i. 18/20

A toy example: A mixture of two Gaussians, N (x m, 1) and N (x 5 m, 2). First has probability 0.6, the second 0.4. 19/20

A toy example: A mixture of two Gaussians, N (x m, 1) and N (x 5 m, 2). First has probability 0.6, the second 0.4. We observe a single data point x = 1. What is the ML estimate of m? 19/20

A toy example: A mixture of two Gaussians, N (x m, 1) and N (x 5 m, 2). First has probability 0.6, the second 0.4. We observe a single data point x = 1. What is the ML estimate of m? What is the hidden variable? 19/20

A toy example: A mixture of two Gaussians, N (x m, 1) and N (x 5 m, 2). First has probability 0.6, the second 0.4. We observe a single data point x = 1. What is the ML estimate of m? What is the hidden variable? Is the overall model exponential family? 19/20

A toy example: A mixture of two Gaussians, N (x m, 1) and N (x 5 m, 2). First has probability 0.6, the second 0.4. We observe a single data point x = 1. What is the ML estimate of m? What is the hidden variable? Is the overall model exponential family? What is the posterior distribution over the hidden variable? 19/20

A toy example: A mixture of two Gaussians, N (x m, 1) and N (x 5 m, 2). First has probability 0.6, the second 0.4. We observe a single data point x = 1. What is the ML estimate of m? What is the hidden variable? Is the overall model exponential family? What is the posterior distribution over the hidden variable? If we knew the hidden variable, what is the MLE? 19/20

The EM algorithm 2 l_q 4 6 8 0.00 0.25 0.50 0.75 1.00 q 2 l_m 4 6 8 2.5 0.0 2.5 5.0 7.5 m 20/20

The EM algorithm 2 f_q 4 6 8 0.00 0.25 0.50 0.75 1.00 q Initialize m = 2.9. 2 l_m 4 6 8 2.5 0.0 2.5 5.0 7.5 m 20/20

The EM algorithm 2 f_q 4 6 8 0.00 0.25 0.50 0.75 1.00 q 2 Initialize m = 2.9. Set q = 0.37. f_m 4 6 8 2.5 0.0 2.5 5.0 7.5 m 20/20

The EM algorithm 2 f_q 4 6 8 0.00 0.25 0.50 0.75 1.00 q f_m 2 4 Initialize m = 2.9. Set q = 0.37. Set m = 1.88. 6 8 2.5 0.0 2.5 5.0 7.5 m 20/20

The EM algorithm 2 f_q 4 6 8 0.00 0.25 0.50 0.75 1.00 q f_m 2 4 6 Initialize m = 2.9. Set q = 0.37. Set m = 1.88. Set q = 0.775. 8 2.5 0.0 2.5 5.0 7.5 m 20/20

The EM algorithm 2 f_q 4 6 8 0.00 0.25 0.50 0.75 1.00 q f_m 2 4 6 8 2.5 0.0 2.5 5.0 7.5 m Initialize m = 2.9. Set q = 0.37. Set m = 1.88. Set q = 0.775. Set m = 1.265. 20/20

The EM algorithm 2 f_q 4 6 8 0.00 0.25 0.50 0.75 1.00 q f_m 2 4 6 8 2.5 0.0 2.5 5.0 7.5 m Initialize m = 2.9. Set q = 0.37. Set m = 1.88. Set q = 0.775. Set m = 1.265. Repeat till convergence 20/20