Probabilistic & Unsupervised Learning

Σχετικά έγγραφα
Probabilistic and Bayesian Machine Learning

lecture 10: the em algorithm (contd)

Bayesian statistics. DS GA 1002 Probability and Statistics for Data Science.

HOMEWORK 4 = G. In order to plot the stress versus the stretch we define a normalized stretch:

Other Test Constructions: Likelihood Ratio & Bayes Tests

Solution Series 9. i=1 x i and i=1 x i.

Statistical Inference I Locally most powerful tests

SOLUTIONS TO MATH38181 EXTREME VALUES AND FINANCIAL RISK EXAM

Problem Set 3: Solutions

6.3 Forecasting ARMA processes

Homework 8 Model Solution Section

3.4 SUM AND DIFFERENCE FORMULAS. NOTE: cos(α+β) cos α + cos β cos(α-β) cos α -cos β

derivation of the Laplacian from rectangular to spherical coordinates

ST5224: Advanced Statistical Theory II

Example Sheet 3 Solutions

Math221: HW# 1 solutions

SCHOOL OF MATHEMATICAL SCIENCES G11LMA Linear Mathematics Examination Solutions

EE512: Error Control Coding

Solutions to Exercise Sheet 5

Homework 3 Solutions

2 Composition. Invertible Mappings

Matrices and Determinants

4.6 Autoregressive Moving Average Model ARMA(1,1)

Section 8.3 Trigonometric Equations

SOLUTIONS TO MATH38181 EXTREME VALUES AND FINANCIAL RISK EXAM

The Simply Typed Lambda Calculus

Numerical Analysis FMN011

Lecture 2: Dirac notation and a review of linear algebra Read Sakurai chapter 1, Baym chatper 3

Chapter 6: Systems of Linear Differential. be continuous functions on the interval

Areas and Lengths in Polar Coordinates

Econ 2110: Fall 2008 Suggested Solutions to Problem Set 8 questions or comments to Dan Fetter 1

k A = [k, k]( )[a 1, a 2 ] = [ka 1,ka 2 ] 4For the division of two intervals of confidence in R +

Ordinal Arithmetic: Addition, Multiplication, Exponentiation and Limit

CHAPTER 25 SOLVING EQUATIONS BY ITERATIVE METHODS

ANSWERSHEET (TOPIC = DIFFERENTIAL CALCULUS) COLLECTION #2. h 0 h h 0 h h 0 ( ) g k = g 0 + g 1 + g g 2009 =?

Fourier Series. MATH 211, Calculus II. J. Robert Buchanan. Spring Department of Mathematics

Tridiagonal matrices. Gérard MEURANT. October, 2008

SCITECH Volume 13, Issue 2 RESEARCH ORGANISATION Published online: March 29, 2018

C.S. 430 Assignment 6, Sample Solutions

Areas and Lengths in Polar Coordinates

Fractional Colorings and Zykov Products of graphs

Απόκριση σε Μοναδιαία Ωστική Δύναμη (Unit Impulse) Απόκριση σε Δυνάμεις Αυθαίρετα Μεταβαλλόμενες με το Χρόνο. Απόστολος Σ.

Math 6 SL Probability Distributions Practice Test Mark Scheme

2. THEORY OF EQUATIONS. PREVIOUS EAMCET Bits.

Jesse Maassen and Mark Lundstrom Purdue University November 25, 2013

Srednicki Chapter 55

Probability and Random Processes (Part II)

Section 7.6 Double and Half Angle Formulas

Every set of first-order formulas is equivalent to an independent set

forms This gives Remark 1. How to remember the above formulas: Substituting these into the equation we obtain with

ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 19/5/2007

Statistics 104: Quantitative Methods for Economics Formula and Theorem Review

Mean-Variance Analysis

Parametrized Surfaces

Μηχανική Μάθηση Hypothesis Testing

Tutorial on Multinomial Logistic Regression

Uniform Convergence of Fourier Series Michael Taylor

The challenges of non-stable predicates

Reminders: linear functions

6. MAXIMUM LIKELIHOOD ESTIMATION

Second Order Partial Differential Equations

: Monte Carlo EM 313, Louis (1982) EM, EM Newton-Raphson, /. EM, 2 Monte Carlo EM Newton-Raphson, Monte Carlo EM, Monte Carlo EM, /. 3, Monte Carlo EM

6.1. Dirac Equation. Hamiltonian. Dirac Eq.

DESIGN OF MACHINERY SOLUTION MANUAL h in h 4 0.

Approximation of distance between locations on earth given by latitude and longitude

ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 6/5/2006

Inverse trigonometric functions & General Solution of Trigonometric Equations

Introduction to the ML Estimation of ARMA processes

Phys460.nb Solution for the t-dependent Schrodinger s equation How did we find the solution? (not required)

Durbin-Levinson recursive method

Pg The perimeter is P = 3x The area of a triangle is. where b is the base, h is the height. In our case b = x, then the area is

Partial Differential Equations in Biology The boundary element method. March 26, 2013

Exercises 10. Find a fundamental matrix of the given system of equations. Also find the fundamental matrix Φ(t) satisfying Φ(0) = I. 1.

Lecture 2. Soundness and completeness of propositional logic

b. Use the parametrization from (a) to compute the area of S a as S a ds. Be sure to substitute for ds!

= λ 1 1 e. = λ 1 =12. has the properties e 1. e 3,V(Y

Problem Set 9 Solutions. θ + 1. θ 2 + cotθ ( ) sinθ e iφ is an eigenfunction of the ˆ L 2 operator. / θ 2. φ 2. sin 2 θ φ 2. ( ) = e iφ. = e iφ cosθ.

Exercises to Statistics of Material Fatigue No. 5

Main source: "Discrete-time systems and computer control" by Α. ΣΚΟΔΡΑΣ ΨΗΦΙΑΚΟΣ ΕΛΕΓΧΟΣ ΔΙΑΛΕΞΗ 4 ΔΙΑΦΑΝΕΙΑ 1

Theorem 8 Let φ be the most powerful size α test of H

Chapter 6: Systems of Linear Differential. be continuous functions on the interval

Lecture 21: Properties and robustness of LSE

w o = R 1 p. (1) R = p =. = 1

Lecture 12: Pseudo likelihood approach

PARTIAL NOTES for 6.1 Trigonometric Identities

Concrete Mathematics Exercises from 30 September 2016

5.4 The Poisson Distribution.

5. Choice under Uncertainty

Lecture 10 - Representation Theory III: Theory of Weights

Appendix S1 1. ( z) α βc. dβ β δ β

TMA4115 Matematikk 3

HW 3 Solutions 1. a) I use the auto.arima R function to search over models using AIC and decide on an ARMA(3,1)

Practice Exam 2. Conceptual Questions. 1. State a Basic identity and then verify it. (a) Identity: Solution: One identity is csc(θ) = 1

A Bonus-Malus System as a Markov Set-Chain. Małgorzata Niemiec Warsaw School of Economics Institute of Econometrics

Queensland University of Technology Transport Data Analysis and Modeling Methodologies

CHAPTER 48 APPLICATIONS OF MATRICES AND DETERMINANTS

The Probabilistic Method - Probabilistic Techniques. Lecture 7: The Janson Inequality

ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016

Nowhere-zero flows Let be a digraph, Abelian group. A Γ-circulation in is a mapping : such that, where, and : tail in X, head in

Example of the Baum-Welch Algorithm

Transcript:

Probabilistic & Unsupervised Learning Week 3: The EM algorithm Yee Whye Teh ywteh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London Term 1, Autumn 2009

Mixtures of Gaussians Data: X = {x 1... x N } Latent process: s i iid Disc[π] Component distributions: x i (s i = m) P m [θ m ] = N (µ m, Σ m ) Marginal distribution: P (x i ) = k π m P m (x; θ m ) m=1 Log-likelihood: log p(x {µ m }, {Σ m }, π) = n log i=1 k π m 2πΣ m 1/2 exp m=1 [ 1 ] 2 (x i µ m ) T Σ 1 m (x i µ m )

EM for MoGs Evaluate responsibilities r im = P m (x)π m m P m (x)π m Update parameters µ m i r imx i Σ m π m i r im i r im(x i µ m )(x i µ m ) T i r im i r im N

The Expectation Maximisation (EM) algorithm The EM algorithm finds a (local) maximum of a latent variable model likelihood. It starts from arbitrary values of the parameters, and iterates two steps: E step: Fill in values of latent variables according to posterior given data. M step: Maximise likelihood as if latent variables were not hidden. Useful in models where learning would be easy if hidden variables were, in fact, observed (e.g. MoGs). Decomposes difficult problems into series of tractable steps. No learning rate. Framework lends itself to principled approximations.

Jensen s Inequality log(x) log(α x 1 + (1 α) x 2 ) α log(x 1 ) + (1 α) log(x 2 ) x 1 α x 1 + (1 α)x 2 x 2 For α i 0, α i = 1 and any {x i > 0} ( ) log α i x i i i α i log(x i ) Equality if and only if α i = 1 for some i (and therefore all others are 0).

The Free Energy for a Latent Variable Model Observed data X = {x i }; Latent variables Y = {y i }; Parameters θ. Goal: Maximize the log likelihood (i.e. ML learning) wrt θ: l(θ) = log P (X θ) = log P (Y, X θ)dy, Any distribution,, over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensen s inequality: P (Y, X θ) P (Y, X θ) l(θ) = log dy log dy def = F(q, θ). Now, log P (Y, X θ) dy = = log P (Y, X θ) dy log dy log P (Y, X θ) dy + H[q], where H[q] is the entropy of. So: F(q, θ) = log P (Y, X θ) + H[q]

The Free Energy for a Latent Variable Model Observed data X = {x i }; Latent variables Y = {y i }; Parameters θ. Goal: Maximize the log likelihood (i.e. ML learning) wrt θ: l(θ) = log P (X θ) = log P (Y, X θ)dy, Any distribution,, over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensen s inequality: P (Y, X θ) P (Y, X θ) l(θ) = log dy log dy def = F(q, θ). Now, log P (Y, X θ) dy = = log P (Y, X θ) dy log dy log P (Y, X θ) dy + H[q], where H[q] is the entropy of. So: F(q, θ) = log P (Y, X θ) + H[q]

The Free Energy for a Latent Variable Model Observed data X = {x i }; Latent variables Y = {y i }; Parameters θ. Goal: Maximize the log likelihood (i.e. ML learning) wrt θ: l(θ) = log P (X θ) = log P (Y, X θ)dy, Any distribution,, over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensen s inequality: P (Y, X θ) P (Y, X θ) l(θ) = log dy log dy def = F(q, θ). Now, log P (Y, X θ) dy = = log P (Y, X θ) dy log dy log P (Y, X θ) dy + H[q], where H[q] is the entropy of. So: F(q, θ) = log P (Y, X θ) + H[q]

The Free Energy for a Latent Variable Model Observed data X = {x i }; Latent variables Y = {y i }; Parameters θ. Goal: Maximize the log likelihood (i.e. ML learning) wrt θ: l(θ) = log P (X θ) = log P (Y, X θ)dy, Any distribution,, over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensen s inequality: P (Y, X θ) P (Y, X θ) l(θ) = log dy log dy def = F(q, θ). Now, log P (Y, X θ) dy = = log P (Y, X θ) dy log dy log P (Y, X θ) dy + H[q], where H[q] is the entropy of. So: F(q, θ) = log P (Y, X θ) + H[q]

The Free Energy for a Latent Variable Model Observed data X = {x i }; Latent variables Y = {y i }; Parameters θ. Goal: Maximize the log likelihood (i.e. ML learning) wrt θ: l(θ) = log P (X θ) = log P (Y, X θ)dy, Any distribution,, over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensen s inequality: P (Y, X θ) P (Y, X θ) l(θ) = log dy log dy def = F(q, θ). Now, log P (Y, X θ) dy = = log P (Y, X θ) dy log dy log P (Y, X θ) dy + H[q], where H[q] is the entropy of. So: F(q, θ) = log P (Y, X θ) + H[q]

The E and M steps of EM The lower bound on the log likelihood is given by: F(q, θ) = log P (Y, X θ) + H[q], EM alternates between: E step: optimize F(q, θ) wrt distribution over hidden variables holding parameters fixed: q (k) (Y) := argmax F (, θ (k 1)). M step: maximize F(q, θ) wrt parameters holding hidden distribution fixed: θ (k) := argmax θ F ( q (k) (Y), θ ) = argmax θ log P (Y, X θ) q (k) (Y) The second equality comes from the fact that the entropy of does not depend directly on θ.

The E and M steps of EM The lower bound on the log likelihood is given by: F(q, θ) = log P (Y, X θ) + H[q], EM alternates between: E step: optimize F(q, θ) wrt distribution over hidden variables holding parameters fixed: q (k) (Y) := argmax F (, θ (k 1)). M step: maximize F(q, θ) wrt parameters holding hidden distribution fixed: θ (k) := argmax θ F ( q (k) (Y), θ ) = argmax θ log P (Y, X θ) q (k) (Y) The second equality comes from the fact that the entropy of does not depend directly on θ.

The E and M steps of EM The lower bound on the log likelihood is given by: F(q, θ) = log P (Y, X θ) + H[q], EM alternates between: E step: optimize F(q, θ) wrt distribution over hidden variables holding parameters fixed: q (k) (Y) := argmax F (, θ (k 1)). M step: maximize F(q, θ) wrt parameters holding hidden distribution fixed: θ (k) := argmax θ F ( q (k) (Y), θ ) = argmax θ log P (Y, X θ) q (k) (Y) The second equality comes from the fact that the entropy of does not depend directly on θ.

EM as Coordinate Ascent in F

The E Step The free energy can be re-written P (Y, X θ) F(q, θ)= log dy P (Y X, θ)p (X θ) = log dy = log P (X θ) dy + log = l(θ) KL[ P (Y X, θ)] The second term is the Kullback-Leibler divergence. P (Y X, θ) dy This means that, for fixed θ, F is bounded above by l, and achieves that bound when KL[ P (Y X, θ)] = 0. But KL[q p] is zero if and only if q = p. So, the E step simply sets q (k) (Y) = P (Y X, θ (k 1) ) and, after an E step, the free energy equals the likelihood.

The E Step The free energy can be re-written P (Y, X θ) F(q, θ)= log dy P (Y X, θ)p (X θ) = log dy = log P (X θ) dy + log = l(θ) KL[ P (Y X, θ)] The second term is the Kullback-Leibler divergence. P (Y X, θ) dy This means that, for fixed θ, F is bounded above by l, and achieves that bound when KL[ P (Y X, θ)] = 0. But KL[q p] is zero if and only if q = p. So, the E step simply sets q (k) (Y) = P (Y X, θ (k 1) ) and, after an E step, the free energy equals the likelihood.

The E Step The free energy can be re-written P (Y, X θ) F(q, θ)= log dy P (Y X, θ)p (X θ) = log dy = log P (X θ) dy + log = l(θ) KL[ P (Y X, θ)] The second term is the Kullback-Leibler divergence. P (Y X, θ) dy This means that, for fixed θ, F is bounded above by l, and achieves that bound when KL[ P (Y X, θ)] = 0. But KL[q p] is zero if and only if q = p. So, the E step simply sets q (k) (Y) = P (Y X, θ (k 1) ) and, after an E step, the free energy equals the likelihood.

The E Step The free energy can be re-written P (Y, X θ) F(q, θ)= log dy P (Y X, θ)p (X θ) = log dy = log P (X θ) dy + log = l(θ) KL[ P (Y X, θ)] The second term is the Kullback-Leibler divergence. P (Y X, θ) dy This means that, for fixed θ, F is bounded above by l, and achieves that bound when KL[ P (Y X, θ)] = 0. But KL[q p] is zero if and only if q = p. So, the E step simply sets q (k) (Y) = P (Y X, θ (k 1) ) and, after an E step, the free energy equals the likelihood.

The E Step The free energy can be re-written P (Y, X θ) F(q, θ)= log dy P (Y X, θ)p (X θ) = log dy = log P (X θ) dy + log = l(θ) KL[ P (Y X, θ)] The second term is the Kullback-Leibler divergence. P (Y X, θ) dy This means that, for fixed θ, F is bounded above by l, and achieves that bound when KL[ P (Y X, θ)] = 0. But KL[q p] is zero if and only if q = p. So, the E step simply sets q (k) (Y) = P (Y X, θ (k 1) ) and, after an E step, the free energy equals the likelihood.

The E Step The free energy can be re-written P (Y, X θ) F(q, θ)= log dy P (Y X, θ)p (X θ) = log dy = log P (X θ) dy + log = l(θ) KL[ P (Y X, θ)] The second term is the Kullback-Leibler divergence. P (Y X, θ) dy This means that, for fixed θ, F is bounded above by l, and achieves that bound when KL[ P (Y X, θ)] = 0. But KL[q p] is zero if and only if q = p. So, the E step simply sets q (k) (Y) = P (Y X, θ (k 1) ) and, after an E step, the free energy equals the likelihood.

The KL[q(x) p(x)] is non-negative and zero iff x : p(x) = q(x) First let s consider discrete distributions; the Kullback-Liebler divergence is: KL[q p] = i q i log q i p i. To find the distribution q which minimizes KL[q p] we add a Lagrange multiplier to enforce the normalization constraint: ) ) q i = q i E def = KL[q p] + λ ( 1 i We then take partial derivatives and set to zero: i q i log q i p i + λ ( 1 i E = log q i log p i + 1 λ = 0 q i = p i exp(λ 1) q i E λ = 1 q i = 0 q i = 1 i i q i = p i.

The KL[q(x) p(x)] is non-negative and zero iff x : p(x) = q(x) Check that the curvature (Hessian) is positive (definite), corresponding to a minimum: 2 E q i q i = 1 q i > 0, 2 E q i q j = 0, showing that q i = p i is a genuine minimum. At the minimum is it easily verified that KL[p p] = 0. A similar proof holds for KL[ ] between continuous densities, the derivatives being substituted by functional derivatives.

Coordinate Ascent in F (Demo) One parameter mixture: s Bernoulli[π] x s = 0 N [ 1, 1] x s = 1 N [1, 1] and one data point x 1 =.3. q(s) is a distribution on a single binary latent, and so is represented by r 1 [0, 1]. 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 2 1.5 1 0.5 0 0.5 1 1.5 2

Coordinate Ascent in F (Demo) 1 0.9 0.8 0.7 0.6 r 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 π

EM Never Decreases the Likelihood The E and M steps together never decrease the log likelihood: l ( θ (k 1)) = E step F( q (k), θ (k 1)) M step F ( q (k), θ (k)) Jensen l ( θ (k)), The E step brings the free energy to the likelihood. The M-step maximises the free energy wrt θ. F l by Jensen or, equivalently, from the non-negativity of KL If the M-step is executed so that θ (k) θ (k 1) iff F increases, then the overall EM iteration will step to a new value of θ iff the likelihood increases.

EM Never Decreases the Likelihood The E and M steps together never decrease the log likelihood: l ( θ (k 1)) = E step F( q (k), θ (k 1)) M step F ( q (k), θ (k)) Jensen l ( θ (k)), The E step brings the free energy to the likelihood. The M-step maximises the free energy wrt θ. F l by Jensen or, equivalently, from the non-negativity of KL If the M-step is executed so that θ (k) θ (k 1) iff F increases, then the overall EM iteration will step to a new value of θ iff the likelihood increases.

EM Never Decreases the Likelihood The E and M steps together never decrease the log likelihood: l ( θ (k 1)) = E step F( q (k), θ (k 1)) M step F ( q (k), θ (k)) Jensen l ( θ (k)), The E step brings the free energy to the likelihood. The M-step maximises the free energy wrt θ. F l by Jensen or, equivalently, from the non-negativity of KL If the M-step is executed so that θ (k) θ (k 1) iff F increases, then the overall EM iteration will step to a new value of θ iff the likelihood increases.

EM Never Decreases the Likelihood The E and M steps together never decrease the log likelihood: l ( θ (k 1)) = E step F( q (k), θ (k 1)) M step F ( q (k), θ (k)) Jensen l ( θ (k)), The E step brings the free energy to the likelihood. The M-step maximises the free energy wrt θ. F l by Jensen or, equivalently, from the non-negativity of KL If the M-step is executed so that θ (k) θ (k 1) iff F increases, then the overall EM iteration will step to a new value of θ iff the likelihood increases.

EM Never Decreases the Likelihood The E and M steps together never decrease the log likelihood: l ( θ (k 1)) = E step F( q (k), θ (k 1)) M step F ( q (k), θ (k)) Jensen l ( θ (k)), The E step brings the free energy to the likelihood. The M-step maximises the free energy wrt θ. F l by Jensen or, equivalently, from the non-negativity of KL If the M-step is executed so that θ (k) θ (k 1) iff F increases, then the overall EM iteration will step to a new value of θ iff the likelihood increases.

Fixed Points of EM are Stationary Points in l Let a fixed point of EM occur with parameter θ. Then: θ log P (Y, X θ) P (Y X,θ ) = 0 θ Now, l(θ)= log P (X θ)= log P (X θ) P (Y X,θ ) P (Y, X θ) = log P (Y X, θ) so, P (Y X,θ ) = log P (Y, X θ) P (Y X,θ ) log P (Y X, θ) P (Y X,θ ) d dθ l(θ)= d dθ log P (Y, X θ) P (Y X,θ ) d dθ log P (Y X, θ) P (Y X,θ ) The second term is 0 at θ if the derivative exists (minimum of KL[ ]), and thus: d dθ l(θ) θ = d dθ log P (Y, X θ) P (Y X,θ ) = 0 θ So, EM converges to a stationary point of l(θ).

Fixed Points of EM are Stationary Points in l Let a fixed point of EM occur with parameter θ. Then: θ log P (Y, X θ) P (Y X,θ ) = 0 θ Now, l(θ)= log P (X θ)= log P (X θ) P (Y X,θ ) P (Y, X θ) = log P (Y X, θ) so, P (Y X,θ ) = log P (Y, X θ) P (Y X,θ ) log P (Y X, θ) P (Y X,θ ) d dθ l(θ)= d dθ log P (Y, X θ) P (Y X,θ ) d dθ log P (Y X, θ) P (Y X,θ ) The second term is 0 at θ if the derivative exists (minimum of KL[ ]), and thus: d dθ l(θ) θ = d dθ log P (Y, X θ) P (Y X,θ ) = 0 θ So, EM converges to a stationary point of l(θ).

Fixed Points of EM are Stationary Points in l Let a fixed point of EM occur with parameter θ. Then: θ log P (Y, X θ) P (Y X,θ ) = 0 θ Now, l(θ)= log P (X θ)= log P (X θ) P (Y X,θ ) P (Y, X θ) = log P (Y X, θ) so, P (Y X,θ ) = log P (Y, X θ) P (Y X,θ ) log P (Y X, θ) P (Y X,θ ) d dθ l(θ)= d dθ log P (Y, X θ) P (Y X,θ ) d dθ log P (Y X, θ) P (Y X,θ ) The second term is 0 at θ if the derivative exists (minimum of KL[ ]), and thus: d dθ l(θ) θ = d dθ log P (Y, X θ) P (Y X,θ ) = 0 θ So, EM converges to a stationary point of l(θ).

Fixed Points of EM are Stationary Points in l Let a fixed point of EM occur with parameter θ. Then: θ log P (Y, X θ) P (Y X,θ ) = 0 θ Now, l(θ)= log P (X θ)= log P (X θ) P (Y X,θ ) P (Y, X θ) = log P (Y X, θ) so, P (Y X,θ ) = log P (Y, X θ) P (Y X,θ ) log P (Y X, θ) P (Y X,θ ) d dθ l(θ)= d dθ log P (Y, X θ) P (Y X,θ ) d dθ log P (Y X, θ) P (Y X,θ ) The second term is 0 at θ if the derivative exists (minimum of KL[ ]), and thus: d dθ l(θ) θ = d dθ log P (Y, X θ) P (Y X,θ ) = 0 θ So, EM converges to a stationary point of l(θ).

Fixed Points of EM are Stationary Points in l Let a fixed point of EM occur with parameter θ. Then: θ log P (Y, X θ) P (Y X,θ ) = 0 θ Now, l(θ)= log P (X θ)= log P (X θ) P (Y X,θ ) P (Y, X θ) = log P (Y X, θ) so, P (Y X,θ ) = log P (Y, X θ) P (Y X,θ ) log P (Y X, θ) P (Y X,θ ) d dθ l(θ)= d dθ log P (Y, X θ) P (Y X,θ ) d dθ log P (Y X, θ) P (Y X,θ ) The second term is 0 at θ if the derivative exists (minimum of KL[ ]), and thus: d dθ l(θ) θ = d dθ log P (Y, X θ) P (Y X,θ ) = 0 θ So, EM converges to a stationary point of l(θ).

Fixed Points of EM are Stationary Points in l Let a fixed point of EM occur with parameter θ. Then: θ log P (Y, X θ) P (Y X,θ ) = 0 θ Now, l(θ)= log P (X θ)= log P (X θ) P (Y X,θ ) P (Y, X θ) = log P (Y X, θ) so, P (Y X,θ ) = log P (Y, X θ) P (Y X,θ ) log P (Y X, θ) P (Y X,θ ) d dθ l(θ)= d dθ log P (Y, X θ) P (Y X,θ ) d dθ log P (Y X, θ) P (Y X,θ ) The second term is 0 at θ if the derivative exists (minimum of KL[ ]), and thus: d dθ l(θ) θ = d dθ log P (Y, X θ) P (Y X,θ ) = 0 θ So, EM converges to a stationary point of l(θ).

Fixed Points of EM are Stationary Points in l Let a fixed point of EM occur with parameter θ. Then: θ log P (Y, X θ) P (Y X,θ ) = 0 θ Now, l(θ)= log P (X θ)= log P (X θ) P (Y X,θ ) P (Y, X θ) = log P (Y X, θ) so, P (Y X,θ ) = log P (Y, X θ) P (Y X,θ ) log P (Y X, θ) P (Y X,θ ) d dθ l(θ)= d dθ log P (Y, X θ) P (Y X,θ ) d dθ log P (Y X, θ) P (Y X,θ ) The second term is 0 at θ if the derivative exists (minimum of KL[ ]), and thus: d dθ l(θ) θ = d dθ log P (Y, X θ) P (Y X,θ ) = 0 θ So, EM converges to a stationary point of l(θ).

Fixed Points of EM are Stationary Points in l Let a fixed point of EM occur with parameter θ. Then: θ log P (Y, X θ) P (Y X,θ ) = 0 θ Now, l(θ)= log P (X θ)= log P (X θ) P (Y X,θ ) P (Y, X θ) = log P (Y X, θ) so, P (Y X,θ ) = log P (Y, X θ) P (Y X,θ ) log P (Y X, θ) P (Y X,θ ) d dθ l(θ)= d dθ log P (Y, X θ) P (Y X,θ ) d dθ log P (Y X, θ) P (Y X,θ ) The second term is 0 at θ if the derivative exists (minimum of KL[ ]), and thus: d dθ l(θ) θ = d dθ log P (Y, X θ) P (Y X,θ ) = 0 θ So, EM converges to a stationary point of l(θ).

Maxima in F correspond to maxima in l Let θ now be the parameter value at a local maximum of F (and thus at a fixed point) Differentiating the previous expression wrt θ again we find d 2 d2 dθ2l(θ)= dθ 2 log P (Y, X θ) P (Y X,θ ) d2 dθ 2 log P (Y X, θ) P (Y X,θ ) The first term on the right is negative (a maximum) and the second term is positive (a minimum). Thus the curvature of the likelihood is negative and θ is a maximum of l. [... as long as the derivatives exist. They sometimes don t (zero-noise ICA)].

Maxima in F correspond to maxima in l Let θ now be the parameter value at a local maximum of F (and thus at a fixed point) Differentiating the previous expression wrt θ again we find d 2 d2 dθ2l(θ)= dθ 2 log P (Y, X θ) P (Y X,θ ) d2 dθ 2 log P (Y X, θ) P (Y X,θ ) The first term on the right is negative (a maximum) and the second term is positive (a minimum). Thus the curvature of the likelihood is negative and θ is a maximum of l. [... as long as the derivatives exist. They sometimes don t (zero-noise ICA)].

Maxima in F correspond to maxima in l Let θ now be the parameter value at a local maximum of F (and thus at a fixed point) Differentiating the previous expression wrt θ again we find d 2 d2 dθ2l(θ)= dθ 2 log P (Y, X θ) P (Y X,θ ) d2 dθ 2 log P (Y X, θ) P (Y X,θ ) The first term on the right is negative (a maximum) and the second term is positive (a minimum). Thus the curvature of the likelihood is negative and θ is a maximum of l. [... as long as the derivatives exist. They sometimes don t (zero-noise ICA)].

Maxima in F correspond to maxima in l Let θ now be the parameter value at a local maximum of F (and thus at a fixed point) Differentiating the previous expression wrt θ again we find d 2 d2 dθ2l(θ)= dθ 2 log P (Y, X θ) P (Y X,θ ) d2 dθ 2 log P (Y X, θ) P (Y X,θ ) The first term on the right is negative (a maximum) and the second term is positive (a minimum). Thus the curvature of the likelihood is negative and θ is a maximum of l. [... as long as the derivatives exist. They sometimes don t (zero-noise ICA)].

Maxima in F correspond to maxima in l Let θ now be the parameter value at a local maximum of F (and thus at a fixed point) Differentiating the previous expression wrt θ again we find d 2 d2 dθ2l(θ)= dθ 2 log P (Y, X θ) P (Y X,θ ) d2 dθ 2 log P (Y X, θ) P (Y X,θ ) The first term on the right is negative (a maximum) and the second term is positive (a minimum). Thus the curvature of the likelihood is negative and θ is a maximum of l. [... as long as the derivatives exist. They sometimes don t (zero-noise ICA)].

Partial M steps and Partial E steps Partial M steps: The proof holds even if we just increase F wrt θ rather than maximize. (Dempster, Laird and Rubin (1977) call this the generalized EM, or GEM, algorithm). Partial E steps: We can also just increase F wrt to some of the qs. For example, sparse or online versions of the EM algorithm would compute the posterior for a subset of the data points or as the data arrives, respectively. You can also update the posterior over a subset of the hidden variables, while holding others fixed...

The Gaussian mixture model (E-step) In a univariate Gaussian mixture model, the density of a data point x is: p(x θ) = k p(s = m θ)p(x s = m, θ) m=1 k m=1 π m exp { 1 ( x µm ) 2}, σ m 2σm 2 where θ is the collection of parameters: means µ m, variances σm 2 π m = p(s = m θ). and mixing proportions The hidden variable s i indicates which component observation x i belongs to. The E-step computes the posterior for s i given the current parameters: q(s i )= p(s i x i, θ) p(x i s i, θ)p(s i θ) def r im = q(s i = m) π m exp { 1 (x σ m 2σm 2 i µ m ) 2} (responsibilities) with the normalization such that m r im = 1.

The Gaussian mixture model (E-step) In a univariate Gaussian mixture model, the density of a data point x is: p(x θ) = k p(s = m θ)p(x s = m, θ) m=1 k m=1 π m exp { 1 ( x µm ) 2}, σ m 2σm 2 where θ is the collection of parameters: means µ m, variances σm 2 π m = p(s = m θ). and mixing proportions The hidden variable s i indicates which component observation x i belongs to. The E-step computes the posterior for s i given the current parameters: q(s i )= p(s i x i, θ) p(x i s i, θ)p(s i θ) def r im = q(s i = m) π m exp { 1 (x σ m 2σm 2 i µ m ) 2} (responsibilities) with the normalization such that m r im = 1.

The Gaussian mixture model (E-step) In a univariate Gaussian mixture model, the density of a data point x is: p(x θ) = k p(s = m θ)p(x s = m, θ) m=1 k m=1 π m exp { 1 ( x µm ) 2}, σ m 2σm 2 where θ is the collection of parameters: means µ m, variances σm 2 π m = p(s = m θ). and mixing proportions The hidden variable s i indicates which component observation x i belongs to. The E-step computes the posterior for s i given the current parameters: q(s i )= p(s i x i, θ) p(x i s i, θ)p(s i θ) def r im = q(s i = m) π m exp { 1 (x σ m 2σm 2 i µ m ) 2} (responsibilities) with the normalization such that m r im = 1.

The Gaussian mixture model (M-step) In the M-step we optimize the sum (since s is discrete): E = log p(x, s θ) q(s) = q(s) log[p(s θ) p(x s, θ)] = i,m [ r im log πm log σ m 1 (x 2σm 2 i µ m ) 2]. Optimization is done by setting the partial derivatives of E to zero: E = (x i µ m ) r im = 0 µ µ m 2σ 2 m = i r imx i i m i r, im E = [ r im 1 + (x i µ m ) 2 ] i = 0 σ 2 σ m σ i m σm 3 m = r im(x i µ m ) 2 i r, im E = 1 E r im, + λ = 0 π m = 1 r im, π m π i m π m n i where λ is a Lagrange multiplier ensuring that the mixing proportions sum to unity.

Factor Analysis y 1 y K x 1 x 2 x D Linear generative model: x d = K Λ dk y k + ɛ d k=1 y k are independent N (0, 1) Gaussian factors ɛ d are independent N (0, Ψ dd ) Gaussian noise K <D So, x is Gaussian with: p(x) = p(y)p(x y)dy = N (0, ΛΛ + Ψ) where Λ is a D K matrix, and Ψ is diagonal. Dimensionality Reduction: Finds a low-dimensional projection of high dimensional data that captures the correlation structure of the data.

EM for Factor Analysis y 1 y K The model for x: p(x θ) = p(y θ)p(x y, θ)dy = N (0, ΛΛ + Ψ) x 1 x 2 x D Model parameters: θ = {Λ, Ψ}. E step: For each data point x n, compute the posterior distribution of hidden factors given the observed data: q n (y) = p(y x n, θ t ). M step: Find the θ t+1 that maximises F(q, θ): F(q, θ) = q n (y) [log p(y θ) + log p(x n y, θ) log q n (y)] dy n = q n (y) [log p(y θ) + log p(x n y, θ)] dy + c. n

The E step for Factor Analysis E step: For each data point x n, compute the posterior distribution of hidden factors given the observed data: q n (y) = p(y x n, θ) = p(y, x n θ)/p(x n θ) Tactic: write p(y, x n θ), consider x n to be fixed. What is this as a function of y? p(y, x n ) = p(y)p(x n y) = (2π) K 2 exp{ 1 2 y y} 2πΨ 1 2 exp{ 1 2 (x n Λy) Ψ 1 (x n Λy)} = c exp{ 1 2 [y y + (x n Λy) Ψ 1 (x n Λy)]} = c exp{ 1 2 [y (I + Λ Ψ 1 Λ)y 2y Λ Ψ 1 x n ]} = c exp{ 1 2 [y Σ 1 y 2y Σ 1 µ + µ Σ 1 µ]} So Σ = (I + Λ Ψ 1 Λ) 1 = I βλ and µ = ΣΛ Ψ 1 x n = βx n. Where β = ΣΛ Ψ 1. Note that µ is a linear function of x n and Σ does not depend on x n.

The M step for Factor Analysis M step: Find θ t+1 maximising F = n qn (y) [log p(y θ) + log p(x n y, θ)] dy + c log p(y θ)+ log p(x n y, θ) = c 1 2 y y 1 2 log Ψ 1 2 (x n Λy) Ψ 1 (x n Λy) Taking expectations over q n (y)... = c 1 2 log Ψ 1 2 [x n Ψ 1 x n 2x n Ψ 1 Λy + y Λ Ψ 1 Λy] = c 1 2 log Ψ 1 2 [x n Ψ 1 x n 2x n Ψ 1 Λy + Tr [ Λ Ψ 1 Λyy ] ] = c 1 2 log Ψ 1 2 [x n Ψ 1 x n 2x n Ψ 1 Λµ n + Tr [ Λ Ψ 1 Λ(µ n µ n + Σ) ] ] Note that we don t need to know everything about q, just the expectations of y and yy under q (i.e. the expected sufficient statistics).

The M step for Factor Analysis (cont.) F = c N 2 log Ψ 1 2 [ xn Ψ 1 x n 2x n Ψ 1 Λµ n + Tr [ Λ Ψ 1 Λ(µ n µ n + Σ) ]] n log A Taking derivatives w.r.t. Λ and Ψ 1, using Tr[AB] B = A and A = A : F Λ = ( Ψ 1 x n µ n Ψ 1 Λ NΣ + ) µ n µ n = 0 n n ( ) 1 ˆΛ= ( x n µ n ) NΣ+ µ n µ n n n F Ψ = N 1 2 Ψ 1 [ xn x n Λµ n x n x n µ n Λ + Λ(µ n µ n + Σ)Λ ] 2 n ˆΨ = 1 [ xn x n Λµ n x n x n µ n Λ + Λ(µ n µ n + Σ)Λ ] N n ˆΨ= ΛΣΛ + 1 (x n Λµ n )(x n Λµ n ) (squared residuals) N n Note: we should actually only take derivarives w.r.t. Ψ dd since Ψ is diagonal. When Σ 0 these become the equations for linear regression!

Mixtures of Factor Analysers Simultaneous clustering and dimensionality reduction. p(x θ) = k π k N (µ k, Λ k Λ k + Ψ) where π k is the mixing proportion for FA k, µ k is its centre, Λ k is its factor loading matrix, and Ψ is a common sensor noise model. θ = {{π k, µ k, Λ k } k=1...k, Ψ} We can think of this model as having two sets of hidden latent variables: A discrete indicator variable s n {1,... K} For each factor analyzer, a continous factor vector y n,k R D k K p(x θ) = p(s n θ) p(y s n, θ)p(x n y, s n, θ) dy s n =1 As before, an EM algorithm can be derived for this model: E step: Infer joint distribution of latent variables, p(y n, s n x n, θ) M step: Maximize F with respect to θ.

EM for exponential families Defn: p is in the exponential family for z = (y, x) if it can be written: p(z θ) = b(z) exp{θ s(z)}/α(θ) where α(θ) = b(z) exp{θ s(z)}dz E step: q(y) = p(y x, θ) M step: θ (k) := argmax θ F(q, θ) F(q, θ) = = q(y) log p(y, x θ)dy H(q) q(y)[θ s(z) log α(θ)]dy + const It is easy to verify that: Therefore, M step solves: log α(θ) θ = E[s(z) θ] F θ = E q(y)[s(z)] E[s(z) θ] = 0

References A. P. Dempster, N. M. Laird and D. B. Rubin (1977). Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society. Series B (Methodological), Vol. 39, No. 1 (1977), pp. 1-38. http://www.jstor.org/stable/2984875 R. M. Neal and G. E. Hinton (1998). A view of the EM algorithm that justifies incremental, sparse, and other variants. In M. I. Jordan (editor) Learning in Graphical Models, pp. 355-368, Dordrecht: Kluwer Academic Publishers. http://www.cs.utoronto.ca/ radford/ftp/emk.pdf Z. Ghahramani and G. E. Hinton (1996). The EM Algorithm for Mixtures of Factor Analyzers. University of Toronto Technical Report CRG-TR-96-1. http://learning.eng.cam.ac.uk/zoubin/papers/tr-96-1.pdf

Failure Modes of EM EM can fail under a number of degenerate situations: EM may converge to a bad local maximum. Likelihood function may not be bounded above. E.g. a cluster responsible for a single data item can given arbitrarily large likelihood if variance σ m 0. Free energy may not be well defined (or is ).

Proof of the Matrix Inversion Lemma (A + XBX ) 1 = A 1 A 1 X(B 1 + X A 1 X) 1 X A 1 Need to prove: ( A 1 A 1 X(B 1 + X A 1 X) 1 X A 1) (A + XBX ) = I Expand: Regroup: I + A 1 XBX A 1 X(B 1 + X A 1 X) 1 X A 1 X(B 1 + X A 1 X) 1 X A 1 XBX = I + A 1 X ( BX (B 1 + X A 1 X) 1 X (B 1 + X A 1 X) 1 X A 1 XBX ) = I + A 1 X ( BX (B 1 + X A 1 X) 1 B 1 BX (B 1 + X A 1 X) 1 X A 1 XBX ) = I + A 1 X ( BX (B 1 + X A 1 X) 1 (B 1 + X A 1 X)BX ) = I + A 1 X(BX BX ) = I

Proof of the Matrix Inversion Lemma