Probabilistic and Bayesian Machine Learning

Σχετικά έγγραφα
Probabilistic & Unsupervised Learning

lecture 10: the em algorithm (contd)

Bayesian statistics. DS GA 1002 Probability and Statistics for Data Science.

Other Test Constructions: Likelihood Ratio & Bayes Tests

Solution Series 9. i=1 x i and i=1 x i.

Statistical Inference I Locally most powerful tests

HOMEWORK 4 = G. In order to plot the stress versus the stretch we define a normalized stretch:

2 Composition. Invertible Mappings

Example Sheet 3 Solutions

Homework 8 Model Solution Section

Math221: HW# 1 solutions

Homework 3 Solutions

Section 8.3 Trigonometric Equations

: Monte Carlo EM 313, Louis (1982) EM, EM Newton-Raphson, /. EM, 2 Monte Carlo EM Newton-Raphson, Monte Carlo EM, Monte Carlo EM, /. 3, Monte Carlo EM

SCITECH Volume 13, Issue 2 RESEARCH ORGANISATION Published online: March 29, 2018

3.4 SUM AND DIFFERENCE FORMULAS. NOTE: cos(α+β) cos α + cos β cos(α-β) cos α -cos β

derivation of the Laplacian from rectangular to spherical coordinates

SOLUTIONS TO MATH38181 EXTREME VALUES AND FINANCIAL RISK EXAM

ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 19/5/2007

ST5224: Advanced Statistical Theory II

Solutions to Exercise Sheet 5

C.S. 430 Assignment 6, Sample Solutions

k A = [k, k]( )[a 1, a 2 ] = [ka 1,ka 2 ] 4For the division of two intervals of confidence in R +

Απόκριση σε Μοναδιαία Ωστική Δύναμη (Unit Impulse) Απόκριση σε Δυνάμεις Αυθαίρετα Μεταβαλλόμενες με το Χρόνο. Απόστολος Σ.

CHAPTER 25 SOLVING EQUATIONS BY ITERATIVE METHODS

Ordinal Arithmetic: Addition, Multiplication, Exponentiation and Limit

Problem Set 3: Solutions

The Simply Typed Lambda Calculus

Econ 2110: Fall 2008 Suggested Solutions to Problem Set 8 questions or comments to Dan Fetter 1

4.6 Autoregressive Moving Average Model ARMA(1,1)

Example of the Baum-Welch Algorithm

6.3 Forecasting ARMA processes

Matrices and Determinants

Jesse Maassen and Mark Lundstrom Purdue University November 25, 2013

Areas and Lengths in Polar Coordinates

Every set of first-order formulas is equivalent to an independent set

forms This gives Remark 1. How to remember the above formulas: Substituting these into the equation we obtain with

Exercises 10. Find a fundamental matrix of the given system of equations. Also find the fundamental matrix Φ(t) satisfying Φ(0) = I. 1.

ANSWERSHEET (TOPIC = DIFFERENTIAL CALCULUS) COLLECTION #2. h 0 h h 0 h h 0 ( ) g k = g 0 + g 1 + g g 2009 =?

SOLUTIONS TO MATH38181 EXTREME VALUES AND FINANCIAL RISK EXAM

Areas and Lengths in Polar Coordinates

ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 6/5/2006

Chapter 6: Systems of Linear Differential. be continuous functions on the interval

Srednicki Chapter 55

Lecture 2: Dirac notation and a review of linear algebra Read Sakurai chapter 1, Baym chatper 3

EE512: Error Control Coding

Practice Exam 2. Conceptual Questions. 1. State a Basic identity and then verify it. (a) Identity: Solution: One identity is csc(θ) = 1

Second Order Partial Differential Equations

6. MAXIMUM LIKELIHOOD ESTIMATION

Mean-Variance Analysis

Fourier Series. MATH 211, Calculus II. J. Robert Buchanan. Spring Department of Mathematics

Approximation of distance between locations on earth given by latitude and longitude

Second Order RLC Filters

Lecture 12: Pseudo likelihood approach

ω ω ω ω ω ω+2 ω ω+2 + ω ω ω ω+2 + ω ω+1 ω ω+2 2 ω ω ω ω ω ω ω ω+1 ω ω2 ω ω2 + ω ω ω2 + ω ω ω ω2 + ω ω+1 ω ω2 + ω ω+1 + ω ω ω ω2 + ω

SCHOOL OF MATHEMATICAL SCIENCES G11LMA Linear Mathematics Examination Solutions

Tridiagonal matrices. Gérard MEURANT. October, 2008

Math 6 SL Probability Distributions Practice Test Mark Scheme

Exercises to Statistics of Material Fatigue No. 5

Durbin-Levinson recursive method

The challenges of non-stable predicates

Section 7.6 Double and Half Angle Formulas

Phys460.nb Solution for the t-dependent Schrodinger s equation How did we find the solution? (not required)

Inverse trigonometric functions & General Solution of Trigonometric Equations

Parametrized Surfaces

A Bonus-Malus System as a Markov Set-Chain. Małgorzata Niemiec Warsaw School of Economics Institute of Econometrics

PARTIAL NOTES for 6.1 Trigonometric Identities

CHAPTER 101 FOURIER SERIES FOR PERIODIC FUNCTIONS OF PERIOD

DESIGN OF MACHINERY SOLUTION MANUAL h in h 4 0.

Uniform Convergence of Fourier Series Michael Taylor

Lecture 34 Bootstrap confidence intervals

Laplace Expansion. Peter McCullagh. WHOA-PSI, St Louis August, Department of Statistics University of Chicago

5.4 The Poisson Distribution.

An Inventory of Continuous Distributions

Main source: "Discrete-time systems and computer control" by Α. ΣΚΟΔΡΑΣ ΨΗΦΙΑΚΟΣ ΕΛΕΓΧΟΣ ΔΙΑΛΕΞΗ 4 ΔΙΑΦΑΝΕΙΑ 1

Tutorial on Multinomial Logistic Regression

ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 24/3/2007

2. THEORY OF EQUATIONS. PREVIOUS EAMCET Bits.

Nowhere-zero flows Let be a digraph, Abelian group. A Γ-circulation in is a mapping : such that, where, and : tail in X, head in

HOMEWORK#1. t E(x) = 1 λ = (b) Find the median lifetime of a randomly selected light bulb. Answer:

D Alembert s Solution to the Wave Equation

Web-based supplementary materials for Bayesian Quantile Regression for Ordinal Longitudinal Data

5. Choice under Uncertainty

Εργαστήριο Ανάπτυξης Εφαρμογών Βάσεων Δεδομένων. Εξάμηνο 7 ο

Estimation for ARMA Processes with Stable Noise. Matt Calder & Richard A. Davis Colorado State University

HW 3 Solutions 1. a) I use the auto.arima R function to search over models using AIC and decide on an ARMA(3,1)

ΕΛΛΗΝΙΚΗ ΔΗΜΟΚΡΑΤΙΑ ΠΑΝΕΠΙΣΤΗΜΙΟ ΚΡΗΤΗΣ. Ψηφιακή Οικονομία. Διάλεξη 10η: Basics of Game Theory part 2 Mαρίνα Μπιτσάκη Τμήμα Επιστήμης Υπολογιστών

Sequent Calculi for the Modal µ-calculus over S5. Luca Alberucci, University of Berne. Logic Colloquium Berne, July 4th 2008

Partial Differential Equations in Biology The boundary element method. March 26, 2013

Υπολογιστική Φυσική Στοιχειωδών Σωματιδίων

Μηχανική Μάθηση Hypothesis Testing

Numerical Analysis FMN011

Homomorphism in Intuitionistic Fuzzy Automata

Finite Field Problems: Solutions

Theorem 8 Let φ be the most powerful size α test of H

6.1. Dirac Equation. Hamiltonian. Dirac Eq.

Problem Set 9 Solutions. θ + 1. θ 2 + cotθ ( ) sinθ e iφ is an eigenfunction of the ˆ L 2 operator. / θ 2. φ 2. sin 2 θ φ 2. ( ) = e iφ. = e iφ cosθ.

Pg The perimeter is P = 3x The area of a triangle is. where b is the base, h is the height. In our case b = x, then the area is

DiracDelta. Notations. Primary definition. Specific values. General characteristics. Traditional name. Traditional notation

Statistics 104: Quantitative Methods for Economics Formula and Theorem Review

Transcript:

Probabilistic and Bayesian Machine Learning Day 3: The EM Algorithm Yee Whye Teh ywteh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London http://www.gatsby.ucl.ac.uk/ ywteh/teaching/probmodels

The Expectation Maximisation (EM) algorithm The EM algorithm finds a (local) maximum of a latent variable model likelihood P (X, Y θ). It starts from arbitrary values of the parameters, and iterates two steps: E step: Fill in values of latent variables according to posterior given data. M step: Maximise likelihood as if latent variables were not hidden. Useful in models where learning would be easy if unobserved variables were, in fact, observed (e.g. MoGs). Decomposes difficult problems into series of tractable steps. No gradients and learning rate. Framework lends itself to principled approximations.

Jensen s Inequality log(x) log(α x 1 + (1 α) x 2 ) α log(x 1 ) + (1 α) log(x 2 ) x 1 α x 1 + (1 α)x 2 x 2 For α i 0, α i = 1 and any {x i > 0} ( ) log α i x i i i α i log(x i ) Equality if and only if α i = 1 for some i (and therefore all others are 0).

The Free Energy for a Latent Variable Model Observed data X = {X i }; Latent variables Y = {Y i }; Parameters θ. Goal: Maximize the log likelihood (i.e. ML learning) wrt θ: l(θ) = log P (X θ) = log dy, Any distribution,, over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensen s inequality: l(θ) = log dy log dy def = F(q, θ). Now, log dy = = log dy log dy log dy + H[q], where H[q] is the entropy of. So: F(q, θ) = log + H[q]

The Free Energy for a Latent Variable Model Observed data X = {X i }; Latent variables Y = {Y i }; Parameters θ. Goal: Maximize the log likelihood (i.e. ML learning) wrt θ: l(θ) = log P (X θ) = log dy, Any distribution,, over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensen s inequality: l(θ) = log dy log dy def = F(q, θ). Now, log dy = = log dy log dy log dy + H[q], where H[q] is the entropy of. So: F(q, θ) = log + H[q]

The Free Energy for a Latent Variable Model Observed data X = {X i }; Latent variables Y = {Y i }; Parameters θ. Goal: Maximize the log likelihood (i.e. ML learning) wrt θ: l(θ) = log P (X θ) = log dy, Any distribution,, over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensen s inequality: l(θ) = log dy log dy def = F(q, θ). Now, log dy = = log dy log dy log dy + H[q], where H[q] is the entropy of. So: F(q, θ) = log + H[q]

The Free Energy for a Latent Variable Model Observed data X = {X i }; Latent variables Y = {Y i }; Parameters θ. Goal: Maximize the log likelihood (i.e. ML learning) wrt θ: l(θ) = log P (X θ) = log dy, Any distribution,, over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensen s inequality: l(θ) = log dy log dy def = F(q, θ). Now, log dy = = log dy log dy log dy + H[q], where H[q] is the entropy of. So: F(q, θ) = log + H[q]

The Free Energy for a Latent Variable Model Observed data X = {X i }; Latent variables Y = {Y i }; Parameters θ. Goal: Maximize the log likelihood (i.e. ML learning) wrt θ: l(θ) = log P (X θ) = log dy, Any distribution,, over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensen s inequality: l(θ) = log dy log dy def = F(q, θ). Now, log dy = = log dy log dy log dy + H[q], where H[q] is the entropy of. So: F(q, θ) = log + H[q]

The E and M steps of EM The lower bound on the log likelihood is given by: F(q, θ) = log + H[q], EM alternates between: E step: optimize F(q, θ) wrt distribution over hidden variables holding parameters fixed: q (k) (Y) := argmax F (, θ (k 1)). M step: maximize F(q, θ) wrt parameters holding hidden distribution fixed: θ (k) := argmax θ F ( q (k) (Y), θ ) = argmax θ log q (k) (Y) The second equality comes from the fact that the entropy of does not depend directly on θ.

The E and M steps of EM The lower bound on the log likelihood is given by: F(q, θ) = log + H[q], EM alternates between: E step: optimize F(q, θ) wrt distribution over hidden variables holding parameters fixed: q (k) (Y) := argmax F (, θ (k 1)). M step: maximize F(q, θ) wrt parameters holding hidden distribution fixed: θ (k) := argmax θ F ( q (k) (Y), θ ) = argmax θ log q (k) (Y) The second equality comes from the fact that the entropy of does not depend directly on θ.

The E and M steps of EM The lower bound on the log likelihood is given by: F(q, θ) = log + H[q], EM alternates between: E step: optimize F(q, θ) wrt distribution over hidden variables holding parameters fixed: q (k) (Y) := argmax F (, θ (k 1)). M step: maximize F(q, θ) wrt parameters holding hidden distribution fixed: θ (k) := argmax θ F ( q (k) (Y), θ ) = argmax θ log q (k) (Y) The second equality comes from the fact that the entropy of does not depend directly on θ.

EM as Coordinate Ascent in F 1 0.9 0.8 0.7 0.6 r 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 π

The E Step The free energy can be re-written F(q, θ)= log dy P (Y X, θ)p (X θ) = log dy = log P (X θ) dy + log = l(θ) KL[ P (Y X, θ)] The second term is the Kullback-Leibler divergence. P (Y X, θ) dy This means that, for fixed θ, F is bounded above by l, and achieves that bound when KL[ P (Y X, θ)] = 0. But KL[q p] is zero if and only if q = p. So, the E step simply sets q (k) (Y) = P (Y X, θ (k 1) ) and, after an E step, the free energy equals the likelihood.

The E Step The free energy can be re-written F(q, θ)= log dy P (Y X, θ)p (X θ) = log dy = log P (X θ) dy + log = l(θ) KL[ P (Y X, θ)] The second term is the Kullback-Leibler divergence. P (Y X, θ) dy This means that, for fixed θ, F is bounded above by l, and achieves that bound when KL[ P (Y X, θ)] = 0. But KL[q p] is zero if and only if q = p. So, the E step simply sets q (k) (Y) = P (Y X, θ (k 1) ) and, after an E step, the free energy equals the likelihood.

The E Step The free energy can be re-written F(q, θ)= log dy P (Y X, θ)p (X θ) = log dy = log P (X θ) dy + log = l(θ) KL[ P (Y X, θ)] The second term is the Kullback-Leibler divergence. P (Y X, θ) dy This means that, for fixed θ, F is bounded above by l, and achieves that bound when KL[ P (Y X, θ)] = 0. But KL[q p] is zero if and only if q = p. So, the E step simply sets q (k) (Y) = P (Y X, θ (k 1) ) and, after an E step, the free energy equals the likelihood.

The E Step The free energy can be re-written F(q, θ)= log dy P (Y X, θ)p (X θ) = log dy = log P (X θ) dy + log = l(θ) KL[ P (Y X, θ)] The second term is the Kullback-Leibler divergence. P (Y X, θ) dy This means that, for fixed θ, F is bounded above by l, and achieves that bound when KL[ P (Y X, θ)] = 0. But KL[q p] is zero if and only if q = p. So, the E step simply sets q (k) (Y) = P (Y X, θ (k 1) ) and, after an E step, the free energy equals the likelihood.

The E Step The free energy can be re-written F(q, θ)= log dy P (Y X, θ)p (X θ) = log dy = log P (X θ) dy + log = l(θ) KL[ P (Y X, θ)] The second term is the Kullback-Leibler divergence. P (Y X, θ) dy This means that, for fixed θ, F is bounded above by l, and achieves that bound when KL[ P (Y X, θ)] = 0. But KL[q p] is zero if and only if q = p. So, the E step simply sets q (k) (Y) = P (Y X, θ (k 1) ) and, after an E step, the free energy equals the likelihood.

The E Step The free energy can be re-written F(q, θ)= log dy P (Y X, θ)p (X θ) = log dy = log P (X θ) dy + log = l(θ) KL[ P (Y X, θ)] The second term is the Kullback-Leibler divergence. P (Y X, θ) dy This means that, for fixed θ, F is bounded above by l, and achieves that bound when KL[ P (Y X, θ)] = 0. But KL[q p] is zero if and only if q = p. So, the E step simply sets q (k) (Y) = P (Y X, θ (k 1) ) and, after an E step, the free energy equals the likelihood.

Coordinate Ascent in F (Demo) One parameter mixture: s Bernoulli[π] x s = 0 N [ 1, 1] x s = 1 N [1, 1] and one data point x 1 =.3. q(s) is a distribution on a single binary latent, and so is represented by r 1 [0, 1]. 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 2 1.5 1 0.5 0 0.5 1 1.5 2

Coordinate Ascent in F (Demo) 1 0.9 0.8 0.7 0.6 r 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 π

EM for Learning HMMs Y 1 Y 2 Y 3 Y τ X 1 X 2 X 3 X τ Parameters: θ = {π, T, A} Free energy: F(q, θ) = Y 1:τ q(y 1:τ )(log P (X 1:τ, Y 1:τ θ) log q(y 1:τ )) E-step: Maximise F w.r.t. q with θ fixed: q (Y 1:τ ) = P (Y 1:τ X 1:τ, θ) We will only need the marginal probabilities q (Y t, Y t+1 ), which can also be obtained from the forward-backward algorithm. M-step: Maximize F w.r.t. θ with q fixed. We can re-estimate the parameters by computing the expected number of times the HMM was in state i, emitted symbol k and transitioned to state j. This is the Baum-Welch algorithm and it predates the (more general) EM algorithm.

M step: Parameter updates are given by just ratios of expected counts We can derive the following updates by taking derivatives of F w.r.t. θ. Let the posterior marginals be: γ t (i) = P (Y t = i X 1:τ ) α t (i)β t (i) ξ t (ij) = P (Y t = i, Y t+1 = j X 1:τ ) α i (i)p (Y t+1 = j Y t = i)p (X t+1 Y t+1 = j)β t+1 (j) The initial state distribution is the expected number of times in state i at t = 1: The estimated transition probabilities are: ˆπ i = γ 1 (i) ˆT ij = τ 1 t=1 ξ t(ij) τ 1 t=1 γ t(i) The output distributions are the expected number of times we observe a particular symbol in a particular state: t:x  ik = t =k γ t(i) τ t=1 γ t(i) (or the state-probability-weighted sufficient statistics for exponential family observation models).

EM Never Decreases the Likelihood The E and M steps together never decrease the log likelihood: l ( θ (k 1)) = E step F( q (k), θ (k 1)) M step F ( q (k), θ (k)) Jensen l ( θ (k)), The E step brings the free energy to the likelihood. The M-step maximises the free energy wrt θ. F l by Jensen or, equivalently, from the non-negativity of KL If the M-step is executed so that θ (k) θ (k 1) iff F increases, then the overall EM iteration will step to a new value of θ iff the likelihood increases.

EM Never Decreases the Likelihood The E and M steps together never decrease the log likelihood: l ( θ (k 1)) = E step F( q (k), θ (k 1)) M step F ( q (k), θ (k)) Jensen l ( θ (k)), The E step brings the free energy to the likelihood. The M-step maximises the free energy wrt θ. F l by Jensen or, equivalently, from the non-negativity of KL If the M-step is executed so that θ (k) θ (k 1) iff F increases, then the overall EM iteration will step to a new value of θ iff the likelihood increases.

EM Never Decreases the Likelihood The E and M steps together never decrease the log likelihood: l ( θ (k 1)) = E step F( q (k), θ (k 1)) M step F ( q (k), θ (k)) Jensen l ( θ (k)), The E step brings the free energy to the likelihood. The M-step maximises the free energy wrt θ. F l by Jensen or, equivalently, from the non-negativity of KL If the M-step is executed so that θ (k) θ (k 1) iff F increases, then the overall EM iteration will step to a new value of θ iff the likelihood increases.

EM Never Decreases the Likelihood The E and M steps together never decrease the log likelihood: l ( θ (k 1)) = E step F( q (k), θ (k 1)) M step F ( q (k), θ (k)) Jensen l ( θ (k)), The E step brings the free energy to the likelihood. The M-step maximises the free energy wrt θ. F l by Jensen or, equivalently, from the non-negativity of KL If the M-step is executed so that θ (k) θ (k 1) iff F increases, then the overall EM iteration will step to a new value of θ iff the likelihood increases.

EM Never Decreases the Likelihood The E and M steps together never decrease the log likelihood: l ( θ (k 1)) = E step F( q (k), θ (k 1)) M step F ( q (k), θ (k)) Jensen l ( θ (k)), The E step brings the free energy to the likelihood. The M-step maximises the free energy wrt θ. F l by Jensen or, equivalently, from the non-negativity of KL If the M-step is executed so that θ (k) θ (k 1) iff F increases, then the overall EM iteration will step to a new value of θ iff the likelihood increases.

Fixed Points of EM are Stationary Points in l Let a fixed point of EM occur with parameter θ. Then: θ log P (Y, X θ) P (Y X,θ ) = 0 θ Now, l(θ)= log P (X θ)= log P (X θ) P (Y X,θ ) = log P (Y X, θ) so, P (Y X,θ ) = log P (Y X,θ ) log P (Y X, θ) P (Y X,θ ) d dθ l(θ)= d dθ log P (Y X,θ ) d dθ log P (Y X, θ) P (Y X,θ ) The second term is 0 at θ if the derivative exists (minimum of KL[ ]), and thus: d dθ l(θ) θ = d dθ log P (Y X,θ ) = 0 θ So, EM converges to a stationary point of l(θ).

Fixed Points of EM are Stationary Points in l Let a fixed point of EM occur with parameter θ. Then: θ log P (Y, X θ) P (Y X,θ ) = 0 θ Now, l(θ)= log P (X θ)= log P (X θ) P (Y X,θ ) = log P (Y X, θ) so, P (Y X,θ ) = log P (Y X,θ ) log P (Y X, θ) P (Y X,θ ) d dθ l(θ)= d dθ log P (Y X,θ ) d dθ log P (Y X, θ) P (Y X,θ ) The second term is 0 at θ if the derivative exists (minimum of KL[ ]), and thus: d dθ l(θ) θ = d dθ log P (Y X,θ ) = 0 θ So, EM converges to a stationary point of l(θ).

Fixed Points of EM are Stationary Points in l Let a fixed point of EM occur with parameter θ. Then: θ log P (Y, X θ) P (Y X,θ ) = 0 θ Now, l(θ)= log P (X θ)= log P (X θ) P (Y X,θ ) = log P (Y X, θ) so, P (Y X,θ ) = log P (Y X,θ ) log P (Y X, θ) P (Y X,θ ) d dθ l(θ)= d dθ log P (Y X,θ ) d dθ log P (Y X, θ) P (Y X,θ ) The second term is 0 at θ if the derivative exists (minimum of KL[ ]), and thus: d dθ l(θ) θ = d dθ log P (Y X,θ ) = 0 θ So, EM converges to a stationary point of l(θ).

Fixed Points of EM are Stationary Points in l Let a fixed point of EM occur with parameter θ. Then: θ log P (Y, X θ) P (Y X,θ ) = 0 θ Now, l(θ)= log P (X θ)= log P (X θ) P (Y X,θ ) = log P (Y X, θ) so, P (Y X,θ ) = log P (Y X,θ ) log P (Y X, θ) P (Y X,θ ) d dθ l(θ)= d dθ log P (Y X,θ ) d dθ log P (Y X, θ) P (Y X,θ ) The second term is 0 at θ if the derivative exists (minimum of KL[ ]), and thus: d dθ l(θ) θ = d dθ log P (Y X,θ ) = 0 θ So, EM converges to a stationary point of l(θ).

Fixed Points of EM are Stationary Points in l Let a fixed point of EM occur with parameter θ. Then: θ log P (Y, X θ) P (Y X,θ ) = 0 θ Now, l(θ)= log P (X θ)= log P (X θ) P (Y X,θ ) = log P (Y X, θ) so, P (Y X,θ ) = log P (Y X,θ ) log P (Y X, θ) P (Y X,θ ) d dθ l(θ)= d dθ log P (Y X,θ ) d dθ log P (Y X, θ) P (Y X,θ ) The second term is 0 at θ if the derivative exists (minimum of KL[ ]), and thus: d dθ l(θ) θ = d dθ log P (Y X,θ ) = 0 θ So, EM converges to a stationary point of l(θ).

Fixed Points of EM are Stationary Points in l Let a fixed point of EM occur with parameter θ. Then: θ log P (Y, X θ) P (Y X,θ ) = 0 θ Now, l(θ)= log P (X θ)= log P (X θ) P (Y X,θ ) = log P (Y X, θ) so, P (Y X,θ ) = log P (Y X,θ ) log P (Y X, θ) P (Y X,θ ) d dθ l(θ)= d dθ log P (Y X,θ ) d dθ log P (Y X, θ) P (Y X,θ ) The second term is 0 at θ if the derivative exists (minimum of KL[ ]), and thus: d dθ l(θ) θ = d dθ log P (Y X,θ ) = 0 θ So, EM converges to a stationary point of l(θ).

Fixed Points of EM are Stationary Points in l Let a fixed point of EM occur with parameter θ. Then: θ log P (Y, X θ) P (Y X,θ ) = 0 θ Now, l(θ)= log P (X θ)= log P (X θ) P (Y X,θ ) = log P (Y X, θ) so, P (Y X,θ ) = log P (Y X,θ ) log P (Y X, θ) P (Y X,θ ) d dθ l(θ)= d dθ log P (Y X,θ ) d dθ log P (Y X, θ) P (Y X,θ ) The second term is 0 at θ if the derivative exists (minimum of KL[ ]), and thus: d dθ l(θ) θ = d dθ log P (Y X,θ ) = 0 θ So, EM converges to a stationary point of l(θ).

Fixed Points of EM are Stationary Points in l Let a fixed point of EM occur with parameter θ. Then: θ log P (Y, X θ) P (Y X,θ ) = 0 θ Now, l(θ)= log P (X θ)= log P (X θ) P (Y X,θ ) = log P (Y X, θ) so, P (Y X,θ ) = log P (Y X,θ ) log P (Y X, θ) P (Y X,θ ) d dθ l(θ)= d dθ log P (Y X,θ ) d dθ log P (Y X, θ) P (Y X,θ ) The second term is 0 at θ if the derivative exists (minimum of KL[ ]), and thus: d dθ l(θ) θ = d dθ log P (Y X,θ ) = 0 θ So, EM converges to a stationary point of l(θ).

Maxima in F correspond to maxima in l Let θ now be the parameter value at a local maximum of F (and thus at a fixed point) Differentiating the previous expression wrt θ again we find d 2 d2 dθ2l(θ)= dθ 2 log P (Y X,θ ) d2 dθ 2 log P (Y X, θ) P (Y X,θ ) The first term on the right is negative (a maximum) and the second term is positive (a minimum). Thus the curvature of the likelihood is negative and θ is a maximum of l. [... as long as the derivatives exist. They sometimes don t (zero-noise ICA)].

Maxima in F correspond to maxima in l Let θ now be the parameter value at a local maximum of F (and thus at a fixed point) Differentiating the previous expression wrt θ again we find d 2 d2 dθ2l(θ)= dθ 2 log P (Y X,θ ) d2 dθ 2 log P (Y X, θ) P (Y X,θ ) The first term on the right is negative (a maximum) and the second term is positive (a minimum). Thus the curvature of the likelihood is negative and θ is a maximum of l. [... as long as the derivatives exist. They sometimes don t (zero-noise ICA)].

Maxima in F correspond to maxima in l Let θ now be the parameter value at a local maximum of F (and thus at a fixed point) Differentiating the previous expression wrt θ again we find d 2 d2 dθ2l(θ)= dθ 2 log P (Y X,θ ) d2 dθ 2 log P (Y X, θ) P (Y X,θ ) The first term on the right is negative (a maximum) and the second term is positive (a minimum). Thus the curvature of the likelihood is negative and θ is a maximum of l. [... as long as the derivatives exist. They sometimes don t (zero-noise ICA)].

Maxima in F correspond to maxima in l Let θ now be the parameter value at a local maximum of F (and thus at a fixed point) Differentiating the previous expression wrt θ again we find d 2 d2 dθ2l(θ)= dθ 2 log P (Y X,θ ) d2 dθ 2 log P (Y X, θ) P (Y X,θ ) The first term on the right is negative (a maximum) and the second term is positive (a minimum). Thus the curvature of the likelihood is negative and θ is a maximum of l. [... as long as the derivatives exist. They sometimes don t (zero-noise ICA)].

Maxima in F correspond to maxima in l Let θ now be the parameter value at a local maximum of F (and thus at a fixed point) Differentiating the previous expression wrt θ again we find d 2 d2 dθ2l(θ)= dθ 2 log P (Y X,θ ) d2 dθ 2 log P (Y X, θ) P (Y X,θ ) The first term on the right is negative (a maximum) and the second term is positive (a minimum). Thus the curvature of the likelihood is negative and θ is a maximum of l. [... as long as the derivatives exist. They sometimes don t (zero-noise ICA)].

Partial M steps and Partial E steps Partial M steps: The proof holds even if we just increase F wrt θ rather than maximize. (Dempster, Laird and Rubin (1977) call this the generalized EM, or GEM, algorithm). Partial E steps: We can also just increase F wrt to some of the qs. For example, sparse or online versions of the EM algorithm would compute the posterior for a subset of the data points or as the data arrives, respectively. You can also update the posterior over a subset of the hidden variables, while holding others fixed...

Failure Modes of EM EM can fail under a number of degenerate situations: EM may converge to a bad local maximum. Likelihood function may not be bounded above. E.g. a cluster responsible for a single data item can given arbitrarily large likelihood if variance σ m 0. Free energy may not be well defined (or is ).

EM for Exponential Families Defn: P is in the exponential family for Y, X if it can be written: = h(y, X) exp{θ T(Y, X)}/Z(θ) where Z(θ) = h(y, X) exp{θ T(Y, X)}d(Y, X) E step: q(y ) = P (Y X, θ) M step: θ (k) := argmax F(q, θ) θ F(q, θ) = = q(y ) log dy H[q] q(y )[θ T(Y, X) log Z(θ)]dY + const It is easy to verify that: Therefore, M step solves: log Z(θ) θ = E P (Y,X θ) [T(Y, X)] F θ = E q(y )[T(Y, X)] E P (Y,X θ) [T(Y, X)] = 0

The Central Role of the Partition Function The partition function Z(θ) of exponential families plays an important role in inference and learning of such models. Undirected graphical models are exponential families if each factor in the model has an exponential family form: f i (Y Ci, X Ci ) = h i (Y Ci, X Ci ) exp{θi T i (Y Ci, X Ci )} { } = 1 h i (Y Ci, X Ci ) exp θi T i (Y Ci, X Ci ) Z(θ) i Likelihoods P (X θ) are basically partition functions of undirected graphical models. Derivatives give the sufficient statistics of the models: log Z(θ) = µ = E P (Y,X θ) [T(Y, X)] Second derivatives give the covariance of sufficient statistics: 2 log Z(θ) = E P (Y,X θ) [(T(Y, X) µ)(t(y, X) µ) ] Higher order derivatives give all cumulants, so log Z(θ) is the cumulant generative function of the exponential family distribution. Many approximate inference techniques are based on approximating log Z(θ). i

End Notes A. P. Dempster, N. M. Laird and D. B. Rubin (1977). Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society. Series B (Methodological), Vol. 39, No. 1 (1977), pp. 1-38. http://www.jstor.org/stable/2984875 R. M. Neal and G. E. Hinton (1998). A view of the EM algorithm that justifies incremental, sparse, and other variants. In M. I. Jordan (editor) Learning in Graphical Models, pp. 355-368, Dordrecht: Kluwer Academic Publishers. http://www.cs.utoronto.ca/ radford/ftp/emk.pdf Z. Ghahramani and G. E. Hinton (1996). The EM Algorithm for Mixtures of Factor Analyzers. University of Toronto Technical Report CRG-TR-96-1. http://learning.eng.cam.ac.uk/zoubin/papers/tr-96-1.pdf

End Notes