Machine Learning. Lecture 5: Gaussian Discriminant Analysis, Naive Bayes and EM Algorithm. Feng Li.

Σχετικά έγγραφα
Other Test Constructions: Likelihood Ratio & Bayes Tests

Solution Series 9. i=1 x i and i=1 x i.

Statistical Inference I Locally most powerful tests

Bayesian statistics. DS GA 1002 Probability and Statistics for Data Science.

lecture 10: the em algorithm (contd)

Problem Set 3: Solutions

C.S. 430 Assignment 6, Sample Solutions

Chapter 6: Systems of Linear Differential. be continuous functions on the interval

ST5224: Advanced Statistical Theory II

Homework 8 Model Solution Section

Reminders: linear functions

Statistics 104: Quantitative Methods for Economics Formula and Theorem Review

Ordinal Arithmetic: Addition, Multiplication, Exponentiation and Limit

SOLUTIONS TO MATH38181 EXTREME VALUES AND FINANCIAL RISK EXAM

EE512: Error Control Coding

Solutions to Exercise Sheet 5

Partial Differential Equations in Biology The boundary element method. March 26, 2013

2 Composition. Invertible Mappings

ENGR 691/692 Section 66 (Fall 06): Machine Learning Assigned: August 30 Homework 1: Bayesian Decision Theory (solutions) Due: September 13

5.4 The Poisson Distribution.

Every set of first-order formulas is equivalent to an independent set

Example Sheet 3 Solutions

Μηχανική Μάθηση Hypothesis Testing

Math 446 Homework 3 Solutions. (1). (i): Reverse triangle inequality for metrics: Let (X, d) be a metric space and let x, y, z X.

Lecture 34 Bootstrap confidence intervals

SCHOOL OF MATHEMATICAL SCIENCES G11LMA Linear Mathematics Examination Solutions

Econ 2110: Fall 2008 Suggested Solutions to Problem Set 8 questions or comments to Dan Fetter 1

Chapter 6: Systems of Linear Differential. be continuous functions on the interval

6.3 Forecasting ARMA processes

ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016

Phys460.nb Solution for the t-dependent Schrodinger s equation How did we find the solution? (not required)

Fourier Series. MATH 211, Calculus II. J. Robert Buchanan. Spring Department of Mathematics

Probability and Random Processes (Part II)

Matrices and Determinants

Parametrized Surfaces

Homework 3 Solutions

Estimation for ARMA Processes with Stable Noise. Matt Calder & Richard A. Davis Colorado State University

Introduction to the ML Estimation of ARMA processes

CHAPTER 25 SOLVING EQUATIONS BY ITERATIVE METHODS

3.4 SUM AND DIFFERENCE FORMULAS. NOTE: cos(α+β) cos α + cos β cos(α-β) cos α -cos β

HOMEWORK 4 = G. In order to plot the stress versus the stretch we define a normalized stretch:

Approximation of distance between locations on earth given by latitude and longitude

Fractional Colorings and Zykov Products of graphs

Congruence Classes of Invertible Matrices of Order 3 over F 2

Exercises to Statistics of Material Fatigue No. 5

Math 6 SL Probability Distributions Practice Test Mark Scheme

Numerical Analysis FMN011

Concrete Mathematics Exercises from 30 September 2016

= λ 1 1 e. = λ 1 =12. has the properties e 1. e 3,V(Y

k A = [k, k]( )[a 1, a 2 ] = [ka 1,ka 2 ] 4For the division of two intervals of confidence in R +

ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 19/5/2007

Lecture 2: Dirac notation and a review of linear algebra Read Sakurai chapter 1, Baym chatper 3

6. MAXIMUM LIKELIHOOD ESTIMATION

Section 8.3 Trigonometric Equations

A Note on Intuitionistic Fuzzy. Equivalence Relation

Areas and Lengths in Polar Coordinates

ω ω ω ω ω ω+2 ω ω+2 + ω ω ω ω+2 + ω ω+1 ω ω+2 2 ω ω ω ω ω ω ω ω+1 ω ω2 ω ω2 + ω ω ω2 + ω ω ω ω2 + ω ω+1 ω ω2 + ω ω+1 + ω ω ω ω2 + ω

Tridiagonal matrices. Gérard MEURANT. October, 2008

Uniform Convergence of Fourier Series Michael Taylor

Lecture 21: Properties and robustness of LSE

The Simply Typed Lambda Calculus

Lecture 12: Pseudo likelihood approach

Lecture 2. Soundness and completeness of propositional logic

Tutorial on Multinomial Logistic Regression

6.1. Dirac Equation. Hamiltonian. Dirac Eq.

derivation of the Laplacian from rectangular to spherical coordinates

Bounding Nonsplitting Enumeration Degrees

Jesse Maassen and Mark Lundstrom Purdue University November 25, 2013

SCITECH Volume 13, Issue 2 RESEARCH ORGANISATION Published online: March 29, 2018

Second Order Partial Differential Equations

New bounds for spherical two-distance sets and equiangular lines

Απόκριση σε Μοναδιαία Ωστική Δύναμη (Unit Impulse) Απόκριση σε Δυνάμεις Αυθαίρετα Μεταβαλλόμενες με το Χρόνο. Απόστολος Σ.

5. Choice under Uncertainty

Areas and Lengths in Polar Coordinates

Main source: "Discrete-time systems and computer control" by Α. ΣΚΟΔΡΑΣ ΨΗΦΙΑΚΟΣ ΕΛΕΓΧΟΣ ΔΙΑΛΕΞΗ 4 ΔΙΑΦΑΝΕΙΑ 1

A Bonus-Malus System as a Markov Set-Chain. Małgorzata Niemiec Warsaw School of Economics Institute of Econometrics

Inverse trigonometric functions & General Solution of Trigonometric Equations

HW 3 Solutions 1. a) I use the auto.arima R function to search over models using AIC and decide on an ARMA(3,1)

CRASH COURSE IN PRECALCULUS

Chapter 3: Ordinal Numbers

ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 6/5/2006

SOLUTIONS TO MATH38181 EXTREME VALUES AND FINANCIAL RISK EXAM

The challenges of non-stable predicates

Bayesian modeling of inseparable space-time variation in disease risk

Matrices and vectors. Matrix and vector. a 11 a 12 a 1n a 21 a 22 a 2n A = b 1 b 2. b m. R m n, b = = ( a ij. a m1 a m2 a mn. def

: Monte Carlo EM 313, Louis (1982) EM, EM Newton-Raphson, /. EM, 2 Monte Carlo EM Newton-Raphson, Monte Carlo EM, Monte Carlo EM, /. 3, Monte Carlo EM

A Two-Sided Laplace Inversion Algorithm with Computable Error Bounds and Its Applications in Financial Engineering

Example of the Baum-Welch Algorithm

Math221: HW# 1 solutions

Continuous Distribution Arising from the Three Gap Theorem

Coefficient Inequalities for a New Subclass of K-uniformly Convex Functions

Theorem 8 Let φ be the most powerful size α test of H

( ) 2 and compare to M.

Quadratic Expressions

4.6 Autoregressive Moving Average Model ARMA(1,1)

Nowhere-zero flows Let be a digraph, Abelian group. A Γ-circulation in is a mapping : such that, where, and : tail in X, head in

b. Use the parametrization from (a) to compute the area of S a as S a ds. Be sure to substitute for ds!

Queensland University of Technology Transport Data Analysis and Modeling Methodologies

ΕΛΛΗΝΙΚΗ ΔΗΜΟΚΡΑΤΙΑ ΠΑΝΕΠΙΣΤΗΜΙΟ ΚΡΗΤΗΣ. Ψηφιακή Οικονομία. Διάλεξη 7η: Consumer Behavior Mαρίνα Μπιτσάκη Τμήμα Επιστήμης Υπολογιστών

ANSWERSHEET (TOPIC = DIFFERENTIAL CALCULUS) COLLECTION #2. h 0 h h 0 h h 0 ( ) g k = g 0 + g 1 + g g 2009 =?

Transcript:

Machine Learning Lecture 5: Gaussian Discriminant Analysis, Naive Bayes and EM Algorithm Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018

Probability Theory Review Sample space, events and probability Conditional probability Random variables and probability distributions Joint probability distribution Independence Conditional probability distribution Bayes Theorem...... 2 / 108

Sample Space, Events and Probability A sample space S is the set of all possible outcomes of a (conceptual or physical) random experiment Event A is a subset of the sample space S P (A) is the probability that event A happens It is a function that maps the event A onto the interval [0, 1]. P (A) is also called the probability measure of A Kolmogorov axioms Non-negativity: p(a) 0 for each event A P (S) = 1 σ-additivity: For disjoint events {A i} i such that A i Aj = for i j P ( A i) = P (A i) 3 / 108

Sample Space, Events and Probability (Contd ) Some consequences P ( ) = 0 P (A B) = P (A) + P (B) P (A B) P (A ) = 1 P (A) 4 / 108

Conditional Probability Definition of conditional probability: Fraction of worlds in which event A is true given event B is true P (A B) = P (A, B) P (B), P (A, B) = P (A B)P (B) Corollary: The chain rule n P (A 1, A 2,, A k ) = P (A k A 1, A 2,, A k 1 ) k=1 Example: P (A 4, A 3, A 2, A 1 ) = P (A 4 A 3, A 2, A 1 )P (A 3 A 2, A 1 )P (A 2 A 1 )P (A 1 ) 5 / 108

Conditional Probability (Contd.) Real valued random variable is a function of the outcome of a randomized experiment X : S R Examples: Discrete random variables (S is discrete) X(s) = T rue if a randomly drawn person (s) from our class (S) is female X(s) = The hometown X(s) of a randomly drawn person (s) from (S) Examples: Continuous random variables (S is continuous) X(s) = r be the heart rate of a randomly drawn person s in our class S 6 / 108

Random Variables Real valued random variable is a function of the outcome of a randomized experiment X : S R For continuous random variable X For discrete random variable X P (a < X < b) = P ({s S : a < X(s) < b}) P (X = x) = P ({s S : X(s) = x}) 7 / 108

Probability Distribution Probability distribution for discrete random variables Suppose X is a discrete random variable X : S A Probability mass function (PMF) of X: the probability of X = x p X(x) = P (X = x) Since x A P (X = x) = 1, we have p X(x) = 1 x A 8 / 108

Probability Distribution (Contd.) Probability distribution for continuous random variables Suppose X is a continuous random variable X : S A Probability density function (PDF) of X is a function f X(x) such that for a, b A with (a b) P (a X b) = b a f X(x)dx 9 / 108

Joint Probability Distribution Joint probability distribution Suppose both X and Y are discrete random variable Joint probability mass function (PMF) p X,Y (x, y) = P (X = x, Y = y) Marginal probability mass function for discrete random variables p X(x) = y P (X = x, Y = y) = y P (X = x Y = y)p (Y = y) p Y (y) = x P (X = x, Y = y) = x P (Y = y X = x)p (Y = x) Extension to multiple random variables X 1, X 2, X 3,, X n p(x 1, x 2,, x n) = P (X 1 = x 1, X 2 = x 2,, X n = x n) 10 / 108

Joint Probability Distribution (Contd.) Joint probability distribution Suppose both X and Y are continuous random variable Joint probability density function (PDF) f(x, y) P (a 1 X b 1, a 2 Y b 2) = Marginal probability density functions b1 b2 a 1 a 2 f(x, y)dxdy f X(x) = f Y (x) = f(x, y)dy for < x < f(x, y)dx for < y < Extension to more than two random variables b1 bn P (a 1 X 1 b 1,, a n X n b n) = a 1 a n f(x 1,, x n)dx 1 dx n 11 / 108

Independent Random Variables Two discrete random variables X and Y are independent if for any pair of x and y p X,Y (x, y) = p X (x)p Y (y) Two continuous random variables X and Y are independent if for any pair of x and y f X,Y (x, y) = f X (x)f Y (y) If the above equations do not hold for all (x, y), then X and Y are said to be dependent 12 / 108

Conditional Probability Distribution Discrete random variables X and Y Joint PMF p(x, y) Marginal PMF p X (x) = y p(x, y) The Conditional probability density function of Y given X = x or p Y X (y x) = p Y X=x (y) = p(x, y) p X (x), p(x, y) p X (x), Conditional probability of Y = y given X = x y y P Y =y X=x = p Y X (y x) 13 / 108

Conditional Probability Distribution (Contd.) Continuous random variables X and Y Joint PDF f(x, y) Marginal PDF f X (x) = f(x, y)dy y The Conditional probability density function of Y given X = x or f Y X (y x) = f Y X=x (y) = Probability of a X b given Y = y f(x, y) f X (x), f(x, y) f X (x), y y P (a 1 X b 1 Y = y) = b a f X Y =y (x)dx 14 / 108

Conditional Probability Distribution (Contd.) Continuous random variables X Discrete random variable Y Joint probability distribution P (a X b, Y = y) = P (a X b Y = y)p (Y = y) where P (a X b Y = y) = P (Y = y) = p Y (y) b a f X Y =y (x)dx 15 / 108

Bayes Theorem Bayes theorem (or Bayes rule) describes the probability of an event, based on prior knowledge of conditions that might be related to the event P (A B) = P (B A)P (A) P (B) In the Bayesian interpretation, probability measures a degree of belief, and Bayes theorem links the degree of belief in a proposition before and after accounting for evidence. For proposition A and evidence B P (A), the prior, is the initial degree of belief in A P (A B), the posterior, is the degree of belief having accounted for B Another form: P (A B) = P (A B)p(A) p(b) = with A being the complement of A P (B A)P (A) P (B A)P (A) + P (B A )P (A ) 16 / 108

Bayes Theorem (Contd ) A: you have the flu B: you just coughed Assume: P (A) = 0.05, P (B A) = 0.8, and P (B A ) = 0.2 (A denotes of the complement of A) Question: P (flue cough) = P (A B)? P (A B) = P (B A)P (A) P (B) = P (B A)P (A) P (B A)p(A) + P (B A )P (A ) = 0.8 0.05 0.8 0.05 + 0.2 0.95 0.18 17 / 108

Bayes Theorem (Contd ) Random variables X and Y, both of which are discrete P (X = x Y = y) = Conditional PMF of X given Y = y P (Y = y X = x)p (X = x) P (Y = y) p X Y =y (x) = p Y X=x(y)p X (x) p Y (y) 18 / 108

Bayes Theorem (Contd ) Continuous random variable X and discrete random variable Y f X Y =y (x) = P (Y = y X = x)f X(x) P (Y = y) = p Y X=x(y)f X (x) p Y (y) 19 / 108

Bayes Theorem (Contd ) Discrete random variable X and continuous random variable Y P (X = x Y = y) = f Y X=x(y)P (X = x) f Y (y) In another form p X Y =y (x) = f Y X=x(y)p(x) f Y (y) 20 / 108

Bayes Theorem (Contd ) X and Y are both continuous f X Y =y (x) = f Y X=x(y)f X (x) f Y (y) 21 / 108

Prediction Based on Bayes Theorem X is a random variable indicating the feature vector Y is a random variable indicating the label We perform a trial to obtain a sample x for test, and what is P (Y = y X = x) = p(y x)? 22 / 108

Prediction Based on Bayes Theorem (Contd ) We compute p(y x) based on Bayes Theorem p(y x) = p(x y)p(y), y p(x) We calculate p(x y) for x, y and p(y) for y according to the given training data Fortunately, we do not have to calculate p(x), because p(x y)p(y) arg max p(y x) = arg max y y p(x) = arg max p(x y)p(y) y 23 / 108

Gaussian Distribution Gaussian Distribution (Normal Distribution) ( 1 p(x; µ, σ) = exp 1 ) (x (2πσ 2 µ)2 ) 1/2 2σ2 where µ is the mean and σ 2 is the variance Gaussian distributions are important in statistics and are often used in the natural and social science to represent real-valued random variables whose distribution are not known Central limit theorem: The averages of samples of observations of random variables independently drawn from independent distributions converge in distribution to the normal, that is, become normally distributed when the number of observations is sufficiently large Physical quantities that are expected to be the sum of many independent processes (such as measurement errors) often have distributions that are nearly normal. 24 / 108

Multivariate Gaussian Distribution Multivariate normal distribution in n-dimensions N (µ, Σ) ( 1 p(x; µ, Σ) = exp 1 ) (2π) n/2 Σ 1/2 2 (x µ)t Σ 1 (x µ) Mean vector µ R n Covariance matrix Σ R n n Mahalanobis distance: r 2 = (x µ) T Σ 1 (x µ) Σ is symmetric and positive semidefinite Σ = ΦΛΦ T Φ is an orthonormal matrix, whose columns are eigenvectors of Σ Λ is a diagonal matrix with the diagonal elements being the eigenvalues 25 / 108

Multivariate Gaussian Distribution: A 2D Example From left to right: Σ = I, Σ = 0.6I, Σ = 2I From left to right: Σ = I, Σ = [ ] 1 0.5, Σ = 0.5 1 [ 1 ] 0.8 0.8 1 26 / 108

Multivariate Gaussian Distribution: A 2D Example From left to right: µ = [ ] 1, µ = 0 [ ] [ ] 0.5 1, µ = 0 1.5 27 / 108

Gaussian Discriminant Analysis (Contd.) Y Bernoulli(ψ) P (Y = 1) = ψ P (Y = 0) = 1 ψ Probability mass function p(y) = ψ y (1 ψ) 1 y, y = 0, 1 28 / 108

Gaussian Discriminant Analysis (Contd.) X Y = 0 N (µ 0, Σ) Conditional probability density function of X given Y = 0 ( 1 p X Y =0 (x) = exp 1 ) (2π) n/2 Σ 1/2 2 (x µ0)t Σ 1 (x µ 0) Or p(x y = 0) = ( 1 exp 1 ) (2π) n/2 Σ 1/2 2 (x µ0)t Σ 1 (x µ 0) 29 / 108

Gaussian Discriminant Analysis (Contd.) X Y = 1 N (µ 1, Σ) Conditional probability density function of X given Y = 1 ( 1 p X Y =1 (x) = exp 1 ) (2π) n/2 Σ 1/2 2 (x µ1)t Σ 1 (x µ 1) Or p(x y = 1) = ( 1 exp 1 ) (2π) n/2 Σ 1/2 2 (x µ1)t Σ 1 (x µ 1) 30 / 108

Gaussian Discriminant Analysis (Contd.) In summary, for y = 0, 1 p(y) = ψ y (1 ψ) 1 y ( 1 p(x y) = exp 1 ) (2π) n/2 Σ 1/2 2 (x µ y) T Σ 1 (x µ y ) 31 / 108

Gaussian Discriminant Analysis (Contd.) Given m sample data, the log-likelihood is l(ψ, µ 0, µ 1, Σ) m = log p(x (i), y (i) ; ψ, µ 0, µ 1, Σ) = log = m p(x (i) y (i) ; µ 0, µ 1, Σ)p(y (i) ; ψ) log p(x (i) y (i) ; µ 0, µ 1, Σ) + log p(y (i) ; ψ) 32 / 108

Gaussian Discriminant Analysis (Contd.) The log-likelihood function l(ψ, µ 0, µ 1, Σ) = log p(x (i) y (i) ; µ 0, µ 1, Σ) + log p(y (i) ; ψ) For each training data (x (i), y (i) ) If y (i) = 0 ( p(x (i) y (i) 1 ; µ 0, Σ) = exp 1 ) (2π) n/2 Σ 1/2 2 (x µ0)t Σ 1 (x µ 0) If y (i) = 1 p(x (i) y (i) ; µ 1, Σ) = ( 1 exp 1 ) (2π) n/2 Σ 1/2 2 (x µ1)t Σ 1 (x µ 1) 33 / 108

Gaussian Discriminant Analysis (Contd.) The log-likelihood function l(ψ, µ 0, µ 1, Σ) = log p(x (i) y (i) ; µ 0, µ 1, Σ) + log p(y (i) ; ψ) For each training data (x (i), y (i) ) p(y (i) ; ψ) = ψ y(i) (1 ψ) 1 y(i) 34 / 108

Gaussian Discriminant Analysis (Contd.) Maximizing l(ψ, µ 0, µ 1, Σ) through ψ l(ψ, µ 0, µ 1, Σ) = 0 µ0 l(ψ, µ 0, µ 1, Σ) = 0 µ1 l(ψ, µ 0, µ 1, Σ) = 0 Σ l(ψ, µ 0, µ 1, Σ) = 0 35 / 108

Gaussian Discriminant Analysis (Contd.) Solutions: ψ = 1 1{y (i) = 1} m µ 0 = 1{y (i) = 0}x (i) / 1{y (i) = 0} µ 1 = 1{y (i) = 1}x (i) / 1{y (i) = 1} Σ = 1 m Proof (see Problem Set 2) (x (i) µ y (i))(x (i) µ y (i)) T 36 / 108

Gaussian Discriminant Analysis (Contd.) Given a test data sample x, we can calculate p(y = 1 x) = = = p(x y = 1)p(y = 1) p(x) p(x y = 1)p(y = 1) p(x y = 1)p(y = 1) + p(x y = 0)p(y = 0) 1 1 + p(x y=0)p(y=0) p(x y=1)p(y=1) 37 / 108

p(x y = 0)p(y = 0) p(x y = 1)p(y = 1) ( = exp 1 2 (x µ 0) T Σ 1 (x µ 0 ) + 1 ) 2 (x µ 1) T Σ 1 (x µ 1 ) 1 ψ ψ ( = exp (µ 0 µ 1 ) T Σ 1 x + 1 ( µ T 2 1 Σ 1 µ 1 µ T 0 Σ 1 ) ) ( ( )) 1 ψ µ 0 exp log ψ ( = exp (µ 0 µ 1 ) T Σ 1 x + 1 ( µ T 2 1 Σ 1 µ 1 µ T 0 Σ 1 ) ) ( ( )) 1 ψ µ 0 exp log ψ ( = exp (µ 0 µ 1 ) T Σ 1 x + 1 ( )) ( µ T 2 1 Σ 1 µ 1 µ T 0 Σ 1 ) 1 ψ µ 0 + log ψ 38 / 108

Assume We have x := θ = [ ] x 1 [ 1 2 (µ 0 µ 1 ) T Σ 1 ( ) ( µ T 1 Σ 1 µ 1 µ T 0 Σ 1 µ 0 + log 1 ψ ψ ] ) p(x y = 0)p(y = 0) p(x y = 1)p(y = 1) ( = exp (µ 0 µ 1 ) T Σ 1 x + 1 ( )) ( µ T 2 1 Σ 1 µ 1 µ T 0 Σ 1 ) 1 ψ µ 0 + log ψ = exp ( θ T x ) Then p(y = 1 x) = 1 1 + p(x y=0)p(y=0) p(x y=1)p(y=1) = 1 1 + exp(θ T x) 39 / 108

Similarly, we have = = = = p(y = 0 x) p(x y = 0)p(y = 0) p(x) p(x y = 0)p(y = 0) p(x y = 1)p(y = 1) + p(x y = 0)p(y = 0) 1 1 + p(x y=1)p(y=1) p(x y=0)p(y=0) 1 ( ( )) 1 + exp (µ 1 µ 0 ) T Σ 1 x + µt 0 Σ 1 µ 0 µ T 1 Σ 1 µ 1 2 + log ψ 1 ψ 40 / 108

GDA and Logistic Regression GDA can model be reformulated as logistic regression Which one is better? GDA makes stronger modeling assumptions, and is more data efficient (i.e., requires less training data to learn well ) when the modeling assumptions are correct or at least approximately correct Logistic regression makes weaker assumptions, and is significantly more robust deviations from modeling assumptions When the data is indeed non-gaussian, then in the limit of large datasets, logistic regression will almost always do better than GDA In practice, logistic regression is used more often than GDA 41 / 108

Gaussian Discriminant Analysis (Contd.) 42 / 108

Naive Bayes Training data (x (i), y (i) ),,m x (i) is a n-dimensional vector Each feature x (i) j {0, 1} (j = 1,, n) y (i) {1,, k} The features and labels can be represented by random variables {X j } j=1,,n and Y, respectively 43 / 108

Naive Bayes (Contd ) For j j, Naive Bayes assumes X j and X j are conditionally independent given Y = = P (X 1 = x 1, X 2 = x 2,, X n = x n Y = y) n P (X j = x j X 1 = x 1, X 2 = x 2,, X j 1 = x j 1, Y = y) j=1 n P (X j = x j Y = y) j=1 44 / 108

Naive Bayes (Contd.) The key assumption in NB model P (Y = y, X 1 = x 1,, X n = x n ) = P (X 1 = x 1,, X n = x n Y = y)p (Y = y) n = P (Y = y) P (X j = x j Y = y) j=1 45 / 108

The key assumption in NB model P (Y = y, X 1 = x 1,, X n = x n ) = P (Y = y) Two sets of parameters (denoted by Ω) Probability mass function of Y = p(y) n P (X j = x j Y = y) j=1 n p j (x j y) j=1 p(y) = P (Y = y) where y {1, 2,, k} Conditional probability mass function of X j (j {1, 2,, n}) given Y = y (y {1,, k}) p j(x y) = P (X j = x Y = y) where x j {0, 1} 46 / 108

Maximum-Likelihood Estimates for Naive Bayes Log-likelihood function is l(ω) = log = = = m p(x (i), y (i) ) log p(x (i), y (i) ) n log p(y (i) ) p j (x (i) j y (i) ) j=1 n log p(y (i) ) + log p j (x (i) j y (i) ) j=1 47 / 108

MLE for Naive Bayes (Contd.) Assumption: The number of training data whose label is y count(y) = 1(y (i) = y), y = 1, 2,, k The number of training data with the j-th feature being x and the label being y count j(x y) = 1(y (i) = y x (i) j = x), y = 1, 2,, k, x = 0, 1 48 / 108

= log p(y (i) ) log p(y = 1) + + log p(y = k) i:y (i) =1 i:y (i) =k = count(y = 1) log p(y = 1) + + count(y = k) log p(y = k) k = count(y) log p(y) y=1 49 / 108

= n log p j (x (i) j y (i) ) j=1 n j=1 i:y (i) =1 log p j (x (i) j y = 1) + + n j=1 i:y (i) =k log p j (x (i) j y = k) 50 / 108

D (j) x y = {i : x(i) j = x y (i) = y} Then, we have = = n j=1 n j=1 i:y (i) =1 j=1 log p j (x (i) j y (i) ) i D (j) 0 1 log p j (x (i) j y = 1) + + n j=1 i:y=k i D (j) 1 1 log p j (x (i) j y = k) n log p j (x = 0 y = 1) + log p j (x = 1 y = 1). n + j=1 i D (j) 0 k log p j (x = 0 y = k) + i D (j) 1 k log p j (x = 1 y = k) 51 / 108

D (j) x y = {i : x(i) j = x y (i) = y} Then, we have = = n j=1 j=1 log p j (x (i) j y (i) ) n log p j (x = 0 y = 1) + log p j (x = 1 y = 1) j=1 i D (j) 0 1. n + n k i D (j) 0 k j=1 y=1 x {0,1} i D (j) 1 1 log p j (x = 0 y = k) + count j (x y) log p j (x y) i D (j) 1 k log p j (x = 1 y = k) 52 / 108

MLE for Naive Bayes (Contd.) Revisiting the log-likelihood function l(ω) = = n log p(y (i) ) + log p j (x (i) j y (i) ) j=1 k count(y) log p(y) + y=1 n k j=1 y=1 x {0,1} count j (x y) log p j (x y) 53 / 108

MLE for Naive Bayes (Contd.) Assume a training set {(x (i), y (i) )},,m. The maximum-likelihood estimates are then the parameter values p(y) and p j (x j y) that maximize l(ω) = k count(y) log p(y) + y=1 subject to the following constraints: p(y) 0 for y {1, 2,, k} n k j=1 y=1 x {0,1} count j (x y) log p j (x y) k y=1 p(y) = 1 For y {1,, k}, j {1,, n}, x {0, 1}, p j(x y) 0 For y {1,, k}, j {1,, n}, x {0,1} pj(x y) = 1 54 / 108

MLE for Naive Bayes (Contd.) Theorem 1 The maximum-likelihood estimates for Naive Bayes model are as follows and p(y) = count(y) m p j (x y) = count j(x y) count(y) Proof (see Problem Set 2) m = 1(y(i) = y) m = m 1(y(i) = y x (i) j = x) m 1(y(i) = y) 55 / 108

MLE for Naive Bayes (Contd.) What if x {1, 2,, u} instead of x {0, 1}? Can we get the same results? Check it yourself! 56 / 108

Laplace Smoothing There may exist some feature, e.g., X j, such that X j = x may never happen in the training data p j (x y) = 0 p(y x 1,, x n ) = 0 This is unreasonable!!! How can we resolve this problem? Laplace smoothing: m p(y) = 1(y(i) = y) + 1 m + k m p j(x y) = 1(y(i) = y x (i) j = x) + 1 m 1(y(i) = y) + k 57 / 108

Classification by Naive Bayes Given a test sample x = [x 1, x 2,, x 3 ] T, we have P (Y = y X 1 = x 1,, X n = x n ) = P (X 1 = x 1,, X n = x n Y = y)p (Y = y) P (X 1 = x 1,, X n = x n ) = P (Y = y) n j=1 p(x j = x j Y = y) P (X 1 = x 1,, X n = x n ) = p(y) n j=1 p j(x j y) p(x 1,, x n ) Therefore, the output of the Naive Bayes model is n arg max p(y) p j (x j y) y {1,,k} j=1 58 / 108

Classification by Naive Bayes Example: y = 0, 1, 2, 3 P (Y = 0 X 1 = x 1,, X n = x n ) = p(y = 0) n j=1 p j(x j y = 0) p(x 1,, x n ) P (Y = 1 X 1 = x 1,, X n = x n ) = p(y = 1) n j=1 p j(x j y = 1) p(x 1,, x n ) P (Y = 2 X 1 = x 1,, X n = x n ) = p(y = 2) n j=1 p j(x j y = 2) p(x 1,, x n ) P (Y = 3 X 1 = x 1,, X n = x n ) = p(y = 3) n j=1 p j(x j y = 3) p(x 1,, x n ) 59 / 108

Naive Bayes for Multinomial Distribution Assumptions: Each training sample involves a different number of features x (i) = [x (i) 1, x(i) 2,, x(i) n i ] T The j-th feature of x (i) takes a finite set of values x (i) j {1, 2,, v}, for j = 1,, n For each training data, the features are i.i.d. P (X j = t Y = y) = p(t y), for j = 1,, n p(t y) 0 is the conditional probability mass function v t=1 p(t y) = 1 60 / 108

Naive Bayes for Multinomial Distribution (Contd.) Assumptions: Each training sample involves a different number of features x (i) = [x (i) 1, x(i) 2,, x(i) n i ] T The j-th feature of x (i) takes a finite set of values, x (i) j {1, 2,, v} For each training data (x, y) where x is a n-dimensional vector P (Y = y) = p(y) P (X 1 = x 1, X 2 = x 2,, X n = x n Y = y) = n P (X j = x j Y = y) j=1 61 / 108

Naive Bayes for Multinomial Distribution (Contd.) = = = = where P (X 1 = x 1, X 2 = x 2,, X n = x n Y = y) n P (X j = x j Y = y) j=1 v P (X j = t Y = y) t=1 j:x j=t v p(t y) t=1 j:x j=t v p(t y) count(t) t=1 n count(t) = 1(x j = t) j=1 62 / 108

Naive Bayes for Multinomial Distribution (Contd.) For each training data (x, y) P (Y = y) = p(y) P (X 1 = x 1, X 2 = x 2,, X n = x n Y = y) = Parameters Ω p(y), y = 1, 2,, k v p(t y) count(t) t=1 p(t y), t = 1, 2,, v, y = 1, 2,, k 63 / 108

Naive Bayes for Multinomial Distribution (Contd.) Log-likelihood function l(ω) = log = log = m p(x (i), y (i) ) m y=1 k y=1 k 1(y (i) = y)p(x (i) y (i) = y)p(y (i) = y) ( ) 1(y (i) = y) log p(x (i) y (i) = y)p(y (i) = y) 64 / 108

Naive Bayes for Multinomial Distribution (Contd.) l(ω) = = = = where k y=1 y=1 y=1 y=1 ( ) 1(y (i) = y) log p(x (i) y (i) = y)p(y (i) = y) k n i 1(y (i) = y) log p(y (i) = y) j=1 p(x (i) j y (i) = y) ( ) k v 1(y (i) = y) log p(y) p(t y) count(i) (t) t=1 ( k 1(y (i) = y) log p(y) + n i count (i) (t) = 1(x (i) j = t) j=1 ) v count (i) (t)p(t y) t=1 65 / 108

Naive Bayes for Multinomial Distribution (Contd.) Problem formulation ( ) k v max l(ω) = 1(y (i) = y) log p(y) + count (i) (t)p(t y) y=1 t=1 s.t. p(y) 0, y = 1,, k p(t y) 0, t = 1,, v y = 1,, k k p(y) = 1, y=1 v p(t y) = 1, y = 1, 2,, k t=1 66 / 108

Naive Bayes for Multinomial Distribution (Contd.) Solution m p(t y) = 1(y(i) = y)count (i) (t) m 1(y(i) = y) v t=1 count(i) (t) m p(y) = 1(y(i) = y) m Check them by yourselves! 67 / 108

Naive Bayes for Multinomial Distribution (Contd.) Laplace smoothing m ψ(t y) = 1(y(i) = y)count (i) (t) m 1(y(i) = y) v t=1 count(i) (t) + v m ψ(y) = 1(y(i) = y) m + k 68 / 108

Convex Functions A set C is convex if the line segment between any two points in C lies in C, i.e., for x 1, x 2 C and θ with 0 θ 1, we have θx 1 + (1 θ)x 2 C A function f : R n R is convex, if domf is a convex set and if for all x, y domf and λ with 0 λ 1, we have f(λx + (1 λ)y) λf(x) + (1 λ)f(y) 69 / 108

Convex Functions (Contd.) First-order conditions: Suppose f is differentiable (i.e., its gradient f exists at each point in domf, which is open). Then, f is convex if and only if domf is convex and holds for x, y domf f(y) f(x) + f(x) T (y x) 70 / 108

Convex Functions (Contd.) Second-order conditions: Assume f is twice differentiable (i.t., its Hessian matrix or second derivative 2 f exists at each point in domf, which is open), then f is convex if and only if domf is convex and its Hessian is positive semidefinite: for x domf, 2 f 0 71 / 108

Jensen s Inequality Let f be a convex function, then where λ [0, 1] f(λx 1 + (1 λ)x 2 ) λf(x 1 ) + (1 λ)f(x 2 ) Jensen s Inequality Let f(x) be a convex function defined on an interval I. If x 1, x 2,, x N I and λ 1, λ 2,, λ N 0 with N λi = 1 N N f( λ ix i) λ if(x i) 72 / 108

The Proof of Jensen s Inequality When N = 1, the result is trivial When N = 2, f(λ 1 x 1 + λ 2 x 2 ) λ 1 f(x 1 ) + λ 2 f(x 2 ) due to convexity of f(x) When N 3, the proof is by induction We assume that, the Jensen s inequality holds when N = k 1. k 1 λ ix i) λ if(x i) k 1 f( We then prove that, the Jensen s inequality still holds for N = k 73 / 108

When N = k, f( k k 1 λ i x i ) = f( λ i x i + λ k x k ) k 1 λ i = f((1 λ k ) x i + λ k x k ) 1 λ k k 1 λ i (1 λ k )f( x i ) + λ k f(x k ) 1 λ k k 1 λ i (1 λ k ) f(x i ) + λ k f(x k ) 1 λ k = = k 1 λ i f(x i ) + λ k f(x k ) k λ i f(x i ) 74 / 108

The Probabilistic Form of Jensen s Inequality The inequality can be extended to infinite sums, integrals, and expected values If p(x) 0 on S domf and p(x)dx = 1, we have S f( p(x)xdx) = p(x)f(x)dx S S Assuming X is a random variable and P is a probability distribution on sample space S, we have The equality holds if X is a constant f[e(x)] E[f(X)] 75 / 108

Jensen s inequality for Concave Function Assume f be a concave function The probabilistic form N N f( λ i x i ) λ i f(x i ) Example: f(x) = log x log( N λixi) N λi log(xi) log(e[x]) E[log(X)] f(e[x]) E[f(X)] 76 / 108

The Expectation-Maximization (EM) Algorithm A training set {x (1), x (2),, x (m) } (without labels) The log-likelihood function m l(θ) = log p(x (i) ; θ) = log p(x (i), z (i) ; θ) z (i) Ω θ denotes the full set of unknown parameters in the model z (i) Ω is so-called latent variable 77 / 108

The EM Algorithm (Contd ) Our goal is to maximize the log-likelihood function l(θ) = log p(x (i), z (i) ; θ) z (i) Ω The basic idea of EM algorithm Repeatedly construct a lower-bound on l (E-step) Then optimize that lower-bound (M-step) 78 / 108

The EM Algorithm (Contd ) A training set {x (1), x (2),, x (m) } (without labels) The log-likelihood function l(θ) = log p(x (i) ; θ) = log p(x (i), z (i) ; θ) z (i) Ω θ denotes the full set of unknown parameters in the model z (i) Ω s are latent variables Then, our goal is to maximize the above log-likelihood function The basic idea of EM algorithm: Repeatedly construct a lower-bound on l (E-step), and then optimize that lower-bound (M-step) 79 / 108

For each i {1, 2,, m}, let Z i Ω be a random variable indicating the label of the i-th training data The probability that the label of the i-th training data is z: P (Z i = z) = Q i (z) We say Q i be some distribution over the z s Q i (z) = 1, Q i (z) 0 We have z Ω l(θ) = = log p(x (i), z (i) ; θ) z (i) Ω log z (i) Ω Q i (z (i) ) p(x(i), z (i) ; θ) Q i (z (i) ) 80 / 108

Assume φ : Ω R Suppose E[φ(Z i )] = z (i) Ω = Re-visit the log-likelihood function z (i) Ω P (Z i = z (i) )φ(z (i) ) Q i (z (i) )φ(z (i) ) φ(z (i) ) = p(x(i), z (i) ; θ) Q i (z (i) ) l(θ) = = log z (i) Ω log E[φ(Z i )] Q i (z (i) ) p(x(i), z (i) ; θ) Q i (z (i) ) 81 / 108

Since log( ) is a concave function, according to Jensen s inequality, we have Then, the log-likelihood function log(e[φ(z i )]) E[log(φ(Z i ))] l(θ) = log E[φ(Z i )] E[log(φ(Z i )] = z (i) Ω = z (i) Ω Q i (z (i) ) log φ(z (i) ) Q i (z (i) ) log p(x(i), z (i) ; θ) Q i (z (i) ) 82 / 108

For any set of distributions Q i, l(θ) has a lower bound l(θ) z (i) Ω Q i (z (i) ) log p(x(i), z (i) ; θ) Q i (z (i) ) Tighten the lower bound (i.e., let the equality hold) The equality in the Jensen s inequality holds if where c is a constant p(x (i), z (i) ; θ) Q i(z (i) ) = c 83 / 108

Tighten the lower bound (i.e., let the equality hold) The equality in the Jensen s inequality holds if where c is a constant Since we have z (i) Ω p(x (i), z (i) ; θ) Q i(z (i) ) z (i) Ω = c Q i(z (i) ) = 1 p(x (i), z (i) ; θ) = c z (i) Ω Q i(z) = c 84 / 108

Tighten the lower bound (i.e., let the equality hold) We have p(x (i), z (i) ; θ)/q i(z (i) ) = c z (i) Ω Qi(z) = 1 z (i) Ω p(x(i), z (i) ; θ) = c Therefore, Q i(z (i) ) = p(x(i), z (i) ; θ) c p(x (i), z (i) ; θ) = z (i) Ω p(x(i), z (i) ; θ) = p(x(i), z (i) ; θ) p(x (i) ; θ) = p(z (i) x (i) ; θ) 85 / 108

The EM Algorithm (Contd ) Repeat the following step until convergence (E-step) For each i, set (M-step) set θ := arg max θ Q i(z (i) ) := p(z (i) x (i) ; θ) i z (i) Ω Q i(z (i) ) log p(x(i), z (i) ; θ) Q i(z (i) ) 86 / 108

In the end of the t-th iteration, l(θ [t] ) = Convergence Q [t] z (i) i (z(i) ) log p(x(i), z (i) ; θ [t] ) Q [t] i (z(i) ) θ (t+1) is then obtained by maximizing the right hand side of the above equation l(θ [t+1] ) z (i) Ω z (i) Ω = l(θ [t] ) Q [t] Q [t] i (z(i) ) log p(x(i), z (i) ; θ [t+1] ) Q [t] i (z(i) ) i (z(i) ) log p(x(i), z (i) ; θ [t] ) Q [t] i (z(i) ) EM causes the likelihood to converge monotonically 87 / 108

Reviewing Mixtures of Gaussians A training set {x (1),, x (m) } Mixture of Gaussians model p(x, z) = p(x z)p(z) Z {1,, k} Multinomial({φ j} j=1,,k ) φ j = P (Z = j) (φ j 0 and k j=1 φj = 1) x z = j N (µ j, Σ j) z s are so-called latent random variables, since they are hidden/unobserved 88 / 108

Reviewing Mixtures of Gaussians (Contd.) The log-likelihood function l(φ, µ, Σ) = = log p(x (i) ; φ, µ, Σ) log z (i) =1 k p(x (i) z (i) ; µ, Σ)p(z (i) ; φ) 89 / 108

Applying EM Algorithm to Mixtures of Gaussians Repeat the following steps until convergence (E-step) For each i, j, set w (i) j = p(z (i) = j x (i) ; φ, µ, Σ) = p(x (i) z (i) = j; µ, Σ)p(z (i) = j; φ) k l=1 p(x(i) z (i) = l; µ, Σ)p(z (i) = l; φ) (M-step) Update the parameters φ j = 1 m µ j = Σ j = w (i) j m w(i) j x(i) m w(i) j m w(i) j (x(i) µ j)(x (i) µ j) T m w(i) j 90 / 108

Applying EM Algorithm to MG (Contd.) (E-step) For each i, j, set w (i) j = Q i (z (i) = j) = p(z (i) = j x (i) ; φ, µ, Σ) = p(x (i) z (i) = j; µ, Σ)p(z (i) = j; φ) k l=1 p(x(i) z (i) = l; µ, Σ)p(z (i) = l; φ) p(x (i) z (i) = j; µ j, Σ j) = p(z (i) = j; φ) = φ j 1 exp ( 1 (x (2π) n/2 Σ 1/2 2 µj)t Σ 1 j (x µ ) j) 91 / 108

Applying EM Algorithm to MG (Contd.) (M-step) Maximizing = = = Q i (z (i) ) log p(x(i), z(i) ; φ, µ, Σ) Q i (z (i) ) z (i) k Q i (z (i) = j) log p(x(i) z (i) = j; µ, Σ)p(z (i) = j; φ) Q i (z (i) = j) j=1 k j=1 + j=1 ω (i) j log k k j=1 1 (2π) n/2 Σ j exp ( 1 1/2 2 (x(i) µ j ) T Σ 1 j ω (i) j (x (i) µ j ) ) φ j [ ( ) ω (i) j log (2π) n 2 Σj 1 2 + 1 2 (x(i) µ j ) T Σ 1 j (x (i) µ j ) ω (i) j log φ j k j=1 ω (i) j log ω (i) j ] 92 / 108

Applying EM Algorithm to MG: Calculating µ µl m = µl m = 1 2 = k j=1 k j=1 ω (i) j log 1 (2π) n/2 Σ j exp ( 1 1/2 2 (x(j) µ j ) T Σ 1 j ω (i) 1 j 2 (x(i) µ j ) T Σ 1 j (x (i) µ j ) ( ) ω (i) l µl 2µ T l Σ 1 l x (i) µ T l Σ 1 l µ l ω (i) l ( Σ 1 l ) x (i) Σ 1 l µ l ω (i) j (x (j) µ j ) ) φ j Let m ω(i) l ( Σ 1 l x (i) Σ 1 l µ l ) = 0, we have µ l = m ω(i) l x (i) m ω(i) l 93 / 108

Applying EM Algorithm to MG: Calculating φ Our problem becomes max s.t. k j=1 k φ j = 1 j=1 ω (i) j log φ j Using the theory of Lagrange multiplier φ j = 1 m ω (i) j 94 / 108

Applying EM Algorithm to MG: Calculating Σ Minimizing = k j=1 n log 2π 2 + 1 2 [ ( ) ω (i) j log (2π) n 2 Σj 1 2 + 1 ] 2 (x(i) µ j ) T Σ 1 j (x (i) µ j ) k j=1 k j=1 ω (i) j + 1 2 k j=1 ω (i) j log Σ j ω (i) j (x (i) µ j ) T Σ 1 j (x (i) µ j ) 95 / 108

Minimizing 1 2 k j=1 ω (i) j log Σ j + 1 2 k j=1 Therefore, we have k Σj ω (i) j log Σ j + j=1 for j {1,, k} k j=1 ω (i) j (x (i) µ j ) T Σ 1 j (x (i) µ j ) ω (i) j (x (i) µ j ) T Σ 1 j (x (i) µ j ) = 0 96 / 108

By applying X tr(ax 1 B) = (X 1 BAX 1 ) T A A = A (A 1 ) T We get a solution Σ j = where j = 1,, k Check the derivations by yourself! m ω(i) j (x (i) µ j )(x (i) µ j ) T m ω(i) j 97 / 108

Naive Bayes with Missing Labels For any x, we have p(x) = k p(x, y) = y=1 k n p(y) p j (x j y) The log-likelihood function is then defined as k n l(θ) = log p(x (i) ) = log p(y) p j (x (i) j y) Maximizing l(θ) subject to the following constraints p(y) 0 for y {1,, k}, and k y=1 p(y) = 1 For y {1,, k}, j {1,, n}, x j {0, 1}, p j(x j y) 0 For y {1,, k} and j {1,, n}, x j {0,1} pj(xj y) = 1 y=1 y=1 j=1 j=1 98 / 108

Naive Bayes with Missing Labels (Contd.) When labels are given n l(θ) = log p(y (i) ) p j (x (i) j y (i) ) j=1 When labels are missed k n l(θ) = log p(y) p j (x (i) j y) y=1 j=1 99 / 108

Applying EM Algorithm to Naive Bayes Repeat the following steps until convergence (E-step) For each i = 1,, m and y = 1,, k set p(y) n δ(y i) = p(y x (i) j=1 ) = pj(x(i) j y) k y =1 p(y ) n j=1 pj(x(i) j y ) (M-step) Update the parameters p(y) = 1 m p j(x y) = δ(y i) δ(y i) m δ(y i) 100 / 108

Applying EM Algorithm to NB: E-Step δ(y i) = p(y x (i) ) = p(x(i) y)p(y) p(x (i) ) = p(y) n j=1 p j(x (i) j y) k y =1 p(x(i), y ) = p(y) n j=1 p j(x (i) j y) k y =1 p(x(i) y )p(y ) = p(y) n j=1 p j(x (i) j y) k y =1 p(y ) n j=1 p j(x (i) j y ) 101 / 108

Applying EM Algorithm to NB: M-Step = = = k y=1 k y=1 k y=1 y=1 δ(y i) log p(x(i), z (i) = y) δ(y i) δ(y i) log p(x(i) z (i) = y)p(z (i) = y) δ(y i) δ(y i) log p(y) n j=1 p j(x (i) j y) δ(y i) k δ(y i) log p(y) + n log p j (x (i) j y) log δ(y i) j=1 102 / 108

Applying EM Algorithm to NB: M-Step (Contd ) = = y=1 y=1 y=1 k δ(y i) log p(x(i), z (i) = y) δ(y i) k n δ(y i) log p(y) + log p j (x (i) j y) log δ(y i) j=1 k δ(y i) log p(y) + y=1 k δ(y i) log δ(y i) k n y=1 j=1 δ(y i) log p j (x (i) j y) 103 / 108

Applying EM Algorithm to NB: M-Step (Contd ) = = k n y=1 j=1 δ(y i) log p j (x (i) j y) k n δ(y i) log p j (x = 0 y) y=1 j=1 i:x (i) j =0 k n + δ(y i) log p j (x = 1 y) y=1 j=1 i:x (i) j =1 k n δ(y i) log p j (x y) y=1 j=1 x {0,1} i:x (i) j =x 104 / 108

Applying EM Algorithm to NB: M-Step (Contd ) max s.t. y=1 k δ(y i) log p(y) + k p(y) = 1 y=1 x {0,1} k n y=1 j=1 x {0,1} i:x (i) j =x δ(y i) p j (x y) = 1, y = 1,, k, j = 1,, n p(y) 0, y = 1,, k p j (x y) 0, j = 1,, n, x {0, 1}, y = 1,, k log p j (x y) 105 / 108

Applying EM Algorithm to NB: M-Step (Contd ) Problem I max s.t. k δ(y i) log p(y) y=1 k p(y) = 1 y=1 p(y) 0, y = 1,, k Solution (by Lagrange multiplier): p(y) = m m δ(y i) k y=1 δ(y i) = 1 m δ(y i) 106 / 108

Applying EM Algorithm to NB: M-Step (Contd ) Problem II max s.t. k n y=1 j=1 x {0,1} x {0,1} i:x (i) j =x δ(y i) log p j (x y) p j (x y) = 1, y = 1,, k, j = 1,, n p j (x y) 0, j = 1,, n, x {0, 1}, y = 1,, k Solution (by Lagrange multiplier): i:x p j (x y) = (i) =x δ(y i) x {0,1} i:x (i) j =x δ(y i) = i:x (i) =x m δ(y i) δ(y i) 107 / 108

Thanks! Q & A 108 / 108