Probabilistic & Unsupervised Learning
|
|
- Άλκηστις Αποστολίδης
- 6 χρόνια πριν
- Προβολές:
Transcript
1 Probabilistic & Unsupervised Learning Week 3: The EM algorithm Yee Whye Teh Gatsby Computational Neuroscience Unit University College London Term 1, Autumn 2009
2 Mixtures of Gaussians Data: X = {x 1... x N } Latent process: s i iid Disc[π] Component distributions: x i (s i = m) P m [θ m ] = N (µ m, Σ m ) Marginal distribution: P (x i ) = k π m P m (x; θ m ) m=1 Log-likelihood: log p(x {µ m }, {Σ m }, π) = n log i=1 k π m 2πΣ m 1/2 exp m=1 [ 1 ] 2 (x i µ m ) T Σ 1 m (x i µ m )
3 EM for MoGs Evaluate responsibilities r im = P m (x)π m m P m (x)π m Update parameters µ m i r imx i Σ m π m i r im i r im(x i µ m )(x i µ m ) T i r im i r im N
4 The Expectation Maximisation (EM) algorithm The EM algorithm finds a (local) maximum of a latent variable model likelihood. It starts from arbitrary values of the parameters, and iterates two steps: E step: Fill in values of latent variables according to posterior given data. M step: Maximise likelihood as if latent variables were not hidden. Useful in models where learning would be easy if hidden variables were, in fact, observed (e.g. MoGs). Decomposes difficult problems into series of tractable steps. No learning rate. Framework lends itself to principled approximations.
5 Jensen s Inequality log(x) log(α x 1 + (1 α) x 2 ) α log(x 1 ) + (1 α) log(x 2 ) x 1 α x 1 + (1 α)x 2 x 2 For α i 0, α i = 1 and any {x i > 0} ( ) log α i x i i i α i log(x i ) Equality if and only if α i = 1 for some i (and therefore all others are 0).
6 The Free Energy for a Latent Variable Model Observed data X = {x i }; Latent variables Y = {y i }; Parameters θ. Goal: Maximize the log likelihood (i.e. ML learning) wrt θ: l(θ) = log P (X θ) = log P (Y, X θ)dy, Any distribution,, over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensen s inequality: P (Y, X θ) P (Y, X θ) l(θ) = log dy log dy def = F(q, θ). Now, log P (Y, X θ) dy = = log P (Y, X θ) dy log dy log P (Y, X θ) dy + H[q], where H[q] is the entropy of. So: F(q, θ) = log P (Y, X θ) + H[q]
7 The Free Energy for a Latent Variable Model Observed data X = {x i }; Latent variables Y = {y i }; Parameters θ. Goal: Maximize the log likelihood (i.e. ML learning) wrt θ: l(θ) = log P (X θ) = log P (Y, X θ)dy, Any distribution,, over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensen s inequality: P (Y, X θ) P (Y, X θ) l(θ) = log dy log dy def = F(q, θ). Now, log P (Y, X θ) dy = = log P (Y, X θ) dy log dy log P (Y, X θ) dy + H[q], where H[q] is the entropy of. So: F(q, θ) = log P (Y, X θ) + H[q]
8 The Free Energy for a Latent Variable Model Observed data X = {x i }; Latent variables Y = {y i }; Parameters θ. Goal: Maximize the log likelihood (i.e. ML learning) wrt θ: l(θ) = log P (X θ) = log P (Y, X θ)dy, Any distribution,, over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensen s inequality: P (Y, X θ) P (Y, X θ) l(θ) = log dy log dy def = F(q, θ). Now, log P (Y, X θ) dy = = log P (Y, X θ) dy log dy log P (Y, X θ) dy + H[q], where H[q] is the entropy of. So: F(q, θ) = log P (Y, X θ) + H[q]
9 The Free Energy for a Latent Variable Model Observed data X = {x i }; Latent variables Y = {y i }; Parameters θ. Goal: Maximize the log likelihood (i.e. ML learning) wrt θ: l(θ) = log P (X θ) = log P (Y, X θ)dy, Any distribution,, over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensen s inequality: P (Y, X θ) P (Y, X θ) l(θ) = log dy log dy def = F(q, θ). Now, log P (Y, X θ) dy = = log P (Y, X θ) dy log dy log P (Y, X θ) dy + H[q], where H[q] is the entropy of. So: F(q, θ) = log P (Y, X θ) + H[q]
10 The Free Energy for a Latent Variable Model Observed data X = {x i }; Latent variables Y = {y i }; Parameters θ. Goal: Maximize the log likelihood (i.e. ML learning) wrt θ: l(θ) = log P (X θ) = log P (Y, X θ)dy, Any distribution,, over the hidden variables can be used to obtain a lower bound on the log likelihood using Jensen s inequality: P (Y, X θ) P (Y, X θ) l(θ) = log dy log dy def = F(q, θ). Now, log P (Y, X θ) dy = = log P (Y, X θ) dy log dy log P (Y, X θ) dy + H[q], where H[q] is the entropy of. So: F(q, θ) = log P (Y, X θ) + H[q]
11 The E and M steps of EM The lower bound on the log likelihood is given by: F(q, θ) = log P (Y, X θ) + H[q], EM alternates between: E step: optimize F(q, θ) wrt distribution over hidden variables holding parameters fixed: q (k) (Y) := argmax F (, θ (k 1)). M step: maximize F(q, θ) wrt parameters holding hidden distribution fixed: θ (k) := argmax θ F ( q (k) (Y), θ ) = argmax θ log P (Y, X θ) q (k) (Y) The second equality comes from the fact that the entropy of does not depend directly on θ.
12 The E and M steps of EM The lower bound on the log likelihood is given by: F(q, θ) = log P (Y, X θ) + H[q], EM alternates between: E step: optimize F(q, θ) wrt distribution over hidden variables holding parameters fixed: q (k) (Y) := argmax F (, θ (k 1)). M step: maximize F(q, θ) wrt parameters holding hidden distribution fixed: θ (k) := argmax θ F ( q (k) (Y), θ ) = argmax θ log P (Y, X θ) q (k) (Y) The second equality comes from the fact that the entropy of does not depend directly on θ.
13 The E and M steps of EM The lower bound on the log likelihood is given by: F(q, θ) = log P (Y, X θ) + H[q], EM alternates between: E step: optimize F(q, θ) wrt distribution over hidden variables holding parameters fixed: q (k) (Y) := argmax F (, θ (k 1)). M step: maximize F(q, θ) wrt parameters holding hidden distribution fixed: θ (k) := argmax θ F ( q (k) (Y), θ ) = argmax θ log P (Y, X θ) q (k) (Y) The second equality comes from the fact that the entropy of does not depend directly on θ.
14 EM as Coordinate Ascent in F
15 The E Step The free energy can be re-written P (Y, X θ) F(q, θ)= log dy P (Y X, θ)p (X θ) = log dy = log P (X θ) dy + log = l(θ) KL[ P (Y X, θ)] The second term is the Kullback-Leibler divergence. P (Y X, θ) dy This means that, for fixed θ, F is bounded above by l, and achieves that bound when KL[ P (Y X, θ)] = 0. But KL[q p] is zero if and only if q = p. So, the E step simply sets q (k) (Y) = P (Y X, θ (k 1) ) and, after an E step, the free energy equals the likelihood.
16 The E Step The free energy can be re-written P (Y, X θ) F(q, θ)= log dy P (Y X, θ)p (X θ) = log dy = log P (X θ) dy + log = l(θ) KL[ P (Y X, θ)] The second term is the Kullback-Leibler divergence. P (Y X, θ) dy This means that, for fixed θ, F is bounded above by l, and achieves that bound when KL[ P (Y X, θ)] = 0. But KL[q p] is zero if and only if q = p. So, the E step simply sets q (k) (Y) = P (Y X, θ (k 1) ) and, after an E step, the free energy equals the likelihood.
17 The E Step The free energy can be re-written P (Y, X θ) F(q, θ)= log dy P (Y X, θ)p (X θ) = log dy = log P (X θ) dy + log = l(θ) KL[ P (Y X, θ)] The second term is the Kullback-Leibler divergence. P (Y X, θ) dy This means that, for fixed θ, F is bounded above by l, and achieves that bound when KL[ P (Y X, θ)] = 0. But KL[q p] is zero if and only if q = p. So, the E step simply sets q (k) (Y) = P (Y X, θ (k 1) ) and, after an E step, the free energy equals the likelihood.
18 The E Step The free energy can be re-written P (Y, X θ) F(q, θ)= log dy P (Y X, θ)p (X θ) = log dy = log P (X θ) dy + log = l(θ) KL[ P (Y X, θ)] The second term is the Kullback-Leibler divergence. P (Y X, θ) dy This means that, for fixed θ, F is bounded above by l, and achieves that bound when KL[ P (Y X, θ)] = 0. But KL[q p] is zero if and only if q = p. So, the E step simply sets q (k) (Y) = P (Y X, θ (k 1) ) and, after an E step, the free energy equals the likelihood.
19 The E Step The free energy can be re-written P (Y, X θ) F(q, θ)= log dy P (Y X, θ)p (X θ) = log dy = log P (X θ) dy + log = l(θ) KL[ P (Y X, θ)] The second term is the Kullback-Leibler divergence. P (Y X, θ) dy This means that, for fixed θ, F is bounded above by l, and achieves that bound when KL[ P (Y X, θ)] = 0. But KL[q p] is zero if and only if q = p. So, the E step simply sets q (k) (Y) = P (Y X, θ (k 1) ) and, after an E step, the free energy equals the likelihood.
20 The E Step The free energy can be re-written P (Y, X θ) F(q, θ)= log dy P (Y X, θ)p (X θ) = log dy = log P (X θ) dy + log = l(θ) KL[ P (Y X, θ)] The second term is the Kullback-Leibler divergence. P (Y X, θ) dy This means that, for fixed θ, F is bounded above by l, and achieves that bound when KL[ P (Y X, θ)] = 0. But KL[q p] is zero if and only if q = p. So, the E step simply sets q (k) (Y) = P (Y X, θ (k 1) ) and, after an E step, the free energy equals the likelihood.
21 The KL[q(x) p(x)] is non-negative and zero iff x : p(x) = q(x) First let s consider discrete distributions; the Kullback-Liebler divergence is: KL[q p] = i q i log q i p i. To find the distribution q which minimizes KL[q p] we add a Lagrange multiplier to enforce the normalization constraint: ) ) q i = q i E def = KL[q p] + λ ( 1 i We then take partial derivatives and set to zero: i q i log q i p i + λ ( 1 i E = log q i log p i + 1 λ = 0 q i = p i exp(λ 1) q i E λ = 1 q i = 0 q i = 1 i i q i = p i.
22 The KL[q(x) p(x)] is non-negative and zero iff x : p(x) = q(x) Check that the curvature (Hessian) is positive (definite), corresponding to a minimum: 2 E q i q i = 1 q i > 0, 2 E q i q j = 0, showing that q i = p i is a genuine minimum. At the minimum is it easily verified that KL[p p] = 0. A similar proof holds for KL[ ] between continuous densities, the derivatives being substituted by functional derivatives.
23 Coordinate Ascent in F (Demo) One parameter mixture: s Bernoulli[π] x s = 0 N [ 1, 1] x s = 1 N [1, 1] and one data point x 1 =.3. q(s) is a distribution on a single binary latent, and so is represented by r 1 [0, 1]
24 Coordinate Ascent in F (Demo) r π
25 EM Never Decreases the Likelihood The E and M steps together never decrease the log likelihood: l ( θ (k 1)) = E step F( q (k), θ (k 1)) M step F ( q (k), θ (k)) Jensen l ( θ (k)), The E step brings the free energy to the likelihood. The M-step maximises the free energy wrt θ. F l by Jensen or, equivalently, from the non-negativity of KL If the M-step is executed so that θ (k) θ (k 1) iff F increases, then the overall EM iteration will step to a new value of θ iff the likelihood increases.
26 EM Never Decreases the Likelihood The E and M steps together never decrease the log likelihood: l ( θ (k 1)) = E step F( q (k), θ (k 1)) M step F ( q (k), θ (k)) Jensen l ( θ (k)), The E step brings the free energy to the likelihood. The M-step maximises the free energy wrt θ. F l by Jensen or, equivalently, from the non-negativity of KL If the M-step is executed so that θ (k) θ (k 1) iff F increases, then the overall EM iteration will step to a new value of θ iff the likelihood increases.
27 EM Never Decreases the Likelihood The E and M steps together never decrease the log likelihood: l ( θ (k 1)) = E step F( q (k), θ (k 1)) M step F ( q (k), θ (k)) Jensen l ( θ (k)), The E step brings the free energy to the likelihood. The M-step maximises the free energy wrt θ. F l by Jensen or, equivalently, from the non-negativity of KL If the M-step is executed so that θ (k) θ (k 1) iff F increases, then the overall EM iteration will step to a new value of θ iff the likelihood increases.
28 EM Never Decreases the Likelihood The E and M steps together never decrease the log likelihood: l ( θ (k 1)) = E step F( q (k), θ (k 1)) M step F ( q (k), θ (k)) Jensen l ( θ (k)), The E step brings the free energy to the likelihood. The M-step maximises the free energy wrt θ. F l by Jensen or, equivalently, from the non-negativity of KL If the M-step is executed so that θ (k) θ (k 1) iff F increases, then the overall EM iteration will step to a new value of θ iff the likelihood increases.
29 EM Never Decreases the Likelihood The E and M steps together never decrease the log likelihood: l ( θ (k 1)) = E step F( q (k), θ (k 1)) M step F ( q (k), θ (k)) Jensen l ( θ (k)), The E step brings the free energy to the likelihood. The M-step maximises the free energy wrt θ. F l by Jensen or, equivalently, from the non-negativity of KL If the M-step is executed so that θ (k) θ (k 1) iff F increases, then the overall EM iteration will step to a new value of θ iff the likelihood increases.
30 Fixed Points of EM are Stationary Points in l Let a fixed point of EM occur with parameter θ. Then: θ log P (Y, X θ) P (Y X,θ ) = 0 θ Now, l(θ)= log P (X θ)= log P (X θ) P (Y X,θ ) P (Y, X θ) = log P (Y X, θ) so, P (Y X,θ ) = log P (Y, X θ) P (Y X,θ ) log P (Y X, θ) P (Y X,θ ) d dθ l(θ)= d dθ log P (Y, X θ) P (Y X,θ ) d dθ log P (Y X, θ) P (Y X,θ ) The second term is 0 at θ if the derivative exists (minimum of KL[ ]), and thus: d dθ l(θ) θ = d dθ log P (Y, X θ) P (Y X,θ ) = 0 θ So, EM converges to a stationary point of l(θ).
31 Fixed Points of EM are Stationary Points in l Let a fixed point of EM occur with parameter θ. Then: θ log P (Y, X θ) P (Y X,θ ) = 0 θ Now, l(θ)= log P (X θ)= log P (X θ) P (Y X,θ ) P (Y, X θ) = log P (Y X, θ) so, P (Y X,θ ) = log P (Y, X θ) P (Y X,θ ) log P (Y X, θ) P (Y X,θ ) d dθ l(θ)= d dθ log P (Y, X θ) P (Y X,θ ) d dθ log P (Y X, θ) P (Y X,θ ) The second term is 0 at θ if the derivative exists (minimum of KL[ ]), and thus: d dθ l(θ) θ = d dθ log P (Y, X θ) P (Y X,θ ) = 0 θ So, EM converges to a stationary point of l(θ).
32 Fixed Points of EM are Stationary Points in l Let a fixed point of EM occur with parameter θ. Then: θ log P (Y, X θ) P (Y X,θ ) = 0 θ Now, l(θ)= log P (X θ)= log P (X θ) P (Y X,θ ) P (Y, X θ) = log P (Y X, θ) so, P (Y X,θ ) = log P (Y, X θ) P (Y X,θ ) log P (Y X, θ) P (Y X,θ ) d dθ l(θ)= d dθ log P (Y, X θ) P (Y X,θ ) d dθ log P (Y X, θ) P (Y X,θ ) The second term is 0 at θ if the derivative exists (minimum of KL[ ]), and thus: d dθ l(θ) θ = d dθ log P (Y, X θ) P (Y X,θ ) = 0 θ So, EM converges to a stationary point of l(θ).
33 Fixed Points of EM are Stationary Points in l Let a fixed point of EM occur with parameter θ. Then: θ log P (Y, X θ) P (Y X,θ ) = 0 θ Now, l(θ)= log P (X θ)= log P (X θ) P (Y X,θ ) P (Y, X θ) = log P (Y X, θ) so, P (Y X,θ ) = log P (Y, X θ) P (Y X,θ ) log P (Y X, θ) P (Y X,θ ) d dθ l(θ)= d dθ log P (Y, X θ) P (Y X,θ ) d dθ log P (Y X, θ) P (Y X,θ ) The second term is 0 at θ if the derivative exists (minimum of KL[ ]), and thus: d dθ l(θ) θ = d dθ log P (Y, X θ) P (Y X,θ ) = 0 θ So, EM converges to a stationary point of l(θ).
34 Fixed Points of EM are Stationary Points in l Let a fixed point of EM occur with parameter θ. Then: θ log P (Y, X θ) P (Y X,θ ) = 0 θ Now, l(θ)= log P (X θ)= log P (X θ) P (Y X,θ ) P (Y, X θ) = log P (Y X, θ) so, P (Y X,θ ) = log P (Y, X θ) P (Y X,θ ) log P (Y X, θ) P (Y X,θ ) d dθ l(θ)= d dθ log P (Y, X θ) P (Y X,θ ) d dθ log P (Y X, θ) P (Y X,θ ) The second term is 0 at θ if the derivative exists (minimum of KL[ ]), and thus: d dθ l(θ) θ = d dθ log P (Y, X θ) P (Y X,θ ) = 0 θ So, EM converges to a stationary point of l(θ).
35 Fixed Points of EM are Stationary Points in l Let a fixed point of EM occur with parameter θ. Then: θ log P (Y, X θ) P (Y X,θ ) = 0 θ Now, l(θ)= log P (X θ)= log P (X θ) P (Y X,θ ) P (Y, X θ) = log P (Y X, θ) so, P (Y X,θ ) = log P (Y, X θ) P (Y X,θ ) log P (Y X, θ) P (Y X,θ ) d dθ l(θ)= d dθ log P (Y, X θ) P (Y X,θ ) d dθ log P (Y X, θ) P (Y X,θ ) The second term is 0 at θ if the derivative exists (minimum of KL[ ]), and thus: d dθ l(θ) θ = d dθ log P (Y, X θ) P (Y X,θ ) = 0 θ So, EM converges to a stationary point of l(θ).
36 Fixed Points of EM are Stationary Points in l Let a fixed point of EM occur with parameter θ. Then: θ log P (Y, X θ) P (Y X,θ ) = 0 θ Now, l(θ)= log P (X θ)= log P (X θ) P (Y X,θ ) P (Y, X θ) = log P (Y X, θ) so, P (Y X,θ ) = log P (Y, X θ) P (Y X,θ ) log P (Y X, θ) P (Y X,θ ) d dθ l(θ)= d dθ log P (Y, X θ) P (Y X,θ ) d dθ log P (Y X, θ) P (Y X,θ ) The second term is 0 at θ if the derivative exists (minimum of KL[ ]), and thus: d dθ l(θ) θ = d dθ log P (Y, X θ) P (Y X,θ ) = 0 θ So, EM converges to a stationary point of l(θ).
37 Fixed Points of EM are Stationary Points in l Let a fixed point of EM occur with parameter θ. Then: θ log P (Y, X θ) P (Y X,θ ) = 0 θ Now, l(θ)= log P (X θ)= log P (X θ) P (Y X,θ ) P (Y, X θ) = log P (Y X, θ) so, P (Y X,θ ) = log P (Y, X θ) P (Y X,θ ) log P (Y X, θ) P (Y X,θ ) d dθ l(θ)= d dθ log P (Y, X θ) P (Y X,θ ) d dθ log P (Y X, θ) P (Y X,θ ) The second term is 0 at θ if the derivative exists (minimum of KL[ ]), and thus: d dθ l(θ) θ = d dθ log P (Y, X θ) P (Y X,θ ) = 0 θ So, EM converges to a stationary point of l(θ).
38 Maxima in F correspond to maxima in l Let θ now be the parameter value at a local maximum of F (and thus at a fixed point) Differentiating the previous expression wrt θ again we find d 2 d2 dθ2l(θ)= dθ 2 log P (Y, X θ) P (Y X,θ ) d2 dθ 2 log P (Y X, θ) P (Y X,θ ) The first term on the right is negative (a maximum) and the second term is positive (a minimum). Thus the curvature of the likelihood is negative and θ is a maximum of l. [... as long as the derivatives exist. They sometimes don t (zero-noise ICA)].
39 Maxima in F correspond to maxima in l Let θ now be the parameter value at a local maximum of F (and thus at a fixed point) Differentiating the previous expression wrt θ again we find d 2 d2 dθ2l(θ)= dθ 2 log P (Y, X θ) P (Y X,θ ) d2 dθ 2 log P (Y X, θ) P (Y X,θ ) The first term on the right is negative (a maximum) and the second term is positive (a minimum). Thus the curvature of the likelihood is negative and θ is a maximum of l. [... as long as the derivatives exist. They sometimes don t (zero-noise ICA)].
40 Maxima in F correspond to maxima in l Let θ now be the parameter value at a local maximum of F (and thus at a fixed point) Differentiating the previous expression wrt θ again we find d 2 d2 dθ2l(θ)= dθ 2 log P (Y, X θ) P (Y X,θ ) d2 dθ 2 log P (Y X, θ) P (Y X,θ ) The first term on the right is negative (a maximum) and the second term is positive (a minimum). Thus the curvature of the likelihood is negative and θ is a maximum of l. [... as long as the derivatives exist. They sometimes don t (zero-noise ICA)].
41 Maxima in F correspond to maxima in l Let θ now be the parameter value at a local maximum of F (and thus at a fixed point) Differentiating the previous expression wrt θ again we find d 2 d2 dθ2l(θ)= dθ 2 log P (Y, X θ) P (Y X,θ ) d2 dθ 2 log P (Y X, θ) P (Y X,θ ) The first term on the right is negative (a maximum) and the second term is positive (a minimum). Thus the curvature of the likelihood is negative and θ is a maximum of l. [... as long as the derivatives exist. They sometimes don t (zero-noise ICA)].
42 Maxima in F correspond to maxima in l Let θ now be the parameter value at a local maximum of F (and thus at a fixed point) Differentiating the previous expression wrt θ again we find d 2 d2 dθ2l(θ)= dθ 2 log P (Y, X θ) P (Y X,θ ) d2 dθ 2 log P (Y X, θ) P (Y X,θ ) The first term on the right is negative (a maximum) and the second term is positive (a minimum). Thus the curvature of the likelihood is negative and θ is a maximum of l. [... as long as the derivatives exist. They sometimes don t (zero-noise ICA)].
43 Partial M steps and Partial E steps Partial M steps: The proof holds even if we just increase F wrt θ rather than maximize. (Dempster, Laird and Rubin (1977) call this the generalized EM, or GEM, algorithm). Partial E steps: We can also just increase F wrt to some of the qs. For example, sparse or online versions of the EM algorithm would compute the posterior for a subset of the data points or as the data arrives, respectively. You can also update the posterior over a subset of the hidden variables, while holding others fixed...
44 The Gaussian mixture model (E-step) In a univariate Gaussian mixture model, the density of a data point x is: p(x θ) = k p(s = m θ)p(x s = m, θ) m=1 k m=1 π m exp { 1 ( x µm ) 2}, σ m 2σm 2 where θ is the collection of parameters: means µ m, variances σm 2 π m = p(s = m θ). and mixing proportions The hidden variable s i indicates which component observation x i belongs to. The E-step computes the posterior for s i given the current parameters: q(s i )= p(s i x i, θ) p(x i s i, θ)p(s i θ) def r im = q(s i = m) π m exp { 1 (x σ m 2σm 2 i µ m ) 2} (responsibilities) with the normalization such that m r im = 1.
45 The Gaussian mixture model (E-step) In a univariate Gaussian mixture model, the density of a data point x is: p(x θ) = k p(s = m θ)p(x s = m, θ) m=1 k m=1 π m exp { 1 ( x µm ) 2}, σ m 2σm 2 where θ is the collection of parameters: means µ m, variances σm 2 π m = p(s = m θ). and mixing proportions The hidden variable s i indicates which component observation x i belongs to. The E-step computes the posterior for s i given the current parameters: q(s i )= p(s i x i, θ) p(x i s i, θ)p(s i θ) def r im = q(s i = m) π m exp { 1 (x σ m 2σm 2 i µ m ) 2} (responsibilities) with the normalization such that m r im = 1.
46 The Gaussian mixture model (E-step) In a univariate Gaussian mixture model, the density of a data point x is: p(x θ) = k p(s = m θ)p(x s = m, θ) m=1 k m=1 π m exp { 1 ( x µm ) 2}, σ m 2σm 2 where θ is the collection of parameters: means µ m, variances σm 2 π m = p(s = m θ). and mixing proportions The hidden variable s i indicates which component observation x i belongs to. The E-step computes the posterior for s i given the current parameters: q(s i )= p(s i x i, θ) p(x i s i, θ)p(s i θ) def r im = q(s i = m) π m exp { 1 (x σ m 2σm 2 i µ m ) 2} (responsibilities) with the normalization such that m r im = 1.
47 The Gaussian mixture model (M-step) In the M-step we optimize the sum (since s is discrete): E = log p(x, s θ) q(s) = q(s) log[p(s θ) p(x s, θ)] = i,m [ r im log πm log σ m 1 (x 2σm 2 i µ m ) 2]. Optimization is done by setting the partial derivatives of E to zero: E = (x i µ m ) r im = 0 µ µ m 2σ 2 m = i r imx i i m i r, im E = [ r im 1 + (x i µ m ) 2 ] i = 0 σ 2 σ m σ i m σm 3 m = r im(x i µ m ) 2 i r, im E = 1 E r im, + λ = 0 π m = 1 r im, π m π i m π m n i where λ is a Lagrange multiplier ensuring that the mixing proportions sum to unity.
48 Factor Analysis y 1 y K x 1 x 2 x D Linear generative model: x d = K Λ dk y k + ɛ d k=1 y k are independent N (0, 1) Gaussian factors ɛ d are independent N (0, Ψ dd ) Gaussian noise K <D So, x is Gaussian with: p(x) = p(y)p(x y)dy = N (0, ΛΛ + Ψ) where Λ is a D K matrix, and Ψ is diagonal. Dimensionality Reduction: Finds a low-dimensional projection of high dimensional data that captures the correlation structure of the data.
49 EM for Factor Analysis y 1 y K The model for x: p(x θ) = p(y θ)p(x y, θ)dy = N (0, ΛΛ + Ψ) x 1 x 2 x D Model parameters: θ = {Λ, Ψ}. E step: For each data point x n, compute the posterior distribution of hidden factors given the observed data: q n (y) = p(y x n, θ t ). M step: Find the θ t+1 that maximises F(q, θ): F(q, θ) = q n (y) [log p(y θ) + log p(x n y, θ) log q n (y)] dy n = q n (y) [log p(y θ) + log p(x n y, θ)] dy + c. n
50 The E step for Factor Analysis E step: For each data point x n, compute the posterior distribution of hidden factors given the observed data: q n (y) = p(y x n, θ) = p(y, x n θ)/p(x n θ) Tactic: write p(y, x n θ), consider x n to be fixed. What is this as a function of y? p(y, x n ) = p(y)p(x n y) = (2π) K 2 exp{ 1 2 y y} 2πΨ 1 2 exp{ 1 2 (x n Λy) Ψ 1 (x n Λy)} = c exp{ 1 2 [y y + (x n Λy) Ψ 1 (x n Λy)]} = c exp{ 1 2 [y (I + Λ Ψ 1 Λ)y 2y Λ Ψ 1 x n ]} = c exp{ 1 2 [y Σ 1 y 2y Σ 1 µ + µ Σ 1 µ]} So Σ = (I + Λ Ψ 1 Λ) 1 = I βλ and µ = ΣΛ Ψ 1 x n = βx n. Where β = ΣΛ Ψ 1. Note that µ is a linear function of x n and Σ does not depend on x n.
51 The M step for Factor Analysis M step: Find θ t+1 maximising F = n qn (y) [log p(y θ) + log p(x n y, θ)] dy + c log p(y θ)+ log p(x n y, θ) = c 1 2 y y 1 2 log Ψ 1 2 (x n Λy) Ψ 1 (x n Λy) Taking expectations over q n (y)... = c 1 2 log Ψ 1 2 [x n Ψ 1 x n 2x n Ψ 1 Λy + y Λ Ψ 1 Λy] = c 1 2 log Ψ 1 2 [x n Ψ 1 x n 2x n Ψ 1 Λy + Tr [ Λ Ψ 1 Λyy ] ] = c 1 2 log Ψ 1 2 [x n Ψ 1 x n 2x n Ψ 1 Λµ n + Tr [ Λ Ψ 1 Λ(µ n µ n + Σ) ] ] Note that we don t need to know everything about q, just the expectations of y and yy under q (i.e. the expected sufficient statistics).
52 The M step for Factor Analysis (cont.) F = c N 2 log Ψ 1 2 [ xn Ψ 1 x n 2x n Ψ 1 Λµ n + Tr [ Λ Ψ 1 Λ(µ n µ n + Σ) ]] n log A Taking derivatives w.r.t. Λ and Ψ 1, using Tr[AB] B = A and A = A : F Λ = ( Ψ 1 x n µ n Ψ 1 Λ NΣ + ) µ n µ n = 0 n n ( ) 1 ˆΛ= ( x n µ n ) NΣ+ µ n µ n n n F Ψ = N 1 2 Ψ 1 [ xn x n Λµ n x n x n µ n Λ + Λ(µ n µ n + Σ)Λ ] 2 n ˆΨ = 1 [ xn x n Λµ n x n x n µ n Λ + Λ(µ n µ n + Σ)Λ ] N n ˆΨ= ΛΣΛ + 1 (x n Λµ n )(x n Λµ n ) (squared residuals) N n Note: we should actually only take derivarives w.r.t. Ψ dd since Ψ is diagonal. When Σ 0 these become the equations for linear regression!
53 Mixtures of Factor Analysers Simultaneous clustering and dimensionality reduction. p(x θ) = k π k N (µ k, Λ k Λ k + Ψ) where π k is the mixing proportion for FA k, µ k is its centre, Λ k is its factor loading matrix, and Ψ is a common sensor noise model. θ = {{π k, µ k, Λ k } k=1...k, Ψ} We can think of this model as having two sets of hidden latent variables: A discrete indicator variable s n {1,... K} For each factor analyzer, a continous factor vector y n,k R D k K p(x θ) = p(s n θ) p(y s n, θ)p(x n y, s n, θ) dy s n =1 As before, an EM algorithm can be derived for this model: E step: Infer joint distribution of latent variables, p(y n, s n x n, θ) M step: Maximize F with respect to θ.
54 EM for exponential families Defn: p is in the exponential family for z = (y, x) if it can be written: p(z θ) = b(z) exp{θ s(z)}/α(θ) where α(θ) = b(z) exp{θ s(z)}dz E step: q(y) = p(y x, θ) M step: θ (k) := argmax θ F(q, θ) F(q, θ) = = q(y) log p(y, x θ)dy H(q) q(y)[θ s(z) log α(θ)]dy + const It is easy to verify that: Therefore, M step solves: log α(θ) θ = E[s(z) θ] F θ = E q(y)[s(z)] E[s(z) θ] = 0
55 References A. P. Dempster, N. M. Laird and D. B. Rubin (1977). Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society. Series B (Methodological), Vol. 39, No. 1 (1977), pp R. M. Neal and G. E. Hinton (1998). A view of the EM algorithm that justifies incremental, sparse, and other variants. In M. I. Jordan (editor) Learning in Graphical Models, pp , Dordrecht: Kluwer Academic Publishers. radford/ftp/emk.pdf Z. Ghahramani and G. E. Hinton (1996). The EM Algorithm for Mixtures of Factor Analyzers. University of Toronto Technical Report CRG-TR
56 Failure Modes of EM EM can fail under a number of degenerate situations: EM may converge to a bad local maximum. Likelihood function may not be bounded above. E.g. a cluster responsible for a single data item can given arbitrarily large likelihood if variance σ m 0. Free energy may not be well defined (or is ).
57 Proof of the Matrix Inversion Lemma (A + XBX ) 1 = A 1 A 1 X(B 1 + X A 1 X) 1 X A 1 Need to prove: ( A 1 A 1 X(B 1 + X A 1 X) 1 X A 1) (A + XBX ) = I Expand: Regroup: I + A 1 XBX A 1 X(B 1 + X A 1 X) 1 X A 1 X(B 1 + X A 1 X) 1 X A 1 XBX = I + A 1 X ( BX (B 1 + X A 1 X) 1 X (B 1 + X A 1 X) 1 X A 1 XBX ) = I + A 1 X ( BX (B 1 + X A 1 X) 1 B 1 BX (B 1 + X A 1 X) 1 X A 1 XBX ) = I + A 1 X ( BX (B 1 + X A 1 X) 1 (B 1 + X A 1 X)BX ) = I + A 1 X(BX BX ) = I
58 Proof of the Matrix Inversion Lemma
Probabilistic and Bayesian Machine Learning
Probabilistic and Bayesian Machine Learning Day 3: The EM Algorithm Yee Whye Teh ywteh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London http://www.gatsby.ucl.ac.uk/ ywteh/teaching/probmodels
lecture 10: the em algorithm (contd)
lecture 10: the em algorithm (contd) STAT 545: Intro. to Computational Statistics Vinayak Rao Purdue University September 24, 2018 Exponential family models Consider a space X. E.g. R, R d or N. ϕ(x) =
Bayesian statistics. DS GA 1002 Probability and Statistics for Data Science.
Bayesian statistics DS GA 1002 Probability and Statistics for Data Science http://www.cims.nyu.edu/~cfgranda/pages/dsga1002_fall17 Carlos Fernandez-Granda Frequentist vs Bayesian statistics In frequentist
HOMEWORK 4 = G. In order to plot the stress versus the stretch we define a normalized stretch:
HOMEWORK 4 Problem a For the fast loading case, we want to derive the relationship between P zz and λ z. We know that the nominal stress is expressed as: P zz = ψ λ z where λ z = λ λ z. Therefore, applying
Other Test Constructions: Likelihood Ratio & Bayes Tests
Other Test Constructions: Likelihood Ratio & Bayes Tests Side-Note: So far we have seen a few approaches for creating tests such as Neyman-Pearson Lemma ( most powerful tests of H 0 : θ = θ 0 vs H 1 :
Solution Series 9. i=1 x i and i=1 x i.
Lecturer: Prof. Dr. Mete SONER Coordinator: Yilin WANG Solution Series 9 Q1. Let α, β >, the p.d.f. of a beta distribution with parameters α and β is { Γ(α+β) Γ(α)Γ(β) f(x α, β) xα 1 (1 x) β 1 for < x
Statistical Inference I Locally most powerful tests
Statistical Inference I Locally most powerful tests Shirsendu Mukherjee Department of Statistics, Asutosh College, Kolkata, India. shirsendu st@yahoo.co.in So far we have treated the testing of one-sided
SOLUTIONS TO MATH38181 EXTREME VALUES AND FINANCIAL RISK EXAM
SOLUTIONS TO MATH38181 EXTREME VALUES AND FINANCIAL RISK EXAM Solutions to Question 1 a) The cumulative distribution function of T conditional on N n is Pr T t N n) Pr max X 1,..., X N ) t N n) Pr max
Problem Set 3: Solutions
CMPSCI 69GG Applied Information Theory Fall 006 Problem Set 3: Solutions. [Cover and Thomas 7.] a Define the following notation, C I p xx; Y max X; Y C I p xx; Ỹ max I X; Ỹ We would like to show that C
6.3 Forecasting ARMA processes
122 CHAPTER 6. ARMA MODELS 6.3 Forecasting ARMA processes The purpose of forecasting is to predict future values of a TS based on the data collected to the present. In this section we will discuss a linear
Homework 8 Model Solution Section
MATH 004 Homework Solution Homework 8 Model Solution Section 14.5 14.6. 14.5. Use the Chain Rule to find dz where z cosx + 4y), x 5t 4, y 1 t. dz dx + dy y sinx + 4y)0t + 4) sinx + 4y) 1t ) 0t + 4t ) sinx
3.4 SUM AND DIFFERENCE FORMULAS. NOTE: cos(α+β) cos α + cos β cos(α-β) cos α -cos β
3.4 SUM AND DIFFERENCE FORMULAS Page Theorem cos(αβ cos α cos β -sin α cos(α-β cos α cos β sin α NOTE: cos(αβ cos α cos β cos(α-β cos α -cos β Proof of cos(α-β cos α cos β sin α Let s use a unit circle
derivation of the Laplacian from rectangular to spherical coordinates
derivation of the Laplacian from rectangular to spherical coordinates swapnizzle 03-03- :5:43 We begin by recognizing the familiar conversion from rectangular to spherical coordinates (note that φ is used
ST5224: Advanced Statistical Theory II
ST5224: Advanced Statistical Theory II 2014/2015: Semester II Tutorial 7 1. Let X be a sample from a population P and consider testing hypotheses H 0 : P = P 0 versus H 1 : P = P 1, where P j is a known
Example Sheet 3 Solutions
Example Sheet 3 Solutions. i Regular Sturm-Liouville. ii Singular Sturm-Liouville mixed boundary conditions. iii Not Sturm-Liouville ODE is not in Sturm-Liouville form. iv Regular Sturm-Liouville note
Math221: HW# 1 solutions
Math: HW# solutions Andy Royston October, 5 7.5.7, 3 rd Ed. We have a n = b n = a = fxdx = xdx =, x cos nxdx = x sin nx n sin nxdx n = cos nx n = n n, x sin nxdx = x cos nx n + cos nxdx n cos n = + sin
SCHOOL OF MATHEMATICAL SCIENCES G11LMA Linear Mathematics Examination Solutions
SCHOOL OF MATHEMATICAL SCIENCES GLMA Linear Mathematics 00- Examination Solutions. (a) i. ( + 5i)( i) = (6 + 5) + (5 )i = + i. Real part is, imaginary part is. (b) ii. + 5i i ( + 5i)( + i) = ( i)( + i)
EE512: Error Control Coding
EE512: Error Control Coding Solution for Assignment on Finite Fields February 16, 2007 1. (a) Addition and Multiplication tables for GF (5) and GF (7) are shown in Tables 1 and 2. + 0 1 2 3 4 0 0 1 2 3
Solutions to Exercise Sheet 5
Solutions to Eercise Sheet 5 jacques@ucsd.edu. Let X and Y be random variables with joint pdf f(, y) = 3y( + y) where and y. Determine each of the following probabilities. Solutions. a. P (X ). b. P (X
Homework 3 Solutions
Homework 3 Solutions Igor Yanovsky (Math 151A TA) Problem 1: Compute the absolute error and relative error in approximations of p by p. (Use calculator!) a) p π, p 22/7; b) p π, p 3.141. Solution: For
2 Composition. Invertible Mappings
Arkansas Tech University MATH 4033: Elementary Modern Algebra Dr. Marcel B. Finan Composition. Invertible Mappings In this section we discuss two procedures for creating new mappings from old ones, namely,
Matrices and Determinants
Matrices and Determinants SUBJECTIVE PROBLEMS: Q 1. For what value of k do the following system of equations possess a non-trivial (i.e., not all zero) solution over the set of rationals Q? x + ky + 3z
4.6 Autoregressive Moving Average Model ARMA(1,1)
84 CHAPTER 4. STATIONARY TS MODELS 4.6 Autoregressive Moving Average Model ARMA(,) This section is an introduction to a wide class of models ARMA(p,q) which we will consider in more detail later in this
Section 8.3 Trigonometric Equations
99 Section 8. Trigonometric Equations Objective 1: Solve Equations Involving One Trigonometric Function. In this section and the next, we will exple how to solving equations involving trigonometric functions.
SOLUTIONS TO MATH38181 EXTREME VALUES AND FINANCIAL RISK EXAM
SOLUTIONS TO MATH38181 EXTREME VALUES AND FINANCIAL RISK EXAM Solutions to Question 1 a) The cumulative distribution function of T conditional on N n is Pr (T t N n) Pr (max (X 1,..., X N ) t N n) Pr (max
The Simply Typed Lambda Calculus
Type Inference Instead of writing type annotations, can we use an algorithm to infer what the type annotations should be? That depends on the type system. For simple type systems the answer is yes, and
Numerical Analysis FMN011
Numerical Analysis FMN011 Carmen Arévalo Lund University carmen@maths.lth.se Lecture 12 Periodic data A function g has period P if g(x + P ) = g(x) Model: Trigonometric polynomial of order M T M (x) =
Lecture 2: Dirac notation and a review of linear algebra Read Sakurai chapter 1, Baym chatper 3
Lecture 2: Dirac notation and a review of linear algebra Read Sakurai chapter 1, Baym chatper 3 1 State vector space and the dual space Space of wavefunctions The space of wavefunctions is the set of all
Chapter 6: Systems of Linear Differential. be continuous functions on the interval
Chapter 6: Systems of Linear Differential Equations Let a (t), a 2 (t),..., a nn (t), b (t), b 2 (t),..., b n (t) be continuous functions on the interval I. The system of n first-order differential equations
Areas and Lengths in Polar Coordinates
Kiryl Tsishchanka Areas and Lengths in Polar Coordinates In this section we develop the formula for the area of a region whose boundary is given by a polar equation. We need to use the formula for the
Econ 2110: Fall 2008 Suggested Solutions to Problem Set 8 questions or comments to Dan Fetter 1
Eon : Fall 8 Suggested Solutions to Problem Set 8 Email questions or omments to Dan Fetter Problem. Let X be a salar with density f(x, θ) (θx + θ) [ x ] with θ. (a) Find the most powerful level α test
k A = [k, k]( )[a 1, a 2 ] = [ka 1,ka 2 ] 4For the division of two intervals of confidence in R +
Chapter 3. Fuzzy Arithmetic 3- Fuzzy arithmetic: ~Addition(+) and subtraction (-): Let A = [a and B = [b, b in R If x [a and y [b, b than x+y [a +b +b Symbolically,we write A(+)B = [a (+)[b, b = [a +b
Ordinal Arithmetic: Addition, Multiplication, Exponentiation and Limit
Ordinal Arithmetic: Addition, Multiplication, Exponentiation and Limit Ting Zhang Stanford May 11, 2001 Stanford, 5/11/2001 1 Outline Ordinal Classification Ordinal Addition Ordinal Multiplication Ordinal
CHAPTER 25 SOLVING EQUATIONS BY ITERATIVE METHODS
CHAPTER 5 SOLVING EQUATIONS BY ITERATIVE METHODS EXERCISE 104 Page 8 1. Find the positive root of the equation x + 3x 5 = 0, correct to 3 significant figures, using the method of bisection. Let f(x) =
ANSWERSHEET (TOPIC = DIFFERENTIAL CALCULUS) COLLECTION #2. h 0 h h 0 h h 0 ( ) g k = g 0 + g 1 + g g 2009 =?
Teko Classes IITJEE/AIEEE Maths by SUHAAG SIR, Bhopal, Ph (0755) 3 00 000 www.tekoclasses.com ANSWERSHEET (TOPIC DIFFERENTIAL CALCULUS) COLLECTION # Question Type A.Single Correct Type Q. (A) Sol least
Fourier Series. MATH 211, Calculus II. J. Robert Buchanan. Spring Department of Mathematics
Fourier Series MATH 211, Calculus II J. Robert Buchanan Department of Mathematics Spring 2018 Introduction Not all functions can be represented by Taylor series. f (k) (c) A Taylor series f (x) = (x c)
Tridiagonal matrices. Gérard MEURANT. October, 2008
Tridiagonal matrices Gérard MEURANT October, 2008 1 Similarity 2 Cholesy factorizations 3 Eigenvalues 4 Inverse Similarity Let α 1 ω 1 β 1 α 2 ω 2 T =......... β 2 α 1 ω 1 β 1 α and β i ω i, i = 1,...,
SCITECH Volume 13, Issue 2 RESEARCH ORGANISATION Published online: March 29, 2018
Journal of rogressive Research in Mathematics(JRM) ISSN: 2395-028 SCITECH Volume 3, Issue 2 RESEARCH ORGANISATION ublished online: March 29, 208 Journal of rogressive Research in Mathematics www.scitecresearch.com/journals
C.S. 430 Assignment 6, Sample Solutions
C.S. 430 Assignment 6, Sample Solutions Paul Liu November 15, 2007 Note that these are sample solutions only; in many cases there were many acceptable answers. 1 Reynolds Problem 10.1 1.1 Normal-order
Areas and Lengths in Polar Coordinates
Kiryl Tsishchanka Areas and Lengths in Polar Coordinates In this section we develop the formula for the area of a region whose boundary is given by a polar equation. We need to use the formula for the
Fractional Colorings and Zykov Products of graphs
Fractional Colorings and Zykov Products of graphs Who? Nichole Schimanski When? July 27, 2011 Graphs A graph, G, consists of a vertex set, V (G), and an edge set, E(G). V (G) is any finite set E(G) is
Απόκριση σε Μοναδιαία Ωστική Δύναμη (Unit Impulse) Απόκριση σε Δυνάμεις Αυθαίρετα Μεταβαλλόμενες με το Χρόνο. Απόστολος Σ.
Απόκριση σε Δυνάμεις Αυθαίρετα Μεταβαλλόμενες με το Χρόνο The time integral of a force is referred to as impulse, is determined by and is obtained from: Newton s 2 nd Law of motion states that the action
Math 6 SL Probability Distributions Practice Test Mark Scheme
Math 6 SL Probability Distributions Practice Test Mark Scheme. (a) Note: Award A for vertical line to right of mean, A for shading to right of their vertical line. AA N (b) evidence of recognizing symmetry
2. THEORY OF EQUATIONS. PREVIOUS EAMCET Bits.
EAMCET-. THEORY OF EQUATIONS PREVIOUS EAMCET Bits. Each of the roots of the equation x 6x + 6x 5= are increased by k so that the new transformed equation does not contain term. Then k =... - 4. - Sol.
Jesse Maassen and Mark Lundstrom Purdue University November 25, 2013
Notes on Average Scattering imes and Hall Factors Jesse Maassen and Mar Lundstrom Purdue University November 5, 13 I. Introduction 1 II. Solution of the BE 1 III. Exercises: Woring out average scattering
Srednicki Chapter 55
Srednicki Chapter 55 QFT Problems & Solutions A. George August 3, 03 Srednicki 55.. Use equations 55.3-55.0 and A i, A j ] = Π i, Π j ] = 0 (at equal times) to verify equations 55.-55.3. This is our third
Probability and Random Processes (Part II)
Probability and Random Processes (Part II) 1. If the variance σ x of d(n) = x(n) x(n 1) is one-tenth the variance σ x of a stationary zero-mean discrete-time signal x(n), then the normalized autocorrelation
Section 7.6 Double and Half Angle Formulas
09 Section 7. Double and Half Angle Fmulas To derive the double-angles fmulas, we will use the sum of two angles fmulas that we developed in the last section. We will let α θ and β θ: cos(θ) cos(θ + θ)
Every set of first-order formulas is equivalent to an independent set
Every set of first-order formulas is equivalent to an independent set May 6, 2008 Abstract A set of first-order formulas, whatever the cardinality of the set of symbols, is equivalent to an independent
forms This gives Remark 1. How to remember the above formulas: Substituting these into the equation we obtain with
Week 03: C lassification of S econd- Order L inear Equations In last week s lectures we have illustrated how to obtain the general solutions of first order PDEs using the method of characteristics. We
ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 19/5/2007
Οδηγίες: Να απαντηθούν όλες οι ερωτήσεις. Αν κάπου κάνετε κάποιες υποθέσεις να αναφερθούν στη σχετική ερώτηση. Όλα τα αρχεία που αναφέρονται στα προβλήματα βρίσκονται στον ίδιο φάκελο με το εκτελέσιμο
Statistics 104: Quantitative Methods for Economics Formula and Theorem Review
Harvard College Statistics 104: Quantitative Methods for Economics Formula and Theorem Review Tommy MacWilliam, 13 tmacwilliam@college.harvard.edu March 10, 2011 Contents 1 Introduction to Data 5 1.1 Sample
Mean-Variance Analysis
Mean-Variance Analysis Jan Schneider McCombs School of Business University of Texas at Austin Jan Schneider Mean-Variance Analysis Beta Representation of the Risk Premium risk premium E t [Rt t+τ ] R1
Parametrized Surfaces
Parametrized Surfaces Recall from our unit on vector-valued functions at the beginning of the semester that an R 3 -valued function c(t) in one parameter is a mapping of the form c : I R 3 where I is some
Μηχανική Μάθηση Hypothesis Testing
ΕΛΛΗΝΙΚΗ ΔΗΜΟΚΡΑΤΙΑ ΠΑΝΕΠΙΣΤΗΜΙΟ ΚΡΗΤΗΣ Μηχανική Μάθηση Hypothesis Testing Γιώργος Μπορμπουδάκης Τμήμα Επιστήμης Υπολογιστών Procedure 1. Form the null (H 0 ) and alternative (H 1 ) hypothesis 2. Consider
Tutorial on Multinomial Logistic Regression
Tutorial on Multinomial Logistic Regression Javier R Movellan June 19, 2013 1 1 General Model The inputs are n-dimensional vectors the outputs are c-dimensional vectors The training sample consist of m
Uniform Convergence of Fourier Series Michael Taylor
Uniform Convergence of Fourier Series Michael Taylor Given f L 1 T 1 ), we consider the partial sums of the Fourier series of f: N 1) S N fθ) = ˆfk)e ikθ. k= N A calculation gives the Dirichlet formula
The challenges of non-stable predicates
The challenges of non-stable predicates Consider a non-stable predicate Φ encoding, say, a safety property. We want to determine whether Φ holds for our program. The challenges of non-stable predicates
Reminders: linear functions
Reminders: linear functions Let U and V be vector spaces over the same field F. Definition A function f : U V is linear if for every u 1, u 2 U, f (u 1 + u 2 ) = f (u 1 ) + f (u 2 ), and for every u U
6. MAXIMUM LIKELIHOOD ESTIMATION
6 MAXIMUM LIKELIHOOD ESIMAION [1] Maximum Likelihood Estimator (1) Cases in which θ (unknown parameter) is scalar Notational Clarification: From now on, we denote the true value of θ as θ o hen, view θ
Second Order Partial Differential Equations
Chapter 7 Second Order Partial Differential Equations 7.1 Introduction A second order linear PDE in two independent variables (x, y Ω can be written as A(x, y u x + B(x, y u xy + C(x, y u u u + D(x, y
: Monte Carlo EM 313, Louis (1982) EM, EM Newton-Raphson, /. EM, 2 Monte Carlo EM Newton-Raphson, Monte Carlo EM, Monte Carlo EM, /. 3, Monte Carlo EM
2008 6 Chinese Journal of Applied Probability and Statistics Vol.24 No.3 Jun. 2008 Monte Carlo EM 1,2 ( 1,, 200241; 2,, 310018) EM, E,,. Monte Carlo EM, EM E Monte Carlo,. EM, Monte Carlo EM,,,,. Newton-Raphson.
6.1. Dirac Equation. Hamiltonian. Dirac Eq.
6.1. Dirac Equation Ref: M.Kaku, Quantum Field Theory, Oxford Univ Press (1993) η μν = η μν = diag(1, -1, -1, -1) p 0 = p 0 p = p i = -p i p μ p μ = p 0 p 0 + p i p i = E c 2 - p 2 = (m c) 2 H = c p 2
DESIGN OF MACHINERY SOLUTION MANUAL h in h 4 0.
DESIGN OF MACHINERY SOLUTION MANUAL -7-1! PROBLEM -7 Statement: Design a double-dwell cam to move a follower from to 25 6, dwell for 12, fall 25 and dwell for the remader The total cycle must take 4 sec
Approximation of distance between locations on earth given by latitude and longitude
Approximation of distance between locations on earth given by latitude and longitude Jan Behrens 2012-12-31 In this paper we shall provide a method to approximate distances between two points on earth
ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 6/5/2006
Οδηγίες: Να απαντηθούν όλες οι ερωτήσεις. Ολοι οι αριθμοί που αναφέρονται σε όλα τα ερωτήματα είναι μικρότεροι το 1000 εκτός αν ορίζεται διαφορετικά στη διατύπωση του προβλήματος. Διάρκεια: 3,5 ώρες Καλή
Inverse trigonometric functions & General Solution of Trigonometric Equations. ------------------ ----------------------------- -----------------
Inverse trigonometric functions & General Solution of Trigonometric Equations. 1. Sin ( ) = a) b) c) d) Ans b. Solution : Method 1. Ans a: 17 > 1 a) is rejected. w.k.t Sin ( sin ) = d is rejected. If sin
Introduction to the ML Estimation of ARMA processes
Introduction to the ML Estimation of ARMA processes Eduardo Rossi University of Pavia October 2013 Rossi ARMA Estimation Financial Econometrics - 2013 1 / 1 We consider the AR(p) model: Y t = c + φ 1 Y
Phys460.nb Solution for the t-dependent Schrodinger s equation How did we find the solution? (not required)
Phys460.nb 81 ψ n (t) is still the (same) eigenstate of H But for tdependent H. The answer is NO. 5.5.5. Solution for the tdependent Schrodinger s equation If we assume that at time t 0, the electron starts
Durbin-Levinson recursive method
Durbin-Levinson recursive method A recursive method for computing ϕ n is useful because it avoids inverting large matrices; when new data are acquired, one can update predictions, instead of starting again
Pg The perimeter is P = 3x The area of a triangle is. where b is the base, h is the height. In our case b = x, then the area is
Pg. 9. The perimeter is P = The area of a triangle is A = bh where b is the base, h is the height 0 h= btan 60 = b = b In our case b =, then the area is A = = 0. By Pythagorean theorem a + a = d a a =
Partial Differential Equations in Biology The boundary element method. March 26, 2013
The boundary element method March 26, 203 Introduction and notation The problem: u = f in D R d u = ϕ in Γ D u n = g on Γ N, where D = Γ D Γ N, Γ D Γ N = (possibly, Γ D = [Neumann problem] or Γ N = [Dirichlet
Exercises 10. Find a fundamental matrix of the given system of equations. Also find the fundamental matrix Φ(t) satisfying Φ(0) = I. 1.
Exercises 0 More exercises are available in Elementary Differential Equations. If you have a problem to solve any of them, feel free to come to office hour. Problem Find a fundamental matrix of the given
Lecture 2. Soundness and completeness of propositional logic
Lecture 2 Soundness and completeness of propositional logic February 9, 2004 1 Overview Review of natural deduction. Soundness and completeness. Semantics of propositional formulas. Soundness proof. Completeness
b. Use the parametrization from (a) to compute the area of S a as S a ds. Be sure to substitute for ds!
MTH U341 urface Integrals, tokes theorem, the divergence theorem To be turned in Wed., Dec. 1. 1. Let be the sphere of radius a, x 2 + y 2 + z 2 a 2. a. Use spherical coordinates (with ρ a) to parametrize.
= λ 1 1 e. = λ 1 =12. has the properties e 1. e 3,V(Y
Stat 50 Homework Solutions Spring 005. (a λ λ λ 44 (b trace( λ + λ + λ 0 (c V (e x e e λ e e λ e (λ e by definition, the eigenvector e has the properties e λ e and e e. (d λ e e + λ e e + λ e e 8 6 4 4
Problem Set 9 Solutions. θ + 1. θ 2 + cotθ ( ) sinθ e iφ is an eigenfunction of the ˆ L 2 operator. / θ 2. φ 2. sin 2 θ φ 2. ( ) = e iφ. = e iφ cosθ.
Chemistry 362 Dr Jean M Standard Problem Set 9 Solutions The ˆ L 2 operator is defined as Verify that the angular wavefunction Y θ,φ) Also verify that the eigenvalue is given by 2! 2 & L ˆ 2! 2 2 θ 2 +
Exercises to Statistics of Material Fatigue No. 5
Prof. Dr. Christine Müller Dipl.-Math. Christoph Kustosz Eercises to Statistics of Material Fatigue No. 5 E. 9 (5 a Show, that a Fisher information matri for a two dimensional parameter θ (θ,θ 2 R 2, can
Main source: "Discrete-time systems and computer control" by Α. ΣΚΟΔΡΑΣ ΨΗΦΙΑΚΟΣ ΕΛΕΓΧΟΣ ΔΙΑΛΕΞΗ 4 ΔΙΑΦΑΝΕΙΑ 1
Main source: "Discrete-time systems and computer control" by Α. ΣΚΟΔΡΑΣ ΨΗΦΙΑΚΟΣ ΕΛΕΓΧΟΣ ΔΙΑΛΕΞΗ 4 ΔΙΑΦΑΝΕΙΑ 1 A Brief History of Sampling Research 1915 - Edmund Taylor Whittaker (1873-1956) devised a
Theorem 8 Let φ be the most powerful size α test of H
Testing composite hypotheses Θ = Θ 0 Θ c 0 H 0 : θ Θ 0 H 1 : θ Θ c 0 Definition 16 A test φ is a uniformly most powerful (UMP) level α test for H 0 vs. H 1 if φ has level α and for any other level α test
Chapter 6: Systems of Linear Differential. be continuous functions on the interval
Chapter 6: Systems of Linear Differential Equations Let a (t), a 2 (t),..., a nn (t), b (t), b 2 (t),..., b n (t) be continuous functions on the interval I. The system of n first-order differential equations
Lecture 21: Properties and robustness of LSE
Lecture 21: Properties and robustness of LSE BLUE: Robustness of LSE against normality We now study properties of l τ β and σ 2 under assumption A2, i.e., without the normality assumption on ε. From Theorem
w o = R 1 p. (1) R = p =. = 1
Πανεπιστήµιο Κρήτης - Τµήµα Επιστήµης Υπολογιστών ΗΥ-570: Στατιστική Επεξεργασία Σήµατος 205 ιδάσκων : Α. Μουχτάρης Τριτη Σειρά Ασκήσεων Λύσεις Ασκηση 3. 5.2 (a) From the Wiener-Hopf equation we have:
Lecture 12: Pseudo likelihood approach
Lecture 12: Pseudo likelihood approach Pseudo MLE Let X 1,...,X n be a random sample from a pdf in a family indexed by two parameters θ and π with likelihood l(θ,π). The method of pseudo MLE may be viewed
PARTIAL NOTES for 6.1 Trigonometric Identities
PARTIAL NOTES for 6.1 Trigonometric Identities tanθ = sinθ cosθ cotθ = cosθ sinθ BASIC IDENTITIES cscθ = 1 sinθ secθ = 1 cosθ cotθ = 1 tanθ PYTHAGOREAN IDENTITIES sin θ + cos θ =1 tan θ +1= sec θ 1 + cot
Concrete Mathematics Exercises from 30 September 2016
Concrete Mathematics Exercises from 30 September 2016 Silvio Capobianco Exercise 1.7 Let H(n) = J(n + 1) J(n). Equation (1.8) tells us that H(2n) = 2, and H(2n+1) = J(2n+2) J(2n+1) = (2J(n+1) 1) (2J(n)+1)
5.4 The Poisson Distribution.
The worst thing you can do about a situation is nothing. Sr. O Shea Jackson 5.4 The Poisson Distribution. Description of the Poisson Distribution Discrete probability distribution. The random variable
5. Choice under Uncertainty
5. Choice under Uncertainty Daisuke Oyama Microeconomics I May 23, 2018 Formulations von Neumann-Morgenstern (1944/1947) X: Set of prizes Π: Set of probability distributions on X : Preference relation
Lecture 10 - Representation Theory III: Theory of Weights
Lecture 10 - Representation Theory III: Theory of Weights February 18, 2012 1 Terminology One assumes a base = {α i } i has been chosen. Then a weight Λ with non-negative integral Dynkin coefficients Λ
Appendix S1 1. ( z) α βc. dβ β δ β
Appendix S1 1 Proof of Lemma 1. Taking first and second partial derivatives of the expected profit function, as expressed in Eq. (7), with respect to l: Π Π ( z, λ, l) l θ + s ( s + h ) g ( t) dt λ Ω(
TMA4115 Matematikk 3
TMA4115 Matematikk 3 Andrew Stacey Norges Teknisk-Naturvitenskapelige Universitet Trondheim Spring 2010 Lecture 12: Mathematics Marvellous Matrices Andrew Stacey Norges Teknisk-Naturvitenskapelige Universitet
HW 3 Solutions 1. a) I use the auto.arima R function to search over models using AIC and decide on an ARMA(3,1)
HW 3 Solutions a) I use the autoarima R function to search over models using AIC and decide on an ARMA3,) b) I compare the ARMA3,) to ARMA,0) ARMA3,) does better in all three criteria c) The plot of the
Practice Exam 2. Conceptual Questions. 1. State a Basic identity and then verify it. (a) Identity: Solution: One identity is csc(θ) = 1
Conceptual Questions. State a Basic identity and then verify it. a) Identity: Solution: One identity is cscθ) = sinθ) Practice Exam b) Verification: Solution: Given the point of intersection x, y) of the
A Bonus-Malus System as a Markov Set-Chain. Małgorzata Niemiec Warsaw School of Economics Institute of Econometrics
A Bonus-Malus System as a Markov Set-Chain Małgorzata Niemiec Warsaw School of Economics Institute of Econometrics Contents 1. Markov set-chain 2. Model of bonus-malus system 3. Example 4. Conclusions
Queensland University of Technology Transport Data Analysis and Modeling Methodologies
Queensland University of Technology Transport Data Analysis and Modeling Methodologies Lab Session #7 Example 5.2 (with 3SLS Extensions) Seemingly Unrelated Regression Estimation and 3SLS A survey of 206
CHAPTER 48 APPLICATIONS OF MATRICES AND DETERMINANTS
CHAPTER 48 APPLICATIONS OF MATRICES AND DETERMINANTS EXERCISE 01 Page 545 1. Use matrices to solve: 3x + 4y x + 5y + 7 3x + 4y x + 5y 7 Hence, 3 4 x 0 5 y 7 The inverse of 3 4 5 is: 1 5 4 1 5 4 15 8 3
The Probabilistic Method - Probabilistic Techniques. Lecture 7: The Janson Inequality
The Probabilistic Method - Probabilistic Techniques Lecture 7: The Janson Inequality Sotiris Nikoletseas Associate Professor Computer Engineering and Informatics Department 2014-2015 Sotiris Nikoletseas,
ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016
ECE598: Information-theoretic methods in high-dimensional statistics Spring 06 Lecture 7: Information bound Lecturer: Yihong Wu Scribe: Shiyu Liang, Feb 6, 06 [Ed. Mar 9] Recall the Chi-squared divergence
Nowhere-zero flows Let be a digraph, Abelian group. A Γ-circulation in is a mapping : such that, where, and : tail in X, head in
Nowhere-zero flows Let be a digraph, Abelian group. A Γ-circulation in is a mapping : such that, where, and : tail in X, head in : tail in X, head in A nowhere-zero Γ-flow is a Γ-circulation such that
Example of the Baum-Welch Algorithm
Example of the Baum-Welch Algorithm Larry Moss Q520, Spring 2008 1 Our corpus c We start with a very simple corpus. We take the set Y of unanalyzed words to be {ABBA, BAB}, and c to be given by c(abba)