The partial derivatives of their Kullback-Leibler divergence are given by

1 Gradiet of Kullbac-Leibler divergece Let λ ad λ be two sets of atural paraeters of a expoetial faily, that is, qβ; λ = hβ exp λ tβ aλ. 1 The partial derivatives of their Kullbac-Leibler divergece are give by λ D KLλ, λ = ] λ E qβ; λ λ log qβ; λ = λ E λ λ λ tβ aλ + aλ ] 3 where we have used the expoetial faily idetities λ aλ = E λtβ], Asyptotic behavior of trust-regio ethod = λ λ λ E λ tβ] + E λ tβ] aλ λ 4 = λ λ λ aλ + λ aλ aλ λ 5 = λ λ Iλ, 6 aλ = Iλ. 7 λ Here we show that the trust-regio update give below coverges to a atural gradiet step. dλ = argax dλ {L λ + dλ ξd KL λ + dλ, λ} 8 For large ξ, dλ will be close to zero so that we ca focus o the target fuctios first-order ters. For expoetial failies i caoical for, we have ] qβ; λ + dλ D KL λ + dλ, λ = E λ+dλ log 9 qβ; λ The gradiet of the KL divergece i dλ is thus give by = E λ+dλ dλ tβ aλ + dλ + aλ ] 10 = dλ aλ + dλ aλ + dλ + aλ. 11 D KL λ + dλ, λ = aλ + dλ + aλ + dλdλ aλ + dλ 1 Approxiatig the target fuctio aroud λ yields = aλ + dλdλ. 13 dλ L λ ξdλ aλdλ, 14 which whe axiized gives a update proportioal to the atural gradiet directio, dλ = 1 aλ L λ. 15 ξ 1

3 Latet Dirichlet Allocatio I LDA we have global paraeters β cosistig of distributios over words β ad local paraeters θ, z with distributios pβ = Dirβ ; η, 16 pθ = Dirθ ; α, 17 pz θ = θ z, 18 px z, β = β zx. 19 We approxiate the posterior distributio over β with qβ = Dirβ ; λ. 0 Writig the lielihood i the for of Equatio of the ai paper gives px, z, θ β = Dirθ; α θ z β zx 1 = hθ, z exp log β zx = hθ, z exp tβ, fz, x 3 where h ecopasses all ters which do ot deped o β ad tβ = log β, 4 fz, x = I zx, 5 where I ij is a atrix with etry i, j set to 1 ad all other etries set to 0. We here assue that β is a atrix whose rows are the topics β ad A, = veca vec = tra 6 for atrices A ad. Sice the stadard paraetrizatio of the Dirichlet distributio is already i caoical for, we ca iediately apply Equatio 10 of the ai paper to get the steps of the ier loop of the trustregio update, λ = 1 ρ t λ t + ρ t η + NE fx, z ]. 7 The beliefs over z i.e., are coputed i the usual aer lei et al., 003].

4 Mixture odels Cosider a ixture odel where the local paraeters are the cluster assigets ad the global paraeters are the prior weights π ad the copoets paraeters β. We assue the followig ore specific for for the odel, pπ = Dirπ; α, 8 p π = π, 9 pβ hβ exp η tβ, 30 px, β = gx exp tβ fx 31 for suitable futios t, f, g, h. The factors of a ea-field approxiatio to the posterior are give by qπ = Dirπ; γ, 3 q = φ, 33 qβ hβ exp λ tβ. 34 I each iteratio, the trust-regio ethod alterates betwee coputig the optial φ ad updatig γ ad λ. The approxiate posterior over ixture copoets, slightly abusig otatio, is give by where for the expected log-lielihood we have Oce is coputed, we update γ ad λ via = argax L λ, γ, φ 35 φ exp E q log px, βp π] 36 = exp E q log px, β] exp ψγ ψ γ 37 E q log px, β] = E q tβ ] fx + log gx. 38 γ = 1 ρ t γ t + ρ t α + N, 39 λ = 1 ρ t λ t + ρ t η + N fx. 40 For ii-batches of size, these updates becoe γ = 1 ρ t γ t + ρ t α + N, 41 λ = 1 ρ t λ t + ρ t η + N fx. 4 To use these results with ay cocrete ixture odel, we have to write dow the prior over β i caoical for ad ipleet the expected log-lielihood Equatio 38 ad the update i Equatio 4. 3

4.1 Mixture of ultivariate eroullis For the ultivariate eroulli odel, px, β = i β xi i 1 β i 1 xi, 43 we assue beta distributios for the prior ad approxiate posterior, pβ β a i 1 β i b, 44 qβ β a i i 1 β i bi. 45 λ = a, b are already the atural paraeters of the beta distributio, so that the updates Equatio 4 becoe a = 1 ρ t a t + ρ t a + N x i, 46 b = 1 ρ t b t + ρ t b + N 1 x i. 47 The expected log-lielihood eeded for the coputatio of is give by E q log px, β] = x ψa + 1 x ψb 1 ψa + b, 48 where the digaa fuctio ψ is applied poit-wise ad 1 is a vector of oes. 4. Mixture of Gaussias We assue a oral-iverse-wishart distributio NIW for the paraeters β = µ, Σ of each Gaussia distributio. The NIW for a sige copoet is give by pµ, Σ exp s µ Σ µ exp 1 trψσ Σ ν+d+1 49 = exp s µ Σ µ + sµ Σ 1 trs Σ 1 trψσ ν D + 1 log Σ log Σ 50 where = exp η, s, Ψ, ν tµ, Σ 51 η, s, Ψ, ν = s, s, vecs + Ψ, ν 5 = s, b, vec C, ν = η, 53 tµ, Σ = 1 µ Σ µ, Σ µ, vec Σ, log Σ. 54 4

are the atural paraeters ad sufficiet statistics of the distributio, respectively. The lielihood for a sigle data poit x is give by px µ, Σ exp 1 x µ Σ x µ / Σ 1 55 = exp 1 tr Σ xx + x Σ µ 1 µ Σ µ 1 log Σ 56 where = exp tµ, Σ fx, 57 Hece, assuig the atural paraeters of all copoets are give by fx = 1, x, vec xx, 1. 58 η = η, s, Ψ, ν = s, b, vec C, ν, 59 a update of the ier loop of the trust-regio ethod Equatio 4 is give by s = 1 ρ t s t + ρ t s + N, 60 b = 1 ρ t b t + ρ t b N x, 61 C = 1 ρ t C t + ρ t C + N x x, 6 ν = 1 ρ t ν t + ρ t ν + N. 63 Or, i ters of the ore traditioal paraetrizatio, = 1 ρ t st t + 1 ρ t s + N s s x, 64 Ψ = 1 ρ t Ψ t + s t t t + ρ t Ψ + s + N x x s. 65 A proof that these updates leave Ψ positive defiite is give below. Positive defiiteess of Ψ We show positive defiiteess of Ψ i two steps. First, we show that the costrait is fulfilled for ρ t = 0 ad ρ t = 1. Secod, we show that the set of valid atural paraeters iduced by the costrait is covex, iplyig that the costrait ust be fulfilled for ay liear iterpolatio of two atural paraeters. 5

For ρ t = 0, we have Ψ = Ψ t sice oe of the atural paraeters has chaged. Thus, Ψ is positive defiite if Ψ t is positive defiite. For ρ t = 1, we have Ψ = Ψ + s + N x x s, 66 s = 1 s + N s x s + N x, 67 s = s + N. 68 Note that p 0 = s/s ad p = N φ /s are positive ad su to oe ad therefore ca be cosidered probabilities. Let X be a rado variable which taes o value with probability p 0 ad value x with probability p. The we ca rewrite Equatio 66 as Ψ = Ψ + s E p XX ] s E p X]E p X] = Ψ + s VX]. 69 Sice Ψ is positive defiite ad the covariace atrix VX] is at least positive sei-defiite, Ψ ust be positive defiite. We ext show that the set of valid atural paraeters, is covex. Not that this set is covex iff is covex. This set is covex iff {s, b, C, ν : s > 0, ν > D 1, C 1 4s bb is p. d. }, 70 {s, b, C : s > 0, C 1 s bb is p. d. }, 71 {s, b, C : s > 0, C bb is p. d. }, 7 is covex, sice ay perspective fuctio preserves covexity ad P s, b, C = s, b/s, C is a perspective fuctio. Fially, this set is covex iff the followig set is covex, Assue b 1, C 1, b, C Ω ad let ρ 0, 1]. The Ω = {b, C : C bb is p. d. }. 73 ρc 1 + 1 ρc ρb 1 + 1 ρb ρb 1 + 1 ρb 74 = ρc 1 + 1 ρc ρb 1 b + b b 1 + 1 ρb b 1 75 = ρc 1 b 1 b 1 + 1 ρc b b + ρ1 ρb 1 b b 1 b, 76 which is a su of positive defiite ad sei-defiite atrices ad therefore positive defiite. Hece, ad Ω is covex. ρc 1 + 1 ρc, ρb 1 + 1 ρb Ω 77 6

Expected log-lielihood To copute he expected log-lielihood Equatio 38, we eed E q µ Σ µ ] = Eq Eq µ Σ µ ]] Σ 78 = E q tr s Σ Σ + Σ ] 79 = Ds + ν Ψ, 80 E q Σ µ ] x = ν Ψ x, 81 x ] E q Σ x = ν x Ψ x, 8 D E q log Σ ] ν + 1 i = ψ + D log + log Ψ. 83 i=1 For Equatio 79, see Peterse ad Pederse 008]. For Equatio 83, see ishop 006]. The expected loglielihood is thus give by E q log px, β] = E q 1 x Σ x + xσ µ 1 µ Σ µ + 1 log Σ D ] logπ 84 = ν x Ψ x + ν x Ψ D s ν Ψ 85 + 1 D ν + 1 i ψ + D log log Ψ D logπ 86 i=1 = ν x Ψ x D s + 1 D ν + 1 i ψ + 1 log Ψ D log π. i=1 87 Refereces C. M. ishop. Patter Recogitio ad Machie Learig. Spriger, 006. D. M. lei, A. Y. Ng, ad M. I. Jorda. Latet Dirichlet Allocatio. Joural of Machie Learig Research, 3: 993 10, 003. K. Peterse ad M. Pederse. The Matrix Cooboo, 008. 7