Sytactic Topic Models Supplemet Jorda Boyd-Graber Departmet of Computer Sciece 35 Olde Street Priceto Uiversity Priceto, NJ 08540 jbg@cs.priceto.edu David Blei Departmet of Computer Sciece 35 Olde Street Priceto Uiversity Priceto, NJ 08540 blei@cs.priceto.edu I this documet, we derive mea field variatioal iferece updates for the sytactic topic model. After recapitulatig the model, we expad the expectatios for each of the terms ad the derive updates for each of the variatioal parameters that maximie the likelihood boud. 1 Model As a referece, we first summarie the primary variables i our model. π p,i θ d,i How much we should wat to go ito topic i give that the paret topic was p. It is parameteried i the variatioal distributio by ν. How much we should wat to go ito topic i give that the documet is d. It is parameteried i the variatioal distributio by γ i the variatioal distributio. The topic of the th word. It is parameteried by φ i the variatioal distributio. β Top level proportios from DP. I the variatioal distributio, qβ β oly has support o β, which is ero for all idices greater tha K. τ i,j How much we should wat to geerate word j give that the word s topic is i. These variables are geerated through the followig process: 1. Choose global topic weights β GEMα 2. For each topic idex k = {1,... }: a Choose topic trasitio distributio π k DPα T, β. b Choose topic τ k Dirσ 3. For each documet d = {1,... M}: a Choose topic weights θ d DPα D, β. b For each setece i the documet: i. Choose topic assigmet 0 θ d π start ii. Choose root word w 0 mult1, τ 0 iii. For each additioal word w ad paret p, {1,... d } Choose topic assigmet θ d π p Choose word w mult1, τ 0 2 Variatioal Distributio Our variatioal distributio is factored ito qβ,, θ, π, τ β, φ, γ, ν = qβ β d qθ d γ d qπ ν q φ, 1 1
where qβ β is ot a full distributio but a degeerate poit estimate, γ d ad ν are variatioal Dirichlet distributios, ad φ is a topic multiomial for the th word. Our lower boud o the likelihood is Lγ, ν, φ; β, θ, π, τ E q [log pβ α log pθ α D, β log pπ α P, β log p θ, π log pw, τ log pτ σ E q [log qθ log qπ log q. 2 3 Documet-specific terms I our implemetatio, documet-specific variatioal parameters ad global variatioal parameters are computed separately because the documet-specific parameters ca be computed i parallel, greatly speedig the iferece process. We preset the derivatio of the variatioal updates i a similar maer, focusig o documet specific terms before global oes because it more closely follows the flow of the implemeted algorithm ad removes ueeded subscripts ad summatios. 3.1 Expadig expectatios We first expad the expectatio of the per-word topic term, both because it is the most difficult ad because it is the crux of this work. Rather tha drawig the topic of a word directly from a multiomial, it is chose from the reormalied poit-wise product of two multiomial distributios. I order to hadle the expectatio of the log sum itroduced by the reormaliatio, we itroduce a additioal variatioal parameter ω for each word via a Taylor approximatio of the logarithm to fid that N θ π p, E q [log p θ, π = E q [log K i θ i π p,i [ N N = E q log θ π p, log θ i π p,i = = N [ N [ E q log θ π p, E q θ i π p,i log ω 1 N [ N E q [log θ E q log πp, N N φ,i [ E q θi π p, log ω 1 N Ψγ i Ψ γ j φ,i φ p,j Ψν j,i Ψ ν j,k φ p,j i K γ K k ν log ω 1. 3 j,k Because the words come from a multiomial distributio, the probability of a word comig from a topic is simply its weight uder the multiomial. So takig the expectatio over its possible assigmets, we have [ N N E q [log pw, τ, ρ, t = E q log τ,w = φ i log τ i,w. 4 Fially, we take the expectatio of the documet topic multiomial; this is easier because the variatioal distributio oly has support at β ; this gives us E q [pθ d α D, β = log Γ α D,j β log Γα D,i β α D,i β 1 Ψγ i Ψ γ j. 5 2
Computig the etropy for the variatioal distributio terms which are specific to a sigle documet is relatively straightforward: E q [log q = N φ,i log φ,i E q [log qθ = log Γ γ j log Γγ i γ i 1 Ψγ i Ψ γ j We ow have all the terms ecessary to express the likelihood boud cotributio of a documet L d = log Γ α D,j β log Γα D,i β N α D,i β 1Ψγ i Ψ γ j N N φ,i N Ψγ i Ψ γ j φ,i φ p,j Ψν j,i Ψ ν j,k φ i log τ i,w φ p,j K γ K k ν log ω 1 j,k log Γ γ j log Γγ i γ i 1 Ψγ i Ψ γ j N 3.2 Variatioal updates φ,i log φ,i. 6 Because we seek to maximie Equatio 6, we compute the gradiet of the likelihood boud with respect to each variable. Whe possible, we ll explicitly solve for a value of the parameter that maximies the likelihood with respect to that coordiate. Otherwise, we ll fully express the likelihood fuctio ad gradiet for that variatioal parameter so that we ca use these expressios i a umerical optimiatio package. The variatioal parameters are preseted i a order correspodig roughly to the order i which they are used by the algorithm. 3.2.1 Topic ormalier slack variable First, we ll explicitly solve for the value of ω, the slack variable itroduced i the Taylor approximatio of Equatio 3. L = ω 2 φ p,j ω K γ K k ν 7 j,k ω = φ p,j K γ K k ν j,k 3
3.2.2 Variatioal topic multiomial Because φ is a variatioal multiomial, it must sum to oe. So we wat to maximie the likelihood boud augmeted by that costrait ad its correspodig Lagrage multiplier. L φi = L d λ φ,j 1 8 c c L φi = Ψγ i Ψ γ j φ p,j Ψν j,i Ψ ν j,k φ i φ c,j Ψν i,j Ψ ν i,k c c c j k γ k γ j ν i,j k ν i,k φ i log τ i,w log φ,i λ. 9 This the gives us, up to a costat, the value of φ that maximies our variatioal boud φ i exp Ψγ i Ψ γ j φ p,j Ψν j,i Ψ ν j,k c c c c c j k γ k γ j ν i,j k ν i,k φ c,j Ψν i,j Ψ ν i,k log τ i,w }. 10 3.2.3 Variatioal documet Dirichlet Because we couple π ad θ, the iteractio betwee these terms i the ormalier prevets us from solvig the optimiatio explicitly. But we still eed to derive the fuctioal form ad the derivative for the cojugate gradiet algorithm [1. I doig so, the followig derivative is useful: N x i f = α i N i=j x j f = α N i j i x j N j i α jx j x i N 2. 11 x i 4
First, we write the terms i L d that ivolve γ: L γ = α D,i β 1 Ψγ i Ψ γ j N φ,i Ψγ i Ψ γ j N ω 1 φ p,j K γ K k ν log ω 1 j,k log Γ γ j log Γγ i γ i 1 Ψγ i Ψ γ j. Now, we use Equatio 11 to compute the partial derivative with respect to γ i to be [ L N N N = Ψ γ i α D,i β φ,i γ i Ψ γ j α D,j β φ,j γ j γ i N N ω 1 ν j,i k j φ γ k N k j ν j,kγ k p,j N 2 γ. 13 N k ν j,k 4 Global variatioal terms I additio to the terms that are specific to the words of a sigle documet, we also have parameters that deal with the word probabilities withi topics, the trasitios betwee topics, ad the top-level topic weights. We retur to these terms ad, after derivig the rest of the likelihood boud, describe the updates for each of these global parameters. 4.1 Expadig expectatios Let s first cosider the terms associated with the variatioal Dirichlet associated with the topic trasitios from paret to child i the parse tree, which gives us E q [pπ d α P, β = log Γ α T,j β log Γα T,i β 12 α T,i β 1Ψν i Ψ ν j 14 E q [log qπ = log Γ ν j log Γν i ν i 1 Ψν i Ψ ν j 15 Aother global term that we have ot cosidered is for the top-level weights associated with β. The distributio is defied i terms of stick breakig proportios [2, so, followig the derivatio of Liag et al [3, we must trasform β ito these stick breakig proportios; this is accomplished by dividig each weight by the tail sum 1 T 1 β i, 16 5 i
the sum of the remaiig weights. This also itroduces a extra term for the chage of variables. E q [pβ α = E q [log GEMβ; α = log Betaβ /T ; 1, α = =1 =1 α 1 log T1 T log T K 1 log α log T = α 1 log T K log T C. 17 Lastly, the multiomial distributio over vocabulary terms per topic is straightforward V E q [log pτ σ = σ log τ i,v. 18 v=1 The total likelihood is, after icludig Equatio 6, the M L = d L d α 1 log T K log T log Γ α T,j β log Γα T,i β α T,i β 1Ψν i Ψ ν j log Γ ν j log Γν i ν i 1 Ψν i Ψ ν j v=1 4.2 Variatioal updates V σ log τ i,v. 19 These are the updates which require iput from all documets ad caot be parallelied. However, each of the documet-specific updates ca compute sufficiet statistics for these computatios which are summed together for the fial updates. 4.2.1 Variatioal trasitio Dirichlet Like our update for γ, the iteractio betwee π ad θ i the ormalier prevets us from solvig the optimiatio explicitly. First, cosider the variatioal Dirichlet for trasitios from topic i L νi = α P,j 1Ψν i,j Ψ ν i,k N c c φ,i φ c,j Ψν i,j Ψ ν i,k N φ,i ωc 1 γ j ν i,j K c c γ K k ν log ω 1 i,k log Γ ν i,j log Γν i,j ν i,j 1 Ψν i,j Ψ ν i,k. 20 6
Differetiatig this expressio, keepig i mid Equatio 11, gives L N = Ψ ν i,j α P,j φ,i φ c,j ν i,j ν i,j c c N Ψ ν i,k α P,k N φ,i ωc 1 c c which is sufficiet for cojugate gradiet optimiatio [1. 4.2.2 DP Process c c N γ j k j ν i,k N N ν j,k φ,i φ c,k ν i,k k j ν i,kγ k 2 N γ k, 21 The last variatioal parameter is β, which is the variatioal estimate of the top-level weights β. Note that we separate the fial term β K ad the previous weights; this is because β K is implicitly defied as 1 i=0 β i. L β = α 1 log T K log T [ M log Γα D d=1 i log Γα D β i log Γα D β K α D βi 1 Ψγ d,i Ψ γ d,j [ log Γα T i log Γα T β i log Γα T β K α T βi 1 Ψν k,i Ψ ν k,j. 22 Takig the derivative of this expressio, ad implicitly differetiatig β K gives us L β 1 βk = α 1 23 T T K =k1 M M α D Ψγ d,k Ψ γ d,j α D Ψγ d,k Ψ γ d,j d K α T Ψν,k Ψ ν,j α T d K Ψν,K Ψ ν,j K [α T Ψα T βk α T Ψα T βk M [α D Ψα D βk α D Ψα D βk 24 which we optimie through the barrier costraied optomiatio algorithm [4. Refereces [1 Galassi, M., J. Davies, J. Theiler, et al. Gu Scietific Library: Referece Maual. Network Theory Ltd., 2003. 7
[2 Pitma, J. Poisso-Dirichlet ad GEM ivariat distributios for split-ad-merge trasformatios of a iterval partitio. Combiatorics, Probability ad Computig, 11:501 514, 2002. [3 Liag, P., S. Petrov, M. Jorda, et al. The ifiite PCFG usig hierarchical Dirichlet processes. I Proceedigs of Emperical Methods i Natural Laguage Processig, pages 688 697. 2007. [4 Boyd, S., L. Vadeberghe. Covex Optimiatio. Cambridge Uiversity Press, 2004. 8