Syntactic Topic Models Supplement

Σχετικά έγγραφα
Diane Hu LDA for Audio Music April 12, 2010

1. For each of the following power series, find the interval of convergence and the radius of convergence:

Homework for 1/27 Due 2/5

Last Lecture. Biostatistics Statistical Inference Lecture 19 Likelihood Ratio Test. Example of Hypothesis Testing.

INTEGRATION OF THE NORMAL DISTRIBUTION CURVE

The Heisenberg Uncertainty Principle

derivation of the Laplacian from rectangular to spherical coordinates

FREE VIBRATION OF A SINGLE-DEGREE-OF-FREEDOM SYSTEM Revision B

Degenerate Perturbation Theory

Srednicki Chapter 55

Ψηφιακή Επεξεργασία Εικόνας

HOMEWORK 4 = G. In order to plot the stress versus the stretch we define a normalized stretch:


Introduction of Numerical Analysis #03 TAGAMI, Daisuke (IMI, Kyushu University)

The Simply Typed Lambda Calculus

Solve the difference equation

Bessel function for complex variable

4.6 Autoregressive Moving Average Model ARMA(1,1)

Solutions: Homework 3

SUPERPOSITION, MEASUREMENT, NORMALIZATION, EXPECTATION VALUES. Reading: QM course packet Ch 5 up to 5.6

Outline. M/M/1 Queue (infinite buffer) M/M/1/N (finite buffer) Networks of M/M/1 Queues M/G/1 Priority Queue

DESIGN OF MACHINERY SOLUTION MANUAL h in h 4 0.

p n r

Presentation of complex number in Cartesian and polar coordinate system

On Generating Relations of Some Triple. Hypergeometric Functions

Homework 4.1 Solutions Math 5110/6830

Bayesian statistics. DS GA 1002 Probability and Statistics for Data Science.

The partial derivatives of their Kullback-Leibler divergence are given by

α β

6.1. Dirac Equation. Hamiltonian. Dirac Eq.

Inertial Navigation Mechanization and Error Equations

Notes on the Open Economy

A study on generalized absolute summability factors for a triangular matrix

n r f ( n-r ) () x g () r () x (1.1) = Σ g() x = Σ n f < -n+ r> g () r -n + r dx r dx n + ( -n,m) dx -n n+1 1 -n -1 + ( -n,n+1)

department listing department name αχχουντσ ϕανε βαλικτ δδσϕηασδδη σδηφγ ασκϕηλκ τεχηνιχαλ αλαν ϕουν διξ τεχηνιχαλ ϕοην µαριανι

MATH 38061/MATH48061/MATH68061: MULTIVARIATE STATISTICS Solutions to Problems on Matrix Algebra

Tired Waiting in Queues? Then get in line now to learn more about Queuing!

Solution Series 9. i=1 x i and i=1 x i.

Lecture 17: Minimum Variance Unbiased (MVUB) Estimators

b. Use the parametrization from (a) to compute the area of S a as S a ds. Be sure to substitute for ds!

Example of the Baum-Welch Algorithm

Finite Field Problems: Solutions

3.4 SUM AND DIFFERENCE FORMULAS. NOTE: cos(α+β) cos α + cos β cos(α-β) cos α -cos β

ST5224: Advanced Statistical Theory II

Phys460.nb Solution for the t-dependent Schrodinger s equation How did we find the solution? (not required)

Section 8.3 Trigonometric Equations

DERIVATION OF MILES EQUATION Revision D

Other Test Constructions: Likelihood Ratio & Bayes Tests

C.S. 430 Assignment 6, Sample Solutions

Variational Wavefunction for the Helium Atom

IIT JEE (2013) (Trigonomtery 1) Solutions

Areas and Lengths in Polar Coordinates

Exercises 10. Find a fundamental matrix of the given system of equations. Also find the fundamental matrix Φ(t) satisfying Φ(0) = I. 1.

CHAPTER 103 EVEN AND ODD FUNCTIONS AND HALF-RANGE FOURIER SERIES

Homework 3 Solutions

Statistical Inference I Locally most powerful tests

Tutorial on Multinomial Logistic Regression

Areas and Lengths in Polar Coordinates

Lifting Entry (continued)

Tridiagonal matrices. Gérard MEURANT. October, 2008

Concrete Mathematics Exercises from 30 September 2016

ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 6/5/2006

Partial Trace and Partial Transpose

SCHOOL OF MATHEMATICAL SCIENCES G11LMA Linear Mathematics Examination Solutions

: Monte Carlo EM 313, Louis (1982) EM, EM Newton-Raphson, /. EM, 2 Monte Carlo EM Newton-Raphson, Monte Carlo EM, Monte Carlo EM, /. 3, Monte Carlo EM

Parametrized Surfaces

ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 19/5/2007

Matrices and vectors. Matrix and vector. a 11 a 12 a 1n a 21 a 22 a 2n A = b 1 b 2. b m. R m n, b = = ( a ij. a m1 a m2 a mn. def

Math221: HW# 1 solutions

Econ 2110: Fall 2008 Suggested Solutions to Problem Set 8 questions or comments to Dan Fetter 1

( ) 2 and compare to M.

Lecture 2: Dirac notation and a review of linear algebra Read Sakurai chapter 1, Baym chatper 3

Mean bond enthalpy Standard enthalpy of formation Bond N H N N N N H O O O

Spherical shell model

CHAPTER 25 SOLVING EQUATIONS BY ITERATIVE METHODS

Durbin-Levinson recursive method

Every set of first-order formulas is equivalent to an independent set

6.3 Forecasting ARMA processes

Solutions to Exercise Sheet 5

PARTIAL NOTES for 6.1 Trigonometric Identities

ω ω ω ω ω ω+2 ω ω+2 + ω ω ω ω+2 + ω ω+1 ω ω+2 2 ω ω ω ω ω ω ω ω+1 ω ω2 ω ω2 + ω ω ω2 + ω ω ω ω2 + ω ω+1 ω ω2 + ω ω+1 + ω ω ω ω2 + ω

2 Composition. Invertible Mappings

2. THEORY OF EQUATIONS. PREVIOUS EAMCET Bits.

5.4 The Poisson Distribution.

Nowhere-zero flows Let be a digraph, Abelian group. A Γ-circulation in is a mapping : such that, where, and : tail in X, head in

CRASH COURSE IN PRECALCULUS

Approximation of distance between locations on earth given by latitude and longitude

Lecture 13 - Root Space Decomposition II

[1] P Q. Fig. 3.1

Capacitors - Capacitance, Charge and Potential Difference

Inverse trigonometric functions & General Solution of Trigonometric Equations

Empirical best prediction under area-level Poisson mixed models

Partial Differential Equations in Biology The boundary element method. March 26, 2013

Η αλληλεπίδραση ανάμεσα στην καθημερινή γλώσσα και την επιστημονική ορολογία: παράδειγμα από το πεδίο της Κοσμολογίας

Fourier Series. MATH 211, Calculus II. J. Robert Buchanan. Spring Department of Mathematics

Fractional Colorings and Zykov Products of graphs

Parameter Estimation Fitting Probability Distributions Bayesian Approach

Lecture 3: Asymptotic Normality of M-estimators

Section 7.6 Double and Half Angle Formulas

Transcript:

Sytactic Topic Models Supplemet Jorda Boyd-Graber Departmet of Computer Sciece 35 Olde Street Priceto Uiversity Priceto, NJ 08540 jbg@cs.priceto.edu David Blei Departmet of Computer Sciece 35 Olde Street Priceto Uiversity Priceto, NJ 08540 blei@cs.priceto.edu I this documet, we derive mea field variatioal iferece updates for the sytactic topic model. After recapitulatig the model, we expad the expectatios for each of the terms ad the derive updates for each of the variatioal parameters that maximie the likelihood boud. 1 Model As a referece, we first summarie the primary variables i our model. π p,i θ d,i How much we should wat to go ito topic i give that the paret topic was p. It is parameteried i the variatioal distributio by ν. How much we should wat to go ito topic i give that the documet is d. It is parameteried i the variatioal distributio by γ i the variatioal distributio. The topic of the th word. It is parameteried by φ i the variatioal distributio. β Top level proportios from DP. I the variatioal distributio, qβ β oly has support o β, which is ero for all idices greater tha K. τ i,j How much we should wat to geerate word j give that the word s topic is i. These variables are geerated through the followig process: 1. Choose global topic weights β GEMα 2. For each topic idex k = {1,... }: a Choose topic trasitio distributio π k DPα T, β. b Choose topic τ k Dirσ 3. For each documet d = {1,... M}: a Choose topic weights θ d DPα D, β. b For each setece i the documet: i. Choose topic assigmet 0 θ d π start ii. Choose root word w 0 mult1, τ 0 iii. For each additioal word w ad paret p, {1,... d } Choose topic assigmet θ d π p Choose word w mult1, τ 0 2 Variatioal Distributio Our variatioal distributio is factored ito qβ,, θ, π, τ β, φ, γ, ν = qβ β d qθ d γ d qπ ν q φ, 1 1

where qβ β is ot a full distributio but a degeerate poit estimate, γ d ad ν are variatioal Dirichlet distributios, ad φ is a topic multiomial for the th word. Our lower boud o the likelihood is Lγ, ν, φ; β, θ, π, τ E q [log pβ α log pθ α D, β log pπ α P, β log p θ, π log pw, τ log pτ σ E q [log qθ log qπ log q. 2 3 Documet-specific terms I our implemetatio, documet-specific variatioal parameters ad global variatioal parameters are computed separately because the documet-specific parameters ca be computed i parallel, greatly speedig the iferece process. We preset the derivatio of the variatioal updates i a similar maer, focusig o documet specific terms before global oes because it more closely follows the flow of the implemeted algorithm ad removes ueeded subscripts ad summatios. 3.1 Expadig expectatios We first expad the expectatio of the per-word topic term, both because it is the most difficult ad because it is the crux of this work. Rather tha drawig the topic of a word directly from a multiomial, it is chose from the reormalied poit-wise product of two multiomial distributios. I order to hadle the expectatio of the log sum itroduced by the reormaliatio, we itroduce a additioal variatioal parameter ω for each word via a Taylor approximatio of the logarithm to fid that N θ π p, E q [log p θ, π = E q [log K i θ i π p,i [ N N = E q log θ π p, log θ i π p,i = = N [ N [ E q log θ π p, E q θ i π p,i log ω 1 N [ N E q [log θ E q log πp, N N φ,i [ E q θi π p, log ω 1 N Ψγ i Ψ γ j φ,i φ p,j Ψν j,i Ψ ν j,k φ p,j i K γ K k ν log ω 1. 3 j,k Because the words come from a multiomial distributio, the probability of a word comig from a topic is simply its weight uder the multiomial. So takig the expectatio over its possible assigmets, we have [ N N E q [log pw, τ, ρ, t = E q log τ,w = φ i log τ i,w. 4 Fially, we take the expectatio of the documet topic multiomial; this is easier because the variatioal distributio oly has support at β ; this gives us E q [pθ d α D, β = log Γ α D,j β log Γα D,i β α D,i β 1 Ψγ i Ψ γ j. 5 2

Computig the etropy for the variatioal distributio terms which are specific to a sigle documet is relatively straightforward: E q [log q = N φ,i log φ,i E q [log qθ = log Γ γ j log Γγ i γ i 1 Ψγ i Ψ γ j We ow have all the terms ecessary to express the likelihood boud cotributio of a documet L d = log Γ α D,j β log Γα D,i β N α D,i β 1Ψγ i Ψ γ j N N φ,i N Ψγ i Ψ γ j φ,i φ p,j Ψν j,i Ψ ν j,k φ i log τ i,w φ p,j K γ K k ν log ω 1 j,k log Γ γ j log Γγ i γ i 1 Ψγ i Ψ γ j N 3.2 Variatioal updates φ,i log φ,i. 6 Because we seek to maximie Equatio 6, we compute the gradiet of the likelihood boud with respect to each variable. Whe possible, we ll explicitly solve for a value of the parameter that maximies the likelihood with respect to that coordiate. Otherwise, we ll fully express the likelihood fuctio ad gradiet for that variatioal parameter so that we ca use these expressios i a umerical optimiatio package. The variatioal parameters are preseted i a order correspodig roughly to the order i which they are used by the algorithm. 3.2.1 Topic ormalier slack variable First, we ll explicitly solve for the value of ω, the slack variable itroduced i the Taylor approximatio of Equatio 3. L = ω 2 φ p,j ω K γ K k ν 7 j,k ω = φ p,j K γ K k ν j,k 3

3.2.2 Variatioal topic multiomial Because φ is a variatioal multiomial, it must sum to oe. So we wat to maximie the likelihood boud augmeted by that costrait ad its correspodig Lagrage multiplier. L φi = L d λ φ,j 1 8 c c L φi = Ψγ i Ψ γ j φ p,j Ψν j,i Ψ ν j,k φ i φ c,j Ψν i,j Ψ ν i,k c c c j k γ k γ j ν i,j k ν i,k φ i log τ i,w log φ,i λ. 9 This the gives us, up to a costat, the value of φ that maximies our variatioal boud φ i exp Ψγ i Ψ γ j φ p,j Ψν j,i Ψ ν j,k c c c c c j k γ k γ j ν i,j k ν i,k φ c,j Ψν i,j Ψ ν i,k log τ i,w }. 10 3.2.3 Variatioal documet Dirichlet Because we couple π ad θ, the iteractio betwee these terms i the ormalier prevets us from solvig the optimiatio explicitly. But we still eed to derive the fuctioal form ad the derivative for the cojugate gradiet algorithm [1. I doig so, the followig derivative is useful: N x i f = α i N i=j x j f = α N i j i x j N j i α jx j x i N 2. 11 x i 4

First, we write the terms i L d that ivolve γ: L γ = α D,i β 1 Ψγ i Ψ γ j N φ,i Ψγ i Ψ γ j N ω 1 φ p,j K γ K k ν log ω 1 j,k log Γ γ j log Γγ i γ i 1 Ψγ i Ψ γ j. Now, we use Equatio 11 to compute the partial derivative with respect to γ i to be [ L N N N = Ψ γ i α D,i β φ,i γ i Ψ γ j α D,j β φ,j γ j γ i N N ω 1 ν j,i k j φ γ k N k j ν j,kγ k p,j N 2 γ. 13 N k ν j,k 4 Global variatioal terms I additio to the terms that are specific to the words of a sigle documet, we also have parameters that deal with the word probabilities withi topics, the trasitios betwee topics, ad the top-level topic weights. We retur to these terms ad, after derivig the rest of the likelihood boud, describe the updates for each of these global parameters. 4.1 Expadig expectatios Let s first cosider the terms associated with the variatioal Dirichlet associated with the topic trasitios from paret to child i the parse tree, which gives us E q [pπ d α P, β = log Γ α T,j β log Γα T,i β 12 α T,i β 1Ψν i Ψ ν j 14 E q [log qπ = log Γ ν j log Γν i ν i 1 Ψν i Ψ ν j 15 Aother global term that we have ot cosidered is for the top-level weights associated with β. The distributio is defied i terms of stick breakig proportios [2, so, followig the derivatio of Liag et al [3, we must trasform β ito these stick breakig proportios; this is accomplished by dividig each weight by the tail sum 1 T 1 β i, 16 5 i

the sum of the remaiig weights. This also itroduces a extra term for the chage of variables. E q [pβ α = E q [log GEMβ; α = log Betaβ /T ; 1, α = =1 =1 α 1 log T1 T log T K 1 log α log T = α 1 log T K log T C. 17 Lastly, the multiomial distributio over vocabulary terms per topic is straightforward V E q [log pτ σ = σ log τ i,v. 18 v=1 The total likelihood is, after icludig Equatio 6, the M L = d L d α 1 log T K log T log Γ α T,j β log Γα T,i β α T,i β 1Ψν i Ψ ν j log Γ ν j log Γν i ν i 1 Ψν i Ψ ν j v=1 4.2 Variatioal updates V σ log τ i,v. 19 These are the updates which require iput from all documets ad caot be parallelied. However, each of the documet-specific updates ca compute sufficiet statistics for these computatios which are summed together for the fial updates. 4.2.1 Variatioal trasitio Dirichlet Like our update for γ, the iteractio betwee π ad θ i the ormalier prevets us from solvig the optimiatio explicitly. First, cosider the variatioal Dirichlet for trasitios from topic i L νi = α P,j 1Ψν i,j Ψ ν i,k N c c φ,i φ c,j Ψν i,j Ψ ν i,k N φ,i ωc 1 γ j ν i,j K c c γ K k ν log ω 1 i,k log Γ ν i,j log Γν i,j ν i,j 1 Ψν i,j Ψ ν i,k. 20 6

Differetiatig this expressio, keepig i mid Equatio 11, gives L N = Ψ ν i,j α P,j φ,i φ c,j ν i,j ν i,j c c N Ψ ν i,k α P,k N φ,i ωc 1 c c which is sufficiet for cojugate gradiet optimiatio [1. 4.2.2 DP Process c c N γ j k j ν i,k N N ν j,k φ,i φ c,k ν i,k k j ν i,kγ k 2 N γ k, 21 The last variatioal parameter is β, which is the variatioal estimate of the top-level weights β. Note that we separate the fial term β K ad the previous weights; this is because β K is implicitly defied as 1 i=0 β i. L β = α 1 log T K log T [ M log Γα D d=1 i log Γα D β i log Γα D β K α D βi 1 Ψγ d,i Ψ γ d,j [ log Γα T i log Γα T β i log Γα T β K α T βi 1 Ψν k,i Ψ ν k,j. 22 Takig the derivative of this expressio, ad implicitly differetiatig β K gives us L β 1 βk = α 1 23 T T K =k1 M M α D Ψγ d,k Ψ γ d,j α D Ψγ d,k Ψ γ d,j d K α T Ψν,k Ψ ν,j α T d K Ψν,K Ψ ν,j K [α T Ψα T βk α T Ψα T βk M [α D Ψα D βk α D Ψα D βk 24 which we optimie through the barrier costraied optomiatio algorithm [4. Refereces [1 Galassi, M., J. Davies, J. Theiler, et al. Gu Scietific Library: Referece Maual. Network Theory Ltd., 2003. 7

[2 Pitma, J. Poisso-Dirichlet ad GEM ivariat distributios for split-ad-merge trasformatios of a iterval partitio. Combiatorics, Probability ad Computig, 11:501 514, 2002. [3 Liag, P., S. Petrov, M. Jorda, et al. The ifiite PCFG usig hierarchical Dirichlet processes. I Proceedigs of Emperical Methods i Natural Laguage Processig, pages 688 697. 2007. [4 Boyd, S., L. Vadeberghe. Covex Optimiatio. Cambridge Uiversity Press, 2004. 8