Topic Modeling with Latent Dirichlet Allocation Vineet Mehta University of Massachusetts - Lowell Vineet Mehta (UML) Topic Modeling 1 / 34
Contents 1 Introduction 2 Preliminaries 3 Modeling Text with Latent Dirichlet Allocation 4 Parameter Estimation Vineet Mehta (UML) Topic Modeling 2 / 34
Contents 1 Introduction 2 Preliminaries 3 Modeling Text with Latent Dirichlet Allocation 4 Parameter Estimation Vineet Mehta (UML) Topic Modeling 2 / 34
Contents 1 Introduction 2 Preliminaries 3 Modeling Text with Latent Dirichlet Allocation 4 Parameter Estimation Vineet Mehta (UML) Topic Modeling 2 / 34
Contents 1 Introduction 2 Preliminaries 3 Modeling Text with Latent Dirichlet Allocation 4 Parameter Estimation Vineet Mehta (UML) Topic Modeling 2 / 34
1 Introduction 2 Preliminaries 3 Modeling Text with Latent Dirichlet Allocation 4 Parameter Estimation Vineet Mehta (UML) Topic Modeling 3 / 34
Background A number of techniques for text analysis and information retrieval have been developed over the past decades This presentation focuses on one such technique known as Latent Dirichlet Allocation Latent Dirichlet Allocation (LDA) was introduced by David Blei, Andrew Ng and Michael Jordan in a 2003 paper in Journal of Machine Learning Research Since its introduction LDA has been employed for applications beyond text analysis LDA has also seen a number of extensions Vineet Mehta (UML) Topic Modeling 4 / 34
Topic Modeling LDA aims at classifying large collections of documents through statistical relationships amongst words, known as topics Topics are distributions over a vocabulary of words LDA employs Bayesian inference techniques to estimate the statistical quantities that are topics The LDA approach to text analysis does not assume any knowledge of language structure Documents in a text collection are treated as bag-of-words Vineet Mehta (UML) Topic Modeling 5 / 34
Applications A Small Sampling Exploring scientific, political, wikipedia articles Audio information retrieval using acoustic features Image segmentation using visual features Identifying surprising events in video data Analysis of stock categories using financial topic models Development of user recommendation systems in social media Topic models for gene expression analysis Analysis of twitter data for public health status and trends Vineet Mehta (UML) Topic Modeling 6 / 34
Finding Out More A really small sampling...so just google it! Workshops Topic Models: Computation, Application, and Evaluation (NIPS 2013) Applications for Topic Models: Text and Beyond (NIPS 2009) Workshop on Topic Models: Structure, Applications, Evaluation, and Extensions (ICML 2010) Topic Modeling for Humanities Research (MITH 2012) Topic Modeling Software lda-c: C-code by David Blei lda: R-language package at CRAN gensim: Python package includes Latent Dirichlet Allocation mallet: Java machine learning package, including topic modeling topictoolbox: Matlab toolbox by UCI Data UCI machine learning repository: http://archive.ics.uci.edu/ml/datasets.html infochimps: http://www.infochimps.com Enron dataset: https://www.cs.cmu.edu/~enron/ LDC (not free): http://catalog.ldc.upenn.edu Vineet Mehta (UML) Topic Modeling 7 / 34
1 Introduction 2 Preliminaries 3 Modeling Text with Latent Dirichlet Allocation 4 Parameter Estimation Vineet Mehta (UML) Topic Modeling 8 / 34
Bayes Theorem Consider the dataset X (K) N which consists of N samples X (K) N = {x 1 x N } where x i = [x 1,i x K,i ] T is sample from the random vector X (K) i The random vector X (K) i has a distribution p(x (K) i θ) parameterized by θ. And {X (K) i } are independent and identically distributed random vectors. Bayes Theorem posterior p(θ X (K) ) = p(x (K) θ) p(θ) p(x (K) ) likelihood prior evidence Vineet Mehta (UML) Topic Modeling 9 / 34
Key Distributions Univariate Case Binomial Distribution Bin(x θ, N) p(x = x θ, N) = Bernoulli Distribution ( ) N θ x (1 θ) 1 x x N 1 x Bern(x θ) p(x = x θ) = θ x (1 θ) 1 x x {0, 1} Likelihood of N Bernoulli observations p(x (1) N N θ) = θ I(x i =1) (1 θ) I(x i =0) = θ n 1 (1 θ) n 0 N = n 0 + n 1 i=1 Vineet Mehta (UML) Topic Modeling 10 / 34
Key Distributions Multivariate Case Multinomial Distribution Mult(x θ, N) p(x (K) = x θ, N) = ( ) N K θ x k x k x N K 1, k=1 K θ k = 1 k=1 Categorical Distribution Cat(x θ) p(x (K) = x θ, 1) = k=1 θ x k k = θ ki(x k = 1) x {0, 1} K Vineet Mehta (UML) Topic Modeling 11 / 34
Key Distributions Multivariate Case (continued) Likelihood of N Bernoulli observations where p(x (K) N N θ) = p(x i θ) = i=1 N i=1 k=1 K n k = N k=1 θ I(x i,k=1) k = k=1 θ n k k Vineet Mehta (UML) Topic Modeling 12 / 34
Parameterized Priors Hyperparameters Generalization: Parameter θ depends on the hyperparameter ϑ Bayes Theorem p(θ) p(θ ϑ) p(θ X (K) p(x(k) N N, ϑ) = θ)p(θ ϑ) (K) p(x N θ)p(θ ϑ)dθ Vineet Mehta (UML) Topic Modeling 13 / 34
Conjugate Priors Univariate Case: p(θ ϑ) Beta Distribution: ϑ = (α, β) Beta(θ α, β) p(θ α, β) = 1 B(α, β) θα 1 (1 θ) β 1 B(α, β) = Γ(α)Γ(β) Γ(α + β) Posterior Distribution: p(θ X (1) N, ϑ) 1 p(θ X (1) N, ϑ) = B(α,β) θn1+α 1 (1 θ) n 0+β 1 1 θ n 1 +α 1 (1 θ) n0+β 1 dθ = B(α,β) 1 B(n 1 + α, n 0 + β) θn 1+α 1 (1 θ) n 0+β 1 Beta(θ n 1 + α, n 0 + β) Vineet Mehta (UML) Topic Modeling 14 / 34
Prediction Univariate Case Marginalizing out likelihood parameters p(x (1) N α, β) = p(x (1) N θ)p(θ α, β)dθ 1 = θ n1+α 1 (1 θ) n0+β 1 dθ = B(n 1 + α, n 0 + β) B(α, β) B(α, β) = Γ(n 1 + α)γ(n 0 + β)γ(α + β) Γ(n 1 + n 0 + α + β)γ(α)γ(β) New sample likelihood p( x = 1 X (1) N p( x = 1, X(1) N α, β), α, β) = p(x (1) N α, β) = n 1 + α n 1 + n 0 + α + β Vineet Mehta (UML) Topic Modeling 15 / 34
Conjugate Priors Multivariate Case: p(θ ϑ) Dirichlet Distribution Dir(θ ϑ) p(θ ϑ) = 1 (ϑ) k=1 θ ϑ k 1 k (ϑ) = Γ(ϑ k ) k=1, Γ( K ϑ k ) k=1 K θ k = 1 k=1 Posterior Distribution: p(θ X (K) N, ϑ) K k=1 θn k+ϑ k 1 1 p(θ X (K) N, ϑ) = (ϑ) k K k=1 θn k+ϑ k 1 dθ = 1 (ϑ) 1 (n + ϑ) k K k Dir(θ n + ϑ) k=1 θn k+ϑ k 1 Vineet Mehta (UML) Topic Modeling 16 / 34
1 Introduction 2 Preliminaries 3 Modeling Text with Latent Dirichlet Allocation 4 Parameter Estimation Vineet Mehta (UML) Topic Modeling 17 / 34
Modeling Text Notation x i,m z i,m ω j ξ k θ m φ k α β i-th word in m-th document topic from which i-th workd in m-th document is drawn value taken by x i,m, where j [1, V ] and V is vocabulary size value taken by z i,m, where k [1, K] and K is topic count topic distribution for m-th document word distribution for k-th topic hyperparameters for document topic distribution hyperparameters for topic word distribution X (V ) N m words in document m X (V ) N words in all documents (corpus) Z (K) N m topics associated with words in document m Z (K) N topics associated with words in corpus Vineet Mehta (UML) Topic Modeling 18 / 34
Modeling Text Latent Dirichlet Allocation Generative Model for all topics k [1, K] do φ k Dir(φ k β) for all documents m [1, M] do θ m Dir(θ m α) N m Pois(N m ξ) for all words i [1, N m ] in document m do topic index z i,m Mult(z i,m θ m, 1) word x i,m Mult(x i,m φ {k:i(zi,m =ξ k )}, 1) Vineet Mehta (UML) Topic Modeling 19 / 34
1 Introduction 2 Preliminaries 3 Modeling Text with Latent Dirichlet Allocation 4 Parameter Estimation Vineet Mehta (UML) Topic Modeling 20 / 34
Latent Dirichlet Allocation Joint Distribution of Known and Hidden Variables i-th word in m-th document All words in m-th document p(x i,m, z i,m, θ m, Φ α, β) p(x (V ) N m, Z (K) N m N m, θ m, Φ α, β) = p(x i,m, z i,m, θ m, Φ α, β) All words in corpus p(x (V ) N =, Z(K) N i=1, Θ, Φ α, β) = M m=1 p(x (V ) N m, Z (K) N m, θ m, Φ α, β) M N m p(x i,m z i,m, Φ)p(z i,m θ m )p(θ m α)p(φ β) m=1 i=1 Vineet Mehta (UML) Topic Modeling 21 / 34
Latent Dirichlet Allocation Conditional Distributions - Word Likelihoods Word in Document p(x i,m z i,m, Φ) = All Words in Document N m p(x i,m z i,m, Φ) = i=1 = V φ I(x i,m=ω j z i,m =ξ k ) k,j j=1 k=1 N m V i=1 j=1 k=1 V φ ρ k,j k,j j=1 k=1 ρ k,j is the count of word j assigned to topic k φ I(x i,m=ω j z i,m =ξ k ) k,j Vineet Mehta (UML) Topic Modeling 22 / 34
Latent Dirichlet Allocation Conditional Distributions - Topic Likelihoods Topic Likelihood for Single Word p(z i,m θ m ) = k=1 θ I(z i,m=ξ k ) m,k Topic Likelihood for all Words in Document N m p(z i,m θ m ) = i=1 = N m i=1 k=1 k=1 θ υ m,k m,k θ I(z i,m=ξ k ) m,k υ m,k is the count of words in document m assigned to topic k Vineet Mehta (UML) Topic Modeling 23 / 34
Latent Dirichlet Allocation Priors Topic Distribution for Document Word Distribution over Topics p(θ m α) = 1 (α) p(φ β) = 1 (β) k=1 V k=1 j=1 θ α k 1 m,k φ β j 1 k,j Vineet Mehta (UML) Topic Modeling 24 / 34
Latent Dirichlet Allocation Full Joint Distribution p(x (V ) N, Z(K) N N m = =, Θ, Φ α, β) M p(x i,m z i,m, Φ)p(z i,m θ m )p(θ m α)p(φ β) m=1 i=1 1 (α) (β) M V m=1 j=1 k=1 φ ρ k,j +β j 1 k,j θ υ m,k+α k 1 m,k Vineet Mehta (UML) Topic Modeling 25 / 34
Latent Dirichlet Allocation Integrating out Θ and Φ p(x (V ) N = =, Z(K) N 1 (α) (β) M m=1 α, β) (υ m + α) (α) M V m=1 j=1 k=1 k=1 φ ρ k,j +β j 1 k,j (ρ k + β) (β) θ υ m,k+α k 1 m,k dθdφ Vineet Mehta (UML) Topic Modeling 26 / 34
Gibbs Sampling Sampling the Posterior p(z X, α, β) Sampling Algorithm initialize Z to Z (0) = {z (0) 1... z (0) N } at interation l = 0 for l [0, L] do for n [1, N] do sample z (l+1) n p(z (l+1) n {z (l+1) 1... z (l+1) n 1, z(l) n+1... z(l) }, X, α, β) N After sufficient iterations the sampler converges, and the samples z (l) n instances of p(z X, α, β) are Vineet Mehta (UML) Topic Modeling 27 / 34
Constructing the Posterior for Gibbs Sampler p(z n Z n, X, α, β) = p(x, Z α, β) p(x, Z α, β) p(x n, Z n α, β)p(x n α, β) p(x n, Z n α, β) p(x n, Z n α, β) = p(x n Z n, Φ)p(Z n Θ)p(Θ α)p(φ β) dθdφ p(x n Z n, Φ) = M N m V m=1 i=1 j=1 k=1 n=(q,r,s,t) n (m,i,j,k) (q,r,s,t) φ I(x i,m=ω j z i,m =ξ k ) k,j = V j=1 k=1 φ ρ( n) k,j k,j p(z n Θ) = M N m m=1 i=1 k=1 n=(q,r,s,t) n (m,i,k) (q,r,t) θ I(z i,m=ξ k ) m,k = M m=1 k=1 θ υ( n) m,k m,k Vineet Mehta (UML) Topic Modeling 28 / 34
Defining Counts for Posterior in Gibbs Sampler Counts words assigned to topics k: ρ k, ρ ( n) ρ ( n) k,j = { ρ k,j (j, k) (s, t) ρ k,j 1 (j, k) = (s, t) k Counts words in document m assigned to topics: υ m, υ m ( n) { υ ( n) m,k = υ m,k (m, k) (q, t) υ m,k 1 (m, k) = (q, t) Vineet Mehta (UML) Topic Modeling 29 / 34
Joint Distributions and Posterior p(x, Z α, β) = M m=1 (υ m + α) (α) k=1 (ρ k + β) (β) p(x n, Z n α, β) = M m=1 (υ ( n) m + α) (α) k=1 (ρ ( n) k + β) (β) p(z n Z n, X, α, β) p(x, Z α, β) p(x n, Z n α, β) = (υ q + α) (ρ t + β) (υ ( n) q + α) (ρ ( n) t + β) (y) = K k=1 Γ(y k) Γ( K k=1 y k) Γ(y + 1) = yγ(y) Vineet Mehta (UML) Topic Modeling 30 / 34
Simplifing Expression for Posterior p(z n Z n, X, α, β) K k=1 Γ(υ V q,k+α k ) Γ( j=1 Γ(ρ t,j +β j ) K k=1 υ q,k+α k ) Γ( V j=1 ρ t,j +β j ) K k=1 Γ(υ( n) q,k +α V k) Γ( j=1 Γ(ρ( n) t,j +β j ) K k=1 υ( n) q,k +α k) Γ( V j=1 ρ( n) t,j +β j ) Γ(υ q,t+α t) Γ( Γ(ρ t,s+β s) K k=1 υ q,k+α k ) Γ( V j=1 ρ t,j +β j ) Γ(υ ( n) q,t +α t) Γ( K k=1 υ( n) q,k +α k) Γ(υ q,t+α t) Γ( K k=1 υ q,k+α k ) Γ(υ q,t+α t 1) Γ( K k=1 υ q,k+α k 1) υ q,t + α t 1 K k=1 υ q,k + α k 1 Γ(ρ ( n) t,s +β s) Γ( V j=1 ρ( n) t,j +β j ) Γ(ρ t,s+β s) Γ( V j=1 ρ t,j +β j ) Γ(ρ t,s+β s 1) Γ( V j=1 ρ t,j +β j 1) ρ t,s + β s 1 V j=1 ρ t,j + β j 1 Note that the counts υ m,k and ρ k,j are updated over the Gibbs sampling iterations Vineet Mehta (UML) Topic Modeling 31 / 34
Estimating Topic Model Parameters Distribution of topics in documents p(θ m Z Nm, α) = 1 C θm p(z Nm θ m )p(θ m α) = = 1 N m C θm (α) 1 C θm (α) i=1 k=1 k=1 θ I(z (i,m)=ξ k ) m,k θ α k 1 m,k θ υ m,k m,k θα k 1 m,k = Dir(θ m υ m + α) Vineet Mehta (UML) Topic Modeling 32 / 34
Estimating Topic Model Parameters Continued Distribution of words in topics Let N(ξ k ) = {(i, m) : z i,m = ξ k } p(φ k X N(ξ), Z N(ξ), β) = 1 p(x C N(ξ) φ k )p(φ k β) φk = 1 p(x i,m φ C k )p(φ k β) φk = = N(ξ k ) 1 C φk (β) 1 C φk (β) V N(ξ k ) j=1 V j=1 φ I(x i,m=ω j ) k,j φ β j 1 k,j φ ρ k,j +β j 1 k,j = Dir(φ k ρ k + β) Vineet Mehta (UML) Topic Modeling 33 / 34
Estimating Topic Model Parameters Continued Given x = (x 1... x K ) Dir(x α) E[x i ] = α i ᾱ Var[x i ] = α i(ᾱ α i ) ᾱ 2 (ᾱ + 1) ᾱ = K i=1 α i = α T 1 Estimate for distribution of topics in documents [a k = (υ m + α) T 1] E[θ m,k ] = υ m,k + α k a k Var[θ m,k ] = (υ m,k + α k )[a k (υ m,k + α k )] a 2 k (a k + 1) Estimate for distribution of words in topics [b k = (ρ k + β) T 1] E[φ k,j ] = ρ k,j + β j b k Var[φ k,j ] = (ρ k,j + β j )[b k (ρ k,j + β j )] b 2 k (b k + 1) Vineet Mehta (UML) Topic Modeling 34 / 34