Bayesian Variable Order n-gram Language Model based on Hierarchical Pitman-Yor Processes

Vol. 0 No. 0 2007 Pitman-Yor n-gram,, n-gram Pitman-Yor,, n-gram,,, n,,, Bayesian Variable Order n-gram Language Model based on Hierarchical Pitman-Yor Processes Daichi Mochihashi? and Eiichiro Sumita? This paper proposes a variable order n-gram language model by extending a recently proposed model based on the hierarchical Pitman-Yor processes. Introducing a stochastic process on an infinite depth prediction suffix tree, we can infer the hidden n-gram context from which each word originated. Experiments on standard large corpora showed validity and efficiency of the proposed model. Our architecture is also applicable to general Markov models to estimate their variable orders of generation. 1. n, Shannon 1),,, n, (n 1) (n 1),, V V n 1, n 1,,, n=3 () n = 4, 5, ATR / ATR Spoken Language Communication Research Laboratories / National Institute of Communications Technology NTT Presently with NTT Communication Science Laboratories, The United States of America,,,,,, n,, longer than n, n 3, n,, n, n, 2),, 1

2 2007, 3)4), 5) MDL,,, 6) 7),,, n, 10) 11), Pitman-Yor, n 0-gram 1- gram 2-gram, Pitman-Yor n, n, n, n n,, n,,, 2, Pitman-Yor n, 3 Suffix Tree, 4 LDA, 5 6, 7 2. Pitman-Yor n-gram Pitman-Yor n (HPYLM) 10)11), n, 4), n 8), 9), 10), 0 1 2 she sing sing like cry will he like it ɛ and butter bread model is language = = ( ) 1 CRP Suffix Tree, Fig. 1 Suffix tree representation of a hierarchical Chinese Restaurant Process. Each customer corresponds to a count in the model.,, Kneser- Ney 12), Pitman-Yor, 2- = 13) Pitman-Yor, 14),, CRP ( ) 15),, 1, 2 Prediction Suffix Tree () 25), she will sing, ɛ, ɛ will she, she will, p(sing she will), (ɛ she will),,, Prediction Suffix Tree Suffix Tree HPYLM,,, 1 1 (CRP, =, ) 1, 2 n, 1,, 1-, 2-,,

Vol. 0 No. 0 Pitman-Yor n-gram 3,,,, ( ), she will like, he will like, will, p(like she will) 3-gram p(like she will) ( 0), 2- gram p(like will), like, 2-gram 1-gram,, HPYLM, h = w t n w t 1 w, p(w h) = c(w h) d t hw θ+c(h), + θ+d t h θ+c(h) p(w h ) (1) c(w h) h w, c(h)= w c(w h), h = w t n+1 w t 1 1 t hw w p(w h), p(w h ), w t hw = t h d, θ Pitman-Yor, Suffix Tree,, 10), ( ),,, MCMC 16),, 4 (1), c(w h) 1, t hw 1, Kneser-Ney, Kneser-Ney 1 qk 1 qj k 1 qi j 2 Suffix Tree (1 q i ) i Fig. 2 A stochastic process to add a customer to the suffix tree. (1 q i ) means a penetration probability of node i.,, 1, (n 1), Suffix Tree,, 2, n, 3. Pitman-Yor 3.1 Suffix Tree, Suffix Tree i, q i (, (1 q i) i ), q i [0, 1], i q i Be(α, β) (2), Be(α, β) = Γ(α+β)/(Γ(α)+Γ(β)) q α 1 (1 q) β 1 q, E[q] = α/(α+β) 3, (α, β) h=w t w t 2 w t 1 w t, Suffix Tree ɛ, 4 3.5 3 2.5 2 1.5 1 0.5 (0.5,0.5): Jeffrey s (1,1): Uniform (4,1) (2,4) (1), 1-gram, 0-gram ( 1/V ) Fig. 3 0 0 0.2 0.4 0.6 0.8 1 3 Examples of Beta distributions with different hyperparameters.

4 2007 ɛ w t 1 w t 2,, (n 1), q i q 0, q 1, q 2,..., l 1 p(n=l h) = q l (1 q i ) (3) i=0 l, ( 2),, q i (, ) n,, q i ( ) (3), n n,, 3.2 n-gram, Suffix Tree q i, q i,, w = w 1 w 2 w T, n n = n 1 n 2 n T,, w p(w) = p(w, n, s) (4) n s = s 1 s 2 s T, s t, w t 10) n, s,, HPYLM, w t n n t, n t p(n t w, n t, s t ) (5) (s t, ) n t, s t, n, s n t, s t (5) n t, () Suffix Tree n t, i a i, b i s, q i, E[q i] = a i+α a i+b i+α+β, (5) (6) p(n t w, n t, s t ) p(w t w t, n, s t ) p(n t w t, n t, s t ) (7), n n t w t n, (1), n t, (3) (6), p(n t = l w t, n t, s t ) l 1 a l +α b i+β = a l +b l +α+β a i+b i+α+β i=0 (8), n, Pitman- Yor (VPYLM) ( 4) HPYLM, w t Suffix Tree order[t],, a i b i 1 n, a i b i 1, order[t] order[t], (n 1), n HPYLM (7) n t,, n max, n t (8) ɛ, n n,, n - 3.3, n, n, h = w w 2 w 1 p(w h) = p(w, n h) (9) n = p(w n, h)p(n h) (10) n (2)(3), Suffix Tree, Beta two-parameter process Stick-breaking process 15),, / (s t )

Vol. 0 No. 0 Pitman-Yor n-gram 5 For j = 1 N { For t = randperm(1 T ) { } if (j >1) then remove customer (order[t], w t, w 1:t 1 ); order[t] = add customer (w t, w 1:t 1); } 4 VPYLM Fig. 4 The Gibbs sampler of VPYLM. h n, (8) n HPYLM, (1), Kneser-Ney, VPYLM, HPYLM n n = 0, 1,, (10), w n, s N Gibbs 3.4, Suffix Tree, 5), 18) O(log n),, Power Law, 5, Suffix Tree n 4. n-gram n,, 4.1 Latent Dirichlet Allocation (LDA), LDA 19) LDA,,, Gibbs, 17), CRP, struct ngram { /* n-gram node */ ngram *parent; /* parent node */ splay *children; /* = (ngram **) */ splay *words; /* = (restaurant **) */ int stop; /* a h */ int through; /* b h */ int ncounts; /* c(h) */ int ntables; /* t h */ int id; /* word id */ }; 5 n Fig. 5 Data structure of a n-gram node. θ d = (θ 1, θ 2,..., θ M ), θ θ Dir(γ) = Γ(Mγ) M θ γ 1 Γ(γ) M t (11) t=1, γ, d = w 1 w 2 w N, ( 1 ) θ d Dir(γ) ( 2 ) For n = 1... N, a. t Mult(θ d ) b. w n p(w t), p(w t) 19), LDA,, VPYLM, VPYLM,, LDA, n, n, 4.2, VPYLM, ( Pitman-Yor ) LDA θ d, n, Pitman-Yor n, n h,, w t hw

6 2007,,,, VPYLM LDA (VPYLDA), 6 draw topic, w t k p(k w t, w 1:t 1 ) p(w t w 1:t 1, k) p(t θ d ) (12) p(w t w 1:t 1, k) (n d t,k +γ) (13) k, n order[t] ( table[t] Suffix Tree, ), topic[t], VPYLM, p(w h, k) k VPYLM (10), n d t,k d k (w t ) 5. 5.1 VPYLM on Text Corpora 5.1.0.1, 21) 22) NAB (North American Business News) WSJ 409,246, 10,007,108, 10,000,, 10 For j = 1 N { For t = randperm(1 T ) { d = document(w t); k = topic[t]; if (j >1) then remove customer (table[t]); k = draw topic (w t,w 1:t 1,d); table[t] = add customer (w t,w 1:t 1,k); topic[t] = k; } } 6 VPYLDA Fig. 6 The Gibbs sampler of VPYLDA., Sinica 20), (n = 5 43.39.) 1 N/A Table 1 Test-set perplexities with the number of nodes in the model in English and Japanese corpora. N/A means a memory overflow. (a) (NAB ) n HPYLM VPYLM Nodes(H) Nodes(V) 3 113.60 113.74 1,417K 1,344K 5 101.08 101.69 12,699K 7,466K 7 N/A 100.68 N/A 10,182K 8 N/A 100.58 N/A 10,434K 117.65 10,392K (b) () n HPYLM VPYLM Nodes(H) Nodes(V) 3 78.06 78.22 1,341K 1,243K 5 68.36 69.35 12,140K 6,705K 7 N/A 68.63 N/A 9,134K 8 N/A 68.60 N/A 9,490K 80.38 9,561K 26,497 2000 ( 150 ) 520,000, 10,079,410, 10,000 MeCab 23), 10, 32,783 5.1.0.2, HPYLM, VPYLM N =200, 50 Suffix Tree (α, β) = (4, 1), (1, 1) (α, β), 6 Xeon 3.2GHz, 4GB Linux 5.1.0.3 1, HPYLM VPYLM n n, n max, VPYLM HPYLM 40 %, HPYLM n = 7, 8 n 1, Modified Kneser-Ney SRILM 24), n = 3, 5, 7, 8 118.91, 107.99, 107.24, 107.21, HPYLM VPYLM

Vol. 0 No. 0 Pitman-Yor n-gram 7,,, () n, VPYLM HPYLM 20%, VPYLM n, 7, 8- VPYLM n,,, n = 3, 4 5.2 Suffix Tree, (9) p(w, n h) h = w w 2 w 1 n w n w 1 w,, w n w 1 w, america the united states of, the united states of america, (, ) w, w 1 w 2 w n w, w n w 1w, Suffix Tree, 8, 8- VPYLM, 7 # of occurrences 3.5 10 6 3.0 10 6 2.5 10 6 2.0 10 6 1.5 10 6 1.0 10 6 5.0 10 5 0.0 10 0 0 1 2 3 4 5 6 7 n 8 VPYLM n n = 0, n = 1,... (1), n Fig. 7 Global n-gram context length distribution inferred by a 8-gram VPYLM. n = 0 is a unigram, n = 1 is a bigram, and so on. Each n-gram is further interpolated by lower-order n-grams recursively through the equation (1). p(w, n) Stochastic phrases in the suffix tree 0.9784 primary new issues 0.9726 ˆ at the same time 0.9556 american telephone & 0.9512 is a unit of 0.9394 to # % from # % 0.9026 from # % in # to # % 0.8896 in a number of 0.8831 in new york stock exchange composite trading 0.8696 a merrill lynch & co. 0.7566 mechanism of the european monetary 0.7134 increase as a result of 0.6617 tiffany & co. : 8 Fig. 8 NAB 8 VPYLM Suffix Tree, Stochastic phrases computed from the suffix tree of 8-gram VPYLM that is trained on the NAB corpus. 5.3 VPYLDA on 20news-18828 5.3.0.4 n LDA(VPYLDA), 20news-18828 26), 20 4,000, 400 NAB, 801,580, 10,477 N = 250, h = w 1 w n 1 w n, p(w n h) = p(w n h, t) p(t h) (n = 1... N) t (14) p(w n h, t) t n (10), p(t h) h t,, LDA EM 19) 5.3.0.5 2, M VPYLDA M =1, VPYLM,,,, 1 CRP, t,,

8 2007, 1,,,, 6., Context Tree Weighting 27), Pereira 5),, n, Witten-Bell,,, Pitman-Yor n,,, PPM*, PPM*,, Witten-Bell,,,,, 5, Suffix Tree (α, β) = (4, 1),, (6) q i 2 20news-18828 VPYLDA Table 2 VPYLDA test-set perplexities on the 20news- 18828 dataset. Model PPL VPYLDA (M = 5) 104.69 (M = 10) 103.57 (M = 20) 103.28 VPYLM 105.30 PPL 136 134 132 130 128 126 124 0.1 0.5 1 α 2 5 10 2 1 0.5 0.1 10 5 β 134 132 130 128 126 124 122 9 VPYLM (α, β) Fig. 9 Relationship between the VPYLM hyperparameters (α, β) and test-set perplexities. 10 VPYLM, n = 5. Fig. 10 Random walk generation from the language model trained on Maihime. 28) Newton, 1M NAB, (0.85, 0.57), VPY 9, (α, β) (0.1 10) (0.1 10) β α,, 10 5-gram VPYLM,, n = 5, (8),,, (0.856), (0.803), (0.398), (0.199), 10, mixture of Gaussians mixture of flour,,

Vol.0 No.0 Pitman-Yor n-gram 9 7., n Pitman-Yor, n n Suffix Tree,, n,,, n,, Suffix Tree,,, Suffix Tree,, 29),,, 1) Shannon, C. E.: A mathematical theory of communication, Bell System Technical Journal, Vol.27, pp.379 423, 623 656 (1948). 2) NLP-2005 (2005). 3) Siu, M. and Ostendorf, M.: Variable n-grams and extensions for conversational speech language modeling, IEEE Trans. on Speech and Audio Processing, Vol.8, pp.63 75 (2000). 4) Stolcke, A.: Entropy-based Pruning of Backoff Language Models, Proc. of DARPA Broadcast News Transcription and Understanding Workshop, pp.270 274 (1998). 5) Pereira, F., Singer, Y. and Tishby, N.: Beyond Word N-grams, Proc. of the Third Workshop on Very Large Corpora, pp.95 106 (1995). 6) Leonardi, F.G.: A generalization of the PST algorithm: modeling the sparse nature of protein sequences, Bioinformatics, Vol.22, No.11, pp.1302 1307 (2006). 7) Buhlmann, P. and Wyner, A. J.: Variable Length Markov Chains, The Annals of Statistics, Vol.27, No.2, pp.480 513 (1999). 8) Kawabata, T. and Tamoto, M.: Back-off Method for N-gram Smoothing based on Binomial Posteriori Distribution, ICASSP-96, Vol.I, pp.192 195 (1996). 9) Cowans, P.: Probabilistic Document Modelling, PhD Thesis, University of Cambridge (2006). http://www.inference.phy.cam.ac.uk/ pjc51/thesis/index.html. 10) Teh, Y. W.: A Bayesian Interpretation of Interpolated Kneser-Ney, Technical Report TRA2/06, School of Computing, NUS (2006). 11) Teh, Y.W.: A Hierarchical Bayesian Language Model based on Pitman-Yor Processes, Proc. of COLING/ACL 2006, pp.985 992 (2006). 12) Kneser, R. and Ney, H.: Improved backing-off for m-gram language modeling, Proceedings of ICASSP, Vol.1, pp.181 184 (1995). 13) Pitman, J. and Yor, M.: The Two-Parameter Poisson-Dirichlet Distribution Derived from a Stable Subordinator, Annals of Probability, Vol.25, No.2, pp.855 900 (1997). 14) Teh, Y.W., Jordan, M.I., Beal, M.J. and Blei, D.M.: Hierarchical Dirichlet Processes, Technical Report 653, Department of Statistics, University of California at Berkeley (2004). 15) Ghahramani, Z.: Non-parametric Bayesian Methods, UAI 2005 Tutorial (2005). http://learning.eng.cam.ac.uk/zoubin/talks/ uai05tutorial-b.pdf. 16) Gilks, W.R., Richardson, S. and Spiegelhalter, D.J.: Markov Chain Monte Carlo in Practice, Chapman & Hall / CRC (1996). 17) Daumé III, H.: Fast search for Dirichlet process mixture models, AISTATS 2007 (2007). 18) Sleator, D. and Tarjan, R.: Self-Adjusting Binary Search Trees, JACM, Vol. 32, No. 3, pp. 652 686 (1985). 19) Blei, D.M., Ng, A.Y. and Jordan, M.I.: Latent Dirichlet Allocation, Journal of Machine Learning Research, Vol.3, pp.993 1022 (2003). 20) Huang, C.-R. and Chen, K.-j.: A Chinese Corpus for Linguistics Research, Proc. of COLING 1992, pp.1214 1217 (1992). 21) Chen, S. F. and Goodman, J.: An Empirical Study of Smoothing Techniques for Language Modeling, Proc. of ACL 1996, pp.310

10 2007 318 (1996). 22) Goodman, J.T.: A Bit of Progress in Language Modeling, Extended Version, Technical Report MSR TR 2001 72, Microsoft Research (2001). 23) Kudo, T.: MeCab: Yet Another Part-of- Speech and Morphological Analyzer. http://mecab.sourceforge.net/. 24) Stolcke, A.: SRILM An Extensible Language Modeling Toolkit, Proc. of ICSLP, Vol.2, pp. 901 904 (2002). 25) Ron, D., Singer, Y. and Tishby, N.: The Power of Amnesia, Advances in Neural Information Processing Systems, Vol.6, pp.176 183 (1994). 26) Lang, K.: Newsweeder: Learning to filter netnews, Proceedings of the Twelfth International Conference on Machine Learning, pp.331 339 (1995). 27) Willems, F., Shtarkov, Y. and Tjalkens, T.: The Context-Tree Weighting Method: Basic Properties, IEEE Trans. on Information Theory, Vol.41, pp.653 664 (1995). 28) Minka, T. P.: Estimating a Dirichlet distribution (2000). http://research.microsoft.com/ minka/papers/dirichlet/. 29) Pitman, J.: Combinatorial Stochastic Processes, Technical Report 621, Department of Statistics, University of California, Berkeley (2002). (??? ) (??? ) 1973 1998 2005 ( ) 2003 ATR /2007, NTT, 1982. 1999.,ATR. NiCT,, ATR-Langue., e. IEICE, NLP, ASJ, ACL, IEEE, ACM/TSLP Associate Editor.