Determination of Topic Description Terms in Topic Model

DEIM Forum 2011 C1-4 525 8577 1-1-1 525 8577 1-1-1 E-mail: kamura@coms.ics.ritsumei.ac.jp, huang@fc.ritsumei.ac.jp, kawagoe@is.ritsumei.ac.jp () Abstract Determination of Topic Description Terms in Topic Model Kota KAMURA, Hung-Hsuan HUANG, and Kyoji KAWAGOE Graduate School of Science and Engineering, Ritsumeikan University Nojihigashi 1 1 1, Kusatsu, Shiga, 525 8577 Japan Colledge of Information Science and Engineering, Ritsumeikan University Nojihigashi 1 1 1, Kusatsu, Shiga, 525 8577 Japan E-mail: kamura@coms.ics.ritsumei.ac.jp, huang@fc.ritsumei.ac.jp, kawagoe@is.ritsumei.ac.jp Recently, the topic model has been getting much attentions and there have been a lot of studies on the model. Although a topic in the model is defined as a factor which fluctuates document word frequencies, it is necessary but difficult for a user to interpret semantic meanings for obtained topics. Therefore, we propose a novel determination method of topic description terms to describe a topic. In this paper, we propose one of the ideal determination methods, where the topic description terms is obtained so as to minimize a total topic description cost for any combination of terms. We also propose several heuristic determination methods to decrease the computational cost drastically. Key words topic model text mining Latent dirichlet allocation 1. [2] [3] [1] (), [7] [8] [9] 1 Topic1

0.0794 0.0700 0.0657 Topic1 Topic1 Topic1 1 Topic1 Topic2 ( ) 2. 2. 1 D = {d l } Web D Latent Dirichlet Allocation(LDA) P = {p k } p k r k d q d r k t i w ki (t i,w ki ) q d p k r dk (p k,r dk ) r k = {(t i, w ki )} q d = {p k, r dk } D t i T o Topic1 Topic2 0.0762 0.0728 0.0754 0.0672 0.0557 0.0601 0.0526 0.0563 0.0505 0.0462 0.0453 0.0461 0.0428 0.0438 0.0367 0.0363 1 1 2 Topic1(p 1 ) Topic2(p 2 ) d 1, d 2 p 1 = {(t 1, 0.0762) (t 2, 0.0754) (t 3, 0, 0557) (t 4, 0.0526) (t 5, 0.0505) (t 6, 0.0453) (t 7, 0.0428) (t 8, 0.0367)} p 2 = {(t 1, 0.0601) (t 2, 0.0363) (t 3, 0.0462) (t 4, 0.0728) (t 5, 0.0438) (t 6, 0.0563) (t 7, 0.0461) (t 8, 0.0672)} d 1 = {(p 1, 0.65) (p 1, 0.35)} d 2 = {(p 1, 0.24) (p 1, 0.76)} T o =< t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 > t 1 = t 2 = t 3 = t 4 = t 5 = t 6 = t 7 = t 8 = 2. 2 p k P b k b k s T o s 1 0 1 p 1 t 1 t 3 b 1 = 10100000 p2 t 1 t 6 t 7 b 2 = 10000110 b b = b 1&b 2 &b P & 1 b 1 b 2 b = b 1 &b 2 b = 1010000010000110 f(b) f(b) = (( (g ks ))/ (b ks )) (1) k s s { wks /W s (if b ks = 1) g ks = (2) 1 (w ks /W s ) (otherwise) W s = (w ks ) (3) k b kb P k s f(b)

b b ks b 1 b = b1&b2 f(b) = (g 1s ) + (g 2s ) s s = (w 1/W 1) (1 (w 2/W 2)) (w 3/W 3) (1 (w 4/W 4)) (1 (w 5/W 5)) (1 (w 6/W 6)) (1 (w 7/W 7)) (1 (w 8/W 8))+ (w 1/W 1) (1 (w 2/W 2)) (1 (w 3/W 3)) (1 (w 4 /W 4 )) (1 (w 5 /W 5 )) (w 6 /W 6 ) (w 7 /W 7 ) (1 (w 8 /W 8 )) = (0.0762/0.1117) (1 (0.0754/0.1363)) (0.0557/0.1019) (1 (0.0526/0.1039)) (1 (0.0505/0.0943)) (1 (0.0453/0.1254)) (1 (0.0428/0.1016)) (1 (0.0367/0.0889))+ (0.0601/0.1117) (1 (0.0363/0.1363)) (0.0462/0.1019) (1 (0.0728/0.1039)) (1 (0.0438/0.0943)) (1 (0.0563/0.1254)) (1 (0.0461/0.1016)) (1 (0.0672/0.0889)) = 0.005046 W 1 = 0.1117 W 2 = 0.1363 W 3 = 0.1019 W 4 = 0.1039 W 5 = 0.0943 W 6 = 0.1254 W 7 = 0.1016 W 8 = 0.0889 (1) w ki t i p k t i w ki t i p k t i p k w ki p k t i t i w ki t i p k (2) w ki p k P w ks /W s P w ks /W s t i p k (2) w ki t i p k p k p k (2) (3) p k (2) 2. 3 b opt b opt = arg(max(f(b))) (4) b 1 2 (16 ) 0000000000000000 111111111111111 f(b) b opt f(b) 111111111111111 2 1 i b j 2 i b j 3 b i 0 b = 1010000010000110 f(b) 0.005046 b = 1000000000010000 f( b) 0.01319 f( b) > f(b) ˆb f(b b) > f(ˆb) b opt = b b Topic1 Topic2 1. 3. 1 Naïve 2 TDDA (Topic Label Determination using Document Analysis)

3 TDTP (Topic Label Determination using Term Probablistic Distibutions) 3. 1 Naïve 3. 2 TDDA 3. 3 TDTP 3. 1 Naïve Naïve p k t i w ki t i p k t i w ki Top-k 1 2 Topic1 0.0794 Topic2 0.0638 Topic1 Topic2 3. 2 TDDA TDDA tf-idf (TDDA- 1) (TDDA-2) 3. 2. 1 TDDA-1 3. 2. 2 TDDA-2 3. 2. 1 tf-idf (TDDA-1) tf-idf tf-idf p k d p k d tf-idf p k d tf-idf Naïve Top-k 3. 2. 2 (TDDA-2) t ki1 t kin ( ) Naïve Top-k 1 Topic1 43 21 19 36 119 134 68 72 98 134 Topic1 3. 3 TDTP TDTP (TDTP-1) (TDTP-2) 3. 3. 1 TDTP-1 3. 3. 2 TDTP-2 3. 3. 1 (TDPD-1) (TDPD-1) k 1 5 0 p k k t i (δ) t i p k t i (δ) t i (δ) 1 Topic1 2 0.4 0.5832 Topic1 3. 3. 2 (TDPD-2) (TDPD-2) TDPD-2 Naïve Top-k Topic1 0.5590 Topic1 0.6750 0.2931 0.5466 0.2901 0.4194 0.2143 0.5355 0.2023 0.4458 2 0.4814 0.3532 3

1 Topic1 3 0.6075 Topic1 4. 4. 1 20 Newsgroups data set [16] 20 Newsgroups data set misc.forsale comp.os.ms-windows.misc talk.religion.misc talk.politics.misc comp.sys.mac.hardwarecomp.sys.ibm.pc.hardware comp.sys.ibm.pc.hardware rec.autos 14 5,383 61,188 14 Java 4. 2 2. 3 Jaccard Simpson 3 Jaccard J(C i, A h ) Simpson S(C i, A h ) P i 3 (5) (6) (7) J(C i, A h ) = Ci A h C i A h S(C i, A h ) = C i A h min{ C i, A h } (5) (6) P i = 1 C i max h Ci A h (7) C i i A h h 4. 3 7 4 5 6 4 Jaccard 7 3. 5 Naïve TDDA-1 TDDA-2 TDPD-1 TDPD-2 7 20Newsgroups comp.graphics image bit graphics jpeg sci.med medical cancer disease 4 5 6 5 TDPD-2 5 TDPD-2 1.0 7 talk.politics.gunsgunrec.sport.baseball baseball comp.graphicsgraphics image 5. 5. 1 LSI Latent Semantic Indexing [4] plsi probabilistic LSI [5] (Latent Dirichlet Allocation) [6] ( LDA) [6] LDA plsi LDA [13] [14] [14]

6. tagging 5 6 Simpson 5. 2 Wikipedia WordNet [11] [12] Fukumoto Cluster Labeling based on Concepts in a Machine-Readable Dictionaly [10] Fukumoto Graph-based unsupervised clustering TFIDF PMI χ statistics information gain [1] Blei D., Carin L., Dunson D.: Probabilistic Topic Models, Signal Processing Magazine IEEE, 27, pp.55 65 (2010) [2] T. Iwata, T. Yamada, N. Ueda: Modeling Social Annotation Data with Content Relevance using a Topic Model, Advances in Neural Information Processing Systems, pp.835-843 (2009) [3] D. Blei, J. McAuliffe: Supervised topic models, Neural Information Processing Systems 21, (2007) [4] DEERWESTER S.: Indexing by latent semantic analysis, Journal of the American Society For Information Science, 41, pp.391 407 (1990) [5] HOFMANN T.: Probabilistic latent semantic indexing, SIG-IR, pp.50 57 (1999) [6] BLEI D.: Latent dirichlet allocation, Journal of Machine Learning Research, 3, pp.993 1022 (2003) [7] Koji EGUCHI, Hitohiro SHIOZAKI: Wikipedia Retrieval using Multitype Topic Models, SIG-SWO, pp.73-80 (2008) [8] Liangliang Cao, Li Fei-Fei: Spatially Coherent Latent Topic Model for Concurrent Segmentation and Classification of Objects and Scenes, IEEE 11th International Conference, pp.1-8 (2007) [9] Tomoharu IWATA, Shinji WATANABE, Takeshi YA- MADA, Naonori UEDA: Topic Tracking Model for Analyzing Consumer Purchase Behavior, International Joint Conferences on Artificial Intelligence, vol.9, pp.2-4 (2009) [10] Fumiyo Fukumoto, Yoshimi Suzuki: Cluster Labelling based on Concepts in a Machine-Readable Dictionary, In Proceedings of the 5th International Joint Conference on Natural Language Processing, pp.1371-1375 (2011) [11] O. S. Chin, N. Kulathuramaiyer, and A. W. Yeo: Automatic Discovery of Concepts from Text, In Proc. of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence, pages 1046 1049 (2006) [12] Gabrilovich and S. Markovitch: Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis, In Proc. of the 20th International Joint Conference on Artificial Intelligence, pp.1606 1611 (2007) [13] Hiroshi FUJIMOTO, Minoru ETOH, Akira KINNO, Yoshikazu AKINAGA: Personalized Web Content Recommendation based on LDA Profile, Data Engineering and Information Management (2011) [14] Tomonari MASADA, Senya KIYASU, Sueharu MIYA- HARA: Dimensionality Reduction via Latent Dirichlet Allocation for Document Clustering, IPSJ SIG Technical Reports, pp.381-386 (2007) [15] T. IWATA, T. YAMADA, N. UEDA: Probabilistic Latent Semantic Visualization: Topic Model for Visualizing Documents, Knowledge-Discovery in Data (2008) [16] 20 Newsgroups Data Set:http://people.csail.mit.edu/ jrennie/20newsgroups/