Text Mining using Linguistic Information

630-0101 8916-5 {taku-kukaoru-yayuuta-tmatsu}@isaist-naraacjp PrefixSpan : PrefixSpan Text Mining using Linguistic Information Taku Kudo Kaoru Yamamoto Yuta Tsuboi Yuji Matsumoto Graduate School of Information Science Nara Institute Science and Technology 8916-5 Takayama Ikoma Nara 630-0101 Japan {taku-kukaoru-yayuuta-tmatsu}@isaist-naraacjp Text mining has gained the focus of attention recently in particular the success in word clustering have been reported However many of these bag-of-word approaches ignore the implicit depency relations between words completely which are critical to understanding of the original text In this paper we apply NLP tools such as chunking and depency analysis to convert a raw text into a semi-structured text from which useful patterns are extracted We ext the PrefixSpan algorithm one of an efficient algorithms for sequential pattern mining to efficiently extract sub-structures from a linguistic data annotated by NLP tools Keywords : Text Mining Sequential Pattern Mining Semi-structured Data PrefixSpan

1 {( { }) ( { })} ( ) ( ) 3 [6] 2 Text Chunking ( ) Agrawal[2] 1 I = {i 1 i 2 i n } s 1 s 2 s n s k α β β α ( ) α β S id sid s sid s S = { sid 1 s 1 sid 2 s 2 sid 2 s n } α S S α support S (α) = { sid s ( sid s S) (α s)} S ( (minimum support) ξ) support S (α) ξ PrefixSpan α [5] 11 ξ [8] ( ) 3 PrefixSpan - Apriori 4 1

[1] 1 c d a:1 2 c c:1 minsup = 2 2 b c b:2 project 4 a b c:2 sid sequence projected database 1 d d:1 1 a c d a:4 2 2 a b c b:3 2 c a:1 3 c b a c:3 3 a c:1 a:4 4 a a b d:1 b:2 c:2 sequence database a:1 a b:2 Pei[5] Apriori - 1 d b:1 a c:2 3 b a d:1 count sup of 1 item results PrefixSpan 1: PrefixSpan PrefixSpan s = a 1 a 2 a m a a 1 a a 2 a a j 1 a a j = a j(1 call P refixspan(ε S) j m) a 1 a 2 a j s procedure P refixspan (α S α ) a prefix (prefix(s a)) B {b (s S α b s) (support S α ( b ) ξ)} a j+1 a m s a postfix foreach b B (postfix(s a)) j prefixpostfix (ε) # print αb support S α ( b ) S a # (project) S a (S α) b { sid s ( sid s S α) S s postfix(s a) (s = postfix(s b)) (s ε)} call P refixspan (αb (S α) b ) S a = { sid s ( sid s S) (s = postfix(s a)) (s ε)} 2: PrefixSpan S = { sid 3 ξ = 2 PrefixSpan 1 s 1 sid 2 s 2 sid n s n } P = { sid 1 1 sid 2 1 sid n 1 } 1 a b c call P refixspan(ε P ) d 1 procedure P refixspan (α P α) 3 a S a # B # S a ( 3 ) foreach sid l P α b c # C C = {} for i l + 1 to seq(s sid) PrefixSpan 2 b item(s sid i) if (C / b) B[b] B[b] sid i C C b postfix foreach b keys of B S 3 # ξ if ( B[b] < ξ) continue S sid k # seq(s k) seq(s k) i print αb B[b] item(s k i) seq(s k) call P refixspan (αb B[b]) seq(s k) 3: PrefixSpan 2

( ) 1 (abc)(ac)d(cf) + minsup = 2 2 (_d)c(bc)(ae) project sid sequence 3 (_b)(df)cb 4 1 a(abc)(ac)d(cf) a:4 (_f)cbc 2 (ad)c(bc)(ae) _d_b_f b:4 item S id sid 3 (ef)(ab)(df)cb c:4 i j(i < j) Relation(S sid i j) 4 e(af)cbc d:2 a a : 2 (a) (a) a _b c : 2 (a b) (c) seq(s sid) sequence database item(s sid i) item(s sid j) results 4: PrefixSpan + PrefixSpan 5 PrefixSpan 31 PrefixSpan ε ( ) PrefixSpan 3 1 4 N-gram () 4 PrefixSpan 1 a b c d e f a a prefix 42 PrefixSpan 6 prefix 4 (IN/OUT) 2 4 PrefixSpan 41 43 N-gram PrefixSpan N-gram N-gram N-gram 7 i j Window size 3 (7 ) N-gram N

S = { sid 1 s 1 sid 2 s 2 sid n s n } P = { sid 1 1 sid 2 1 sid n 1 } function Relation(S sid i j) call P refixspan(ε P ) procedure P refixspan (α P α) # B # foreach sid l P α # C C {} foreach i l + 1 to seq(s sid) b item(s sid i) r Relation(S sid l i) if (r ε and C / b r ) B[ b r ] B[ b r ] sid i C C b r foreach b r keys of B # ξ if ( B[ b r ] < ξ) continue # print α b r B[ b r ] call P refixspan (α b r B[ b r ]) 5: PrefixSpan function ElementRelation(S sid i j) x item(s sid i) y item(s sid j) if (x y element ) return IN else return OUT 8 9 2 6: Element Relation 8 (type) type(a) = NT (a ) type(a) = T ( ) 10 ( # C ) function NGramRelation(S sid i j) N-gram if (j i < C) return NGRAM else return ε # function NGramRelation(S sid i j) if (i + 1 = j) return NGRAM else return ε 7: N-gram Relation A a b c y C a e c z B b d e <a b c y A a e c z C b d e x B> 8:! " # x $&%('&% ) *+ -/10 9: function ChunkRelation(S sid i j) x item(s sid i) y item(s sid j) if (type(x) = NT) return CHUNK if (type(x) = T and type(y) = T and x y ) return CHUNK if (type(x) = T and type(y) = NT and y x ) return CHUNK return ε 10: Chunk Relation 44

b e d c a c d b e a 11: a b c d e f 2 2 1 12: a 0 45 5 ( / ) ChaSen 6 CaboCha 7 3 N-gram 4 (2 ) (1) head (2) head (3) head (2) [8] (4) 11 2 13 item(s sid i) item(s sid j) k 1 (k 1) Linux (XEON 22 GHz k 35GB ) WS (ε) 12 a Perl C++ 8 52 function DepRelation(S sid i j) x = item(s sid i) y = item(s sid j) if (x < j) return ε else r 0 while (x y) r r + 1 y y return r 13: DepRelation 5 51 2 30 38000 95 1 1 12 (minsup) ( ) N-gram 2 3 5 http://wwwaozoragrjp/ 6 http://chasenaist-naraacjp/ 7 http://claist-naraacjp/ taku-ku/software/cabocha/ 4 8 http://claist-naraacjp/ taku-ku/software/prefixspan/

1: ( ) minsup ( ) N-gram 2 22 N/A 3556 5 20 32 267 ( ) 10 19 30 240 15 19 28 229 20 19 29 221 2: ( ) y i x i minsup ( ) y i x i N-gram x 2 046 074 78 i 5 043 060 58 x i 10 041 057 52 15 041 055 48 20 041 055 46 [7] 62 N-gram ( : : ) PrefixSpan 3 ( : : ) ( { }) ( { }) ( { }) (1) (2) dice ( : : ) (3) (2) (( ) ( )) ( ( ( ))) PrefixSpan (1) 6 Chunker 2 1 [3] 2 ( 14) PrefixSpan 61 dice 9

E9F+GH7I < J1 J2 J3 Jn E1 E2 E3 Em > "!$#%'&( )+*-'"/"02143 57698+:";2<>=>?A@"BDC3 J1 J3 E1 E4 14: 9268 2 PrefixSpan N-gram [1] Rakesh Agrawal and Ramakrishnan Srikant Fast algorithms for mining association rules In Jorge B N Bocca Matthias Jarke and Carlo Zaniolo editors Proc 20th Int Conf Very Large Data N Bases VLDB pp 487 499 Morgan Kaufmann 12 [4] 15 1994 2 PrefixSpan [2] Rakesh Agrawal and Ramakrishnan Srikant Mining 52 sequential patterns In Philip S Yu and Arbee L P N-gram 7 Chen editors Proc 11th Int Conf Data Engineering ICDE pp 3 14 IEEE Press 6 10 1995 [3] Helen Pinto and Jiawei Han and Jian Pei and Ke Wang and Qiming and Chen and Umeshwar Dayal Multi-Dimensional Sequential Pattern Mining In Proc 10th ACM International Conference on Information and Knowledge Management (CIKM 01) earliest convenience November 2001 let know [4] Mihoko Kitamura and Yuji Matsumoto Automatic thank letter Extraction of Word Sequence Correspondences in [4] Parallel Corpora In Proc 4th Workshop in Very N 10 Large Corpora pp 79 87 1996 N-gram PrefixSpan [5] Jian Pei Jiawei Han and et al Prefixspan: Mining sequential patterns by prefix-projected growth N In Proc of International Conference of Data Engineering pp 215 224 2001 7 [6] 7 SIG-FA/KBS-J22 pp 133 139 2001 [7] Authorship identification for heterogeneous documents 148 NL-148 PrefixSpan [8] Vol 42 No 8 2001