IBM {tku-kumtsu}@isist-nrjp korux@gsrikengojp yuutt@jpimom PrefixSpn : PrefixSpn Mining Syntti Strutures from Text Dtse Tku Kuo Koru Ymmoto Yut Tsuoi Yuji Mtsumoto Grute Shool of Informtion Siene Nr Institute Siene n Tehnology RIKEN Genomi Sienes Center IBM Reserh Tokyo Reserh Lortory IBM Jpn Lt {tku-kumtsu}@isist-nrjp korux@gsrikengojp yuutt@jpimom Text mining hs gine the fous of ttention reently in prtiulr the suess in wor lustering hve een reporte However mny of these g-of-wor or sequene-of-wor pprohes ignore the impliit epeneny reltions etween wors whih re ritil to unerstning of the originl text In this pper we pply syntti prser to onvert rw text into semi-struture text from whih useful ptterns re extrte We exten the PrefixSpn lgorithm one of n effiient lgorithms for sequentil pttern mining to effiiently extrt su-strutures from text t nnotte y syntti prser Keywors : Text Mining Sequentil Pttern Mining Semi-struture Dt PrefixSpn
PrefixSpn Agrwl[] I = {i i i n } u u u n u k α [ ] β β α α β S i si s si s S = { si s si s si s n } α S support S (α) S α Text Chunking ( ) S ( (minimum support) ξ) support S (α) ξ α Pei [] PrefixSpn PrefixSpn ( ) (projet) s = m j j = j( j m) j s prefix (prefix(s )) j+ m s postfix (postfix(s )) j [5] prefixpostfix (ε) S S S [6] s postfix(s ) S = { si s ( si s S) (s = postfix(s )) (s ε)} ξ = PrefixSpn PrefixSpn [] S S ( ) ( ) PrefixSpn
minsup = si sequene sequene tse projet : : : : ount sup of item ll P refixspn(ε S) projete tse : : : : : : : : : : : : : results : : : PrefixSpn proeure P refixspn (α S α) egin B { (s S α s) (support S α ( ) ξ)} foreh B egin (S α) { si s ( si s S α) (s = postfix(s )) (s ε)} ll P refixspn (α (S α) ) en en e 0 e e e e e : e (( ((e ) (( ) ((( ) (e )) )))) ) e : : PrefixSpn PrefixSpn t i i j ψ : i j ψ () j i : ψ = 0 I = () j k (k ) i {i i i n } : ψ = k () ()() : ψ = ε ψ : i j α β β α α β φ : α β () φ t PrefixSpn () φ α β T i t t T = { t t t n } (pre-orer trverse) α T T α support T (α) = { t ( t T ) (α t)} S T ( (minimum support) ξ) support T (α) ξ α
fun seq(t ) := T i fun noe(t p) := seq(t ) p T i t fun ψ(t p q) := noe(t p)noe(t q) ψ(t 0 q) := 0 t PrefixSpn # t t T = { t t n t n } # P = { 0 0 n 0 } ll P refixspn(ε P ) PrefixSpn pro P refixspn (α P α) egin # B # ψ B {} PrefixSpn foreh l P α egin foreh k l + to seq(t ) () i r egin noe(t k) r ψ(t l k) () B[ r ] B[ r ] k i 0 (i ) en en () i r foreh r keys of B i r j r egin if (support Pα ( r ) < ξ) ontinue r = ψ(i j) ll P refixspn (α r B[ r ]) j r j r en en () 5: PrefixSpn (5) () support 0 88 95 (6) i ε (i ) ε ( ) prefix 998 ( / ) ( ) ChSen CoCh 5 6 6 ( - )) (( ( )) PC i prefix (XEON GHz RAM 5GB Linux) Perl 5 7 http://wwwozorgrjp/ http://hsenist-nrjp/ PrefixSpn 5 http://list-nrjp/ tku-ku/softwre/oh/
(( ) ((( ( )) )) ((( ) ) (( ) )) (( ) (( ) ( ))) ((( ) ) (( ) )) minsup = - - -0 - - -0 -: -: -0 -e -0 -e -0: Initil Dtse -0-0 -0-0 -0-0 -0-0 -0-0 -0-0 -0-0 -0-0 -0-0 -0-0 -0-0 -0-0 Count Supports -0: -0: -0: - - - -0-0 -0: -0 -e -e -e - -0 - -0 -e -e -e -0 -e -0 -e -0: -0: -0 -e -e -e -e -0-0 -e -0-0 - - - -0-0 -0: - - - -0-0 -0 -e -e -e -0-0 -0: -0-0 -e - -0 Projet -0 -e -e -e -e - -0 -e -0 -e -e -e -e - - -0-0 - - - -0-0 -e -e -e - - -0-0: -: -: -: -0 -e -0-0 -e -0 -e -e -0 -e -0: -0: -0-0 -0 Frequent Sequentil Ptterns -0-0 -0-0 -0-0 -0-0 - -0-0 -0-0 - -0-0 -0-0 -0-0 - -0-0 -0-0 -0-0 -0 - -0 - -0-0 - -0 - -0-0 - -0-0 Frequent Su-Tree Ptterns 6:
0 CPU time (se) 5 / 0 5 0 5 0 5 0 5000 0000 5000 0000 5000 0000 5000 0000 / # of trnstions 7: 6 : vs minsup 5 0 0 ( ) 0 67 7 ( ) 7 6 55 8 PrefixSpn (00%) 7 [5] ( ) [] Rkesh Agrwl n Rmkrishnn Sriknt Mining sequentil ptterns In Philip S Yu n Aree L P Chen eitors Pro th Int Conf Dt Engineering ICDE pp IEEE Press 6 0 995 (( ) ( )) (( ) ( )) [] Roert Dle Hermnn Moisl n Hrol Somers ( ( ( ))) Hnook of Nturl Lnguge Proessing Mrel (( ) ( )) Dekker 000 5 [] Christopher D Mnning n Hinrih Shütze Fountions of Sttistil Nturl Lnguge Proessing The MIT Press 999 [] Jin Pei Jiwei Hn n et l Prefixspn: Mining sequentil ptterns y prefix-projete growth In Pro of Interntionl Conferene of Dt Engineering pp 5 00 6-0 - -0 6 [5] SIG-FA/KBS-J pp 9 00 [6] 6 Vol No 8 00