A Fast Mining Algorithm for Frequent Essential Itemsets

40 6 Vol.40 No.6 Computer Engineering 2014 6 June 2014 1000 3428(2014)06 0120 05 A TP18 ( 230009) FMEP Rymon MEP 2 30 Rymon A Fast Mining Algorithm for Frequent Essential Itemsets TIAN Wei-dong, JI Yun (School of Computer and Information, Hefei University of Technology, Hefei 230009, China) Abstract Traditional frequent essential itemsets mining requires generating candidate itemsets and scanning database many times, which leads to the lower efficiency generation. Motivated by this, a fast algorithm of mining frequent essential itemsets is proposed. This algorithm uses Rymon enumeration tree as the strategy of space search and divide-and-conquer, meanwhile, it selects particular paths for pruning. It uses frequent essential itemsets unique properties to quickly determine whether a candidate itemset is a frequent essential itemset, without comparing with disjunctive support of all direct subsets. It is beneficial for quick mining. Experimental results show that this algorithm can correctly get all elements of frequent essential itemsets concise representation, and highly reduce the time consumption. It can reduce 2 times in dense datasets while reduce the time consumption in sparse datasets by 30 at least. Key words data mining; frequent itemsets; concise representation; frequent essential itemsets; Rymon enumeration tree DOI: 10.3969/j.issn.1000 3428.2014.06.026 1 [1] [2] CPU I/O [3] [4] [5-6] [7] [8-9] 2 (1) X X X (2) t t1 t2 tr 1, tr 2,,tr n t tr tr L tr 1 2 n i i 1,i 2,,i n i i1 i2 L in (60603068) (1970 ) 2013 01 28 2013 05 17 E-mail jiyun1988@126.com

40 6 121 2 [10] [11] [10] FMEP Rymon [12] 2 2.1 1( 3 ) IS={i1,i2,,im} m I IS k k IS D o TID K 3 O I R K=(O, I, R) R O I ( oi, ) R o i 3 supp( I ) supp( I ) supp( I ) supp( I ) = { o O ( i I,( o, i) R)} (1) supp( I ) = { o O ( i I,( o, i) R)} (2) supp( I ) = { o O ( i I,( o, i) R)} (3) 1 supp( I ) supp( I ) 2( ) X X supp( X ) X supp( X ) minsup supp( X ) minsup 3( 3 ) I 1 supp I supp I (4) 1 ( ) = ( 1) φ I1 I ( 1) I1 1 ( ) = ( 1) φ I1 I ( 1) supp I supp I (5) supp( I ) = O supp( I ) (6) 4() supp( I ) max{ supp( I \ i, i I )} (7) I supp( I ) minsup 2.2 (7) X X supp( X ) = max({ supp( Y ) Y X}) (8) Y ε Y E BD+ Y Arg max({ supp( X ), X X, X E}) X 1 supp( Y ), Y X supp( X ) = ( 1) (9) supp( X ) X X X φ X X BD + X E { Y Arg max({ supp( X ), X X, X E}) X X X φ X Y X 1 supp( X ) = ( 1) supp( X ) (10) [10] MEP Ci+1=Gen_Apriori(Li); Ci+1={X Ci+1 Y BD + (F):X Y}; Scan the database for mining the disjunctive frequency X Ci+ 1 Li+1= X Ci+ 1 x X:Freq( X) = Freq( X\x) 3 (1) Ci+1 (2) Ci+1 MEP (3) Ci+1 3 FMEP [10]

122 2014 6 15 3 FMEP Rymon [12] Rymon hash 4 3.1 5( g) g I D t g( I) = { t D i I, i t} g I I i i1 I i1 I g(1) i g() i g() i g(1) i I i I i I i I U i BD+ I i g(1) i g() i g(i) g(i1)=g(i) g(i i)=g(i) supp( I U i) = supp( I) (I i) (I) I i I i I U i BD+ 4 I i 3.2 FMEP (Fast Mining Essential Pattern) g POST FMEP D minsup EP BD+ BD+(F);=Max_Set_Algorithm(D,minfreq) EP=NULL produre FMEP(EP.gen,POST) while POST NULL do i=min<(post) POST=POST\i newgen=gen i if exist(i,gen) and newgen BD+ EP=EP newgen g(newgen)=g(i) g(i) NEWPOST=POST FMEP(EP,newgen, NEWPOST) endif Endwhile return EP BD+ function exist(i,gen) for all j gen do if gen(j) gen(i) or gen(i) gen(j) return true endif endfor return false end function D F1 g POST 1 EP gen POST POST i POST i Rymon i gen newgen exist i gen newgen exist 5 newgen EP newgen gen i g NEWPOST POST FMEP POST exist 2 gen i gen j gen( j) gen( i) gen() i gen( j) newgen 3.3 D abcd a bc cd abc 1 BD+={abcd}, g(a)={1,2,5}, g(b)= {1,3,5}, g(c)={1,3,4,5}, g(d)={1,4} 1,2,3,4,5 EP POST {abcd} (1)a POST a ={bcd} a a b ab, POST ab ={cd} ab ab c

40 6 123 abc POST abc ={d} g(b) g(c) abc ab d abd POST abcd ={} abd a POST {cd} a c ac POST ac ={d}ac ac d acd POST acd ={} g(c) g(d) acd a POST {d} a d ad POST ad ={} ad (2)b POST b ={cd} b b c bc POST bc ={d} g(b) g(c) bc b d bd POST bd ={ } bd bd (3)c POST c ={d} c c d cd cd (4)d POST d ={} d 3 pumsb 4 FMEP MEP FMEP C++ 2 PC Win7,4 GB FMEP [10] [10] connect pumsb chess pumbs_star T10I4D100K T40I10D100K http://fimi.cs.helsinki.fi/data/ 1~ 6 4 pumbs_star 12 10 8 6 4 2 FMEP MEP 5 T10I4D100K 0 50 40 30 20 10 / 1 connect 6 T40I10D100K 2 chess 1~ 3 FMEP MEP 3 4 FMEP MEP 1 pumbs_star Bayardo pumbs 80 T10I4D100K T40I10D100K FMEP MEP

124 2014 6 15 FMEP 5 FMEP Rymon MEP 2 30 [1] Han Jiawei, Kamber M. [M].,,. :, 2004. [2],. [J]., 2012, 38(5): 44-46. [3] Liu Guimei, Li J, Wong L. Positive Borders or Negative Borders: How to Make Lossless Generator Based Representations Concise[C]//Proc. of the 6th SIAM International Conference on Data Mining. [S. 1.]: IEEE Press, 2006: 469-473. [4] Calders T, Goethals B. Non-derivable Itemset Mining[J]. Data Mining and Knowledge Discovery, 2007, 14(1): 171-206. [5] Pasquier N, Bastide Y, Taouil R. Discovering Frequent Closed Itemsets for Association Rules[C]//Proc. of ICDT 99. [S. 1.]: IEEE Press, 1999: 398-416. [6],. [J]., 2008, 34(16): 50-52. [7] Bykowski A, Rigtti C. A Condensed Representation of Find Frequent Patterns[C]//Proc. of PDOS 01. [S. 1.]: IEEE Press, 2001: 56-63. [8] Kryszkiewicz M. Concise Representation of Frequent Patterns Based on Disjunction-free Generators[C]//Proc. of ICDM 01. [S. 1.]: IEEE Press, 2001: 305-312. [9] Kryszkiewicz M, Gajek M. Concise Representation of Frequent Patterns Based on Generalized Disjunction-free Generators[C]// Proc. of PAKDD 02. [S. 1.]: IEEE Press, 2002: 159-171. [10] Casali A, Cicchetti R, Lakhal L. Essential Patterns: A Perfect Cover of Frequent Patterns[C]//Proc. of the 7th International Conference on Data Warehousing and Knowledge Discovery. Copenhagen, Denmark: Springer-Verlag, 2005: 428-437. [11] Galambos J, Simonelli I. Bonferroni-type Inequalities with Applications[M]. New York, USA: Springer, 2000. [12] Rymon R. Search Through Systematic Set Enumeration[C]// Proc. of the 3rd International Conference on Principles of Knowledge Representation and Reasoning. [S. 1.]: IEEE Press, 1992: 539-550.