A data structure based on grammatical compression to detect long pattern

1 1 (1 + )nlogn + n + o(n) O( 1 (mlogn+occc(logmlogu))) n u m = P, < < 1 Z-index ) M-index ) A data structure based on grammatical compression to detect long pattern Naoya Kishiue, 1 Masaya Nakahara, 1 hirou Maruyama and Hiroshi akamoto In this research, we propose the method to search long pattern from compressed index based on Context-free Grammar. The proposed method can detect the pattern at O( 1 (mlogn+occc(logmlogu))) time with (1+)nlogn+ n + o(n) bits, where n is generated variables compressed text (original size u), m = P, < < 1. Result of experiments, we confirmed our proposed method was faster than existing method (e.g, Z-index ), M-index ) ) at long pattern. 1 Kyushu Institute of Technology Graduate chool of Computer cience and ystems ngineering Kyushu University Graduate chool of Information cience and lectrical ngineering Kyushu Institute of Technology aculty of Computer cience and ystems ngineering 1. M-index ) 9) ucg 1 n(1+)nlogn+n+o(n) < < 1 m P O( 1 (mlogn+occc(logmlogu))) occ c P P P P occ c ),),9). CG X ab a b X edit sensitive parsing 1) CG.1 Σ 1 Σ = 11 (digram) a ia j lca(i,j) a i a j w w i w[i] w w[i] 1 w[i] label(w[i]) 1) 1 c11 Information Processing ociety of Japan

lca( a, a) = lca( a, a) = 1 1 1 1 1 1 1 1 a1 a a a a a a a a9 a1 a11 1 lca ig.1 Alphabet tree and lca dummy w 1st labels nd labels final labels landmarks a1 a a a a9 a1 a1 a a a a1 a 1 11 9 9 9 Σ Σ 1 Σ Σ ig. Alphabet reduction and landmarks lca(w[i 1],w[i]) (w[i 1] < w[i]) label(w[i]) = lca(w[i 1],w[i])+1 (otherwise) 1 a i i {1,,...,n} {,,..., logn +1} n 1 logn w w w log n 1 log n w 1 w 1,w n w 1,w n n n edit sensitive parsing 1 log n loglog logn > 1 log (1). dit sensitive parsing s n (1) () 1 log n () 1, 1 s i = aaaab s i = XXb X aa edit sensitive parsing s i = abacabcda s i[],s i[] s i = axcayda X ba,y bc s i = axzywx ba,y bc,z ca,w da s 1,s,,s k s s = 1 s s s logs..1 CG CG G D P P 1,P,,P k G 1 w w[i,j] = w[k,l] = αβγ w[i,j] w[k,l] β w[i,j] w w[j] i < j ss c11 Information Processing ociety of Japan

K '''' ''' H J I H I C D b c b a d b a d e b c Compressed pattern Core K X X Y G C G D '' ' G A a C a D b A a Type: G H I A A C D D A A C D A C AC A A a b a b a b d c b a b b b a b a landmark Type: Type:1 Type: ig. tate of compression a b a b c b a d b a d e b c a d Parsing Tree of original text ig. xtracting the Core ig. Adjacent relation of subtrees b a b b b b b a b a b b b a b b DAG ig. Parsing Tree and DAG w[i,j] = w[k,l] Σ w[i,j] = w[k,l] = xαy α α P P XP 1,P,,P k G P X P i,p ji j P 1 P i P k = O(log P ). A, A 1 X ab a,b X a b A 1 A A A, A X (1) X A () X Y A, A u logu A logu O(logu). DAG A, DAG DAG DAG A DAG DAG DAG G G G left G right c11 Information Processing ociety of Japan

1 a b DAG representation 1 9 Gleft 1 DAG ig. Decomposition of DAG 1 9 1 Grightt G G DAG AXG left,g right X Y G right X Y G left Y G left logu X DAG O(logu) P c O(log P logu) P O(occ c(log P logu)) DAG. DAG Σ [1,n] (1) rank c(,i):[1,i] c Σ () select c(,i): c i () access(,i):[i] ) n nlogσ +o(nlogσ) O(logσ) Σ = σ = n O(1) n+o(n) ) balanced parenthesis representation P ) T T T 1,T,...,T d T P(T) P(T) = () (d = 1) P(T) = (P(T 1)P(T ) P(T d )) (otherwise) (),1 nn P P P ) (1) findclose(i):p[i] () findopen(i):p[i] () enclose(i):p[i] ( 1) parent(x):enclose(x) x () firstchile(x):x+1 x ( ) nextsibling(x):finclose(x) + 1 x i p c11 Information Processing ociety of Japan

The in-branching children of x in T sorted by the original variables of the parents in T R z 1 z z z z The in-degree edges in the left tree T R The in-degree edges to a node x in the left tree T x y 1 y y y y X X X X X 1 The original variables of y i accessible by the succinct permutation ig. reverse dictionary representation by binary search p = preorder(i) = rank ( (P,i) i = select ( (P,p) O(1) n+o(n) ). π π[i] π 1 [i] π = (,,1,,) π[] =,π 1 [] = π[i] π i π 1 [i] π i ) (1+)nlogn+O(n) π[i] O(1) π 1 [i] O( 1 ) P z xy xy z T (x),t R(x) x T (x i) T (x i) T (z 1),T (z ),,T (z k ) T R(z i) 1 a b DAG representation P label in label in T T R original label 1 9 1 T 1 9 1 ((((()))()((())))(())) ((((())()(()))((())))) 1 9 1 1 1 9 a 1 b 9 ig. 9 eft/right tree and succinct representation y i xy T (x) T (k)t R(k) y k xy x T (x) O(logn) n nn+o(n) P 9 CG P P m O(mlogn) TR c11 Information Processing ociety of Japan

O( 1 ) CG (1+)nlogn+ n+o(n) O( 1 (mlogn+occc(logmlogu))) P. Z-index ) CArray 9) M-index ) CPU:Intel Xeon (Quad Core, HT @.GHz), Memory: 1G, CentO.(bit), gcc.1. 1 Pizza & Chili corpus 1M 1,,,,,1M 1M 1M, yte 1 Z-index, yte 1 11 CArray M-index 1 M-index 1. 1 ig. 1 Time to construct index. 1 ig. 1 Time to count occurrences 11 ig.11 Index size edit sensitive parsing z xy z x,y x,y OUD ) c11 Information Processing ociety of Japan

% 1) Cormode, G. and Muthukrishman,.: The string edit distance matching problem with moves, ACM Trans, Vol., No.1 (1). ) Delpratt, O., Rahman, N. and Raman, R.: ngineering the OUD uccinct Tree Representation, In WA (). ) erragina, P. and Manzini, G.: Opportunistic data structures with applications, In OC, Vol., No.1, pp.9 9 (). ) Grossi, R., Gupta, A. and Vitter, J.: High-order entropy-compressed text indexes, In ODA, pp. (). ) Munro, J.: Tables, In TTC9, pp. (199). ) Munro, J., Raman, R., Raman, V. and Rao,.: uccinct representations of permutations, In ICAP, pp. (). ) Munro, J. and Raman, V.: uccinct representation of balanced parentheses and static trees, IAM Journal on Computing, Vol.1, No., pp. (1). ) Navarro, G.: Indexing text using the ziv-lempel tire, Journal of Discrete Algorithms, pp. 11 (1). 9) adakane, K.: Compressed text databases with efficient query algorithms based on the compressed suffix array, In IAAC, pp.1 1 (). c11 Information Processing ociety of Japan