Practical Implementation of Compressed Suffix Array on Modern Processors

DEIM Forum 2012 F11-2 CPU NTT, 239-0847 1-1 E-mail: {yamamuro.takeshi,onizuka.makoto,hitaka.toshio,yamamuro.masashi}@lab.ntt.co.jp T N P M N >> M / T / CPU CPU 2 P CPU CPU Practical Implementation of Compressed Suffix Array on Modern Processors Takeshi YAMAMURO, Makoto ONIZUKA, Toshio HITAKA, and Masashi YAMAMURO NTT Cyber Space Laboratories, 1-1 Hikarino-o-ka, Yokosuka Kanagawa, 239-0847 Japan E-mail: {yamamuro.takeshi,onizuka.makoto,hitaka.toshio,yamamuro.masashi}@lab.ntt.co.jp 1. Web q-gram / 1 I/O q-gram / Compressed Suffix Array CSA S[0...N 1] = s 0s 1s 2...s N 1 P [0...M 1] = p 0p 1p 2...p M 1 / Suffix Array SA S[i] = {s i s i Σ} P [i] = {p i p i Σ} Σ N >> M Σ 1Byte ASCII S T SA [1] [2] SA P 2 Θ(MlogN) [3] [4] [5] SA SA / Θ(MlogN) (word RAM

1 CPU in-memory Xeon 5670 in-memory Fig.1 Stride Size 4Byte CPU x228 SA 2 1 in-memory CSA 3 / 2 TREC Terabyte Track.gov2 2004.gov TREC 2009 Million Query Track 40,000 4 P Best Worst CSA SA SA CSA in-memory / 2 CSA 2 1 2 SA CSA P Θ(logN) SA SA P M CSA CPU (1) 2 (2) CSA P CPU Section 2 SA SA Section 3 CSA CPU CPU Section 4 CPU 2 SA Section 5 Section 6/7/8 2. 2. 1 T [0...N 1] = t 0 t 1 t 2...t N 1 T [N] = $ $ T S[i](i = 0, 1, 2,..., N) S[i] = T [i...n] SA S[SA[i]] < S[SA[j]] iff i < j (1) 2 CSA SA / CSA 1 N logn-bit / 2 http://code.google.com/p/libdivsufsort/ 3 http://code.google.com/p/csalib/ 4 http://trec.nist.gov/data/million.query09.html SA P [0...M 1] = p 0 p 1 p 2...p M 1 P S T P S 2 T P S Θ(M) Θ(logN) SA Θ(MlogN) 2. 2 rank/select B[0...N 1] = b 0b 1b 2...b N 1 B[i] = {b i b i {0, 1}} bit B rank 1(B, i) B[0...i] 1 select 1(B, i) (i + 1) 1 rank 0 select 0

rank/select bit N o(n) O(logN) O(N) O(1) [7] rank/select 2 bit B T [0...N 1] = t 0t 1t 2...t N 1 T [i] = {t i t i Σ} rank/select rank x(t, i) T [0...i] x x Σ select x(t, i) (i + 1) x Wavelet [4] bit B rank/select LOUDS [8]/BP [9]/DFUDS [10] [11] 2. 3 rank/select SA SA 1 4Byte SA 4N-Byte 5 CSA SA ψ[i] ψ[i] = SA 1 [SA[i] + 1] (2) SA 1 SA[j] = i SA 1 [i] = j ψ SA T BWT [12] BWT BW T [i] = T [SA[i] 1] (3) BTW BWT SA ψ SA 1 [SA[i] + 1] = select x (BW T, i cum[x]) (4) cum[x] T x SA CSA P [3] [4] [5] 5 0.20N-Byte N-Byte [6] 3. CSA / CSA / 2 rank/select rank/select [3] [4] [5] CPU CPU 1 CSA CPU Valgrind 1/ 2 3 2 3 valgrind SA/CSA CPU Ins.-refs Data-refs CPU 2 L1-misses L2-misses L1/L2 2 CPU L1/L2 2 SA CSA 1 CPU 4. 4. 1 CSA CSA CPU 4 SA N ψ K BL BL i(i = 1...K) BL 1 BL 2 F 1 F ψ BL ψ ψ ψ rank/select 1. 2 CSA SA S P 2 rank/select CPU

SA S L CPU 2. BL 2 4 Section 4.2 SA S Section 4.3 4. 2 2 4 ψ p i =Pr(ψ[i]) F P j =Pr(F [j]) i = 0...N 1 j = 1...K Pr(A) A P ψ ψ K BL BL i > BL j iff P j < P i (5) BL i BL i BL j BL i BL i BL j log( BL i ) BL i =N*P i 2 ψ log( F )+log(min( BL i )) log( F )+log(max( BL i )) 2 log(n) F Section 4.4 4. 3 5 L Section 4.1 CSA 2 SA SA S P SA S P F S 5 2 Section 4.2 F S S L P L F 2 P S L CSA S L Section 4.5 4. 4 F Algorithm 1 Pseudo code to generate F 1: /* 2: suffix: Ordered suffixes 3: nref: Array of counters 4: chunksz: Number of ψ sharing a counter 5: pr: Array of reference probabilities in CHUNK 6: ratio: Array of sizes allocated in F 7: B: Bit array to map F with suffix 8: */ 9: pr = calc probabilities(nref); 10: ratio = allocate F(pr, sizeof(f )); 11: for i 1 to sizeof(ratio) do 12: for j 1 to ratio[i] do 13: set bit(b, i * refsz + j * (refsz/ratio[i])); 14: end for 15: end for 16: for i 1 to sizeof(f ) do 17: push back(f, TRANSLATE(suffix[select 1 (B, i)])); 18: end for F Algorithm.1 ψ nref ψ chunksz ψ 1 ψ CHUNK nref CHUNK F line 9-10) CHUNK F ψ CHUNK line 9 F CHUNK line 10 CHUNK F

4 B suffix line 13 BL B F line 17 Section 4.2 SA SA S L line 17 TRANSLATE Section 5.2 4. 5 L L L = 4 Algorithm 2 1Byte L = 4 val 4Byte 4Byte 4 1Byte 1 4Byte Byte 1Byte line 3-6 F Algorithm 2 Pseudo code to translate characters (L=4) 1: /* c 0 3 : input characters from 0-th to 4-th ones */ 2: val = 0; 3: val = c 0 << 24; 4: val = c 1 << 16; 5: val = c 2 << 8; 6: val = c 3 ; 7: return val; P 2 Algorithm 3 p Algorithm 2 L line 8 F pint 2 line 11-16 Section 4.3 line 17 L S line 18 F 2 line 11-16 CPU rank/select CPU 5. SA/CSA Section 1 2 CSA Practical CSA pcsa) CSA ψ T ψ γ dag vector 6 T LZ77 LZ-End [17] 6 https://github.com/pfi/dag vector Algorithm 3 Pseudo code to traverse F 1: /* 2: p: Input patten 3: pint: Translated pattern 4: cpos: Position of current searches 5: suffix: Ordered suffixes 6: B: Bit array generated in Algorithm 1 7: */ 8: pint = TRANSLATE(p); 9: len = sizeof(f ) / 2; 10: while len = 0 do 11: len = len / 2; 12: if pint < F [cpos] then 13: cpos -= len; 14: else if pint > F [cpos] then 15: cpos += len; 16: end if 17: if pint == F [cpos] then 18: if p < suffix[select 1 (B, cpos)] then 19: cpos -= len; 20: else 21: cpos += len; 22: end if 23: end if 24: end while 25: return cpos; Section 2.2 rank/select Section 4.2 L=4 4Byte F SA/CSA CPU CPU Section 1 2 TREC Terabyte Track.gov2 2GiB TREC 2009 Million Query Track Xeon 5670 16GiB/CPU 1/Intel Hyper-Threading 6/ 31.8GiB/s CPU oprofile v0.9.6 Xeon 5670 Xeon 5260 16GiB/CPU 1/ 2/ 21.2GiB/s C/C++ GNU Compiler Collection v4.1.2 -O2 5. 1 γ LZ-End pcsa γ LZ-End

rank/select ψ dag vector 254.4µs LZ-End 1 Deterioration Rate LZ-End P M O(M) [17] F P SA 2 CSA pcsa F CSA 1 LZ-End µs Pattern Length 1 4 8 12 16 LZ-End 1.20 3.15 6.15 8.34 9.60 Uncompressed 0.455 0.461 0.654 0.695 0.702 Deterioration Ratio x2.64 x6.84 x9.40 x11.99 x13.75 log( F ) 2 T F P log( F ) 5. 2 6 6 Intel x86 CPU CPU 64bit rdtsc F 2MiB/8MiB/32MiB pcsa 1.17GiB 3% P F F =32MiB 13.07µs P F 2 rank/select CPU 3.37µs SA 5. 3 CPU oprofile CPU 7 branch penalties / stall time / complete instructions 3 # of instructions 7 oprofile CPU CPU CPU 8 CSA F =32MiB 2 SA pcsa 1/5 CPU CPU [22] [20] [21] CPU 8 Memory Consumed CPU 8 Throughtput 6. 2 1 CPU 2 (1) DB Web

2 CPU CPU PForDelta [13] PForDelta OPTPForDelta [14] Simple9/16 [15] [14] VSEncoding [16] (2) CSA CSA CPU 7. 2 1 F 2 F 2 CPU ψ ψ CHUNK F (1) F F 2 (2) DB in-memory L1/L2 [20] [19] / Intel FAST [20] 100 10 6 8 8. N T M P N >> M CSA / CSA rank/select CPU SA 3% SA/CSA /Throughput [1] Daisuke Okanohara and Jun-ichi Tsujii Text Categorization with All Substring Features, Proc. of SIAM 09, pp. 838-846, 2009. [2] Choon Hui Teo and S. V. N. Vishwanathan Fast and space efficient string kernels using suffix arrays, International Conference on Machine Learning, pp. 929-936, 2006. [3] Kunihiko Sadakane New text indexing functionalities of the compressed suffix arrays, Journal of Algorithms, Vol. 48, Issue. 2, pp. 294-313, 2003. [4] Roberto Grossi and Ankur Gupta High-order entropycompressed text indexes, Proc. of SODA 03, pp. 841-850, 2003. [5] Roberto Grossi and Jeffery Scott Vitter Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String, SIAM Journal on Computing, Vol. 35, Issue. 2, pp. 378-407, 2005. [6] Baeza-Yates, R. and Ribeiro, B Modern Information Retrieval, Addison-Wesley, 1999. [7] Daisuke Okanohara and Kunihiko Sadakane Practical Entropy-Compressed Rank/Select Dictionaly, Proc. of ALENEX 07, pp. 60-70, 2007. [8] O Neil Delpratt, Naila Rahman, and Rajeev Raman Engineering the LOUDS Succinct Tree Representation, Proc. of WEA 06, pp. 134-145, 2006. [9] Richard F. Geary et al. A simple optimal representation for balanced parentheses, Journal of Theoretical Computer Science, Vol. 368, Issue. 3, pp. 231-246, 2006. [10] David Benoit et al. Representing Trees of Higher Degree, Journal of Algorithmica, Vol. 43, Issue. 4, pp. 275-292, 2005. [11] Arash Farzan and Johannes Fischer Compact Representation of Posets, Proc. of ISAAC 11, pp. 302-311, 2011. [12] Michael Burrows and David Wheeler A block-sorting lossless data compression algorithm, Technical Report 124, 1994. [13] Marcin Zukowski et al. Super-Scalar RAM-CPU Cache Compression, Proc. of ICDE 06, pp. 59-71, 2006. [14] Hao Yan, Shuai Ding, and Torsten Suel Inverted index compression and query processing with optimized document ordering, Proc. of WWW 11, pp. 401-410, 2009. [15] Vo Ngoc Anh and Alistair Moffat. Inverted Index Compression Using Word-Aligned Binary Codes, Journal of Information Retrieval, Vol. 8, Issue. 1, pp. 151-166, 2005. [16] Fabrizio Silvestri and Rossano Venturini VSEncoding: efficient coding and fast decoding of integer lists via dynamic programming, Proc. of CIKM 10, pp. 1219-1228, 2010. [17] Sebastian Kreft and Gonzalo Navarro LZ77-Like Compression with Fast Random Access, Proc. of DCC 10, pp. 239-248, 2010. [18] Pawel Gawrychowski Pattern matching in lempel-ziv compressed strings: fast, simple, and deterministic, Proc. of ESA 11, pp. 421-432, 2011. [19] Jason Sewall et al. PALM: Parallel Architecture-Friendly Latch-Free Modification to B+Trees on Many-Core Processors, Proc. of VLDB 11, 2011. [20] Changkyu Kim et al. Designing Fast Architecture Sensitive Tree Search on Modern Multi-Core/Many-Core Processors, ACM Transactions on Database Systems, 9(4), 2011. [21] Nadathur Satish et al. Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort, Proc. of SIGMOD 10, 2010. [22] Reilly Matthew When Multicore Isn t Enough: Trends and the Future for Multi-Multicore Systems, In HPEC, 2008.