A data structure based on grammatical compression to detect long pattern

Σχετικά έγγραφα
Re-Pair n. Re-Pair. Re-Pair. Re-Pair. Re-Pair. (Re-Merge) Re-Merge. Sekine [4, 5, 8] (highly repetitive text) [2] Re-Pair. Blocked-Repair-VF [7]

Reducing the space and time requirements of LZ-index using the XBW transformation jργklαzxcvbnβφδγωmζqwertλκοθξyu

Buried Markov Model Pairwise

GPU. CUDA GPU GeForce GTX 580 GPU 2.67GHz Intel Core 2 Duo CPU E7300 CUDA. Parallelizing the Number Partitioning Problem for GPUs

ER-Tree (Extended R*-Tree)

Yoshifumi Moriyama 1,a) Ichiro Iimura 2,b) Tomotsugu Ohno 1,c) Shigeru Nakayama 3,d)

GPGPU. Grover. On Large Scale Simulation of Grover s Algorithm by Using GPGPU

Elements of Information Theory

Anomaly Detection with Neighborhood Preservation Principle

Stabilization of stock price prediction by cross entropy optimization

An Automatic Modulation Classifier using a Frequency Discriminator for Intelligent Software Defined Radio

Practical Implementation of Compressed Suffix Array on Modern Processors

Study of In-vehicle Sound Field Creation by Simultaneous Equation Method

SocialDict. A reading support tool with prediction capability and its extension to readability measurement

IPSJ SIG Technical Report Vol.2014-CE-127 No /12/6 CS Activity 1,a) CS Computer Science Activity Activity Actvity Activity Dining Eight-He

Indexing Methods for Encrypted Vector Databases

Problem Set 3: Solutions

MIDI [8] MIDI. [9] Hsu [1], [2] [10] Salamon [11] [5] Song [6] Sony, Minato, Tokyo , Japan a) b)

BMI/CS 776 Lecture #14: Multiple Alignment - MUSCLE. Colin Dewey

Schedulability Analysis Algorithm for Timing Constraint Workflow Models

Wavelet based matrix compression for boundary integral equations on complex geometries

Development of a basic motion analysis system using a sensor KINECT

Physical DB Design. B-Trees Index files can become quite large for large main files Indices on index files are possible.

Applying Markov Decision Processes to Role-playing Game

A Method for Creating Shortcut Links by Considering Popularity of Contents in Structured P2P Networks

Newman Modularity Newman [4], [5] Newman Q Q Q greedy algorithm[6] Newman Newman Q 1 Tabu Search[7] Newman Newman Newman Q Newman 1 2 Newman 3

(pattern recognition) (symbol processing) (content) (raw data) - 1 -

2. N-gram IDF. DEIM Forum 2016 A1-1. N-gram IDF IDF. 5 N-gram. N-gram. N-gram. N-gram IDF.

Graded Refractive-Index

Ceramic PTC Thermistor Overload Protection

n 1 n 3 choice node (shelf) choice node (rough group) choice node (representative candidate)

Quick algorithm f or computing core attribute

Probabilistic Approach to Robust Optimization

1 n-gram n-gram n-gram [11], [15] n-best [16] n-gram. n-gram. 1,a) Graham Neubig 1,b) Sakriani Sakti 1,c) 1,d) 1,e)

Speeding up the Detection of Scale-Space Extrema in SIFT Based on the Complex First Order System

Detection and Recognition of Traffic Signal Using Machine Learning

Simplex Crossover for Real-coded Genetic Algolithms

Automatic extraction of bibliography with machine learning

CSJ. Speaker clustering based on non-negative matrix factorization using i-vector-based speaker similarity

Japanese Fuzzy String Matching in Cooking Recipes

[4] 1.2 [5] Bayesian Approach min-max min-max [6] UCB(Upper Confidence Bound ) UCT [7] [1] ( ) Amazons[8] Lines of Action(LOA)[4] Winands [4] 1

Homomorphism in Intuitionistic Fuzzy Automata

Πανεπιστήμιο Κρήτης, Τμήμα Επιστήμης Υπολογιστών Άνοιξη HΥ463 - Συστήματα Ανάκτησης Πληροφοριών Information Retrieval (IR) Systems

substructure similarity search using features in graph databases

Homework 3 Solutions

P t s st t t t t2 t s st t t rt t t tt s t t ä ör tt r t r 2ö r t ts t t t t t t st t t t s r s s s t är ä t t t 2ö r t ts rt t t 2 r äärä t r s Pr r

Abstract Storage Devices

Study on Re-adhesion control by monitoring excessive angular momentum in electric railway traction

Πρόβλημα 1: Αναζήτηση Ελάχιστης/Μέγιστης Τιμής

AVL-trees C++ implementation

BCI On Feature Extraction from Multi-Channel Brain Waves Used for Brain Computer Interface

Maxima SCORM. Algebraic Manipulations and Visualizing Graphs in SCORM contents by Maxima and Mashup Approach. Jia Yunpeng, 1 Takayuki Nagai, 2, 1

Approximation of distance between locations on earth given by latitude and longitude

Αλγόριθμοι Ταξινόμησης Μέρος 3

FX10 SIMD SIMD. [3] Dekker [4] IEEE754. a.lo. (SpMV Sparse matrix and vector product) IEEE754 IEEE754 [5] Double-Double Knuth FMA FMA FX10 FMA SIMD

Chap. 6 Pushdown Automata

Vol. 31,No JOURNAL OF CHINA UNIVERSITY OF SCIENCE AND TECHNOLOGY Feb

Retrieval of Seismic Data Recorded on Open-reel-type Magnetic Tapes (MT) by Using Existing Devices

Solving an Air Conditioning System Problem in an Embodiment Design Context Using Constraint Satisfaction Techniques

Graph Algorithms. Παρουσίαση στα πλαίσια του μαθήματος «Παράλληλοι Αλγόριθμοι» Καούρη Γεωργία Μήτσου Βασιλική

Ανάκληση Πληποφοπίαρ. Information Retrieval. Διδάζκων Δημήηριος Καηζαρός

Διάλεξη 14: Δέντρα IV - B-Δένδρα

CRASH COURSE IN PRECALCULUS

A Study on Segmentation of Artificial Grayscale Image for Vector Conversion

An Efficient Calculation of Set Expansion using Zero-Suppressed Binary Decision Diagrams

SCITECH Volume 13, Issue 2 RESEARCH ORGANISATION Published online: March 29, 2018

Resurvey of Possible Seismic Fissures in the Old-Edo River in Tokyo

Apr Vol.26 No.2. Pure and Applied Mathematics O157.5 A (2010) (d(u)d(v)) α, 1, (1969-),,.

Orthogonalization Library with a Numerical Computation Policy Interface

encouraged to use the Version of Record that, when published, will replace this version. The most /BCJ BIOCHEMICAL JOURNAL

ΓΡΑΜΜΙΚΟΣ & ΔΙΚΤΥΑΚΟΣ ΠΡΟΓΡΑΜΜΑΤΙΣΜΟΣ

HIV HIV HIV HIV AIDS 3 :.1 /-,**1 +332

ΑΠΟΔΟΤΙΚΗ ΑΠΟΤΙΜΗΣΗ ΕΡΩΤΗΣΕΩΝ OLAP Η ΜΕΤΑΠΤΥΧΙΑΚΗ ΕΡΓΑΣΙΑ ΕΞΕΙΔΙΚΕΥΣΗΣ. Υποβάλλεται στην

ΗΥ360 Αρχεία και Βάσεις εδοµένων

and algorithms CONTENTS Process for Design and Analysis of Algorithms Understanding the Problem

ΗΜΥ 210 ΣΧΕΔΙΑΣΜΟΣ ΨΗΦΙΑΚΩΝ ΣΥΣΤΗΜΑΤΩΝ. Χειµερινό Εξάµηνο ΔΙΑΛΕΞΗ 3: Αλγοριθµική Ελαχιστοποίηση (Quine-McCluskey, tabular method)

Technical Information T-9100 SI. Suva. refrigerants. Thermodynamic Properties of. Suva Refrigerant [R-410A (50/50)]

HOSVD. Higher Order Data Classification Method with Autocorrelation Matrix Correcting on HOSVD. Junichi MORIGAKI and Kaoru KATAYAMA

Ceramic PTC Thermistor Overload Protection

k A = [k, k]( )[a 1, a 2 ] = [ka 1,ka 2 ] 4For the division of two intervals of confidence in R +

FP series Anti-Bend (Soft termination) capacitor series

Εποχές( 1. Εποχή(του(mainframe((πολλοί( χρήστες,(ένας(υπολογιστής)(( 2. Εποχή(του(PC((ένας(χρήστης,(

Ανάκτηση Πληροφορίας

ΠΛΕ- 027 Μικροεπεξεργαστές

C.S. 430 Assignment 6, Sample Solutions

VBA Microsoft Excel. J. Comput. Chem. Jpn., Vol. 5, No. 1, pp (2006)

DuPont Suva 95 Refrigerant

Homework 8 Model Solution Section

DuPont Suva 95 Refrigerant

2016 IEEE/ACM International Conference on Mobile Software Engineering and Systems

Development and Verification of Multi-Level Sub- Meshing Techniques of PEEC to Model High- Speed Power and Ground Plane-Pairs of PFBS

: Monte Carlo EM 313, Louis (1982) EM, EM Newton-Raphson, /. EM, 2 Monte Carlo EM Newton-Raphson, Monte Carlo EM, Monte Carlo EM, /. 3, Monte Carlo EM


Erkki Mäkinen ja Timo Poranen Algoritmit

High order interpolation function for surface contact problem

EE434 ASIC & Digital Systems Arithmetic Circuits

ΕΦΑΡΜΟΓΗ ΕΥΤΕΡΟΒΑΘΜΙΑ ΕΠΕΞΕΡΓΑΣΜΕΝΩΝ ΥΓΡΩΝ ΑΠΟΒΛΗΤΩΝ ΣΕ ΦΥΣΙΚΑ ΣΥΣΤΗΜΑΤΑ ΚΛΙΝΗΣ ΚΑΛΑΜΙΩΝ

2. THEORY OF EQUATIONS. PREVIOUS EAMCET Bits.

ΕΘΝΙΚΟ ΚΑΙ ΚΑΠΟΔΙΣΤΡΙΑΚΟ ΠΑΝΕΠΙΣΤΗΜΙΟ ΑΘΗΝΩΝ ΣΧΟΛΗ ΘΕΤΙΚΩΝ ΕΠΙΣΤΗΜΩΝ ΤΜΗΜΑ ΠΛΗΡΟΦΟΡΙΚΗΣ ΚΑΙ ΤΗΛΕΠΙΚΟΙΝΩΝΙΩΝ

Αυτόματη Ανακατασκευή Θραυσμένων Αντικειμένων

Transcript:

1 1 (1 + )nlogn + n + o(n) O( 1 (mlogn+occc(logmlogu))) n u m = P, < < 1 Z-index ) M-index ) A data structure based on grammatical compression to detect long pattern Naoya Kishiue, 1 Masaya Nakahara, 1 hirou Maruyama and Hiroshi akamoto In this research, we propose the method to search long pattern from compressed index based on Context-free Grammar. The proposed method can detect the pattern at O( 1 (mlogn+occc(logmlogu))) time with (1+)nlogn+ n + o(n) bits, where n is generated variables compressed text (original size u), m = P, < < 1. Result of experiments, we confirmed our proposed method was faster than existing method (e.g, Z-index ), M-index ) ) at long pattern. 1 Kyushu Institute of Technology Graduate chool of Computer cience and ystems ngineering Kyushu University Graduate chool of Information cience and lectrical ngineering Kyushu Institute of Technology aculty of Computer cience and ystems ngineering 1. M-index ) 9) ucg 1 n(1+)nlogn+n+o(n) < < 1 m P O( 1 (mlogn+occc(logmlogu))) occ c P P P P occ c ),),9). CG X ab a b X edit sensitive parsing 1) CG.1 Σ 1 Σ = 11 (digram) a ia j lca(i,j) a i a j w w i w[i] w w[i] 1 w[i] label(w[i]) 1) 1 c11 Information Processing ociety of Japan

lca( a, a) = lca( a, a) = 1 1 1 1 1 1 1 1 a1 a a a a a a a a9 a1 a11 1 lca ig.1 Alphabet tree and lca dummy w 1st labels nd labels final labels landmarks a1 a a a a9 a1 a1 a a a a1 a 1 11 9 9 9 Σ Σ 1 Σ Σ ig. Alphabet reduction and landmarks lca(w[i 1],w[i]) (w[i 1] < w[i]) label(w[i]) = lca(w[i 1],w[i])+1 (otherwise) 1 a i i {1,,...,n} {,,..., logn +1} n 1 logn w w w log n 1 log n w 1 w 1,w n w 1,w n n n edit sensitive parsing 1 log n loglog logn > 1 log (1). dit sensitive parsing s n (1) () 1 log n () 1, 1 s i = aaaab s i = XXb X aa edit sensitive parsing s i = abacabcda s i[],s i[] s i = axcayda X ba,y bc s i = axzywx ba,y bc,z ca,w da s 1,s,,s k s s = 1 s s s logs..1 CG CG G D P P 1,P,,P k G 1 w w[i,j] = w[k,l] = αβγ w[i,j] w[k,l] β w[i,j] w w[j] i < j ss c11 Information Processing ociety of Japan

K '''' ''' H J I H I C D b c b a d b a d e b c Compressed pattern Core K X X Y G C G D '' ' G A a C a D b A a Type: G H I A A C D D A A C D A C AC A A a b a b a b d c b a b b b a b a landmark Type: Type:1 Type: ig. tate of compression a b a b c b a d b a d e b c a d Parsing Tree of original text ig. xtracting the Core ig. Adjacent relation of subtrees b a b b b b b a b a b b b a b b DAG ig. Parsing Tree and DAG w[i,j] = w[k,l] Σ w[i,j] = w[k,l] = xαy α α P P XP 1,P,,P k G P X P i,p ji j P 1 P i P k = O(log P ). A, A 1 X ab a,b X a b A 1 A A A, A X (1) X A () X Y A, A u logu A logu O(logu). DAG A, DAG DAG DAG A DAG DAG DAG G G G left G right c11 Information Processing ociety of Japan

1 a b DAG representation 1 9 Gleft 1 DAG ig. Decomposition of DAG 1 9 1 Grightt G G DAG AXG left,g right X Y G right X Y G left Y G left logu X DAG O(logu) P c O(log P logu) P O(occ c(log P logu)) DAG. DAG Σ [1,n] (1) rank c(,i):[1,i] c Σ () select c(,i): c i () access(,i):[i] ) n nlogσ +o(nlogσ) O(logσ) Σ = σ = n O(1) n+o(n) ) balanced parenthesis representation P ) T T T 1,T,...,T d T P(T) P(T) = () (d = 1) P(T) = (P(T 1)P(T ) P(T d )) (otherwise) (),1 nn P P P ) (1) findclose(i):p[i] () findopen(i):p[i] () enclose(i):p[i] ( 1) parent(x):enclose(x) x () firstchile(x):x+1 x ( ) nextsibling(x):finclose(x) + 1 x i p c11 Information Processing ociety of Japan

The in-branching children of x in T sorted by the original variables of the parents in T R z 1 z z z z The in-degree edges in the left tree T R The in-degree edges to a node x in the left tree T x y 1 y y y y X X X X X 1 The original variables of y i accessible by the succinct permutation ig. reverse dictionary representation by binary search p = preorder(i) = rank ( (P,i) i = select ( (P,p) O(1) n+o(n) ). π π[i] π 1 [i] π = (,,1,,) π[] =,π 1 [] = π[i] π i π 1 [i] π i ) (1+)nlogn+O(n) π[i] O(1) π 1 [i] O( 1 ) P z xy xy z T (x),t R(x) x T (x i) T (x i) T (z 1),T (z ),,T (z k ) T R(z i) 1 a b DAG representation P label in label in T T R original label 1 9 1 T 1 9 1 ((((()))()((())))(())) ((((())()(()))((())))) 1 9 1 1 1 9 a 1 b 9 ig. 9 eft/right tree and succinct representation y i xy T (x) T (k)t R(k) y k xy x T (x) O(logn) n nn+o(n) P 9 CG P P m O(mlogn) TR c11 Information Processing ociety of Japan

O( 1 ) CG (1+)nlogn+ n+o(n) O( 1 (mlogn+occc(logmlogu))) P. Z-index ) CArray 9) M-index ) CPU:Intel Xeon (Quad Core, HT @.GHz), Memory: 1G, CentO.(bit), gcc.1. 1 Pizza & Chili corpus 1M 1,,,,,1M 1M 1M, yte 1 Z-index, yte 1 11 CArray M-index 1 M-index 1. 1 ig. 1 Time to construct index. 1 ig. 1 Time to count occurrences 11 ig.11 Index size edit sensitive parsing z xy z x,y x,y OUD ) c11 Information Processing ociety of Japan

% 1) Cormode, G. and Muthukrishman,.: The string edit distance matching problem with moves, ACM Trans, Vol., No.1 (1). ) Delpratt, O., Rahman, N. and Raman, R.: ngineering the OUD uccinct Tree Representation, In WA (). ) erragina, P. and Manzini, G.: Opportunistic data structures with applications, In OC, Vol., No.1, pp.9 9 (). ) Grossi, R., Gupta, A. and Vitter, J.: High-order entropy-compressed text indexes, In ODA, pp. (). ) Munro, J.: Tables, In TTC9, pp. (199). ) Munro, J., Raman, R., Raman, V. and Rao,.: uccinct representations of permutations, In ICAP, pp. (). ) Munro, J. and Raman, V.: uccinct representation of balanced parentheses and static trees, IAM Journal on Computing, Vol.1, No., pp. (1). ) Navarro, G.: Indexing text using the ziv-lempel tire, Journal of Discrete Algorithms, pp. 11 (1). 9) adakane, K.: Compressed text databases with efficient query algorithms based on the compressed suffix array, In IAAC, pp.1 1 (). c11 Information Processing ociety of Japan