Text Mining using Linguistic Information

Σχετικά έγγραφα

Mining Syntactic Structures from Text Database

The Algorithm to Extract Characteristic Chord Progression Extended the Sequential Pattern Mining

Ανακάλυψη κανόνων συσχέτισης από εκπαιδευτικά δεδομένα

A Method for Creating Shortcut Links by Considering Popularity of Contents in Structured P2P Networks

Newman Modularity Newman [4], [5] Newman Q Q Q greedy algorithm[6] Newman Newman Q 1 Tabu Search[7] Newman Newman Newman Q Newman 1 2 Newman 3

2016 IEEE/ACM International Conference on Mobile Software Engineering and Systems

SocialDict. A reading support tool with prediction capability and its extension to readability measurement

GPGPU. Grover. On Large Scale Simulation of Grover s Algorithm by Using GPGPU

HOSVD. Higher Order Data Classification Method with Autocorrelation Matrix Correcting on HOSVD. Junichi MORIGAKI and Kaoru KATAYAMA

Simplex Crossover for Real-coded Genetic Algolithms

ER-Tree (Extended R*-Tree)

Vol. 31,No JOURNAL OF CHINA UNIVERSITY OF SCIENCE AND TECHNOLOGY Feb

Re-Pair n. Re-Pair. Re-Pair. Re-Pair. Re-Pair. (Re-Merge) Re-Merge. Sekine [4, 5, 8] (highly repetitive text) [2] Re-Pair. Blocked-Repair-VF [7]

Buried Markov Model Pairwise

Research of Han Character Internal Codes Recognition Algorithm in the Multi2lingual Environment

: Monte Carlo EM 313, Louis (1982) EM, EM Newton-Raphson, /. EM, 2 Monte Carlo EM Newton-Raphson, Monte Carlo EM, Monte Carlo EM, /. 3, Monte Carlo EM

Quick algorithm f or computing core attribute

ΙΠΛΩΜΑΤΙΚΗ ΕΡΓΑΣΙΑ. ΘΕΜΑ: «ιερεύνηση της σχέσης µεταξύ φωνηµικής επίγνωσης και ορθογραφικής δεξιότητας σε παιδιά προσχολικής ηλικίας»

substructure similarity search using features in graph databases

Wiki. Wiki. Analysis of user activity of closed Wiki used by small groups

Resurvey of Possible Seismic Fissures in the Old-Edo River in Tokyo

Retrieval of Seismic Data Recorded on Open-reel-type Magnetic Tapes (MT) by Using Existing Devices

GPU. CUDA GPU GeForce GTX 580 GPU 2.67GHz Intel Core 2 Duo CPU E7300 CUDA. Parallelizing the Number Partitioning Problem for GPUs

Congruence Classes of Invertible Matrices of Order 3 over F 2

Homomorphism in Intuitionistic Fuzzy Automata

An Automatic Modulation Classifier using a Frequency Discriminator for Intelligent Software Defined Radio

Ordinal Arithmetic: Addition, Multiplication, Exponentiation and Limit

Numerical Analysis FMN011

c Key words: cultivation of blood, two-sets blood culture, detection rate of germ Vol. 18 No

Anomaly Detection with Neighborhood Preservation Principle

3: A convolution-pooling layer in PS-CNN 1: Partially Shared Deep Neural Network 2.2 Partially Shared Convolutional Neural Network 2: A hidden layer o

Τ.Ε.Ι. ΔΥΤΙΚΗΣ ΜΑΚΕΔΟΝΙΑΣ ΠΑΡΑΡΤΗΜΑ ΚΑΣΤΟΡΙΑΣ ΤΜΗΜΑ ΔΗΜΟΣΙΩΝ ΣΧΕΣΕΩΝ & ΕΠΙΚΟΙΝΩΝΙΑΣ

Automatic extraction of bibliography with machine learning

Indexing Methods for Encrypted Vector Databases

[4] 1.2 [5] Bayesian Approach min-max min-max [6] UCB(Upper Confidence Bound ) UCT [7] [1] ( ) Amazons[8] Lines of Action(LOA)[4] Winands [4] 1

Research on Economics and Management

IMES DISCUSSION PAPER SERIES

Detection and Recognition of Traffic Signal Using Machine Learning

ΑΝΙΧΝΕΥΣΗ ΓΕΓΟΝΟΤΩΝ ΒΗΜΑΤΙΣΜΟΥ ΜΕ ΧΡΗΣΗ ΕΠΙΤΑΧΥΝΣΙΟΜΕΤΡΩΝ ΔΙΠΛΩΜΑΤΙΚΗ ΕΡΓΑΣΙΑ

Modern Greek Extension

Development of a Seismic Data Analysis System for a Short-term Training for Researchers from Developing Countries

C.S. 430 Assignment 6, Sample Solutions

An Efficient Calculation of Set Expansion using Zero-Suppressed Binary Decision Diagrams

Η αλληλεπίδραση ανάμεσα στην καθημερινή γλώσσα και την επιστημονική ορολογία: παράδειγμα από το πεδίο της Κοσμολογίας

A Survey of Recent Clustering Methods for Data Mining (part 2)

Approximation of distance between locations on earth given by latitude and longitude

Nov Journal of Zhengzhou University Engineering Science Vol. 36 No FCM. A doi /j. issn

Reaction of a Platinum Electrode for the Measurement of Redox Potential of Paddy Soil

Commutative Monoids in Intuitionistic Fuzzy Sets

Πανεπιστήμιο Κρήτης, Τμήμα Επιστήμης Υπολογιστών Άνοιξη HΥ463 - Συστήματα Ανάκτησης Πληροφοριών Information Retrieval (IR) Systems

HIV HIV HIV HIV AIDS 3 :.1 /-,**1 +332

ΤΕΧΝΟΛΟΓΙΚΟ ΠΑΝΕΠΙΣΤΗΜΙΟ ΚΥΠΡΟΥ ΤΜΗΜΑ ΝΟΣΗΛΕΥΤΙΚΗΣ


1 n-gram n-gram n-gram [11], [15] n-best [16] n-gram. n-gram. 1,a) Graham Neubig 1,b) Sakriani Sakti 1,c) 1,d) 1,e)

Reading Order Detection for Text Layout Excluded by Image

Gemini, FastMap, Applications. Εαρινό Εξάμηνο Τμήμα Μηχανικών Η/Υ και Πληροϕορικής Πολυτεχνική Σχολή, Πανεπιστήμιο Πατρών

Πρόβλημα 1: Αναζήτηση Ελάχιστης/Μέγιστης Τιμής

Μιχαήλ Νικητάκης 1, Ανέστης Σίτας 2, Γιώργος Παπαδουράκης Ph.D 1, Θοδωρής Πιτηκάρης 3


Optimization, PSO) DE [1, 2, 3, 4] PSO [5, 6, 7, 8, 9, 10, 11] (P)

Finite Field Problems: Solutions

Overview. Transition Semantics. Configurations and the transition relation. Executions and computation

n 1 n 3 choice node (shelf) choice node (rough group) choice node (representative candidate)

[15], [16], [17] [6] [2] [5] Jiang [6] 2.1 [6], [10] Score(x, y) y ( 1) ( 1 ) b e ( 1 ) b e. O(n 2 ) Jiang [6] (word lattice reranking)

ΕΘΝΙΚΟ ΜΕΤΣΟΒΙΟ ΠΟΛΥΤΕΧΝΕΙΟ

Πανεπιστήμιο Πειραιώς Τμήμα Πληροφορικής Πρόγραμμα Μεταπτυχιακών Σπουδών «Πληροφορική»

Phys460.nb Solution for the t-dependent Schrodinger s equation How did we find the solution? (not required)

Kenta OKU and Fumio HATTORI

Εργαστήριο Ανάπτυξης Εφαρμογών Βάσεων Δεδομένων. Εξάμηνο 7 ο

Ηλεκτρονικά σώματα κειμένων και γλωσσική διδασκαλία: Διεθνείς αναζητήσεις και διαφαινόμενες προοπτικές για την ελληνική γλώσσα

,,, Learning to Identif y Chinese Comparative Sentences

Splice site recognition between different organisms

Test Data Management in Practice

Faruqui [7] WordNet [15] FrameNet [2] PPDB [8]

ΕΘΝΙΚΟ ΜΕΤΣΟΒΙΟ ΠΟΛΥΤΕΧΝΕΙΟ ΣΧΟΛΗ ΗΛΕΚΤΡΟΛΟΓΩΝ ΜΗΧΑΝΙΚΩΝ ΚΑΙ ΜΗΧΑΝΙΚΩΝ ΥΠΟΛΟΓΙΣΤΩΝ

IPSJ SIG Technical Report Vol.2014-CE-127 No /12/6 CS Activity 1,a) CS Computer Science Activity Activity Actvity Activity Dining Eight-He

Α/Α Υποέργου: Ε1 07 Τίτλος: ConServ: Δίκτυα Υπηρεσιών με Βάση τα Συμφραζόμενα: Διαχείριση, Δυναμική Προσαρμοστικότητα και Επεξεργασία Ερωτήσεων

Data mining Εξόρυξη εδοµένων. o Association rules mining o Classification o Clustering o Text Mining o Web Mining

ΚΒΑΝΤΙΚΟΙ ΥΠΟΛΟΓΙΣΤΕΣ

Toward a SPARQL Query Execution Mechanism using Dynamic Mapping Adaptation -A Preliminary Report- Takuya Adachi 1 Naoki Fukuta 2.

Scrub Nurse Robot: SNR. C++ SNR Uppaal TA SNR SNR. Vain SNR. Uppaal TA. TA state Uppaal TA location. Uppaal

«Συμπεριφορά μαθητών δευτεροβάθμιας εκπαίδευσης ως προς την κατανάλωση τροφίμων στο σχολείο»

Ανάκτηση Εικόνας βάσει Υφής με χρήση Eye Tracker

J. of Math. (PRC) 6 n (nt ) + n V = 0, (1.1) n t + div. div(n T ) = n τ (T L(x) T ), (1.2) n)xx (nt ) x + nv x = J 0, (1.4) n. 6 n

ΖητΗματα υποδοχης και προσληψης του Franz Kafka στην ΕλλΑδα

Τεχνικές Εξόρυξης Δεδομένων

ΣΤΥΛΙΑΝΟΥ ΣΟΦΙΑ

Dynamic types, Lambda calculus machines Section and Practice Problems Apr 21 22, 2016

Language Resources for Information Extraction:

MIDI [8] MIDI. [9] Hsu [1], [2] [10] Salamon [11] [5] Song [6] Sony, Minato, Tokyo , Japan a) b)

Practice Exam 2. Conceptual Questions. 1. State a Basic identity and then verify it. (a) Identity: Solution: One identity is csc(θ) = 1

Durbin-Levinson recursive method

«ΑΓΡΟΤΟΥΡΙΣΜΟΣ ΚΑΙ ΤΟΠΙΚΗ ΑΝΑΠΤΥΞΗ: Ο ΡΟΛΟΣ ΤΩΝ ΝΕΩΝ ΤΕΧΝΟΛΟΓΙΩΝ ΣΤΗΝ ΠΡΟΩΘΗΣΗ ΤΩΝ ΓΥΝΑΙΚΕΙΩΝ ΣΥΝΕΤΑΙΡΙΣΜΩΝ»


, -.

Bundle Adjustment for 3-D Reconstruction: Implementation and Evaluation

þÿÿ ÁÌ» Â Ä Å ¹µÅ Å½Ä ÃÄ

CHAPTER 25 SOLVING EQUATIONS BY ITERATIVE METHODS

Maxima SCORM. Algebraic Manipulations and Visualizing Graphs in SCORM contents by Maxima and Mashup Approach. Jia Yunpeng, 1 Takayuki Nagai, 2, 1

Transcript:

630-0101 8916-5 {taku-kukaoru-yayuuta-tmatsu}@isaist-naraacjp PrefixSpan : PrefixSpan Text Mining using Linguistic Information Taku Kudo Kaoru Yamamoto Yuta Tsuboi Yuji Matsumoto Graduate School of Information Science Nara Institute Science and Technology 8916-5 Takayama Ikoma Nara 630-0101 Japan {taku-kukaoru-yayuuta-tmatsu}@isaist-naraacjp Text mining has gained the focus of attention recently in particular the success in word clustering have been reported However many of these bag-of-word approaches ignore the implicit depency relations between words completely which are critical to understanding of the original text In this paper we apply NLP tools such as chunking and depency analysis to convert a raw text into a semi-structured text from which useful patterns are extracted We ext the PrefixSpan algorithm one of an efficient algorithms for sequential pattern mining to efficiently extract sub-structures from a linguistic data annotated by NLP tools Keywords : Text Mining Sequential Pattern Mining Semi-structured Data PrefixSpan

1 {( { }) ( { })} ( ) ( ) 3 [6] 2 Text Chunking ( ) Agrawal[2] 1 I = {i 1 i 2 i n } s 1 s 2 s n s k α β β α ( ) α β S id sid s sid s S = { sid 1 s 1 sid 2 s 2 sid 2 s n } α S S α support S (α) = { sid s ( sid s S) (α s)} S ( (minimum support) ξ) support S (α) ξ PrefixSpan α [5] 11 ξ [8] ( ) 3 PrefixSpan - Apriori 4 1

[1] 1 c d a:1 2 c c:1 minsup = 2 2 b c b:2 project 4 a b c:2 sid sequence projected database 1 d d:1 1 a c d a:4 2 2 a b c b:3 2 c a:1 3 c b a c:3 3 a c:1 a:4 4 a a b d:1 b:2 c:2 sequence database a:1 a b:2 Pei[5] Apriori - 1 d b:1 a c:2 3 b a d:1 count sup of 1 item results PrefixSpan 1: PrefixSpan PrefixSpan s = a 1 a 2 a m a a 1 a a 2 a a j 1 a a j = a j(1 call P refixspan(ε S) j m) a 1 a 2 a j s procedure P refixspan (α S α ) a prefix (prefix(s a)) B {b (s S α b s) (support S α ( b ) ξ)} a j+1 a m s a postfix foreach b B (postfix(s a)) j prefixpostfix (ε) # print αb support S α ( b ) S a # (project) S a (S α) b { sid s ( sid s S α) S s postfix(s a) (s = postfix(s b)) (s ε)} call P refixspan (αb (S α) b ) S a = { sid s ( sid s S) (s = postfix(s a)) (s ε)} 2: PrefixSpan S = { sid 3 ξ = 2 PrefixSpan 1 s 1 sid 2 s 2 sid n s n } P = { sid 1 1 sid 2 1 sid n 1 } 1 a b c call P refixspan(ε P ) d 1 procedure P refixspan (α P α) 3 a S a # B # S a ( 3 ) foreach sid l P α b c # C C = {} for i l + 1 to seq(s sid) PrefixSpan 2 b item(s sid i) if (C / b) B[b] B[b] sid i C C b postfix foreach b keys of B S 3 # ξ if ( B[b] < ξ) continue S sid k # seq(s k) seq(s k) i print αb B[b] item(s k i) seq(s k) call P refixspan (αb B[b]) seq(s k) 3: PrefixSpan 2

( ) 1 (abc)(ac)d(cf) + minsup = 2 2 (_d)c(bc)(ae) project sid sequence 3 (_b)(df)cb 4 1 a(abc)(ac)d(cf) a:4 (_f)cbc 2 (ad)c(bc)(ae) _d_b_f b:4 item S id sid 3 (ef)(ab)(df)cb c:4 i j(i < j) Relation(S sid i j) 4 e(af)cbc d:2 a a : 2 (a) (a) a _b c : 2 (a b) (c) seq(s sid) sequence database item(s sid i) item(s sid j) results 4: PrefixSpan + PrefixSpan 5 PrefixSpan 31 PrefixSpan ε ( ) PrefixSpan 3 1 4 N-gram () 4 PrefixSpan 1 a b c d e f a a prefix 42 PrefixSpan 6 prefix 4 (IN/OUT) 2 4 PrefixSpan 41 43 N-gram PrefixSpan N-gram N-gram N-gram 7 i j Window size 3 (7 ) N-gram N

S = { sid 1 s 1 sid 2 s 2 sid n s n } P = { sid 1 1 sid 2 1 sid n 1 } function Relation(S sid i j) call P refixspan(ε P ) procedure P refixspan (α P α) # B # foreach sid l P α # C C {} foreach i l + 1 to seq(s sid) b item(s sid i) r Relation(S sid l i) if (r ε and C / b r ) B[ b r ] B[ b r ] sid i C C b r foreach b r keys of B # ξ if ( B[ b r ] < ξ) continue # print α b r B[ b r ] call P refixspan (α b r B[ b r ]) 5: PrefixSpan function ElementRelation(S sid i j) x item(s sid i) y item(s sid j) if (x y element ) return IN else return OUT 8 9 2 6: Element Relation 8 (type) type(a) = NT (a ) type(a) = T ( ) 10 ( # C ) function NGramRelation(S sid i j) N-gram if (j i < C) return NGRAM else return ε # function NGramRelation(S sid i j) if (i + 1 = j) return NGRAM else return ε 7: N-gram Relation A a b c y C a e c z B b d e <a b c y A a e c z C b d e x B> 8:! " # x $&%('&% ) *+ -/10 9: function ChunkRelation(S sid i j) x item(s sid i) y item(s sid j) if (type(x) = NT) return CHUNK if (type(x) = T and type(y) = T and x y ) return CHUNK if (type(x) = T and type(y) = NT and y x ) return CHUNK return ε 10: Chunk Relation 44

b e d c a c d b e a 11: a b c d e f 2 2 1 12: a 0 45 5 ( / ) ChaSen 6 CaboCha 7 3 N-gram 4 (2 ) (1) head (2) head (3) head (2) [8] (4) 11 2 13 item(s sid i) item(s sid j) k 1 (k 1) Linux (XEON 22 GHz k 35GB ) WS (ε) 12 a Perl C++ 8 52 function DepRelation(S sid i j) x = item(s sid i) y = item(s sid j) if (x < j) return ε else r 0 while (x y) r r + 1 y y return r 13: DepRelation 5 51 2 30 38000 95 1 1 12 (minsup) ( ) N-gram 2 3 5 http://wwwaozoragrjp/ 6 http://chasenaist-naraacjp/ 7 http://claist-naraacjp/ taku-ku/software/cabocha/ 4 8 http://claist-naraacjp/ taku-ku/software/prefixspan/

1: ( ) minsup ( ) N-gram 2 22 N/A 3556 5 20 32 267 ( ) 10 19 30 240 15 19 28 229 20 19 29 221 2: ( ) y i x i minsup ( ) y i x i N-gram x 2 046 074 78 i 5 043 060 58 x i 10 041 057 52 15 041 055 48 20 041 055 46 [7] 62 N-gram ( : : ) PrefixSpan 3 ( : : ) ( { }) ( { }) ( { }) (1) (2) dice ( : : ) (3) (2) (( ) ( )) ( ( ( ))) PrefixSpan (1) 6 Chunker 2 1 [3] 2 ( 14) PrefixSpan 61 dice 9

E9F+GH7I < J1 J2 J3 Jn E1 E2 E3 Em > "!$#%'&( )+*-'"/"02143 57698+:";2<>=>?A@"BDC3 J1 J3 E1 E4 14: 9268 2 PrefixSpan N-gram [1] Rakesh Agrawal and Ramakrishnan Srikant Fast algorithms for mining association rules In Jorge B N Bocca Matthias Jarke and Carlo Zaniolo editors Proc 20th Int Conf Very Large Data N Bases VLDB pp 487 499 Morgan Kaufmann 12 [4] 15 1994 2 PrefixSpan [2] Rakesh Agrawal and Ramakrishnan Srikant Mining 52 sequential patterns In Philip S Yu and Arbee L P N-gram 7 Chen editors Proc 11th Int Conf Data Engineering ICDE pp 3 14 IEEE Press 6 10 1995 [3] Helen Pinto and Jiawei Han and Jian Pei and Ke Wang and Qiming and Chen and Umeshwar Dayal Multi-Dimensional Sequential Pattern Mining In Proc 10th ACM International Conference on Information and Knowledge Management (CIKM 01) earliest convenience November 2001 let know [4] Mihoko Kitamura and Yuji Matsumoto Automatic thank letter Extraction of Word Sequence Correspondences in [4] Parallel Corpora In Proc 4th Workshop in Very N 10 Large Corpora pp 79 87 1996 N-gram PrefixSpan [5] Jian Pei Jiawei Han and et al Prefixspan: Mining sequential patterns by prefix-projected growth N In Proc of International Conference of Data Engineering pp 215 224 2001 7 [6] 7 SIG-FA/KBS-J22 pp 133 139 2001 [7] Authorship identification for heterogeneous documents 148 NL-148 PrefixSpan [8] Vol 42 No 8 2001