2. N-gram IDF. DEIM Forum 2016 A1-1. N-gram IDF IDF. 5 N-gram. N-gram. N-gram. N-gram IDF.

Σχετικά έγγραφα

Twitter 6. DEIM Forum 2014 A Twitter,,, Wikipedia, Explicit Semantic Analysis,

Πανεπιστήμιο Κρήτης, Τμήμα Επιστήμης Υπολογιστών Άνοιξη HΥ463 - Συστήματα Ανάκτησης Πληροφοριών Information Retrieval (IR) Systems

Τοποθέτηση τοπωνυµίων και άλλων στοιχείων ονοµατολογίας στους χάρτες


Web DEIM Forum 2009 A7-1. Web. Web. Web. Web. 4 Wikipedia. Wikipedia. Web.

Re-Pair n. Re-Pair. Re-Pair. Re-Pair. Re-Pair. (Re-Merge) Re-Merge. Sekine [4, 5, 8] (highly repetitive text) [2] Re-Pair. Blocked-Repair-VF [7]


Διαχείριση εγγράφων. Αποθήκες και Εξόρυξη Δεδομένων Διδάσκων: Μ. Χαλκίδη


Ανάκτηση Πληροφορίας

An Effective and Efficient Algorithm for Text Categorization

Mellin transforms and asymptotics: Harmonic sums

{takasu, Conditional Random Field

Θέμα : Retrieval Models. Ημερομηνία : 9 Μαρτίου 2006

HOSVD. Higher Order Data Classification Method with Autocorrelation Matrix Correcting on HOSVD. Junichi MORIGAKI and Kaoru KATAYAMA

Ανάκτηση Πληροφορίας

Advanced Subsidiary Unit 1: Understanding and Written Response

Kenta OKU and Fumio HATTORI

Automatic Domain2Specific Term Extraction and Its Application in Text Cla ssification

A data structure based on grammatical compression to detect long pattern

Quick algorithm f or computing core attribute

MIDI [8] MIDI. [9] Hsu [1], [2] [10] Salamon [11] [5] Song [6] Sony, Minato, Tokyo , Japan a) b)

2016 IEEE/ACM International Conference on Mobile Software Engineering and Systems

ER-Tree (Extended R*-Tree)

Ανάκληση Πληποφοπίαρ. Information Retrieval. Διδάζκων Δημήηριος Καηζαρός

Πανεπιστήμιο Κρήτης, Τμήμα Επιστήμης Υπολογιστών Άνοιξη HΥ463 - Συστήματα Ανάκτησης Πληροφοριών Information Retrieval (IR) Systems

Exhaustive Topic Detection and Query Expansion Support Based on Substance-Oriented Term Clustering

ELIXIR-GR / BiP! Finder

(C) 2010 Pearson Education, Inc. All rights reserved.


Reading Order Detection for Text Layout Excluded by Image

Πανεπιστήμιο Κρήτης, Τμήμα Επιστήμης Υπολογιστών Άνοιξη HΥ463 - Συστήματα Ανάκτησης Πληροφοριών Information Retrieval (IR) Systems

Ανάκτηση Πληροφορίας

(Statistical Machine Translation: SMT [1])


HΥ463 - Συστήματα Ανάκτησης Πληροφοριών Information Retrieval (IR) Systems. Μοντέλα Ανάκτησης Ι

The martingale pricing method for pricing fluctuation concerning stock models of callable bonds with random parameters

Toward a SPARQL Query Execution Mechanism using Dynamic Mapping Adaptation -A Preliminary Report- Takuya Adachi 1 Naoki Fukuta 2.

[4] 1.2 [5] Bayesian Approach min-max min-max [6] UCB(Upper Confidence Bound ) UCT [7] [1] ( ) Amazons[8] Lines of Action(LOA)[4] Winands [4] 1

Δημιουργία Λογαριασμού Διαχείρισης Business Telephony Create a Management Account for Business Telephony

ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 24/3/2007

P AND P. P : actual probability. P : risk neutral probability. Realtionship: mutual absolute continuity P P. For example:

Ανάκτηση Πληροφορίας

EFFICIENT TOP-K QUERYING OVER SOCIAL-TAGGING NETWORKS

Ανάκτηση Πληροφορίας

Ανάκτηση Πληροφορίας (Information Retrieval IR) ιδακτικό βοήθηµα 2. Πανεπιστήµιο Θεσσαλίας Πολυτεχνική Σχολή Τµήµα Μηχ. Η/Υ, Τηλ/νιών & ικτύων

Part A. CS-463 Information Retrieval Systems. Yannis Tzitzikas. University of Crete. CS-463,Spring 05 PART (A) PART (C):

Ανάκτηση Πληροφορίας (Information Retrieval IR)

derivation of the Laplacian from rectangular to spherical coordinates

Practical Implementation of Compressed Suffix Array on Modern Processors

Abstract Storage Devices

Διάρθρωση. Στατιστικά Κειμένου Text Statistics. Συχνότητα Εμφάνισης Λέξεων Ο Νόμος του Zipf Ο Νόμος του Heaps. Ανάκτηση Πληροφορίας

Ανάκτηση Πληροφορίας

Πανεπιστήμιο Δυτικής Μακεδονίας. Τμήμα Μηχανικών Πληροφορικής & Τηλεπικοινωνιών. Τεχνητή Νοημοσύνη. Ενότητα 2: Αναζήτηση (Search)

User Behavior Analysis for a Large2scale Search Engine

Automatic extraction of bibliography with machine learning

Dynamic types, Lambda calculus machines Section and Practice Problems Apr 21 22, 2016

Ανάκτηση Πληροφορίας. Διδάσκων: Φοίβος Μυλωνάς. Διάλεξη #03

Πρόβλημα 1: Αναζήτηση Ελάχιστης/Μέγιστης Τιμής

1 n-gram n-gram n-gram [11], [15] n-best [16] n-gram. n-gram. 1,a) Graham Neubig 1,b) Sakriani Sakti 1,c) 1,d) 1,e)

C.S. 430 Assignment 6, Sample Solutions

Ανάκτηση Πληροφορίας

Αλγόριθμοι και πολυπλοκότητα Ταχυταξινόμηση (Quick-Sort)

Bayesian modeling of inseparable space-time variation in disease risk


Βάσεις Δεδομένων ΙΙ Ενότητα 9

ΠΟΛΥ ΜΕΓΑΛΗ : ΜΕΓΑΛΗ : ΜΕΣΑΙΑ: ΜΙΚΡΗ

0 The quick brown fox leaped over the lazy lazy dog 1 Quick brown foxes leaped over lazy dogs for fun

ΓΛΩΣΣΙΚΗ ΤΕΧΝΟΛΟΓΙΑ. Μάθημα 11 ο : Αυτόματη παραγωγή περιλήψεων. Γεώργιος Πετάσης. Ακαδημαϊκό Έτος:

Query by Phrase (QBP) (Music Information Retrieval, MIR) QBH QBP / [1, 2] [3, 4] Query-by-Humming (QBH) QBP MIDI [5, 6] [8 10] [7]

Οδηγίες Αγοράς Ηλεκτρονικού Βιβλίου Instructions for Buying an ebook

SCITECH Volume 13, Issue 2 RESEARCH ORGANISATION Published online: March 29, 2018

Πανεπιστήμιο Κρήτης, Τμήμα Επιστήμης Υπολογιστών HY463 - Συστήματα Ανάκτησης Πληροφοριών Εαρινό Εξάμηνο. Φροντιστήριο 3.

How to register an account with the Hellenic Community of Sheffield.

Numerical Analysis FMN011

Congruence Classes of Invertible Matrices of Order 3 over F 2

ΣΔΥΝΟΛΟΓΗΚΟ ΔΚΠΑΗΓΔΤΣΗΚΟ ΗΓΡΤΜΑ ΗΟΝΗΧΝ ΝΖΧΝ «ΗΣΟΔΛΗΓΔ ΠΟΛΗΣΗΚΖ ΔΠΗΚΟΗΝΧΝΗΑ:ΜΔΛΔΣΖ ΚΑΣΑΚΔΤΖ ΔΡΓΑΛΔΗΟΤ ΑΞΗΟΛΟΓΖΖ» ΠΣΤΥΗΑΚΖ ΔΡΓΑΗΑ ΔΤΑΓΓΔΛΗΑ ΣΔΓΟΤ

TIME SWITCHES AND TWILIGHT SWITCHES

Ευρετηρίαση, Αποθήκευση και Οργάνωση Αρχείων (Indexing, Storage and File Organization) ΜΕΡΟΣ Ι

GPU. CUDA GPU GeForce GTX 580 GPU 2.67GHz Intel Core 2 Duo CPU E7300 CUDA. Parallelizing the Number Partitioning Problem for GPUs

Schedulability Analysis Algorithm for Timing Constraint Workflow Models

ΑΛΓΟΡΙΘΜΟΙ Άνοιξη I. ΜΗΛΗΣ

Δομές Δεδομένων. Δημήτρης Μιχαήλ. Συμβολοσειρές. Τμήμα Πληροφορικής και Τηλεματικής Χαροκόπειο Πανεπιστήμιο

GMRES(m) , GMRES, , GMRES(m), Look-Back GMRES(m). Ax = b, A C n n, x, b C n (1) Krylov.

Πανεπιστήμιο Πειραιώς Τμήμα Πληροφορικής Πρόγραμμα Μεταπτυχιακών Σπουδών «Πληροφορική»

ΓΕΩΜΕΣΡΙΚΗ ΣΕΚΜΗΡΙΩΗ ΣΟΤ ΙΕΡΟΤ ΝΑΟΤ ΣΟΤ ΣΙΜΙΟΤ ΣΑΤΡΟΤ ΣΟ ΠΕΛΕΝΔΡΙ ΣΗ ΚΤΠΡΟΤ ΜΕ ΕΦΑΡΜΟΓΗ ΑΤΣΟΜΑΣΟΠΟΙΗΜΕΝΟΤ ΤΣΗΜΑΣΟ ΨΗΦΙΑΚΗ ΦΩΣΟΓΡΑΜΜΕΣΡΙΑ

An Inventory of Continuous Distributions

2. 3. OCaml. Scheme[13] do CPS. On optimization for recursive programs without tailcalls.

IEEE Xplore, Institute of Electrical and Electronics Engineers Inc.

ECE145a / 218a Tuned Amplifier Design -basic gain relationships

Overview. Transition Semantics. Configurations and the transition relation. Executions and computation

TMA4115 Matematikk 3

Elements of Information Theory

Laplace Expansion. Peter McCullagh. WHOA-PSI, St Louis August, Department of Statistics University of Chicago

Evaluation of Methods to Extract Important Scenes for Automatic Digest Generation from a Presentation Video

Blum Complexity. Αλγόριθμοι και Πολυπλοκότητα ΙΙ. Παναγιώτης Γροντάς. Δεκέμβριος

Newman Modularity Newman [4], [5] Newman Q Q Q greedy algorithm[6] Newman Newman Q 1 Tabu Search[7] Newman Newman Newman Q Newman 1 2 Newman 3

Κατανεμημένα Συστήματα. Javascript LCR example

Buried Markov Model Pairwise

Transcript:

DEIM Forum 216 A1-1 N-gram IDF 565 871 1-5 E-mail: {hirakawa.maumi,hara}@it.oaka-u.ac.jp N-gram IDF IDF N-gram N-gram N-gram N-gram IDF N-gram N-gram IDF N-gram N-gram IDF Web Wikipedia 1 N-gram IDF [3] 1. Wikipedia Invere Document Frequency IDF [9] TF-IDF [16] Okapi BM25 [14] IDF IDF [2], [1], [12], [13] IDF N-gram N-gram PMI N-gram N-gram N-gram IDF [17], [18] N-gram IDF N- gram N-gram N-gram N-gram IDF N-gram N-gram IDF N-gram IDF interection query AND [17], [18] [6] AND 5 N-gram 1 N-gram IDF N-gram IDF N-gram IDF N-gram IDF 1 Wikipedia Web 2. N-gram IDF N-gram IDF [17], [18] IDF MED [4] N-gram N-gram g N-gram IDF NIDF d (g) = log NIDF i (g) = log D df(w 1,, w N ) (1) D df(g) df(w 1,, w N ) 2 (2)

: to be or not to be to live or to die or to be die live not not or to to be or live die to : : to to to to or be live not or be 1 1 1 1 to be or not to be to live or to die (1) (2) N-gram IDF D D df(g) D g df(w 1,, w N ) D g w 1,, w N [17], [18] (2) N-gram (1) (2) N-gram 3. [17], [18] N-gram IDF D D N-gram g df(w 1,, w N ) df(g) N-gram 1 δ [11] N-gram N-gram 2 N-gram [1] 1 1 ortoto be beto N-gram AND N-gram df(w 1,, w N ) AND [7] AND [6] N D O(N α log D ) [3] α α alternation complexity [6] 1 2 N-gram 2 1: Acce Input: i Output: = L[i] 1 j = i, = B 1 [j], p =, p e = n; 2 for k = 1 to log S 1 do 3 p b = p + rank (B k, p e) rank (B k, p ); 4 if B k [p + j] == then 5 j = rank (B k, p + j) rank (B k, p ); 6 p e = p b ; 7 ele 8 j = rank 1 (B k, p + j) - rank 1 (B k, p ); 9 p = p b ; 1 end 11 = << 1; 12 = + B k+1 [p + j]; 13 end S = {, 1, 2,, S 1} L[, n 1] L log S B k [, n 1] k = 1,, log S B 1 L 1 B 2 L 2 1 1 1 B 1 O(1) rank B 1 B 2 (i = ) rank b (B, i) = b B[, i 1] ( < i < = n) k > 2 B k k B k 1 k k + 1 1 2 2 L[8] = 1 1 2 B 1 [8] = B 1 [, 7] 5 B 2 [5] = 1 1 i L[i] B 1,, B log S k B k p p e B k [p, p e 1]

to be, or not to be, to live, or to die : : : 3 to be or not to be to live or to die 1 1 1 k + 1 1 p b p < = p b < = p e B k+1 [p, p b 1] 1 B k+1 [p b, p e 1] B k [p, p e 1] 1 1 B k j ip p e n rank B 1 B log S j B k [p + j] 1 B k [p, j 1] 1 B k [p + j] p b 1 p b k B k [p + j] L[i] [6] 1 D L W [, n 1] L W N-gram G v L W L SA[, n 1] L W [ LSA[i] ] L D[i] L D[, n 1] L SA[i] N-gram g v G v L D L D 3 L W 1 L SA 1 L D 2 2 N-gram g v G v L D 4 N-gram AND 4 2 N-gram m i [j] i e[j] o[j] o[j] = 1 to be to be 1 1 1 4 to be be to 1 1 1 3 2-gram to betobe 2: CountDF Input: i [, m 1], i e[, m 1], o[, m 1], k = 1, p =, p e = n Output: df 1 if k > log D then // 2 return 1; 3 end 4 df =, f = true, f 1 = true; 5 p b = p + rank (B k, p e ) rank (B k, p ); 6 init i [, m 1], i e [, m 1], i 1 [, m 1], i e1 [, m 1]; 7 for j = to m 1 do 8 i [j] = rank (B k, p + i [j]) rank (B k, p ); 9 i e [j] = rank (B k, p + i e[j]) rank (B k, p ); 1 i 1 [j] = rank 1 (B k, p + i [j]) rank 1 (B k, p ); 11 i e1 [j] = rank 1 (B k, p + i e [j]) rank 1 (B k, p ); 12 if i e [j] i [j] < o[j] then 13 f = fale; 14 end 15 if i e1 [j] i 1 [j] < o[j] then 16 f 1 = fale; 17 end 18 end 19 if f then // 2 df = CountDF (i, i e, o, p, p b, k + 1); 21 end 22 if f 1 then // 1 23 df = df + CountDF (i 1, i e1, o, p b, p e, k + 1); 24 end N-gram 2 k p p e 1 n B 1 [, n 1] j + 1 k + 1 i [j] i e [j] 1 i 1 [j] i e1 [j] rank o[j] o[j] 2

4. 3. [3] Wikipedia N-gram df(w 1,, w N ) D α alternation complexity D O(N α log D α ) O(N D ) N-gram N-gram O( D ) D O( D 2 ) D O( D log D ) N-gram IDF 4. 1 D D D N-gram g df (g) df (g) = D D df(g) D df (g), D df(g) df(g) = D D df (g) df (g) IDF IDF (g) = log D df(g) = log D df (g) (3) IDF (g) IDF (g) [8] λ λ X x =, 1, 2, P λ (X = x) = λx e λ x! 1 k 95% 99% 1.3 5.57.1 7.43 5 1.62 11.67 1.8 14.15 1 4.8 18.39 3.72 21.4 2 12.22 3.89 1.35 34.67 5 37.11 65.92 33.66 71.27 1 81.36 121.63 76.12 128.76 log Γ P λ (X = x) = exp ( x log λ λ log Γ(x + 1) ) D df (g) D λ = df (g) df (g) df (g) 1 [2] k λ 1 95% 99% D df (g) = 2 λ 1.35 34.67 99% IDF λ (L) λ (U) df λ N-gram g IDF IDF (L) (g) IDF (U) (g) IDF (L) (g) = log D λ (U) IDF (U) (g) = log D λ (L) (3) (4) (5) IDF IDF (g) IDF (U) (g) = log D df (g) log D λ (L) IDF (g) IDF (L) (g) = log D df (g) log D λ (U) λ (L) λ (U) (4) (5) = log λ(l) df (g) = log λ(u) df (g) df (g) IDF df (g) D IDF NIDF d (g) NIDF i(g) 2 df(g) NIDF i(g) NIDF i (g) = log D df(g) df(w 1,, w N ) 2 D df(g) = log ( D df D (w 1,, w N ) ) 2 = log D 2 df(g) D df (w 1,, w N ) 2

df (w 1,, w N ) D N-gram g w 1,, w N df (w 1,, w N ) D df (w 1,, w N ) λ = df (w 1,, w N ) λ (L) λ (U) df (w 1,, w N ) λ NIDF i(g) NIDF (L) i NIDF (U) i (g) = log D 2 df(g) ) 2 D (λ (U) (g) = log D 2 df(g) ) 2 D (λ (L) NIDF i (g) ( (L)) 2 λ NIDF i (g) NIDF (U) i (g) = log df (w 1,, w N ) 2 NIDF i(g) NIDF (L) i (g) = log λ (L) = 2 log df (w 1,, w N ) ( (U)) 2 λ df (w 1,, w N ) 2 = 2 log λ (U) df (w 1,, w N ) NIDF i(g) IDF (g) NIDF d (g) 2 4. 2 N-gram IDF N-gram N-gram N-gram IDF N-gram N-gram 2 df p df D = df + df D = {, 1, 2,, D 1} O(N α log D ) α dfp α O( D 2 ) O( D df p log D df p ) df p 3 3 2 2 3: CountSubetDF Input: i [, m 1], i e [, m 1], o[, m 1], df p, df o =, k = 1, p =, p e = n Output: df, df 1 if k > log D then // 2 return 1; 3 end 4 df =, df =, f = true, f 1 = true; 5 p b = p + rank (B k, p e) rank (B k, p ); 6 init i [, m 1], i e [, m 1], i 1 [, m 1], i e1 [, m 1]; 7 for j = to m 1 do 8 i [j] = rank (B k, p + i [j]) rank (B k, p ); 9 i e [j] = rank (B k, p + i e [j]) rank (B k, p ); 1 i 1 [j] = rank 1 (B k, p + i [j]) rank 1 (B k, p ); 11 i e1 [j] = rank 1 (B k, p + i e[j]) rank 1 (B k, p ); 12 if i e [j] i [j] < o[j] then 13 f = fale; 14 end 15 if i e1 [j] i 1 [j] < o[j] then 16 f 1 = fale; 17 end 18 end 19 if f then // 2 (df, df ) = CountSubetDF (i, i e, o, df p, df o, p, p b, k + 1); 21 ele // 22 if df o == df p then 23 df = 2 log D k 1 ; 24 ele 25 df = 2 log D k ; 26 end 27 end 28 if df o + df < = df p then 29 if f 1 then // 1 3 (df, df ) = (df, df ) + CountSubetDF (i 1, i e1, o, df p, df o + df, p b, p e, k + 1); 31 ele // 1 32 if df o + df == df p then 33 df = df + 2 log D k 1 ; 34 ele 35 df = df + 2 log D k ; 36 end 37 end 38 end 39 if k == 1 then 4 if df < = df p then 41 df = D df ; 42 ele 43 df = df 1; 44 end 45 end

df p df o df p df df 2 1 f f 1 fale df log D + 1 k + 1 2 log D k 2 D = {, 1, 2,, D 1} k k + 1 1 log D + 1 df D df D df p 3 df o df df p df o + df = df p D df o + df df p df p D df o + df df p df p D df o + df = df p df 1 2 df o + df df p df 1 df = df p 3 D = {, 1, 2,, D 1} D L R[, D 1] i < = i < D L R[i] < = L R[i] < D Fiher-Yate [5] L D i L R[i] L D 1,, 1,, 1,, 1, 1, 1, 5 1 1 1 5. Exact Approx1 Approx5 Approx2 Approx1 Approx5 N-gram IDF 5. 1 N-gram IDF 213 1 1 Wikipedia 4,379,81 Wikipedia 1 1 1 1 1 1 Subet1/1 Subet1/1 Subet1/1 df p 5 1 2 5 1 df p Approx5 Approx1 Approx2 Approx5 Approx1 Exact 5 N-gram δ = 5 N-gram IDF N-gram 18,261 Subet1/192,378 Subet1/1 8,694,915 Subet1/187,491,762 D O( D 2 ) Wikipedia [17], [18] N-gram 1 2 6GB Intel(R) Xeon(R) E5-2643 v2 @ 3.5GHz 2 12 39,61,779 N- gram O( D 2 ) Wikipedia 1 1 1 Approx1 df p = 1 1 Wikipedia N-gram N-gram IDF 1 Approx1 5

6, 5, 4, 3, 2, 1, -4.-3.-2.-1.. 1. 2. 3. 4. Subet1/1 Approx2 6, 5, 4, 3, 2, 1, -4.-3.-2.-1.. 1. 2. 3. 4. Subet1/1 Approx2 6, 5, 4, 3, 2, 1, -4.-3.-2.-1.. 1. 2. 3. 4. Subet1/1 Approx2 6 8, 6, 4, 2, -4.-3.-2.-1.. 1. 2. 3. 4. Subet1/1 Approx1 12, 1, 8, 6, 4, 2, -4.-3.-2.-1.. 1. 2. 3. 4. Subet1/1 Approx1 1,2, 1,, 8, 6, 4, 2, -4.-3.-2.-1.. 1. 2. 3. 4. Subet1/1 Approx1 N-gram IDF O( D α log D α ) α df p 5 D N-gram df p df < df p Subet1/1 72% N-gram df < 1 Wikipedia 11% N-gram df < 1 df < df p N-gram O(N df p log D df p ) O(N df log D df ) N-gram df < df p N-gram N N-gram 1.87 Subet1/12.39 Subet1/12.93 Subet1/13.5 N-gram O(N df p log D df p ) N-gram 5. 2 5. 1 N-gram IDF 6 6 Approx2 df p = 2 Approx1 df p = 1Subet1/1 Subet1/1 Subet1/1 df < = df p N-gram 6 df p 4. 1 Approx2 1 99% 2 log 1.35 2 1.9 2 log 34.67 2 1.58 6 Approx1 2 R-Prec Approx5.358 Approx1.367 Approx2.376 Approx5.382 Approx1.384 Exact.386 t 99% Exact 2 log 76.12 128.76.79 2 log.73 1 1 5. 3 N-gram IDF [17], [18] Web 5. 3. 1 Wikipedia Wikipedia Wikipedia Wikipedia 1,678 1 [19] N-gram IDF 6.2 291 86.7 3 3 R R R-Prec 2 3 df p 5 1 N-gram anglo american playing card 1 2 N-gram N-gram IDF df p 2 R-Prec df p 5 5. 3. 2 Web Web Roy [15] 13,959 Web 3 [17], [18]

3 Web ndcg ndcg MAP MAP MRR MRR @5 @1 @5 @1 @5 @1 Approx5.725.739.899.892.58.59 Approx1.728.741.899.892.577.587 Approx2.73.742.91.894.583.592 Approx5.73.743.9.893.587.596 Approx1.729.741.899.893.58.589 Exact.73.742.9.893.582.593 t 95% Exact 5 Web Qrel Qrel 3 3 2 1 Web larry the lawnmower tv how larry the lawnmowertv how Web Web TF-IDF Roy [15] Roy Roy ndcg MAP MRR 5 1 Web MAP MRR Qrel MAP 2 1 MRR 2 3 Web df p = 5 df p N-gram N-gram Web N-gram df p 6. N-gram N-gram IDF N-gram IDF 1 Wikipedia N-gram IDF Web N-gram N-gram N-gram IDF N-gram N-gram N-gram (A)(262413) JST IT IT [1] M. I. Abouelhoda, S. Kurtz, and E. Ohlebuch. Replacing Suffix Tree with Enhanced Suffix Array. Journal of Dicrete Algorithm, 2(1):53 86, 24. [2] A. Aizawa. An Information-Theoretic Perpective of TF-IDF Meaure. Information Proceing and Management, 39(1):45 65, 23. [3] J. Barbay and C. Kenyon. Adaptive Interection and t-threhold Problem. In SODA, page 39 399, 22. [4] F. Bu, X. Zhu, and M. Li. Meauring the Non-compoitionality of Multiword Expreion. In COLING, page 116 124, 21. [5] R. Durtenfeld. Algorithm 235: Random Permutation. Communication of the ACM, 7(7):42, 1964. [6] T. Gagie, G. Navarro, and S. J. Puglii. New Algorithm on Wavelet Tree and Application to Information Retrieval. Theoretical Computer Science, 426 427:25 41, 212. [7] R. Groi, A. Gupta, and J. S. Vitter. High-Order Entropy- Compreed Text Indexe. In SODA, page 841 85, 23. [8] F. A. Haight. Handbook of the Poion Ditribution. Wiley, New York, 1967. [9] K. S. Jone. A Statitical Interpretation of Term Specificity and it Application in Retrieval. Journal of Documentation, 28:11 21, 1972. [1] D. Metzler. Generalized Invere Document Frequency. In CIKM, page 399 48, 28. [11] D. Okanohara and J. Tujii. Text Categorization with All Subtring Feature. In SDM, page 838 846, 29. [12] K. Papineni. Why Invere Document Frequency? In NAACL, page 1 8, 21. [13] S. Roberton. Undertanding Invere Document Frequency: On theoretical argument for IDF. Journal of Documentation, 6(5):53 52, 24. [14] S. Roberton, S. Walker, S. Jone, M. Hancock-Beaulieu, and M. Gatford. Okapi at TREC-3. In TREC, page 19 126, 1994. [15] R. S. Roy, N. Ganguly, M. Choudhury, and S. Laxman. An IR-baed Evaluation Framework for Web Search Query Segmentation. In SI- GIR, page 881 89, 212. [16] G. Salton, A. Wong, and C.-S. Yang. A Vector Space Model for Automatic Indexing. Communication of the ACM, 18(11):613 62, 1975. [17],,. IDF N-gram. 7, 215. [18] M. Shirakawa, T. Hara, and S. Nihio. N-gram IDF: A Global Term Weighting Scheme Baed on Information Ditance. In WWW, page 96 97, 215. [19] M. Timonen. Term Weighting in Short Document for Document Categorization, Keyword Extraction and Query Expanion. PhD thei, Univerity of Helinki, 213. [2] G. van Belle, L. D. Fiher, P. J. Heagerty, and T. Lumley. Biotatitic: A Methodology For the Health Science. Wiley, New York, 2nd edition, 24.