DEIM Forum 216 A1-1 N-gram IDF 565 871 1-5 E-mail: {hirakawa.maumi,hara}@it.oaka-u.ac.jp N-gram IDF IDF N-gram N-gram N-gram N-gram IDF N-gram N-gram IDF N-gram N-gram IDF Web Wikipedia 1 N-gram IDF [3] 1. Wikipedia Invere Document Frequency IDF [9] TF-IDF [16] Okapi BM25 [14] IDF IDF [2], [1], [12], [13] IDF N-gram N-gram PMI N-gram N-gram N-gram IDF [17], [18] N-gram IDF N- gram N-gram N-gram N-gram IDF N-gram N-gram IDF N-gram IDF interection query AND [17], [18] [6] AND 5 N-gram 1 N-gram IDF N-gram IDF N-gram IDF N-gram IDF 1 Wikipedia Web 2. N-gram IDF N-gram IDF [17], [18] IDF MED [4] N-gram N-gram g N-gram IDF NIDF d (g) = log NIDF i (g) = log D df(w 1,, w N ) (1) D df(g) df(w 1,, w N ) 2 (2)
: to be or not to be to live or to die or to be die live not not or to to be or live die to : : to to to to or be live not or be 1 1 1 1 to be or not to be to live or to die (1) (2) N-gram IDF D D df(g) D g df(w 1,, w N ) D g w 1,, w N [17], [18] (2) N-gram (1) (2) N-gram 3. [17], [18] N-gram IDF D D N-gram g df(w 1,, w N ) df(g) N-gram 1 δ [11] N-gram N-gram 2 N-gram [1] 1 1 ortoto be beto N-gram AND N-gram df(w 1,, w N ) AND [7] AND [6] N D O(N α log D ) [3] α α alternation complexity [6] 1 2 N-gram 2 1: Acce Input: i Output: = L[i] 1 j = i, = B 1 [j], p =, p e = n; 2 for k = 1 to log S 1 do 3 p b = p + rank (B k, p e) rank (B k, p ); 4 if B k [p + j] == then 5 j = rank (B k, p + j) rank (B k, p ); 6 p e = p b ; 7 ele 8 j = rank 1 (B k, p + j) - rank 1 (B k, p ); 9 p = p b ; 1 end 11 = << 1; 12 = + B k+1 [p + j]; 13 end S = {, 1, 2,, S 1} L[, n 1] L log S B k [, n 1] k = 1,, log S B 1 L 1 B 2 L 2 1 1 1 B 1 O(1) rank B 1 B 2 (i = ) rank b (B, i) = b B[, i 1] ( < i < = n) k > 2 B k k B k 1 k k + 1 1 2 2 L[8] = 1 1 2 B 1 [8] = B 1 [, 7] 5 B 2 [5] = 1 1 i L[i] B 1,, B log S k B k p p e B k [p, p e 1]
to be, or not to be, to live, or to die : : : 3 to be or not to be to live or to die 1 1 1 k + 1 1 p b p < = p b < = p e B k+1 [p, p b 1] 1 B k+1 [p b, p e 1] B k [p, p e 1] 1 1 B k j ip p e n rank B 1 B log S j B k [p + j] 1 B k [p, j 1] 1 B k [p + j] p b 1 p b k B k [p + j] L[i] [6] 1 D L W [, n 1] L W N-gram G v L W L SA[, n 1] L W [ LSA[i] ] L D[i] L D[, n 1] L SA[i] N-gram g v G v L D L D 3 L W 1 L SA 1 L D 2 2 N-gram g v G v L D 4 N-gram AND 4 2 N-gram m i [j] i e[j] o[j] o[j] = 1 to be to be 1 1 1 4 to be be to 1 1 1 3 2-gram to betobe 2: CountDF Input: i [, m 1], i e[, m 1], o[, m 1], k = 1, p =, p e = n Output: df 1 if k > log D then // 2 return 1; 3 end 4 df =, f = true, f 1 = true; 5 p b = p + rank (B k, p e ) rank (B k, p ); 6 init i [, m 1], i e [, m 1], i 1 [, m 1], i e1 [, m 1]; 7 for j = to m 1 do 8 i [j] = rank (B k, p + i [j]) rank (B k, p ); 9 i e [j] = rank (B k, p + i e[j]) rank (B k, p ); 1 i 1 [j] = rank 1 (B k, p + i [j]) rank 1 (B k, p ); 11 i e1 [j] = rank 1 (B k, p + i e [j]) rank 1 (B k, p ); 12 if i e [j] i [j] < o[j] then 13 f = fale; 14 end 15 if i e1 [j] i 1 [j] < o[j] then 16 f 1 = fale; 17 end 18 end 19 if f then // 2 df = CountDF (i, i e, o, p, p b, k + 1); 21 end 22 if f 1 then // 1 23 df = df + CountDF (i 1, i e1, o, p b, p e, k + 1); 24 end N-gram 2 k p p e 1 n B 1 [, n 1] j + 1 k + 1 i [j] i e [j] 1 i 1 [j] i e1 [j] rank o[j] o[j] 2
4. 3. [3] Wikipedia N-gram df(w 1,, w N ) D α alternation complexity D O(N α log D α ) O(N D ) N-gram N-gram O( D ) D O( D 2 ) D O( D log D ) N-gram IDF 4. 1 D D D N-gram g df (g) df (g) = D D df(g) D df (g), D df(g) df(g) = D D df (g) df (g) IDF IDF (g) = log D df(g) = log D df (g) (3) IDF (g) IDF (g) [8] λ λ X x =, 1, 2, P λ (X = x) = λx e λ x! 1 k 95% 99% 1.3 5.57.1 7.43 5 1.62 11.67 1.8 14.15 1 4.8 18.39 3.72 21.4 2 12.22 3.89 1.35 34.67 5 37.11 65.92 33.66 71.27 1 81.36 121.63 76.12 128.76 log Γ P λ (X = x) = exp ( x log λ λ log Γ(x + 1) ) D df (g) D λ = df (g) df (g) df (g) 1 [2] k λ 1 95% 99% D df (g) = 2 λ 1.35 34.67 99% IDF λ (L) λ (U) df λ N-gram g IDF IDF (L) (g) IDF (U) (g) IDF (L) (g) = log D λ (U) IDF (U) (g) = log D λ (L) (3) (4) (5) IDF IDF (g) IDF (U) (g) = log D df (g) log D λ (L) IDF (g) IDF (L) (g) = log D df (g) log D λ (U) λ (L) λ (U) (4) (5) = log λ(l) df (g) = log λ(u) df (g) df (g) IDF df (g) D IDF NIDF d (g) NIDF i(g) 2 df(g) NIDF i(g) NIDF i (g) = log D df(g) df(w 1,, w N ) 2 D df(g) = log ( D df D (w 1,, w N ) ) 2 = log D 2 df(g) D df (w 1,, w N ) 2
df (w 1,, w N ) D N-gram g w 1,, w N df (w 1,, w N ) D df (w 1,, w N ) λ = df (w 1,, w N ) λ (L) λ (U) df (w 1,, w N ) λ NIDF i(g) NIDF (L) i NIDF (U) i (g) = log D 2 df(g) ) 2 D (λ (U) (g) = log D 2 df(g) ) 2 D (λ (L) NIDF i (g) ( (L)) 2 λ NIDF i (g) NIDF (U) i (g) = log df (w 1,, w N ) 2 NIDF i(g) NIDF (L) i (g) = log λ (L) = 2 log df (w 1,, w N ) ( (U)) 2 λ df (w 1,, w N ) 2 = 2 log λ (U) df (w 1,, w N ) NIDF i(g) IDF (g) NIDF d (g) 2 4. 2 N-gram IDF N-gram N-gram N-gram IDF N-gram N-gram 2 df p df D = df + df D = {, 1, 2,, D 1} O(N α log D ) α dfp α O( D 2 ) O( D df p log D df p ) df p 3 3 2 2 3: CountSubetDF Input: i [, m 1], i e [, m 1], o[, m 1], df p, df o =, k = 1, p =, p e = n Output: df, df 1 if k > log D then // 2 return 1; 3 end 4 df =, df =, f = true, f 1 = true; 5 p b = p + rank (B k, p e) rank (B k, p ); 6 init i [, m 1], i e [, m 1], i 1 [, m 1], i e1 [, m 1]; 7 for j = to m 1 do 8 i [j] = rank (B k, p + i [j]) rank (B k, p ); 9 i e [j] = rank (B k, p + i e [j]) rank (B k, p ); 1 i 1 [j] = rank 1 (B k, p + i [j]) rank 1 (B k, p ); 11 i e1 [j] = rank 1 (B k, p + i e[j]) rank 1 (B k, p ); 12 if i e [j] i [j] < o[j] then 13 f = fale; 14 end 15 if i e1 [j] i 1 [j] < o[j] then 16 f 1 = fale; 17 end 18 end 19 if f then // 2 (df, df ) = CountSubetDF (i, i e, o, df p, df o, p, p b, k + 1); 21 ele // 22 if df o == df p then 23 df = 2 log D k 1 ; 24 ele 25 df = 2 log D k ; 26 end 27 end 28 if df o + df < = df p then 29 if f 1 then // 1 3 (df, df ) = (df, df ) + CountSubetDF (i 1, i e1, o, df p, df o + df, p b, p e, k + 1); 31 ele // 1 32 if df o + df == df p then 33 df = df + 2 log D k 1 ; 34 ele 35 df = df + 2 log D k ; 36 end 37 end 38 end 39 if k == 1 then 4 if df < = df p then 41 df = D df ; 42 ele 43 df = df 1; 44 end 45 end
df p df o df p df df 2 1 f f 1 fale df log D + 1 k + 1 2 log D k 2 D = {, 1, 2,, D 1} k k + 1 1 log D + 1 df D df D df p 3 df o df df p df o + df = df p D df o + df df p df p D df o + df df p df p D df o + df = df p df 1 2 df o + df df p df 1 df = df p 3 D = {, 1, 2,, D 1} D L R[, D 1] i < = i < D L R[i] < = L R[i] < D Fiher-Yate [5] L D i L R[i] L D 1,, 1,, 1,, 1, 1, 1, 5 1 1 1 5. Exact Approx1 Approx5 Approx2 Approx1 Approx5 N-gram IDF 5. 1 N-gram IDF 213 1 1 Wikipedia 4,379,81 Wikipedia 1 1 1 1 1 1 Subet1/1 Subet1/1 Subet1/1 df p 5 1 2 5 1 df p Approx5 Approx1 Approx2 Approx5 Approx1 Exact 5 N-gram δ = 5 N-gram IDF N-gram 18,261 Subet1/192,378 Subet1/1 8,694,915 Subet1/187,491,762 D O( D 2 ) Wikipedia [17], [18] N-gram 1 2 6GB Intel(R) Xeon(R) E5-2643 v2 @ 3.5GHz 2 12 39,61,779 N- gram O( D 2 ) Wikipedia 1 1 1 Approx1 df p = 1 1 Wikipedia N-gram N-gram IDF 1 Approx1 5
6, 5, 4, 3, 2, 1, -4.-3.-2.-1.. 1. 2. 3. 4. Subet1/1 Approx2 6, 5, 4, 3, 2, 1, -4.-3.-2.-1.. 1. 2. 3. 4. Subet1/1 Approx2 6, 5, 4, 3, 2, 1, -4.-3.-2.-1.. 1. 2. 3. 4. Subet1/1 Approx2 6 8, 6, 4, 2, -4.-3.-2.-1.. 1. 2. 3. 4. Subet1/1 Approx1 12, 1, 8, 6, 4, 2, -4.-3.-2.-1.. 1. 2. 3. 4. Subet1/1 Approx1 1,2, 1,, 8, 6, 4, 2, -4.-3.-2.-1.. 1. 2. 3. 4. Subet1/1 Approx1 N-gram IDF O( D α log D α ) α df p 5 D N-gram df p df < df p Subet1/1 72% N-gram df < 1 Wikipedia 11% N-gram df < 1 df < df p N-gram O(N df p log D df p ) O(N df log D df ) N-gram df < df p N-gram N N-gram 1.87 Subet1/12.39 Subet1/12.93 Subet1/13.5 N-gram O(N df p log D df p ) N-gram 5. 2 5. 1 N-gram IDF 6 6 Approx2 df p = 2 Approx1 df p = 1Subet1/1 Subet1/1 Subet1/1 df < = df p N-gram 6 df p 4. 1 Approx2 1 99% 2 log 1.35 2 1.9 2 log 34.67 2 1.58 6 Approx1 2 R-Prec Approx5.358 Approx1.367 Approx2.376 Approx5.382 Approx1.384 Exact.386 t 99% Exact 2 log 76.12 128.76.79 2 log.73 1 1 5. 3 N-gram IDF [17], [18] Web 5. 3. 1 Wikipedia Wikipedia Wikipedia Wikipedia 1,678 1 [19] N-gram IDF 6.2 291 86.7 3 3 R R R-Prec 2 3 df p 5 1 N-gram anglo american playing card 1 2 N-gram N-gram IDF df p 2 R-Prec df p 5 5. 3. 2 Web Web Roy [15] 13,959 Web 3 [17], [18]
3 Web ndcg ndcg MAP MAP MRR MRR @5 @1 @5 @1 @5 @1 Approx5.725.739.899.892.58.59 Approx1.728.741.899.892.577.587 Approx2.73.742.91.894.583.592 Approx5.73.743.9.893.587.596 Approx1.729.741.899.893.58.589 Exact.73.742.9.893.582.593 t 95% Exact 5 Web Qrel Qrel 3 3 2 1 Web larry the lawnmower tv how larry the lawnmowertv how Web Web TF-IDF Roy [15] Roy Roy ndcg MAP MRR 5 1 Web MAP MRR Qrel MAP 2 1 MRR 2 3 Web df p = 5 df p N-gram N-gram Web N-gram df p 6. N-gram N-gram IDF N-gram IDF 1 Wikipedia N-gram IDF Web N-gram N-gram N-gram IDF N-gram N-gram N-gram (A)(262413) JST IT IT [1] M. I. Abouelhoda, S. Kurtz, and E. Ohlebuch. Replacing Suffix Tree with Enhanced Suffix Array. Journal of Dicrete Algorithm, 2(1):53 86, 24. [2] A. Aizawa. An Information-Theoretic Perpective of TF-IDF Meaure. Information Proceing and Management, 39(1):45 65, 23. [3] J. Barbay and C. Kenyon. Adaptive Interection and t-threhold Problem. In SODA, page 39 399, 22. [4] F. Bu, X. Zhu, and M. Li. Meauring the Non-compoitionality of Multiword Expreion. In COLING, page 116 124, 21. [5] R. Durtenfeld. Algorithm 235: Random Permutation. Communication of the ACM, 7(7):42, 1964. [6] T. Gagie, G. Navarro, and S. J. Puglii. New Algorithm on Wavelet Tree and Application to Information Retrieval. Theoretical Computer Science, 426 427:25 41, 212. [7] R. Groi, A. Gupta, and J. S. Vitter. High-Order Entropy- Compreed Text Indexe. In SODA, page 841 85, 23. [8] F. A. Haight. Handbook of the Poion Ditribution. Wiley, New York, 1967. [9] K. S. Jone. A Statitical Interpretation of Term Specificity and it Application in Retrieval. Journal of Documentation, 28:11 21, 1972. [1] D. Metzler. Generalized Invere Document Frequency. In CIKM, page 399 48, 28. [11] D. Okanohara and J. Tujii. Text Categorization with All Subtring Feature. In SDM, page 838 846, 29. [12] K. Papineni. Why Invere Document Frequency? In NAACL, page 1 8, 21. [13] S. Roberton. Undertanding Invere Document Frequency: On theoretical argument for IDF. Journal of Documentation, 6(5):53 52, 24. [14] S. Roberton, S. Walker, S. Jone, M. Hancock-Beaulieu, and M. Gatford. Okapi at TREC-3. In TREC, page 19 126, 1994. [15] R. S. Roy, N. Ganguly, M. Choudhury, and S. Laxman. An IR-baed Evaluation Framework for Web Search Query Segmentation. In SI- GIR, page 881 89, 212. [16] G. Salton, A. Wong, and C.-S. Yang. A Vector Space Model for Automatic Indexing. Communication of the ACM, 18(11):613 62, 1975. [17],,. IDF N-gram. 7, 215. [18] M. Shirakawa, T. Hara, and S. Nihio. N-gram IDF: A Global Term Weighting Scheme Baed on Information Ditance. In WWW, page 96 97, 215. [19] M. Timonen. Term Weighting in Short Document for Document Categorization, Keyword Extraction and Query Expanion. PhD thei, Univerity of Helinki, 213. [2] G. van Belle, L. D. Fiher, P. J. Heagerty, and T. Lumley. Biotatitic: A Methodology For the Health Science. Wiley, New York, 2nd edition, 24.