1,a) 1 2 10 1. [6] [1], [6], [8], [10], [11] 2 n n+1 C 2 O(n 2 ) 1 153-8505 4-6-1 a) kaji@tkl.iis.u-tokyo.ac.jp [10] [19], [23] [6] [6] (3 ) 10 (1) (2) 3 c 2012 Information Processing Society of Japan 1
b C 1 : C 2 : C 3 : C 4 : C 5 : C 6 : e 1 [10] [15], [16], [17] [6] 10 2. Jiang [6] 2.1 [6], [10] ( 1) ( 1 ) b e ( 1 ) b e 2.2 Jiang [6] (word lattice reranking) 2 1: T 2: E // 3: for i =1...n do 4: C 5: for l =1...min(i, K) do 6: w c i l+1 c i l+2...c i 7: for t T do 8: C C (w, t) 9: end for 10: end for 11: C k E 12: end for 13: return E 2 [2] [5] ŷ =argmaxscore(x, y), (1) y L(x) L(x) x ŷ Score(x, y) y L(x) L(x) O(n 2 ) ŷ 2.3 Jiang [6] [6] Jiang ( E) c 2012 Information Processing Society of Japan 2
( 2) i c i c i (i.e., w = c i l+1 c i l+2...c i t ) C (5-10 ) l K (5 ) C k E c i E K O(n 2 ) (5 ) K K Jinag [6] K 3. ( 3) x W (2 ) w W (4 ) E W [15], [16], [20] O(n 2 ) *1 3.1 3.2 3.3 3.1 x W ( 3 2 ) [16] : *1 1: E 2: W WordGenerator(x) 3: for w W do 4: T PosTagGenerator(x, w) 5: for t T do 6: E E (w, t) 7: end for 8: end for 9: return E 3 b =argmaxλ w F w (x, b) b b = b 1... b n b i B I 2 c i Λ w F w (x, b) Λ w [3] 1 2-gram 2 c i n-gram n-gram Neubig [15] α W : W = i=1...α W i W i i α 3.2 w T ( 3 4 ) β T x w t [15]: Λ t F t (x, w, t) F t (x, w, t) () ( 2) w t 2 Λ t β α c 2012 Information Processing Society of Japan 3
n-gram c i 1, c i, c i+1, c i 2,c i 1, c i 1,c i, c i,c i+1, c i+1,c i+2, c i 3,c i 2,c i 1, c i 2,c i 1,c i, c i 1,c i,c i+1 c i,c i+1,c i+2, c i+1,c i+2,c i+3 n-gram c i 1, c i, c i+1, c i 2,c i 1, c i 1,c i, c i,c i+1, c i+1,c i+2, c i 3,c i 2,c i 1, c i 2,c i 1,c i, c i 1,c i,c i+1, c i,c i+1,c i+2, c i+1,c i+2,c i+3 l, r, i, l, s, r, s, i, s 1 c i c i (1) (2) (3) (4) (5) (6) 6 c i 1 c i+1 r( l) ( ) i s (1 2 3 4 5 ) w, t length(w),t c i,t, c i,c i+1,t, c j 1,t, c j 2,c j 1,t c i 1,t, c i 2,c i 1,t, c i 3,c i 2,c i 1,t, c j,t, c j,c j+1,t, c j,c j+1,c j+2,t d, d, t 2 w = c ic i+1...c j 1 t length(w) w 1 2 3 4 5 d w t 3.3 O(n) W O(n) W = i=1...α W i W i αn i=1...α O( W ) =O(n) 3 (3 ) O(n) (5 ) n O(n) ( O( E ) =O(n) ) α n O(n) 4. Huang[5] Score(x, y) ŷ ŷ =argmaxscore(x, y) y L(x) =argmaxλ F(x, y) y L(x) F(x, y) Λ 4.1 1 (Nakagawa[13] BIES 4 ) c i,c i+1,b c i,i c i 2-gram c i c i+1 c i 1-gram c i 2 2-gram 4.2 Λ 2 L(x) L(x) c 2012 Information Processing Society of Japan 4
KC 30,608 4028 3764 KNBC 3453 385 348 BCCWJ 47,547 6144 5741 3 3 Λ (3.1 3.2 ) 10 9 1 5. 5.1 5.2 5.3 5.4 5.5 5.1 version 4.0 [11] NTT version 1.0 [4] [12] 3 KC KNBC BCCWJ ( 3) JUMAN versiono 7.0 *2 UniDic version 1.3.12 *3 KC KNBC JUMAN BCCWJ UniDic 5.2 2 4 Jiang [6] K 20 K=5, 10, 20 3 *2 http://nlp.ist.i.kyoto-u.ac.jp/en/index.php?nlpresources *3 http://www.tokuteicorpus.jp/dist 5.3 k {2, 4, 8, 16...256} θ% {(α, β) α, β {2, 4, 8,...,256}} (α, β) θ KC BCCWJ 99 KNBC 97 KNBC 99% 5.4 4 Time #Cand 1 ( ) F 1 Size 1 *4 2 6 w W ( 3 2 ) F 1 (p <0.01) bootstrap resampling 1000 F 1 F 1 4 [18], [22] 10 30 *4 c 2012 Information Processing Society of Japan 5
KC KNBC BCCWJ Time #Cand. F 1 Size Time #Cand. F 1 Size Time #Cand. F 1 Size (K=5) 20 22 212 97.25 356 1.2 1.3 137 92.72 235 25 26 163 97.33 276 (K=10) 31 32 400 97.92 88.9 2.0 2.1 250 93.48 235 43 44 301 98.08 69.0 (K=20) 62 63 702 97.94 88.9 3.6 3.8 413 93.42 235 88 90 516 98.18 69.0 1.8 2.6 30.4 97.94 60.8 0.12 0.18 24.8 93.92 99.2 2.3 3.1 23.3 98.10 46.6 4 100 99 100 (%) 99.5 99 (K=5) (K=20) 98.5 0 500 1,000 1,500 (%) 98 97 (K=5) 96 (K=20) 95 0 200 400 600 800 1,000 (%) 99.5 (K=5) (K=20) 99 0 500 1,000 1,500 4 ( : KC: KNBC : BCCWJ) (#Cand) F 1 K K F 1 (K=5)K=5 F 1 F 1 K F 1 (K=20) n n +1 *5 ( 4) k {1, 2, 4, 8, 16} K 10 20 K=10 2 α =32 β {1, 2, 4, 8, 16} 2 4 3.3 W 5 5.5 JUMAN *6 MeCab [10] *7 Kytea [15] *8 5 3 F 1 *5 *6 http://nlp.ist.i.kyoto-u.ac.jp/en/index.php?juman *7 http://code.google.com/p/mecab *8 http://www.phontron.com/kytea c 2012 Information Processing Society of Japan 6
160 120 80 40 0 0 100 200 300 100 80 60 40 20 0 0 50 100 150 250 200 150 100 50 0 0 100 200 300 400 5 ( : KC: KNBC : BCCWJ) KC KNBC BCCWJ JUMAN 95.37 93.85 N/A MeCab 95.45 91.60 96.31 Kytea 96.95 90.91 97.10 97.94 93.92 98.10 5 F 1 F 1 bootstrap resampling F 1 (p <0.01) 4 KNBC JUMAN F 1 1400 / JUMAN MeCab Kytea 2100 / 29000 / 3200 / F 1 6. Nakagawa [14] Kruengkrai [9] Zhang [21], [22] [23] (= ) 7. 10 [7], [24] [1] Asahara, M. and Matsumoto, Y.: Extened Models and Tools for High-performance Part-of-speech Tagger, Proceedings of COLING, pp. 21 27 (2000). [2] Collins, M.: Discriminative Reranking for Natural Language Parsing, Proceedings of ICML, pp. 175 182 (2000). [3] Collins, M.: Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms, Proceedings of EMNLP, pp. 1 8 (2002). [4] Hashimoto, C., Kurohashi, S., Kawahara, D., Shinzato, K. and Nagata, M.: Construction of a Blog Corpus with Syntactic, Anaphoric, and Semantic Annotations (In Japanese), Journal of Natural Language Processing, Vol. 18, No. 2, pp. 175 201 (2011). [5] Huang, L.: Forest Reranking: Discriminative Parsing with Non-local Features, Proceedings of ACL, pp. 586 594 (2008). [6] Jiang, W., Mi, H. and Liu, Q.: Word Lattice Reranking for Chinese Word Segmentation and Part-of-Speech Tagging, Proceedings of Coling, pp. 385 392 (2008). [7] Kaji, N. and Kitsuregawa, M.: Splitting Noun Compounds via Monolingual and Bilingual Paraphrasing: A Study on Japanese Katakana Words, Proceedings of EMNLP, pp. 959 969 (2011). [8] Kruengkrai, C., Sornlertlamvnich, V. and Isahara, H.: A Conditional Random Field Framework for Thai Morphological Analysis, Proceedings of LREC, pp. 2419 2424 (2006). [9] Kruengkrai, C., Uchimoto, K., Kazama, J., Wang, Y., Torisawa, K. and Isahara, H.: An Error-Driven Word- Character Hybrid Model for Joint Chinese Word Segmentation and POS Tagging, Proceedings of ACL, pp. 513 521 (2009). [10] Kudo, T., Yamamoto, K. and Matsumoto, Y.: Applying Conditional Random Fields to Japanese Morphological Analysis, Proceedings of EMNLP, pp. 230 237 (2004). [11] Kurohashi, S. and Nagao, M.: Building a Japanese Parsed Corpus while Improving the Parsing System, Proc 2012 Information Processing Society of Japan 7
ceedings of LREC, pp. 719 724 (1998). [12] Maekawa, K.: Balanced corpus of contemporary written Japanese, Proceedings of the 6th Workshop on Asian Language Resources, pp. 101 102 (2008). [13] Nakagawa, T.: Chinese and Japanese Word Segmentation Using Word-Level and Character-Level Information, Proceedings of Coling, pp. 466 472 (2004). [14] Nakagawa, T. and Uchimoto, K.: A Hybrid Approach to Word Segmentation and POS Tagging, Proceedings of ACL, Demo and Poster Sessions, pp. 217 220 (2007). [15] Neubig, G., Nakata, Y. and Mori, S.: Pointwise Prediction for Robust Adaptable Japanese Morphological Analysis, Proceedings of ACL, pp. 529 533 (2011). [16] Peng, F., Feng, F. and McCallum, A.: Chinese Segmentation and New Word Detection using Conditional Random Fields, Proceedings of Coling, pp. 562 568 (2004). [17] Shi, Y. and Wang, M.: A Dual-layer CRFs Based Joint Decoding Method for Cascaded Segmentation and Labeling Tasks, Proceedings of IJCAI, pp. 1707 1712 (2007). [18] Sun, W.: A Stacked Sub-Word Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging, Proceedings of ACL, pp. 1385 1394 (2011). [19] Uchimoto, K., Sekine, S. and Isahara, H.: The Unknown Word Problem: a Morphological Analysis of Japanese Using Maximum Entropy Aided by a Dictionary, Proceedings of EMNLP, pp. 91 99 (2001). [20] Xue, N.: Chinese Word Segmentation as Character Tagging, Computatioinal Linguistics and Chinese Language Processing, Vol. 8, No. 1, pp. 29 48 (2003). [21] Zhang, Y. and Clark, S.: Joint Word Segmentation and POS Tagging Using a Single Perceptron, Proceedings of ACL, pp. 888 896 (2008). [22] Zhang, Y. and Clark, S.: A Fast Decoder for Joint Word Segmentation and POS Tagging Using a Single Discriminative Model, Proceedings of EMNLP, pp. 843 852 (2010). [23] Shift-Reduce 14 (2008). [24] 19 pp. 908 911 (2013). c 2012 Information Processing Society of Japan 8