Applying Conditional Random Fields to Japanese Morphologiaical Analysis

Conditional Random Fields 1 CREST JST {taku-kumatsu}@isnaistp kaoru@lrpititechacp Conditonal Random Fields (CRF) CRF CRF CRF CRF (HMM MEMM) CRF label bias length bias HMM MEMM 2 (L1-CRF/L2-CRF) Applying Conditional Random Fields to Japanese Morphologiaical Analysis Taku Kudo Kaoru Yamamoto Yui Matsumoto Graduate School of Information Science Nara Institute of Science and Technology CREST JST Tokyo Institute of Technology {taku-kumatsu}@isnaistp kaoru@lrpititechacp This paper presents Japanese morphological analysis based on Conditional Random Fields (CRF) Previous work in CRF assumed that observation sequence (word) boundaries were fixed However word boundaries are not clear in Japanese and hence a straightforward application of CRF is not possible We show how CRF can be applied to situations where word boundary ambiguity exists CRF offer an elegant solution to the long-standing problems in Japanese morphological analysis using HMM or MEMM First flexible feature designs for hierarchical tagsets become possible Second influences of label and length bias are minimized The former compensate weakness in HMM while the latter overcomes noticed problems in MEMM We experiment with CRF HMM and MEMM on Japanese annotated corpora and CRF outperform the other approaches 1 (eg ) Conditional Random Fields ( CRF) [8] (discriminative model) CRF Y CRF CRF [8] [18] [12] HTML [15] CRF Hidden Markov Models [13] (HMM) (eg [4]) Maximum Entropy Markov Models (MEMM) (eg [21]) CRF (eg HMM ) (generative model) 1 16 4 NTT (taku@cslabkeclnttcop) HMM

x y 1 Y(x) HMM y y = MEMM ( w 1 t 1 w #y t #y ) ( #y ) label bias [8] length bias ( ŷ Y(x) 1 bias) MEMM ( ) ( bias ) y 1) ( ) 2) ( ) 22 CRF Gaussian prior (L2) Laplacian prior (L1) 2 221 Laplacian prior Gaussian Prior ChaSen 2 JUMAN 3 ChaSen IPA 4 3 ( CRF ) 2 CRF 3 4 45 ( 2 ) 500 21 Penn Treebank ( ) 1) ( ) 2) 3) HMM [4] ( ) / ad-hoc 222 Label Bias Length Bias 1 Maximum Entropy Markov Models (MEMM) [11 6 16] (eg SVMs) 2 http://chasennaistp/ 3 http://wwwkctu-tokyoacp/nl-resource/umanhtml 4 http://chasennaistp/stable/ipadic/

BOS Input: (I live in Metropolis of Tokyo ) Lattice: (east) [Noun] (capital) [Noun] (Tokyo) [Noun] (Kyoto) [Noun] (Metro) [Suffix] (in) [Particle] (resemble) [Verb] (live) [Verb] 1: label bias [8] length bias (a) Label bias 04 C 10 06 MEMM BOS A D EOS 06 10 (Maximum Entropy Models) 04 10 B 10 E P(A D x) = 06 * 06 * 10 = 036 P(B E x) = 04 * 10 * 10 = 04 P(AD x) < P(BE x) n 1 (b) Length bias 04 C 10 06 (n-gram) MEMM BOS (2-gram A D EOS ) P (y x) = 06 10 #y i=1 p( w 04 10 i t i w i 1 t i 1 ) B 2:(a) label bias P(A D x) = 06 * 06 * 10 = 036 MEMM A-D P(B x) = 04 * 10 = 04 P(AD x) < P(B x) 2: Label bias Length bias B-E B E B-E 10 3 Conditional Random Fields Conditional Random Fields (CRF) [8] 22 2 label bias CRF MEMM ( ) MEMM CRF ( ) 1 (aka ) y x P (y x) length bias ( ) ( ) MEMM label bias length bias y = ( 2:(b)) ( w 1 t 1 w #y t #y ) x length bias length P (y x) = 1 ( #y ) exp λ k f k ( w i 1 t i 1 w i t i ) bias Z x i=1 k MEMM Z x 1 [21 20 19] label bias ( #y ) Z length bias HMM x = exp λ k f k ( w i 1 t i 1 w i t i ) y MEMM Y(x) i=1 k f k ( w i 1 t i 1 w i t i ) i i 1 431 EOS

2 5 E P (y x )[ Fk (y x ) ] k { Y(x) f 1234 ( w t w t ) def 1 t = & t = = 0 otherwise λ k ( Λ = {λ 1 λ K } R K ) f k forward-backward ( ) i y [ E CRF P (y x) Fk (y x)] = α i x w t fk exp( k λ k fk ) β wt y Z x { w t wt } B(x) F(y x) = {F 1 (y x) F K (y x)} F k (y x) = fk f k( w t w t ) #y i=1 f B(x) x 2 k( w i 1 t i 1 w i t i ) (bi-gram) α wt P (y x) P (y x) = 1 β wt forward-backward Z x exp(λ F(y x)) x ŷ α wt = α w ŷ = argmax P (y x) = argmax Λ F(y x) (1) t exp ( λ k f k ( w t w t ) ) w t LT ( wt ) k y Y(x) y Y(x) β wt = β w t exp ( λ k f k ( w t w t ) ) Viterbi w t RT ( wt ) k Λ F(y x) LT ( w t ) ( RT ( w t )) y w t ( ) 2 31 α CRF wbos t bos β weost eos 1 T = { x y } N =1 Z x = α weos t eos (= L Λ β wbos t bos ) L Λ = log(p (y x )) [ ( = log exp ( Λ [F(y x ) F(y x ))] )] y Y(x ) [ ] = Λ F(y x ) log(z x ) ˆΛ = argmax Λ R K L Λ (MAP) Gaussian Prior (L2-norm) [6] Laplacian Prior (L1-norm) [7] 2 6 L Λ x y y Y(x ) exp ( Λ [F(y x ) F(y x )] ) { L Λ = C log(p (y x )) 1 λ k k (L1-norm) 2 Λ F(y x ) λ k k 2 (L2-norm) Λ F(y x ) y Y(x ) L1-norm L2-norm CRF L1-CRF L2-CRF C R + CRF C δl Λ ( [ = F k (y x ) E P (y x ) Fk (y x ) ]) δλ L1-CRF k = O k E k = 0 O k = F max : C log(p (y k(y x ) k x )) (λ + k + λ k )/2 k T E k = 6 L1-norm Inequality Constraints 5 tri-gram n-gram L2-norm L1-norm (eg f k ( w i n+1 t i n+1 w i t i ))

where λ k = λ + k λ k st λ + k 0 λ k 0 1 F wt (y x) = I(w = w i t = t i) Karush-Kuhun-Tucker i λ + k [C (O k E k ) 1/2] = 0 λ k [C (E F k tt (y x) = I(t = t i t = t i 1 ) i O k ) 1/2] = 0 C (O k E k ) 1/2 C (O k E k ) < 1/2 λ + k λ k HMM 0 ( λ k = 0) C (O k E k ) = 1/2 0 λ k ( ) C λ k = 0 HMM L2-CRF δl Λ δλ k = C (O k E k ) λ k = 0 CRF ( ) (O k E k ) 0 λ k 0 / 2 L1-CRF L2-CRF L1-CRF 0 7 8 L1-CRF ( CPU ) CRF SVM Boosting L2-CRF iterative scaling algorithms (eg IIS GIS [14]) CRF SVM Boosting ( (eg L-BFGS [9]) L1-CRF ) E CRF E SV M E Bo (eg L-BFGS-B [5]) [ ( E CRF = log exp ( Λ [F(y x ) F(y x )] ))] 32 CRF y Y(x ) ( ) (1) E SV M = max(0 1 Λ [F(y x ) F(y x )]) y Y(x F(y x) ) Λ E Bo = exp ( Λ [F(y x ) F(y x ) )] y Y(x ) HMM log p(w t) Altun log p(t t ) λ wt λ tt SVM HM-SVM (Hidden Markov SVM)[3] #y [2 1] log( p(w i t i )p(t i t i 1 )) i=1 y Y(x #y ) = [log p(w i t i ) + log p(t i t i 1 )] CRF SVM Boosting i=1 #y [ = I(w = w i t = t i ) log p(w t) CRF i=1 wt + ] I(t = t i t = t i 1 ) log p(t t ) 4 tt = F wt (y x) λ wt + F tt (y x) λ 41 tt CRF wt tt = Λ F(y x) ver 20 (KC) RWCP (RWCP) 2 F wt (y x) F tt (y x) 2 1 7 L1-norm L2-norm 8 L1-norm L2-norm Boosting Support Vector Machines [17] CRF 1

1: KC RWCP ( 95) ( 94) ( ) JUMAN ver 361 (1983173) IPADIC ver 270 (379010) 2 4 ( ) 7958 (1 1-1 8 ) 10000 ( 1 ) ( ) 198514 265631 ( ) 1246 (1 9 ) 25743 ( ) ( ) 31302 655710 791798 580032 HMM 2 2: : f k ( w t w t ) t = p2 cf ct bw t = p2 cf ct bw KC p1 /p1 p2 /p2 cf /cf ct /ct bw /bw w bw p1 /w ( ) uni-gram 2 p2 w bw { bw p1 f 1234( w t w t ) def 1 bw = & p1 = = bw p1 p2 0 otherwise w w 2 {φ p2 } 2 {φ p2 } RWCP {φ p2 } KC bi-gram p1 p1 p2 p2 p1 p2 p1 p2 p2 cf p1 p2 p2 ct p1 p2 p2 cf ct p1 p2 p2 p1 p2 cf 2 / p2 p1 p2 ct p2 p1 p2 cf ct / / p2 cf p1 p2 cf p2 ct p1 p2 ct p2 cf p1 p2 ct p2 ct p1 p2 cf p2 cf ct p1 p2 cf ct 1 w p2 cf ct bw p1 p2 F (F β=1 ) p2 cf ct bw p1 p2 cf p2 cf ct bw p1 p2 ct 2 Recall P recision p2 cf ct bw p1 p2 cf ct F β=1 = Recall + P recision w p2 p1 p2 cf ct bw p2 cf p1 p2 cf ct bw # of correct tokens Recall = p2 ct p1 p2 cf ct bw # of tokens in training corpus p2 cf ct p1 p2 cf ct bw # of correct tokens w /w p2 cf ct bw p1 p2 cf ct bw P recision = # of tokens in system output CRF bi-gram HMM 3 1) ( ) seg: 2) top: KC MEMM 3) all: [21] JUMAN 9 L1-CRF L2-CRF C CRF C++ MEMM HMM-bigram XEON 28Ghz 40Gbyte Linux L-BFGS-B L-BFGS RWCP Extended HMM (E-HMM) E-HMM L1-CRF L2-CRF CRF E-HMM ChaSen 42 3 4 KC RWCP 9 JUMAN - 3 F (seg/top/all) L1-CRF L2-

system 3: : KC F β=1 (seg / top / all)) L1-CRF (C =30) 9880 / 9814 / 9655 L2-CRF (C =12) 9896 / 9831 / 9675 HMM-bigram 9622 / 9499 / 9185 MEMM (Uchimoto 01) 9644 / 9581 / 9427 JUMAN (rule-based) 9870 / 9809 / 9435 4: : RWCP sea rough waves system F β=1 (seg / top / all)) 3: MEMM ( ) L1-CRF (C =30) 9900 / 9858 / 9730 L2-CRF (C =24) 9911 / 9872 / 9765 ( ) HMM-bigram 9882 / 9810 / 9590 E-HMM (Asahara 00) 9886 / 9838 / 9700 432 CRF Extended-HMM 1) ( ) CRF 2) 3) HMM [4] 43 431 CRF MEMM MEMM [21 20 19] MEMM HMM CRF ( ) - HMM (JUMAN) CRF λ k 2 3 MEMM ( - - 3 ) HMM 10 433 L1-CRF L2-CRF Gaussian Prior L2-1 CRF 3 length bias L2-CRF L1-CRF L2-CRF CRF HMM (λ k 0 ) (JUMAN) 2 L1-CRF 1/8-1/6 length bias (L2-CRF: 791798 (KC) / 580032 (RWCP) vs L1-CRF: 90163 (KC) / 101757 (RWCP)) 31 1 L1-CRF / / 2 11 ( ) 10 MEMM 11 KC L1-norm 7MB L2-norm 47MB particle bet romance MEMMs select romanticist particle The romance on the sea they bet is particle loose MEMMs select not one s heart heart A heart which beats rough waves is

L1-CRF [5] Richard H Byrd Peihuang Lu Jorge Nocedal and Ci You Zhu A limited memory algorithm for bound constrained optimization SIAM Journal on Scientific Computing 16(6):1190 1208 1995 [6] Stanley F Chen and Ronald Rosenfeld A gaussian prior for smoothing maximum entropy models 5 Technical report Carnegie Mellon University 1999 Conditonal Random Fields (CRF) [7] Jun ichi Kazama and Jun ichi Tsuii Evaluation (HMM and extension of maximum entropy models with MEMM) CRF inequality constraints In Proc of EMNLP pages 137 144 2003 HMM [8] John Lafferty Andrew McCallum and Fernando CRF Pereira Conditional random fields: Probabilistic models for segmenting and labeling sequence data In Proc of ICML pages 282 289 2001 MEMM label length bais [9] Dong C Liu and Jorge Nocedal On the limited memory BFGS method for large scale optimization 2 Math Programming 45(3 (Ser B)):503 528 1989 HMM MEMM [10] Andrew McCallum Efficiently inducing features of conditional random fields In Nineteenth Conference on Uncertainty in Artificial Intelligence (UAI03) 2003 [11] Andrew McCallum Dayne Freitag and Fernando Pereira Maximum entropy markov models for information and segmentation In Proc of ICML bi-gram pages 591 598 2000 bi-gram [12] Andrew McCallum and Wei Li Early results for tri-gram n-gram named entity recognition with conditional random fields feature induction and web-enhanced lexicons In In Proc of CoNLL 2003 tri-gram [13] Fuchun Peng and Andrew McCallum Accurate information extraction from research papers using conditional random fields In Proc of HLT/NAACL 2004 L1-CRF [14] Della Pietra Stephen Vincent J Della Pietra and John D Lafferty Inducing features of random fields IEEE Transactions on Pattern Analysis and ( ) Machine Intelligence 19(4):380 393 1997 McCallum L2-CRF [15] David Pinto Andrew McCallum Xing Wei and W Bruce Croft Table extraction using conditional [10] random fields In In Proc of SIGIR pages 235 242 2003 [16] Adwait Ratnaparkhi A maximum entropy model for part-of-speech tagging In Proc of EMNLP pages 133 142 1996 MEMM E-HMM [17] Gunnar Rätsch Robust Boosting via Convex Op- PhD thesis Department of Computer CRL NAIST timization Science University of Potsdam 2001 [18] Fei Sha and Fernando Pereira Shallow parsing with conditional random fields In Proc of HLT- NAACL pages 213 220 2003 [1] Yasemin Altun and Thomas Hofmann Large margin methods for label sequence learning In Proc of EuroSpeech 2003 [2] Yasemin Altun Mark Johnson and Thomas Hofmann Investigating loss functions and optimization methods for discriminative learning o f label sequences In Proc of EMNLP pages 145 152 2003 [3] Yasemin Altun Ioannis Tsochantaridis and Thomas Hofmann Hidden markov support vector machines In Proc of ICML pages 3 10 2003 [4] Masayuki Asahara and Yui Matsumoto Extended models and tools for high-performance part-ofspeech tagger In Proc of COLING pages 21 27 2000 [19] Kiyotaka Uchimoto Chikashi Nobata Atsushi Yamada and Hitoshi Isahara Satoshi Sekine Morphological analysis of a large spontaneous speech corpus in Japanese In Proc of ACL pages 479 488 2003 [20] Kiyotaka Uchimoto Chikashi Nobata Atsushi Yamada Satoshi Sekine and Hitoshi Isahara Morphological analysis of the spontaneous speech corpus In Proc of COLING pages 1298 1302 2002 [21] Kiyotaka Uchimoto Satoshi Sekine and Hitoshi Isahara The unknown word problem: a morphological analysis of Japanese using maximum entropy aided by a dictionary In Proc of EMNLP pages 91 99 2001