1,a) 1 1 1 2 Bayes Dscrmnatve Language Modelng Based on Rsk Mnmzaton Tranng Kobayash Ako 1,a) Oku Takahro 1 Fujta Yuya 1 Sato Shoe 1 Nakagawa Sech 2 Abstract: Ths paper descrbes dscrmnatve language models (LMs) that reflect nformaton about word errors n automatc speech recognton (ASR). The dscrmnatve LMs are mplemented as a set of penalty scores employng lngustc features and ther weghtng factors. The models are estmated n the bass of mnmzaton of expected rsks that are closely assocated wth word errors. In transcrbng Japanese broadcast programs, the sem-supervsed dscrmnatve LM acheved the best results n word error rates compared wth the supervsed and unsupervsed LMs and conventonal dscrmnatve LMs based on maxmzaton of condtonal log-lkelhoods. Keywords: Bayes rsk, dscrmnatve tranng, mult objectve programmng, sem-supervsed tranng 1. [1] 1 NHK NHK Scence and Technology Research Laboratores, Setagaya, Tokyo 157 8510, Japan 2 Toyohash Unversty of Technology, Toyohash, Ach, 441-8580, Japan a) kobayash.a-fs@nhk.or.jp [2] c 2012 Informaton Processng Socety of Japan 1
/ [3], [4] [3] [4] [5] [6], [7] Bayes [8] (Mult-objectve Optmzaton Programmng, MOP)[9] [6], [7] ( ) λ Λ 2.2 Bayes Bayes N-best [8] ŵ = arg mn R(w, w )P (w x) (2) w w P (w x) x N-best w R(w, w ) w w 2 Levenshten ( ) Bayes (1) Λ 2.3 [4], [5] x (u) m (m =1,...,M) k w m,k () U(Λ) = 1 M P (w m,k x (u) m ; Λ)χ(w m,k ) (3) m k P (w m,k x (u) m ; Λ) w m,k (1) (3) χ(w m,k ) 2. χ(w m,k )= k R(w m,k, w m,k )P (w m,k x (u) m ; Λ) (4) 2.1 x w { } P (w x; Λ) exp f am (x w)+λ lm f lm (w)+ λ f (w) (1) f am (x w) f lm (w) f (w) x (l) n L(Λ) = 1 N P (w n,k x (l) n ; Λ)R(w ref n, w n,k ) (5) n k (4) (5) c 2012 Informaton Processng Socety of Japan 2
(Mnmum Phone Error, MPE) [10] Bayes Levenshten x m L m 2 e e l(e, e ) [5] l(e, e 0 label(e) = label(e ) ) 1 label e ζ(e) e overlap(e) l(e, e )p(e ) (6) overlap e p(e ) p(e) = 1 {α(σ(e)) s(e) β(τ(e))} (7) ᾱ σ(e) e τ(e) α(σ(e)) σ(e) ᾱ β(τ(e)) τ(e) s(e) φ am (e) φ lm (e) { } s(e) = exp λ am φ am (e)+λ lm φ lm (e)+ λ φ (e) (8) λ am λ lm φ (e) f e 1 0 ζ(e) p(e) γ m m γ m/m Λ [4] e δ m,e δ m,e = p(e)(γ(e) γ m )φ (e) (9) γ(e) e m λ e L m δ,e m (u) (u) = 1 δ,e m (10) M m e L m (6) (5) (l) 2.4 [3] L(Λ) = 1 log P (w ref n x (l) n ; Λ) (11) N n [11] m x (u) m U(Λ) = 1 P (w m,k x (u) m ; Λ) log P (w m,k x (u) m ; Λ) M m k (12) (12) 2 2.5 [9] 2 c 2012 Informaton Processng Socety of Japan 3
[6], [7] ( ) Λ ε [12] 2 ε [9] L(Λ) U(Λ) Λ = arg mn L(Λ) subject to U(Λ) Ū (13) Λ Ū Ū=αU(0) (14) α(< 1.0) Λ=0 5% 20 % L(Λ) [13] 2 κ F(Λ) = L(Λ) + ρ 2ρ + U(Λ) Ū (15) κ ρ x max {x, 0}. (13) F ( ) F(Λ) = (l) κ +2ρ λ 2ρ + U(Λ) Ū (u) (16) (15) κ ρ [13] 2 α 2.6 3 f =h1(u 1,u 2,u 3)(w) =c u1,u 2,u 3 (w) (17) w 3 (u 1,u 2,u 3 ) c u1,u 2,u 3 h 1 (u 1,u 2,u 3 ) 3 [4] [14] w q () 3. 3.1 2 12 MFCC 1 2 39 HMM bgram 200-best trgram 650 MPE trgram ( 239M ) 100k NHK 3 ( 1) ( ) 1 2 (PP) (WER) (OOV) trgram c 2012 Informaton Processng Socety of Japan 4
Table 1 1 Evaluaton data for dscrmnatve language modelng PP OOV(%) WER(%) 245 3.5k 125.7 1.5 23.0 551 7.0k 139.4 1.3 22.3 2 Table 2 Tranng data for dscrmnatve language modelng 58.6 26k 697.5k 344.1 218.6k 2.84M 3 Table 3 Perplextes and word error rates for tranng data Table 4 4 Feature functons for dscrmnatve language modelng 2 1.3k 3 12.9k 2 731.9k 3 1859.6k PP OOV(%) WER(%) GER(%) 64.0 2.03 22.3 13.2 163.2 3.07 30.0 16.9 2 [2] (4.5 ) 2.17 k (47.2 k ) 5 3 (GER) L-BFGS [15] 1 α 0.80 0.95 2 3 2 3 5 ( 4) 3.2 ( ) 5 [7] 21.5 % 3.6 % 6 20.9 % 6.3 % () 2.8 % ( 5%) 4. 4.1 5 + + ( 5 %) 10 30 50 % c 2012 Informaton Processng Socety of Japan 5
5 (%) Table 5 Expermental results for dscrmnatve language modelng (WER,%) () 23.0 22.3 22.9 22.1 ( ) 22.8 22.3 22.7 22.2 ( ) 22.3 21.5 + 22.5 22.0 ( ) 21.9 20.9 6 (, %) Table 6 Sem-supervsed dscrmnatve language modelng wth varous amounts of unlabeled tranng data (WER, %) + 10 % 21.3 22.3 30 % 21.2 22.3 50 % 21.1 22.3 100 % 20.9 22.0 2 7 ( %) Table 7 Sem-supervsed dscrmnatve language modelng wth varous amounts of unlabeled tranng data (rsk mnmzaton, %) VTR 10 % 15.8 27.8 30 % 15.9 27.4 50 % 15.8 27.2 100 % 15.7 27.0 16.4 29.2 16.0 29.6 (100 %) 16.2 27.7 6 4.2 (3.8 k ) VTR(, 3.3 k ) 2 ( 7) VTR VTR VTR 29.2 % 27.7 % 5.1 % VTR 4.3 %(16.4 % 15.7 %) VTR c 2012 Informaton Processng Socety of Japan 6
8 (%) Table 8 Comparson of feature functons (%) 22.2 22.6 22.3 21.5 22.4 21.5 21.5 21.8 20.9 7.5 %(29.2 % 27.0 %) 4.3 ( 8) 21.5 % 22.4 % (22.3 %) 5. [16] 1 [1],,,,, :,, Vol. 63, No. 3, pp. 331 338 (2008). [2],,,, :,, Vol. J93-D, No. 10, pp. 2085 2095 (2010). [3] Roark, B., Saraclar, M. and Collns, M.: Dscrmnatve n-gram language modelng, Computer Speech and Language, Vol. 21, pp. 373 392 (2007). [4],,,,, :,, Vol. J93-D, No. 5, pp. 598 609 (2010). [5] Kobayash, A., Oku, T., Homma, S., Ima, T. and Nakagawa, S.: Lattce-based rsk mnmzaton tranng for unsupervsed language model adaptaton, Proc. Interspeech, pp. 1453 1456 (2011). [6] Kobayash, A., Oku, T., Ima, T. and Nakagawa, S.: Mult-objectve optmzaton for sem-supervsed dscrmnatve language modelng, Proc IEEE ICASSP, pp. 4997-5000, (2012). [7] Kobayash, A., Oku, T., Ima, T. and Nakagawa, S.: Rsk-based sem-supervsed dscrmnatve language modelng for broadcast transcrpton, IEICE Trans. Inf. & Syst., Vol.E95-D, No.11 (2012, n press). [8] Goel, V. and Byrne, W.: Mnmum Bayes-rsk automatc speech recognton, Computer Speech and Language, Vol. 14, pp. 115 135 (2000). [9] Marler, R. T. and Arora, J. S.: Survey of multobjectve optmzaton methods for engneerng, Structural and multdscplnary optmzaton, Vol. 26, pp. 369 395 (2004). [10] Povey, D. and Woodland, P. C.: Mnmum phone error and I-smoothng for mproved dscrmnatve tranng, Proc. ICASSP, pp. I 105 108 (2002). [11] Grandvalet, Y. and Bengo, Y.: Sem-supervsed learnng by entropy mnmzaton, Advances n neural nformaton processng systems, pp. 529 536 (2005). [12] Mettnen, K.: Nonlnear multobjectve optmzaton, Sprnger, 1999. [13] Snyman, J.: Practcal mathematcal optmzaton, Sprnger (2005). [14],,,, :, ( ), No. 2-P-35(a) (2011). [15] Lu, D. and Nocedal, J.: On the lmted memory BFGS method for large scale optmzaton, Mathematcal Programmng, Vol. 45, No. 3, pp. 503 528 (1989). [16] Lehr, M. and Shafran, I.: Dscrmnatvely estmated jont acoustc, duraton and language model for speech recognton, Proc. ICASSP, pp.5542 5545, (2010). c 2012 Informaton Processng Socety of Japan 7