(hidden Markov model: HMM) FUNDAMENTALS OF SPEECH SYNTHESIS BASED ON HMM. Keiichi Tokuda. Department of Computer Science

HMM 466-8555 (hidden Markov model: HMM) HMM HMM HMM HMM FUNDAMENTALS OF SPEECH SYNTHESIS BASED ON HMM Keiichi Tokuda Department of Computer Science Nagoya Institute of Technology Gokiso-cho, Shouwa-ku, Nagoya, 466-8555 Japan The increasing availability of large speech databases makes it possible to construct speech synthesis systems, which are referred to as corpus-based approach, by applying unit selection and statistical learning algorithms. In constructing such a system, the use of hidden Markov models (HMMs) has arisen largely. This paper aims to describe such approaches in relation to an approach in which synthetic speech is generated from HMMs themselves. speech synthesis, Text-to-Speech Translation, hidden Markov model, HMM, corpus

. (hidden Markov model: HMM) HMM [] [2] [3] tying / ( [4]) / ( [5], [6]) HMM speaker-driven, trainable ) HMM () ( [7]) (2) HMM HMM inventory ( [8], [9]). (3) HMM HMM instance ( [0], []) (4) HMM ( [2] [4]) (), (2), (3) HMM PSOLA [5] (4) vocoded speech HMM HMM HMM (4) HMM 2. HMM 3. HMM 4. HMM 5. 2. (HMM) 2. HMM HMM o t b i (o t ) a ij = P (q t = j q t = i) i, j o t MFCC [6], LPC HMM b i (o) =N (o µ i, U i ) { } = (2π)N U i exp 2 (o µ i) U i (o µ i ) () µ i U i b i (o) HMM N HMM λ π = {π i } N i= A = O: Q: a a22 a33 π a2 a23 b(ot) b2(ot) b3(ot) o o2 o3 o4 o5 ot 2 3 2 2 3 3 (HMM)

2 3 4 i 4 3 2 π a44 a34 b4(o 7) t 2 3 4 5 6 7 T = 8 2 HMM {a ij } N i,j= i B = {b i( )} N i= λ =(A, B, π) Q = {q,q 2,...,q T } O =(o, o 2,..., o T ) T P (O, Q λ) = a qt q t b qt (o t ) (2) t= a q0i = π i O =(o, o 2,..., o T ) λ P (O λ) = P (O, Q λ) all Q P (O λ) O λ P (O λ) HMM k-means (embedded training) HMM HMM 2.3 I HMM λ,λ 2,...,λ I 2 O i max = arg max P (λ i O) i = arg max i P (O λ i )P (λ i ) P (O) P (O λ i ) (5) = all Q T a qt q t b qt (o t ) (3) t= (2) 2 (3) 2.2 HMM HMM λ O (3) P (O λ) λ λ max = arg max P (O λ) (4) λ EM Baum-Welch λ λ {O (), O (2),...,O (m) } HMM P (O λ i ) = P (O, Q λ i ) Q max P (O, Q λ i) (6) Q O λ P (O, Q λ) Q P (O, Q λ) Viterbi 2.4 g e N j i ts u 2

u-g+e g-e+n e-n+j N-j+i j-i+ts i-ts+u ts-u+o 3 [4] 2 3 3 k- a+n t- a+n i- a+t Y N Y N Y N Y N HMM [7] [22] [23] HMM [24] 3. HMM 3. HMM [7] HMM 2.3 Viterbi O Q HMM HMM Viterbi HMM diphone, phone [25] [26] half-phone HMM [] 3 HMM 3.2

3 [27] ([28] ) inventory unseen (a) target cost, context matching score (b) (discontinuity, continuity cost, concatenation cost ) ([29], [28] ) (i) ([30], [28], [26] ) (ii) ([9], [], [3], [32], ) (i) (a) (b) DP (i) (ii) instance (b) DP 3 { } { } { } { } (ii) unseen (instance) DP (ii) HMM HMM ([9], [] ) [33], [34] 4. HMM 4. HMM HMM λ T O max = arg min P (O λ, T ) (7) O λ HMM HMM (6) O max = arg max P (O λ, T ) O arg max O max Q P (O, Q λ, T ) (8) P (O, Q λ, T )=P (O Q,λ,T)P (Q λ, T ) (9) Q P (Q λ) O (8) Q max = arg max P (Q λ, T ) (0) Q

O max = arg max O P (O Q max,λ) () (0) 4.2 () (2), () P (O Q,λ,T) log P (O Q,λ) = log T b qt (o t ) t= = 2 (O M) U (O M) log U 2 + Const (2) O = [o, o 2,...,o T ] (3) M = [µ q, µ q 2,..., µ q T ] (4) U = diag[u q, U q2,..., U qt ] (5) µ qt U qt q t (6) (7) P (O Q,λ) O = M [2] o t c t ( ) c t ( ) 2 c t ( ) o t =[c t, c t, 2 c t] c t 2 c t c t c t = 2 c t = L () + τ = L () L (2) + τ = L (2) w (τ)c t+τ (6) w 2 (τ) c t+τ (7) w (τ) w 2 (τ) (6), (7) O = WC (8) C =[c, c 2,...,c T ] (9) c t M C, O TM 3TM W 3TM TM,w (τ),w 2 (τ) 0 (8) P (O Q,λ) C log P (WC Q,λ) C = 0, (20) W U WC = W U M. (2) W U W TM TM (2) O(T 3 M 3 ) 4 W U W QR O(TM 3 L 2 ) 5 L = max{l (),L () +,L (2),L (2) + }. (2) [36], [37] [38]. (7) (8) [39] 4 [36] state 3 i state 2 state state 3 a state 2 state 0 2 3 4 5 (khz) (a) 0 2 3 4 5 (khz) (b) 4 sil, a, i, sil HMM (a) (b) 4 U q,i O(T 3 M) 5 U q,i O(TML 2 ) L () =, L () + =0,w(2) (i) 0 O(TM) [35]

4.2 HMM t = T i = K i d i p i (d i ) (0) P (Q λ, T ) log P (Q λ, T )= K log p i (d i ) (22) i= K i= d i = T HMM (2) p i (d) p i (d) = a ii a ii a d ii (23) p i (d) (0) Q max {d i } K i= d i = m i + ρ σi 2 (24) ( ) K / K ρ = T m i σi 2 (25) k= k= [40] m i σi 2 i T ρ (25) ρ T (25) ρ ρ =0 4.3 HMM HMM HMM HMM(MSD-HMM)[4] HMM HMM HMM [42] 4.4 4. 4.3 HMM HMM [27] HMM 3 I [43] [44] CTR [45] HMM (ii) HMM [9], [] (ii) HMM HMM HMM (ii) [46] 5. HMM HMM HMM HMM HMM

[] S. Schwartz, Y-L. Chow, O. Kimball, S. Roucos, M. Krasner, and J. Makhoul, Context-dependent modeling for acousticphonetic of continuous speech, Proc. ICASSP, pp.205 208, 985. [2] S. Furui, Speaker independent isolated word recognition using dynamic features of speech spectrum, IEEE Trans. Acoust., Speech, Signal Processing, vol.34, pp.52 59, 986. [3] B.-H. Juang, Maximum-likelihood estimation for mixture multivariate stochastic observations of Markov chains, AT&T Technical Journal, vol.64, no.6, pp.235 249, 985. [4] J. J. Odell, The use of context in large vocabulary speech recognition, PhD dissertation, Cambridge University, 995. [5] C. H. Lee, C. H. Lin, and B. H. Juang, A Study on speaker adaptation of the parameters of continuous density hidden Markov models, IEEE Trans. Acoust., Speech, Signal Processing, vol.39, no.4, pp.806-84, Apr. 992. [6] M. J.F. Gales, and P. C. Woodland, Mean and variance adaptation within the MLLR framework, Computer Speech and Language, vol.0, No.4, pp.249 264, Apr. 996. [7] A. Ljolje, J. Hirschberg and J. P. H. van Santen, Automatic speech segmentation for concatenative inventory selection, in Progress in Speech Synthesis, ed. J. P. H. van Santen, R. W. Sproat, J. P. Olive and J. Hirschberg, Springer-Verlag, New York, 997. [8] R. E. Donovan and P. C. Woodland, Automatic speech synthesiser parameter estimation using HMMs, Proc. ICASSP, pp.640 643, 995. [9] X. Huang, A. Acero, H. Hon, Y. Ju, J. Liu, S. Meredith and M. Plumpe, Recent improvements on Microsoft s trainable text-to-speech system -Whistler, Proc. ICASSP, 997. [0] H. Hon, A. Acero, X. Huang, J. Liu and M. Plumpe, Automatic generation of synthesis units for trainable text-tospeech synthesis, Proc. ICASSP, 998. [] R. E. Donovan and E. M. Eide, The IBM Trainable Speech Synthesis System, Proc. ICSLP, vol.5, pp.703 706, 998. [2] A. Falaschi, M. Giustiniani and M. Verola, A hidden Markov model approach to speech synthesis, Proc. EU- ROSPEECH, pp.87 90, 989. [3] M. Giustiniani and P. Pierucci, Phonetic ergodic HMM for speech synthesis, Proc. EUROSPEECH, pp.349 352, 99. [4],,,, HMM, (D), vol.j79-d-ii, no.2, pp.284 290, Dec. 996. [5] E. Moulines, and F. Charpentier, Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones, Speech Communication, no.9, pp.453 467, 985. [6] S. B. Davis and P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences, IEEE trans. Acoust., Speech, Signal Processing, vol.assp-33, pp.357 366, Aug. 986. [7],,, 988. [8],,, 995. [9] L. Rabinar and B.-J. Juang ( ), ( ) ( ), NTT, 995. [20],,,,, 996. [2],,,,, 997. [22],,, 998. [23] X. D. Huang, Y. Ariki and M. A. Jack, Hidden Markov Models for Speech Recognition, Edinburgh University Press, Edinburgh, 990. [24] http://htk.eng.cam.ac.uk/. [25] Y. Sagisaka, N. Kaiki, N. Iwahashi and K. Mimura, ATR ν-talk speech synthesis system, Proc. ICSLP, pp.483 486, 992. [26] B. Beutnagel, A. Conkie, J. Schroeter, Y. Stylianou and A. Syrdal, The AT&T Next-Gen TTS system, Proc. Joint ASA, EAA and DAEA Meeting, pp.5 9, Mar. 999. [27],,,,, HMM, (D-II), vol.j83-d-ii, no., Nov. 2000. [28] A. W. Black and N. Campbell, Optimising selection of units from speech databases for concatenative synthesis, Proc. EUROSPEECH, pp.58-584, Sep 995. [29],,,,, SP9-5, 99. [30] A. G. Hauptmann, SPEAKEZ: A first experiment in concatenation synthesis from a large corpus, Proc. EU- ROSPEECH, pp.70 704, 993. [3] A. W. Black P. Taylor, Automatically clustering similar units for unit selection in speech synthesis, Proc. EU- ROSPEECH, pp.60 604, Sep 997. [32] M. W. Macon, A. E. Cronk and J. Wouters, Generalization and discrimination in tree-structured unit selection, Proc. ESCA/COCOSDA Workshop on Speech Synthesis, Nov. 998. [33],, Journal of Signal Processing, vol.2, no.6, Nov. 998. [34], 2, IPSJ Magazine, vol.4, no.3, Mar. 2000. [35] A. Acero, Formant analysis and synthesis using hidden Markov models, Proc. EUROSPEECH, Budapest, Hungary, pp.047 050, 999. [36] K. Tokuda, T. Kobayashi and S. Imai, Speech parameter generation from HMM using dynamic features, Proc. ICASSP-95, pp.660 663, 995. [37] K. Tokuda, T. Masuko, T. Yamada, T. Kobayashi and Satoshi Imai : An Algorithm for Speech Parameter Generation from Continuous Mixture HMMs with Dynamic Features, Proc. EUROSPEECH-95, pp.757 760, 995. [38] K. Koishida, K. Tokuda, T. Masuko and T. Kobayashi, Vector quantization of speech spectral parameters using statistics of dynamic features, Proc. ICSP, vol., pp.247 252, Aug. 997. [39] K. Tokuda, Takayoshi Yoshimura, T. Masuko, T. Kobayashi, T. Kitamura, Speech parameter generation algorithms for HMM-based speech synthesis, Proc. ICASSP, Turkey, June 2000. [40],,,, HMM,, vol.53, no.3, pp.92 200, Mar. 997. [4],,,, HMM,, SP98-, pp.9 26, Apr. 998. [42],,,, HMM,, SP98-2, pp.27 34, Apr. 998. [43],, 2,, vol.49, no.0, pp.682 690, Oct. 993. [44] M. Riley, Tree-Based Modelling of Segmental Duration, Talking Machines: Theories, Models, and Designs, Elsevier Science Publishers, pp.265 273, 992. [45] N. Iwahashi and Y. Sagisaka, Statistical Modelling of Speech Segment Duration by Constrained Tree Regression, IEICE trans, vol.e83-d, no.7, pp.550 559, July 2000. [46],,, SP99-6, pp.48 54, Aug. 999.