1,a) 1,b) 2 3 Sakriani Sakti 1 Graham Neubig 1 1. A Study on HMM-Based Speech Synthesis Using Rich Context Models

HMM 1,a 1,b 3 Sakriani Sakti 1 Graham Neubig 1 1 Hidden Markov Model HMM HMM HMM HMM HMM A Study on HMM-Based Speech Synthesis Using Rich Context Models Shinnosuke Takamichi 1,a Toda Tomoki 1,b Shiga Yoshinori Kawai Hisashi 3 Sakriani Sakti 1 Graham Neubig 1 Nakamura Satoshi 1 Abstract: In this paper, we propose parameter generation methods using rich context models in HMM-based speech synthesis as yet another hybrid method combining HMM-based speech synthesis and unit selection synthesis. In the traditional HMM-based speech synthesis, generated speech parameters tend to be excessively smoothed and they cause muffled sounds in synthetic speech. To alleviate this problem, several hybrid methods have been proposed. Although they significantly improve quality of synthetic speech by directly using natural waveform segments, they usually lose flexibility in converting synthetic voice characteristics. In the proposed methods, rich context models representing individual acoustic parameter segments are reformed as GMMs and a speech parameter sequence is generated from them using the parameter generation algorithm based on the maximum likelihood criterion. Since a basic framework of the proposed methods is still the same as the traditional framework, the capability of flexibly modeling acoustic features remains. We conduct several experimental evaluations of the proposed methods from various perspectives. The experimental results demonstrate that the proposed methods yield significant improvements in quality of synthetic speech. Keywords: HMM-based speech synthesis, over-smoothing, rich context model, parameter generation 1 NICT 3 KDDI a shinnosuke-t@is.naist.jp b tomoki@is.naist.jp 1. Text-To-Speech TTS c 1 Information Processing Society of Japan 1

[1] TTS TTS [] Hidden Markov Model HMM [3] F HMM HMM [] [5] [] HMM HMM HMM [7] HMM HMM F [] HMM HMM HMM HMM Gaussian Mixture Model GMM HMM HMM 3 5. HMM HMM [9] c b c b c o t = N o t ; µ c, Σ c 1 o t = [ ] c t, c t, c t t c t c t c t N ; µ c, Σ c µ c Σ c HMM HMM HMM o = W c HMM c 1 Information Processing Society of Japan

c = [ c 1,, c T ] [1]W 3. [] c m b c,m µ c,m Σ c b c,m o t = N o t ; µ c,m, Σ c forward-backward Kullback-Leibler Divergence KLD. HMM 3. Fig. 1 1 Training and synthesis processes in proposed methods 1.1 GMM GMM b c o t = M c m=1 ω m N o t ; µ c,m, Σ c 3 ω m m M c forwardbackward ω m = 1/M c. HMM HMM q = [q 1,, q T ] HMM P o q, λ = P o, m q, λ, m = [m 1,, m T ] o = [ ] o HMM 1,, o T λ o = W c HMM HMM [1] ĉ = argmax P o, m q, λ 5 c EM c 1 Information Processing Society of Japan 3

..1 EM c EM Expectation- Maximization algorithm c E c i P m W c i, q, λ M c i+1 EM Q c i, c i+1 = P m W c i, q, λ ln P.. W c i+1, m q, λ HMM P o, m q, λ P o, m q, λ 7 c ˆm i+1 = argmax m P m W c i, q, λ, ĉ i+1 = argmax P W c ˆm i+1, q, λ 9 c..3 HMM [11] EM [1] [] HMM. HMM g Likelihood Log 93 9 91 9 9 7 Before iteration After iteration Initial Feature a Fig. g Likelihood Log 3 1 59 57 55 53 51 9 7 5 Before iteration After iteration Initial Feature b Effect of the initial parameter sequence 5. 5.1 ATR [13] A-I 5 J 53 1 khz 5 ms STRAIGHT [1] F 5 5 left-to-right Hidden Semi-Markov Model HSMM Global Variance: GV [15] Minimum Description Length MDL [1] Viterbi 5. 5..1 3 1 c 1 Information Processing Society of Japan

Randomized Conventional 3 Target Before iteration After iteration a b 5.. Conventional EM Proposed GMM Proposed single Proposed target AB 7 3a EM 5..1 Proposedtarget 5..3 Conventional Frame-based HMM State-based Fig. Fig. 3 3 Preference scores for speech quality An example of selected mixture component Framebased, State-based 3b 5.3 TTS Conventional Proposedsingle Proposedtarget c 1 Information Processing Society of Japan 5

Frequency khz Frequency khz Frequency khz Frequency khz 5 Conventional, Proposedsingle, Proposedtarget, Fig. 5 Spectrogramrepresenting Conventional, Proposedsingle, Proposedtarget, and natural speech from above. 5.. 3c Conventional, Proposedsingle, Proposedtarget, 5 Proposedsingle Proposedtarget 3a 3c. HMM GMM [1] Y. Sagisaka, Speech synthesis by rule using an optimal selection of non-uniform synthesis units, Proc.ICASSP, pp.79, 19. [] N. Iwahashi, N. Kaiki, Y. Sagisaka, Speech segment selection for concatenative synthesis based on spectral distortion minimization, IEICE Trans, Fundamentals, Vol.E7-A, No.11, pp.19 19, 1993. [3] H. Zen, K. Tokuda, A. Black, Statistical parametric speech synthesis, Speech Commun., Vol.51, No.11, pp.139 1, 9. [] T. Yoshimura, T. Masuko, K. Tokuda, T. Kobayashi, T. Kitamura, Speaker interpolation for HMM-based speech synthesis system, J. Acoust. Soc. Jpn. E, Vol.1, No., pp.199,. [5] J. Yamagishi, T. Kobayashi, Average-voice-based speech synthesis using HSMM-based speaker adaptation and adaptive training, IEICE Trans. on Inf. and Syst., Vol.E9-D, No., pp.533 53, 7. [] T. Nose, J. Yamagishi, T. Masuko, T. Kobayashi, A style control technique for HMM-based expressive speech synthesis, IEICE Trans. Inf. and Syst., Vol.E9-D, No.9, pp.1 113, 7. [7] Z. Ling, L. Qin, H. Lu, Y. Gao, L. Dai, R. Wang, Y. Jiang, Z. Zhao, J. Yang, J. Chen, G. Hu, The USTC and iflytek speech synthesis systems for Blizzard Challenge 7, Proc. of Blizzard Challenge workshop, 7. [] Z. Yan, Q. Yao, S.K. Frank, Rich Context Modeling for High Quality HMM-Based TTS, INTERSPEECH 9, pp.1755 175, 9. [9],,,,, HMM, D-, Vol.J3-D-, No.11, pp.99 17,. [1] K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, T. Kitamura, Speech parameter generation algorithms for HMM-based speech synthesis, Proc.ICASSP, pp.1315 131,. [11] S. Kataoka, N. Mizutani, K. Tokuda, T. Kitamura, Decision tree backing-off in HMM-based speech synthesis, INTERSPEECH, WeB13p.1,. [1] T. Mizutani, T. Kagoshima, Concatenative Speech Synthesis Based on the Plural Unit Selection and Fusion Method, IEICE Trans. on Inf. and Syst., Vol. E-D, No.11, pp.55 57, 5. [13],,,,, ATR, TR=1-1, 199. [1] H. Kawahara, I. Masuda-Katsuse, A.D. Cheveigne, Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneousfrequency-based F extraction: Possible role of a repetitive structure in souds, Speech Commun., Vol.7, No.3, pp.17 7, 1999. [15] T. Toda, K. Tokuda, A speech parameter generation algorithm considering global variance for HMMbased speech synthesis, IEICE Trans, Vol.E9-D, No.5, pp.1, 7. [1] K. Shinoda and T. Watanabe, MDL-based contextdependent subword modeling for speech recognition, J.Acoust.Soc.Jpn.E, Vol.1, No., pp.79,. c 1 Information Processing Society of Japan