1,a) 1,b) 2 3 Sakriani Sakti 1 Graham Neubig 1 1. A Study on HMM-Based Speech Synthesis Using Rich Context Models

Σχετικά έγγραφα
(hidden Markov model: HMM) FUNDAMENTALS OF SPEECH SYNTHESIS BASED ON HMM. Keiichi Tokuda. Department of Computer Science

Vol.4-DCC-8 No.8 Vol.4-MUS-5 No.8 4// 3 3 Hanning (T ) 3 Hanning 3T (y(t)w(t)) dt =.5 T y (t)dt. () STRAIGHT F 3 TANDEM-STRAIGHT[] 3 F F 3 [] F []. :

Sinsy: HMM. Sinsy An HMM-based singing voice synthesis system which can realize your wish I want this person to sing my song

EM Baum-Welch. Step by Step the Baum-Welch Algorithm and its Application 2. HMM Baum-Welch. Baum-Welch. Baum-Welch Baum-Welch.

Buried Markov Model Pairwise

CSJ. Speaker clustering based on non-negative matrix factorization using i-vector-based speaker similarity

Voice Conversion based on Non-negative Matrix Factorization with Segment Features in Noisy Environments

Estimation, Evaluation and Guarantee of the Reverberant Speech Recognition Performance based on Room Acoustic Parameters

SNR F0 [2], [3], [4] F0 F0 F0 F0 F0 TUSK F0 TUSK F0 6 TUSK 6 F0 2. F0 F0 [5] [6] [7] p[8] Cepstrum [9], [10] [11] [12] [13] F0 [14] F0 [15] DIO[16] [1

1 n-gram n-gram n-gram [11], [15] n-best [16] n-gram. n-gram. 1,a) Graham Neubig 1,b) Sakriani Sakti 1,c) 1,d) 1,e)

VOCODER VOCODER Vocal

Fourier transform, STFT 5. Continuous wavelet transform, CWT STFT STFT STFT STFT [1] CWT CWT CWT STFT [2 5] CWT STFT STFT CWT CWT. Griffin [8] CWT CWT

Study on Re-adhesion control by monitoring excessive angular momentum in electric railway traction

MIDI [8] MIDI. [9] Hsu [1], [2] [10] Salamon [11] [5] Song [6] Sony, Minato, Tokyo , Japan a) b)

[5] F 16.1% MFCC NMF D-CASE 17 [5] NMF NMF 3. [5] 1 NMF Deep Neural Network(DNN) FUSION 3.1 NMF NMF [12] S W H 1 Fig. 1 Our aoustic event detect

Ψηφιακή Επεξεργασία Φωνής

Study of In-vehicle Sound Field Creation by Simultaneous Equation Method

A study of geometric dependency of cepstrum on vocal tract length

Διπλωματική Εργασία. του φοιτητή του Τμήματος Ηλεκτρολόγων Μηχανικών και Τεχνολογίας Υπολογιστών της Πολυτεχνικής Σχολής του Πανεπιστημίου Πατρών

: Monte Carlo EM 313, Louis (1982) EM, EM Newton-Raphson, /. EM, 2 Monte Carlo EM Newton-Raphson, Monte Carlo EM, Monte Carlo EM, /. 3, Monte Carlo EM

Schedulability Analysis Algorithm for Timing Constraint Workflow Models

Nov Journal of Zhengzhou University Engineering Science Vol. 36 No FCM. A doi /j. issn

Applying Markov Decision Processes to Role-playing Game

Συνδυασμένη Οπτική-Ακουστική Ανάλυση Ομιλίας

An Automatic Modulation Classifier using a Frequency Discriminator for Intelligent Software Defined Radio

IPSJ SIG Technical Report Vol.2014-CE-127 No /12/6 CS Activity 1,a) CS Computer Science Activity Activity Actvity Activity Dining Eight-He

[2] REVERB 8 [3], [4] [5] [20] [6], [7], [8], [9], [10] [11] REVERB 8 *1 [9] LDA *2 MLLT (SAT) [8] (basis fmllr) [12] (DNN) [10] DNN [11] [13] [14] Ka

Bayesian statistics. DS GA 1002 Probability and Statistics for Data Science.

[15], [16], [17] [6] [2] [5] Jiang [6] 2.1 [6], [10] Score(x, y) y ( 1) ( 1 ) b e ( 1 ) b e. O(n 2 ) Jiang [6] (word lattice reranking)

Quick algorithm f or computing core attribute

Stabilization of stock price prediction by cross entropy optimization

AΡΙΣΤΟΤΕΛΕΙΟ ΠΑΝΕΠΙΣΤΗΜΙΟ ΘΕΣΣΑΛΟΝΙΚΗΣ ΠΟΛΥΤΕΧΝΙΚΗ ΣΧΟΛΗ ΤΜΗΜΑ ΠΟΛΙΤΙΚΩΝ ΜΗΧΑΝΙΚΩΝ

Signal processing for handling singing voice texture

Summary of the model specified

Chapter 1 Introduction to Observational Studies Part 2 Cross-Sectional Selection Bias Adjustment

Homework 8 Model Solution Section

ER-Tree (Extended R*-Tree)

«ΑΝΑΠΣΤΞΖ ΓΠ ΚΑΗ ΥΩΡΗΚΖ ΑΝΑΛΤΖ ΜΔΣΔΩΡΟΛΟΓΗΚΩΝ ΓΔΓΟΜΔΝΩΝ ΣΟΝ ΔΛΛΑΓΗΚΟ ΥΩΡΟ»

1 (forward modeling) 2 (data-driven modeling) e- Quest EnergyPlus DeST 1.1. {X t } ARMA. S.Sp. Pappas [4]

Reaction of a Platinum Electrode for the Measurement of Redox Potential of Paddy Soil

Supplementary Appendix

ΒΙΟΠΛΗΡΟΦΟΡΙΚΗ ΙΙ. Δυναμικός Προγραμματισμός. Παντελής Μπάγκος

Διερεύνηση ακουστικών ιδιοτήτων Νεκρομαντείου Αχέροντα

(Statistical Machine Translation: SMT[1]) [2]

Speech Recognition using Phase Information based on Long-Term Analysis

Elements of Information Theory

ΠΤΥΧΙΑΚΗ ΕΡΓΑΣΙΑ "ΠΟΛΥΚΡΙΤΗΡΙΑ ΣΥΣΤΗΜΑΤΑ ΛΗΨΗΣ ΑΠΟΦΑΣΕΩΝ. Η ΠΕΡΙΠΤΩΣΗ ΤΗΣ ΕΠΙΛΟΓΗΣ ΑΣΦΑΛΙΣΤΗΡΙΟΥ ΣΥΜΒΟΛΑΙΟΥ ΥΓΕΙΑΣ "


ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 19/5/2007

Q L -BFGS. Method of Q through full waveform inversion based on L -BFGS algorithm. SUN Hui-qiu HAN Li-guo XU Yang-yang GAO Han ZHOU Yan ZHANG Pan

ΠΤΥΧΙΑΚΗ/ ΙΠΛΩΜΑΤΙΚΗ ΕΡΓΑΣΙΑ

ΕΘΝΙΚΟ ΜΕΤΣΟΒΙΟ ΠΟΛΥΤΕΧΝΕΙΟ ΣΧΟΛΗ ΗΛΕΚΤΡΟΛΟΓΩΝ ΜΗΧΑΝΙΚΩΝ ΚΑΙ ΜΗΧΑΝΙΚΩΝ ΥΠΟΛΟΓΙΣΤΩΝ ΤΟΜΕΑΣ ΗΛΕΚΤΡΙΚΗΣ ΙΣΧΥΟΣ

X-Y COUPLING GENERATION WITH AC/PULSED SKEW QUADRUPOLE AND ITS APPLICATION

Yoshifumi Moriyama 1,a) Ichiro Iimura 2,b) Tomotsugu Ohno 1,c) Shigeru Nakayama 3,d)

ΕΘΝΙΚΟ ΜΕΤΣΟΒΙΟ ΠΟΛΥΤΕΧΝΕΙΟ

ΠΩΣ ΕΠΗΡΕΑΖΕΙ Η ΜΕΡΑ ΤΗΣ ΕΒΔΟΜΑΔΑΣ ΤΙΣ ΑΠΟΔΟΣΕΙΣ ΤΩΝ ΜΕΤΟΧΩΝ ΠΡΙΝ ΚΑΙ ΜΕΤΑ ΤΗΝ ΟΙΚΟΝΟΜΙΚΗ ΚΡΙΣΗ

Ο νοσηλευτικός ρόλος στην πρόληψη του μελανώματος

A Method for Creating Shortcut Links by Considering Popularity of Contents in Structured P2P Networks

Detection and Recognition of Traffic Signal Using Machine Learning

Extraction of Basic Patterns of Household Energy Consumption

Supporting information. An unusual bifunctional Tb-MOF for highly sensing of Ba 2+ ions and remarkable selectivities of CO 2 /N 2 and CO 2 /CH 4

Echo path identification for stereophonic acoustic echo cancellation without pre-processing

ΤΕΧΝΟΛΟΓΙΚΟ ΕΚΠΑΙΔΕΥΤΙΚΟ ΙΔΡΥΜΑ ΑΓΡΟΤΙΚΕΣ ΣΤΑΤΙΣΤΙΚΕΣ ΜΕ ΕΡΓΑΛΕΙΑ ΓΕΩΠΛΗΡΟΦΟΡΙΚΗΣ

Correction of chromatic aberration for human eyes with diffractive-refractive hybrid elements

: TANDEM-STRAIGHT. Make singing voice tangible: TANDEM-STRAIGHT and temporally variable morphing as substrate. Hideki Kawahara 1 and Masanori Morise 2

ΑΠΟΔΟΤΙΚΗ ΑΠΟΤΙΜΗΣΗ ΕΡΩΤΗΣΕΩΝ OLAP Η ΜΕΤΑΠΤΥΧΙΑΚΗ ΕΡΓΑΣΙΑ ΕΞΕΙΔΙΚΕΥΣΗΣ. Υποβάλλεται στην

Μειέηε, θαηαζθεπή θαη πξνζνκνίσζε ηεο ιεηηνπξγίαο κηθξήο αλεκνγελλήηξηαο αμνληθήο ξνήο ΓΗΠΛΩΜΑΣΗΚΖ ΔΡΓΑΗΑ

ΠΑΝΕΠΙΣΤΗΜΙΟ ΠΑΤΡΩΝ ΣΧΟΛΗ ΑΝΘΡΩΠΙΣΤΙΚΩΝ ΚΑΙ ΚΟΙΝΩΝΙΚΩΝ ΕΠΙΣΤΗΜΩΝ ΤΜΗΜΑ ΦΙΛΟΛΟΓΙΑΣ

ΣΥΓΚΡΙΣΗ ΑΝΑΛΥΤΙΚΩΝ ΚΑΙ ΑΡΙΘΜΗΤΙΚΩΝ ΜΕΘΟ ΩΝ ΓΙΑ ΤΗ

CHAPTER 48 APPLICATIONS OF MATRICES AND DETERMINANTS

HIV HIV HIV HIV AIDS 3 :.1 /-,**1 +332

CHAPTER 25 SOLVING EQUATIONS BY ITERATIVE METHODS

Μειέηε θαη αλάιπζε επίδνζεο πξσηνθόιισλ δξνκνιόγεζεο ζε θηλεηά ad hoc δίθηπα κε βάζε ελεξγεηαθά θξηηήξηα ΓΗΠΛΩΜΑΣΗΚΖ ΔΡΓΑΗΑ

Πανεπιστήµιο Πειραιώς Τµήµα Πληροφορικής

Research on Economics and Management

MnZn. MnZn Ferrites with Low Loss and High Flux Density for Power Supply Transformer. Abstract:

GPGPU. Grover. On Large Scale Simulation of Grover s Algorithm by Using GPGPU

ΕΦΑΡΜΟΓΗ ΕΥΤΕΡΟΒΑΘΜΙΑ ΕΠΕΞΕΡΓΑΣΜΕΝΩΝ ΥΓΡΩΝ ΑΠΟΒΛΗΤΩΝ ΣΕ ΦΥΣΙΚΑ ΣΥΣΤΗΜΑΤΑ ΚΛΙΝΗΣ ΚΑΛΑΜΙΩΝ

( ) , ) , ; kg 1) 80 % kg. Vol. 28,No. 1 Jan.,2006 RESOURCES SCIENCE : (2006) ,2 ,,,, ; ;

ΓΕΩΠΟΝΙΚΟ ΠΑΝΕΠΙΣΤΗΜΙΟ ΑΘΗΝΩΝ ΤΜΗΜΑ ΕΠΙΣΤΗΜΗΣ ΤΡΟΦΙΜΩΝ ΚΑΙ ΔΙΑΤΡΟΦΗΣ ΤΟΥ ΑΝΘΡΩΠΟΥ

Feasible Regions Defined by Stability Constraints Based on the Argument Principle

Analysis of prosodic features in native and non-native Japanese using generation process model of fundamental frequency contours

Partial Differential Equations in Biology The boundary element method. March 26, 2013

Bayesian modeling of inseparable space-time variation in disease risk

Other Test Constructions: Likelihood Ratio & Bayes Tests

상대론적고에너지중이온충돌에서 제트입자와관련된제동복사 박가영 인하대학교 윤진희교수님, 권민정교수님

Ανάκτηση Εικόνας βάσει Υφής με χρήση Eye Tracker

Transient Voltage Suppressor

n 1 n 3 choice node (shelf) choice node (rough group) choice node (representative candidate)

ΕΘΝΙΚΟ ΜΕΤΣΟΒΙΟ ΠΟΛΥΤΕΧΝΕΙΟ ΣΧΟΛΗ ΠΟΛΙΤΙΚΩΝ ΜΗΧΑΝΙΚΩΝ. «Θεσμικό Πλαίσιο Φωτοβολταïκών Συστημάτων- Βέλτιστη Απόδοση Μέσω Τρόπων Στήριξης»

(Statistical Machine Translation: SMT [1])

{takasu, Conditional Random Field

476,,. : 4. 7, MML. 4 6,.,. : ; Wishart ; MML Wishart ; CEM 2 ; ;,. 2. EM 2.1 Y = Y 1,, Y d T d, y = y 1,, y d T Y. k : p(y θ) = k α m p(y θ m ), (2.1

*,* + -+ on Bedrock Bath. Hideyuki O, Shoichi O, Takao O, Kumiko Y, Yoshinao K and Tsuneaki G

GPU. CUDA GPU GeForce GTX 580 GPU 2.67GHz Intel Core 2 Duo CPU E7300 CUDA. Parallelizing the Number Partitioning Problem for GPUs

ΠΟΛΥΤΕΧΝΕΙΟ ΚΡΗΤΗΣ ΣΧΟΛΗ ΜΗΧΑΝΙΚΩΝ ΠΕΡΙΒΑΛΛΟΝΤΟΣ

RMT Tick RMT. Application of the RMT-test on Real Data: Hash Function and Tick Data of Stock Prices

Acoustic Signal Adjustment by Considering Musical Expressive Intention Using a Performance Intension Function

Gain self-tuning of PI controller and parameter optimum for PMSM drives

Γονεϊκό στυλ διαπαιδαγώγησης και επιδόσεις μαθητών σχολικής ηλικίας: θεωρητικές και πρακτικές όψεις

Transcript:

HMM 1,a 1,b 3 Sakriani Sakti 1 Graham Neubig 1 1 Hidden Markov Model HMM HMM HMM HMM HMM A Study on HMM-Based Speech Synthesis Using Rich Context Models Shinnosuke Takamichi 1,a Toda Tomoki 1,b Shiga Yoshinori Kawai Hisashi 3 Sakriani Sakti 1 Graham Neubig 1 Nakamura Satoshi 1 Abstract: In this paper, we propose parameter generation methods using rich context models in HMM-based speech synthesis as yet another hybrid method combining HMM-based speech synthesis and unit selection synthesis. In the traditional HMM-based speech synthesis, generated speech parameters tend to be excessively smoothed and they cause muffled sounds in synthetic speech. To alleviate this problem, several hybrid methods have been proposed. Although they significantly improve quality of synthetic speech by directly using natural waveform segments, they usually lose flexibility in converting synthetic voice characteristics. In the proposed methods, rich context models representing individual acoustic parameter segments are reformed as GMMs and a speech parameter sequence is generated from them using the parameter generation algorithm based on the maximum likelihood criterion. Since a basic framework of the proposed methods is still the same as the traditional framework, the capability of flexibly modeling acoustic features remains. We conduct several experimental evaluations of the proposed methods from various perspectives. The experimental results demonstrate that the proposed methods yield significant improvements in quality of synthetic speech. Keywords: HMM-based speech synthesis, over-smoothing, rich context model, parameter generation 1 NICT 3 KDDI a shinnosuke-t@is.naist.jp b tomoki@is.naist.jp 1. Text-To-Speech TTS c 1 Information Processing Society of Japan 1

[1] TTS TTS [] Hidden Markov Model HMM [3] F HMM HMM [] [5] [] HMM HMM HMM [7] HMM HMM F [] HMM HMM HMM HMM Gaussian Mixture Model GMM HMM HMM 3 5. HMM HMM [9] c b c b c o t = N o t ; µ c, Σ c 1 o t = [ ] c t, c t, c t t c t c t c t N ; µ c, Σ c µ c Σ c HMM HMM HMM o = W c HMM c 1 Information Processing Society of Japan

c = [ c 1,, c T ] [1]W 3. [] c m b c,m µ c,m Σ c b c,m o t = N o t ; µ c,m, Σ c forward-backward Kullback-Leibler Divergence KLD. HMM 3. Fig. 1 1 Training and synthesis processes in proposed methods 1.1 GMM GMM b c o t = M c m=1 ω m N o t ; µ c,m, Σ c 3 ω m m M c forwardbackward ω m = 1/M c. HMM HMM q = [q 1,, q T ] HMM P o q, λ = P o, m q, λ, m = [m 1,, m T ] o = [ ] o HMM 1,, o T λ o = W c HMM HMM [1] ĉ = argmax P o, m q, λ 5 c EM c 1 Information Processing Society of Japan 3

..1 EM c EM Expectation- Maximization algorithm c E c i P m W c i, q, λ M c i+1 EM Q c i, c i+1 = P m W c i, q, λ ln P.. W c i+1, m q, λ HMM P o, m q, λ P o, m q, λ 7 c ˆm i+1 = argmax m P m W c i, q, λ, ĉ i+1 = argmax P W c ˆm i+1, q, λ 9 c..3 HMM [11] EM [1] [] HMM. HMM g Likelihood Log 93 9 91 9 9 7 Before iteration After iteration Initial Feature a Fig. g Likelihood Log 3 1 59 57 55 53 51 9 7 5 Before iteration After iteration Initial Feature b Effect of the initial parameter sequence 5. 5.1 ATR [13] A-I 5 J 53 1 khz 5 ms STRAIGHT [1] F 5 5 left-to-right Hidden Semi-Markov Model HSMM Global Variance: GV [15] Minimum Description Length MDL [1] Viterbi 5. 5..1 3 1 c 1 Information Processing Society of Japan

Randomized Conventional 3 Target Before iteration After iteration a b 5.. Conventional EM Proposed GMM Proposed single Proposed target AB 7 3a EM 5..1 Proposedtarget 5..3 Conventional Frame-based HMM State-based Fig. Fig. 3 3 Preference scores for speech quality An example of selected mixture component Framebased, State-based 3b 5.3 TTS Conventional Proposedsingle Proposedtarget c 1 Information Processing Society of Japan 5

Frequency khz Frequency khz Frequency khz Frequency khz 5 Conventional, Proposedsingle, Proposedtarget, Fig. 5 Spectrogramrepresenting Conventional, Proposedsingle, Proposedtarget, and natural speech from above. 5.. 3c Conventional, Proposedsingle, Proposedtarget, 5 Proposedsingle Proposedtarget 3a 3c. HMM GMM [1] Y. Sagisaka, Speech synthesis by rule using an optimal selection of non-uniform synthesis units, Proc.ICASSP, pp.79, 19. [] N. Iwahashi, N. Kaiki, Y. Sagisaka, Speech segment selection for concatenative synthesis based on spectral distortion minimization, IEICE Trans, Fundamentals, Vol.E7-A, No.11, pp.19 19, 1993. [3] H. Zen, K. Tokuda, A. Black, Statistical parametric speech synthesis, Speech Commun., Vol.51, No.11, pp.139 1, 9. [] T. Yoshimura, T. Masuko, K. Tokuda, T. Kobayashi, T. Kitamura, Speaker interpolation for HMM-based speech synthesis system, J. Acoust. Soc. Jpn. E, Vol.1, No., pp.199,. [5] J. Yamagishi, T. Kobayashi, Average-voice-based speech synthesis using HSMM-based speaker adaptation and adaptive training, IEICE Trans. on Inf. and Syst., Vol.E9-D, No., pp.533 53, 7. [] T. Nose, J. Yamagishi, T. Masuko, T. Kobayashi, A style control technique for HMM-based expressive speech synthesis, IEICE Trans. Inf. and Syst., Vol.E9-D, No.9, pp.1 113, 7. [7] Z. Ling, L. Qin, H. Lu, Y. Gao, L. Dai, R. Wang, Y. Jiang, Z. Zhao, J. Yang, J. Chen, G. Hu, The USTC and iflytek speech synthesis systems for Blizzard Challenge 7, Proc. of Blizzard Challenge workshop, 7. [] Z. Yan, Q. Yao, S.K. Frank, Rich Context Modeling for High Quality HMM-Based TTS, INTERSPEECH 9, pp.1755 175, 9. [9],,,,, HMM, D-, Vol.J3-D-, No.11, pp.99 17,. [1] K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, T. Kitamura, Speech parameter generation algorithms for HMM-based speech synthesis, Proc.ICASSP, pp.1315 131,. [11] S. Kataoka, N. Mizutani, K. Tokuda, T. Kitamura, Decision tree backing-off in HMM-based speech synthesis, INTERSPEECH, WeB13p.1,. [1] T. Mizutani, T. Kagoshima, Concatenative Speech Synthesis Based on the Plural Unit Selection and Fusion Method, IEICE Trans. on Inf. and Syst., Vol. E-D, No.11, pp.55 57, 5. [13],,,,, ATR, TR=1-1, 199. [1] H. Kawahara, I. Masuda-Katsuse, A.D. Cheveigne, Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneousfrequency-based F extraction: Possible role of a repetitive structure in souds, Speech Commun., Vol.7, No.3, pp.17 7, 1999. [15] T. Toda, K. Tokuda, A speech parameter generation algorithm considering global variance for HMMbased speech synthesis, IEICE Trans, Vol.E9-D, No.5, pp.1, 7. [1] K. Shinoda and T. Watanabe, MDL-based contextdependent subword modeling for speech recognition, J.Acoust.Soc.Jpn.E, Vol.1, No., pp.79,. c 1 Information Processing Society of Japan