Sinsy: HMM. Sinsy An HMM-based singing voice synthesis system which can realize your wish I want this person to sing my song

. Sinsy: HMM 2 (hidden Markov model; HMM) 2009 2 HMM : Sinsy Sinsy 70 Sinsy Sinsy An HMM-based singing voice synthesis system which can realize your wish I want this person to sing my song Keiichiro Oura, Ayami Mase, Tomohiko Yamada, Keiichi Tokuda and Masataka Goto 2 A statistical parametric approach to singing voice synthesis based on hidden Markov models (HMMs) has been grown over the last few years. In this approach, spectrum, excitation, and duration of singing voices are simultaneously modeled by context-dependent HMMs, and waveforms are generated from HMMs themselves. Since December 2009, we started a free on-line service named Sinsy. By uploading musical scores represented by MusicXML to the Sinsy website, users can obtain synthesized singing voices. However, a high recording cost may be required to train new singer s model because a speakerdependent model trained by using 70 songs is used in Sinsy. The present paper describes the recent developments of Sinsy and a speaker adaptation technique to generate waveforms from a small amount of adaptation data. (hidden Markov model; HMM) ) HMM HMM 2) 3) 4) HMM 5) HMM 2009 2 HMM : Sinsy 6) Sinsy HTS 7) hts engine API 8) SPTK 9) STRAIGHT 0) CrestMuseXML Toolkit ) MusicXML 2) Sinsy 70 Sinsy (Constrained Maximum-Likelihood Linear Regression; CMLLR) 3) 2 HMM Sinsy 3 Sinsy 4 5 6 2. Sinsy HMM Sinsy 2. HMM Vocaloid 4) Nagoya Institute of Technology (NIT) 2 National Institute of Advanced Industrial Science and Technology (AIST) c 200 Information Processing Society of Japan

VocaListener 5) Sinsy Sinsy HMM Vocaloid HMM ( ) ( 2 ) ( 3 ) ( 4 ) 4 2) 3) HMM 5) 6) HMM 3.2 HMM 7) MLSA 8) Fig. SINGING VOICE DATABASE Label Context-dependent HMMs and duration models MUSICAL SCORE Conversion Speech signal Label Excitation parameter extraction F0 F0 Training of HMM Parameter generation from HMM Excitation generation Spectral parameter extraction MLSA filter Mel-cepstrum Mel-cepstrum Training part Synthesis part SYNTHESIZED SINGING VOICE HMM The overview of the HMM-based singing voice synthesis system. 2.2 Sinsy Speech Signal Processing Toolkit (SPTK) SPTK-3.3 9) UNIX BSD 9) Sinsy 6) SPTK STRAIGHT STRAIGHT V40 0), 20) Sinsy STRAIGHT STRAIGHT HMM-based Speech Synthesis System (HTS) HTS-2.. 7), 2) HMM Sinsy HMM HTK 22) BSD 9) HTK 2 3. 2 HTK 2 c 200 Information Processing Society of Japan

CrestMuseXML Toolkit (CMX) MusicXML-2.0 2) ), 23) CMX-0.50 2 MusicXML MusicXML HMM-based Speech Synthesis Engine (hts engine API) HTK HTS Sinsy SourceForge HMM hts engine API-.03 8) HTK HTS BSD 9) 2.3 3 MusicXML HMM MusicXML 5 MusicXML 200 4 Sinsy MusicXML MusicXML 70% 30% MusicXML 3. Sinsy HMM 3. 24) HMM ) Sinsy 25) <?xml version=.0 encoding= UTF-8 > <!DOCTYPE score-partwise PUBLIC -//Recordare//DTD MusicXML 2.0 Partwise//EN http://www.musicxml.org/dtds/partwise.dtd >... <measure number= > <attributes> <divisions></divisions> <key> <fifths>0</fifths> <fifths>major</fifths> </key> <time> <beats>4</beats> <beat-type>4</beat-type> </time>... </attributes> <sound tempo= 20 /> <note> <pitch> <step>e</step> <octave>5</octave> </pitch> <duration>2</duration> <type>half</type> <lyric> <text> </text> </lyric> </note> <note> <pitch> <step>c</step> <octave>5</octave> </pitch> <duration>2</duration> <type>half</type> <lyric> <text> </text> </lyric> </note> </measure> </part> </score-partwise> 2 2 MusicXML Fig. 2 An example of MusicXML. 3 c 200 Information Processing Society of Japan

Waveform Log F 0 : Vibrato 4 Fig. 4 An example of vibrato parts in a log F 0 sequence. 3 HMM Sinsy Fig. 3 HMM-based Speech Synthesis System Sinsy. 3.2 26), 27) 4 HMM Sinsy 28) t = 0,,..., T + i = 0,,..., M ν ( ) ν ( m a (t), m f (t), i ) = m a (t) sin ( ( )) 2π m f (t) f s t t (i) 0, () f s t (i) 0 i m a (t) m f (t) t 27) 5 c = log 2/200 cent o t o (spec) t o (F 0) t o (vib) t s b s (o t ) ( ) ( ) b s (o t ) = p γspec s o (spec) γ F0 t p s o (F 0) ( ) t p γ vib s o (vib) t (2) γ spec, γ F0, γ vib 3 4 c 200 Information Processing Society of Japan

m f(t ) m f(t 3) 800 700 600 Before pitch-shift After pitch-shift Log F 0 0 2c. m a(t 0) m f(t 0) 2c. m a(t ) 2c. m a(t 2) m f(t 2) 2c. m a(t 3) 2c. m a(t 4) m f(t 4) t 0 t t 2 t 3 t 4 t 5 t 6 Frame index 5 Fig. 5 The vibrato parameter analysis. 3.3 3. 24) Sinsy 30) 6 0 3 (Minimum Description Length; MDL) 3) 3 Sinsy MDL HMM The number of notes 500 400 300 200 00 0 G3 Ab3 A3 Bb3 B3 C4 Db4 D4 Eb4 E4 F4 Gb4 G4 Ab4 A4 Bb4 B4 Pitch 6 0 Fig. 6 The distribution of pitch in training data (0 songs). S q S q+ S q q q = L (S ) { L ( S q+ ) + L ( S q )} + α N 2 log Γ (S 0) (3) L ( ) S 0 α N Γ ( ) 3.4 HMM HMM HMM EM t = 0,..., T + i =,..., M α i (t) β i (t) M (T + ) Sinsy 32) 7 t i k Note start (k) Note end (k) Margin start (k) Margin end (k) t t = Note start (k) Margin start (k) (4) C5 Db5 D5 Eb5 E5 F5 5 c 200 Information Processing Society of Japan

sa i ta sa i ta Note start(k) Note end(k) Note timing sil y a n e y o Voice timing sil y a n e y o r i r i khz Model index Margin start(k) Margin end(k) Spectrogram of natural singing voice Waveform of natural singing voice 6 5 4 3 2 Frame number Search space before pruning Search space after pruning 8 Fig. 8 An example of timings. Fig. 7 7 The pruning approach using note boundaries. 4. t = Note end (k) + Margin end (k) (5) 3.5 HMM HMM 5) 8 Sinsy WFST 33) HSMM 34) k T k T k = T k g k + g k, (6) T k g k Sinsy 70 CMLLR 3) CMLLR HMM i µ i, U i ˆµ i Û i H i b i ˆµ i = H i µ i + b i (7) Û i = H i U i H i (8) H i b i EM CMLLR H i, b i H i, b i 35) 5. 5. f00 70 70 48kHz 6bit STRAIGHT 20) 6) 50, 2, 2 6 c 200 Information Processing Society of Japan

(cent) (Hz), 2 30-50cent 5-8Hz 36), 37) HMM 5 left-to-right HSMM 34) 5ms MDL 3) α (3) 5.0 HMM 2.6MBytes 5.2 f00 Vocaloid2 4) RWC (RWC-MDB-P-200) 38) (Rin) 2 7 8 8 3 2 3 6 8 0 8 9 (f00) (KBytes) Table The total file sizes of Sinsy (KBytes). Front-end program (CMX) 456 Phoneme table 3 Back-end program (hts engine API) 677 Acoustic model 652 Total file size of Sinsy 2588 2 Table 2 Adaptation data. Singer Model Total song length for adaptation f00 (Sinsy) Speaker-dependent model - Miku Hatsune (CV0) Speaker adapted model 9m28s (5 songs) Rin (RWC-MDB-P-200) Speaker adapted model 7m59s (7 songs) 6. Singer recognition rate (%) 00 80 60 40 20 0 f00 (Sinsy) Miku Hatsune (CV0) 9 Fig. 9 Subjective evaluation results. Rin (RWC-MDB-P-200) HMM Sinsy HMM HMM Sinsy Sinsy MusicXML MusicXML Sinsy CGM SCOPE 5. ) T.Yoshimura, K.Tokuda, T.Masuko, T.Kobayashi, and T.Kitamura, Simultaneous Modeling of Spectrum, Pitch and Duration in HMM-based Speech Synthesis, Proc.of Eurospeech, pp.2347 2350, 999. 2) J. Yamagishi, Average-Voice-based Speech Synthesis, Ph. D. thesis, Tokyo Institute of Technology, 2006. 3) T.Yoshimura, K.Tokuda, T.Masuko, T.Kobayashi, and T.Kitamura, Speaker Interpolation in HMM-based Speech Synthesis System, Proc.of Eurospeech, pp.2523 2526, 997. 7 c 200 Information Processing Society of Japan

4) K.Shichiri, A.Sawabe, K.Tokuda, T.Masuko, T.Kobayashi, and T.Kitamura, Eignvoices for HMM-based Speech Synthesis, Proc.of ICSLP, pp.269 272, 2002. 5) K.Saino, H.Zen, Y.Nankaku, A.Lee, and K.Tokuda, An HMM-based Singing Voice Synthesis System, Proc.of ICSLP, pp.4 44, 2006. 6) HMM : Sinsy, http://www.sinsy.jp/. 7) HMM-based Speech Synthesis System (HTS), http://hts.sp.nitech.ac.jp/. 8) HMM-based Speech Synthesis Engine (hts engine API), http://hts-engine.sourceforge.net/. 9) Speech Signal Processing Toolkit (SPTK), http://sp-tk.sourceforge.net/. 0) A Speech Analysis, Modification and Synthesis System (STRAIGHT), http://www.wakayamau.ac.jp/kawahara/straightadv/index e.html. ) CrestMuseXML Toolkit (CMX), http://cmx.sourceforge.jp/. 2) MusicXML Definition, http://musicxml.org/. 3) M.J.F. ales, Maximum Likelihood Linear Transformations for HMM-based Speech Recognition, Proc.of Computer Speech & Language, vol.2, no.2, pp.75 98, 998. 4) H.Kenmochi and H.Ohshita, VOCALOID Commercial Singing Synthesizer based on Sample Concatenation, Proc.of Interspeech, 2007. 5),, VocaListener:,, vol.2008-mus-75, no.50, pp.49 56, 2008. 6) K. Tokuda, T. Kobayashi, T. Chiba, and S. Imai, Spectral Estimation of Speech by Mel- Generalized Cepstral Analysis, Proc.of IEICE Trans., vol.75-a, no.7, pp.24 34, 992. 7) K.Tokuda, T.Yoshimura, T.Masuko, T.Kobayashi, and T.Kitamura, Speech Parameter Generation Algorithms for HMM-based Speech Synthesis, Proc. of ICASSP, pp. 35 38, 2000. 8) S.Imai, Cepstral Analysis Synthesis on the Mel Frequency Scale, Proc.of ICASSP, pp.93 96, 983. 9) A New and Simplified BSD License, http://www.opensource.org/licenses/bsd-license.php. 20) H.Kawahara, M.K.Ikuyo, and A.Cheneigne, Restructuring Speech Representations using a Pitch-Adaptive Time-Frequency Smoothing and an Instantaneous-Frequency-based F 0 Extraction: Possible Role of a Repetitive Structure in Sounds, Proc.of Speech Communication, vol.27, pp.87 207, 999. 2) H. Zen, K. Oura, T. Nose, J. Yamagishi, S. Sako, T. Toda, T. Masuko, A. W. Black, and K. Tokuda, Recent Development of the HMM-based Speech Synthesis System (HTS), Proc.of APSIPA, pp.2-30, 2009. 22) The Hidden Markov Model Toolkit (HTK), http://htk.eng.cam.ac.uk/. 23),, CrestMuseXML (CMX) Toolkit ver.0.40,, vol.2008-mus-75, no.7, pp.95 00, 2009. 24) J.Odell, The Use of Context in Large Vocabulary Speech Recognition, Ph.D.thesis, Cambridge University, 995. 25) K. Oura, A. Mase, T. Yamada, S. Muto, Y. Nankaku, and K. Tokuda, Recent Development of the HMM-based Singing Voice Synthesis System Sinsy, Proc.of SSW7, 200 (to be published). 26),,,,,, vol.64, no.5, pp.267 277, 2008. 27) T.Nakano, M.Goto, and Y.Hiraga, An Automatic Singing Skill Evaluation Method for Unknown Melodies Using Pitch Interval Accuracy and Vibrato Features, Proc.of Interspeech, pp.706 709, 2006. 28),,,,, HMM,, vol.2009-mus-80, no.5, pp.309 32, 2009. 29) A. Kuramatsu, K. Takeda, Y. Sagisaka, S. Katagiri, H. Kawabara, and K. Shikano, ATR Japanese Speech Database as a Tool of Speech Recognition and Synthesis, Speech Communication, vol.9, pp.357 363, 990. 30) A.Mase, K.Oura, Y.Nankaku, and K.Tokuda, HMM-based Singing Voice Synthesis System using Pitch-Shifted Pseudo Training Data, Proc.of Interspeech, 200 (to be published). 3) K. Shinoda and T. Watanabe, MDL-based Context-Dependent Subword Modeling for Speech Recognition, J.Acoust.Soc.Jpn.(E), vol.2, no.2, pp.79 86, 2000. 32),,,, HMM,, vol.i, 2-7-8, pp.347 348, 200. 33) K. Oura, H. Zen, Y. Nankaku, A. Lee, and K. Tokuda, A Fully Consistent Hidden Semi- Markov Model-Based Speech Recognition System, Proc.of IEICE Trans., vol.e9-d, no., pp.2693 2700, 208. 34) H. Zen, T. Masuko, K. Tokuda, T. Kobayashi, and T. Kitamura, A Hidden Semi-Markov Model-Based Speech Synthesis System, Proc.of IEICE Trans., vol.90-d, no.5, pp.825 834, 2007. 35) J.Yamagishi, M.Tachibana, T.Masuko, and T.Kobayashi, Speaking Style Adaptation Using Context Clustering Decision Tree for HMM-based Speech Synthesis, Proc.of ICASSP 2004, pp.5 8, 2004. 36) J.Sundberg, The Science of the Singing Voice, Northern Illinois University Press, 987. 37) C.E.Seashore, A Musical Ornament, the Vibrato, Proc.of Psychology of Music, McGrew- Hill Book Company, pp.33 52, 938. 38),,,, RWC :,, vol.45, no.3, pp.728 738, 2004. 8 c 200 Information Processing Society of Japan