Vol.4-DCC-8 No.8 Vol.4-MUS-5 No.8 4//,a) Vocoder (F) F F. PSOLA [] sinusoidal model [] phase vocoder Vocoder [3] (F) F 3 [4], [5], [6], [7], [8], [9] [], [], [], [3], [4] [5], [6] [7], [8], University of Yamanashi, Yamanashi 4 85, Japan a) mmorise@yamanashi.ac.jp [9] STRAIGHT[] [] TANDEM-STRAIGHT[] WORLD[3] * [] [] F F [] SNR F F. * http://ml.cs.yamanashi.ac.jp/world/ c 4 Information Processing Society of Japan
Vol.4-DCC-8 No.8 Vol.4-MUS-5 No.8 4// 3 3 Hanning (T ) 3 Hanning 3T (y(t)w(t)) dt =.5 T y (t)dt. () STRAIGHT F 3 TANDEM-STRAIGHT[] 3 F F 3 [] F []. : F [3] Hanning nω (ω = π/t n ) nω Hz nω Hz. : F 3 3 3 P s (ω) = 3 ω 3 P (ω + λ)dλ, () ω ω 3 P (ω) ω /3 ω.3 3: c 4 Information Processing Society of Japan
3 [] nt nt sinc sinc 3 sinc consistent sampling [4] fs Hz fs / Hz * consistent sampling DA/AD DA/AD nω Hz AD nω Hz Hanning -3 db F ±ω Hz q P l (ω) = exp ( q log(p (ω)) + q log (P (ω + ω )P (ω ω ))), Hz ±ω * (3) Hz P l (ω) P l (ω) = exp (F [l s (τ)l q (τ)p s (τ)]), (4) l s (τ) = sin(πf τ), (5) πf τ ( ) πτ l q (τ) = q + q cos, (6) T p s (τ) = F [log (P s (ω))], (7) l s (τ) l q (τ) q q F[] F [] q.8 q.9 3. TANDEM-STRAIGHT F 3. E f σ f (n) = = N K N n= K k= σ f (n), (8) ( P e (k, n) K Vol.4-DCC-8 No.8 Vol.4-MUS-5 No.8 4// K l= P e (l, n)), (9) P e (k, n) = log (P l (k, n)) log (P t (k)), () P l (k, n) k n P t (k) c 4 Information Processing Society of Japan 3
K FFT N E f E t = K σ t (k) = N K k= N n= σ t (k), () ( P e (k, n) N N m= P e (k, m)). () E t E f E t 3. TANDEM-STRAIGHT [3] 3 F (Standard F) P t (k) 48 khz s ms N, FFT TANDEM-STRAIGHT F 4 Hz 3 4,96 K,48 F Standard F Hz Hz 3.3 : SNR db 6 db. db SNR Standard F: Hz 3 3 3 4 5 6 SNR (db) SNR Standard F Hz 3 SNR E f E t SNR Standard F TANDEM-STRAIGHT SNR 45 db TANDEM-STRAIGHT 3.4 : F Vol.4-DCC-8 No.8 Vol.4-MUS-5 No.8 4// F %.% SNR 6 db 4 5 STRAIGHT F ±3% TANDEM- TANDEM- STRAIGHT F TANDEM-STRAIGHT c 4 Information Processing Society of Japan 4
Vol.4-DCC-8 No.8 Vol.4-MUS-5 No.8 4// Standard F: Hz.5.5 Standard F: Hz 3 3 3 4 5 6 SNR (db) 3 SNR Standard F Hz.5.5 F estimation error (%) 4 F Standard F Hz *3 3.5 SNR 45 db TANDEM-STRAIGHT F SNR 45 db TANDEM-STRAIGHT 4. Cheap- *3 TANDEM-STRAIGHT F Trick F F TANDEM-STRAIGHT F STRAIGHT [] JSPS 4373 c 4 Information Processing Society of Japan 5
.5.5.5.5 Standard F: Hz F estimation error (%) 5 F Standard F Hz 65487 H5/A8 [] Moulines, E. and Charpentier, F.: Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones, Speech Communication, Vol. 9, No. 5-6, pp. 453 467 (99). [] McAulay, R. and Quatieri, T.: Speech analysis/synthesis based on a sinusoidal representation, IEEE Trans. Acoustics, Speech, and Signal Processing), Vol. 34, No. 4, pp. 744 755 (986). [3] Dudley, H.: Remaking speech, J. Acoust. Soc. Am., Vol., No., pp. 69 77 (939). [4] Noll, A. M.: Short-time spectrum and cepstrum techniques for vocal pitch detection, J. Acoust. Soc. Am., Vol. 36, No., pp. 96 3 (964). [5] Oppenheim, A. V.: Speech analysis-synthesis system based on homomorphic filtering, J. Acoust. Soc. Am., Vol. 45, No., pp. 458 465 (969). [6] Atal, B. S. and Hanauer, S. L.: Speech analysis and synthesis by linear prediction of the speech wave, J. Acoust. Soc. Am., Vol. 5, No. B, pp. 637 655 (969). [7] El-Jaroudi, A. and Makhoul, J.: Discrete all-pole modeling, IEEE Trans. on Signal Processing, Vol. 39, No., pp. 4 43 (99). [8] Badeau, R. and David, B.: Weighted maximum likelihood autoregressive and moving average spectrum modeling, in Proc. ICASSP 8, pp. 376 3764 (9). Vol.4-DCC-8 No.8 Vol.4-MUS-5 No.8 4// [9] Campedel-Oudot, M., Cappé, O. and Moulines, E.: Estimation of the spectral envelope of voiced sounds using a penalized likelihood approach, IEEE Trans. on Speech and Audio Processing, Vol. 9, No. 5, pp. 469 48 (). [] Kawahara, H., Masuda-Katsuse, I. and de Cheveigné, A.: Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F extraction, Speech Communication, Speech Communication, Vol. 7, pp. 87 7 (999). [] Kawahara, H., Morise, M., Takahashi, T., Nisimura, R., Irino, T. and Banno, H.: TANDEM-STRAIGHT: A temporally stable power spectral representation for periodic signals and applications to interference-free spectrum, f, and aperiodicity estimation, in Proc. ICASSP 8, pp. 3933 3936 (8). [] Kawahara, H. and Morise, M.: Technical foundations of TANDEM-STRAIGHT, a speech analysis, modification and synthesis framework, SADHANA - Academy Proceedings in Engineering Sciences, Vol. 36, pp. 73 78 (). [3] Morise, M.: An attempt to develop a singing synthesizer by collaborative creation, in Proc. SMAC 3, pp. 87 9 (3). [4] Nakano, T. and Goto, M.: A spectral envelope estimation method based on F-adaptive multi-frame integration analysis, in Proc. SAPA-SCALE, pp. 6 (). [5] Banno, H., Hata, H., Morise, M., Takahashi, T., Irino, T. and Kawahara, H.: Implementation of realtime STRAIGHT speech manipulation system, Acoust. Sci. & Tech., Vol. 8, No. 3, pp. 4 46 (7). [6] Morise, M., Onishi, M., Kawahara, H. and Katayose, H.: v.morish 9: A morphing-based singing design interface for vocal melodies, Lecture Notes in Computer Science, Vol. LNCS 579, pp. 85 9 (9). [7] Zen, H., Tokuda, K. and Black, A. W.: Statistical parametric speech synthesis, Speech Communication, Vol. 5, No., pp. 39 64 (9). [8] Zen, H., Senior, A. and Schuster, M.: Statistical parametric speech synthesis using deep neural networks, in Proc. ICASSP 3, pp. 796 7966 (3). [9] Toda, T. and Tokuda, K.: Statistical approach to vocal tract transfer function estimation based on factor analyzed trajectory HMM, in Proc. ICASSP 8, pp. 395 398 (8). [] Kawahara, H., Nisimura, R., Irino, T., Morise, M., Takahashi, T. and Banno, H.: Temporally variable multi-aspect auditory morphing enabling extrapolation without objective and perceptual breakdown, in Proc. ICASSP9, pp. 395 398 (9). [] Morise, M.:, a spectral envelope estimator for high-quality speech synthesis, Speech Communication (4). [] Kawahara, H., Morise, M., Banno, H., Nisimura, R. and Irino, T.: Excitation source analysis for high-quality speech manipulation systems based on an interferencefree representation of group delay with minimum phase response compensation, pp. xxx xxx (4). [3] Mathews, M. V., Miller, J. E. and David, E. E.: Pitch synchronous analysis of voiced sounds, J. Acoust. Soc. Am., Vol. 33, pp. 79 86 (96). [4] Unser, M.: Sampling 5 years after Shannon, Proc. of the IEEE, Vol. 88, pp. 569 587 (). c 4 Information Processing Society of Japan 6