: TANDEM-STRAIGHT. Make singing voice tangible: TANDEM-STRAIGHT and temporally variable morphing as substrate. Hideki Kawahara 1 and Masanori Morise 2

Vol.1-MUS-86 No.6 1/7/8 1. : TANDEM-STRAIGHT 1 STRAIGHT TANDEM-STRAIGHT STRAIGHT TANDEM-STRAIGHT SNR 3 db Make singing voice tangible: TANDEM-STRAIGHT and temporally variable morphing as substrate Hideki Kawahara 1 and Masanori Morise Algorithms and implementation details are introduced for latest TANDEM- STRAIGHT and temporally variable multi-aspect speech morphing, based on introduction of motivations behind the legacy-straight and following developments. STRAIGHT and TANDEM-STRAIGHT intentionally destroy phase information in the original input speech. This destruction yields extremely poor SNR value ( 3 db) when they are evaluated as waveform coding methods. This article tries to illustrate views on prospective merits which this destruction provides in return. The authors introduced those views in the hope that readers of this article would be able to find interesting hints for their applications. STRAIGHT 1) ) STRAIGHT 3) 4),5) TANDEM-STRAIGHT 6) TANDEM- STRAIGHT 7) 8) 1 sound spectrogram 1) pattern playback 11) Voder channel vocoder 1) 13) 3 15) LPC 16) 4 vocoder vocoder SNR 3 db 1 Wakayama University Ritsumeikan University 1 3 9) 1989 NTT VOCODER VOCODER 3 CAPTCHA 14) 4 1 c 1 Information Processing Society of Japan

SNR (phase deaf), 15)16) 17)18) STRAIGHT Vocoder 1. STRAIGHT TANDEM-STRAIGHT.1 Fourier TANDEM 19) TANDEM x(t) k = x(t) = e jkω t + αe j((k+1)ω t+β) α, β ω = πf = π/t f (1) Fourier W (ω) 3 P (ω, t) k = P (ω, t) = W (ω) + α W (ω ω ) + W (ω)w (ω ω ) cos(ω t + β), () T T / P T (ω, t) = 1 [ ( P ω, t T 4 ) ( + P ω, t + T 4 )]. (3) P T (ω, t) TANDEM.1.1 P T (ω, t) 4.5T Blackman η dbt T 1 η dbt = L(ω, t) L(ω) dt dω πt L(ω) T 1 L(ω) = L(ω, t) dt, L(ω, t) = 1 log T 1 P (ω, t), (5) X X Vol.1-MUS-86 No.6 1/7/8 1 TANDEM T SNR σ t η dbt SNR 3 db Blackman Hanning Kaiser β = 9 1) (4) 1 STRAIGHT STRAIGHT Matlab code 3 4 ) c 1 Information Processing Society of Japan

1 Temporal variation of logarithmic power spectra under different SNR. (left) original time windows. (right) TANDEM windows. The SNR is 3 db Nuttall ) Blackman.5T σ t =.388 TANDEM 1/1 Cepstrum T 1% 4 4 cent TANDEM T N = Welch 3). TANDEM f 1/f = T f P S (ω, t) P S (ω, t) = 1 ω ω ω P T (ω λ) (6) P S(ω, t) antialiasing filter A/D D/A consistent sampling 4) D/A q k P S(ω, t) P ST (ω, t) P ST (ω, t) = k= q k P S (ω kω, t) (7) q k h(ω) W (ω) Q(z) = 1 R(z) = 1 = r k z k r k = k= k= h(ω kω ) W ( ω) dω, q k z k (8) h(ω).5t Blackman r k q k k k <..1 Vol.1-MUS-86 No.6 1/7/8 q 1 q 1 P ST (ω, t) P ST (ω, t) 1 TANDEM x 1 log(1 + x) x 1 3 c 1 Information Processing Society of Japan

P ST (ω, t) 6 7 L S (ω, t) = 1 ω ω ω log (P T (ω λ)) (9) P ST (ω, t) = exp (q L S(ω) + q 1(L S(ω ω, t) + L S(ω + ω, t))) (1) q 1 = q 1 9 1 cepstrum llifter P ST (ω, t) STRAIGHT 1.3 STRAIGHT TANDEM-STRAIGHT ) 5) 3. STRAIGHT 6) 7) 3 3) v.morish 31) 4 5 4. 3) 4.1 xa ( ( )) xa ( ) rab dtam (λ) dtab (λ) T Am (x A ) = exp log =, (11) Vol.1-MUS-86 No.6 1/7/8 A, B x A, x B A B T BA (x A ) m B B A r BA 1 A 1 q k Taylor q k k < 3 8) 139 Flash 9) 4 v.morish 3) 5 4 c 1 Information Processing Society of Japan

r AB, 1 4. v.morish r BA(t s) t s t s A T sa(t s) B T sb(t s) t s = T sa(t s) = T sb (t s ) = ts ts ts, (1) ( dtab(t sa(λ)) ( dtba (T sb (λ)) ) r (t) AB (λ), (13) ) (r (t) AB (λ) 1), (14) t s T sa(t s), T sb(t s) Θ m (t s ) = (1 r AB (t s ))Θ A (T sa (t s )) + r AB (t s )Θ B (T sb (t s )). (15) Θ(t) t r(t) 4.3 t r t r r (t) r AB (t r ) T ra(t r), T rb(t r) T rs (t r ) t s 7) 5. GUI TANDEM-STRAIGHT Matlab 1 6. TANDEM-STRAIGHT substrate substratum Matlab 34) SNR SNR STRAIGHT TANDEM-STRAIGHT CrestMuse (A)1917 Vol.1-MUS-86 No.6 1/7/8 1) Kawahara, H., Masuda-Katsuse, I. and de Cheveigné, A.: Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F extraction, Speech Communication, Vol.7, No.3-4, pp.187 7 (1999). ) Vocoder 1 STRAIGHT STRAIGHT 33) GUI 8) 5 c 1 Information Processing Society of Japan

STRAIGHT Vol.63, No.8, pp.44 449 (7). 3) Kawahara, H. and Matsui, H.: Auditory morphing based on an elastic perceptual distance metric in an interference-free time-frequency representation,, ICASSP 3, Vol.I, pp.56 59 (3). 4) Kawahara, H., Katayose, H., de Cheveigné, A. and Patterson, R.D.: Fixed point analysis of frequency to instantaneous frequency mapping for accurate estimation of F and periodicity, EUROSPEECH 99, Vol.6, pp.781 784 (1999). 5) Kawahara, H., de Cheveigné, A., Banno, H., Takahashi, T. and Irino, T.: Nearly defect-free F trajectory extraction for expressive speech modifications based on STRAIGHT, Interspeech 5, pp.537 54 (5). 6) Kawahara, H., Morise, M., Takahashi, T., Nisimura, R., Irino, T. and Banno, H.: A temporally stable power spectral representation for periodic signals and applications to interference-free spectrum, F and aperiodicity estimation, ICASSP 8, pp.3933 3936 (8). 7) Kawahara, H., Nisimura, R., Irino, T., Morise, M., Takahashi, T. and Banno, B.: Temporally variable multi-aspect auditory morphing enabling extrapolation without objective and perceptual breakdown, ICASSP9, pp.395 398 (9). 8) Kawahara, H., Takahashi, T., Morise, M. and Banno, H.: Development of exploratory research tools based on TANDEM-STRAIGHT, APSIPA 9, pp.111 1 (9). 9) Vol.H- 87-1 (1987). 1) Koenig, W., Dunn, H.K. and Lacy, L.Y.: The sound spectrograph, J. Acoust. Soc. Am., Vol.18, No.1, pp.19 49 (1946). 11) Liberman, A.M., Delattre, P.C. and Cooper, F.S.: The rôle of selected stimulusvariables in the perception of the unvoiced stop consonants, American Journal of Psychology, Vol.65, pp.497 516 (195). 1) Dudley, H.: Remaking Speech, J. Acoust. Soc. Am., Vol. 11, No., pp. 169 177 (1939). 13) Vol.61, No.5, pp. 63 68 (5). 14) CAPTCHA No.3-4-3, p.11 (1). 15) A Vol.53-A, No.1, pp.35 4 (197). 16) Atal, B.S. and Hanauer, S.L.: Speech analysis and synthesis by linear prediction of the speech wave, J. Acoust. Soc. Am., Vol.5, No.B, pp.637 655 (1971). 17) Plomp, R. and Steeneken, H. J.M.: Effect of Phase on the Timbre of Complex Vol.1-MUS-86 No.6 1/7/8 Tones, J. Acoust. Soc. Am., Vol.46, No.B, pp.49 41 (1969). 18) Patterson, R.D.: The sound of a sinusoid: Spectral models, J. Acoust. Soc. Am., Vol.96, No.3, pp.149 1418 (1994). 19) D Vol.J 9-D, No.1, pp.365 367 (7). ) (1). 1.7.17. 1) Harris, F.J.: On the use of windows for harmonic analysis with the discrete Fourier transform, Proceedings of the IEEE, Vol.66, No.1, pp.51 83 (1978). ) Nuttall, A.H.: Some windows with very good sidelobe behavior, IEEE Trans. Audio Speech and Signal Processing, Vol.9, No.1, pp.84 91 (1981). 3) Welch, P.: The use of fast Fourier transform for the estimation of power spectra: A method based on time averaging over short, modified periodograms, IEEE Trans. Audio and Electroacoustics, Vol.15, No., pp.7 73 (1967). 4) Unser, M.: Sampling 5 Years After Shannon, Proceedings of the IEEE, Vol.88, No.4, pp.569 587 (). 5) H-1-44 Vol.4, No.3, pp.31 36 (1). 6) Schweinberger, S. R., Casper, C., Hauthal, N., Kaufmann, J. M., Kawahara, H., Kloth, N., Robertson, D.M., Simpson, A.P. and Zaeske, R.: Auditory Adaptation in Voice Perception, Current Biology, Vol.18, No.9, pp.684 688 (8). 7) Yonezawa, T., Suzuki, N., Abe, S., Mase, K. and Kogure, K.: Perceptual continuity and naturalness of expressive strength in singing voices based on speech morphing, EURASIP Journal on Audio, Speech, and Music Processing, No.3 (7). 8) : 5.4.15 5.8.15. 9) : http://www.wakayama-u.ac.jp/%7ekawahara/miraikandemo/straightmorph.swf. 3) Vol.48, No.1, pp.3637 3648 (7). 31) Morise, M., Onishi, M., Kawahara, H. and Katayose, H.: v.morish 9: A morphingbased singing design interface for vocal melodies, Lecture Note in Computer Science, No.LNCS 579, pp.185 19 (9). 3) : http://www.nicovideo.jp/watch/sm47471. 33) : http://www.wakayama-u.ac.jp/%7ekawahara/straightadv/index j.html. 34) No.MUS86-6 (1). 6 c 1 Information Processing Society of Japan