SNR F0 [2], [3], [4] F0 F0 F0 F0 F0 TUSK F0 TUSK F0 6 TUSK 6 F0 2. F0 F0 [5] [6] [7] p[8] Cepstrum [9], [10] [11] [12] [13] F0 [14] F0 [15] DIO[16] [1

1,a) 2 F0 TUSK F0 F0 F0 F0 TUSK TUSK F0 Prototype of a framework for overviewing the performance of F0 estimators Morise Masanori 1,a) Kawahara Hideki 2 Abstract: This article represents a framework for overviewing the performance of fundamental frequency (F0) estimator and evaluates its effectiveness. Many F0 estimators have been proposed, and their effectivenesses have been evaluated by speech databases. On the other hand, since the evaluation result depends on the speech database used for the evaluation, it is difficult to fairly evaluate the estimators. The framework, named TUSK, does not rank the estimators but attempts to overview them. In this article, we introduce the concept of TUSK and evaluation criteria, and its effectiveness is examined by several modern F0 estimators. Keywords: Speech analysis, fundamental frequency, temporal variation, noise robustness 1. (F0) 1 1 University of Yamanashi, 4-3-11, Takeda, Kofu, 400 8511, Japan 2 Wakayama University, 930, Sakaedani, Wakayama, 640 8510, Japan. a) mmorise@yamanashi.ac.jp F0 F0 F0 F0 F0 Vocoder [1] 1

SNR F0 [2], [3], [4] F0 F0 F0 F0 F0 TUSK F0 TUSK F0 6 TUSK 6 F0 2. F0 F0 [5] [6] [7] p[8] Cepstrum [9], [10] [11] [12] [13] F0 [14] F0 [15] DIO[16] [17], [18] 14 [12] CMU *1 Bagshaw *2 Electrogastrogram (EGG) EGG F0 F0 Fine pitch error (FPA) Gross pitch error (GPA) [19] Gross error [7] F0 Gross Error FPE F0 F0 F0 F0 3. TUSK TUSK F0 6 F0 F0 3.1 x(t) = n(t) + h(t) K k=1 ( a k cos 2πk t 0 f 0 (τ)dτ + θ k ), n(t) h(t) f 0 (t) F0 a k θ k k K Kf 0 (t) TUSK F0 f 0 (t) f c F0 Klatt *1 http://festvox.org/cmu arctic/index.html *2 http://www.cstr.ed.ac.uk/research/projects/fda/ 2

[20] f 0 (t) = F L f c (sin(2π12.7t) + sin(2π7.1t) 50 100 + sin(2π4.7t)), (1) F L [20] 25 F0 f 0 (t) = f c F0 n(t) = 0 h(t) = δ(t) a k = 0 θ k = 1 f c = 440 1 F0 TUSK Gross error ±20% 20%F0 FPE GPE F0 Gross error TUSK F0 3.2 ACT1: F0 F0 F0 Klatt F0 f c F0 f v (t) = ( ) αf c cos αfc t, (2) α F0 αf c F0 f c f 0 (t) f v (t) ACT2 α F0 3.4 ACT3: a k ACT3 (1) a k F0 F0 TUSK a k 3.5 ACT4: ACT3 ACT4 (1) θ k ACT4 2π 3.3 ACT2: F0 F0 F0 F0 F0 F0 F0 ACT2 F0 F0 3.6 ACT5: SNR ACT5 n(t) SNR SNR SNR 2 SNR 3

3.7 ACT6: ACT6 h(t) 0 (t < 0) 1 (t = 0) h e (t) = ( ) (3) t log(0.001) exp r (otherwise) 10 r t 0 h e (t) h(t) TUSK 1 1 EDT (Early Decay Time) 10 db EDT 0 r 4. TUSK F0 F0 [7] [11] DIO[16] DIO STRAIGHT [21] TANDEM-STRAIGHT [22], [23] Matlab DIO StoneMask Web *3 F0 40 1000 Hz 1.2 s f 0 (t) 1 ms 0.1 1.1 s 1000 F0 48 khz ACT1 ACT6 1 *3 http://ml.cs.yamanashi.ac.jp/world/ 2500 3000 3500 4000 4500 5000 5500 6000 6500 F0 (cent) 1 TUSK ACT1 ACT5 2 ACT3 ACT6 100 ACT1 1 ACT2 α: 0...25 ACT3 ACT4 ACT5 (white noise) ACT5 (pink noise) ACT6 4.1 ACT1 f c : 2100...6900 cent (A1...A5) a k : 0...40 db θ k : 0...2π SNR: 0...60 db SNR: 0...60 db r: 10...1000 ms F0 1 cent F0 2100 cent (55 Hz) 6900 cent (880 Hz) F0 F0 F0 F0 4.2 ACT2 2 F0 α α 20 F0 4

10 3 DIO 0 5 5 20 25 Vibrato intensity α 0 0 30 40 50 60 SNR (db) 2 TUSK ACT2 5 TUSK ACT5 (whilte noise) 0 0 30 40 50 60 Amplitude randomization (db) 3 TUSK ACT3 0 0 30 40 50 60 SNR (db) 6 TUSK ACT5 (pink noise) DIO SNR ACT1 DIO 0 1 2 3 4 5 6 Phase randomization (rad) 4 TUSK ACT4 4.3 ACT3, ACT4 ACT3, ACT4 3,4 DIO 0.1 Hz 4.4 ACT5 ACT5 5,6 SNR 4.5 ACT6 7 10 db 5. DIO SNR 5

100 200 300 400 500 600 700 800 900 1000 Reverberation parameter (ms) 7 TUSK ACT6 STRAIGHT F0 6. F0 TUSK F0 F0 15H02726 26540087 H25/A08 [1] Dudley, H.: Remaking speech, J. Acoust. Soc. Am., Vol. 11, No. 2, pp. 169 177 (1939). [2] Morise, M.: CheapTrick, a spectral envelope estimator for high-quality speech synthesis, Speech Communication, Vol. 67, pp. 1 7 (2015). [3] Morise, M.: Error evaluation of an F0-adaptive spectral envelope estimator in robustness against the additive noise and F0 error, IEICE Trans. Inf. & Syst., Vol. E98- D, No. 7, pp. 1405 1408 (2015). [4] Nakano, T. and Goto, M.: A spectral envelope estimation method based on F0-adaptive multi-frame integration analysis, in Proc. SAPA-SCALE 2012, pp. 11 16 (2012). [5] Hess, W.: Pitch determination of speech signals, Springer-Verlag (1983). [6] Ross, M., Shaffer, H., Cohen, A., Freudberg, R. and Manley, H.: Average magnitude difference function pitch extractor, IEEE Transactions on acoustic, speech, and signal processing, Vol. ASSP-22, No. 5, pp. 353 362 (1974). [7] Cheveigné, A. and Kawahara, H.:, a fundamental frequency estimator for speech and music, J. Acoust. Soc. Am., Vol. 111, No. 4, pp. 1917 1930 (2002). [8] Mauch, M. and Dixon, S.: P: A fundamental frequency estimator using probabilistic threshold distributions, in Proc. ICASSP2014, pp. 659 663 (2014). [9] Noll, A.: Short-time spectrum and cepstrum techniques for vocal pitch detection, J. Acoust. Soc. Am., Vol. 36, No. 2, pp. 269 302 (1964). [10] Noll, A.: Cepstrum pitch determination, J. Acoust. Soc. Am., Vol. 41, No. 2, pp. 293 309 (1967). [11] Camacho, A. and Harris, J. G.: A sawtooth waveform inspired pitch estimator for speech and music, J. Acoust. Soc. Am., Vol. 124, No. 3, pp. 1638 1652 (2008). [12] D Vol. J83-DII, No. 11, pp. 2077 2086 (2000). [13] A Vol. J80-A, No. 11, pp. 1848 1856 (1997). [14] Yegnanarayana, B. and Murty, K.: Event-based instantaneous fundamental frequency estimation from speech signals, IEEE Transactions on Audio, Speech, and Language Processing, Vol. 17, No. 4, pp. 614 624 (2009). [15] Vol. 51, No. 7, pp. 509 518 (1995). [16] SNR F0 D Vol. J93-D, No. 2, pp. 109 117 (2010). [17] Shimamura, T. and Kobayashi, H.: Weighted autocorrelation for pitch extraction of noisy speech, IEEE Transactions on speech and audio processing, Vol. 9, No. 7, pp. 727 730 (2001). [18] Nakatani, T. and Irino, T.: Robust and accurate fundamental frequency estimation based on dominant harmonic components, J. Acoust. Soc. Am., Vol. 116, No. 6, pp. 3690 3700 (2004). [19] Rabiner, L., Cheng, M., Rosenberg, A. and McGonegal, C.: A comparative performance study of several pitch detection algorithms, IEEE Transactions on acoustic, speech, and signal processing, Vol. ASSP-24, No. 5, pp. 399 418 (1976). [20] Klatt, D. and Klatt, L.: Analysis, synthesis, and perception of voice quality variations among female and male talkers, J. Acoust. Soc. Am., Vol. 82, No. 2, pp. 820 857 (1990). [21] Kawahara, H., Cheveigné, A., Banno, H., Takahashi, T. and Irino, T.: Nearly defect-free F0 trajectory extraction for expressive speech modifications based on STRAIGHT, in Proc. Interspeech2005, pp. 537 540 (2005). [22] Kawahara, H., Morise, M., Takahashi, T., Nisimura, R., Irino, T. and Banno, H.: TANDEM-STRAIGHT: A temporally stable power spectral representation for periodic signals and applications to interference-free spectrum, f0, and aperiodicity estimation, in Proc. ICASSP2008, pp. 3933 3936 (2008). [23] Kawahara, H. and Morise, M.: Technical foundations of TANDEM-STRAIGHT, a speech analysis, modification and synthesis framework, SADHANA - Academy Proceedings in Engineering Sciences, Vol. 36, No. 5, pp. 713 728 (2011). 6