一般社団法人 電子情報通信学会 信学技報 THE INSTITUTE OF ELECTRONICS, IEICE Technical Report INFORMATION AND COMMUNICATION ENGINEERS EA2017-4 (2017-07) [ ]VOCODER 640-8510 930 E-mail: kawahara@sys.wakayama-u.ac.jp 80 VOCODER VOCODER [Invited]Revisiting VOCODER Why I intentionally discard the original phase of the original speech? Hideki KAWAHARA Wakayama University 930 Sakae-dani, Wakayama, Wakayama, 640-8510 Japan E-mail: kawahara@sys.wakayama-u.ac.jp Abstract VOCODER is a framework invented for narrow band communication about 80 years ago. It has been providing a productive basis for speech research and applications. It also will play another productive roles in the age of deep learning, a rapidly expanding research and deployment framework. I would like to introduce a perspective on new roles of VOCODER, based on reviewing of research tools, which I developed and am currently developing. Key words speech, phase, spectrum, instantaneous frequency, group delay, sampling, deep learning 1. 1939 VOCODER [1] VOCODER 80 [2] VOCODER [3] [5] [6] VOCODER VOCODER 2. VOCODER VOCODER 2. 1 VOCODER [7], [8] [9] [10] pattern playback [11], [12] [13], [14] 2. 2 [15] VOCODER VOCODER LPC (Linear Predictive Coding) [16], PARCOR (PARtial autocorrelation) [14], [17], CSM(Composite Sinusoidal Modeling) [18], 21 This article is a technical report without peer review, and its polished and/or extended version may be published elsewhere. Copyright 2017 by IEICE
LSP (Line Spectrum Pair, LSF: Line Spectrum Frequencies) [19] [20] LSP CODE [21], [22] VOCODER [23] 2. 3 STRAIGHT VOCODER STRAIGHT [24] STRAIGHT [25] [26], [27] STRAIGHT VOCODER VOCODER STRAIGHT [28], [29] STRAIGHT TANDEM-STRAIGHT [30] [31] [33] [34], [35] STRAIGHT( ) [5], [36] STRAIGHT [37] [38] Google scholar STRAIGHT 2017 6 2,000 20 STRAIGHT WORLD [39] STRAIGHT [4]Mel cepstrum [40] 2. 4 WaveNet [2] WaveNet WaveNet μ-law [41] 256 VOCODER VOCODER [27] 1 Fig. 1 Demonstration movie for phase perception [6], [42], [43] VOCODER Google UK [44], [45] [46], [47] 3. SparkNG SparkNG [48], [49] 30 [50] SparkNG GUI 3. 1 [51] [52] [52] MATLAB 1 Schroeder [53] 1)cos 2)sin 3) sin cos 4)Schroeder 5)0 2π 1 1), 2), 3) [52] 50 Hz 400 Hz 20 db [54] 22
Fig. 3 3 Realtime visualization of the vocal tract shape. 2 ERB N number 1/3 Fig. 2 Time-frequency representation using non-linear frequency resolution. Upper image shows ERB N number-based representation. Lower image shows 1/3 octave-based representation. 3. 2 ERB N number [55] [56] 2 ERB N number 1/3 /aiueo/ ERB N number 1/3 FFT(Fast Fourier Transform) Bark [57] 3. 3 PARCOR SparkNG 3 PARCOR 3 MacBook Pro (Retina, 13- inch, 2.9GHz Intel Core i5) MATLAB (R2017a) 20 fps MATLAB 3 3. 4 SparkNG GUI 3. 4. 1 4 GUI GUI 4 GUI 23
Fig. 4 4 Filter manipulation GUI of the speech production simulator. 44,100 Hz LSP 3 3 3. 4. 2 Fant L-F 5 Fig. 5 Glottal source manipulation GUI of the speech production simulator. model [58] L-F model L-F model t p t p t a t c 4 5 GUI L-F model 3 L-F model (t p,t e,t a,t c) 5 t a +6 dboct modal, breathy, vocal fry [59] 3. 4. 3 L-F model L-F model [60] Fujisaki-Ljungqvist model [61] 24
[62] cos [46] cos 80 db [47] [44] L-F model VOCODER [44] [63] SparkNG 4. [42], [43] VOCODER VOCODER [3], [5] WaveNet VOCODER 16K12464 (B)15H02726 VOCODER ATR STRAIGHT [1] H. Dudley, Remaking Speech, The Journal of the Acoustical Society of America, vol.11, no.2, pp.169 177, 1939. [2] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, WaveNet: A generative model for raw audio, arxiv preprint arxiv:1609.03499, pp.1 15, 2016. [3] Y.C. Eldar and T. Michaeli, Beyond bandlimited sampling, IEEE Signal Processing Magazine, vol.26, no.3, pp.48 68, may 2009. [4] vol.73 no.9 p. 2017 [ ] [5] 1 5 3 3 1, ( 15-May-2017) http://www.ieicehbkb.org/portal/ [6] M. Blaauw and J. Bonada, A neural parametric singing synthesizer, arxiv preprint arxiv:1704.03809, pp.1 9, apr 2017. http://arxiv.org/abs/1704.03809 [7] T. Chiba and M. Kajiyama, The Vowel, Its Nature and Structure, Tokyo-Kaiseikan, 1941. [8] vol.5 no.2 pp.15 30 2001 [9] G. Fant, Acoustic theory of speech production: with calculations based on X-ray studies of Russian articulations, vol.2, Walter de Gruyter, 1971. [originally, 1960, Mouton]. [10] [ ] Sona-Graph vol.11 no.1 pp.57 64 1955 [11] F.S. Cooper, A.M. Lieberman, and J.M. Borst, The interconversion of audible and visible patterns as a basis for research in the perception of speech, Proc. N. A. S., vol.37, pp.318 325, 1951. [12] pp.1 Q 28,429 430 2005 [13] C.G. Bell, H. Fujisaki, J.M. Heinz, K.N. Stevens, and A.S. House, Reduction of Speech Spectra by Analysis-by- Synthesis Techniques, The Journal of the Acoustical Society of America, vol.33, no.12, pp.1725 1736, 1961. [14] [ ] vol.19 no.7 pp.644 656 1978 [15] vol.53a no.1 pp.35 42 1970 [16] B.S. Atal and S.L. Hanauer, Speech analysis and synthesis by linear prediction of the speech wave, The Journal of the Acoustical Society of America, vol.50, no.2b, pp.637 655, 1971. [17] pp.2 2 6 1969 [18] vol.j64-a no.2 pp.105 112 1981 [19] (LSP) A vol.64 no.8 pp.599 606 1981 [20] vol.j83-a no.11 pp.1244 1255 2000 [21] M. Schroeder and B.S. Atal, Code-excited linear prediction (CELP): High-quality speech at very low bit rates, Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP 85., vol.10ieee, pp.937 940 1985. [22] ITU-T, G.729 : Coding of speech at 8 kbit/s using conjugate-structure algebraic-code-excited linear prediction (CS-ACELP), 2012. [started 1996, In force 2012]. [23] A.S. Spanias, Speech coding: A tutorial review, Proceedings of the IEEE, vol.82, no.10, pp.1541 1582, 1994. [24] H. Kawahara, I. Masuda-Katsuse, and A. decheveigné, Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequencybased F0 extraction, Speech Communication, vol.27, no.3-4, pp.187 207, 1999. [25] A.S. Bregman, et al., Auditory scene analysis, vol.10, Cambridge, ma: mit press, 1990. [26] vocoder: Straight (< > ), vol.54 no.7 pp.521 526 1998 [27] Vocoder : straight, vol.63 no.8 pp.442 449 2007 [28] C. Liu and D. Kewley-Port, Vowel formant discrimination for high-fidelity speech, The Journal of the Acoustical Society of America, vol.116, no.2, pp.1224 1233, 2004. [29] P.F. Assmann and W.F. Katz, Synthesis fidelity and timevarying spectral change in vowels, The Journal of the Acoustical Society of America, vol.117, no.2, pp.886 895, 2005. 25
[30] H. Kawahara, M. Morise, T. Takahashi, R. Nisimura, T. Irino, and H. Banno, TANDEM-STRAIGHT: A temporally stable power spectral representation for periodic signals and applications to interference-free spectrum, F0 and aperiodicity estimation, ICASSP 2008, pp.3933 3936, Las Vegas, 2008. [31] H. Kawahara and H. Matsui, Auditory morphing based on an elastic perceptual distance metric in an interferencefree time-frequency representation, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, vol.1, pp.256 259, 2003. [32] H. Kawahara, R. Nisimura, T. Irino, M. Morise, T. Takahashi, and H. Banno, Temporally variable multi-aspect auditory morphing enabling extrapolation without objective and perceptual breakdown, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, pp.3905 3908, 2009. [33] H. Kawahara, M. Morise, Banno, and V.G. Skuk, Temporally variable multi-aspect N-way morphing based on interference-free speech representations, ASPIPA ASC 2013, p.0s28.02, 2013. [34] L. Bruckert, P. Bestelmeyer, M. Latinus, J. Rouger, I. Charest, G.A. Rousselet, H. Kawahara, and P. Belin, Vocal Attractiveness Increases by Averaging, Current Biology, vol.20, no.2, pp.116 120, 2010. [35] S.R. Schweinberger, C. Casper, N. Hauthal, J.M. Kaufmann, H. Kawahara, N. Kloth, D.M.C. Robertson, A.P. Simpson, and R. Zäske, Auditory Adaptation in Voice Perception, Current Biology, vol.18, pp.684 688, 2008. [36] M. Unser, Sampling-50 years after Shannon, Proceedings of the IEEE, vol.88, no.4, pp.569 587, apr 2000. [37] H. Zen, K. Tokuda, and A.W. Black, Statistical parametric speech synthesis, Speech Communication, vol.51, no.11, pp.1039 1064, nov 2009. [38] H. Zen, T. Toda, M. Nakamura, and K. Tokuda, Details of the nitech hmm-based speech synthesis system for the blizzard challenge 2005, IEICE transactions on information and systems, vol.90, no.1, pp.325 333, 2007. [39] M. Morise, F. Yokomori, and K. Ozawa, World: A vocoderbased high-quality speech synthesis system for real-time applications, IEICE TRANSACTIONS on Information and Systems, vol.99, no.7, pp.1877 1884, 2016. [40] S. Imai, Cepstral analysis synthesis on the mel frequency scale, ICASSP 83. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol.8, pp.93 96, Institute of Electrical and Electronics Engineers, Boston, apr 1983. [41] ITU-T, G.711 : Pulse code modulation (PCM) of voice frequencies, 1988. [42] Y. Saito, S. Takamichi, and H. Saruwatari, Training algorithm to deceive anti-spoofing verification for dnn-based speech synthesis, Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference onieee, pp.4900 4904 2017. [43] T. Kaneko, H. Kameoka, N. Hojo, Y. Ijima, K. Hiramatsu, and K. Kashino, Generative adversarial networkbased postfilter for statistical parametric speech synthesis, Proc. 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP2017), pp.4910-4914, 2017. [44] H. Kawahara, Y. Agiomyrgiannakis, and H. Zen, Using instantaneous frequency and aperiodicity detection to estimate F0 for high-quality speech synthesis, arxiv preprint arxiv:1605.07809, pp.1 10, 2016. http://arxiv.org/abs/1605.07809 [45] H. Kawahara, Y. Agiomyrgiannakis, and H. Zen, YANG VOCODER: Yet-ANother-Generalized VOCODER. ( 2017-06-13). https://github.com/google/yang_vocoder [46] H. Kawahara, K. Sakakibara, H. Banno, M. Morise, T. Toda, and T. Irino, A new cosine series antialiasing function and its application to aliasing-free glottal source models for speech and singing synthesis, Proc. Interspeech 2017, p., 2017. (Accepted: Extended draft: arxiv preprint arxiv:1702.06724). [47] H. Kawahara, K. Sakakibara, H. Banno, M. Morise, and T. Toda, A modulation property of time-frequency derivatives of filtered phase and its application to aperiodicity and fo estimation, Proc. Interspeech 2017, p., 2017. (Accepted: Extended draft: arxiv preprint arxiv:1706.02964). [48] vol.18 no.3 pp.43 52 2014 [49] H. Kawahara, MATLAB realtime speech tools and voice production tools, ( 20-Feb.-2017). http://www.wakayamau.ac.jp/%7ekawahara/sparkng/ [50],, H-87-21 1987 ( NTT (1989) ) [51] R. Plomp and H. Steeneken, Effect of phase on the timbre of complex tones, The Journal of the Acoustical Society of America, vol.46, no.2b, pp.409 421, 1969. [52] R.D. Patterson, A pulse ribbon model of monaural phase perception, The Journal of the Acoustical Society of America, vol.82, no.5, pp.1560 1586, 1987. [53] M. Schroeder, Synthesis of low-peak-factor signals and binary sequences with low autocorrelation (corresp.), IEEE Transactions on Information Theory, vol.16, no.1, pp.85 89, Jan. 1970. [54] J. Skoglund and W.B. Kleijn, On time-frequency masking in voiced speech, Speech and Audio Processing, IEEE Transactions on, vol.8, no.4, pp.361 369, jul 2000. [55] B.C.J. Moore, An introduction to the psychology of hearing: sixth edition, Emerald, 2012. [56] D.D. Greenwood, A cochlear frequency-position function for several species 29 years later, The Journal of the Acoustical Society of America vol.87 no.6 pp.2592 2605 1990 [57] E. Zwicker and E. Terhardt, Analytical expressions for critical-band rate and critical bandwidth as a function of frequency, The Journal of the Acoustical Society of America, vol.68, no.5, pp.1523 1525, 1980. [58] G. Fant, J. Liljencrants, and Q.-g. Lin, A four-parameter model of glottal flow, STL-QPSR, vol.4, no.1985, pp.1 13, 1985. [59] D.G. Childers and C. Ahn, Modeling the glottal volume velocity waveform for three voice types, The Journal of the Acoustical Society of America vol.97 no.1 pp.505 519 1995 [60] H. Kawahara, K.-I. Sakakibara, H. Banno, M. Morise, T. Toda, and T. Irino, Aliasing-free implementation of discrete-time glottal source models and their applications to speech synthesis and F0 extractor evaluation, 2015 Asia- Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), pp.520 529, IEEE, Hong Kong, dec 2015. [61] H. Fujisaki and M. Ljungqvist, Proposal and evaluation of models for the glottal source waveform, ICASSP 1986, pp.1605 1608, Tokyo, 1986. [62] P.H. Milenkovic, Voice source model for continuous control of pitch period, The Journal of the Acoustical Society of America, vol.93, no.2, pp.1087 1096, 1993. [63] I.R. Titze, Nonlinear source filter coupling in phonation: Theory, The Journal of the Acoustical Society of America, vol.123, no.5, pp.2733 2749, may 2008. 26