Speech Recognition using Phase Information based on Long-Term Analysis

THE INSTITUTE OF ELECTRONICS, INFORMATION AND COMMUNICATION ENGINEERS TECHNICAL REPORT OF IEICE. 441 8580 1 1 E-mail: {kyama,sueyoshi,nakagawa}@slp.cs.tut.ac.jp MFCC Liu 2 1 1 90% MFCC 20% Abstract Speech Recognition using Phase Information based on Long-Term Analysis Kazumasa YAMAMOTO, Eiichi SUEYOSHI, and Seiichi NAKAGAWA Department of Computer Science and Engineering, Toyohashi University of Technology 1 1 Hibarigaoka, Tempaku-cho, Toyohashi, 441 8580 Japan E-mail: {kyama,sueyoshi,nakagawa}@slp.cs.tut.ac.jp Current speech recognition systems use mainly amplitude spectrum-based features such as MFFC for acoustic feature parameters, while discarding phase spectral information. The results of perceptual experiments, however, suggested that phase spectral information based on long-term analysis includes certain linguistic information. In this paper, we propose the use of phase features based on long-term analysis for speech recognition. We use two types of parameters: the delta phase parameter as a group delay and analytic group delay features. Isolated word and continuous digit recognition experiments were performed, resulting in a greater than 90% word or digit accuracy for each of the experiments. The experimental results confirmed that a long-term phase spectrum includes sufficient information for recognizing speech. Furthermore, combining likelihoods of MFCC and long-term group delay cepstrum outperformed the baseline MFCC by relatively 20% for clean speech. Key words speech recognition, phase information, long-term analysis, group delay spectrum 1. Mel-frequency cepstral coefficients; MFCC MFCC 25ms 10ms [1] [4] Oppenheim [5] Liu [1] Liu 1

[1] 192ms 192ms 256ms Paliwal [2] MFCC [6] [4] 256ms 2ms Shi [3] [7], [8] [10] [9] 100ms (Group Delay) 1) 2) 2 2. 2. 1 t x(t) w(τ) f S(f, t) S(f, t) = x(t + τ)w(τ) exp{ jfτ}dτ = S(f, t) exp{jθ(f, t)} (1) S(f, t) x(t) θ(f, t) x(t) θ(f, t) θ(f, t) = S(f, t). (2) 4 0-4 4 0-4 1 Fig. 1 frequency unwrapping frequency Phase unwrapping. [10] [11] MFCC 2. 2 G(f) = dθ(f) df θ(f) f x(t) (3) (2) π +π 1 1 (DFT) 2 2. 2. 1 2 lf 2

4 2 0-2 Unwrapped phase spectrum frequency DCT (MFMGDCC) [13] 30ms 3. lf K K K K K K 2 lf Calculate the slope of the spectrum Fig. 2 Calculation of lf parameters. = ΣK k=1k(θ(l f + k) θ(l f k)) 2Σ K k=1 k2 (4) l f DFT f K K DFT 16kHz 0 8000Hz 256ms 4096 DFT 0 2047 30 31 66 ( 2048/31) K 33 (= 66/2) 2. 2. 2 2 [12] G ana(f) = X R(f)Y R (f) + X I (f)y I (f) X(f) 2 (5) X R (f) X I (f) x(t) Y R (f) Y I (f) t tx(t) (5) G ana (f) X R (f) X I (f) 2 1 1 (DCT) (Long-Term Group Delay Cepstrum; LTGDC) Hegde DCT [12] Zhu 3. 1 3. 1. 1 CIAIR-VCV [14] 30 125 20 16kHz 25ms MFCC 100ms 256ms 10ms 4 2. 2. 1 lf 10/20/30 2. 2. 2 30 LTGDC 10/20 4 9 38 MFCC 12MFCC+12 MFCC +12 MFCC + Pow + Pow MFCC 25ms 10ms 30 HMM HMM 23 8 HTK-3.4.1 1 3. 1. 2 1 2 3 MFCC 25ms 90% MFCC 99.5% 3 2 LTGDC lf MFCC 1 http://htk.eng.cam.ac.uk/ 3

1 lf [%] Table 1 Word recognition rates using the lf parameters [%]. Dimension of parameters 25 ms 100 ms 256 ms 10 47.3 95.1 96.3 20 48.3 96.1 97.1 30 51.5 96.3 97.1 2 30 [%] Table 2 Word recognition rates using the 30-dimensional movingaveraged group delay features [%]. Points of M-A 25 ms 100 ms 256 ms 4 pts 7.6 68.6 84.3 9 pts 13.7 83.1 91.2 3 LTGDC [%] Table 3 Word recognition rates using the LTGDC [%]. Dimension of parameters 25 ms 100 ms 256 ms 10 17.9 96.3 95.8 20 11.2 93.9 97.0 4 30 lf [%] Table 4 Recognition results altering the frame shift for the 30- dimensional lf parameters [%]. Frame shift [ms] 25 ms 100 ms 256 ms 10 51.5 96.3 97.1 20 31.8 97.0 98.1 40 21.7 93.2 96.9 10ms lf 4 40ms 3. 2 3. 2. 1 CENSREC-1 (AURORA-2J) [15] 5 CENSREC-1 Table 5 Comparison of connected digit recognition results for CENSREC-1 database with various analysis window length. Digit accuracy [%] Window Length Test Set Parameter 25ms 64ms 128ms 256ms MFCC 48.21 45.43 44.06 30.43 A lf 8.82 26.15 39.77 46.14 LTGDC 28.90 9.02 35.05 32.87 MFCC 42.64 42.55 44.80 38.61 B lf 6.52 31.94 43.24 51.57 LTGDC 13.84 26.05 47.05 47.53 CENSREC-1 110 55 55 8440 G.712 A Subway Babble Car Exhibition G.712 B Restaurant Street Airport Station G.712 C MIRS SNR 7 20 15 10 5 0 5dB / SNR 1001 30 10 LT- GDC 25ms 64ms 128ms 256ms CENSREC-1 39 MFCC LTGDC MFCC 3. 2. 2 5 A B 20dB 0dB 2 MFCC 256ms LTGDC 25ms MFCC 25ms LTGDC 128ms 2 CENSREC-1 4

6 6(a) MFCC (b) (c) LTGDC Ave. 20dB 0dB MFCC 10% LTGDC 90% Babble Restaurant Airport MFCC CENSREC-1 128ms LTGDC 25ms MFCC 50-best LTGDC LTGDC MFCC L MF CC LTGDC L LT GDC L comb = αl MF CC + (1 α)l LT GDC, (6) α 0.4 6(d) MFCC 10% MFCC LTGDC MFCC 10% SNR 20% LTGDC MFCC 4. MFCC LTGDC SNR MFCC LTGDC MFCC Hegde (MGD) [12] G mod (f) = sign X R(f)Y R (f) + X I (f)y I (f) S c (f) 2γ β, (7) S c (f) 2 β γ MGD LTGDC 5. [1] Liu, L., He, J., and Palm, G., Effects of phase on the perception of intervocalic stop consonants, Speech Communication, Vol.22, No.4, pp.403 417, 1997. [2] Paliwal, K. K. and Alsteris, L. D., Usefulness of phase spectrum in human speech perception, Proc. Eurospeech 2003, pp.2117 2120, 2003. [3] Shi, G., Shanechi, M. M., and Aarabi, P., On the importance of phase in human speech recognition, IEEE Transactions on Audio, Speech, and Language Processing, Vol.14, No.5, pp.1867 1874, 2006. [4] Kazama, M., Gotoh, S., Tohyama, M., Houtgast T., On the significance of phase in the short term Fourier spectrum for speech intelligibility, The Journal of the Acoustical Society of America, Vol.127, No.3, pp.1432 1439, 2010. [5] Oppenheim, A. V. and Lim, J. S., The importance of phase in signals, Proc. of the IEEE, Vol.69, No.5, pp.529 541, 1981. [6] Alsteris, L. D. and Paliwal, K. K., ASR on speech reconstructed from short-time Fourier phase spectra, Proc. IC- SLP 2004, 2004. [7] Schlüter, R. and Ney, H., Using phase spectrum information for improved speech recognition performance, Proc. ICASSP 2001, Vol.1, pp.133 136, 2001. [8] Aarabi, P., Shi, G., Shanechi, M. M., and Rabi, S. A., Phase-based speech processing, World Scientific, 2005. [9] Kubo, Y., Okawa, S., Kurematsu, A., and Shirai, K., A comparative study on AM and FM features, Proc. INTER- SPEECH 2008, pp.642 645, 2008. [10] Wang, L., Minami, K., Yamamoto, K., and Nakagawa, S., Speaker identification by combining MFCC and phase information in noisy environments, Proc. ICASSP 2010, pp.4502 4505, 2010. [11],,,, 2009, 1-P-17, pp.161 164, 2009. [12] Hegde, R. M., Murthy, H. A., and Gadde, V. R. R, Significance of the modified group delay feature in speech recognition, IEEE Transactions on Audio, Speech, and Language 5

6 CENSREC-1 Table 6 Digit accuracy for CENSREC-1 database (a) Digit accuracy using MFCC (baseline) [%] Clean 99.75 99.67 99.88 99.69 99.75 99.67 99.88 99.69 20dB 97.88 72.16 84.97 97.78 80.99 83.52 80.11 82.78 15dB 84.77 53.51 55.56 88.58 57.54 60.85 57.26 55.29 10dB 57.02 37.58 37.22 60.78 38.66 39.78 42.38 38.26 5dB 28.40 21.19 21.12 27.15 18.94 23.22 27.17 22.12 0dB 10.53 6.17 10.53 11.32 7.46 12.52 13.72 10.27 5dB 7.03 1.66 7.84 7.99 0.74 7.77 7.46 7.28 Ave. 55.72 38.12 41.88 57.12 40.72 43.98 44.13 41.74 (b) Digit accuracy using only lf parameters [%] Clean 92.82 92.53 92.48 92.07 92.82 92.53 92.48 92.07 20dB 70.71 81.32 76.29 67.91 80.81 77.36 83.03 76.61 15dB 55.70 69.07 54.91 53.47 70.06 62.79 66.90 52.64 10dB 34.76 49.21 25.26 33.82 49.89 39.75 40.41 28.17 5dB 19.13 28.42 13.33 17.37 30.40 19.98 21.50 15.30 0dB 10.81 14.72 9.22 9.97 15.23 11.91 12.14 9.94 5dB 8.50 8.98 8.29 8.45 9.30 8.22 8.53 8.45 Ave. 38.22 48.55 35.80 36.51 42.36 44.80 36.53 34.84 (c) Digit accuracy using only LTGDC [%] Clean 90.70 90.60 92.31 92.38 90.70 90.60 92.31 92.38 20dB 61.47 83.65 81.96 42.36 82.38 72.61 84.70 81.06 15dB 42.55 75.39 67.79 22.43 73.17 56.32 73.43 67.20 10dB 23.67 59.04 43.63 6.54 57.54 37.88 51.18 42.61 5dB 10.93 36.70 18.76 2.16 36.94 20.34 28.93 21.32 0dB 4.30 16.51 11.06 5.52 18.08 8.74 14.91 11.72 5dB 4.21 9.70 8.68 2.44 10.72 6.95 10.20 9.16 Ave. 28.58 54.26 44.64 12.73 53.62 39.18 50.63 44.78 (d) Digit accuracy with rescoring and likelihood combination of MFCC and LTGDC [%] Clean 99.79 99.67 99.94 99.75 99.79 99.67 99.94 99.75 20dB 98.22 75.79 87.56 97.84 84.19 85.85 83.36 85.53 15dB 87.14 56.05 59.14 88.71 60.79 63.30 60.63 59.06 10dB 59.78 39.39 39.34 61.15 41.14 41.87 44.65 40.51 5dB 29.63 22.61 21.86 26.57 20.57 24.88 28.78 23.73 0dB 12.47 7.80 10.17 10.74 8.41 12.91 15.87 11.45 5dB 8.11 3.20 8.05 8.15 2.49 7.92 9.19 7.93 Ave. 57.45 40.33 43.72 57.00 43.02 45.76 46.66 44.06 Processing, Vol.15, No.1, pp.190 202, 2007. [13] Zhu, D. and Paliwal, K. K., Product of power spectrum and group delay function for speech recognition, Proc. ICASSP 2004, Vol.I, pp.i-125 I-128, 2004. [14] CIAIR Children Voice Speech Corpus (CIAIR-VCV). Online: http://db.ciair.coe.nagoya-u.ac.jp/eng/dbciair/ dbciair2/kodomo.htm, accessed on 23 Apr. 2010. [15] Nakamura, S. et al., AURORA-2J: An evaluation framework for Japanese noisy speech recognition, IEICE Transactions on Information and Systems, Vol.E88-D, No.3, pp.535 544, 2005. 6