Estimation, Evaluation and Guarantee of the Reverberant Speech Recognition Performance based on Room Acoustic Parameters

Vol.21-SLP-83 No.9 21/1/29 1 Estimation, Evaluation and Guarantee of the Reverberant Speech Recognition Performance based on Room Acoustic Parameters Takanobu Nishiura 1 We study on estimation, evaluation and guarantee of the reverberant speech recognition performance based on room acoustic parameters in this paper. We first designed the suitable reverberation criteria with the relation between room acoustic parameters and speech recognition performance. We then estimated, evaluated and guaranteed the speech recognition performance based on our designed reverberation criteria. As a result of evaluation experiments, we could confirm that the recognition performance could be accurately and robustly estimated, evaluated and guaranteed with proposed criteria. 1 College of Information Science and Engineering, Ritsumeikan University 1. 1965 M. R. Schroeder 1) 2) 2) RSR-D n (Reverberant Speech Recognition criteria with D n ) 2. 2.1 (T 6 ) (T 6 ) 3) 1 1-6 db 1 c 21 Information Processing Society of Japan

3. RSR-D n 3.1 4) ISO3382 Annex A 3.1.1 Definition(D ) C (Clarity), D (Definition) Ts(Centre time) 3 C D D D D (1) D n = n h 2 (t)dt/ h 2 (t)dt. (1) h(t) n D 3.2 RSR-D n D D RSR-D n (Reverberant Speech Recognition criteria with D n ) 3.2.1 RSR-D n RSR-D n 1 Step.1 1 1 A Design of Reverberation Criteria Impulse Responses in Training Impulse Response in Testing Reverberation Time D Value Recognition Performance Recognition Performance Estimation Reverberation Time D Value New Reverberation Criterion R.Time[ xxx ms ] R.Rate = function( D Value ) R.Time[ yyy ms ] R.Rate = function( D Value ) R.Time[ zzz ms ] R.Rate = function( D Value ) Estimated Recogniton Performance Vol.21-SLP-83 No.9 21/1/29 1 Fig. 1 Overview of the proposed method. Step.2 D Step.1 (1) D n D n 4.1.1 D n Step.3 Step.1 Step.4 Step.2 Step.3 D 1 2 1 1 2 3.3 RSR-D n 1 2 c 21 Information Processing Society of Japan

Fig. 2 Table 1 1 Regression curve and parameters to estimate y = ax + b y = ax 2 + b Correlation Coefficients.965.96.955.95.945.94.935.93 a b.925 Linear Quadratic.92 1 2 3 4 5 6 7 8 9 Border Time [ms] 2 n The relation between correlation coefficient in each regression curve and border time n D RSR-D n D 4. D RSR-D n RSR-D n 4.1 2(A) 9 732 RIRs Room Impulse Responses 2 4.1.1 RSR-D n (1) n D 2 Table 2 Experimental conditions (A) Soundproof room (T 6 =1 ms 72 RIRs) Japanese style room (T 6 =4 ms 72 RIRs) Laboratory (T 6 =45 ms 72 RIRs) Environments Conference room (T 6 =6 ms 12 RIRs) in training Living room (T 6 =6 ms 72 RIRs) Corridor (T 6 =6 ms 12 RIRs) Bath room (T 6 =65 ms 28 RIRs) Elevator hall (T 6 =85 ms 12 RIRs) Standard stairs (T 6 =85 ms 56 RIRs) (B) Japanese style room (T 6 =4 ms 72 RIRs) Environments to Conference room (T 6 =6 ms 12 RIRs) calculate a suitable n Standard stairs (T 6 =85 ms 56 RIRs) (C) Japanese style room (T 6 =4 ms 72 RIRs) Environments Conference room (T 6 =6 ms 12 RIRs) to design RSR-D n Standard stairs (T 6 =85 ms 56 RIRs) (D) Laboratory (T 6 =45 ms 72 RIRs) Environments Bath room (T 6 =65 ms 28 RIRs) in testing Elevator hall (T 6 =85 ms 12 RIRs) Measured distance 1 5, mm ATR phoneme balance Speech 216 words 5) 7 female and 7 male speakers Decoder Julius 6) HMM IPA monophone model (Gender-dependent) Feature vectors 12 MFCC + 12 ΔMFCC + 1 ΔPower Frame length 25 ms (Hamming window) Frame interval 1 ms Vol.21-SLP-83 No.9 21/1/29 D n 2(B) 3 3.2.1 n 1 9 ms 1 ms D n D 3 n 2 1 2 n 2 ms 2(B) 3 RSR-D n n 2 ms 3 c 21 Information Processing Society of Japan

Vol.21-SLP-83 No.9 21/1/29 Table 3 3 Correlation coefficients RSR-D 2 L (Linear) RSR-D 2 Q (Quadratic) T 6 =4 ms.937.939 T 6 =6 ms.966.963 T 6 =85 ms.977.972 Average Estimation Error [%] 16 14 12 1 8 6 4 2 2.6 2.5 Conventional.9 3.47 Linear.92 3.45 Quadratic Average Estimation Error [%] 16 14 12 1 8 6 4 2 5.36 6.13 Conventional 1.98 2.9 2.1 2.62 Linear Quadratic Average Estimation Error [%] 16 14 12 1 8 6 4 2 7.34 15.32 Conventional 1.92 4.85 Linear 2.22 4.58 Quadratic 1.2 Sound Proof Room (1ms) Elevator Hall (85ms) 1.2 (a) (b) (c) 1.8 Japanese Style Room (4ms) Laboratory (45ms) Conference Room (6ms) Living Room (6ms) Corridor(6ms) Bath Room (65ms) Standard stairs (85ms) 1.8 5 Fig. 5 Average error D2 D2.6.4.2 4 5 6 7 8 9 1 D2.6.4.2 Sound Proof Room (1ms) Corridor(6ms) Japanese Style Room (4ms) Bath Room (65ms) Laboratory (45ms) Elevator Hall (85ms) Conference Room (6ms) Standard stairs (85ms) Living Room (6ms) 7 75 8 85 9 (a) (Overall) (b) (Closeup) 3 D 2 Fig. 3 The relation between D 2 and speech recognition performance RSR-D2L (Linear) RSR-D2Q (Quadratic) 1..8.6.4.2 4 5 6 7 8 9 1 D2 RSR-D2L (Linear) 1 RSR-D2Q (Quadratic).8.6.4.2 4 5 6 7 8 9 1 D2 RSR-D2L (Linear) 1 RSR-D2Q (Quadratic).8.6.4.2 4 5 6 7 8 9 1 (a) (T 6 =4 ms) (b) (T 6 =6 ms) (c) (T 6 =85 ms) 4 RSR-D 2 Fig. 4 The relation between RSR-D 2 and speech recognition performance n=2 ms D (D 2 ) RSR-D 2 4.2 RSR-D 2 2(A) 9 D 2 3(a) 3(b) 9 2(C) 3 4 Table 4 Standard deviation Conventional RSR-D 2 L RSR-D 2 Q Method (Linear) (Quadratic) Close Open Close Open Close Open T 6 =45 ms 3.1 3.26 1.1 3.62 1.13 3.6 T 6 =65 ms 6.92 7.18 2.46 3.49 2.59 3.14 T 6 =85 ms 8.8 17.64 2.41 5.35 2.81 5.23 4 3 3 D 2 1 RSR-D 2 L(Linear) 2 RSR-D 2 Q(Quadratic) (T 6 =6 ms) (T 6 =85 ms).96 (T 6 =4 ms).93 D 2 1 2 RSR-D 2 L RSR-D 2 Q 4.3 2(D) 3 4 c 21 Information Processing Society of Japan

2(D) 3 5 4 RSR-D n RSR-D n Q RSR-D n L D 2 2 RSR-D 2 Q 4.4 4.4.1 RSR-D 2 RSR-D 2 2(A) 9 D 2 3(b) 6 ms ( ) 4 45 ms 85 ms RSR-D 2 5. RSR-D n, RSR-D n. RSR-D n. 5 D 2 Table 5 D 2 of measured impulse response in each environment T 6 =45ms.48.63 T 6 =85ms.72.67 Vol.21-SLP-83 No.9 21/1/29 5.1 2(A) T 6 =45 ms T 6 =85 ms ( 5 ) 25 132 25 3 5 D 2 D 2 5.2 6 7 5.3 T 6 =45 ms T 6 =85 ms 5.1 D 2 D 2 5 c 21 Information Processing Society of Japan

Window Recognition performance [ % ] 1 9 8 7 132 TV 25 355 5 163 : Speaker : Microphone Door (T 6 =45 ms) Fig. 6 175 6 1 2 3 1 12 2 Distance between microphone and speaker [ ] 264 Door 7 EV EV EV : Speaker 6 Placement of microphone and speaker Distance between Mic-Speaker and Wall [ ] :25 :132 (T 6 =45 ms) Fig. 7 Recognition performance [ % ] 1 9 8 7 : Microphone 823 3 25 (T 6 =85 ms) 581 Distance between Mic-Speaker and Wall [ ] :25 :3 6 1 2 3 112 2 27 3 4 5 Distance between microphone and speaker [ ] (T 6 =85 ms) 7 Reverberant speech recognition performance RSR-D n, 6. 8) Vol.21-SLP-83 No.9 21/1/29 1) M. R. Schroeder, New Method of Measuring Reverberation Time, JASA, Vol. 37, pp. 49-412, 1965. 2) Rico Petrick, Xugang Lu, Masashi Unoki, Masato Akagi, and Ruediger Hoffmann, Robust Front End Processing for Speech Recognition in Reverberant Environments: Utilization of Speech Characteristics, Proc. Interspeech28, pp. 658-661, Brisbane, Australia, Sept. 28. 3),,, 23. 4) ISO3382:Acoustics-Measurement of the reverberation time of rooms with reference to other accoustical parameters. Internatinal Organization for Standardization, 1997. 5) K. Takeda, Y. Sagisaka, and S. Katagiri, Acoustic-Phonetic Labels in a Japanese Speech Database, Proc. European Conference on Speech Technology, vol. 2, pp. 13-16, Oct. 1987. 6) A. Lee, T. Kawahara, and K. Shikano, Julius an open source real-time large vocabulary recognition engine, In Proc. European Conf. on Speech Communication and Technology, pp. 1691-1694, 21. 7) T. Houtgast, H. J. M. Steeneken, and R. Plomp, Predicting speech intelligibility in room acoustics, Acustica, vol. 46, pp. 6-72, 198. 8) T. Yamada, M. Kumakura, N. Kitawaki, Performance estimation of speech recognition system under noise conditions using objective quality measures and artificial voice, IEEE Trans. on ASLP, Vol. 14, No. 6, pp. 26-213, Nov. 26. RSR-D n MTF(Modulation Transfer Function) 7) PESQ 6 c 21 Information Processing Society of Japan