THE INSTITUTE OF ELECTRONICS, INFORMTION ND COMMUNICTION ENGINEERS TECHNICL REPORT OF IEICE 277 882 5 5 3 33 7 3 E-mail: {dsk saito,matsuura,asakawa,mine,hirose}gavotu-tokyoacjp n (VTLN) VTLN ĉ = c n 8 cm 2 cm,,,, study of geometric dependency of cepstrum on vocal tract length Daisuke SITO, Ryo MTSUUR, Satoshi SKW, Nobuaki MINEMTSU, and bstract Keikichi HIROSE Graduate School of Frontier Sciences, The University of Tokyo 5 5 Kashiwano-ha, Kashiwa-shi, Chiba 277 882, Japan Graduate School of Information Science and Technology, The University of Tokyo 7 3 Hongo, Bunkyo-ku, Tokyo 3 33, Japan E-mail: {dsk saito,matsuura,asakawa,mine,hirose}gavotu-tokyoacjp In this paper, we theoretically and experimentally prove that the direction of cepstrum vectors strongly depends on vocal tract length and that this dependency is represented as rotation in the n dimensional cepstrum space In speech recognition studies, vocal tract length normalization (VTLN) techniques are widely used to cancel age- and gender-differences In VTLN, a frequency warping is often carried out and it can be implemented as a linear transformation in a cepstrum space; ĉ = c However, the geometric properties of this transformation matrix have not been well discussed In this study, its properties are made clear using n dimensional geometry and it is shown that the matrix rotates any cepstrum vector similarly and apparently Experimental results using resynthesized speech demonstrate that cepstrum vectors extracted from a speaker of 8 [cm] in height and those from another speaker of 2 [cm] in height are reasonably orthogonal This result clarifies one of the reasons why children s speech is very difficult for conventional speech recognizers to deal with adequately Key words frequency warping, cepstrum, geometric property, rotation matrix, vocal tract length
(Speaker Independent: SI) [] (CMN) (VTLN) CMN [2]CMN VTLN [3] VTLN 2 2 ω, ˆω ( < = ω, ˆω < = π) z = e jω, ẑ = e j ˆω ẑ = z α αz () α α < α < α > α () 2 2 [4], [5] c, ĉ Fig Examples of frequency warping functions for different values of α ĉ = c (2) ĉ = (ĉ ĉ 2 ĉ 3 ĉ 4 ) t α 2 2α 2α 3 α+α 3 4α 2 +3α 4 = B C c = (c c 2 c 3 c 4 ) t Pitz (3) a ij α [6] jx j a ij = (j ) m m=max(,j i) (3) (m + i ) (m + i j) ( )(m+i j) α (2m+i j) (4) j = m 8 < jc m (j > = m) : (j < m) 3 3 2 (5) (3) 2 2 n 2 (2) c ĉ ĉ α 2 2α 2α 3 = α+α 3 4α 2 +3α 4 ĉ 2 (6) T T c c 2 (6) T =R + O (7) 2
Transformed by R Transformed by T Transformed by O 2 α = 2 T RO Fig 2 Effects of transformations of T, R, and O for α = 2 2α 2 2α( 2 R = α2 ) (8) 2α( 2 α2 ) 2α 2 α 2 α 3 O = (9) α 2α 2 +3α 4 R (+t) k + kt 2α 2 2α α 2 R 2α () α 2 2α 2 cos 2θ sin 2θ = (α = sin θ) () sin 2θ cos 2θ () R R 2θ O T α < O 3 2 2 T R T 2 α = 2 2 T RO O 2 T R O T O T 3 2 y = (T I)c = ĉ c (2) I 2 (T I) y 2 T 3 (2) T 3 (2) Fig 3 Vector filed given by Equation (2) for α = 2 3 2 n n 2 n (3) R R R t R = RR t = I (3) det R = + (4) (3) (4) (3) α 2 2α α 3α n = B 2α 4α C n a ij 8 (i = j) >< a ij = sgn(j i) jα ( i j = ) >: (otherwise) (5) (6) sgn(j i) j i > + j i < t n n n t n +α 2 α 3α 2 α +8α 2 α 8α 2 t n n = 3α 2 α +8α 2 α B 8α 2 α + 32α 2 C (7) (7) k R +kα 2 i j = α i j = 2 m R mα 2 n t n 3
Table coustic conditions 6 khz / 6 bit Hamming window 25 ms 5 ms MFCC ( 2) [ a ] [ i ] [ a ] [ i ] Fig 4 4 Rotation of two cepstrum vectors and their vector +4α 2 α 4α 2 α +α 2 α 9α 2 n t n = 4α 2 α +2α 2 α (8) B 9α 2 α + 34α 2 C (7)(8) α α 2 i j = α t n n n t n (3) n [7]n det n = a nn det n a n(n ) a (n )n det n 2 (9) (5) a nn = a n(n ), a (n )n α 2 α det n det n det n (4) (3) n 2 3 n 3 T α 4 t t + c t, c t+ c t, c t+ c = c t+ c t Fig 5 5 Spectrogram of the original speech (left) and its warped version (right) 4 4 5 /aiueo/ STRIGHT [8] MFCC MFCC a, b θ (2) θ = arccos a b a b (2) a b a, b () (3) 8 < ˆω = ω ( < m = ω < m π) +m : m(ω π) + π ( m π < +m = ω < = π) (2) (2) m 4
Vocal tract length ration [m] 4 35 3 25 2 5 5-8 -6-4 -2 2 4 6 8 Warping parameter [α] 6 α m Fig 6 Relation between warping parameter and vocal tract length ratio (2) 5 5 (2) m α α (2) 2 m α m 6 6 8 cm α = 4 9 cm α = 4 36 cm 4 2 7 /a/ /i//i/ /u//u/ /e//e/ /o/ (2) (a) (c) (d) (f) 7 MFCC MFCC MFCC 7 MFCC MFCC MFCC 5 7(b) 8 cm 2 cm MFCC α α 3 2 α α 8 cm 2 cm [9] [] [], [2] 6 [9] 5
5 5-5 - original height -5 5 5 2 25 3 35 5 5-5 (a):mfcc (male) - original height -5 5 5 2 25 3 35 (d):mfcc (female) 5 5-5 - original height -5 5 5 2 25 3 35 5 5-5 (b): MFCC (male) - original height -5 5 5 2 25 3 35 (e): MFCC (female) 5 5-5 - original height -5 5 5 2 25 3 35 5 5-5 (c): MFCC (male) - original height -5 5 5 2 25 3 35 (f): MFCC (female) 7 : (a) (c): 8 cm ;(d) (f): 63 cm Fig 7 Relation between the rotation angle and the estimated body height (a) to (c) are from a male speaker of 8 cm in height and (d) to (f) are from a female speaker of 63 cm in height [3] [] M Russel and S D rcy: Challenges for computer recognition of children s speech, CD-ROM of SLaTE27, 27 [2] B tal: Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification, JcoustSocmerica, vol 55, pp 34 32, 974 [3] E Eide and H Gish: parametric approach to vocal tract length normalization, ICSSP96, vol, pp 346 348, 996 [4], :, D-II, vol J83-D-II, no, pp28 27, 2 [5] T Emori and K Shinoda: Rapid Vocal Tract Length Normalization usgin Maximum Likelihood Estimation, Eurospeech2, pp 649 652, 2 [6] M Pitz and H Ney: Vocal tract normalization equals linear transformation in cepstral space, IEEE Trans Speech and udio Processing, vol 3, pp 93 944, 25 [7] R Horn and CR Johnson: Matrix nalysis, Cambridge University Press, 985 [8] H Kawahara et al: Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F extraction: Possible role of a repetitive structure in sounds, Speech Communication,, vol 27, pp 87 27, 999 [9],, :,, pp 43 5, 24 [] N Minematsu: Mathematical evidence of the acoustic universal structure in speech, ICSSP25, pp889 892, 25 [] S sakawa et al : utomatic Recognition of Connected Vowels Only Using Speaker-invariant Representation of Speech Dynamics, INTERSPEECH27, pp89 893, 27 [2] :,, 3-Q-, 27 [3] :, Visual Computing / CD, 27 6