CSJ. Speaker clustering based on non-negative matrix factorization using i-vector-based speaker similarity

i-vector 1 1 1 1 i-vector CSJ i-vector Speaker clustering based on non-negative matrix factorization using i-vector-based speaker similarity Fukuchi Yusuke 1 Tawara Naohiro 1 Ogawa Tetsuji 1 Kobayashi Tetsunori 1 Abstract: We have developed a novel speaker clustering method by integrating highly accurate speaker representation and a clustering algorithm. The conventional method caused significant degradation in clustering accuracy when the number of utterances increased. High-accuracy speaker representation and high-performance clustering method are required to realize robust speaker clustering system against such a condition. For this purpose, we used i-vectors for the speaker representation, which contributes to the realization of high-accuracy speaker verification systems, and efficient non-negative matrix factorization for the clustering algorithm. Experimental results show that the proposed method outperforms the conventional methods, irrespective of the amount of data. Keywords: speaker clustering, i-vector, non-negative matrix factorization 1. Web [1] 1 Waseda University, Shinjuku, Tokyo 164 0042, Japan (Gaussian mixture model; GMM) [2], [3], [4], [] [6] c 2012 Information Processing Society of Japan 1

Bayesian information criteria; BIC) (agglomerative hierarchical clustering; AHC) k-means BIC (BIC-AHC) [4] (non-negative matrix factorization; NMF) [] (Gaussian mixture model: GMM) GMM cross likelihood ratio(clr) GMM BIC GMM i-vector k-means [6] i-vector BIC [4] i-vector k-means [7] GMM NMF [] 2 3 4 CSJ 6 2. 2.1 GMM Cross likelihood ratio (CLR) 2.2 i-vector 2.1 GMM X i GMM p(x i λ i ) = K K m i,k N (x µ i,k, Σ i,k ), ( m i,k = 1)(1) k=1 k=1 N ( ) 1 N (X i µ i,k, Σ i,k )= (2π) D 2 Σ i,k 1 2 exp ( 1 2 (x µ i,k) T Σ 1 i,k (x µ i,k) ). (2) λ i = { m i,k, µ i,k, Σ i,k } K k=1 i X i GMM µ i,k Σ i,k m i,k k 2 GMM CLR [] i j CLR Dij CLR = log p(x i λ i ) p(x i λ j ) + log p(x j λ j ) p(x j λ i ) (3) p(x i λ j ) λ j = { m j,k, µ j,k, Σ j,k } K k=1 GMM X i T i p(x i λ j ) = p(x i,t λ j ) (4) t=1 2.2 i-vector 2.2.1 i-vector u GMM M u (Total variability; TV) M u = m + T w u () m universal background model (UBM) GMM T TV c 2012 Information Processing Society of Japan 2

T Eigen voice [6] w u u i-vector u i-vector TV T u w u = (I + T t ΣN(u)T) 1 T t Σ 1 F(u) (6) N(u) F(u) 0 1 N c = F c = L P (c x t, Ω) (7) t=1 L P (c x t, Ω) (x c m c ) (8) t=1 C UBM Ω F c = 1,, C UBM m c c N(u) N c I(I R F F ) CF CF F(u) F c CF 1 Σ TV CF CF [8] i-vector (Fisher discriminant analysis; FDA) (within-class covariance normalization; WCCN) 2.2.2 i-vector FDA WCCN [6], [] FDA w A t w (w R d, A R d d, d < d) A S b S w S b v = λs w v (9) WCCN w B t w (w R d, B d d ) WCCN ( ) B s i w s i s w s S W = 1 S 1 n s n s=1 s i=1 (w s i w s )(w s i w s ) t () W 1 W 1 = BB t 2.2.3 i j i-vector Di,j cos z T i = z j z i z j (11) z u i-vector w u FDA WCCN u 3. BIC NMF 3.1 BIC BIC BIC 0 i,j BIC i,j X i X j BIC BIC 0 i,j = N i + N j 2 + α 2 log Σ 0 d(d + 1) (d + ) log(n i + N j ) (12) 2 BIC i,j = N i 2 log Σ i + N j 2 log Σ j + α 2 d(d + 1) (d + ) log(n i + N j ). (13) 2 Σ 0 2 Σ i Σ j N i N j d α i j BIC i,j = BIC 0 i,j BIC i,j (14) BIC i,j BIC i,j c 2012 Information Processing Society of Japan 3

BIC i,j (14) 3.2 U V R U U (Non-negative Matrix Factorization; NMF) V W R U S H R S U V WH (1) U S W H u s u ŝ u = arg max H s,u (16) s u = 1,, U (16) ŝ u 2.1 GMM CLR i-vector W H V [11] D(V WH) = ( V ) i,j V i,j log V ij + (WH) i,j (WH) i,j i,j (17) W H D(V WH) u W i,a W H a,uv i,u /(WH) i,u i,a u H (18) a,u s H a,j H H s,av s,j /(WH) s,j a,j s H (19) s,a V i,j W i,j H i,j W W H i j 4. 1 Method 1 Speaker representation Clustering algorithm BIC [4] single Gaussian BIC-AHC GMM-NMF [] GMM NMF IV-kM [7] i-vector k-means IV-NMF (proposed) i-vector NMF BIC [4] 3.1 GMM-NMF 2.1 GMM CLR 3.2 NMF BIC [] CLR NMF IV-kM i-vector k-means k-means [7] i-vector NMF i-vector (20) { 0, v < 0 f(v) = v 3, otherwise (20). 1.1.1.1 (CSJ) [12] 0 0 46 4 00 ms s s 0 c 2012 Information Processing Society of Japan 4

.1.2 BIC GMM-NMF 12 MFCC GMM- NMF GMM 8 8 GMM CLR 0 1. IV-kM IV-NMF UBM i-vector T FDA WCCN 183 28,171 UBM MFCC 12 MFCC 12 MFCC 12 36 12 i-vector 30 FDA 0 WCCN.1.3 NMF k-means ( ) BIC (12) (13) BIC α K ( 0 α = 6.8 α = 13.8) NMF H W 0.0001.1.4 (average speaker purity; ASP) (average cluster purity; ACP) K [13] p i q j p i = q j = n 2 ij n 2 j=0 i i=0 n 2 ij n 2 j (21) (22) S U n j j n i i n i j j i V ASP V ACP V ACP = 1 U p i n i (23) i=0 2 K BIC IV-kM GMM-NMF IV-NMF V ASP = 1 U # speakers # utterances K value 0.989 0 0.6 0.993 0 0.92 0.922 0 0.914 0.920 0 0.896 0.977 0 0.931 0.940 0 0.896 0.984 0 0.987 0.966 0 0.944 q j n j (24) j=0 K K = V ACP V ASP (2) K 1.2 2 BIC IV-kM GMM-NMF (IV-NMF) IV-NMF K IV-NMF GMM-NMF IV-kM BIC-AHC BIC-AHC 0 BIC-AHC BIC IV-kM GMM-NMF c 2012 Information Processing Society of Japan

BIC IV-kM IV-kM IV-NMF BIC 3 IV-kM IV- NMF i-vector BIC GMM-NMF i-vector NMF (GMM-NMF IV- NMF) 2 non-negative matrix factorization, Proc. Interspeech, pp.949-92, Aug. 2011. [6] N. Dehak et al., Front-end factor analysis for speaker verification, IEEE trans. Speech Audio Process., vol.19, no.4, pp.788-798, May 2011. [7] S. Shum et al., Exploiting intra-conversation variability for speaker diarization, Proc. Interspeech, pp.94-948, Aug. 2011. [8] P.Kenny, et al., Eigenvice modeling with sparse training data, IEEE trans. Speech Audio Process., vol.13, no. 3, pp.34-34 May 200. [9] P. Kenny et al., A study of interspeaker variability in speaker verification, IEEE Trans. Audio, Speech, Lang. Process., vol. 16, no., pp. 980-988, Jul. 2008. [] A. Hatch et al., Within-class covariance normalization for SVM-based speaker recognition, Proc. ICSLP, pp.1471-1474, Sept. 2006. [11] D. D. Lee and H. S. Seung, Algorithms for non-negative matrix factorization, Proc. NIPS, pp.6-62, Nov. 2000. [12] K. Maekawa, Corpus of spontaneous Japanese: its design and evaluation, Proc. SSPR, pp.7-12, Apr. 2003. [13] A. Solomonoff et al., Clustering speakers by their voices, Proc. ICASSP, vol.2, pp.77-760, May 1998. 6. i-vector NMF [1] T.J. Reynolds, et al., Clustering via the Baysian information criterion with applications in speech recognition, Proc. ICASSP, vol.2, pp.64-648, May 1998. [2] D.A. Reynolds, et al., Blind Clustering of Speech Utterances based on Speaker and Language Characteristics, Proc. ISCLP, vol.7, pp.3193-3196, Nov. 1998. [3] D.A. Reynolds, et al., Speaker Verification Using Adapted Gaussian Mixture Models, Proc. Digital Signal Processing, vol., pp.19-41, Jan. 2000. [4] S. Chen and P. Gopalakrishan, Speaker, environment and channel change detection and clustering via the Bayesian information criterion, Proc. DARPA Broadcast News Transcription and Understanding Workshop, pp.127-132, May 1998. [] M. Nishida et al., Speaker clustering based on c 2012 Information Processing Society of Japan 6