THE INSTITUTE OF ELECTRONICS, INFORMATION AND COMMUNICATION ENGINEERS TECHNICAL REPORT OF IEICE. NMF 657 8501 1 1 657 8501 1 1 E-ail: {fujii,aihara}@e.cs.scitec.kobe-u.ac.jp, {takigu,ariki}@kobe-u.ac.jp NMF NMF NMF Voice Conversion based on Non-negative Matrix Factorization with Segent Features in Noisy Environents Takao FUJII, Ryo AIHARA, Tetsuya TAKIGUCHI, and Yasuo ARIKI Organization of Advanced Science and Technology, Kobe University Rokkodai-cho 1 1, Nada-ku, Kobe-shi, 657 8501 Japan Graduate School of Syste Inforatics, Kobe University Rokkodai-cho 1 1, Nada-ku, Kobe-shi, 657 8501 Japan E-ail: {fujii,aihara}@e.cs.scitec.kobe-u.ac.jp, {takigu,ariki}@kobe-u.ac.jp Abstract This paper presents a voice conversion based on NMF for noisy environents. We prepared parallel exeplars that consist of the source and target exeplars, which have the sae texts uttered by the source and target speakers. The input source signal is decoposed into the source exeplars, noise exeplars obtained fro the input signal, and their weights. Then, the converted signal is obtained by calculating the linear cobination of the target exeplars and the weights which are calculated using the source exeplars. In the proposed ethod, segent features are used for the voice conversion technique based on NMF in order to iprove the accuracy of the weight estiation. The effectiveness of this ethod was confired by coparing its effectiveness with that of a conventional ethod. Key words NMF, Voice Conversion, Speaker Conversion, Noise Reduction, 1. [1], [2] [3] Gaussian Mixture Model (GMM) [4] [5] [6] [7] GMM Global Variance Helander [8] Partial Least Squares (PLS) GMM [9] Eigen-Voice GMM (EV-GMM) [10] [11] - 77-1 This article is a technical report without peer review, and its polished and/or extended version ay be published elsewhere. Copyright 2013 by IEICE
Sparse Coding Non-negative Matrix Factorization ( NMF) [12] [13] [14] Sparse Coding Geeke [15] Hidden Markov Model (HMM) NMF NMF 2. GMM 2. 1 GMM t x t y t GMM λ α μ Σ λ = {α,μ, Σ =1, 2,..., M} (1) α M α =1,α > = 0 (2) =1 N(x t ; μ, Σ ) = 1 (2π)D Σ exp[ 1 2 (xt μ)t Σ 1 (x t μ )] 2. 2 x =[x 0,x 1,..., x T ] T y =[y 0,y 1,..., y T ] T [16] (3) F (x) =E[y x] M = h (x)e [y x] (4) =1 E [y x] =[μ (y) h (x) = +Σ (yx) α N(x; μ (x), Σ (xx) M i=1 (Σ (xx) ) 1 (x μ (x) )] (5) ) αin(x; μ(x) i, Σ (xx) i ) μ (x),μ (y) Σ (xx) Σ (yx) 2. 3 z t = [x T t,yt T ] T p(z) = M =1 (6) α N(z; μ (z), Σ (z) ) (7) α,μ (x),μ (y), Σ (xx), Σ (yx) GMM Σ (z) μ (Z) [ ] (xx) Σ Σ (z) Σ (xy) = (8) μ (z) = Σ (yx) [ μ (x) μ (y) ] Σ (yy) EM Σ (xx), Σ (yx) 3. 3. 1 NMF NMF [17] (9) x l J j=1 ajhj,l =Ahl (10) x l l a j j h j,l a j - 78-2
1 Fig. 1 Activity atrix of the source signal 2 Fig. 2 Activity atrix of the target signal 3 NMF Fig. 3 Basic approach of voice conversion based on NMF A=[a 1...a J ] h l =[h 1,l...h J,l ] T (DP) 1 2 NMF STRAIGHT [18] (STRAIGHT ) 3 NMF 3. 2 STRAIGHT [18] (STFT) STRAIGHT 4 STFT STRAIGHT STRAIGHT DP STFT STRAIGHT STFT STRAIGHT 3. 3 NMF l - 79-3
λ T =[λ 1...λ J...λ J+K ] [λ 1...λ J ] 0.2 [λ J+1...λ J+K ] 0 3. 4 H H s Fig. 4 x l =x s l +x n l 4 Construction of source and target dictionaries J j=1 as jh s j,l + K [ ] h s =[A s A n l ] s.t. h n l k=1 an k h n k,l h s l, h n l > = 0 =Ah l s.t. h l > = 0 (11) x s l x n l. A s, A n, h s l, h n l l (11) - [ ] H s X [A s A n ] s.t. H s, H n > = 0 H n =AH s.t. H > = 0. (12) X A s A n H NMF [15] M = 1 (D D) X X X./M A A./(1 (D D) A) (13) 1 1 NMF H H d(x, AH) + (λ1 (1 L) ). H 1 s.t. H > = 0. (14) H n+1 =H n. (A T (X./(AH)))./(1 ((J+K) L) + λ1 (1 L) ). (15) (14) H X AH Kullback-Leibler divergence H L1 A t A t./(1 (D D) A t ) (16) H s (13) NMF ˆX t =(A t H s ). M (17) STRAIGHT NMF STRAIGHT STRAIGHT STRAIGHT 3. 5 NMF NMF NMF 4. 4. 1 GMM NMF ATR 8kHz 216 57,033 GMM - 80-4
5 50 Fig. 5 Cepstru distortion for each ethod in the case of using 50 words Fig. 7 7 GMM Converted spectru envelope of source speaker based on GMM Fig. 8 8 NMF Converted spectru envelope of source speaker with NMF 6 25 Fig. 6 Cepstru distortion for each ethod in the case of using 25 sentences STRAIGHT 40 GMM 64 50 25 CENSREC-1-C SNR 10dB 104 256 512 STRAIGHT 500 1 (seg(2)) 1 (seg(3)) 4. 2 50 25 5 6 7 9 GMM NMF 9 Fig. 9 Converted spectru envelope of source speaker with proposed ethod 10 Fig. 10 Spectru envelope of source speaker 11 Fig. 11 Spectru envelope of target speaker 10 11-81 - 5
5 6 NMF GMM 5. NMF NMF NMF [1] Y. Iwai, T. Toda, H. Saruwatari, and K. Shikano, GMMbased Voice Conversion Applied to Eotional Speech Synthesis, IEEE Trans. Seech and Audio Proc., Vol. 7, pp. 2401 2404, 1999. [2] C. Veaux and X. Robet, Intonation conversion fro neutral to expressive speech, in Proc. INTERSPEECH, pp. 2765 2768, 2011. [3] K. Nakaura, T. Toda, H. Saruwatari, and K. Shikano, Speaking-aid systes using GMM-based voice conversion for electrolaryngeal speech, Speech Counication, Vol. 54, No. 1, pp. 134 146, 2012. [4] Y. Stylianou, O. Cappe, and E. Moulines, Continuous probabilistic transfor for voice conversion, IEEE Trans. Speech and Audio Processing, Vol. 6, No. 2, pp. 131 142, 1998. [5] M. Abe, S. Nakaura, K. Shikano, and H. Kuwabara, Vice conversion through vector quantization, in Proc. ICASSP, pp. 655 658, 1988. [6] H. Valbret, E. Moulines and J. P. Tubach, Voice transforation using PSOLA technique, Speech Counication, Vol. 11, No. 2-3, pp. 175 187, 1992. [7] T. Toda, A. Black, and K. Tokuda, Voice conversion based on axiu likelihood estiation of spectral paraeter trajectory, IEEE Trans. Audio, Speech, Lang. Process., Vol. 15, No. 8, pp. 2222 2235, 2007. [8] E. Helander, T. Virtanen, J. Nurinen, and M. Gabbouj, Voice conversion using partial least squares regression, IEEE Trans. Audo, Speech, Lang. Process., Vol. 18, No. 5, pp. 912 921, 2010. [9] C. H. Lee and C. H. Wu, Map-based adaptation for speech conversion using adaptation data selection and non-parallel training, in Proc. INTERSPEECH, pp. 2254 2257, 2006. [10] T. Toda, Y. Ohtani, and K. Shikano, Eigenvoice conversion based on Gaussian ixture odel, in Proc. INTER- SPEECH, pp. 2446 2449, 2006. [11] D. Saito, K. Yaaoto, N. Mineatsu, and K. Hirose, One-to-any voice conversion based on tensor representation of speaker space, in Proc. INTERSPEECH, pp. 653 656, 2011. [12] D. D. Lee and H. S. Seung, Algoriths for non-negative atrix factorization, in Proc. Neural Inforation Processing Syste, pp. 556 562, 2001. [13] T. Virtanen, Monaural sound source separation by nonnegative atrix factorization with teporal continuity and sparseness criteria, IEEE Trans. Audio, Speech, Lang. Process., Vol. 15, No. 3, pp. 1066 1074, 2007. [14] M. N. Schidt and R. K. Olsson, Single-channel speech separation using sparse non-negative atrix factorization, in Proc. INTERSPEECH, pp. 2614-2617, 2006. [15] J. F. Geeke, T. Viratnen, and A. Huralainen, Exeplar-Based Sparse Representations for Noise Robust Autoatic Speech Recognition, IEEE Trans. Audio, Speech, Lang. Process., Vol. 19, Issue 7, pp. 2067 2080, 2011. [16] Y. Stylianou, O. Cappe and E. Moilines, Statistical ethods for voice quality transforation, in Proc. European speech Association, 1995. [17] Ryoichi Takashia, Tetsuya Takiguchi, and Yasuo Ariki, Exeplar-Based Voice Conversion in Noisy Environent, IEEE Workshop on Spoken Language Technology (SLT2012), pp. 313-317, 2012. [18] H. Kawahara, I. Masuda-Katsuse, and A.de Cheveigne, Restructuring speech representations using a pitchadaptive tie-frequency soothing and an instantaneousfrequency-based F0 extraction: Possible role of a repetitive structure in sounds, Speech Counication, Vol.27, pp. 187 207, 1999. - 82-6