Voice Conversion based on Non-negative Matrix Factorization with Segment Features in Noisy Environments

Σχετικά έγγραφα
1,a) 1,b) 2 3 Sakriani Sakti 1 Graham Neubig 1 1. A Study on HMM-Based Speech Synthesis Using Rich Context Models

CSJ. Speaker clustering based on non-negative matrix factorization using i-vector-based speaker similarity

Buried Markov Model Pairwise

Vol.4-DCC-8 No.8 Vol.4-MUS-5 No.8 4// 3 3 Hanning (T ) 3 Hanning 3T (y(t)w(t)) dt =.5 T y (t)dt. () STRAIGHT F 3 TANDEM-STRAIGHT[] 3 F F 3 [] F []. :

[5] F 16.1% MFCC NMF D-CASE 17 [5] NMF NMF 3. [5] 1 NMF Deep Neural Network(DNN) FUSION 3.1 NMF NMF [12] S W H 1 Fig. 1 Our aoustic event detect

SNR F0 [2], [3], [4] F0 F0 F0 F0 F0 TUSK F0 TUSK F0 6 TUSK 6 F0 2. F0 F0 [5] [6] [7] p[8] Cepstrum [9], [10] [11] [12] [13] F0 [14] F0 [15] DIO[16] [1

Fourier transform, STFT 5. Continuous wavelet transform, CWT STFT STFT STFT STFT [1] CWT CWT CWT STFT [2 5] CWT STFT STFT CWT CWT. Griffin [8] CWT CWT

3: A convolution-pooling layer in PS-CNN 1: Partially Shared Deep Neural Network 2.2 Partially Shared Convolutional Neural Network 2: A hidden layer o

EM Baum-Welch. Step by Step the Baum-Welch Algorithm and its Application 2. HMM Baum-Welch. Baum-Welch. Baum-Welch Baum-Welch.

Analysis of prosodic features in native and non-native Japanese using generation process model of fundamental frequency contours

MIDI [8] MIDI. [9] Hsu [1], [2] [10] Salamon [11] [5] Song [6] Sony, Minato, Tokyo , Japan a) b)

An Automatic Modulation Classifier using a Frequency Discriminator for Intelligent Software Defined Radio

CE 530 Molecular Simulation

Estimation, Evaluation and Guarantee of the Reverberant Speech Recognition Performance based on Room Acoustic Parameters

A study of geometric dependency of cepstrum on vocal tract length

The Research on Sampling Estimation of Seasonal Index Based on Stratified Random Sampling

VOCODER VOCODER Vocal

Signal processing for handling singing voice texture

(hidden Markov model: HMM) FUNDAMENTALS OF SPEECH SYNTHESIS BASED ON HMM. Keiichi Tokuda. Department of Computer Science

n 1 n 3 choice node (shelf) choice node (rough group) choice node (representative candidate)

Retrieval of Seismic Data Recorded on Open-reel-type Magnetic Tapes (MT) by Using Existing Devices

Study of In-vehicle Sound Field Creation by Simultaneous Equation Method

Resurvey of Possible Seismic Fissures in the Old-Edo River in Tokyo

Development of a Seismic Data Analysis System for a Short-term Training for Researchers from Developing Countries

YOU Wen-jie 1 2 JI Guo-li 1 YUAN Ming-shun 2

Speech Recognition using Phase Information based on Long-Term Analysis

Technical Research Report, Earthquake Research Institute, the University of Tokyo, No. +-, pp. 0 +3,,**1. No ,**1

Ψηφιακή Επεξεργασία Φωνής

: TANDEM-STRAIGHT. Make singing voice tangible: TANDEM-STRAIGHT and temporally variable morphing as substrate. Hideki Kawahara 1 and Masanori Morise 2

Detection and Recognition of Traffic Signal Using Machine Learning

Applying Markov Decision Processes to Role-playing Game

Ψηφιακή Επεξεργασία Φωνής

Numerical Analysis FMN011

Stabilization of stock price prediction by cross entropy optimization

Feasible Regions Defined by Stability Constraints Based on the Argument Principle

ΕΘΝΙΚΟ ΜΕΤΣΟΒΙΟ ΠΟΛΥΤΕΧΝΕΙΟ

Non-negative Matrix Factorization, NMF [5] NMF. [1 3] Bregman [4] Harmonic-Temporal Clustering, HTC [2,3] 1,2,b) NTT

: Monte Carlo EM 313, Louis (1982) EM, EM Newton-Raphson, /. EM, 2 Monte Carlo EM Newton-Raphson, Monte Carlo EM, Monte Carlo EM, /. 3, Monte Carlo EM

情報処理学会研究報告 IPSJ SIG Technical Report Vol.2014-MUS-104 No /8/26 1,a) Music Structure and Composition with Sound Directivity in 3D Space

Sampling Basics (1B) Young Won Lim 9/21/13

Homework 8 Model Solution Section

IPSJ SIG Technical Report Vol.2014-CE-127 No /12/6 CS Activity 1,a) CS Computer Science Activity Activity Actvity Activity Dining Eight-He

Automatic extraction of bibliography with machine learning

Divergence for log concave functions

{takasu, Conditional Random Field

Acoustic Signal Adjustment by Considering Musical Expressive Intention Using a Performance Intension Function

Διπλωματική Εργασία της φοιτήτριας του Τμήματος Ηλεκτρολόγων Μηχανικών και Τεχνολογίας Υπολογιστών της Πολυτεχνικής Σχολής του Πανεπιστημίου Πατρών

[2] REVERB 8 [3], [4] [5] [20] [6], [7], [8], [9], [10] [11] REVERB 8 *1 [9] LDA *2 MLLT (SAT) [8] (basis fmllr) [12] (DNN) [10] DNN [11] [13] [14] Ka

Διπλωματική Εργασία. του φοιτητή του Τμήματος Ηλεκτρολόγων Μηχανικών και Τεχνολογίας Υπολογιστών της Πολυτεχνικής Σχολής του Πανεπιστημίου Πατρών

A Robust Bootstrapping Algorithm of Speaker Models for On-Line Unsupervised Speaker Indexing

Echo path identification for stereophonic acoustic echo cancellation without pre-processing

Architecture for Visualization Using Teacher Information based on SOM

Partial Differential Equations in Biology The boundary element method. March 26, 2013

ΤΟ ΜΟΝΤΕΛΟ Οι Υποθέσεις Η Απλή Περίπτωση για λi = μi 25 = Η Γενική Περίπτωση για λi μi..35

Nov Journal of Zhengzhou University Engineering Science Vol. 36 No FCM. A doi /j. issn

Πτυχιακή Εργασι α «Εκτι μήσή τής ποιο τήτας εικο νων με τήν χρή σή τεχνήτων νευρωνικων δικτυ ων»

Development of the Nursing Program for Rehabilitation of Woman Diagnosed with Breast Cancer

Main source: "Discrete-time systems and computer control" by Α. ΣΚΟΔΡΑΣ ΨΗΦΙΑΚΟΣ ΕΛΕΓΧΟΣ ΔΙΑΛΕΞΗ 4 ΔΙΑΦΑΝΕΙΑ 1

BCI On Feature Extraction from Multi-Channel Brain Waves Used for Brain Computer Interface

N. P. Mozhey Belarusian State University of Informatics and Radioelectronics NORMAL CONNECTIONS ON SYMMETRIC MANIFOLDS

Web-based supplementary materials for Bayesian Quantile Regression for Ordinal Longitudinal Data

( ) (Harmonic-Temporal Clustering; HTC) [1], [2] ( ) ( ) [4] HTC. (Non-negative Matrix Factorization; NMF) [3] [5], [6] [7], [8]

ΕΘΝΙΚΗ ΣΧΟΛΗ ΔΗΜΟΣΙΑΣ ΔΙΟΙΚΗΣΗΣ ΙΓ' ΕΚΠΑΙΔΕΥΤΙΚΗ ΣΕΙΡΑ

40 3 Journal of South China University of Technology Vol. 40 No Natural Science Edition March

Maxima SCORM. Algebraic Manipulations and Visualizing Graphs in SCORM contents by Maxima and Mashup Approach. Jia Yunpeng, 1 Takayuki Nagai, 2, 1

ΠΤΥΧΙΑΚΗ ΕΡΓΑΣΙΑ "ΠΟΛΥΚΡΙΤΗΡΙΑ ΣΥΣΤΗΜΑΤΑ ΛΗΨΗΣ ΑΠΟΦΑΣΕΩΝ. Η ΠΕΡΙΠΤΩΣΗ ΤΗΣ ΕΠΙΛΟΓΗΣ ΑΣΦΑΛΙΣΤΗΡΙΟΥ ΣΥΜΒΟΛΑΙΟΥ ΥΓΕΙΑΣ "

Development of a basic motion analysis system using a sensor KINECT

Query by Phrase (QBP) (Music Information Retrieval, MIR) QBH QBP / [1, 2] [3, 4] Query-by-Humming (QBH) QBP MIDI [5, 6] [8 10] [7]

ES440/ES911: CFD. Chapter 5. Solution of Linear Equation Systems

LAN. On the Dynamic Performance of Indoor Positioning System using Wireless LAN on Smartphone

Study on Re-adhesion control by monitoring excessive angular momentum in electric railway traction

ΠΟΛΥΤΕΧΝΕΙΟ ΚΡΗΤΗΣ ΣΧΟΛΗ ΜΗΧΑΝΙΚΩΝ ΠΕΡΙΒΑΛΛΟΝΤΟΣ

ΚΥΠΡΙΑΚΗ ΕΤΑΙΡΕΙΑ ΠΛΗΡΟΦΟΡΙΚΗΣ CYPRUS COMPUTER SOCIETY ΠΑΓΚΥΠΡΙΟΣ ΜΑΘΗΤΙΚΟΣ ΔΙΑΓΩΝΙΣΜΟΣ ΠΛΗΡΟΦΟΡΙΚΗΣ 19/5/2007

2. Μηχανικό Μαύρο Κουτί: κύλινδρος με μια μπάλα μέσα σε αυτόν.

Re-Pair n. Re-Pair. Re-Pair. Re-Pair. Re-Pair. (Re-Merge) Re-Merge. Sekine [4, 5, 8] (highly repetitive text) [2] Re-Pair. Blocked-Repair-VF [7]

Second Order Partial Differential Equations

Probabilistic Approach to Robust Optimization

6.1. Dirac Equation. Hamiltonian. Dirac Eq.

High order interpolation function for surface contact problem

EE512: Error Control Coding

ΤΕΧΝΟΛΟΓΙΚΟ ΠΑΝΕΠΙΣΤΗΜΙΟ ΚΥΠΡΟΥ ΣΧΟΛΗ ΓΕΩΠΟΝΙΚΩΝ ΕΠΙΣΤΗΜΩΝ ΒΙΟΤΕΧΝΟΛΟΓΙΑΣ ΚΑΙ ΕΠΙΣΤΗΜΗΣ ΤΡΟΦΙΜΩΝ. Πτυχιακή εργασία

Lecture 2: Dirac notation and a review of linear algebra Read Sakurai chapter 1, Baym chatper 3

Sinsy: HMM. Sinsy An HMM-based singing voice synthesis system which can realize your wish I want this person to sing my song

Bayesian statistics. DS GA 1002 Probability and Statistics for Data Science.

A Determination Method of Diffusion-Parameter Values in the Ion-Exchange Optical Waveguides in Soda-Lime glass Made by Diluted AgNO 3 with NaNO 3

Partial Trace and Partial Transpose

ΤΕΧΝΟΛΟΓΙΚΟ ΠΑΝΕΠΙΣΤΗΜΙΟ ΚΥΠΡΟΥ ΣΧΟΛΗ ΕΠΙΣΤΗΜΩΝ ΥΓΕΙΑΣ

ER-Tree (Extended R*-Tree)

Lecture 10 - Representation Theory III: Theory of Weights

GPU. CUDA GPU GeForce GTX 580 GPU 2.67GHz Intel Core 2 Duo CPU E7300 CUDA. Parallelizing the Number Partitioning Problem for GPUs

Development of a Tiltmeter with a XY Magnetic Detector (Part +)

ΕΘΝΙΚΟ ΜΕΤΣΟΒΙΟ ΠΟΛΥΤΕΧΝΕΙΟ ΣΧΟΛΗ ΗΛΕΚΤΡΟΛΟΓΩΝ ΜΗΧΑΝΙΚΩΝ ΚΑΙ ΜΗΧΑΝΙΚΩΝ ΥΠΟΛΟΓΙΣΤΩΝ ΤΟΜΕΑΣ ΗΛΕΚΤΡΙΚΗΣ ΙΣΧΥΟΣ

Extraction of Basic Patterns of Household Energy Consumption

SocialDict. A reading support tool with prediction capability and its extension to readability measurement

Congruence Classes of Invertible Matrices of Order 3 over F 2

6.3 Forecasting ARMA processes

, Evaluation of a library against injection attacks

Dr. D. Dinev, Department of Structural Mechanics, UACEG

Newman Modularity Newman [4], [5] Newman Q Q Q greedy algorithm[6] Newman Newman Q 1 Tabu Search[7] Newman Newman Newman Q Newman 1 2 Newman 3

Spectrum Representation (5A) Young Won Lim 11/3/16

Transcript:

THE INSTITUTE OF ELECTRONICS, INFORMATION AND COMMUNICATION ENGINEERS TECHNICAL REPORT OF IEICE. NMF 657 8501 1 1 657 8501 1 1 E-ail: {fujii,aihara}@e.cs.scitec.kobe-u.ac.jp, {takigu,ariki}@kobe-u.ac.jp NMF NMF NMF Voice Conversion based on Non-negative Matrix Factorization with Segent Features in Noisy Environents Takao FUJII, Ryo AIHARA, Tetsuya TAKIGUCHI, and Yasuo ARIKI Organization of Advanced Science and Technology, Kobe University Rokkodai-cho 1 1, Nada-ku, Kobe-shi, 657 8501 Japan Graduate School of Syste Inforatics, Kobe University Rokkodai-cho 1 1, Nada-ku, Kobe-shi, 657 8501 Japan E-ail: {fujii,aihara}@e.cs.scitec.kobe-u.ac.jp, {takigu,ariki}@kobe-u.ac.jp Abstract This paper presents a voice conversion based on NMF for noisy environents. We prepared parallel exeplars that consist of the source and target exeplars, which have the sae texts uttered by the source and target speakers. The input source signal is decoposed into the source exeplars, noise exeplars obtained fro the input signal, and their weights. Then, the converted signal is obtained by calculating the linear cobination of the target exeplars and the weights which are calculated using the source exeplars. In the proposed ethod, segent features are used for the voice conversion technique based on NMF in order to iprove the accuracy of the weight estiation. The effectiveness of this ethod was confired by coparing its effectiveness with that of a conventional ethod. Key words NMF, Voice Conversion, Speaker Conversion, Noise Reduction, 1. [1], [2] [3] Gaussian Mixture Model (GMM) [4] [5] [6] [7] GMM Global Variance Helander [8] Partial Least Squares (PLS) GMM [9] Eigen-Voice GMM (EV-GMM) [10] [11] - 77-1 This article is a technical report without peer review, and its polished and/or extended version ay be published elsewhere. Copyright 2013 by IEICE

Sparse Coding Non-negative Matrix Factorization ( NMF) [12] [13] [14] Sparse Coding Geeke [15] Hidden Markov Model (HMM) NMF NMF 2. GMM 2. 1 GMM t x t y t GMM λ α μ Σ λ = {α,μ, Σ =1, 2,..., M} (1) α M α =1,α > = 0 (2) =1 N(x t ; μ, Σ ) = 1 (2π)D Σ exp[ 1 2 (xt μ)t Σ 1 (x t μ )] 2. 2 x =[x 0,x 1,..., x T ] T y =[y 0,y 1,..., y T ] T [16] (3) F (x) =E[y x] M = h (x)e [y x] (4) =1 E [y x] =[μ (y) h (x) = +Σ (yx) α N(x; μ (x), Σ (xx) M i=1 (Σ (xx) ) 1 (x μ (x) )] (5) ) αin(x; μ(x) i, Σ (xx) i ) μ (x),μ (y) Σ (xx) Σ (yx) 2. 3 z t = [x T t,yt T ] T p(z) = M =1 (6) α N(z; μ (z), Σ (z) ) (7) α,μ (x),μ (y), Σ (xx), Σ (yx) GMM Σ (z) μ (Z) [ ] (xx) Σ Σ (z) Σ (xy) = (8) μ (z) = Σ (yx) [ μ (x) μ (y) ] Σ (yy) EM Σ (xx), Σ (yx) 3. 3. 1 NMF NMF [17] (9) x l J j=1 ajhj,l =Ahl (10) x l l a j j h j,l a j - 78-2

1 Fig. 1 Activity atrix of the source signal 2 Fig. 2 Activity atrix of the target signal 3 NMF Fig. 3 Basic approach of voice conversion based on NMF A=[a 1...a J ] h l =[h 1,l...h J,l ] T (DP) 1 2 NMF STRAIGHT [18] (STRAIGHT ) 3 NMF 3. 2 STRAIGHT [18] (STFT) STRAIGHT 4 STFT STRAIGHT STRAIGHT DP STFT STRAIGHT STFT STRAIGHT 3. 3 NMF l - 79-3

λ T =[λ 1...λ J...λ J+K ] [λ 1...λ J ] 0.2 [λ J+1...λ J+K ] 0 3. 4 H H s Fig. 4 x l =x s l +x n l 4 Construction of source and target dictionaries J j=1 as jh s j,l + K [ ] h s =[A s A n l ] s.t. h n l k=1 an k h n k,l h s l, h n l > = 0 =Ah l s.t. h l > = 0 (11) x s l x n l. A s, A n, h s l, h n l l (11) - [ ] H s X [A s A n ] s.t. H s, H n > = 0 H n =AH s.t. H > = 0. (12) X A s A n H NMF [15] M = 1 (D D) X X X./M A A./(1 (D D) A) (13) 1 1 NMF H H d(x, AH) + (λ1 (1 L) ). H 1 s.t. H > = 0. (14) H n+1 =H n. (A T (X./(AH)))./(1 ((J+K) L) + λ1 (1 L) ). (15) (14) H X AH Kullback-Leibler divergence H L1 A t A t./(1 (D D) A t ) (16) H s (13) NMF ˆX t =(A t H s ). M (17) STRAIGHT NMF STRAIGHT STRAIGHT STRAIGHT 3. 5 NMF NMF NMF 4. 4. 1 GMM NMF ATR 8kHz 216 57,033 GMM - 80-4

5 50 Fig. 5 Cepstru distortion for each ethod in the case of using 50 words Fig. 7 7 GMM Converted spectru envelope of source speaker based on GMM Fig. 8 8 NMF Converted spectru envelope of source speaker with NMF 6 25 Fig. 6 Cepstru distortion for each ethod in the case of using 25 sentences STRAIGHT 40 GMM 64 50 25 CENSREC-1-C SNR 10dB 104 256 512 STRAIGHT 500 1 (seg(2)) 1 (seg(3)) 4. 2 50 25 5 6 7 9 GMM NMF 9 Fig. 9 Converted spectru envelope of source speaker with proposed ethod 10 Fig. 10 Spectru envelope of source speaker 11 Fig. 11 Spectru envelope of target speaker 10 11-81 - 5

5 6 NMF GMM 5. NMF NMF NMF [1] Y. Iwai, T. Toda, H. Saruwatari, and K. Shikano, GMMbased Voice Conversion Applied to Eotional Speech Synthesis, IEEE Trans. Seech and Audio Proc., Vol. 7, pp. 2401 2404, 1999. [2] C. Veaux and X. Robet, Intonation conversion fro neutral to expressive speech, in Proc. INTERSPEECH, pp. 2765 2768, 2011. [3] K. Nakaura, T. Toda, H. Saruwatari, and K. Shikano, Speaking-aid systes using GMM-based voice conversion for electrolaryngeal speech, Speech Counication, Vol. 54, No. 1, pp. 134 146, 2012. [4] Y. Stylianou, O. Cappe, and E. Moulines, Continuous probabilistic transfor for voice conversion, IEEE Trans. Speech and Audio Processing, Vol. 6, No. 2, pp. 131 142, 1998. [5] M. Abe, S. Nakaura, K. Shikano, and H. Kuwabara, Vice conversion through vector quantization, in Proc. ICASSP, pp. 655 658, 1988. [6] H. Valbret, E. Moulines and J. P. Tubach, Voice transforation using PSOLA technique, Speech Counication, Vol. 11, No. 2-3, pp. 175 187, 1992. [7] T. Toda, A. Black, and K. Tokuda, Voice conversion based on axiu likelihood estiation of spectral paraeter trajectory, IEEE Trans. Audio, Speech, Lang. Process., Vol. 15, No. 8, pp. 2222 2235, 2007. [8] E. Helander, T. Virtanen, J. Nurinen, and M. Gabbouj, Voice conversion using partial least squares regression, IEEE Trans. Audo, Speech, Lang. Process., Vol. 18, No. 5, pp. 912 921, 2010. [9] C. H. Lee and C. H. Wu, Map-based adaptation for speech conversion using adaptation data selection and non-parallel training, in Proc. INTERSPEECH, pp. 2254 2257, 2006. [10] T. Toda, Y. Ohtani, and K. Shikano, Eigenvoice conversion based on Gaussian ixture odel, in Proc. INTER- SPEECH, pp. 2446 2449, 2006. [11] D. Saito, K. Yaaoto, N. Mineatsu, and K. Hirose, One-to-any voice conversion based on tensor representation of speaker space, in Proc. INTERSPEECH, pp. 653 656, 2011. [12] D. D. Lee and H. S. Seung, Algoriths for non-negative atrix factorization, in Proc. Neural Inforation Processing Syste, pp. 556 562, 2001. [13] T. Virtanen, Monaural sound source separation by nonnegative atrix factorization with teporal continuity and sparseness criteria, IEEE Trans. Audio, Speech, Lang. Process., Vol. 15, No. 3, pp. 1066 1074, 2007. [14] M. N. Schidt and R. K. Olsson, Single-channel speech separation using sparse non-negative atrix factorization, in Proc. INTERSPEECH, pp. 2614-2617, 2006. [15] J. F. Geeke, T. Viratnen, and A. Huralainen, Exeplar-Based Sparse Representations for Noise Robust Autoatic Speech Recognition, IEEE Trans. Audio, Speech, Lang. Process., Vol. 19, Issue 7, pp. 2067 2080, 2011. [16] Y. Stylianou, O. Cappe and E. Moilines, Statistical ethods for voice quality transforation, in Proc. European speech Association, 1995. [17] Ryoichi Takashia, Tetsuya Takiguchi, and Yasuo Ariki, Exeplar-Based Voice Conversion in Noisy Environent, IEEE Workshop on Spoken Language Technology (SLT2012), pp. 313-317, 2012. [18] H. Kawahara, I. Masuda-Katsuse, and A.de Cheveigne, Restructuring speech representations using a pitchadaptive tie-frequency soothing and an instantaneousfrequency-based F0 extraction: Possible role of a repetitive structure in sounds, Speech Counication, Vol.27, pp. 187 207, 1999. - 82-6