SNR F0 [2], [3], [4] F0 F0 F0 F0 F0 TUSK F0 TUSK F0 6 TUSK 6 F0 2. F0 F0 [5] [6] [7] p[8] Cepstrum [9], [10] [11] [12] [13] F0 [14] F0 [15] DIO[16] [1

Σχετικά έγγραφα
Vol.4-DCC-8 No.8 Vol.4-MUS-5 No.8 4// 3 3 Hanning (T ) 3 Hanning 3T (y(t)w(t)) dt =.5 T y (t)dt. () STRAIGHT F 3 TANDEM-STRAIGHT[] 3 F F 3 [] F []. :

VOCODER VOCODER Vocal

Signal processing for handling singing voice texture

: TANDEM-STRAIGHT. Make singing voice tangible: TANDEM-STRAIGHT and temporally variable morphing as substrate. Hideki Kawahara 1 and Masanori Morise 2

Fourier transform, STFT 5. Continuous wavelet transform, CWT STFT STFT STFT STFT [1] CWT CWT CWT STFT [2 5] CWT STFT STFT CWT CWT. Griffin [8] CWT CWT

Ψηφιακή Επεξεργασία Φωνής

MIDI [8] MIDI. [9] Hsu [1], [2] [10] Salamon [11] [5] Song [6] Sony, Minato, Tokyo , Japan a) b)

GUI

1,a) 1,b) 2 3 Sakriani Sakti 1 Graham Neubig 1 1. A Study on HMM-Based Speech Synthesis Using Rich Context Models

Estimation, Evaluation and Guarantee of the Reverberant Speech Recognition Performance based on Room Acoustic Parameters

LSP (Line Spectrum Pair, LSF: Line Spectrum Frequencies) [19] [20] LSP CODE [21], [22] VOCODER [23] 2. 3 STRAIGHT VOCODER STRAIGHT [24] STRAIGHT [25]

An Automatic Modulation Classifier using a Frequency Discriminator for Intelligent Software Defined Radio

F0 Estimation of Melody and Bass Lines in Real-world Musical Audio Signals

Detection and Recognition of Traffic Signal Using Machine Learning

CSJ. Speaker clustering based on non-negative matrix factorization using i-vector-based speaker similarity

Feasible Regions Defined by Stability Constraints Based on the Argument Principle

Πτυχιακή Εργασι α «Εκτι μήσή τής ποιο τήτας εικο νων με τήν χρή σή τεχνήτων νευρωνικων δικτυ ων»

6.003: Signals and Systems. Modulation

3: A convolution-pooling layer in PS-CNN 1: Partially Shared Deep Neural Network 2.2 Partially Shared Convolutional Neural Network 2: A hidden layer o

Buried Markov Model Pairwise

Speech Recognition using Phase Information based on Long-Term Analysis

[5] F 16.1% MFCC NMF D-CASE 17 [5] NMF NMF 3. [5] 1 NMF Deep Neural Network(DNN) FUSION 3.1 NMF NMF [12] S W H 1 Fig. 1 Our aoustic event detect

Acoustic Signal Adjustment by Considering Musical Expressive Intention Using a Performance Intension Function

Analysis of prosodic features in native and non-native Japanese using generation process model of fundamental frequency contours

Nov Journal of Zhengzhou University Engineering Science Vol. 36 No FCM. A doi /j. issn

Διπλωματική Εργασία. του φοιτητή του Τμήματος Ηλεκτρολόγων Μηχανικών και Τεχνολογίας Υπολογιστών της Πολυτεχνικής Σχολής του Πανεπιστημίου Πατρών

Retrieval of Seismic Data Recorded on Open-reel-type Magnetic Tapes (MT) by Using Existing Devices

A study of geometric dependency of cepstrum on vocal tract length

Assignment 1 Solutions Complex Sinusoids

Research on mode-locked optical fiber laser

Main source: "Discrete-time systems and computer control" by Α. ΣΚΟΔΡΑΣ ΨΗΦΙΑΚΟΣ ΕΛΕΓΧΟΣ ΔΙΑΛΕΞΗ 4 ΔΙΑΦΑΝΕΙΑ 1

Ψηφιακή Επεξεργασία Φωνής

Ψηφιακή Επεξεργασία Φωνής

1181 (real-timespeechdriven) 1 1 ( ) D FAP FAP (voiceactivationdetectionvad) D FaceGen 3- D XfaceEd MPEG-4 1 FAP 66 FAP ( ) FAP 84

Ψηφιακή Επεξεργασία Φωνής

[2] REVERB 8 [3], [4] [5] [20] [6], [7], [8], [9], [10] [11] REVERB 8 *1 [9] LDA *2 MLLT (SAT) [8] (basis fmllr) [12] (DNN) [10] DNN [11] [13] [14] Ka

University of Illinois at Urbana-Champaign ECE 310: Digital Signal Processing

Voice Conversion based on Non-negative Matrix Factorization with Segment Features in Noisy Environments

Stabilization of stock price prediction by cross entropy optimization

ΚΑΝΟΝΙΣΜΟΣ ΕΚΠΟΝΗΣΗΣ ΕΡΓΑΣΙΩΝ ΓΙΑ ΤΟ ΜΑΘΗΜΑ «ΕΠΕΞΕΡΓΑΣΙΑ ΨΗΦΙΑΚΟΥ ΣΗΜΑΤΟΣ ΚΑΙ ΣΧΕΔΙΑΣΜΟΣ ΥΛΙΚΟΥ»

Περιεχόµενα. ΕΠΛ 422: Συστήµατα Πολυµέσων. Μέθοδοι συµπίεσης ηχητικών. Βιβλιογραφία. Κωδικοποίηση µε βάση την αντίληψη.

The Algorithm to Extract Characteristic Chord Progression Extended the Sequential Pattern Mining

A Method for Creating Shortcut Links by Considering Popularity of Contents in Structured P2P Networks

Probability and Random Processes (Part II)

JOURNAL OF APPLIED SCIENCES Electronics and Information Engineering. Cyclic MUSIC DOA TN (2012)

Higher-Order Correlation Analysis of Pitch Fluctuations in Sustained Normal Vowels by the Method of Surrogate Data

Simplex Crossover for Real-coded Genetic Algolithms

Query by Phrase (QBP) (Music Information Retrieval, MIR) QBH QBP / [1, 2] [3, 4] Query-by-Humming (QBH) QBP MIDI [5, 6] [8 10] [7]

Non-negative Matrix Factorization, NMF [5] NMF. [1 3] Bregman [4] Harmonic-Temporal Clustering, HTC [2,3] 1,2,b) NTT

Sampling Basics (1B) Young Won Lim 9/21/13

Study of In-vehicle Sound Field Creation by Simultaneous Equation Method

Study on Re-adhesion control by monitoring excessive angular momentum in electric railway traction

Fundamentals of Array Antennas

Sinsy: HMM. Sinsy An HMM-based singing voice synthesis system which can realize your wish I want this person to sing my song

Echo path identification for stereophonic acoustic echo cancellation without pre-processing

Singing Information Processing: Music Information Processing for Singing Voices

Yoshifumi Moriyama 1,a) Ichiro Iimura 2,b) Tomotsugu Ohno 1,c) Shigeru Nakayama 3,d)

Robust Feature Extraction Method Based on Run-Length Compensation for Degraded Character Recognition

Web 論 文. Performance Evaluation and Renewal of Department s Official Web Site. Akira TAKAHASHI and Kenji KAMIMURA

Επιστημολογικές τάσεις της Εκπαιδευτικής Έρευνας Δράσης στην Ελλάδα κατά την τελευταία εικοσαετία ( )

ST5224: Advanced Statistical Theory II

/MAC DoS. A Coding Scheme Using Matched Filter Resistant against DoS Attack to PHY/MAC Layer in Wireless Communications

Statistical analysis of extreme events in a nonstationary context via a Bayesian framework. Case study with peak-over-threshold data

CT Correlation (2B) Young Won Lim 8/15/14

ΜΕΛΕΤΗ ΚΑΙ ΠΡΟΣΟΜΟΙΩΣΗ ΙΑΜΟΡΦΩΣΕΩΝ ΣΕ ΨΗΦΙΑΚΑ ΣΥΣΤΗΜΑΤΑ ΕΠΙΚΟΙΝΩΝΙΩΝ.

1 (forward modeling) 2 (data-driven modeling) e- Quest EnergyPlus DeST 1.1. {X t } ARMA. S.Sp. Pappas [4]

ΔΙΠΛΩΜΑΤΙΚΗ ΕΡΓΑΣΙΑ. του Φοιτητή του Τμήματος Ηλεκτρολόγων Μηχανικών και Τεχνολογίας Υπολογιστών της Πολυτεχνικής Σχολής του Πανεπιστημίου Πατρών

Fourier Series. MATH 211, Calculus II. J. Robert Buchanan. Spring Department of Mathematics

Dynamic types, Lambda calculus machines Section and Practice Problems Apr 21 22, 2016

«ΑΝΑΠΣΤΞΖ ΓΠ ΚΑΗ ΥΩΡΗΚΖ ΑΝΑΛΤΖ ΜΔΣΔΩΡΟΛΟΓΗΚΩΝ ΓΔΓΟΜΔΝΩΝ ΣΟΝ ΔΛΛΑΓΗΚΟ ΥΩΡΟ»

Development of a Tiltmeter with a XY Magnetic Detector (Part +)

Introduction to Time Series Analysis. Lecture 16.

: Monte Carlo EM 313, Louis (1982) EM, EM Newton-Raphson, /. EM, 2 Monte Carlo EM Newton-Raphson, Monte Carlo EM, Monte Carlo EM, /. 3, Monte Carlo EM

Διπλωματική Εργασία της φοιτήτριας του Τμήματος Ηλεκτρολόγων Μηχανικών και Τεχνολογίας Υπολογιστών της Πολυτεχνικής Σχολής του Πανεπιστημίου Πατρών

BandPass (4A) Young Won Lim 1/11/14

Contents Preliminary on Watermarking Technology Part I Signal Watermarking 2 Audio Watermarking

DERIVATION OF MILES EQUATION FOR AN APPLIED FORCE Revision C

J. of Math. (PRC) Banach, , X = N(T ) R(T + ), Y = R(T ) N(T + ). Vol. 37 ( 2017 ) No. 5

Assalamu `alaikum wr. wb.


ΤΕΧΝΟΛΟΓΙΚΟ ΠΑΝΕΠΙΣΤΗΜΙΟ ΚΥΠΡΟΥ ΣΧΟΛΗ ΕΠΙΣΤΗΜΩΝ ΥΓΕΙΑΣ

40 3 Journal of South China University of Technology Vol. 40 No Natural Science Edition March

IPSJ SIG Technical Report Vol.2014-CE-127 No /12/6 CS Activity 1,a) CS Computer Science Activity Activity Actvity Activity Dining Eight-He

Spectrum Representation (5A) Young Won Lim 11/3/16

Outline Analog Communications. Lecture 05 Angle Modulation. Instantaneous Frequency and Frequency Deviation. Angle Modulation. Pierluigi SALVO ROSSI

ΤΕΧΝΟΛΟΓΙΚΟ ΠΑΝΕΠΙΣΤΗΜΙΟ ΚΥΠΡΟΥ ΣΧΟΛΗ ΓΕΩΤΕΧΝΙΚΩΝ ΕΠΙΣΤΗΜΩΝ ΚΑΙ ΔΙΑΧΕΙΡΙΣΗΣ ΠΕΡΙΒΑΛΛΟΝΤΟΣ. Πτυχιακή εργασία

Ψηφιακή Επεξεργασία Φωνής

ΕΘΝΙΚΟ ΜΕΤΣΟΒΙΟ ΠΟΛΥΤΕΧΝΕΙΟ ΣΧΟΛΗ ΗΛΕΚΤΡΟΛΟΓΩΝ ΜΗΧΑΝΙΚΩΝ ΚΑΙ ΜΗΧΑΝΙΚΩΝ ΥΠΟΛΟΓΙΣΤΩΝ

VBA Microsoft Excel. J. Comput. Chem. Jpn., Vol. 5, No. 1, pp (2006)

Finite Field Problems: Solutions

A Determination Method of Diffusion-Parameter Values in the Ion-Exchange Optical Waveguides in Soda-Lime glass Made by Diluted AgNO 3 with NaNO 3

Χρηματοοικονομική Ανάπτυξη, Θεσμοί και

[1] DNA ATM [2] c 2013 Information Processing Society of Japan. Gait motion descriptors. Osaka University 2. Drexel University a)

A Method of Trajectory Tracking Control for Nonminimum Phase Continuous Time Systems

Applying Markov Decision Processes to Role-playing Game

ΕΘΝΙΚΟ ΜΕΤΣΟΒΙΟ ΠΟΛΥΤΕΧΝΕΙΟ

ΕΘΝΙΚΟ ΜΕΤΣΟΒΙΟ ΠΟΛΥΤΕΧΝΕΙΟ

*,* + -+ on Bedrock Bath. Hideyuki O, Shoichi O, Takao O, Kumiko Y, Yoshinao K and Tsuneaki G

What happens when two or more waves overlap in a certain region of space at the same time?


Transcript:

1,a) 2 F0 TUSK F0 F0 F0 F0 TUSK TUSK F0 Prototype of a framework for overviewing the performance of F0 estimators Morise Masanori 1,a) Kawahara Hideki 2 Abstract: This article represents a framework for overviewing the performance of fundamental frequency (F0) estimator and evaluates its effectiveness. Many F0 estimators have been proposed, and their effectivenesses have been evaluated by speech databases. On the other hand, since the evaluation result depends on the speech database used for the evaluation, it is difficult to fairly evaluate the estimators. The framework, named TUSK, does not rank the estimators but attempts to overview them. In this article, we introduce the concept of TUSK and evaluation criteria, and its effectiveness is examined by several modern F0 estimators. Keywords: Speech analysis, fundamental frequency, temporal variation, noise robustness 1. (F0) 1 1 University of Yamanashi, 4-3-11, Takeda, Kofu, 400 8511, Japan 2 Wakayama University, 930, Sakaedani, Wakayama, 640 8510, Japan. a) mmorise@yamanashi.ac.jp F0 F0 F0 F0 F0 Vocoder [1] 1

SNR F0 [2], [3], [4] F0 F0 F0 F0 F0 TUSK F0 TUSK F0 6 TUSK 6 F0 2. F0 F0 [5] [6] [7] p[8] Cepstrum [9], [10] [11] [12] [13] F0 [14] F0 [15] DIO[16] [17], [18] 14 [12] CMU *1 Bagshaw *2 Electrogastrogram (EGG) EGG F0 F0 Fine pitch error (FPA) Gross pitch error (GPA) [19] Gross error [7] F0 Gross Error FPE F0 F0 F0 F0 3. TUSK TUSK F0 6 F0 F0 3.1 x(t) = n(t) + h(t) K k=1 ( a k cos 2πk t 0 f 0 (τ)dτ + θ k ), n(t) h(t) f 0 (t) F0 a k θ k k K Kf 0 (t) TUSK F0 f 0 (t) f c F0 Klatt *1 http://festvox.org/cmu arctic/index.html *2 http://www.cstr.ed.ac.uk/research/projects/fda/ 2

[20] f 0 (t) = F L f c (sin(2π12.7t) + sin(2π7.1t) 50 100 + sin(2π4.7t)), (1) F L [20] 25 F0 f 0 (t) = f c F0 n(t) = 0 h(t) = δ(t) a k = 0 θ k = 1 f c = 440 1 F0 TUSK Gross error ±20% 20%F0 FPE GPE F0 Gross error TUSK F0 3.2 ACT1: F0 F0 F0 Klatt F0 f c F0 f v (t) = ( ) αf c cos αfc t, (2) α F0 αf c F0 f c f 0 (t) f v (t) ACT2 α F0 3.4 ACT3: a k ACT3 (1) a k F0 F0 TUSK a k 3.5 ACT4: ACT3 ACT4 (1) θ k ACT4 2π 3.3 ACT2: F0 F0 F0 F0 F0 F0 F0 ACT2 F0 F0 3.6 ACT5: SNR ACT5 n(t) SNR SNR SNR 2 SNR 3

3.7 ACT6: ACT6 h(t) 0 (t < 0) 1 (t = 0) h e (t) = ( ) (3) t log(0.001) exp r (otherwise) 10 r t 0 h e (t) h(t) TUSK 1 1 EDT (Early Decay Time) 10 db EDT 0 r 4. TUSK F0 F0 [7] [11] DIO[16] DIO STRAIGHT [21] TANDEM-STRAIGHT [22], [23] Matlab DIO StoneMask Web *3 F0 40 1000 Hz 1.2 s f 0 (t) 1 ms 0.1 1.1 s 1000 F0 48 khz ACT1 ACT6 1 *3 http://ml.cs.yamanashi.ac.jp/world/ 2500 3000 3500 4000 4500 5000 5500 6000 6500 F0 (cent) 1 TUSK ACT1 ACT5 2 ACT3 ACT6 100 ACT1 1 ACT2 α: 0...25 ACT3 ACT4 ACT5 (white noise) ACT5 (pink noise) ACT6 4.1 ACT1 f c : 2100...6900 cent (A1...A5) a k : 0...40 db θ k : 0...2π SNR: 0...60 db SNR: 0...60 db r: 10...1000 ms F0 1 cent F0 2100 cent (55 Hz) 6900 cent (880 Hz) F0 F0 F0 F0 4.2 ACT2 2 F0 α α 20 F0 4

10 3 DIO 0 5 5 20 25 Vibrato intensity α 0 0 30 40 50 60 SNR (db) 2 TUSK ACT2 5 TUSK ACT5 (whilte noise) 0 0 30 40 50 60 Amplitude randomization (db) 3 TUSK ACT3 0 0 30 40 50 60 SNR (db) 6 TUSK ACT5 (pink noise) DIO SNR ACT1 DIO 0 1 2 3 4 5 6 Phase randomization (rad) 4 TUSK ACT4 4.3 ACT3, ACT4 ACT3, ACT4 3,4 DIO 0.1 Hz 4.4 ACT5 ACT5 5,6 SNR 4.5 ACT6 7 10 db 5. DIO SNR 5

100 200 300 400 500 600 700 800 900 1000 Reverberation parameter (ms) 7 TUSK ACT6 STRAIGHT F0 6. F0 TUSK F0 F0 15H02726 26540087 H25/A08 [1] Dudley, H.: Remaking speech, J. Acoust. Soc. Am., Vol. 11, No. 2, pp. 169 177 (1939). [2] Morise, M.: CheapTrick, a spectral envelope estimator for high-quality speech synthesis, Speech Communication, Vol. 67, pp. 1 7 (2015). [3] Morise, M.: Error evaluation of an F0-adaptive spectral envelope estimator in robustness against the additive noise and F0 error, IEICE Trans. Inf. & Syst., Vol. E98- D, No. 7, pp. 1405 1408 (2015). [4] Nakano, T. and Goto, M.: A spectral envelope estimation method based on F0-adaptive multi-frame integration analysis, in Proc. SAPA-SCALE 2012, pp. 11 16 (2012). [5] Hess, W.: Pitch determination of speech signals, Springer-Verlag (1983). [6] Ross, M., Shaffer, H., Cohen, A., Freudberg, R. and Manley, H.: Average magnitude difference function pitch extractor, IEEE Transactions on acoustic, speech, and signal processing, Vol. ASSP-22, No. 5, pp. 353 362 (1974). [7] Cheveigné, A. and Kawahara, H.:, a fundamental frequency estimator for speech and music, J. Acoust. Soc. Am., Vol. 111, No. 4, pp. 1917 1930 (2002). [8] Mauch, M. and Dixon, S.: P: A fundamental frequency estimator using probabilistic threshold distributions, in Proc. ICASSP2014, pp. 659 663 (2014). [9] Noll, A.: Short-time spectrum and cepstrum techniques for vocal pitch detection, J. Acoust. Soc. Am., Vol. 36, No. 2, pp. 269 302 (1964). [10] Noll, A.: Cepstrum pitch determination, J. Acoust. Soc. Am., Vol. 41, No. 2, pp. 293 309 (1967). [11] Camacho, A. and Harris, J. G.: A sawtooth waveform inspired pitch estimator for speech and music, J. Acoust. Soc. Am., Vol. 124, No. 3, pp. 1638 1652 (2008). [12] D Vol. J83-DII, No. 11, pp. 2077 2086 (2000). [13] A Vol. J80-A, No. 11, pp. 1848 1856 (1997). [14] Yegnanarayana, B. and Murty, K.: Event-based instantaneous fundamental frequency estimation from speech signals, IEEE Transactions on Audio, Speech, and Language Processing, Vol. 17, No. 4, pp. 614 624 (2009). [15] Vol. 51, No. 7, pp. 509 518 (1995). [16] SNR F0 D Vol. J93-D, No. 2, pp. 109 117 (2010). [17] Shimamura, T. and Kobayashi, H.: Weighted autocorrelation for pitch extraction of noisy speech, IEEE Transactions on speech and audio processing, Vol. 9, No. 7, pp. 727 730 (2001). [18] Nakatani, T. and Irino, T.: Robust and accurate fundamental frequency estimation based on dominant harmonic components, J. Acoust. Soc. Am., Vol. 116, No. 6, pp. 3690 3700 (2004). [19] Rabiner, L., Cheng, M., Rosenberg, A. and McGonegal, C.: A comparative performance study of several pitch detection algorithms, IEEE Transactions on acoustic, speech, and signal processing, Vol. ASSP-24, No. 5, pp. 399 418 (1976). [20] Klatt, D. and Klatt, L.: Analysis, synthesis, and perception of voice quality variations among female and male talkers, J. Acoust. Soc. Am., Vol. 82, No. 2, pp. 820 857 (1990). [21] Kawahara, H., Cheveigné, A., Banno, H., Takahashi, T. and Irino, T.: Nearly defect-free F0 trajectory extraction for expressive speech modifications based on STRAIGHT, in Proc. Interspeech2005, pp. 537 540 (2005). [22] Kawahara, H., Morise, M., Takahashi, T., Nisimura, R., Irino, T. and Banno, H.: TANDEM-STRAIGHT: A temporally stable power spectral representation for periodic signals and applications to interference-free spectrum, f0, and aperiodicity estimation, in Proc. ICASSP2008, pp. 3933 3936 (2008). [23] Kawahara, H. and Morise, M.: Technical foundations of TANDEM-STRAIGHT, a speech analysis, modification and synthesis framework, SADHANA - Academy Proceedings in Engineering Sciences, Vol. 36, No. 5, pp. 713 728 (2011). 6