Signal processing for handling singing voice texture

Σχετικά έγγραφα
Vol.4-DCC-8 No.8 Vol.4-MUS-5 No.8 4// 3 3 Hanning (T ) 3 Hanning 3T (y(t)w(t)) dt =.5 T y (t)dt. () STRAIGHT F 3 TANDEM-STRAIGHT[] 3 F F 3 [] F []. :

SNR F0 [2], [3], [4] F0 F0 F0 F0 F0 TUSK F0 TUSK F0 6 TUSK 6 F0 2. F0 F0 [5] [6] [7] p[8] Cepstrum [9], [10] [11] [12] [13] F0 [14] F0 [15] DIO[16] [1

: TANDEM-STRAIGHT. Make singing voice tangible: TANDEM-STRAIGHT and temporally variable morphing as substrate. Hideki Kawahara 1 and Masanori Morise 2

VOCODER VOCODER Vocal

LSP (Line Spectrum Pair, LSF: Line Spectrum Frequencies) [19] [20] LSP CODE [21], [22] VOCODER [23] 2. 3 STRAIGHT VOCODER STRAIGHT [24] STRAIGHT [25]

Fourier transform, STFT 5. Continuous wavelet transform, CWT STFT STFT STFT STFT [1] CWT CWT CWT STFT [2 5] CWT STFT STFT CWT CWT. Griffin [8] CWT CWT

Ψηφιακή Επεξεργασία Φωνής

MIDI [8] MIDI. [9] Hsu [1], [2] [10] Salamon [11] [5] Song [6] Sony, Minato, Tokyo , Japan a) b)

1,a) 1,b) 2 3 Sakriani Sakti 1 Graham Neubig 1 1. A Study on HMM-Based Speech Synthesis Using Rich Context Models

GUI

Buried Markov Model Pairwise

3: A convolution-pooling layer in PS-CNN 1: Partially Shared Deep Neural Network 2.2 Partially Shared Convolutional Neural Network 2: A hidden layer o

CSJ. Speaker clustering based on non-negative matrix factorization using i-vector-based speaker similarity

Ψηφιακή Επεξεργασία Φωνής

6.003: Signals and Systems. Modulation

ΔΙΠΛΩΜΑΤΙΚΗ ΕΡΓΑΣΙΑ. του Φοιτητή του Τμήματος Ηλεκτρολόγων Μηχανικών και Τεχνολογίας Υπολογιστών της Πολυτεχνικής Σχολής του Πανεπιστημίου Πατρών

An Automatic Modulation Classifier using a Frequency Discriminator for Intelligent Software Defined Radio

Estimation, Evaluation and Guarantee of the Reverberant Speech Recognition Performance based on Room Acoustic Parameters

: Monte Carlo EM 313, Louis (1982) EM, EM Newton-Raphson, /. EM, 2 Monte Carlo EM Newton-Raphson, Monte Carlo EM, Monte Carlo EM, /. 3, Monte Carlo EM

[5] F 16.1% MFCC NMF D-CASE 17 [5] NMF NMF 3. [5] 1 NMF Deep Neural Network(DNN) FUSION 3.1 NMF NMF [12] S W H 1 Fig. 1 Our aoustic event detect

Non-negative Matrix Factorization, NMF [5] NMF. [1 3] Bregman [4] Harmonic-Temporal Clustering, HTC [2,3] 1,2,b) NTT

Analysis of prosodic features in native and non-native Japanese using generation process model of fundamental frequency contours

F0 Estimation of Melody and Bass Lines in Real-world Musical Audio Signals

Spectrum Representation (5A) Young Won Lim 11/3/16

Sinsy: HMM. Sinsy An HMM-based singing voice synthesis system which can realize your wish I want this person to sing my song

Monolithic Crystal Filters (M.C.F.)

{takasu, Conditional Random Field

Voice Conversion based on Non-negative Matrix Factorization with Segment Features in Noisy Environments

Διπλωματική Εργασία. του φοιτητή του Τμήματος Ηλεκτρολόγων Μηχανικών και Τεχνολογίας Υπολογιστών της Πολυτεχνικής Σχολής του Πανεπιστημίου Πατρών

Singing Information Processing: Music Information Processing for Singing Voices

Speech Recognition using Phase Information based on Long-Term Analysis

Ψηφιακή Επεξεργασία Φωνής

v.connect 2 v.connect : A Singing Synthesis System Enabling Users to Control Vocal Tones Makoto Ogawa, 1 Syunji Yazaki 1 and Kôki Abe 1 VOCALOID

Sampling Basics (1B) Young Won Lim 9/21/13

Re-Pair n. Re-Pair. Re-Pair. Re-Pair. Re-Pair. (Re-Merge) Re-Merge. Sekine [4, 5, 8] (highly repetitive text) [2] Re-Pair. Blocked-Repair-VF [7]

Nov Journal of Zhengzhou University Engineering Science Vol. 36 No FCM. A doi /j. issn

JOURNAL OF APPLIED SCIENCES Electronics and Information Engineering. Cyclic MUSIC DOA TN (2012)

Main source: "Discrete-time systems and computer control" by Α. ΣΚΟΔΡΑΣ ΨΗΦΙΑΚΟΣ ΕΛΕΓΧΟΣ ΔΙΑΛΕΞΗ 4 ΔΙΑΦΑΝΕΙΑ 1

Retrieval of Seismic Data Recorded on Open-reel-type Magnetic Tapes (MT) by Using Existing Devices

Ψηφιακή Επεξεργασία Φωνής

Πτυχιακή Εργασι α «Εκτι μήσή τής ποιο τήτας εικο νων με τήν χρή σή τεχνήτων νευρωνικων δικτυ ων»

Ψηφιακή Επεξεργασία Φωνής

Feasible Regions Defined by Stability Constraints Based on the Argument Principle

Acoustic Signal Adjustment by Considering Musical Expressive Intention Using a Performance Intension Function

2016 IEEE/ACM International Conference on Mobile Software Engineering and Systems

1181 (real-timespeechdriven) 1 1 ( ) D FAP FAP (voiceactivationdetectionvad) D FaceGen 3- D XfaceEd MPEG-4 1 FAP 66 FAP ( ) FAP 84

Fundamentals of Signals, Systems and Filtering

1 (forward modeling) 2 (data-driven modeling) e- Quest EnergyPlus DeST 1.1. {X t } ARMA. S.Sp. Pappas [4]

Statistical analysis of extreme events in a nonstationary context via a Bayesian framework. Case study with peak-over-threshold data

Study of In-vehicle Sound Field Creation by Simultaneous Equation Method

Anomaly Detection with Neighborhood Preservation Principle

An Advanced Manipulation for Space Redundant Macro-Micro Manipulator System

[2] REVERB 8 [3], [4] [5] [20] [6], [7], [8], [9], [10] [11] REVERB 8 *1 [9] LDA *2 MLLT (SAT) [8] (basis fmllr) [12] (DNN) [10] DNN [11] [13] [14] Ka


Introduction to Time Series Analysis. Lecture 16.

Query by Phrase (QBP) (Music Information Retrieval, MIR) QBH QBP / [1, 2] [3, 4] Query-by-Humming (QBH) QBP MIDI [5, 6] [8 10] [7]

Elements of Information Theory

Adaptive grouping difference variation wolf pack algorithm

BandPass (4A) Young Won Lim 1/11/14

Development of a Seismic Data Analysis System for a Short-term Training for Researchers from Developing Countries

Vol. 31,No JOURNAL OF CHINA UNIVERSITY OF SCIENCE AND TECHNOLOGY Feb

Table of Contents 1 Supplementary Data MCD

Σύστημα ψηφιακής επεξεργασίας ακουστικών σημάτων με χρήση προγραμματιζόμενων διατάξεων πυλών. Πτυχιακή Εργασία. Φοιτητής: ΤΣΟΥΛΑΣ ΧΡΗΣΤΟΣ

VBA Microsoft Excel. J. Comput. Chem. Jpn., Vol. 5, No. 1, pp (2006)

Higher-Order Correlation Analysis of Pitch Fluctuations in Sustained Normal Vowels by the Method of Surrogate Data

Βιογραφικό σημείωμα Μαρία Μαρκάκη

17 min R A (2009) To probe into the thermal property the mechanism of the thermal decomposition and the prospective

ΚΑΝΟΝΙΣΜΟΣ ΕΚΠΟΝΗΣΗΣ ΕΡΓΑΣΙΩΝ ΓΙΑ ΤΟ ΜΑΘΗΜΑ «ΕΠΕΞΕΡΓΑΣΙΑ ΨΗΦΙΑΚΟΥ ΣΗΜΑΤΟΣ ΚΑΙ ΣΧΕΔΙΑΣΜΟΣ ΥΛΙΚΟΥ»

DERIVATION OF MILES EQUATION FOR AN APPLIED FORCE Revision C

Yoshifumi Moriyama 1,a) Ichiro Iimura 2,b) Tomotsugu Ohno 1,c) Shigeru Nakayama 3,d)

Dong Liu State Key Laboratory of Particle Detection and Electronics University of Science and Technology of China

ΔΙΠΛΩΜΑΤΙΚΕΣ ΕΡΓΑΣΙΕΣ

Prey-Taxis Holling-Tanner

Graded Refractive-Index

[1] DNA ATM [2] c 2013 Information Processing Society of Japan. Gait motion descriptors. Osaka University 2. Drexel University a)

Digital Signal Octave Codes (0B)

Research on mode-locked optical fiber laser

Precision Metal Film Fixed Resistor Axial Leaded

The Algorithm to Extract Characteristic Chord Progression Extended the Sequential Pattern Mining

SMD Transient Voltage Suppressors

ER-Tree (Extended R*-Tree)

ECE 468: Digital Image Processing. Lecture 8

Echo path identification for stereophonic acoustic echo cancellation without pre-processing

Correction of chromatic aberration for human eyes with diffractive-refractive hybrid elements

SCOPE OF ACCREDITATION TO ISO 17025:2005

Assignment 1 Solutions Complex Sinusoids

J. of Math. (PRC) Banach, , X = N(T ) R(T + ), Y = R(T ) N(T + ). Vol. 37 ( 2017 ) No. 5

Application of Wavelet Transform in Fundamental Study of Measurement of Blood Glucose Concentration with Near2Infrared Spectroscopy

( ) (Harmonic-Temporal Clustering; HTC) [1], [2] ( ) ( ) [4] HTC. (Non-negative Matrix Factorization; NMF) [3] [5], [6] [7], [8]

Στοιχεία εισηγητή Ημερομηνία: 10/10/2017

( ) , ) , ; kg 1) 80 % kg. Vol. 28,No. 1 Jan.,2006 RESOURCES SCIENCE : (2006) ,2 ,,,, ; ;

Διπλωματική Εργασία της φοιτήτριας του Τμήματος Ηλεκτρολόγων Μηχανικών και Τεχνολογίας Υπολογιστών της Πολυτεχνικής Σχολής του Πανεπιστημίου Πατρών

Simplex Crossover for Real-coded Genetic Algolithms

EM Baum-Welch. Step by Step the Baum-Welch Algorithm and its Application 2. HMM Baum-Welch. Baum-Welch. Baum-Welch Baum-Welch.

Ψηφιακή Επεξεργασία Φωνής

το προφίλ της γάστρας, η ίσαλος σχεδίασης, η καμπύλη εμβαδών εγκαρσίων τομών και η κατανομή του κέντρου βάρους των εγκαρσίων τομών κατά μήκος του

A study of geometric dependency of cepstrum on vocal tract length

ITU-R BT ITU-R BT ( ) ITU-T J.61 (

Maxima SCORM. Algebraic Manipulations and Visualizing Graphs in SCORM contents by Maxima and Mashup Approach. Jia Yunpeng, 1 Takayuki Nagai, 2, 1

Area Location and Recognition of Video Text Based on Depth Learning Method

Transcript:

1 TANDEM-STRAIGHT Signal processing for handling singing voice texture Hideki Kawahara 1 Singers explore vocal expressions to the limit. Conventional speech processing algorithms, which were designed to handle usual conversational speech and read speech, where voiced sounds are usually quasi periodic, are not capable of flexible and versatile manipulation of such extreme expressions as shout, for example. This article discusses various signal processing approaches for handling singing voice texture, referring recent investigations for extending TANDEM-STRAIGHT as an example testbet. 1. 1),2) 1 1 Wakayama University 1 3) 2 4) 6) TANDEM-STRAIGHT 7) 8) 2. 9),10) 11) TANDEM-STRAIGHT 3. TANDEM-STRAIGHT TANDEM-STRAIGHT STRAIGHT 12),13) 14) 15) x << 1 2 1 c 2012 Information Processing Society of Japan

log(1 + x) x cepstrum liftering 16),17) P ST (ω) = exp (F [g 1(q)g 2(q)C T (q)]), (1) P ST (ω) C T (q) P T (ω) cepstrum q quefrency. F[ ] Fourier lifterg 1(q) g 2(q) g 1 (q) = α 0 + 2 α 1 cos (2πqf 0 ), (2) g 2 (q) = sin(πf0q), πf 0q (3) f 0 g 2(q) f 0 g 1(q) consistent sampling 15) α 0 = 1.18, α 1 = 0.09 16) 3.1 TANDEM-STRAIGHT VOCODER 18) VOCODER 19) P ST (ω, t) 20) x S (t) N ( t ) x S(t) = a k (t) sin ω k (λ)dλ + ϕ k (t) k=1 N ω k (t) = 2πf k (t) k 1 1 (DVD ) 21) 0 (4) f 0(t) k ϕ k (t) k a k (t) k 2 a k (t) = P ST (kω 0(t), t)/ω 0 (5) 4. 4.1 23) 3 23) 26),27) 28) 2 22) TANDEM-STRAIGHT 3 Titze NCSV (National Center for Voice & Speech) 24) vocologist 1992 25) 2 c 2012 Information Processing Society of Japan

29) 1 28) 2 4.2 2 30),31) 4 0 90 Schroeder 32) 3 1 DVD 21) 2 50 ms 100 ms 3 http://www.wakayama-u.ac.jp/ kawahara/phaseeffect/ 20 db 33) SNR 20 db 5. STRAIGHT TANDEM-STRAIGHT 1) 2) 3) 4) 5) 4 5.1 TANDEM 7) XSX (excitation Structure extractor) XSX YIN 34) SWIPE 35) 36) 4 1 2 3 4 5 3 c 2012 Information Processing Society of Japan

37) 38) 60 db 60 db -90 db -18 db/oct cos 39) 40) TANDEM 5.2 XSX 1 XSX XSX 41) 36) 5.3 18),19),42),43) 1 http://www.wakayama-u.ac.jp/%7ekawahara/howtandemstraightworks/ 29) TANDEM-STRAIGHT sigmoid 44) TANDEM-STRAIGHT 5.4 45) Hilbert TANDEM-STRAIGHT 5.5?),46) 6. 4 c 2012 Information Processing Society of Japan

TANDEM-STRAIGHT 22650042 1) Audio CD (2002). 2) RWC : Vol.45, No.3, pp.728 738 (2004). 3) : TANDEM-STRAIGHT. [ ] Vol.2010-MUS-86, No.6, pp.1 6 (2010). 4) : Vol.67, No.1, pp.46 50 (2011). 5) Sinsy: HMM. [ ] Vol.2010-MUS-86, No.1, pp.1 8 (2010). 6) Villavicencio, V., Röbel, A. and Rodet, X.: Applying improved spectral modeling for high quality voice conversion, ICASSP2009, pp.4285 4288 (2009). 7) Kawahara, H., Morise, M., Takahashi, T., Nisimura, R., Irino, T. and Banno, H.: A temporally stable power spectral representation for periodic signals and applications to interference-free spectrum, F0 and aperiodicity estimation, ICASSP 2008, pp.3933 3936 (2008). 8) Kawahara, H., Nisimura, R., Irino, T., Morise, M., Takahashi, T. and Banno, B.: Temporally variable multi-aspect auditory morphing enabling extrapolation without objective and perceptual breakdown, ICASSP2009, pp.3905 3908 (2009). 9) McDermott, J.H., Oxenham, A.J. and Simoncelli, E.P.: Sound texture synthesis via filter statistics, Proc. IEEE WASPAA, pp.297 300 (2009). 10) Ellis, D., Xiaohong, Z. and McDermott, J.: Classifying soundtracks with audio texture features, ICASSP2011, pp.5880 5883 (online), DOI:10.1109/ICASSP.2011.5947699 (2011). 11) : Texture Synthesis Examples - Page 1 (McDermott and Simoncelli), New York University (online), available from http://www.cns.nyu.edu/ jhm/texture examples/ (accessed 2012-01-09). 12) Kawahara, H., Masuda-Katsuse, I. and de Cheveigné, A.: Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction, Speech Communication, Vol.27, No.3-4, pp.187 207 (1999). 13) Vocoder STRAIGHT Vol.63, No.8, pp.442 449 (2007). 14) D Vol.J 90-D, No.12, pp.3265 3267 (2007). 15) Unser, M.: Sampling 50 Years After Shannon, Proceedings of the IEEE, Vol.88, No.4, pp.569 587 (2000). 16) F0 A Vol.J94-A, No.8, pp.557 567 (2011). 17) Kawahara, H. and Morise, M.: Technical foundations of TANDEM-STRAIGHT, a speech analysis, modification and synthesis framework, SADHANA - Academy Proceedings in Engineering Sciences, Vol.36, No.5, pp.713 722 (2011). 18) : Vol.110, No.297, pp.23 28 (2010). 19) Laroche, J., Stylianou, Y. and Moulines, E.: HNS: Speech modification based on a harmonic+noise model, Acoustics, Speech, and Signal Processing, 1993. ICASSP- 93., 1993 IEEE International Conference on, Vol. 2, pp. 550 553 vol.2 (online), DOI:10.1109/ICASSP.1993.319365 (1993). 20) 2011 No. 2CR3-7, pp. CD ROM & web http://www.interaction-ipsj.org/archives/paper2011/ (2011). 21) : STRAIGHT introduction (DVD ) Add data for field: Organization http://www.wakayama-u.ac.jp/raight 2001 OCT DEMO.zip 2012-01-09. 22) 2010 No.1-7-16, pp.315 316 (2010). 23) Titze, I.R.: Principles of voice production, Prentice Hall (1994). 2003. 24) : NCVS: Giving Voice to America, NCVS (online), available from http://www.ncvs.org/index.html (accessed 2012-01-09). 25) : Ingo Titze and Pavarobotti singing Nessun Dorma, YouTube (online), available from http://www.youtube.com/watch?v=uqw03txzsha 5 c 2012 Information Processing Society of Japan

(accessed 2012-01-09). 26) Childers, D.G. and Ahn, C.: Modeling the glottal volume-velocity waveform for three voice types, J. Acoust. Soc. Am., Vol.97, No.1, pp.505 519 (1995). 27) Kawahara, H., Atake, Y. and Zolfaghari, P.: Accurate vocal event detection method based on a fixed-point analysis of mapping from time to weighted average group delay, Proc. ICSLP 2000, Beijin China, pp.664 667 (2000). 28) Murty, K. and Yegnanarayana, B.: Epoch Extraction From Speech Signals, Audio, Speech, and Language Processing, IEEE Transactions on, Vol. 16, No. 8, pp. 1602 1613 (online), DOI:10.1109/TASL.2008.2004526 (2008). 29) Kawahara, H., Katayose, H., de Cheveigné, A. and Patterson, R.D.: Fixed Point Analysis of Frequency to Instantaneous Frequency Mapping for Accurate Estimation of F0 and Periodicity, Proc. EUROSPEECH 99, Vol.6, ESCA, pp.2781 2784 (1999). 30) Plomp, R. and Steeneken, H. J.M.: Effect of Phase on the Timbre of Complex Tones, J. Acoust. Soc. Am., Vol.46, No.2B, pp.409 421 (1969). 31) Patterson, R.D.: A pulse ribbon model of monaural phase perception, J. Acoust. Soc. Am., Vol.82, No.5, pp.1560 1586 (1987). 32) Schroeder, M.R.: Synthesis of low-peak-factor signals and binary sequences with low autocorrelation, IEEE Trans. Information Theory, Vol. 16, No. 1, pp. 85 89 (1970). 33) Skoglund, J. and Kleijn, W.: On time-frequency masking in voiced speech, Speech and Audio Processing, IEEE Transactions on, Vol.8, No.4, pp.361 369 (online), DOI:10.1109/89.848218 (2000). 34) de Chevengné, A. and Kawahara, H.: YIN, a fundamental frequency estimator for speech and music, J. Acoust. Soc. Am., Vol.111, No.4, pp.1917 1930 (2002). 35) Camacho, A. and Harris, J.G.: A sawtooth waveform inspired pitch estimator for speech and music, J. Acoust. Soc. Am., Vol.124, No.3, pp.1638 1652 (2008). 36) Vol.111, No.175, pp.81 86 (2011). 37) SNR F0 D Vol.J93-D, No.2, pp.109 117 (2010). 38) FM AM H-2011-121 Vol.41, No.9, pp.679 684 (2011). 39) Nuttall, A.H.: Some windows with very good sidelobe behavior, IEEE Trans. Audio Speech and Signal Processing, Vol.29, No.1, pp.84 91 (1981). 40) Kawahara, H., Irino, T. and Morise, M.: An interference-free representation of instantaneous frequency of periodic signals and its application to F0 extraction, ICASSP 2011, pp.5420 5423 (2011). 41) Fujimura, O., Honda, K., Kawahara, H., Konparu, Y., Morise, M. and Williams, J.: Noh Voice Quality, J. Logopedics Phoniatrics Vocology, Vol.34, No.4, pp.157 170 (2009). 42) Yegnanarayana, B., d Alessandro, C. and Darsinos, V.: An iterative algorithm for decomposition of speech signals into periodic and aperiodic components, Speech and Audio Processing, IEEE Transactions on, Vol. 6, No. 1, pp. 1 11 (online), DOI:10.1109/89.650304 (1998). 43) Hermus, K., Van hamme, H. and Irhimeh, S.: Estimation of the Voicing Cut-Off Frequency Contour Based on a Cumulative Harmonicity Score, Signal Processing Letters, IEEE, Vol. 14, No. 11, pp. 820 823 (online), DOI:10.1109/LSP.2007.898854 (2007). 44) Kawahara, H., Morise, M., Takahashi, T., Banno, H., Nisimura, R. and Irino, T.: Simplification and extension of non-periodic excitation source representations for high-quality speech manipulation systems, Proc. Interspeech2010, ISCA, pp.38 41 (2010). 45) Pantazis, Y. and Stylianou, Y.: Improving the modeling of the noise part in the harmonic plus noise model of speech, ICASSP 2008, pp.4609 4612 (2008). 46) Vol.52, No.5, pp.1910 1922 (2011). 6 c 2012 Information Processing Society of Japan