Vocal Dynamics Controller:

Vocal Dynamics Controller:. F ),) F F 2 F F EM 2 ),2),3) 2 F 4) F Vocal Dynamics Controller 2 Vocal Dynamics Controller: A note-by-note editing and synthesizing interface for F dynamics in singing voices Yasunori Ohishi, Hirokazu Kameoka, Daichi Mochihashi, Hidehisa Nagano and Kunio Kashino We present a novel statistical model for dynamics of various singing behaviors, such as vibrato and overshoot, in a fundamental frequency (F) sequence and develop a note-by-note editing and synthesizing interface for F dynamics. We develop a complete stochastic representation of the F dynamics based on a second-order linear system and propose a complete, efficient scheme for parameter estimation using the Expectation-Maximization (EM) algorithm. Finally, we synthesize the singing voice using the F sequence generated by manipulating model parameters individually which control the oscillation based on the second-order system and the pitch of each note. F F ),2) 3) 5) 6) 9) F F F F 2 H(s) = Ω 2 s 2 + 2ζΩs + Ω 2 () ζ (ζ > ) ( < ζ < ) ) (ζ = ) (ζ = ) ζ Ω () F ) F F 2 ζ, Ω NTT NTT Communication Science Laboratories, Nippon Telegraph and Telephone Corporation c 2 Information Processing Society of Japan

56 ] t n 52 e c 48 F44 6 4 2 4 6 8 観測 F 系列観測信号 : 旋律成分ノート間の音高差歌唱動的変動成分線形 2 次系 sec] ガウス性白色雑音時間時間時間時間ステップ信号 : インパルス応答 : 系の出力信号 : 残差信号 : F F o(t) f(t) 2 h(t) y(t) ɛ(t) h(t) ɛ(t) F h(t) ɛ(t) F Vocal Dynamics Controller 2. 2 F () () ) HMM ζ () ( Ωe ζωt e ζ 2 Ωt e ) ζ 2 Ωt (ζ > ) 2 ζ F F 2 ( Ωe ζωt sin( ζ2 Ωt ) ( < ζ < ) F h(t) = ζ 2 Ω 2 te Ωt (ζ = ) Ω sin(ωt) (ζ = ) F 5) y = Φf 2 y = y, y 2,..., y N ] T, f = f, f 2,..., f N ] T y(t) f(t) N 2 Φ ζ = Φ F ( ) 2 h(t) ) 5) Ω 2 e Ω 2Ω 2 e 2Ω Ω 2 e Ω Φ =....... ( 2 ) F NΩ 2 e NΩ... 2Ω 2 e 2Ω Ω 2 e Ω h(t) (2) Φ (2) 2 c 2 Information Processing Society of Japan

Φ w Υ () + w 2 Υ (2) +... + w I Υ (I) (3) y y N ( Ψ u, αψ (Ψ ) T) (7) ζ, Ω I 3.2 {Φ (), Φ (2),..., Φ (I) } Υ (i) := (Φ (i) ) = ɛ, ɛ 2,..., ɛ N ] T N (, βi N ) F Φ o = o, o 2,..., o N ] T y w := {w, w 2,..., w I } o = y + (8) Φ β y o Θ := {w, u, β} w Φ Φ (i) { Φ Υ (i) P (o Θ) = (2π) N/2 Σ exp } /2 2 (o )T Σ (o ) (9) = Ψ u, Σ = αψ (Ψ ) T + βi N (w Υ () + w 2Υ (2) +... + w IΥ (I) )y = f (4) Ψ := w Υ () + w 2Υ (2) +... + w IΥ (I) Θ P (Θ) P (Θ) = P (w)p (u)p (β) u β w 3. 2 F I λp p P (w) = wi 2Γ(/p) exp λ p () (4) 2 i= 3. p, λ < p < 2 f p(w) u = u,..., u N ] T = u,,..., ] T = u 4. EM u N u N (u, αi N ) F o P (Θ o) P (o Θ)P (Θ) f α Θ Θ (MAP) I N N N y f y = Ψ f ( ) F o y y ( 2 ) w Ey] = Ψ Ef] = Ψ u (5) ( ) EM 6) E-step o y covy] = Ψ Eff T ](Ψ ) T Ψ Ef]Ef] T (Ψ ) T = αψ (Ψ ) T (6) ( 2 ) EM M-step Q 3 c 2 Information Processing Society of Japan

2 4.2 M-step 4. MAP EM EM f(w, u, β) := N N ( I ) 2 log αβ + log w iυ (i) y x Q(Θ, Θ ) = c 2 + log P (Θ) o = Hx, ( ]) ] y H := I N I N, x := () 7) (6) x Θ 2 log P (Θ) Q N ( I ) N I log Λ tr ( Λ Exx T o; Θ ] ) ] log w iυ (i) γ i,n log w iυ (i) (7) γ + 2m T Λ Ex o; Θ ] m T Λ i,n m n= i n= i= ( ] Ψ u m :=, Λ := α ΨT Ψ tr( ) Ex o; Θ ] Exx T o; Θ ] f + (w, u, β, w, ) := N N 2 log αβ + β IN ]) (2) Ex o; Θ ] = m + ΛH T (HΛH T ) (o Hm) (3) Exx T o; Θ ] = Λ ΛH T (HΛH T ) HΛ + Ex o; Θ ]Ex o; Θ ] T (4) EM E-step Θ w i = w i, γ i,n = Ex o; Θ ] Exx T o; Θ ] y, Ex o; Θ ] Exx T o; Θ ] ] Ex o; Θ x y ] =, Exx T o; Θ ] = x ɛ R y x y x ɛ N R y R ɛ N N ] R ɛ (5) (2) Θ n= i= 2β tr(rɛ) 2α ut u λ p + α ut Ψ x y 2α tr(ψt ΨR y) I w i p (6) i= Υ (i) Υ (i) n n w i p p w i p w i + w i p p w i p, ( < p ) (8) w := { w, w 2,..., w I }, := {γ,,..., γ I,N } (7) (8) (6) I n= i= 2β tr(rɛ) 2α ut u λ p γ i,n log w iυ (i) + γ i,n α ut Ψ x y 2α tr(ψt ΨR y) I ) (p w i p w i + w i p p w i p (9) i= f(w, u, β) f + (w, u, β, w, ) w i Υ (i) I i = w i Υ(i ), (i =, 2,..., I, n =, 2,..., N) (2) (9) (9) w i I α i= ( ) tr R T y Υ (i)t Υ (i ) w i α ut Υ (i ) x y + λ p p w i p N n= γ i,n w i = (i =, 2,..., I) (2) (2) w, w 2,..., w I 4 c 2 Information Processing Society of Japan

: Θ = {w, u, β} E-step: Ex o; Θ ], Exx T o; Θ ] w, M-step: (22) (23) Θ = {w, u, β} : (9) Θ = Θ E-step 2 EM 2 F Coordinate descent 8) w, w 2,..., w I (2) w i w i = Y 2 + Y 2 4XZ 2X ( ) X = tr R T y Υ (i ) T Υ (i ), ( ) Y = R T y Υ (i)t Υ (i ) w i u T Υ (i ) x y + αλ p p w i p, i i tr Z = α i =, 2,..., I (22) w, w 2,..., w I f + (w, u, β, w, ) u, β u = N T Ψ x y, β = N tr( ) R ɛ (23) N n= γ i,n (22) u β 2 ( ) 42 HMM F Viterbi 5. Vocal Dynamics Controller w, u, β F o (8) w, u, β 3cent F Vocal Dynamics 7cent cent Controller 3 GUI 64 A: F F F YIN 9) 5ms = 5ms Hz 3 Vocal Dynamics Controller 2 A I 5 o Hz cent o cent o cent = 2 log 2 o Hz 44 2 3 2 5 (24) F B: 4 F HMM 2 cent cent /42.999999./4 HMM 4 5 c 2 Information Processing Society of Japan

セグメント分割 (HMM による Viterbi 探索 ) セグメントごとのモデルパラメータ推定 ( 桃線はを表す ) 先頭の F 値 4 B 5 C D ( 2 ) all 4 {Υ (), Υ (2),..., Υ (I) } ζ 2.2 Ω.5.3.5 I = 3 w = {w, w 2,..., w I} /I u F o β β = ( 5 ) () (4) (9) α = 2, λ =, p =.8 ( 3 ) 2 C: F F F ζ Ω F 4 F all () HMM Viterbi (2) (3) Viterbi = Ψ u F 2 F h(t) Φ Φu F 2 5 ζ F Ω u F ( 4 ) (2) (3) D: x ɛ 2 F F 6 c 2 Information Processing Society of Japan

Depth cent (2) ζ = Ω Frequency ] F t 2 5 Frequency c F Depth 54 E: 2 F F u u ζ Ω Φ Φu F ] 2 F: C D E B F G: B E F Griffin-Lim STFT 2) 6 F o F µ ( ) F F F ( 2 ) STFT H: A F Y = (Y f,t ) F T STFT 2ms Hanning 5ms ( 3 ) LPC 2) ( 4 ) () w, u (9) F F o Beethoven 9 4 ( 5 ) (4) {X ω,t} R w, u F V ω,t C ( 6 ) {V f,t } f {,...,F },t {,...,T } STFT vm] M m= ) ( 7 ) vm] M m= STFT {V f,t } f {,...,F },t {,...,T } V ( 8 ) f, t V f,t X f,t f,t V f,t V f,t (6) F (6) (8) Griffin-Lim STFT Le Roux 22) β n e t n e c F 62 58 5 5 46 42 38 2 4 6 8 Time sec] 声楽家 ( 女性 ) 素人 ( 男性 ) 6. F 6 7 c 2 Information Processing Society of Japan

Prosody of Japanese Lyrics, Proc. ICEC 29, pp.39 3 (29). Multiple Kernel Learning ) Ohishi, Y. et al.: A Stochastic Representation of the Dynamics of Sung Melody, 23) 25) Proc. ISMIR 27 (27). ) Ohishi, Y. et al.: Parameter Estimation Method of F Control Model for Singing w, u, β Voices, Proc. ICSLP 28, pp.39 42 (28). 2), 2-9- pp.625 626 (998). 7. 3) Minematsu, N. et al.: Prosodic Modeling of Nagauta Singing and Its Evaluation, Proc. SpeechProsody 24, pp.487 49 (24). F 4) Fujisaki, H.: A note on the physiological and physical basis for the phrase and accent components in the voice fundamental frequency contour, Vocal Physiology: Voice Production, Mechanisms and Functions, (O. Fujimura, ed.), Raven Press, pp. 347 355 (988). MFCC 5) HMM Vol.28, No.76, pp.89 96 (28). Jonathan Le Roux NTT CS 6) Feder, M. and Weinstein, E.: Parameter estimation of superimposed signals using the EM algorithm, IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol.36, No.4, pp.477 489 (988). 7) Kameoka, H. et al.: Complex NMF: A New Sparse Representation for Acoustic ) Saitou, T. et al.: Speech-To-Singing Synthesis: Converting Speaking Voices to Signals, Proc. ICASSP 29, pp.3437 344 (29). Singing Voices by Controlling Acoustic Features Unique to Singing Voices, Proc. 8) Meng, X.L. and Rubin, D.B.: Maximum Likelihood Estimation via the ECM Algorithm: A general framework, Biometrika, Vol.8, pp.267 278 (993). WASSPA 27, pp.25 28 (27). 2) Saitou, T. et al.: Acoustic and Perceptual Effects of Vocal training in Amateur 9) de Cheveigné, A. and Kawahara, H.: YIN, a fundamental frequency estimator for Male Singing, Proc. EUROSPEECH 29, pp.832 835 (29). speech and music, JASA, Vol., No.4, pp.97 93 (22). 3) Nakano, T. et al.: An Automatic Singing Skill Evaluation Method for Unknown 2) Griffin, D.W. and Lim, J.S.: Signal estimation from modified short-time Fourier Melodies Using Pitch Interval Accuracy and Vibrato Features, Proc. ICSLP 26, transform, IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol.32, pp.76 79 (26). No.2, pp.236 243 (984). 4) Kako, T. et al.: Automatic Identification for Singing Style Based on Sung Melodic 2) Itakura, F. and Saito, S.: Digital filtering techniques for speech analysis and synthesis, Proc. ICA 97, Vol.25-C-, pp.26 264 (97). Contour Characterized in Phase Plane, Proc. ISMIR 29, pp.393 397 (29). 5) Proutskova, P. and Casey, M.: You Call That Singing? Ensemble Classification 22) Le Roux, J. et al.: Explicit consistency constraints for STFT spectrograms and for Multi-Cultural Collections of Music Recordings, Proc. ISMIR 29, pp.759 764 their application to phase reconstruction, Proc. SAPA 28 (28). (29). 23) Rasmussen, C.E. and Williams, C. K.I.: Gaussian Processes for Machine Learning, 6) Sundberg, J.: The KTH synthesis of singing, Advances in Cognitive Psychology. MIT Press, Cambridge, Mass, USA (26). Special issue on Music Performance, Vol.2, No.2-3, pp.3 43 (26). 24) Bach, F. et al.: Multiple kernel learning, conic duality, and the smo algorithm, 7) Bonada, J. and Loscos, A.: Sample-based singing voice synthesizer by spectral Proc. ICML 24, pp.6 3. concatenation, Proc. SMAC 23 (23). 25), 2-Q-24 8) Nakano, T. et al.: VocaListener: A Singing-to-Singing Synthesis System Based on pp.499 52 (2). Iterative Parameter Estimation, Proc. SMC 29, pp.343 348 (29). 9) Fukayama, S. et al.: Orpheus: Automatic Composition System Considering 8 c 2 Information Processing Society of Japan