v.connect 1 1 1 VOCALOID UTAU WORLD Vorbis v.connect 2 1.7 2.2 v.connect : A Singing Synthesis System Enabling Users to Control Vocal Tones Makoto Ogawa, 1 Syunji Yazaki 1 and Kôki Abe 1 Since the release of Hatsune Miku, interets in singing synthesis increase. For example, a singing synthesis system, UTAU, has been developed as a freeware. Most of these systems, however, lack of the function that users can mix vocal tones at any times. Controling tonal changes in singing requires a large amount of time and data for synthesis. We have developed a singing synthesis system, v.connect, which connects corresponding phonemes with a time-stretching function to enable users to control tonal changes in singing by specifying the rate of voice morphing. The system processes voice signals with WORLD, a voice synthesis and analysis system, and uses corpora of various tonal voices consisting of Mel cepstra and excitation signals compressed by Vorbis. We constructed a corpus, Namine Ritsu Connect, using the proposed method. It was found that the size of the corpus was two times larger than that of raw waves, and that synthesis from the corpus was 1.7 to 2.2 times faster than that from raw waves. Degradation caused by compression was not sensed subjectively. 1. VOCALOID 1) VOCALOID Append UTAU 2) 3) 4) VocalListener2 VocalListener2 VocalListener2 v.connect v.connect 2 WORLD 5) Project 1 1 The University of Electro-Communications 1 http://ritsu73.net/ 1 c 2012 Information Processing Society of Japan
2 v.connect 3 4 5 6 2. v.connect Cadencii 6) Cadencii Cadencii 1 v.connect Cadencii Cadencii v.connect 2.1 Cadencii Cadencii VOCALOID2 VOCALOID, VOCALOID2, Sinsy 2 MusicXML, UTAU, AquesTone 3 v.connect GUI Cadencii 1 1 Cadencii Fig. 1 Screen shot of Cadencii. 1 Cadencii http://sourceforge.jp/projects/cadencii/ 2 Sinsy - HMM-based Singing Voice Synthesis System http://www.sinsy.jp/ 3 AquesTone - VSTi http://www.a-quest.com/products/aquestone.html 2 v.connect Fig. 2 Block diagram of v.connect. 2.2 v.connect v.connect 2 v.connect Input VOCALOID2 v.connect Cadencii v.connect v.connect WORLD Output v.connect OggVorbis 7) 2 c 2012 Information Processing Society of Japan
3. v.connect VCV 3.1 WORLD WORLD WORLD Vocoder WORLD DIO 8) STAR 9) PLATINUM 10) PLATINUM x(t) h(t) R(ω) R(ω) = X(ω) (X(ω) = FFT[x(t)w(t)], H(ω) = FFT[h(t)]) (1) H(ω) w(t) WORLD Vocoder 3.2 WORLD WORLD Vorbis 3 3 WORLD 3 Fig. 3 Block diagram of v.connect s analysis. F0 Vorbis F0 WORLD 3.3 A, B t E(t) x(t) m E(t) = (x(t + i )) 2 (2) f s i= m f s m 3 c 2012 Information Processing Society of Japan
A, B E A(t), E B(t) t=la t=0 E A (t) E B (T (t)) d t 2 + T 2 (t) s.t. dt (t) dt > 0 (3) T (t) l A, l B A, B T (l A ) = l B T (t) DP DP DP h DP [i, j] [i + p, j + q], (1 p l A h i, 1 q l B j) d h d ij < p, q >= h p 2 + q 2 t2 E A (t) E B (T ij < p, q > (t)) dt t 2 t 1 t 1 (4) t 1 = hi, t 2 = h(i + p) DP T ij < p, q > (t) d { q (t t1) + hj (t1 < t t2) T ij < p, q > (t) = p (5) hp i = l A, hq i = l B C = (p i, q i ) 0 i n p 0 = q 0 = 0 C E A, E B D D = d (i) i=1 d (i) = d ui v i < p i, q i > i 1 i 1 u i = p j, v i = q j j=1 j=1 D C min T (i) min = T u i v i < p i, q i > T = T (i) min DP i=1 (6) 1 < N < 1 h max(l A, l B ) N 1 p, q N 4. F0 F0 4.1 Cadencii BRI p(t) 4.2 F0 F0 F0 Ω F0 2 H(s) = s 2 + 2ζΩs + Ω 2 H(s) F0 ζ, Ω F0 f 0 (t) n t (n), l (n), f (n), l (n) por, d (n) por, l (n) nnote f0(t) note log f 0(t) = log f (t) + F flu (t) + F por(t) (i) note log f (t) = log f note + F (i) (t) { f (n) (t (n) < t t (n) + l (n) ) log f note (t) = F por (n) (t), F (n) (t) n F flu (t) (7) 4 c 2012 Information Processing Society of Japan
F por (n) (t) = (log f (n+1) log f (t))( 1 (1 cos θ(n) por(t))) 2 d (n) por(log f (n+1) log f (n) )(sin θ pre(t)) (n) F (n) (t) = d(t) sin θ(n) F flu (t) = θ (n) por(t) = (t) 1 100 (sin 12.7πt + sin 7.1πt + 1 sin 4.7πt) 3 θ por(t), (n) θ (n) (t) t t p t (n) + l (n) t p π (t p < t < t (n) + l (n) ) θ pre(t) (n) = θ por(t) (n) + θ por(2(t (n) (n) + l (n) ) t) t θ (n) (t) = s(τ)dτ (t v < t < t (n) + l (n) ) t v t p = t (n) + l (n) l (n) por, t v = t (n) + l (n) l (n) (8) (9) d(t), s(t) Wen-Hsing Lai Mandarin Singing Synthesis 11) 4.3 BRI BRI 2 BRI 4 t = t t (n) 2 A, B t A, t B A B BRI b A, b B (b A > b B ) p A, p B b(t) BRI p bri (t) p bri (t) b B b A b B (b B p bri (t) b A ) 0 (p bri (t) < b B ) b(t) = 1 (p bri (t) < b A ) (10) Fig. 4 4 Block diagram of v.connect s synthesis. A B T AB(t) T 1 AB (t) A, B { t A = b(t)(t + p A ) + (1 b(t))t 1 AB (t + p B ) (11) t B = (1 b(t))(t + p B ) + b(t)t AB (t + p A ) p A, p B A, B t A, t B S A (ω), S B (ω) r A(τ), r B(τ) S(ω) r(τ) log S(ω) = b(t) log S A + (1 b(t)) log S B (12) r(τ) = b(t)r A(τ) + (1 b(t))r B(τ) (13) S(ω, t), r(τ, t) 5 c 2012 Information Processing Society of Japan
Table 1 1 Recording environment and contents of Namine Ritsu Connect. 2 2ms byte Table 2 Comparision of data amount per 2ms (in bytes). VCV 955 Audio-Technica AT-4040 Audio I/F Roland UA-25EX SG 2000 RT 32 OggVorbis 64kbits / 44100 samples B4(493.9Hz) F4(349.2Hz) WORLD v.connect 176.4 - - - 4096 128-8192 200 176.4 12288 328 3 Table 3 Comparision of synthesis time with WORLD and v.connect. CPU WORLD(sec) (sec) 4.4 l pre = min (p A, p B ) n t (n) begin = t(n) l pre t (n 1) + l (n 1) F0 WORLD 5. 1 5.1 1 OggVorbis 44.1kHz 64kbps 44100 64kbits 6 7 150 3 2mora/sec 60 48kHz 24bit 44.1kHz 16bit Celeron 1.73Ghz 89.1 40.4 1 Core2Quad 2.80Ghz 39.6 20.7 1 Core2Quad 2.80Ghz 22.3 10.5 2 Corei7 3.50Ghz 22.9 13.1 1 Corei7 3.50Ghz 11.6 6.6 2 250ms 500ms m = 1024 N = 8 5.2 210MB 430MB 2 WORLD 14.2GB 2ms WORLD v.connect 2 byte WORLD v.connect 2ms 5.3 32 3 WORLD 1.7 2.2 1 http://hal-the-cat.music.coocan.jp/ritsu.html 6 c 2012 Information Processing Society of Japan
5.4 20% 8%BRI 2 BRI 6. v.connect WORLD OggVorbis 2 1.7 2.2 Cadencii kbinani WORLD 1) VOCALOID Vol.2008, No.50, pp.57 60 (2008). 2) http://utau2008.web.fc2.com/ 2011-11-28. 3) STRAIGHT Vol.2006-MUS-64, No.19, pp.59 64 (2006). 4) VocaListener2: Vol.2010-MUS-86, No.3, pp. 1 10 (2010). 5) WORLD Vol.41, No.7, pp.555 560 (2011). 6) kbinani: cadencii jp @ wiki - Cadencii, Cadencii Project (online), available from http://www9.atwiki.jp/boare/pages/18.html (accessed 2011-11-28). 7) XIPH.ORG: Vorbis.com, XIPH.ORG (online), available from http://www.vorbis.com/ (accessed 2011-11-28). 8) SNR F0 D Vol.93-D, No.2, pp.109 117 (2010). 9) (D) Vol.94-D, No.7, pp.1079 1087 (2011). 10) 2011 pp.325 326 (2011). 11) Lai, W.-H.: F0 Control Model for Mandarin Singing Voice Synthesis, Second International Conference on Digital Telecommunications, Vol.ICDT2, p.12 (2007). 7 c 2012 Information Processing Society of Japan