F0 Estimation of Melody and Bass Lines in Real-world Musical Audio Signals

99-MUS-31-16, Vol.99, No.68, August 1999. 31 16 goto@etl.go.jp CD EM F0 Estimation o Melody and Bass Lines in Real-world Musical Audio Signals Masataka Goto Electrotechnical Laboratory 1-1-4 Umezono, Tsukuba, Ibaraki 305-8568 Japan Abstract This paper describes a method or estimating the undamental requency (F0) o melody and bass lines in monaural audio signals containing sounds o various instruments. Most previous methods premised mixtures o a ew sounds and had great diiculty dealing with audio signals sampled rom compact discs. Our method does not rely on the unreliable F0 s component and obtains the most predominant F0 supported by harmonics within an intentionally limited requency range. It estimates relative dominance o every possible F0 by using the Expectation- Maximization algorithm and considers F0 s temporal continuity by introducing a multi-agent model. Experimental results show that our real-time system can detect the melody and bass lines in real-world audio signals. 1 1) 5) 4),5) CD (compact disc) ( ) CD 91

6) 10) 11) 18) CD CD 19) ( ) DP () EM (Expectation-Maximization) 20) CD 10 2 D m (t) D b (t) t(f0) F i (t) (i =m, b) A i (t) D m (t) ={F m (t),a m (t)} (1) D b (t) ={F b (t),a b (t)} (2) () ( ) (missing undamental) 1 (1) (2) (3) ( ) 1 () 92

(1) (2) () EM (Expectation-Maximization) 20) EM (3) (singing ormant) 2.8 2 450 Hz 21) 22) 2 Interaction 1: 23) 1.4 2 3 1 2 () ( ) 3.1 24),25) Flanagan 24) 93

16 2 8 4 = (0.45 s) + FFT FFT FFT FFT 2 1 FFT 2: (STFT) x(t) h(t) STFT X(ω, t) = x(τ)h(τ t)e jωτ dτ (3) = a + jb (4) 3.6-7.2 1.8-3.6 0.9-1.8 0.45-0.9 0-0.45 λ(ω, t) 24) λ(ω,t) =ω + a b t b a t a 2 + b 2 (5) h(t) 2 B- 10) 10) STFT h(t) STFT 26) 2 (FIR ) 1/2 (decimator) 0.45 s ( s ) 16 16 bit A/D 1 STFT 512 (FFT) FFT 16 160 (1 ) 10 msec 1 3.2 8) 10) STFT ω λ(ω, t) ψ ψ 10) (t) 27) (t) = { ψ λ(ψ, t) ψ =0, (λ(ψ, t) ψ) < 0} (6) ψ (t) STFT (t) p (ω) { (t) X(ω, t) i ω (t) p (ω) = (7) 0 otherwise 3.3 (BPF) BPF BPF BPF 3 cent ( ( ))Hz Hz cent cent cent = 1200 log Hz 2 REF Hz (8) REF Hz = 440 2 3 12 5 (9) 100 cent 1 1200 cent 0 cent 1200 cent 2400 cent 3600 cent 4800 cent 6000 cent 7200 cent 8400 cent 9600 cent 16.35Hz 32.70Hz 65.41Hz 130.8Hz 261.6Hz 523.3Hz 1047 Hz 2093 Hz 4186 Hz 3: (BPF) 94

x cent BPF BPF i (x) (i = m, b) (t) p (x) BPF BPF i (x) (t) p (x) (t) p (x) cent (t) p (ω) BPF (x) (x) =BPF i(x) (t) p (x) (10) Pow (t) Pow (t) BPF Pow (t) = BPF i (x) (t) p (x) dx (11) 3.4 BPF (x) ( ) () F p(x F ) p(x; θ (t) ) Fhi p(x; θ (t) )= w (t) (F ) p(x F ) df (12) Fl i θ (t) = {w (t) (F ) Fl i F Fh i } (13) Fh i Fl i w (t) (F ) p(x F ) Fhi w (t) (F ) df = 1 (14) Fl i CD (x) p(x; θ (t) ) θ (t) (x) w (t) (F ) F 0 (F ) F 0 (F )=w(t) (F ) (Fl i F Fh i ) (15) p(x F ) (w (t) (F ) ) F 0 (F ) F (x) p(x; θ (t) ) θ (t) θ (t) (x) log p(x; θ(t) ) dx (16) EM (Expectation-Maximization) 20) θ (t) EM E (expectation step) M (maximization step) ( (x)) θ (t) θ (t) ( ) θ (t) θ (t) t 1 x ( ) F EM 1. (E ) Q(θ (t) θ (t) ) Q(θ (t) θ (t) ) = (x) E F [log p(x, F ; θ (t) ) x; θ (t) ] dx (17) E F [a b] b F a 2. (M ) Q(θ (t) θ (t) ) θ (t) θ (t) θ (t) = argmax Q(θ (t) θ (t) ) (18) θ (t) E (17) Q(θ (t) θ (t) ) Fhi = (x)p(f x; θ (t) ) log p(x, F ; θ (t) )df dx Fl i (19) log p(x, F ; θ (t) ) = log(w (t) (F ) p(x F )) = log w (t) (F ) + log p(x F ) (20) M (18) (14) Lagrange λ Euler-Lagrange 95

( w (t) (x) p(f x; θ (t) ) (log w (t) (F )+ ) log p(x F )) dx λ(w (t) 1 (F ) Fh i Fl i ) =0 (21) w (t) (F )= 1 λ (x) p(f x; θ (t) ) dx (22) λ (14) λ =1 p(f x; θ (t) ) p(f x; θ (t) w (t) (F ) p(x F ) )= Fhi w (t) (η) p(x η) dη Fl i (23) w (t) (F ) (θ (t) = {w (t) (F )}) w (t) (F ) w (t) (F )= (x) w (t) (F ) p(x F ) Fhi Fl i w (t) (η) p(x η) dη dx (24) (24) p(x F ) F (i = m) (i = b) N i p(x F )=α c(h) G(x; F + 1200 log 2 h, W i ) (25) h=1 G(x; m, σ) = 1 (x m)2 e 2σ 2 (26) 2πσ 2 α N i () W i 2 G(x; m, σ) c(h) h c(h) =G(h;1, H i ) (H i ) F i (t) F 0 (F )( (15) (24) ) F i (t) = argmax F 0 (F ) (27) F Interaction 4: 3.5 3 28) 16) ( 4) (1) ( ) Pow (t) (2) 3 96

(3) (4) (5) (6) (7) t F i (t) A i (t) F i (t) (t) p (ω) 4 ( 1) () 1: Fh m = 9600 cent (4186 Hz) Fh b = 4800 cent (261.6 Hz) Fl m = 3600 cent (130.8 Hz) Fl b = 1000 cent (29.14 Hz) N m =16 N b =6 W m = 17 cent W b = 17 cent H m = 5.5 H b = 2.7 2: CD My Heart Will GoOn (Celine Dion) Vision o Love (Mariah Carey) Always (Bon Jovi) Time Goes By (Every Little Thing) Spirit o Love (Sing Like Talking) (Misia) Scarborough Fair (Herbie Hancock) jazz Autumn Leaves (Julian Cannonball Adderley) jazz On Green Dolphin Street (Miles Davis) jazz Violin Concerto in D, Op. 35 (Tchaikovsky) classical 5: : ( ) ( ) ( 5) D i (t) 2 AR 29) 3 LAN (Ethernet) RACP (Remote Audio Control Protocol) RACP RMCP (Remote Music Control Protocol) 30) (Pentium II 450 MHzCPU x 2, Linux 2.2) (SGI Octane R10000 250 MHz CPU, Irix 6.4) 2 10 CD 97

5 () EM CD [1], : AP1000,, Vol. 37, No. 7, pp. 1460 1468 (1996). [2], :, D-II, Vol. J81-D-II, No. 2, pp. 227 237 (1998). [3] Goto, M. and Muraoka, Y.: Music Understanding At The Beat Level Real-time Beat Tracking For AudioSignals, Computational Auditory Scene Analysis, Lawrence Erlbaum Associates, Publishers, pp. 157 176 (1998). [4] Goto, M. and Muraoka, Y.: Real-time Beat Tracking or Drumless Audio Signals: Chord Change Detection or Musical Decisions, Speech Communication, Vol. 27, No. 3 4, pp. 311 335 (1999). [5] :,, (1998). [6] Rabiner, L. R., Cheng, M. J., Rosenberg, A. E. and McGonegal, C. A.: A Comparative Perormance Study o Several Pitch Detection Algorithms, IEEE Trans. on ASSP, Vol. ASSP-24, No. 5, pp. 399 418 (1976). [7] Nehorai, A. and Porat, B.: Adaptive Comb Filtering or Harmonic Signal Enhancement, IEEE Trans. on ASSP, Vol. ASSP-34, No. 5, pp. 1124 1138 (1986). [8] Charpentier, F. J.: Pitch detection using the shortterm phase spectrum, Proc. o ICASSP 86, pp. 113 116 (1986). [9],, :, D-II, Vol. J79-D-II, No. 11, pp. 1771 1781 (1996). [10],, Patterson, R. D., de Cheveigné, A.:, H-98-116, pp. 31 38 (1998). [11] Chae, C. and Jae, D.: Source separation and note identiication in polyphonic music, Proc. o ICASSP 86, pp. 1289 1292 (1986). [12] :,, (1991). [13] Brown, G. J. and Cooke, M.: Perceptual Grouping o Musical Sounds: A Computational Model, J. o New Music Research, Vol. 23, pp. 107 132 (1994). [14] :,, (1994). [15], :,, Vol. 38, No. 1, pp. 146 157 (1997). [16],,, :,, Vol. 12, No. 1, pp. 111 120 (1997). [17], :, 98-MUS-24-5, pp. 33 40 (1998). [18] :,, Vol. 54, No. 10, pp. 715 719 (1998). [19], :, 2-9-1 (1998). [20] Dempster, A. P., Laird, N. M. and Rubin, D. B.: Maximum likelihood rom incomplete data via the EM algorithm, J. Roy. Stat. Soc. B, Vol. 39, No. 1, pp. 1 38 (1977). [21] Richards, W.(ed.): Natural Computation, The MIT Press (1988). [22] Ritsma, R. J.: Frequencies Dominant in the Perception o the Pitch o Complex Sounds, J. Acoust. Soc. Am., Vol. 42, No. 1, pp. 191 198 (1967). [23] Plomp, R.: Pitch o Complex Tones, J. Acoust. Soc. Am., Vol. 41, No. 6, pp. 1526 1533 (1967). [24] Flanagan, J. L. and Golden, R. M.: Phase Vocoder, The Bell System Technical J., Vol. 45, pp. 1493 1509 (1966). [25] Boashash, B.: Estimating and Interpreting The Instantaneous Frequency o a Signal, Proc. o the IEEE, Vol. 80, No. 4, pp. 520 568 (1992). [26] Vetterli, M.: A Theory o Multirate Filter Banks, IEEE Trans. on ASSP, Vol. ASSP-35, No. 3, pp. 356 372 (1987). [27] Abe, T., Kobayashi, T. and Imai, S.: The IF spectrogram: a new spectral representation, Proc.oASVA97, pp. 423 430 (1997). [28] Goto, M. and Muraoka, Y.: Beat Tracking based on Multiple-agent Architecture A Real-time Beat Tracking System or Audio Signals, Proc. o the Second Intl. Con. on Multiagent Systems, pp. 103 110 (1996). [29] Aikawa, K., Kawahara, H. and Tsuzaki, M.: A neural matrix model or active tracking o requency-modulated tones, Proc. o ICSLP 96, pp. 578 581 (1996). [30],, : RMCP:,, Vol. 40, No. 3, pp. 1335 1345 (1999). 98