Sound Source Identification based on Deep Learning with Partially-Shared Architecture 1 2 1 1,3 Takayuki MORITO 1, Osamu SUGIYAMA 2, Ryosuke KOJIMA 1, Kazuhiro NAKADAI 1,3 1 2 ( ) 3 Tokyo Institute of Technology 1, Kyoto University 2, Honda Research Institute Japan Co., Ltd. 3 morito@cyb.mei.titech.ac.jp, sugiyama@kuhp.kyoto-u.ac.jp, kojima@cyb.mei.titech.ac.jp, nakadai@jp.honda-ri.com Abstract Partially Shared Deep Neural Network (PS-DNN) Partially Shared Convolutional Neural Network (PS-CNN) 1 Unmanned Aerial Vehicle (UAV) Signal-to-Noise (SN) SN MUltiple SIgnal Classification based on incremental Generalized Singular Value Decomposition (igsvd-music) [Ohata 14] Geometric Highorder Decorrelation-based Source Separation (GHDSS) [Nakajima 10] SN Convolutional Neural Network (CNN) [Lawrence 97] [Uemura 15] DNN [Hannun 14] DNN DNN CNN 2 Multi-Task Learning (MTL) [Caruana 97] Partially Shared Deep Neural Network (PS- DNN) CNN Partially Shared Convolutional Neural Network (PS-CNN) 12
3: A convolution-pooling layer in PS-CNN 1: Partially Shared Deep Neural Network 2.2 Partially Shared Convolutional Neural Network 2: A hidden layer of PS-DNN 2.1 Partially Shared Deep Neural Network PS-DNN 1 PS-DNN y i 1, y i 2, y i 3 2 i i+1 y i+1 1, y i+1 2, y i+1 3 (1) y i+1 1 W i 11 W i 12 0 y i 1 y i+1 2 = σ 0 W i 22 0 y i 2 y i+1 3 0 W i 32 W i 33 y i 3 + b i 1 b i 2 b i 3 (1) W i jk yi k yi+1 j b i j σ( ) PS-CNN CNN PS-DNN CNN 1 2 1 CNN 2 RGB 3 CNN 2 1 PS-CNN Fig. 3 PS-DNN PS-CNN CNN PS-CNN PS-DNN CNN CNN PS-DNN i K i,1 K i,2 K i,3 i [X (i,1) 1,, X (i,ki,3) 3 ] X (i,1) 1 = [x (i,1) 1,1,1,, x(i,1) 1,V,H ] i + 1 j V H 1 CNN 2 1 13
C (i+1,j) = [ 1,1,,,, V,H ] j 2 3 4 = σ ( K i,1 K i,2 + 1,v+s,h+t 2,v+s,h+t +b (i,j) ) (2) 4: source The layout of the microphones and the sound = σ ( K i,2 2,v+s,h+t +b (i,j) ) (3) = σ ( K i,2 K i,3 + 2,v+s,h+t 3,v+s,h+t +b (i,j) ) (4) m, n w k,s,t b (i,j) σ( ) 3 Python TensorFlow version 0.8.0 [Abadi 15] 4 DNN, PS- DNN, CNN, PS-CNN DNN, CNN PS-DNN PS-CNN 2 DNN, CNN PS-DNN, PS-CNN DCASE2016 [Mesaros 16] Acoustic scene classification 15 35100 5 1170 Wave 5 3.1-3.2 465 5: UAV (Parrot Bebop Drone) 3.1 8 5 SN SN SN 5 SN 0 db SNR = 20 log(s p /N p ) (5) S p, N p SN SN 14
1: Dimensions for the DNN Hidden layer Units 1 2000 2 1000 3 400 2: Dimensions for the PS-DNN Hidden Units layer Identify Shared Separate 3.2 1 1500 1500 1500 2 800 400 800 3 400 0 800 16 khz 512 sample (32 ms) 120 sample (7.5 ms) 63 Hz, 8 khz, 20 Honda Research Institute Japan Audition for Robots with Kyoro University (HARK) [Nakadai 10] 20 8 20 3200 PS-DNN, PS-CNN 400 3.3 1-4 3.2 3200 15 PS-DNN, PS-CNN 400 0 pre-training Adam Dropout drop rate 0.2 0 0.4 4: Dimensions for the PS-CNN Hidden Type Channels Size layer Identify Shared Separate 1 Conv 40 40 40 20 20 2 Pool 40 40 40 10 10 3 Conv 80 40 80 10 10 4 Pool 80 40 80 5 5 5 Full 400 0 800 1 1 5: Accuracy of Sound Source Identification DNN PS-DNN CNN PS-CNN 100% Avg. 55.88 56.27 55.79 56.75 S.E. 0.6756 0.5742 0.5348 0.5275 75% Avg. 54.07 54.57 54.36 55.09 S.E. 0.6599 0.6584 0.5055 0.3268 50% Avg. 51.63 51.91 51.95 52.71 S.E. 0.5786 0.6525 0.5332 0.5357 25% Avg. 47.84 48.04 48.35 48.74 S.E. 0.6477 0.6526 0.5566 0.5898 5 5 zero padding 2 2 100 10 epoch 3.4 5 6-9 5 (Avg.) (S.E.) 5 4 4 100%, 75%, 50%, 25% 5 DNN < PS-DNN CNN < PS-CNN t p p > 0.05 3: Dimensions for the CNN Hidden layer Type Channels Size 1 Conv 40 20 20 2 Pool 40 10 10 3 Conv 80 10 10 4 Pool 80 5 5 5 Full 400 1 1 6: Trained with 100% annotated data 15
7: Trained with 75% annotated data 9: Trained with 25% annotated data DNN, CNN JSPS 24220006, 16H02884, 16K00294 JST ImPACT 8: Trained with 50% annotated data SN DCASE2016 Acoustic scene classification 4 SN [Abadi 15] Abadi, M., et al.: TensorFlow: Large- Scale Machine Learning on Heterogeneous Systems, http://tensorflow.org/ (2015) [Caruana 97] Caruana, R., et al.: Multitask learning, Machine Learning, vol.28, no. 1, pp. 41-75 (1997) [Mesaros 16] Mesaros, A., et al.: TUT database for acoustic scene classification and sound event detection, 24th Acoustic Scene Classification Workshop 2016 European Signal Processing Conference (EU- SIPCO) (2016) [Hannun 14] Hannun, A., et al.: Deepspeech: Scaling up end-to-end speech recognition, arxiv preprint arxiv:1412.5567 (2014) [Lawrence 97] Lawrence, S., et al.: Face recognition: A convolutional neural-network approach, IEEE Transactions on Neural Networks, vol. 8, no. 1, pp. 98-113 (1997) 16
[Nakadai 10] Nakadai, K., et al.: Design and Implementation of Robot Audition System HARK, Advanced Robotics, vol. 24, pp. 739-761 (2010) [Nakajima 10] Nakajima, H., et al.: Correlation matrix estimation by an optimally controlled recursive average method and its application to blind source separation, Acoustical Science and Technology, vol. 31, no. 3, pp. 205212 (2010) [Ohata 14] Ohata, T., et al.: Inprovement in outdoor sound source detection using a quadrotor-embedded microphone array, IEEE/RSJ International Conference on Intelligent Robots and Systems (2014). [Uemura 15] Uemura, S., et al.: Outdoor acoustic event identification using sound source separation and deep learning with a quadrotor-embedded microphone array, The 6th International Conference on Advanced Mechatronics (2015) 17