[2] REVERB 8 [3], [4] [5] [20] [6], [7], [8], [9], [10] [11] REVERB 8 *1 [9] LDA *2 MLLT (SAT) [8] (basis fmllr) [12] (DNN) [10] DNN [11] [13] [14] Ka

Σχετικά έγγραφα
[5] F 16.1% MFCC NMF D-CASE 17 [5] NMF NMF 3. [5] 1 NMF Deep Neural Network(DNN) FUSION 3.1 NMF NMF [12] S W H 1 Fig. 1 Our aoustic event detect

3: A convolution-pooling layer in PS-CNN 1: Partially Shared Deep Neural Network 2.2 Partially Shared Convolutional Neural Network 2: A hidden layer o

Buried Markov Model Pairwise

CSJ. Speaker clustering based on non-negative matrix factorization using i-vector-based speaker similarity

Estimation, Evaluation and Guarantee of the Reverberant Speech Recognition Performance based on Room Acoustic Parameters

Vol.4-DCC-8 No.8 Vol.4-MUS-5 No.8 4// 3 3 Hanning (T ) 3 Hanning 3T (y(t)w(t)) dt =.5 T y (t)dt. () STRAIGHT F 3 TANDEM-STRAIGHT[] 3 F F 3 [] F []. :

Διερεύνηση ακουστικών ιδιοτήτων Νεκρομαντείου Αχέροντα

An Automatic Modulation Classifier using a Frequency Discriminator for Intelligent Software Defined Radio

Πτυχιακή Εργασι α «Εκτι μήσή τής ποιο τήτας εικο νων με τήν χρή σή τεχνήτων νευρωνικων δικτυ ων»

MIDI [8] MIDI. [9] Hsu [1], [2] [10] Salamon [11] [5] Song [6] Sony, Minato, Tokyo , Japan a) b)

Fourier transform, STFT 5. Continuous wavelet transform, CWT STFT STFT STFT STFT [1] CWT CWT CWT STFT [2 5] CWT STFT STFT CWT CWT. Griffin [8] CWT CWT

«ΑΝΑΠΣΤΞΖ ΓΠ ΚΑΗ ΥΩΡΗΚΖ ΑΝΑΛΤΖ ΜΔΣΔΩΡΟΛΟΓΗΚΩΝ ΓΔΓΟΜΔΝΩΝ ΣΟΝ ΔΛΛΑΓΗΚΟ ΥΩΡΟ»

Διπλωματική Εργασία. του φοιτητή του Τμήματος Ηλεκτρολόγων Μηχανικών και Τεχνολογίας Υπολογιστών της Πολυτεχνικής Σχολής του Πανεπιστημίου Πατρών

1,a) 1,b) 2 3 Sakriani Sakti 1 Graham Neubig 1 1. A Study on HMM-Based Speech Synthesis Using Rich Context Models

Bayesian statistics. DS GA 1002 Probability and Statistics for Data Science.

SNR F0 [2], [3], [4] F0 F0 F0 F0 F0 TUSK F0 TUSK F0 6 TUSK 6 F0 2. F0 F0 [5] [6] [7] p[8] Cepstrum [9], [10] [11] [12] [13] F0 [14] F0 [15] DIO[16] [1

1 n-gram n-gram n-gram [11], [15] n-best [16] n-gram. n-gram. 1,a) Graham Neubig 1,b) Sakriani Sakti 1,c) 1,d) 1,e)

: Monte Carlo EM 313, Louis (1982) EM, EM Newton-Raphson, /. EM, 2 Monte Carlo EM Newton-Raphson, Monte Carlo EM, Monte Carlo EM, /. 3, Monte Carlo EM

Study on Re-adhesion control by monitoring excessive angular momentum in electric railway traction

Chapter 1 Introduction to Observational Studies Part 2 Cross-Sectional Selection Bias Adjustment

Discriminative Language Modeling Based on Risk Minimization Training

Optimizing Microwave-assisted Extraction Process for Paprika Red Pigments Using Response Surface Methodology

Schedulability Analysis Algorithm for Timing Constraint Workflow Models


Feasible Regions Defined by Stability Constraints Based on the Argument Principle

Anomaly Detection with Neighborhood Preservation Principle

Vol. 31,No JOURNAL OF CHINA UNIVERSITY OF SCIENCE AND TECHNOLOGY Feb

Πανεπιστήµιο Πειραιώς Τµήµα Πληροφορικής

Ψηφιακή Επεξεργασία Φωνής

Nov Journal of Zhengzhou University Engineering Science Vol. 36 No FCM. A doi /j. issn

ΜΑΘΗΜΑΤΙΚΗ ΠΡΟΣΟΜΟΙΩΣΗ ΤΗΣ ΔΥΝΑΜΙΚΗΣ ΤΟΥ ΕΔΑΦΙΚΟΥ ΝΕΡΟΥ ΣΤΗΝ ΠΕΡΙΠΤΩΣΗ ΑΡΔΕΥΣΗΣ ΜΕ ΥΠΟΓΕΙΟΥΣ ΣΤΑΛΑΚΤΗΦΟΡΟΥΣ ΣΩΛΗΝΕΣ ΣΕ ΔΙΑΣΤΡΩΜΕΝΑ ΕΔΑΦΗ

ER-Tree (Extended R*-Tree)

Spectrum Representation (5A) Young Won Lim 11/3/16

ΕΘΝΙΚΟ ΜΕΤΣΟΒΙΟ ΠΟΛΥΤΕΧΝΕΙΟ ΣΧΟΛΗ ΗΛΕΚΤΡΟΛΟΓΩΝ ΜΗΧΑΝΙΚΩΝ ΚΑΙ ΜΗΧΑΝΙΚΩΝ ΥΠΟΛΟΓΙΣΤΩΝ

1 (forward modeling) 2 (data-driven modeling) e- Quest EnergyPlus DeST 1.1. {X t } ARMA. S.Sp. Pappas [4]

Test Data Management in Practice

ΤΕΧΝΟΛΟΓΙΚΟ ΕΚΠΑΙΔΕΥΤΙΚΟ ΙΔΡΥΜΑ ΑΓΡΟΤΙΚΕΣ ΣΤΑΤΙΣΤΙΚΕΣ ΜΕ ΕΡΓΑΛΕΙΑ ΓΕΩΠΛΗΡΟΦΟΡΙΚΗΣ

Συνδυασμένη Οπτική-Ακουστική Ανάλυση Ομιλίας

Speech Recognition using Phase Information based on Long-Term Analysis

Stabilization of stock price prediction by cross entropy optimization

Detection and Recognition of Traffic Signal Using Machine Learning


, -.

Ψηφιακή Επεξεργασία Φωνής

1181 (real-timespeechdriven) 1 1 ( ) D FAP FAP (voiceactivationdetectionvad) D FaceGen 3- D XfaceEd MPEG-4 1 FAP 66 FAP ( ) FAP 84

ΣΙΣΛΟ ΓΙΑΣΡΙΒΗ ΔΞΑΓΧΓΗ ΥΑΡΑΚΣΗΡΙΣΙΚΧΝ ΔΙΚΟΝΟΠΛΑΙΙΧΝ ΑΠΟ ΑΚΟΛΟΤΘΙΔ ΒΙΝΣΔΟ ΜΔ ΥΡΗΗ ΟΜΑΓΟΠΟΙΗΗ ΠΟΛΛΑΠΛΧΝ ΟΦΔΧΝ ΜΔΣΑΠΣΤΥΙΑΚΗ ΔΡΓΑΙΑ ΔΞΔΙΓΙΚΔΤΗ

ΕΘΝΙΚΟ ΜΕΤΣΟΒΙΟ ΠΟΛΥΤΕΧΝΕΙΟ

HIV HIV HIV HIV AIDS 3 :.1 /-,**1 +332

No. 7 Modular Machine Tool & Automatic Manufacturing Technique. Jul TH166 TG659 A

w o = R 1 p. (1) R = p =. = 1

IPSJ SIG Technical Report Vol.2014-CE-127 No /12/6 CS Activity 1,a) CS Computer Science Activity Activity Actvity Activity Dining Eight-He

Θωμάς ΣΑΛΟΝΙΚΙΟΣ 1, Χρήστος ΚΑΡΑΚΩΣΤΑΣ 2, Βασίλειος ΛΕΚΙΔΗΣ 2, Μίλτων ΔΗΜΟΣΘΕΝΟΥΣ 1, Τριαντάφυλλος ΜΑΚΑΡΙΟΣ 3,

476,,. : 4. 7, MML. 4 6,.,. : ; Wishart ; MML Wishart ; CEM 2 ; ;,. 2. EM 2.1 Y = Y 1,, Y d T d, y = y 1,, y d T Y. k : p(y θ) = k α m p(y θ m ), (2.1

Simplex Crossover for Real-coded Genetic Algolithms

ΦΩΤΟΓΡΑΜΜΕΤΡΙΚΕΣ ΚΑΙ ΤΗΛΕΠΙΣΚΟΠΙΚΕΣ ΜΕΘΟΔΟΙ ΣΤΗ ΜΕΛΕΤΗ ΘΕΜΑΤΩΝ ΔΑΣΙΚΟΥ ΠΕΡΙΒΑΛΛΟΝΤΟΣ

1530 ( ) 2014,54(12),, E (, 1, X ) [4],,, α, T α, β,, T β, c, P(T β 1 T α,α, β,c) 1 1,,X X F, X E F X E X F X F E X E 1 [1-2] , 2 : X X 1 X 2 ;

Other Test Constructions: Likelihood Ratio & Bayes Tests

Ι ΑΚΤΟΡΙΚΗ ΙΑΤΡΙΒΗ. Χρήστος Αθ. Χριστοδούλου. Επιβλέπων: Καθηγητής Ιωάννης Αθ. Σταθόπουλος

Elements of Information Theory

Study of In-vehicle Sound Field Creation by Simultaneous Equation Method

Numerical Analysis FMN011

1 h, , CaCl 2. pelamis) 58.1%, (Headspace solid -phase microextraction and gas chromatography -mass spectrometry,hs -SPME - Vol. 15 No.

Research on Economics and Management

ΤΕΧΝΟΛΟΓΙΚΟ ΠΑΝΕΠΙΣΤΗΜΙΟ ΚΥΠΡΟΥ ΣΧΟΛΗ ΜΗΧΑΝΙΚΗΣ ΚΑΙ ΤΕΧΝΟΛΟΓΙΑΣ. Πτυχιακή εργασία

Company. Patras, Greece

Sampling Basics (1B) Young Won Lim 9/21/13

6.1. Dirac Equation. Hamiltonian. Dirac Eq.

The Algorithm to Extract Characteristic Chord Progression Extended the Sequential Pattern Mining

VSC STEADY2STATE MOD EL AND ITS NONL INEAR CONTROL OF VSC2HVDC SYSTEM VSC (1. , ; 2. , )

BCI On Feature Extraction from Multi-Channel Brain Waves Used for Brain Computer Interface

Supplementary Materials for Evolutionary Multiobjective Optimization Based Multimodal Optimization: Fitness Landscape Approximation and Peak Detection

«ΧΩΡΙΚΗ ΜΟΝΤΕΛΟΠΟΙΗΣΗ ΤΗΣ ΚΑΤΑΝΟΜΗΣ ΤΟΥ ΠΛΗΘΥΣΜΟΥ ΤΗΣ ΠΕΡΔΙΚΑΣ (ALECTORIS GRAECA) ΣΤΗ ΣΤΕΡΕΑ ΕΛΛΑΔΑ»

* ** *** *** Jun S HIMADA*, Kyoko O HSUMI**, Kazuhiko O HBA*** and Atsushi M ARUYAMA***

Development of a Seismic Data Analysis System for a Short-term Training for Researchers from Developing Countries

ΙΠΛΩΜΑΤΙΚΗ ΕΡΓΑΣΙΑ. ΘΕΜΑ: «ιερεύνηση της σχέσης µεταξύ φωνηµικής επίγνωσης και ορθογραφικής δεξιότητας σε παιδιά προσχολικής ηλικίας»

ΖΩΝΟΠΟΙΗΣΗ ΤΗΣ ΚΑΤΟΛΙΣΘΗΤΙΚΗΣ ΕΠΙΚΙΝΔΥΝΟΤΗΤΑΣ ΣΤΟ ΟΡΟΣ ΠΗΛΙΟ ΜΕ ΤΗ ΣΥΜΒΟΛΗ ΔΕΔΟΜΕΝΩΝ ΣΥΜΒΟΛΟΜΕΤΡΙΑΣ ΜΟΝΙΜΩΝ ΣΚΕΔΑΣΤΩΝ

substructure similarity search using features in graph databases

PHOS π 0 analysis, for production, R AA, and Flow analysis, LHC11h

AΡΙΣΤΟΤΕΛΕΙΟ ΠΑΝΕΠΙΣΤΗΜΙΟ ΘΕΣΣΑΛΟΝΙΚΗΣ ΠΟΛΥΤΕΧΝΙΚΗ ΣΧΟΛΗ ΤΜΗΜΑ ΠΟΛΙΤΙΚΩΝ ΜΗΧΑΝΙΚΩΝ

JOURNAL OF APPLIED SCIENCES Electronics and Information Engineering. Cyclic MUSIC DOA TN (2012)

Statistical Inference I Locally most powerful tests

{takasu, Conditional Random Field

2016 IEEE/ACM International Conference on Mobile Software Engineering and Systems

Statistical analysis of extreme events in a nonstationary context via a Bayesian framework. Case study with peak-over-threshold data

Development of the Nursing Program for Rehabilitation of Woman Diagnosed with Breast Cancer

ΧΩΡΙΚΑ ΟΙΚΟΝΟΜΕΤΡΙΚΑ ΥΠΟΔΕΙΓΜΑΤΑ ΣΤΗΝ ΕΚΤΙΜΗΣΗ ΤΩΝ ΤΙΜΩΝ ΤΩΝ ΑΚΙΝΗΤΩΝ SPATIAL ECONOMETRIC MODELS FOR VALUATION OF THE PROPERTY PRICES

Ανάκτηση Εικόνας βάσει Υφής με χρήση Eye Tracker

Supplementary Appendix

[1] DNA ATM [2] c 2013 Information Processing Society of Japan. Gait motion descriptors. Osaka University 2. Drexel University a)

Μεταπτυχιακή διατριβή. Ανδρέας Παπαευσταθίου

Retrieval of Seismic Data Recorded on Open-reel-type Magnetic Tapes (MT) by Using Existing Devices

ΔΙΠΛΩΜΑΤΙΚΗ ΕΡΓΑΣΙΑ. του φοιτητή του Τμήματος Ηλεκτρολόγων Μηχανικών και. Τεχνολογίας Υπολογιστών της Πολυτεχνικής Σχολής του. Πανεπιστημίου Πατρών

Resurvey of Possible Seismic Fissures in the Old-Edo River in Tokyo

Approximation of distance between locations on earth given by latitude and longitude

ΠΟΛΥΤΕΧΝΕΙΟ ΚΡΗΤΗΣ ΣΧΟΛΗ ΜΗΧΑΝΙΚΩΝ ΠΕΡΙΒΑΛΛΟΝΤΟΣ

Echo path identification for stereophonic acoustic echo cancellation without pre-processing

Queensland University of Technology Transport Data Analysis and Modeling Methodologies

Applying Markov Decision Processes to Role-playing Game

ΓΕΩΠΟΝΙΚΟ ΠΑΝΕΠΙΣΤΗΜΙΟ ΑΘΗΝΩΝ ΤΜΗΜΑ ΑΓΡΟΤΙΚΗΣ ΟΙΚΟΝΟΜΙΑΣ & ΑΝΑΠΤΥΞΗΣ

Σύστημα επεξεργασίας, ανάλυσης και ταξινόμησης εικόνων δισδιάστατης ηλεκτροφόρησης με τεχνικές αναγνώρισης προτύπων

Transcript:

: REVERB 1,a) 1 2 REVERB 8 REVERB REVERB 6.76% 18.60% 68.8% 61.5% REVERB Effectiveness of dereverberation techniques and system combination approach for various reverberant environments: REVERB challenge Tachioka Yuuki 1,a) Narita Tomohiro 1 Watanabe Shinji 2 Abstract: The recently released REVERB challenge includes a reverberant speech recognition task. This paper focuses on state-of-the-art ASR techniques such as discriminative training of acoustic models including Gaussian mixture model, sub-space Gaussian mixture model, and deep neural networks, and various feature transformations after the proposed single channel dereverberation method with reverberation time estimation and multi-channel beamforming that enhances direct sound compared with the reflected sound. In addition, because it is necessary to handle these various environments in the challenge and the best performing system is different from environment to environment, we perform a system combination approach using different feature and different types of systems. Moreover, we use our discriminative training technique for system combination that improves system combination by making systems complementary. Experiments show the effectiveness of these approaches, reaching 6.76% and 18.60% word error rate on the REVERB simulated and real test sets, which are 68.8% and 61.5% relative improvements over the baseline. Keywords: Reverberation, Dereverberation, Discriminative training, Feature transformation, System combination, REVERB challenge 1 Information Technology R&D Center, Mitsubishi Electric Corporation, Kamakura, Kanagawa, 247-8501, Japan 2 Mitsubishi Electric Research Laboratories a) Tachioka.Yuki@eb.MitsubishiElectric.co.jp 1. REVERB [1] 1

[2] REVERB 8 [3], [4] [5] [20] [6], [7], [8], [9], [10] [11] REVERB 8 *1 [9] LDA *2 MLLT (SAT) [8] (basis fmllr) [12] (DNN) [10] DNN [11] [13] [14] Kaldi [15] 2. 1 3 1 3 1) 2) *1 (LDA) [6] (MLLT) [7] *2 9 3) 2 (NLMS) 2 4.2 (LDA MLLT fmllr) (MFCC) (PLP) 2 2 3 ( (boosted MMI), cf. 4.1 4.3 ) 3 (GMM SGMM DNN) 4.4 ROVER 3. 3.1 CSP [5] ỹ t m (STFT) x t (m) ỹ t = m x t (m) exp( ȷωτ 1,m ) (1) t ω 1 m τ 1,m 2 (CSP) [3] [ τ 1,m = arg max S 1 xt (1) x t (m) ] x t (1) x t (m) (2) S STFT * CSP [16] [4] 3 CSP [17] 3.2 [2] T r x 2 ŷ 2 n 2 x t 2 = t w µ ŷ t µ 2 + n 2 (3) µ=0 2

<PLP> <MFCC> GMM (f-bmmi, w/o SAT) τ CSP Speech DS derev. NLMS {1,8} 1 1 1 Feature extraction Feature Transform LDA,MLLT basis fmllr f-bmmi GMM (f-bmmi, w/ SAT) SGMM (bmmi, w/ SAT) ROVER Results Speech enhancement part (3.) DNN (bmmi, w/ SAT) for 8ch data ASR part (4.) 1 Schematic diagram of the proposed system. (CSP: cross spectrum phase analysis, DS: delay-and-sum beamformer, derev.: proposed dereverberation method, NLMS: normalized least-mean-squares algorithm, gray blocks are complementary systems for each system type) µ w µ ŷ 2 x 2 ŷ t µ 2 = η(t r ) x t µ 2 n 2 (4) η T r T r w 0 1 Eq. (5) t ŷ t 2 = x t 2 [ w µ η(tr ) x t µ 2 n 2] n 2 (5) µ=1 2 D( ) Polack [18] w µ D < µ µ < D w µ 0 w µ = α s /η(t r )e 2 φµ (6) φ α s η Eq. (5) (SS)[19] SS ŷ 2 β x 2 β x 2 ( ) β 2 T r [2] 4. 4.1 MMI MMI MMI (bmmi) [20] b( 0) F bmmi (λ) = r p λ (x r H sr ) κ p L (s r ) log s p λ (x r H s ) κ p L (s)e ba(s,sr) (7) x r r λ H sr H s s r s HMM p λ κ p L A(s, s r ) s s r 4.2 MMI (f-bmmi) h t I J M [9] y t = x t + Mh t (8) x t t I y t I h t J( I) M F f-bmmi (M) Eq. (7) x r r y r F f-bmmi (M) = r p λ (y r H sr ) κ p L (s r ) log s p λ (y r H s ) κ p L (s)e ba(s,sr) (9) 3

4.3 DNN DNN-HMM (CE) bmmi (7) [21] DNN HMM j p θ p θ (x r j) = p θ (j x r ) p 0 (j) (10) p 0 (j) j F bmmi (θ) Eq. (7) λ θ 4.4 [14] Q F F c (φ) = (1 + α c )F(φ) α c Q Q F(φ) (11) q=1 s r q 1 ( )s q,1 φ ( λ M θ) α c F bmmi f-bmmi α c F Eq. (11) 1 2 5. REVERB [1] 1 2 8 1 8 (5k) (WSJCAM0) SimData 0.25 0.5 0.75 3 (Room 1 3) 0.5 m (near) 2 m (far) 6 SNR 20dB RealData 1 (Room 1) 2 8 0.1 m (tr) 92 7,861 (eva) SimData 28 2,176 RealData 10 372 (dev) SimData 20 1,484 RealData 5 179 tr dev (WER) tri-gram 1 *3 8 200 NLMS 0 12 MFCC PLP 1 2 9 MFCC 117 LDA 40 LDA HMM MLLT [11] fmllr [12] SAT [8] bmmi f-bmmi 0.1 Eq. (11) 2 0.3 α c 0.75 *4 tri-phone (ML) ML DNN Kaldi [15] Povey 2 2M 0.02 0.004 GMM SGMM [22] DNN 3 GMM f-bmmi SGMM DNN[21] bmmi MFCC PLP 16 6. 1 (dev) WER 4 2 Kaldi baseline WER derev. 8 derev. *3 (D = 9, α = 5, β = 0.05, a = 0.005, b = 0.6) *4 mono-phone ( sil ) 44 triphone 2,500 15,000 4

1 WER [%] by room and microphone distance on the REVERB Challenge (dev). SimData RealData Room 1 Room 2 Room 3 Avg Room 1 Avg Feature Type near far near far near far near far 1ch Kaldi baseline MFCC ML 10.96 12.56 15.70 34.21 19.61 39.24 22.05 48.53 47.37 47.95 derev. 12.41 14.68 14.03 27.16 16.39 33.85 19.75 47.04 44.57 45.81 GMM +LDA+MLLT ML 9.46 11.01 11.51 22.04 13.08 28.09 15.87 39.99 40.67 40.33 +basis fmllr 7.77 10.00 9.76 19.28 11.05 24.90 13.79 33.00 35.54 34.27 bmmi 7.13 9.61 9.12 16.19 10.46 21.98 12.42 30.69 35.20 32.95 f-bmmi 6.27 8.73 8.28 14.89 9.37 19.54 11.18 28.32 31.31 29.82 f-bmmi c 7.06 9.05 8.58 14.96 10.16 20.43 11.71 29.01 31.72 30.37 +SAT ML 8.87 11.21 9.71 19.89 10.95 24.04 14.11 36.06 36.23 36.15 bmmi 6.56 8.51 7.76 16.24 9.03 19.88 11.33 34.19 37.53 35.86 f-bmmi 5.88 7.60 7.25 14.59 8.09 17.51 10.15 31.63 34.72 33.18 f-bmmi c 6.07 7.82 7.22 14.89 8.43 17.51 10.32 32.38 35.27 33.83 SGMM ML 6.47 9.07 8.18 17.11 9.55 20.40 11.80 33.13 34.93 34.03 bmmi 5.53 7.23 7.00 14.44 7.76 17.48 9.91 31.50 33.36 32.43 bmmi c 5.68 7.28 7.02 14.44 7.94 17.68 10.01 30.94 33.08 32.01 DNN CE 6.71 8.85 8.70 15.58 9.15 19.07 11.34 30.88 35.82 33.35 bmmi 5.29 7.06 6.95 13.09 7.57 15.53 9.25 28.45 32.67 30.56 bmmi c 5.14 6.74 6.51 12.37 7.27 15.50 8.92 28.32 33.49 30.91 ROVER 4.67 5.88 6.31 11.93 6.63 14.89 8.39 26.58 28.91 27.75 8ch CSP+BF+derev. MFCC ML 10.79 12.19 11.02 16.71 11.47 20.43 13.77 40.36 42.83 41.60 +NLMS 11.11 12.27 11.81 17.40 12.34 21.46 14.40 38.37 40.74 39.56 GMM +LDA+MLLT ML 8.38 10.30 9.91 14.94 10.19 17.28 11.83 34.06 37.18 35.62 +basis fmllr 7.74 9.22 8.80 13.33 9.05 15.28 10.57 27.39 30.14 28.77 bmmi 6.64 8.21 7.25 11.39 7.10 11.50 8.68 24.89 27.96 26.43 f-bmmi 6.19 7.40 7.39 10.13 6.58 10.24 7.99 22.58 26.25 24.42 f-bmmi c 6.39 7.33 7.44 9.86 6.70 10.44 8.03 22.71 27.41 25.06 +SAT ML 7.25 9.32 8.70 12.79 8.33 13.80 10.03 28.88 32.88 30.88 bmmi 5.24 7.10 6.56 9.93 5.98 10.98 7.63 26.58 30.83 28.71 f-bmmi 5.01 6.76 5.96 9.07 5.84 9.40 7.01 24.27 29.60 26.94 f-bmmi c 5.16 6.93 6.11 9.49 5.96 9.67 7.22 24.27 29.73 27.00 SGMM ML 5.65 7.62 7.47 10.97 7.00 11.45 8.36 25.27 30.35 27.81 bmmi 4.57 6.05 6.19 9.27 6.01 9.89 7.00 24.70 30.01 27.36 bmmi c 4.72 6.10 6.09 9.56 6.18 10.01 7.11 24.39 30.01 27.20 DNN CE 6.49 7.45 7.84 11.44 7.25 11.97 8.74 25.27 29.32 27.30 bmmi 5.56 6.27 6.24 9.29 5.71 10.44 7.25 23.27 28.84 26.06 bmmi c 5.26 6.05 6.21 9.10 5.61 10.06 7.05 22.65 28.50 25.58 ROVER 4.18 5.11 5.50 7.74 4.85 8.23 5.94 21.90 26.52 24.21 NLMS RealData WER 2.04% SimData 0.63% NLMS LDA MLLT WER 1 f-bmmi bmmi SGMM SimData GMM RealData GMM DNN SimData RealData SAT GMM DNN 2 SimData Real- Data DNN DNN 16 *5 *5 PLP MFCC 5

2 WER [%] on the REVERB Challenge (eva). MFCC feature was used for single system and MFCC and PLP features were used for ROVER). SimData RealData Room 1 Room 2 Room 3 Avg Room 1 Avg near far near far near far near far 1ch Kaldi baseline 13.23 14.13 15.54 29.69 20.06 37.44 21.68 50.62 45.98 48.30 derev. 12.50 13.43 14.61 24.71 17.09 32.62 19.16 44.75 43.32 44.04 f-bmmi 7.27 8.17 8.82 14.11 10.54 18.76 11.28 28.65 29.54 29.10 SAT+f-bMMI 6.44 7.22 7.57 13.97 9.52 18.44 10.53 28.87 29.78 29.33 SGMM+bMMI 5.81 6.54 7.22 13.84 8.70 18.17 10.05 27.75 28.36 28.06 DNN+bMMI 5.90 6.84 7.35 12.57 9.40 16.55 9.77 25.97 25.69 25.83 ROVER 5.30 5.61 6.30 11.16 7.76 14.95 8.51 23.79 23.60 23.70 8ch CSP+BF+derev. 10.94 11.69 10.98 16.33 12.79 21.39 14.02 34.33 36.93 35.63 +NLMS 10.94 12.32 11.38 17.59 13.46 22.96 14.78 35.32 35.28 35.30 f-bmmi 6.57 6.93 6.80 9.93 7.47 12.76 8.41 20.22 23.19 21.71 SAT+f-bMMI 6.17 6.64 6.51 10.13 7.40 13.15 8.33 20.63 23.67 22.15 SGMM+bMMI 5.86 6.44 6.29 9.23 6.96 12.83 7.94 20.66 23.50 22.08 DNN+bMMI 5.64 6.18 6.16 9.29 7.08 12.40 7.79 19.35 22.28 20.82 ROVER 4.96 5.62 5.58 8.18 5.73 10.47 6.76 16.90 20.29 18.60 2 (eva) DNN DNN (ROVER 5) WER SimData RealData 1ch 1.26% 2.13% 8ch 1.03% 2.22% 7. [1] Kinoshita, K. et al.: The REVERB Challenge: A Common Evaluation Framework for Dereverberation and Recognition of Reverberant Speech, Proc. of WASPAA (2013). [2] Tachioka, Y. et al.: Dereverberation Method with Reverberation Time Estimation Using Floored Ratio of Spectral Subtraction, Acoust. Sci. & Tech., Vol. 34, No. 3, pp. 212 215 (2013). [3] Knapp, C. and Carter, G.: The Generalized Correlation Method for Estimation of Time Delay, IEEE Trans. on Acoustics, Speech, and Signal Processing, Vol. 24, pp. 320 327 (1976). [4] Tachioka, Y. et al.: Direction of Arrival Estimation by Cross-power Spectrum Phase Analysis Using Prior Distributions and Voice Activity Detection Information, Acous. Sci. & Tech., Vol. 33, pp. 68 71 (2012). [5] Johnson, D. and Dudgeon, D.: Array Signal Processing, Prentice-Hall (1993). [6] Haeb-Umbach, R. and Ney, H.: Linear Discriminant Analysis for Improved Large Vocabulary Continuous Speech Recognition, ICASSP, pp. 13 16 (1992). [7] Gopinath, R.: Maximum Likelihood Modeling with Gaussian Distributions for Classification, ICASSP, pp. 661 664 (1998). [8] Anastasakos, T. et al.: A Compact Model for Speakeradaptive Training, ICSLP, pp. 1137 1140 (1996). [9] Povey, D. et al.: fmpe: Discriminatively Trained Features for Speech Recognition, ICASSP, pp. 961 964 (2005). [10] Hinton, G. et al.: Deep Neural Networks for Acoustic Modeling in Speech Recognition, IEEE Signal Processing Mag., Vol. 28, pp. 82 97 (2012). [11] Tachioka, Y. et al.: Discriminative Methods for Noise Robust Speech Recognition: A CHiME Challenge Benchmark, the 2nd CHiME Workshop on Machine Listening in Multisource Environments, pp. 19 24 (2013). [12] Povey, D. and Yao, K.: A Basis Representation of Constrained MLLR Transforms for Robust Adaptation, Computer Speech and Language, Vol. 26, pp. 35 51 (2012). [13] Fiscus, J.: A Post-processing System to Yield Reduced Error Word Rates: Recognizer Output Voting Error Reduction (ROVER), Proc. of ASRU, pp. 347 354 (1997). [14] Tachioka, Y. et al.: A Generalized Framework of Discriminative Training for System Combination, Proc. of ASRU, pp. 43 48 (2013). [15] Povey, D. et al.: The Kaldi Speech Recognition Toolkit, ASRU, pp. 1 4 (2011). [16] Suzuki, T. and Kaneda, Y.: Sound Source Direction Estimation Based on Subband Peak-hold Processing, The Journal of the Acoust. Soc. of Jpn, Vol. 65, pp. 513 522 (2009). [17] Nishiura, T. et al.: Localization of Multiple Sound Sources Based on a CSP Analysis with a Microphone Array, ICASSP, pp. 1053 1056 (2000). [18] Habets, E.: Speech Dereverberation Using Statistical Reverberation Models, Speech Dereverberation (Naylor, P. and Gaubitch, N., eds.), Springer (2010). [19] Boll, S.: Suppression of Acoustic Noise in Speech Using Spectral Subtraction, IEEE Trans. on Acoustics, Speech and Signal Processing, Vol. 27, No. 2, pp. 113 120 (1979). [20] Povey, D. et al.: Boosted MMI for Model and Feature-space Discriminative Training, ICASSP, pp. 4057 4060 (2008). [21] Veselý, K. et al.: Sequence-discriminative Training of Deep Neural Networks, INTERSPEECH (2013). [22] Povey, D. et al.: The Subspace Gaussian Mixture Model A Structured Model for Speech Recognition, Computer Speech and Language, Vol. 25, No. 2, pp. 404 439 (2011). 6