Ψηφιακή Επεξεργασία Φωνής

ΕΛΛΗΝΙΚΗ ΔΗΜΟΚΡΑΤΙΑ ΠΑΝΕΠΙΣΤΗΜΙΟ ΚΡΗΤΗΣ Ψηφιακή Επεξεργασία Φωνής Ενότητα 7η: Βελτίωση Σήματος Φωνής Στυλιανού Ιωάννης Τμήμα Επιστήμης Υπολογιστών

CS578- Speech Signal Processing Lecture 8: Speech Enhancement Yannis Stylianou University of Crete, Computer Science Dept., Multimedia Informatics Lab yannis@csd.uoc.gr Univ. of Crete

Outline 1 Introduction 2 Preliminaries Problem Formulation Spectral Subtraction Cepstral Mean Subtraction 3 Wiener Filtering Estimating the Object Spectrum Adaptive smoothing Application to Speech Optimal Spectral Magnitude Estimation Binaural Representation 4 Model-Based Processing 5 Auditory Masking Frequency-Domain Masking Principles Calculation of the Masking Threshold Exploiting Frequency Masking in Noise Reduction 6 Acknowledgments

Introduction Types of noise: Additive and Convolutional Speech distortion Enhancement foundations: Spectral Subtraction, Cepstral Mean Subtraction Wiener Filter Enhanced speech judgements: by humans, by machines

Additive Noise A discrete-time noisy sequence: with power spectra: Working with STFT: in the frequency domain: Our target: y[n] = x[n] + b[n] S y (ω) = S x (ω) + S b (ω) y pl [n] = w[pl n](x[n] + b[n]) Y (pl, ω) = X (pl, ω) + B(pL, ω) j Y (pl,ω) ˆX (pl, ω) = X (pl, ω) e

Convolutional Distortion A discrete-time convolutional distorted sequence: y[n] = x[n] g[n] where g[n] is the impulse response of a linear time-invariant distortion filter. Working with a frame-by-frame analysis: y pl [n] = w[pl n](x[n] g[n]) In the frequency domain, we can show that: Y (pl, ω) = X (pl, ω)g(ω)

Standard Spectral Subtraction Assuming that noise and target (object) signal are uncorrelated: Estimate of object s short-time squared spectral magnitude ˆX (pl, ω) 2 = Y (pl, ω) 2 Ŝ b (ω) if Y (pl, ω) 2 Ŝ b (ω) 0 = 0 otherwise STFT estimate: j Y (pl,ω) ˆX (pl, ω) = ˆX (pl, ω) e

Spectral Subtraction as a filtering operation We can show: where ˆX (pl, ω) 2 = Y (pl, ω) 2 Ŝ b (ω) Y (pl, ω) 2 [1 + X (pl, ω) 2 R(pL, ω) = Ŝ b (ω) Suppression filter frequency response H s (pl, ω) = 1 R(pL, ω) [ ] 1 1/2 1 + R(pL, ω) ] 1

The role of the analysis window Let x[n] = A cos (ω 0 n) be in a stationary white noise b[n] of variance σ 2 and w[n] be a short-time window. Then: Average short-time signal power at ω 0 : Ŝ x (pl, ω 0 ) = E[ X (pl, ω 0 ) 2 ] A2 2 w[n] 4 Average power of the windowed noise Ratio at ω 0 : where Ŝ b (pl, ω) = E[ B(pL, ω) 2 ] = σ 2 E[ Y (pl, ω) 2 ] Ŝ b (pl, ω 0 ) n= n= = 1 + A2 /4 [σ 2 w ] n= w = w 2 [n] n= w[n] 2 w 2 [n]

Cepstral Mean Subtraction Let y[n] = x[n] g[n]. Then: Logarithm of the STFT of y[n]: Cepstrum: Y (pl, ω) log [X (pl, ω)] + log [G(ω)] ŷ[n, ω] Fp 1 (log [X (pl, ω)]) + Fp 1 (log [G(ω)]) = ˆx[n, ω] + ĝ[0, ω]δ[n] Cepstral filter: where l[n] = u[n 1] ˆx[n, ω] l[n]ŷ[n, ω]

Wiener Filtering Stochastic optimization: if y[n] = x[n] + b[n], find h[n] such that ˆx[n] = y[n] h[n] minimizes e = E[ ˆx[n] x[n] 2 ] Frequency domain solution (Wiener filter): H w = Time-varying Wiener filter: Or where S x (ω) S x (ω) + S b (ω) Ŝ x (pl, ω) H w (pl, ω) = Ŝ x (pl, ω) + Ŝ b (ω) H w (pl, ω) = [ ] 1 1 1 + R(pL, ω) R(pL, ω) = Ŝx(pL, ω) Ŝ b (ω)

Comparing the two suppression filters Solid line: Spectral Subtraction. Dashed-line: Wiener filter

A basic approach We assume that the Wiener filter of p 1 frame is known, then: ˆX (pl, ω) = Y (pl, ω)h w ((p 1)L, ω) Updating the Wiener filter: H w (pl, ω) = ˆX (pl, ω) 2 ˆX (pl, ω) 2 + Ŝ b (ω) Smooth power spectrum: S x (pl, ω) = τ S x ((p 1)L, ω) + (1 τ)ŝ x (pl, ω) where Ŝ x (pl, ω) = ˆX (pl, ω) 2 Initialization: spectral subtraction

Adaptive smoothing Wiener filter estimator adapts to the degree of stationarity of the measured signal. A measure of the degree of stationarity [ 1 π Y (pl) = h [p] Y (pl, ω) Y ((p 1)L, ω) 2 dω π Time varying smoothing constant: where Smooth object spectrum: 0 τ(p) = Q[1 2( Y (pl) Y )] x, 0 x 1 Q(x) = 0, x < 0 1, x > 1 S x (pl, ω) = τ(p) S x ((p 1)L, ω) + [1 τ(p)]ŝ x (pl, ω) ] 1/2

Example of enhancement

Application to Speech Satisfying enhanced speech quality with Wiener filter is obtained if: Window: triangular Frame length: 4ms Frame interval (rate): 1ms OLA synthesis

Example of enhancement in speech

Minimum mean-square Error If y[n] = x[n] + b[n] compute the expected value of: E{ X (pl, ω) y[n]} (Ephraim and Malah, 1984)

Suppression Filter Suppression Filter of Ephraim and Malah H s (pl, ω) = ( ) ( π 1 γpr (pl,ω) 2 1+γ po(pl,ω) ] G where a priori SNR: a posteriori SNR: 1+γ pr (pl,ω) [ γpr (pl,ω)+γ po(pl,ω)γ pr (pl,ω) 1+γ pr (pl,ω) G(x) = e x/2 [(1 + x)i 0 (x/2) + xi 1 (x/2)] γ po (pl, ω) = P[ Y (pl, ω) 2 Ŝ b (ω)] Ŝ b (ω) γ pr (pl, ω) = (1 a)p[γ po (pl, ω)]+a H s((p 1)L, ω)y ((p 1)L, ω) 2 ) Ŝ b (ω)

Binaural Representation Compute the enhanced signal (object) through H s (pl, ω) Compute its complement: 1 H s (pl, ω) Play a stereo signal: i.e., left channel for the object and right channel it complement Illusion: object and its complement come from different directions, and thus there is further enhancement!!!

Model-Based Processing Model-based Wiener Filter: H(ω) = Ŝ x (ω) Ŝ x (ω) + Ŝb(ω) Power spectrum estimate of speech: Ŝ x (ω) = A 2 1 p k=1 âke jωk 2

Stochastic Estimation methods Maximum Likelihood, ML Maximum a posteriori, (MAP) max p Y A (y a) a max p A Y (a y) a knowing the a priori probability p A (a) Minimum-Mean-Squared Error, (MMSE) mean of p A Y (a y)

Example of (L)MAP estimation for Enhancement Solution to the MAP problem requires solving a set of nonlinear equations. Instead we use a linearized approach of MAP: Initial estimation of â 0 Estimate speech spectrum E[x â 0, y] Having a speech estimate, estimate a new parameter vector â 1 Estimate speech spectrum: Ŝ 1 x (ω) = Estimate suppression filter: make iterations A 2 1 p k=1 â1 k e jωk 2 H 1 Ŝx 1 (ω) (ω) = Ŝx 1 (ω) + Ŝb(ω)

Linearized MAP

Auditory Masking Auditory masking: one sound component is concealed by the presence of another sound component. Frequency masking Temporal masking Critical band Masking threshold Maskee Masker

Masking Threshold Curve

Frequency-Domain Masking Principles Physiologically-based/Psychoacoustically-based filters Critical Bands: Bandwidth of Psychoacoustically-based filters Quantized critical bands (Bark Scale): z = 13 arctan (0.76f ) + 3.5 arctan (f /7500) Quantized critical bands (Mel Scale): m = 2595 log 1 0(1 + f /700)

Masking Threshold Calculation Compute energy E k in each kth bark filter in the estimated speech spectrum (after spectral subtraction) Convolve each E k with a spreading function h k : T k = E k h k Subtract a threshold offset depending if the masker is noise-like or tone-like. Map T k to linear frequency scale to obtain T (pl, ω)

Auditory Masking Threshold Curves

Approach 1 Suppression filter: H s (pl, ω) = [1 aq(pl, ω) γ 1 ] γ 2, if Q(pL, ω) γ 1 < 1 a+b = [bq(pl, ω) γ 1 ] γ 2, otherwise where [ Ŝ b (ω) Q(pL, ω) = Y (pl, ω) 2 ] 1/2 Requirements: (a) Estimation of Ŝ b (ω), and (b) a masking threshold curve on each frame T (pl, ω).

Approach 2 From y[n] = x[n] + b[n] go to d[n] = x[n] + ab[n] If h s [n] is the impulse response of the suppression filter, then the noise error is: ab[n] h s [n] b[n] with short-time power spectrum: Constraint: Ŝ e (pl, ω) = H s (pl, ω) a 2 Ŝ b (ω) H s (pl, ω) a 2 Ŝ b (ω) < T (pl, ω) or: a T (pl, ω) Ŝ b (ω) < H s (pl, ω) < a + T (pl, ω) Ŝ b (ω)

Acknowledgments Most, if not all, figures in this lecture are coming from the book: T. F. Quatieri: Discrete-Time Speech Signal Processing, principles and practice 2002, Prentice Hall and have been used after permission from Prentice Hall

Τέλος Ενότητας

Χρηματοδότηση Το παρόν εκπαιδευτικό υλικό έχει αναπτυχθεί στα πλαίσια του εκπαιδευτικού έργου του διδάσκοντα. Το έργο «Ανοικτά Ακαδημαϊκά Μαθήματα στο Πανεπιστήμιο Κρήτης» έχει χρηματοδοτήσει μόνο τη αναδιαμόρφωση του εκπαιδευτικού υλικού. Το έργο υλοποιείται στο πλαίσιο του Επιχειρησιακού Προγράμματος «Εκπαίδευση και Δια Βίου Μάθηση» και συγχρηματοδοτείται από την Ευρωπαϊκή Ένωση (Ευρωπαϊκό Κοινωνικό Ταμείο) και από εθνικούς πόρους. 3

Σημειώματα

Σημείωμα αδειοδότησης Το παρόν υλικό διατίθεται με τους όρους της άδειας χρήσης Creative Commons Αναφορά, Μη Εμπορική Χρήση, Όχι Παράγωγο Έργο 4.0 [1] ή μεταγενέστερη, Διεθνής Έκδοση. Εξαιρούνται τα αυτοτελή έργα τρίτων π.χ. φωτογραφίες, διαγράμματα κ.λ.π., τα οποία εμπεριέχονται σε αυτό και τα οποία αναφέρονται μαζί με τους όρους χρήσης τους στο «Σημείωμα Χρήσης Έργων Τρίτων». [1] http://creativecommons.org/licenses/by-nc-nd/4.0/ Ως Μη Εμπορική ορίζεται η χρήση: που δεν περιλαμβάνει άμεσο ή έμμεσο οικονομικό όφελος από την χρήση του έργου, για το διανομέα του έργου και αδειοδόχο που δεν περιλαμβάνει οικονομική συναλλαγή ως προϋπόθεση για τη χρήση ή πρόσβαση στο έργο που δεν προσπορίζει στο διανομέα του έργου και αδειοδόχο έμμεσο οικονομικό όφελος (π.χ. διαφημίσεις) από την προβολή του έργου σε διαδικτυακό τόπο Ο δικαιούχος μπορεί να παρέχει στον αδειοδόχο ξεχωριστή άδεια να χρησιμοποιεί το έργο για εμπορική χρήση, εφόσον αυτό του ζητηθεί.. 5

Σημείωμα Αναφοράς Copyright Πανεπιστήμιο Κρήτης, Στυλιανού Ιωάννης. «Ψηφιακή Επεξεργασία Φωνής. Βελτίωση Σήματος Φωνής». Έκδοση: 1.0. Ηράκλειο/Ρέθυμνο 2015. Διαθέσιμο από τη δικτυακή διεύθυνση: http://www.csd.uoc.gr/~hy578

Διατήρηση Σημειωμάτων Οποιαδήποτε αναπαραγωγή ή διασκευή του υλικού θα πρέπει να συμπεριλαμβάνει: το Σημείωμα Αναφοράς το Σημείωμα Αδειοδότησης τη δήλωση Διατήρησης Σημειωμάτων το Σημείωμα Χρήσης Έργων Τρίτων (εφόσον υπάρχει) μαζί με τους συνοδευόμενους υπερσυνδέσμους.

Σημείωμα Χρήσης Έργων Τρίτων Το Έργο αυτό κάνει χρήση των ακόλουθων έργων: Εικόνες/Σχήματα/Διαγράμματα/Φωτογραφίες Εικόνες/σχήματα/διαγράμματα/φωτογραφίες που περιέχονται σε αυτό το αρχείο προέρχονται από το βιβλίο: Τίτλος: Discrete-time Speech Signal Processing: Principles and Practice Prentice-Hall signal processing series, ISSN 1050-2769 Συγγραφέας: Thomas F. Quatieri Εκδότης: Prentice Hall PTR, 2002 ISBN: 013242942X, 9780132429429 Μέγεθος: 781 σελίδες και αναπαράγονται μετά από άδεια του εκδότη.