Single-Channel Online Enhancement of Speech Corrupted by Reverberation and Noise

This paper proposes an online single-channel speech enhancement method designed to improve the quality of speech degraded by reverberation and noise. Based on an autoregressive model for the reverberation power and on a hidden Markov model for clean speech production, a Bayesian filtering formulation of the problem is derived and online joint estimation of the acoustic parameters and mean speech, reverberation, and noise powers is obtained in mel-frequency bands. From these estimates, a real-valued spectral gain is derived and spectral enhancement is applied in the short-time Fourier transform (STFT) domain. The method yields state-of-the-art performance and greatly reduces the effects of reverberation and noise while improving speech quality and preserving speech intelligibility in challenging acoustic environments.


I. INTRODUCTION
S PEECH signals captured using a distant microphone within a confined acoustic space are often corrupted by reverberation.The detrimental impact of reverberation on the quality and intelligibility of the speech and on the performance of speech recognition systems is made worse when it is combined with acoustic noise [1]- [4].Combating the damaging effects of reverberation has been a key research topic in recent years driven by an increasing demand for effective methods of speech communication in challenging environments [5].While some progress has been made in both single-and multi-channel processing [4], [6]- [9], the task of providing a blind single-channel dereverberation method robust to noise and suitable for real-time processing remains a challenge.
Most single-channel speech dereverberation techniques can be classified into inverse filtering [10], [11], nonlinear mapping [12], spectral enhancement [6], [13], [14] and probabilistic model-based methods [15]- [17].Inverse filtering methods typically try to reconstruct the original signal by designing an inverse filter for the Room Impulse Response (RIR).Based on the observation that the Linear Prediction (LP) residual of clean speech has a higher kurtosis (fourth-order moment) than that of reverberant speech, the inverse filter of the impulse response is estimated in [10] by maximizing the kurtosis of the LP residual of the inverse-filtered speech.In [11], a similar principle is applied, in which the inverse filter is chosen to maximize the normalized skewness (third-order moment) of the LP residual.These techniques, however, compensate only for the coloration effect caused by the early reflections and must be used in conjunction with other late reverberation suppression methods in order to achieve good dereverberation performance [10], [11].If the RIR is known, or can be estimated, inverse filtering can also be applied using methods in the time or frequency domain [18] or using homomorphic approaches [19], [20].
Nonlinear mapping methods do not assume any explicit model for the reverberation, and instead use parallel training data in order to learn a nonlinear mapping function from the reverberant speech spectrogram to its clean speech equivalent.This can be done using a fully connected Deep Neural Network (DNN) as in [12] where the mean squared error between the output of the DNN and the clean speech log-power spectrum is minimized.Even though results can be improved by also considering first and second-order time derivatives of the input features, speech enhanced by this method can lead to a decrease in overall speech quality [21].
In spectral enhancement methods, a time-frequency gain is applied to the noisy reverberant spectral coefficients in order to estimate those of the clean speech.This gain is based on the estimated power spectral densities (PSDs) of the noise and late reverberation [6], [13].The estimation of the late reverberant PSD is often based on a simple statistical model of the room impulse response such as [6], [22].Spectral enhancement methods are able to reduce both the background noise and reverberation while being computationally efficient, but usually suffer from artifacts introduced by the nonlinear filtering operation, though efforts have been made to alleviate this problem, e.g. by using temporal cepstrum smoothing [14].
In the probabilistic model-based approaches to blind dereverberation, the parameters of the acoustic channel and clean This work is licensed under a Creative Commons Attribution 3.0 License.For more information, see http://creativecommons.org/licenses/by/3.0/speech models are estimated from the observed data and used to reconstruct the original source signal.The reverberation model is typically an FIR or IIR filter in the time domain [15], the complex short-time Fourier transform (STFT) domain [23], [24] or the STFT power domain [16].In [15], the acoustic channel is modeled as a time-varying linear combination of entries from a codebook of all-pole filters, and the speech signal is modeled using a block-based time-varying autoregressive (AR) model.Bayesian inference is used to estimate the joint probability density function (pdf) of the channel and source parameters.The method has been applied successfully on simulated data within a limited frequency range, but difficulties arise when the data does not follow the assumed channel and source models.Bayesian variational inference is used in [16] where an extension of the Multi-Channel Linear Prediction (MCLP) model [25] to power spectrograms in the single-channel case is used.The order of this non-negative auto-regressive reverberation model is determined in a data-driven manner using a Dirichlet process [26].However, the method assumes a noise-free environment, which is unrealistic in practice.In [17], a Non-negative Convolutive Transfer Function (N-CTF) model [8] is used for the RIR and the speech spectrogram is modeled using Non-negative Matrix Factorization (NMF) so as to capture the spectral structure of the speech signal.The two models are then combined to form an optimization problem in which the clean speech spectrogram and RIR parameters are simultaneously estimated through iterative update rules.In [24], the reverberation model is an FIR filter in the complex STFT domain.Processing each subband independently, a recursive expectation-maximization (EM) procedure is used in which the E step estimates the clean-speech coefficients with a Kalman filter and the M step updates a parameter vector comprising the reverberation filter coefficients and the variances of the speech and noise.
In this paper, we present an online method for enhancing reverberant and noisy speech recordings using a combination of spectral enhancement and probabilistic estimation.Enhancement is performed by applying a time-frequency gain to the degraded speech complex STFT coefficients as in spectral enhancement.The estimation of the quantities needed to compute this gain is formulated as a Bayesian filtering problem in which they are jointly estimated along with the parameters of the acoustic channel.The latter is modeled using a nonnegative first-order autoregressive moving-average (ARMA) process parametrized by the reverberation time (T 60 ) and the Direct-to-Reverberant energy Ratio (DRR).The clean speech is modeled by a Hidden Markov Model (HMM) in which each state captures the spectral characteristics of a possible prior distribution of the multivariate speech log-power.At each time frame, the possible clean speech prior distributions are tested through a swarm of nonlinear Kalman filter-like updates.The distribution leading to the highest likelihood for the observed power is kept, leading to a-posteriori estimates of the speech, reverberation and noise mean powers.The performance of the proposed method is evaluated on simulated data through six different objective measures and on live recordings through the Word Error Rate (WER) of a speech recognizer.A listening test was conducted to assess the subjective reverberation reduction and overall quality improvement.The idea of using an HMM whose states represent broad speech sound classes with distinct acoustic spectra has been applied previously to speech enhancement [27]- [30].In these papers, a state-dependent spectral shape was multiplied by a time-varying speech gain to obtain prior distributions for the speech spectral coefficients; these priors were then used to determine an MMSE or MAP estimate of the clean speech spectrum in an appropriate domain.In the current work, this approach is extended to include an explicit model of reverberation and to track the time-variation of both the reverberation model parameters and the speech gain.
The paper is organized as follows.The non-negative ARMA reverberation model and HMM clean speech model are described in Section II and an overview of the overall enhancement system is given in Section III.In Section IV, the Bayesian filtering formulation of the problem is detailed as well as the computation of the posterior densities and the online estimation of the reverberation parameters.Results are presented in Section V and conclusions drawn in Section VI.

II. SIGNAL MODEL AND NOTATION
In the system block diagram shown in Fig. 1, the enhancement of the noisy and reverberant speech is performed in the STFT domain, while the estimation of the system parameters and signal powers is performed in Mel-spaced subbands.A filterbank comprising triangular filters [31], [32] is used to transform the power spectrum of each frame from K STFT bins to a reduced number, K, of Mel-spaced subbands.The use of these broad subbands has two benefits: it reduces the dimension of the state vector, x l , in (14) below and it reduces the number of states required in the speech model described in Section II-B.This is because the filterbank removes narrowband features such as pitch harmonics whose variability would otherwise need to be included in the model.

A. Reverberation Model
Let y(n) denote the observed reverberant noisy speech signal at discrete-time n.The additive background noise signal is denoted by ν(n) and the reverberant speech signal is obtained by convolving the clean speech source s(n) and the J-tap RIR between the source and microphone, ρ(n), as ( The complex STFT coefficients of the observed signal are then computed according to where l is the time-frame index, k is the STFT frequency bin, w(n) a time-domain window, and T the frame increment.A power-domain filterbank is applied to compute the power in K Mel-spaced subbands as where the b k, k implement the triangular filters from [31], [32].
Analogous to (3), N (l, k) denotes the subband noise power.For the speech signal, however, we divide by the band-independent active speech level [33], G(l), to obtain the level-normalized subband speech signal The decomposition of the speech power into the product of a time-varying active level, G(l), and a level-normalized spectral shape, S(l, k), is similar to that in [28], [30] and allows the prior distribution of S(l, k) to be trained offline using levelnormalized training data.
Based on an approximation of (1) in the STFT domain in which cross-band filters are neglected, the N-CTF model was proposed in [8] to approximate the power spectrogram of a reverberant signal.In this paper, we assume this model to apply in each Mel-frequency band, k, and also assume that the reverberant speech and noise are additive in the power domain, resulting in where L h is the RIR length in the STFT domain.The errors introduced by assuming that the signals add in the power domain are discussed further in Appendix B.
Polack proposed a time-domain statistical model [22] of the RIR as scaled exponentially-decaying white Gaussian noise parametrized by the broadband reverberation time T 60 .Noting that the latter is normally frequency-dependent [34], this model was extended in [6] to each subband and split into two statistical sub-models: one containing the direct path, the other comprising all later reflections.In this paper, we assume the exponentially-decaying model is valid in each Mel-frequency band, model the direct path deterministically and only consider the energy envelope of the impulse response so that where δ(l) is the Kronecker delta function and u(l) is the unit step function.The decay constant, α k , in Mel-frequency band k, is related to T 60,k through where T is the STFT frame hop.The drop in energy after the direct path, d k , is related to the frequency-dependent DRR by the equation Substituting the drop and decay reverberation model of (6) into the observed power of (5), we obtain where the reverberation power in time-frame l and frequency band k is The model of ( 10) can be written recursively as Equations ( 9) and ( 11) correspond to a first-order ARMA model for the acoustic channel in the spectral power domain having the system function z −α +d z −α .This parsimonious model contrasts with the higher order moving average or autoregressive models used by [23] and [24] respectively in the complex STFT domain.By writing the frequency-dependent quantities in (9) and (11) as column vectors of length K, we can write the system's dynamic equations as Y l = Gl Sl + Ȓl + N l (13) where is the Hadamard product.In the following, uppercase letters represent random variables, the corresponding lower case letters their realizations, and estimates are denoted by ˆ.Means and covariances are denoted by μ, Σ with the random variable as a suffix.Unadorned signal variables are in the log-power domain and the corresponding power domain quantities indicated by a ˘; thus y l = log(y l ).A sequence of consecutive frames is represented using a colon; thus y 1:l denotes {y 1 , . . ., y l }.We assume below that the log-power spectra, S l , R l and N l , follow multivariate Gaussian distributions [35].

B. Clean Speech Model
The log-power, S l , of the level-normalized clean speech is modeled by an HMM with N states in which the state at time frame l is denoted by c l .Associated with each state is a prior distribution for the multivariate clean speech log-power, so that p(S l |c l ) ∼ N (μ S c l , Σ S c l ) where the μ S c l and Σ S c l are trained offline using the training procedure discussed in Section V-A1.
We denote by c l the path, {c 1 , c 2 , . . ., c l } ending in c l .For each possible state, c l , at time frame l, we consider the N possible paths {c l−1 , c l } and select the one with highest likelihood Fig. 2. Bayesian gain computation system described in Section IV.
as c l (see (25) below).Thus, for each time frame, we end up with N hypothesized paths, c l , one for each of the N states.

III. SYSTEM OVERVIEW
To perform enhancement, the reverberant noisy speech signal, y(n), is processed by applying a real-valued magnitude gain to its complex STFT coefficients in order to obtain the estimated clean speech signal ŝ(n).This gain is first computed in each Mel-frequency band at each time-frame and then interpolated to cover the full STFT frequency range, as illustrated in Fig. 1.

A. Clean Speech HMM
As we want to track the system parameters over time, the computation of the spectral gain in the Mel-frequency bands, shown as the upper block in Fig. 1, uses a Bayesian filtering formulation that is illustrated in Fig. 2.This includes the clean speech HMM which encapsulates prior speech knowledge in the form of state transition probabilities and state-dependent log-power spectral distributions.
We define of size 2K + 1, to be the state representation of our system at frame l.Note that x l includes the reverberation and noise parameters for all subbands in a single state vector in contrast to algorithms such as [23], [24] in which each subband is processed independently.The inclusion of all subbands in a single state vector enables our algorithm to take account of inter-band correlations of the reverberation and noise parameters.
For each of the N best paths, c l−1 , the "Prediction" block in Fig. 2 estimates the prior distribution of x l from the pathdependent posterior distributions of x l−1 and S l−1 .To do so, it uses the current estimate of the reverberation parameters contained in the vector π l−1 (defined fully in (53) below).For each of these paths, N new possibilities arise, corresponding to the possible prior distributions for the clean speech log-power associated with each HMM state c l .This gives N 2 possible likelihood functions for the observed log-power y l , corresponding to the N 2 possible choices {c l−1 , c l }.The "Likelihood Computation & Pruning" block then computes the likelihood of each of the N 2 paths.Only the path arriving at each c l with the highest likelihood is kept, and new path-dependent posterior distributions for x l and S l are computed, as described in Section IV-B2.

B. Gain Computation
For each time-frame l, we obtain the Gaussian posterior densities of the state vector x l and clean speech log-power S l conditional on the HMM path, c l as described in Section IV-B.From these, an updated estimate for the reverberation parameters π l is computed as described in Section IV-C.The path probabilities, p(c l |c l−1 , y 1:l ), are normally extremely sparse in practice; in the final block of Fig. 2, we therefore compute the speech enhancement gain, W l , from the posterior pdfs of the clean speech, reverberation and noise log-powers associated with the most probable path.From the mean and covariance of the distribution in the log-power domain, we obtain the mean of the corresponding distribution in the power domain using the formulae relating the moments of a normal distribution in the log-power domain to those of a log-normal distribution in the power domain [36]: where diag(Σ x l ) is the vector composed of the diagonal elements of Σ x l .Similarly, we can obtain μ Sl from the mean and covariance matrix of its log-domain distribution.We can then directly extract the estimated means of Ȓl , N l , Gl and Sl .According to (13) we have Y l = Gl Sl + Ȓl + N l , and we wish to compute an estimate of the clean speech power Gl Sl as where W l is a magnitude gain.This is a form of spectral subtraction [37], [38], and a general form for the gain W l is where the division and power operations act elementwise on the vectors.η is the oversubtraction factor, and controls how aggressively the processing is applied.Depending on the value of the exponent β, several forms of spectral enhancement can be obtained.The value of β determines the sharpness of the transition from W l (k) = 1 to W l (k) = 0 [39], with β = 1 (corresponding to Wiener-Filtering) achieving more aggressive processing than β = 1 2 .Since the estimation of the posterior density of Sl is based on a discrete choice of priors at each time-frame, the resulting estimated μ Sl is highly time varying.Accordingly, we perform smoothing of the gain in the time domain according to where λ s is the smoothing constant.Finally, as indicated in Fig. 1, we use linear interpolation to map the gain, W l , from K Mel-spaced bands onto the full STFT resolution.The effect of this interpolation is to smooth the gain function in frequency, which helps to reduce artifacts such as musical noise.

IV. BAYESIAN ESTIMATION
In this section, we are concerned with the computation of the posterior densities of the state vector x l and clean speech log-power S l in order to be able to perform the gain computation described in Section III-B.The general structure of the proposed algorithm is illustrated in Fig. 2 and detailed in Section IV-A.Section IV-B describes the computation of the means and covariance matrices of the Gaussian pdfs involved, while Section IV-C details how to update the reverberation parameters estimate.
We denote by μ x l and Σ x l the mean and covariance matrix of the probability density function of x l .Given c l , the HMM state at time l, we have available from the training data and as detailed in Section II-B the corresponding mean μ S c l and covariance matrix Σ S c l of the prior distribution p(S l |c l ).
We can describe our system dynamics with the following nonlinear prediction and observation equations: where α l−1 and d l−1 are assumed to be system parameters, fixed for time-frame l.From (21), we see that the speech gain, G l , and the noise log-power, N l , follow a Gaussian random walk.The function h : R 3K +1 → R K in (20) implements (13) as h(x l , S l ) = log (exp(G l + S l ) + exp(R l ) + exp(N l )).The nonlinear functions f and h are both differentiable as required for implementing the extended Kalman filter update described in Section IV-B below.The covariance of l is Q l and represents the variance of the errors in the prediction model, (21).Similarly, M l , the observation noise covariance, represents both the errors inherent to the statistical properties of the input data and those introduced by assuming that uncorrelated signals add in the power domain; expressions for the two components of M l are derived in Appendices A and B respectively.
From our system equations, we can derive several conditional independencies.Given x l and c l , p(y l |x l , c l , y 1:l−1 ) = p(y l |x l , c l ).We also have P (c l |c l−1 , x l−1 , y 1:l−1 ) = P (c l |c l−1 ) using pre-trained transition probabilities.

A. State Sequence Estimation
We want to maximize the joint likelihood of the path through the HMM and the sequence of observations, marginalizing over the system state x l .Assume we know p(y 1:l−1 , c l−1 ), the probability of a path up until time l − 1, as well as the posterior density functions p(x l−1 |y l−1 , c l−1 ) and p(S l−1 |y l−1 , c l−1 ).
We can compute: where For each of the N possible c l−1 , we use the posterior densities p(x l−1 |c l−1 , y l−1 ) and p(S l−1 |c l−1 , y l−1 ) to compute the prediction stage (24) as described in Section IV-B1.For each of these paths, there are N possible clean speech prior distributions corresponding to each c l , creating N 2 possible paths {c l−1 , c l } for which the likelihood of the observation ( 23) is computed.Only the best path arriving at each c l is kept, so that ∀c l , ĉl = arg max For each of the N retained paths, the posterior densities of x l and S l are computed as described in Section IV-B2.

B. Posterior Densities Computation 1) Model Prediction Step:
The "Prediction" block of Fig. 2 computes the path-dependent Gaussian prior densities p(x l |c l−1 , y l−1 ).We define F l−1 to be the Jacobian matrix of f from the prediction equation ( 19) evaluated at μ x l −1 and μ S l −1 .It can be written as (26) with Let us now define the augmented state Keeping only the first two terms from the Taylor series for f [40] gives the following linear approximation: (28) Computing the expected value gives us : which in turn gives the following covariance matrix: If we now introduce f ( Cov Therefore, by writing (19) as x l = f (x l−1 ) + l , we can approximate the conditional joint probability of x l−1 and x l by a Gaussian distribution with the following moments: where where the means and covariance matrices of x l−1 and S l−1 are the moments of the posterior distributions p(x l−1 |c l−1 , y l−1 ) and p(S l−1 |c l−1 , y l−1 ).We therefore have for the marginal probability density of x l : with giving us the solution to (24).

2) Observation Update
Step: This section describes the "Clean Speech HMM" and "Likelihood Computation" blocks of Fig. 2.These compute the likelihood of the observation p(y l |c l , c l−1 , y l−1 ) for each of the N 2 possible paths {c l−1 , c l } as well as the posterior densities of the state vector and clean speech log-power, p(x l |c l , c l−1 , y l , y l−1 ) and p(S l |c l , c l−1 , y l , y l−1 ).
The assumption in (9) that speech, reverberation and noise powers add to form the observed power imposes a nonlinear constraint in the log-power domain.Similar to the derivations in Section IV-B1, we can use a first order Taylor series approximation of h in the observation equation (20) to obtain mean and covariance for the approximately Gaussian joint distribution of y l and x l .We define H l as the Jacobian matrix of h(x l , S l ) evaluated at (μ x l |c l −1 , μ S c l ) so that The mean, μ x l |c l −1 , and covariance matrix, Σ x l |c l −1 , of the predicted pdf of x l for the path originating at c l−1 are given in (37), (38).The mean, μ S c l , and covariance matrix, Σ S c l , of the prior pdf associated with state c l are learned during training.
Using similar derivations to ( 28)-( 32), it follows that for the path defined by {c l−1 , c l } we have: where The observation noise covariance matrix, M l , in (42) represents the uncertainty between the model of ( 13) and the actual observations.It is the sum of a fixed component that is a function of the filterbank parameters b k, k in (3) and another that depends on the estimated mean and variance of the observation, Y (l, k).Detailed expressions for these two components are given in Appendices A and B respectively.
We therefore have the likelihood of the observation with and the posterior pdf of x l [40], [41] p( with which uses a similar approach to the implementation of an Extended Kalman Filter (EKF).Equations ( 43)-( 45) can then be used to compute the joint likelihood of the observations and sequence of states in (22).Using a similar method to (40)-( 42), we can approximate the joint distribution of the observation and clean speech log-power as a Gaussian distribution to obtain with The N -best paths can then be pruned according to (25), and the associated posterior densities can then be used in order to update the reverberation parameters estimate π l and compute the gain W l .Numerical errors can arise when computing Σ y l using (45) that may lead to the estimated covariance matrix being non-positive definite and preventing the computation of the likelihood of the observation.This can especially happen when the observation noise is very low.Though not described in detail here, this problem can be solved by implementing the Square Root version of the Extended Kalman Filter-type update (SR-EKF).By factorizing Σ x l and Σ S l in a U DU T form where U is a unit upper triangular matrix and D is a diagonal matrix, we can carry the updates on both these matrices and ensure that the covariance matrices of p(y l |c l , c l−1 , y l−1 ), p(x l |c l , c l−1 , y l , y l−1 ) and p(S l |c l , c l−1 , y l , y l−1 ) remain positive-definite.This is achieved by using the Bierman-Thornton SR-EKF, which is a combination of the Square-Root implementations proposed in [42], [43].
3) On the Approximation of Transformed Distributions: In this section we look at how well the Taylor series approximation of h allows us to approximate the transformed pdfs.To do so, for clarity we consider the 2-dimensional case with random variables A and B, in which we assume no observation noise is present.We assume that A and B are jointly Gaussian as in Fig. 3 (a) where the log-probability density values have been scaled to match the displayed colormap.On the plot, the mean is marked by a cross and the unit standard deviation contour by an ellipse.The dotted line indicates the constraint log (exp(A) + exp(B)) = 0 analogous to (9).We can approximate the constrained distribution (i.e. the posterior distribution) by computing the empirical mean and covariance of the points lying on the contour log (exp(A) + exp(B)) = 0.The Gaussian distribution with the empirical mean and covariance is shown in Fig. 3 (b).
The constrained distribution computed using a first order Taylor series approximation of the nonlinear constraint is shown in

Fig. 3 (c).
There is a large underestimate of the variance in the direction orthogonal to the tangent of the non-linear constraint.This can be explained by the first order linearization of the constraint, which forces the constrained distribution to lie on the tangent.If the original unconstrained distribution is very close to one of the extremes of the constraint, corresponding to a highly positive or highly negative SNR, this approximation is accurate.However, the approximated covariance is too small at the maximum curvature point of the constraint.
Although not used in our implementation, these approximation inaccuracies can be reduced by using a second-order Taylor series approximation of our constraint which gives the approximated constrained distribution shown in Fig. 3 (d).The result is closer to the empirically computed distribution, suggesting that better results could be achieved using a second-order Taylor series approximation in Section IV-B2.This adds an additional term to the covariance matrix of the marginal distribution of the observation, of the form i,j e i e T j tr H (i)  xs Σ x l H (j )  xs Σ x l (52) with xs the Hessian of h at output dimension i, tr[.] indicating the trace of the matrix, and e i = [0, 0, . . ., 1, . . ., 0, 0] T where the 1 is at position i.As this requires substantial additional computation, we instead use a first-order approximation with an additional observation noise term compensating for the underestimated covariance while remaining computationally efficient.A detailed derivation of this additional noise term is given in Appendix B.

C. Reverberation Parameters Estimation
In Sections IV-A & IV-B, the reverberation parameters α and d are assumed fixed in order to compute the moments of the probability distributions in the prediction step.However, as we do not assume a perfect initialization for these parameters, and as the DRR can change dynamically due to movement of the speaker or changes in the acoustic environment, we need to update our reverberation parameters estimates adaptively.
We define to be the vector of transformed reverberation parameters, where we map the range (0, 1) to (−∞, +∞) to avoid the need for range constraints on the elements of π l .In the following, we identify global random variables that take into account all paths in the HMM with an overbar, ¯.We define the following dynamic equations describing the evolution of the reverberation parameters: where μR l , the mean of the global posterior density of R l , acts as observation, μR l −1 and μS l −1 act as fixed system parameters, ω l ∼ N (0, U l ) and ψ l ∼ N (0, V l ).U l controls how much the reverberation parameters are allowed to change from one frame to the next, while V l represents errors in the model of ( 12), of which g is a direct implementation.Assuming we have for each of the N paths c l the posterior pdfs of x l and S l , we can compute the global posterior densities as with the normalized path probabilities defined as p(c l |c l−1 , y 1:l ) = p(c l , y l , c l−1 , y 1:l−1 ) and similarly for p( Sl |c l−1 , y 1:l ).The means of these global pdfs are then directly calculated as the weighted sum of the means of each individual path mixture.The mean of the global posterior distribution of the reverberation log-power, μR l , is directly extracted from that of x l .From ( 54)-( 55) we can therefore obtain the first and secondorder moments of the posterior distribution for π l using: where e Rl = μR l − g(μ π l −1 , μR l −1 , μS l −1 ) is the error in the predicted mean reverberation power, is the covariance matrix of the predicted π l of ( 54), is the Jacobian matrix of g and The resulting algorithm is therefore a two-stage approach.First we fix the reverberation parameters in order to compute the likelihood of each path in the HMM, so as to get the posterior probability densities of x l and S l for the best path arriving at each possible state in the HMM.Then, the means of the global posterior densities are computed and fixed in order to update the reverberation parameters using ( 58)- (59).

V. PERFORMANCE EVALUATION
The evaluation of the proposed algorithm on actual reverberant noisy data is divided into two parts.First, because most objective metrics for speech quality and intelligibility are intrusive, we generate simulated reverberant data by convolving anechoic speech with measured room impulse responses so that we can have access to the original clean speech.Second, the algorithm is tested on real data, i.e. actual reverberant and noisy recordings for which no target clean signal is available.
We compare our method with the single-channel scheme of Cauchi et al. [14] as it was the only single-channel method participating in the REVERB Challenge [5] which managed to reduce the perceived amount of reverberation appreciably while significantly improving the overall speech quality [21].We therefore consider this competing method to be state-ofthe-art.The parameters of this competing method correspond to those described in [14] and the implementation was generously provided by the author.A difference between the two algorithms is that [14], although a spectral enhancement algorithm suitable for real-time implementation, requires an external estimate of the broadband T 60 which is obtained using the utterance-based algorithm presented in [44].The proposed method, in contrast, does not require prior knowledge of the reverberation parameters and is implemented in an online manner computing the spectral gain at each time frame.On a laptop equipped with an Intel Core i5 processor, the average real-time factors of the two methods were measured to be 0.17 for [14] and 3.65 for the proposed method.An implementation in MATLAB of the proposed method is available as spendred.m in the VOICEBOX toolbox [32].

A. Implementation Details 1) HMM States Learning:
To train the mean and covariance matrices of each state in the HMM, we use a purely data-driven technique, which gives us the ability to work with any clean speech dataset with the minimum amount of adaptation effort.This can also help to make the set of states less languagedependent and to provide better generalization.To determine a representative set of states, we used the k-means [45], [46] feature-learning technique, as it remains a method of choice in many practical scenarios thanks to its scalability [47].Viewing the k-means algorithm from a Bayesian perspective, minimizing the Euclidean distance is equivalent to maximizing the likelihood of the clusters according to Gaussian distributions with identity covariance matrices.This fits well with the assumptions of our model, and we can perform the clustering directly on the Mel-frequency log-spectral powers.We used the k-means implementation available in [32], and computed 15 separate  instances with random initialization for N , the number of clusters, varying from 2 to 14.Such a low number of states may seem surprising, as a much higher number of dictionary elements has been reported to be necessary in speech enhancement applications using NMF-based techniques [48].Here, however, we look at log-power spectral frames on a Mel-frequency scale having broad frequency bands and the learned states are used only to provide prior probabilities in a Bayesian inference context rather than used directly in a Wiener filter as in [28], [30].We used the training set of the TIMIT database [49], normalized the input speech signals to 0 dB active level [32], [33], obtained STFT frames of 30 ms with 5 ms frame increment and computed the log-power in each Mel-frequency band for each frame.
The Bayesian Information Criterion (BIC) [50] was computed for each value of N according to and is plotted in Fig. 4. L is the likelihood of the observed data and n is the number of data points in the observed data.From a clustering point of view, the BIC gives an idea of how well the clusters can explain the whole dataset.It appears from Fig. 4 that the BIC does not improve significantly for 10 clusters or more.However, from an inference point of view, the HMM states are only used as possible prior density functions for the clean speech, reducing even further the need for a set of states able to perfectly represent any clean speech signal directly.This allows us to use a low number of states, and in our experiments we have chosen N = 4 or N = 6.The state means obtained for N = 4 are shown in Fig. 5 (a); these states correspond approximately to a silence state, a voiced state, an unvoiced state, and a voiced/unvoiced combination.The state means obtained for N = 6 are shown in Fig. 5 (b); the first four are similar to those of Fig. 5 (a) while the remaining two correspond to additional voiced spectra.The results for simulated data are presented for an implementation with 4 states in Section V-C, while both 4 and 6-state implementations are used to evaluate the performance on live recordings in Section V-D.
2) Algorithm Parameters: In order to obtain better dereverberation and denoising performance, we used β = 1, i.e. a Wiener gain, η = 2 and λ s = 0.95 in (18).We have found that the proposed algorithm is not very sensitive to the initial values used for the reverberation parameters and the same initial values were used for all reported experiments.The initial values for the frequency-dependent α were chosen to correspond to the subband T 60 values averaged over all RIRs in [51] and [52].We initialized d to correspond to linearly spaced subband DRR values ranging from −2 dB in the lowest Mel-frequency band to 8 dB in the highest band according to (8).The first 100 ms of each recording were assumed to be noise and were used to initialize the mean and covariance of the noise log-power in x 0 .Reverberation log-power was initialized at 10 dB below the noise and the clean speech global gain was initialized to −5 dB.The STFT analysis used 30 ms Hann-windowed frames with a frame increment of 5 ms.The number of Mel-frequency bands was set to K = 25.

B. Evaluation Metrics
Six different metrics were used in order to evaluate the algorithms: the Cepstrum Distance (CD) [53], the Frequencyweighted Segmental SNR (FWSegSNR) [54], the Reverberation Decay Tail (R DT ) [55], the normalized Speech-to-Reverberation Modulation energy Ratio (SRMR norm ) [56] (available at [57]), the Short-Time Objective Intelligibility score (STOI) [58] (available at [59]) and the Perceptual Evaluation of Speech Quality (PESQ) [60].The STOI scores were mapped to a percentage of words correctly recognized using the mapping function provided in [58] in order to make results easier to read and interpret.The implementations of CD and FWSegSNR were taken from [5], while we used a direct implementation of [55] for R DT .
CD has been reported to be well correlated with the overall quality of processed noisy speech as well as the perceived level of reverberation [21], [61], [62].However, conflicting results have been found regarding its correlation with the overall quality of enhanced reverberant speech [21], [62], and it has been found to correlate poorly with speech intelligibility [63].R DT and SRMR norm have been found to correlate well with the perceived level of reverberation [62], [64].The FWSegSNR and PESQ measures have generally been reported to correlate well with overall quality and intelligibility [58], [61]- [63].Finally, STOI has been found to be highly correlated with intelligibility for time-frequency weighted noisy speech [58].

C. Simulated Data
In order to test the performance of our algorithm in challenging scenarios, we use the Acoustic Characterisation of Environments (ACE) Challenge Corpus [65], which was developed to evaluate algorithms for blind estimation of acoustic parameters in the presence of noise.The corpus provides multi-channel RIRs as well as noises recorded in-situ for various acoustic spaces (lecture rooms, offices, meeting rooms, lobby).For the single channel case, measured impulse responses and corresponding noises are provided for two different source-receiver positions within each room.There are three noise types: fan noise, ambient noise and babble noise.All noise types were recorded in situ using identical microphone configurations and are therefore consistent with the measured impulse responses.The babble noise was recorded using actual talkers in each room, and the RIRs were measured with the talkers still present inside the room.
From the ACE challenge clean speech corpus, we selected sound files from 14 speakers in total (5 females and 9 males), each uttering a free-speech sentence approximately 10 seconds long describing where they live.The anechoic speech files were convolved with one of 8 RIRs corresponding to 4 different rooms and 2 source-microphone positions within each room.Table I gives the broadband T 60 and DRR values measured from the impulse responses using [66] and [11].
For each measured impulse response, the corresponding ambient, fan and babble noises were used and random portions of these recordings were added to the reverberant speech at 0, 10 and 20 dB SNR.This makes a total of 1008 noisy and reverberant speech files.
First, in order to assess the dereverberation performance of our algorithm, we show in Fig. 6 the average score for each metric in the case of 20 dB SNR only, averaging the results over the three noise types.An SNR of 20 dB is still a realistic environment, but the noise has a limited degradation effect and therefore we expect the results to be dominated by the dereverberation performance of the methods.
The proposed method leads to the lowest Cepstral Distance (plot a), highest Frequency-weighted Segmental SNR (plot b) and lowest reverberation decay tail (plot c) for all acoustic conditions, suggesting the proposed method achieves better dereverberation performance than [14].Both algorithms yield very similar PESQ (plot f) and STOI (plot e) results, with a slight improvement of predicted intelligibility in the most reverberant case (D), and a near-constant PESQ improvement of about 0.2 over unprocessed speech.This seems to suggest the proposed method improves speech quality as much as the competing Fig. 6. Results comparing the two speech enhancement methods on simulated data (a) -Cepstrum Distance (dB), the lower the better (b) Frequency-Weighted Segmental SNR, the higher the better (c) -Reverberation Decay Tail, the lower the better (d) -Normalized version of Speech to Reverberation Modulation Energy Ratio, the higher the better (e) -STOI scores mapped to words correctly recognized in %, the higher the better (f) -PESQ scores, the higher the better.
one without degrading intelligibility.The proposed method achieves better results than unprocessed speech with respect to the SRMR norm metric, but does less well than [14].This contradicts the R DT result as it suggests a higher perceived reverberation than [14], however the validity of the SRMR norm metric for use with processed speech signals has not been studied.
In order to study the robustness of both methods to noise, we show box plots of the differential (Δ) scores obtained for each metric, separated for each SNR and each noise type and averaged across all acoustic conditions.On the box plots, the interquartile range is shown by a coloured box, the median of the distribution is shown by a horizontal line, and the mean of the distribution is shown by a circle.For each result, a 0 score indicates no change in the metric compared to unprocessed speech, and a positive result indicates a higher metric score.Fig. 7 indicates that, apart from the babble noise case, the proposed algorithm achieves lower Cepstral Distance than [14], especially at low SNRs, indicating that it is better able to deal with heavy noise.Furthermore, as can be seen in Fig. 8, the  higher FWSegSNR scores achieved by the proposed method in all cases seem to suggest better dereverberation as well as better noise reduction properties.
Fig. 9 shows that even when the SNR is low, the proposed algorithm achieves lower R DT scores than [14].This indicates that even in the presence of heavy noise, it is able to reduce the decay tail of the reverberation significantly.Unsurprisingly, both methods achieve very low R DT scores in babble noise.Indeed, with the ACE challenge corpus the babble noise was recorded using talkers in situ, giving much more information about the acoustic properties of the whole recording.Fig. 10 confirms the earlier observation that the proposed method achieves lower SRMR norm scores compared to [14], although they are almost always greater than those of the unprocessed speech.
As can be seen in Fig. 11, the predicted intelligibility is slightly worse with the proposed method than with [14].However, as was seen in Fig. 6 (e), the predicted intelligibility of the test signals was well above 90% in all cases so these small differences will have little effect.The PESQ scores, shown in Fig. 12, show a consistent improvement for both algorithms relative to unprocessed speech with the proposed method having marginally higher scores than [14].
Overall, it seems the proposed method achieves better dereverberation and denoising performance while improving speech quality and preserving speech intelligibility.It also seems that [14] deals with babble noise slightly better, which is unsurprising since our clean speech model cannot distinguish babble noise from wanted speech.Tests using 6 states in the HMM were also carried out, but the results were almost identical to those using 4 states and are not presented here.

D. Real Data
We used the real data section of the evaluation set of the RE-VERB Challenge [21], which corresponds to the Multi-Channel Wall Street Journal Audio Visual Corpus [67].The data was recorded in a room using real talkers and at two different sourcemicrophone positions, i.e. near and far.Because no reference signal is available, and in order to gain some insight into how well the dereverberation methods worked on this dataset, we used the baseline ASR systems from the REVERB challenge  to obtain WER scores.The proposed algorithm was evaluated using both 4 states and 6 states in the HMM.
All methods were tested on two baseline speech recognition engines from [21].The baseline systems were both based on HTK, using a triphone GMM-HMM recognizer that has been trained on clean speech data only.One version of the engine used Constrained Maximum Likelihood Linear Regression (CMLLR) speaker adaptation while the other did not.Fig. 13 shows the reduction in WER achieved by [14] and by the proposed method using a 4 or 6-state HMM.
The proposed method achieves lower WER than unprocessed speech, with significantly better results obtained when using a 6-state HMM for the clean speech model, but still higher WER than the competing method.Although the audible quality of the recordings has been substantially improved, we believe that our method may introduce more artifacts detrimental to such ASR systems than [14].Audio recordings processed by the 6-state implementation as well as the listening test results presented below are available from http://www.commsp.ee.ic.ac.uk/˜sap/sicenspeech/.

E. Listening Test
Although objective metrics are a good indication of an algorithm's performance, it has been hypothesized that no instrumental measure can capture the subjective sense of overall speech quality [21].Therefore a listening test similar to the multi-stimuli with hidden reference and anchor (MUSHRA) [68] test was used in order to assess the overall quality and amount of perceived reverberation before and after processing.The ambient noise level and the headphones used in the experiment were not controlled and varied between participants.
The 13 self-reported normal-hearing participants, all experts in acoustic signal processing, each performed 8 tests: 4 tests rating the perceived level of reverberation on a scale ranging from 0 (not reverberant) to 100 (very reverberant), and 4 tests rating the overall speech quality on a scale going from 0 (bad) to 100 (excellent).Post-screening was performed after the test in order to remove results where participants failed to identify the hidden reference.
For each test, the participants were asked to compare four randomly-ordered unmarked samples: (i) a hidden reference, (ii) a noisy reverberant anchor signal, (iii) the anchor signal processed by [14] and (iv) the anchor signal processed by the proposed method with a 6-state HMM.The hidden reference was a clean speech utterance convolved with an RIR from [52] with very low T 60 (0.18 s) and high DRR (5 dB), as in MUSHRA for reverberant speech or MUSHRAR proposed in [64].To form the anchor signals, clean speech utterances from the ACE challenge corpus [65] were first convolved with RIRs B, D, E and H from Table I to create reverberant signals with 0.38 s ≤ T 60 ≤ 1.29 s and −2.27 dB ≤ DRR ≤ 5 dB.For the tests that evaluated speech quality, these reverberant signals were then degraded by adding "ambient noise" from [65] at 0 dB or 10 dB SNR.For the tests that evaluated reverberation reduction they were degraded by adding "babble noise" from [65] at 30 dB SNR.The results are shown in Fig. 14 which presents differential MUSHRA scores between the unprocessed reverberant and noisy speech and the two processed versions.These differential scores can be viewed as measuring the overall quality improvement and reverberation reduction provided by each enhancement method.To assess the significance of the observed differences in mean MUSHRA scores, a two-sample t-test was used with Satterthwaite's approximation for unequal variances [69].
The proposed method always has lower perceived reverberation than the unprocessed speech.It consistently achieves higher reverberation reduction than [14] and the difference in mean performance was statistically significant (P < 5%).In most cases, the proposed method also improves on the quality of the unprocessed speech although, in a minority of cases, it appears that the strong reverberation and noise reduction applied by the algorithm leads to a small degradation in perceived quality.In most cases, the proposed method gave higher quality than [14] although the difference in mean improvement was not statistically significant at the 5% level.
From these results, the proposed algorithm is especially suited to situations with high levels of reverberation and/or noise.We believe that the algorithm is able to achieve large reductions in both noise and reverberation because it estimates them jointly rather than independently and also because its use of a speech model allows it to take advantage of correlations between frequency bands.In applications with lower levels of reverberation and noise, the method of [14] may be preferred since it has lower computational requirements and almost never degrades the perceived quality.

VI. CONCLUSION
In this paper, we have presented a novel blind single-channel approach to the online dereverberation problem.Using an ARMA model for the reverberation power and a Hidden Markov Model for the clean speech log-power, a spectral gain is computed in order to achieve good dereverberation performance.This real-valued gain is computed for each frame after jointly estimating posterior distributions of the acoustic parameters and speech, reverberation, and noise log-powers.The algorithm achieves very good dereverberation and denoising performance while improving speech quality and preserving speech intelligibility.Listening tests showed excellent audible quality of the speech signals processed by the proposed method.

APPENDIX A OBSERVATION NOISE
The complex STFT coefficients of the degraded speech observation can be modeled as zero-mean complex Gaussians in each time-frequency bin using the central limit theorem.Using Y • (l, k) to denote the complex STFT coefficient of the observed speech at time frame l and at STFT frequency bin k, Y • (l, k) ∼ N 0, σ(l, k) 2 .We therefore have where {Y • (l, k)} 2 and {Y • (l, k)} 2 are independent zeromean Gaussians with variance σ (l, k ) 2

2
. It follows that As we formulated the problem in Mel-frequency bands, the power in STFT frequency bins of each time frame are then weighted and summed according to our filterbank.We assume the resulting weighted sum of Gamma distributed random variables is also approximately Gamma distributed, so that with mean E[ Y l (k)] = σ(l, k) 2 and variance Var[ Y l (k)] = κ k σ(l, k) 4 .The values κ k were determined empirically.As we are assuming normally distributed log-powers, we can use the formula relating the moments of a normal distribution in the log-domain to the moments of a log-normal distribution in the power domain [70], and approximate the variance of y l (k) as follows: = log(1 + κ k ) This means we have for the observation noise ν l ∼ N (0, M l ) with M l = diag (log(1 + κ)) in (42).

APPENDIX B MODEL NOISE
As well as the observation noise that is a consequence of the statistical properties of the input data, we can model the noise due to the inaccuracies introduced when we assumed the powers are exactly additive.The total power in Mel-frequency band k is therefore assumed to be The observation noise covariance matrix M l in ( 42) is therefore augmented by a diagonal matrix T l whose diagonal elements are defined by (69), so that M l = diag (log(1 + κ)) + T l .This extra noise term is small when one of the powers is much greater than the others and maximum when all signal powers are equal (i.e. the point of maximum curvature of h).

Fig. 3 .
Fig. 3. Two-dimensional case : A and B are jointly Gaussian distributed.Unconstrained prior (a), empirically computed constrained posterior (b) distributions.Using Taylor series approximation of the nonlinear constraint, the first-order (c) and second-order (d) approximation of the constrained distribution are shown.

Fig. 4 .
Fig. 4. Bayesian Information Criterion (BIC) computed for different values of N the number of clusters used in the k-means algorithm.

Fig. 5 .
Fig. 5. Means of the log-power clean speech HMM states obtained through k-means with (a) 4 clusters and (b) 6 clusters.

Fig. 7 .
Fig. 7. Differential Cepstral Distance for different noise conditions, averaged across all acoustic scenarios.

Fig. 9 .
Fig. 9. Differential R D T scores obtained for different noise conditions, averaged across all acoustic scenarios.

Fig. 12 .
Fig. 12. Differential PESQ scores for different noise conditions, averaged across all acoustic scenarios.

Fig. 13 .
Fig. 13.Average WER reduction for the different acoustic conditions of the REVERB challenge real data.

Fig. 14 .
Fig. 14.Listening test results.MUSHRA differential scores corresponding to the overall speech quality improvement and perceived reverberation reduction.

TABLE I TABLE
DETAILING INFORMATION ABOUT RIRS FROM THE ACE CORPUS USED TO CREATE THE SIMULATED DATA |Y • (l, k )| 2