Signal-Based Performance Evaluation of Dereverberation Algorithms

We address the measurement of reverberation in terms of the (DRR) in the context of the assessment of dereverberation algorithms for which we wish to quantify the level of reverberation before and after processing. The DRR is normally calculated from the impulse response of the reverberating system. However, several important dereverberation algorithms involve nonlinear and/or time-varying processing and therefore their effect cannot conveniently be represented in terms of modifications to the impulse response of the reverberating system. In such cases, we show that a good estimate of DRR can be obtained from the input/output signals alone using the Signal-to-Reverberant Ratio (SRR) only if the source signal is spectrally white and correctly normalized. We study alternative normalization schemes and conclude by showing a least squares optimal normalization procedure for estimating DRR using signal-based SRR measurement. Simulation results illustrate the accuracy of DRR estimation using SRR.


Introduction
When a speech signal is acquired in an enclosed space by one or more microphones positioned at some distance from the talker, each observed signal consists of a superposition of many delayed and attenuated copies of the speech signal due to multiple reflections from the surrounding walls and other objects. These multiple reflections can number several thousands and give rise to the effect known as reverberation. The reverberation time of an enclosed space is usually measured as the time, T 60 , taken for the free-decay of reverberation to reduce by 60 dB and is affected by the volume of the enclosed space and the acoustic properties of the reflecting surfaces [1]. Efficient schemes for modeling reverberation are widely used, for example, the sourceimage method [2,3]. A general scenario comprises a source speech signal s(n) which propagates through M acoustic channels, assumed Linear Time Invariant (LTI), with impulse responses h M (n) and is acquired by M microphones with output signals x M (n). The microphone signals x M (n) therefore contain reverberated versions of the source signal s(n).
Dereverberation algorithms operate on x M (n) and output N estimates s N (n) of the source signal s(n). We will assume that M = N = 1 for the purposes of this paper with whereṪ represents the transpose operator and L h is the number of taps in the impulse response.
The development of dereverberation algorithms [4] to reduce the reverberation effects in an audio signal is a slowly maturing topic in signal processing. Early work [5] introduced a speech enhancement approach operating on the linear prediction residual and several microphone array-based approaches [6,7] have been proposed. Blind system identification techniques have been applied [8] involving subspace decomposition [9] and adaptive filters [10]. Techniques to evaluate dereverberation algorithms are as yet not consistently defined and research is underway to address this issue. A common measure of dereverberation performance will be summarized in Section 2, where the difference between channel-based Direct-to-Reverberation Ratio (DRR) and signal-based Signal-to-Reverberation Ratio (SSR) measures will be highlighted. The remainder of the paper will focus on signal-based measures for which normalization is not straightforward. We will justify the need for correct normalization and then briefly study alternative schemes in Section 3.

Measures of Reverberation
We here define the direct path as an L h -tap impulse response T representing propagation from the talker to a microphone without reflections. We assume h d is known. We also define the reverberant component h r = [h r (0), h r (1), . . . , h r (L h − 1)] T as an impulse response representing all nondirect propagation paths from talker to microphone. We therefore write where s d (n) is a delayed and scaled version of s(n).
In general, the measurement of the level of reverberation in a signal requires a comparison of the energy due to the direct path propagation and the energy due to the reverberant paths. This may be characterized as the DRR which will be discussed below. Evaluation of the performance of a dereverberation algorithm can classified into two approaches: channel based and signal based.

Channel-Based Measure.
Channel-based measures are appropriate when the effect of the dereverberation algorithm on the reverberating system impulse response h is known or can be deduced. The DRR can be found straightforwardly from the corresponding impulse response coefficients [1] as If the direct path propagation time corresponds to an integer number of sampling periods then h d may be an impulse; otherwise it has the form of a sinc function [3]. Comparison of DRR before and after processing leads to a measure of improvement in DRR. We note that, in contrast to the evaluation of dereverberation using improvement in DRR, evaluation of system identification performance is usually done in terms of the Normalized Projection Misalignment [11].

Signal-Based Measure.
Signal-based measures are needed when the effect of a dereverberation algorithm cannot be characterized in terms of an impulse response, such as [5,6,12], where the processing is not LTI. In such cases it is necessary to determine the SRR only from the signals before and after processing. The SRR can be written where , and s = ( s d + x r ) is the reverberant signal to be measured of length L s samples, for example, at the input and the output of a dereverberation algorithm in order to measure the improvement in DRR achieved. The SRR is an intrusive measure that requires both the original and the processed speech signals. In addition, knowledge of the direct path component of the true impulse response is assumed in our approach such that the speech signals can be time-aligned correctly.
2.3. Relationship between DRR and SRR. Subject to correct level normalization as will be discussed below, the SRR is equivalent to the DRR when the source s(n) is spectrally white. In the case when s d = s d and evoking Parseval's theorem, in the frequency domain we have When S(k) = S, independent of k, |S| 2 can be taken outside the summation in both numerator and denominator and cancelled. An illustrative example is when s(n) = δ(n), so that S(k) = 1 for all k, in which case (3) reduces directly to the formulation of the DRR in (2). In practice, when speech signals are considered, a prewhitening filter can be employed [13] as will be shown below.
These effects are illustrated in Figure 1 which shows a comparison of DRR and SRR for a room of dimensions 6 × 5 × 4 m simulated using the source-image method [2,3] (left) and for real measured room impulse responses from MARDY [14] (right). The SRR calculated for a white noise input is shown in curve (a) and is seen to correspond almost exactly to DRR. Curve (b) shows SRR calculated for five sentences of male speech, sampled at 20 kHz from the APLAWD database [15]. Lastly the results with prewhitened speech are shown in curve (c). The prewhitening filters were computed over all five sentences using a 10th order linear predictor; separate filters were obtained for s d and s and were applied to each of the signals, respectively. It is clear that whitening the speech signal has a significant effect.

Level Normalization
A dereverberation algorithm aims to attenuate the level of reverberation and may affect either or both of the direct path signal s d (n) or the reverberant component x r (n) in order to improve the SRR. Therefore we can write that where x r (n) is the reverberant component remaining after dereverberation processing and α is a scalar assumed stationary over the duration of the measurement. We also assume that any processing delay has been appropriately compensated as is generally assumed in other measurements such as the SNR. We propose that the measurement of the reverberant component's energy and the assessment of its impact on the speech signal must be done relative to the energy of the direct path component. This can be conveniently accomplished by normalization in order to match the level of the direct path component before and after processing. The aim of this normalization is to adjust the magnitude of s such that the direct path signal energy is unchanged by the dereverberation Journal of Electrical and Computer Engineering algorithm. This can be achieved by determining α. Our motivation comes from the observation that signal-based measures are not, in general, scale independent as can be seen in the case of (3) and therefore misleading results can be obtained unless the scaling is correctly normalized.
We formulate this problem as a search for a scalar α such that the Normalized Signal-to-Reverberation Ratio (NSRR) NSRR = 20 log 10 is a good estimate of DRR.

RMS and Peak Normalization.
It is necessary to estimate α from the available signals and, for baseline comparison purposes, we have initially considered straightforward approaches to determining α using corresponding to RMS and peak matching for norm = 2 and norm = ∞, respectively, and employing uniform and A-weighting [1] for W{·} representing a corresponding weighting filter. These approaches lead to incorrect calculation of SRR as will be shown below.

Least Squares Optimal Normalization.
We propose that a good solution to the normalization problem can be obtained using α ls from the least squares minimization The solution for α ls is found by minimizing J = E{ s − αs d 2 2 } arising from (7), where E{·} denotes mathematical expectation.
To minimize J, we differentiate it with respect to α and set the result to zero, which gives The final step is to approximate expectations with sample averages giving α ls to be the value of α satisfying (8) as which is a projection of s onto the direct component s d .
The effect of α is seen by substituting (4) into J to obtain Clearly, J is minimized when α = α. Although the normalization constant has been considered stationary, it could also be applied in a frame-based manner as, for example, in Segmental SNR. Figure 2 shows a comparison of DRR with NSRR computed from (5)  obtained for the same experimental setup as in Section 2.3. The test signal s was generated as in (4) with α chosen arbitrarily and x r (n) = x r (n). The speech signals were prewhitened with prewhitening filters computed from s d and (1/ α) s and applied, after the level normalization, to each of the signals, respectively. Curves (a), (b), and (c) show SRR with the normalization factor α from (6) with peak normalization, RMS normalization, and A-weighted RMS normalization, respectively. Curve (d) shows SRR with least squares optimal normalization. It can be seen that the match between DRR and least squares optimal normalized SRR is much smaller over a wide range of DRRs; whereas other normalization schemes substationally overestimate and offer little discrimination between different values of DRR. These discrepancies are more severe at lower DRR values.

Discussion and Conclusions
An important class of dereverberation algorithms employ nonlinear and/or time-varying processing such that the effect of their processing on the reverberation cannot be characterized in terms of an impulse response. In such cases, the improvement in DRR cannot be measured directly. Accordingly, it is necessary to estimate the DRR values at the input and output of the dereverberation algorithm using SRR.
We have shown that two effects require consideration. First, the signal characteristics affect the SRR calculation such that good estimates of DRR are obtained when the signal is white. Prewhitening of speech with a 10th-order predictor has been seen to be sufficient for the cases studied here. Second, the level of the signals must be correctly normalized. We have shown that level normalization using RMS, A-weighted RMS, and peak matching are not appropriate. We have formulated a least squares optimal normalization scheme and shown that this can be expressed as a projection of the signal onto the direct path component. Simulation results confirm that the least squares optimal level normalization and prewhitening enable DRR to be estimated without the requirement for impulse response measurements.