Inverse Filtering and Principal Component Analysis Techniques for Speech Dereverberation

. In this work, we present a single channel approach for early and late reverberation suppression. This approach can be decomposed into two stages. The ﬁrst stage employs the inverse ﬁlter to augment the signal-to-reverberant energy ratio. The second stage uses the kernel PCA algorithm to enhance the obtained dereverberant signal. It consists in extracting the main non-linear features from the speech signal after inverse ﬁltering. Our approach appears to be eﬃcient mainly in far ﬁeld conditions and in highly reverberant environments.


Introduction
It is well known that the intelligibility of speech in natural environments is influenced by room reverberation [1], [2].The presence of reverberation alone can degrade speech perception.Several factors contribute to the influence of reverberation on intelligibility including: source distance, room reverberation time, and the characteristics of both late reverberation and early reflections.Speech dereverberation is an active area of research designed to mitigate the influence of reverberation on speech intelligibility under monaural presentation.
Several speech dereverberation approaches have been proposed in literature.As the linear prediction residual signal includes the effects of the reverberant channel, Brandstein et al. [3] propose to use the residual and the all-pole filter resulting from prediction analysis of the reverberant speech.Allen [4] aim to suppress the effects of reverberation by identifying the Linear Predictive Coding (LPC) parameters.
Yegnanarayana et al. [5] apply the Hilbert envelopes to indicate the strength of the peaks in the Linear Prediction (LP) residuals.Gillespie et al. [6] propose to maximize the kurtosis of the residual by using an adaptive filter for multiple microphones.In [7], [8], the authors use a spatial averaging approach.Then the prediction residual is improved by applying a temporal averaging of larynx cycles.In [9], the authors identify the magnitude and phases of sinusoidal speech model by estimation of the pitch.Consequently, the speech dereverberation is applied to deduce an equivalent equalization filter.Habets et al. [10], Lebart et al. [11] have proposed to use the spectral subtraction technique to dereverberation.They used a statistical model of the room impulse response including Gaussian noise modulated by a decay rate of an exponential function, and then the power spectral density of the impulse response can be suppressed by spectral subtraction.A similar method proposed in [12] estimates the late reflection by using a multistep LP.In [13], the authors use the subspace method to determine the inverse filter independent of the source characteristics.Nakatani et al. [14] proposed to formulate the dereverberation problem as a maximum likelihood using a hill-climbing technique.In [15], the authors optimize the scaling factors by using the minimum mean square error criterion to eliminate the late reflection components.
Principal Component Analysis (PCA) is a subcategory of subspace approaches.It is an important tool for data analysis designed to reduce multidimensional signal.The essential object is to obtain a set of orthogonal factors that describe the variance of the observations and track the new factors considered to determine the necessary features.Speech processing by PCA [16] is extensively applied as a classical multi-variate speech processing tool.PCA has been used in different fields of science to extract relevant information from complex matrix data [17].For speech separation, a robust extension of classical PCA by generalizing eigenvalue decomposition of a pair of covariance matrices has been proposed by [18].It may be used in speech identification [19], speech recognition [20], speech enhancement [21], [22], and [23] and speech dereverberation [24], [25], and [26].
In this work, we propose an approach of dereverberation in the field of single-channel speech enhancement which is appropriate for real-world applications.For suppressing the late and early reverberations, we apply a single-microphone two-step approach.In the first stage, the inverse filter is applied to enhance the speech to reverberation ratio and reduce the coloration caused by the early reflections and late reverberation.The second stage consists in using the kernel principal component analysis algorithm to enhance the obtained dereverberation signal.
Our approach is tested on a corpus of speech utterances from thirty two speakers.The evaluation shows that our approach is able to reduce early reflections and late reverberation in highly reverberant environments.
The rest of the paper is organized as follows.In Section 2. we present the first stage based on the inverse filter.Then, we integrate the kernel PCA into our approach.Next, in Section 3. we provide the experimental results of our approach.

Proposed Approach
We propose an approach of speech dereverberation employing monaural recordings.It can be decomposed into two essential stages, as shown in Fig. 1.
In the first stage of our approach, we apply a preprocessing procedure to reduce the short-term correlation of the speech, and identify the delayed late reverberations.Then, we determine the inverse filter by maximization of kurtosis.The second stage consists of using the kernel principal component analysis algorithm to reduce the effects of the long-term reverberation.

1)
Pre-Processing Procedure The pre-processing procedure consists in adjusting the reverberant speech signal by the right choice of the short order linear prediction.
1: Block diagram of our proposed approach.
We consider the reverberant speech signal x r (n) as the result of the convolution between the clean speech and the room impulse response.
where is the response impulse filter determined by combining the effects of the human speech production and the room impulse response, and is the clean speech signal.
The finite impulse response filter is described as follow: Such a filter would produce the speech signal from the white noise u(n).The reverberant speech signal can be written in vector form: where: are the length of the vector g and x r (n), respectively.With A bias provoked by the transfer function of the human speech production exists in estimated late elements of G(z).So, we implement a linear prediction with small-order.In [27], the authors have proposed to use an order of pre-whitening equal to 20 taps.However, in our approach, the order is arranged according to the length of the RIR.This pre-processing compensates for the bias provoked by the transfer function of the human speech production bearing in mind its convolution by the RIR.Therefore, the resulting preprocessing will be more adapted to the reverberant signal.In experiment, we have chosen for the room impulse response with reverberation time equal to 0.5 s and 0.7 s, a short linear prediction order equal to 6 and 20 taps, respectively.
After processing the short-order linear prediction, we reduce the effects of the long-term reverberation.
The long-term linear prediction is described by the following equation: where x r (n) is the reverberant speech signal, f (l) are the coefficients of filter, L is the number of the coefficients, D is the delay and er(n) is the error of the speech signal.
In this step, we apply the Levinson-Durbin method to minimize the energy of the error er(n).
Based on the method proposed by Kinoshita et al. [27], we apply the Wiener-Hopf equation specialized for delayed LP. with where v 2 x is the variance of white noise and As a result, we are having: The estimated power of late reverberation is: The linear prediction filter order, L, is a large number.Consequently, the residual speech signal each time is calculated based on samples L+D.Hence, the linear prediction residual speech signal is capable of providing the long-term reverberation of the signal.

2) Inverse Filtering
In order to suppress the long reverberations, we apply an inverse filtering by maximizing the kurtosis of linear prediction residual speech.As the reverberant speech signal has a lower kurtosis than that of linear prediction residual of clean speech signal, we suggest to estimate the inverse filter by using the kurtosis maximization.
Figure 2 shows the diagram for inverse filtering.We approximate the LP residual of the processed speech by the inverse-filtered LP residual of the reverberant speech.
The inverse-filtered speech signal is: where r(t) = [r(t − K + 1), . . ., r(t)] T is the residual signal of the reverberant speech after the application of the pre-processing procedure, h is the inverse filter, and y(t) is the inverse filtered speech signal.
The kurtosis of y(t) is described as follow: According to the work of Gillespie et al. [6], the gradient of the kurtosis is described by the equation in below: Then, we approximate the gradient by the following equation: Then, we calculate the E{ y 4 (t)} and E{ y 2 (t)} to obtain the adaptive inverse filter.
In the time domain, the inverse adaptive filter can be described by the following equation: where β adapts the learning rate, and f (t) is the function of the kurtosis.
However, on account of the several numbers of taps for RIR, signal is operated frame by frame into the frequency domain by using the Fast Fourier Transform (FFT) consistent with the work of Wu and Wang [28].
In the frequency domain, the Eq. ( 12) becomes: where H(n) is the n th iteration of h, F (p), R * (p) denote, respectively, the FFT of f (t), r(t), with p is the p th frame, and P is the number of frames.
Then, we obtain the inverse-filtered speech by convolving the inverse filter with the reverberant signal.In the next sub-section, we add the kernel PCA technique to enhance the processed signal.

Kernel PCA Stage
In Fig. while reverberant components will be projected onto high-order.
We compute the Fast Fourier Transform (FFT) of the obtained speech signal by the first stage, y(t).We obtain the short spectrum for the clean speech and the transfer function of the frequency f at the p th frame.Our goal is to determine the estimated clean speech y(t).
The FFT of y(t) can be expressed as: Then, we calculate the log-spectrum of Y p (f ).We obtain the following equation: where Y log −p (f ), log H p (f ) , and log (X i (f )) are the logarithm spectra for the first stage obtained speech signal by inverse filtering, estimated clean speech signal, and transfer function, respectively.
So that, we extract the clean speech based on KPCA, The eigen-vector matrix V , is derived by the eigen-value decomposition, in which the filter V = [v (1) , . . ., v (p) ] consists of the eigen-vectors corresponding to the Q dominant eigen-values.The component of the convolution reverberant signal [v (p+1) , . . ., v (M ) ] is eliminated by the filtering operation in Eq. ( 16).
According to the Eq. ( 16), the y(t) signal is represented under the assumption of non-correlation between the reverberant signal and the clean speech.Then, we apply a non-linear PCA called kernel principal component analysis (KPCA) to extract the clean speech.Now, we detail the KPCA technique in the following subsection.

1) KPCA Technique
KPCA transforms the input logarithm spectra with nonlinear structure into a higher-dimensional feature space with linear structure, and then computes linear PCA on the mapped data.We can assume that reverberant signal will be eliminated in the high-dimensional space.
Consider the output of the mel-scale filter bank ym p at p th frame, we define the covariance matrix as follow: where Ψ is the nonlinear map, and P denotes the number of samples.Now, we must solve the eigen-vectors v and eigenvalues α.
We formulate the equation above as: So we obtain with K as the kernel matrix.
For dereverberation a test sample Ψ( ym), it is projected onto the Eigen-vectors V eigenvalues.The projection is presented in terms of kernel functions based on the Eq. ( 19): Therefore, the projected sample in feature space is equal to: where O n presents the projection operator.
For dereverberation, we try to find a sample y that satisfies Ψ(y) = O n Ψ( ym).
We consider the Gaussian kernel function: where c presents the variance.
The dereverberant speech is determined by an iterative update equation of y where We indicate that the pre-image y is a linear combination of the input speech ym j .
The KPCA matrix was estimated only from the reverberant speech recordings.In order to perform the matrix, K, we choose P = 1500 frames from the training data, and we used the Gaussian kernel function.The KPCA uses 16 dimensions Mel-scale filter bank output.

Experiment Results
We evaluated the proposed dereverberation speech approach with the TIMIT database [29].As we compare our approach to the state-of-the-art algorithms when the signals are sampled at 8 kHz [28] and [31] so we down-sample the speech signals from 16 kHz to 8 kHz.We use 32 English speakers of this database.
We apply the standard objective measures to determinate the quality of noisy speech enhanced by our proposed approach.The reverberant speech signal is generated by convolving the room impulse response function and the clean speech with T 60 = 0.3, 0.5, and 0.7 s.In our approach, we use the mirror image model proposed in [30] to generate the RIR function.The RIR was created in a room with dimensions 6 × 4 × 3 m, the microphone was installed at 4 × 1 × 2 m, and the speaker was placed at 2 × 3 × 1.5 m, see Fig. 4.
To evaluate the performance of our approach and to compare it with two existing methods named the Wu method [28], and MAP and variational deconvolution method (MA-VD) [31].In [28], the authors proposed a two-step single-channel dereverberation method (Wu method) whose first step applied an adaptive inverse filtering scheme by kurtosis maximization.Then, they introduce the spectral subtraction technique in the second stage to improve the dereverberation performance for long reflections.In [31], the authors used a cost functions obtained from variational estimation and maximum posteriori to formulate a dereverberation in frequency bin. the Perceptual Evaluation of Speech Quality (PESQ) that measure the overall quality of the speech, and the Speech to Reverberation Ratio (SRR) with the three reverberation times.
Table 1 presents the results of global SNR in dB obtained by using the Wu, MA-VD, and proposed approach (Pro-App) for speech dereverberation using monaural speech recordings with the TIMIT database.We referred to the degraded reverberant speech by the word "Rev".It can be seen that our approach gives better SNR results than the others methods at SNR levels varied from −2.87 at −4.39 dB.
Table 2 gives the results in the case of our approach and that of Wu and Wang [28] and MA-VD [31].In Tab. 2, the PESQ scores show the intelligibility of our approach compared to that produced by the stateof-the-art methods.However, in the two reverberation times (T 60 = 0.5 s and 0.7 s), the reverberant speech obtains the greater PESQ scores.This results show that the PESQ measure is not correlated with dereverberation.Also, we compare the proposed approach with other methods in terms of SRR measures.The Speech to Reverberation Ratio (SRR) is a speech-based measure of reverberation that can be calculated even when the effect of the dereverberation method cannot be described in the impulse response.The SRR can be defined as: where is the clean speech signal, and is the estimated output of our dereverberation approach.Results of the Tab. 3 indicate that our approach outperforms the two state of the art methods with SRR measures.
We observe that in the three reverberation times, the proposed approach demonstrate a greater SRR score compared to their corresponding methods.The SRR level of the reverberant speech is much lower than that of our approach.However, the two methods [28], and [31] fail to conserve its performance for T 60 = 0.7 s.It can be seen that our approach outperforms the results of the compared methods when we added the KPCA technique to the first stage.
We can conclude that the proposed approach can suppress the long reflections in the first stage by applying a pre-processing procedure followed by a delayed long-term LP.The KPCA stage is used to enhance the obtained signal by the first stage.Furthermore, the proposed approach is more effective in suppressing the long reflections.However, the two state-of-the-art algorithms yield satisfactory results only when the T 60 is short.
In order to show the enhanced speech obtained by using the proposed approach, spectrograms of the original speech, the reverberant speech, and the enhanced speech signals obtained by using the inverse filter, and the dereverberation by our approach after applying the kernel PCA are presented in Fig. 5.
The spectrogram of our proposed approach shows the inverse filter reconstructs the harmonic structure of the original speech.After the application of the kernel PCA algorithm, the spectrogram improves that the reverberation is greatly reduced in our proposed approach.

Conclusion
In this paper, we present a speech dereverberation approach of monaural recordings.The proposed approach can be summarized into two stages.First, we apply an inverse filter of reverberant speech with introducing a pre-processing procedure.Second, we propose to use the kernel principal component analysis algorithm to enhance the obtained speech by inverse filter.
The experimental results show that our approach achieves a significantly better speech quality than the state of the art algorithms and show a significant improvement in terms of these measures with minimal residual distortion.Also, the existing methods do not obtain better results for rooms with T 60 of more than 0.5 s.
The future work consists in processing reverberant and noisy speech in real-time while keeping reasonable distortion levels.

Fig. 5 :
Fig. 5: Spectrograms of speech signal.(a) Spectrogram of clean speech signal.(b) Spectrogram of reverberant speech signal.(c) Spectrogram of dereverberant speech signal by inverse filter.(d) Spectrogram of enhanced dereverberant speech signal by our approach.
3, we illustrate the kernel PCA algorithm for speech dereverberation.In the KPCA technique, the speech components will be projected onto low-order 3: Block diagram of the first stage.
Tab. 1: Performance comparison of global SNR, in the presence of reverberant speech.