A New Soft Masking Method for Speech Enhancement in the Frequency Domain

1 Abstract —Recently, ideal binary mask (IdBM) method has attracted keen interest because of its superiority in improving speech intelligibility. This method processes noisy speech based on time-frequency (T-F) unit. If the local Signal to Noise Ratio (SNR) is higher than the threshold, the T-F unit is retained; else, the T-F unit would be removed. This method works well in computational auditory scene analysis (CASA) field. However, as the threshold is usually low, much residual noise would exist. In addition, the accurate local SNR is difficult to obtain in practice. In this paper, we try to propose a new method to improve speech quality and intelligibility. Instead of finding a new way to estimate the local SNR, we try to compute the probability of local SNR higher than the threshold. After that, we multiply T-F units with a proper value to compress the residual noise. Results from sufficient experiments showed that our method performs well.


I. INTRODUCTION
As speech signal can be easily polluted by kinds of noises, speech enhancement has become an important means of speech signal processing.For application like Automatic Speech Recognition (ASR) system, whether speech enhancement is adopted can make a big difference.
Many classic and effective speech enhancement methods have been proposed in the past.These algorithms such as Wiener filtering [1] and minimum mean-square error (MMSE) [2] can greatly improve the speech quality, but in terms of speech intelligibility, few gains were obtained.In [3], Loizou revealed that the amplification distortions exceeding 6 dB should be responsible for damage of speech intelligibility.Meanwhile, most of the existing algorithms allow this kind of amplification distortion.Recently, many people begin to focus on the IdBM method which is usually used in the CASA [4].After using it to process noisy speech, the speech intelligibility can be improved markedly according to [5], [6].The realization of IdBM method can be regarded as binary masking.Speech signals are first transformed to the frequency domain and divided into many T-F units.Then, a proper threshold would be selected.When the corresponding local SNR is higher than the selected threshold, the T-F unit should be saved, otherwise, the T-F unit should be abandoned, which means the gain value is zero.We notice that the threshold is usually low, which means much noise would be reserved.This certainly is harmful to speech quality and intelligibility.In this paper, we try to modify the binary gain function to compress noise.More specifically, we choose new gain value form to replace one when local SNR is higher than the threshold.
There is still a problem needs to consider.As we can see form definition, the IdBM method needs accurate local SNR to decide gain value, which is very difficult in practice.In [4], the author proposed many advanced and useful algorithms, of particular interest is the SMPO method.Instead of trying to calculate accurate local SNR, this method calculates the probability of local SNR greater than the threshold.
The organizational structure of this paper is as follows.Section II introduces the background knowledge and assumptions.Section III describes the details of the proposed method.In Section IV, we present the experimental details and the experimental results.The conclusion is drawn in Section V.

II. ASSUMPTIONS AND HYPOTHESIS MODEL
For speech enhancement, the first step is to establish a proper analysis model.Many research used the linear additive model.Based on this model, degraded speech signal is the sum of the clean speech signal and noise signal where  Y(k, τ).Then, we use the magnitude-squared spectrum to approximate the power spectrum.This approximation is common in spectral subtraction algorithms [7]- [12].Then, (2) can be rewritten as Equation ( 3) is simplified as follows for convenience As we all known, the short-time Fourier transform coefficients can be divided into two parts, i.e., the real part and imaginary part.Here we assume both the real part and the imaginary part obey Gaussian distribution and have equal variance, besides, the two distributions are independent [13], [14].In this condition, the probability densities of X 2 k and N 2 k obey exponential distribution according to [15], the corresponding probability distribution functions (PDF) are given by: where σ 2 x (k) and σ 2 n (k) denote clean speech and noise variance, respectively.According to Characteristics of exponential distribution, the relationship between the expectation of X 2 k and σ 2 x (k) can be described as In a similar way According to the Bayes' rule, the posteriori PDF of X 2 k can be calculated as follows Before inserting ( 5), ( 10), ( 11) into (9), we define two intermediate variables: Then we get Noticed that The assumption ( 15) is reasonable.

III. PROPOSED SPEECH ENHANCEMENT METHOD
As mentioned above, usage of IdBM can obtain high intelligibility, but the usage of this method needs accurate local SNR.In the meantime, the true local SNR is difficult to estimate in practice as we don't have enough information about clean sentences.The local SNR is defined as follows
Following the approach in [16], the IdBM can be formulated using the following binary hypothesis model: .
Here thr represent the threshold of local SNR According to (18), if we try to estimate 2 k X we need accurate local SNR which can be hardly archived without clean speech.There is another approach to get rid of this problem.Taking the expectation on 2 k X , we get where P(Hx) represents the probability of hypothesis Hx is true, E[Gk|Hx] denotes the gain function when hypothesis Hx is true.P(H2) is the key to this equation, its definition is the posteriori probability of ξL ,k > θ as follows Insert ( 4), ( 17) into (21) Insert ( 14) into ( 22) λk is an intermediate variable, defined as Insert ( 15) into ( 21) 1 / 1 , .
x n Notice that σ 2 x /σ 2 n is exactly the definition of the a priori SNR ξk, we could use the "decision-directed" [2] approach to calculate it.
Then, we focus on the E(Gk |H1) and E(Gk |H2).For the former, E(Gk |H1) = 0 would be reasonable as it is consistent with the IdBM method.As for E(Gk |H2), to compress residual noise, it should be less than one and consistent with the ξk.According to this principle, we choose two typical and classical forms, Wiener and Minimum Mean Square Error (MMSE) based on magnitude-squared spectrum [4].
Inserting ( 26), ( 27) into (20) respectively, we get: We denote (28) as SMPO-Wiener and denote (29) as SMPO-MMSE.To compare these methods intuitively, we fix the thr at 0 dB and let the a priori SNR ξ range from -10 dB to 20 dB, then the corresponding gain functions are plotted in Fig. 1.As can be seen from the figure, the gain functions of the SMPO-Wiener and the SMPO-MMSE are more aggressive than that of the SMPO when the a priori SNR ξ is low, which indicates better performance in noise reduction.When the a priori SNR ξ is high, the SMPO-Wiener and the the SMPO are almost the same, while the SMPO-MMSE is obviously low.

IV. EXPERIMENTS AND RESULTS
We choose the NOIZEUS [17] database to be our experimental corpus.This database has 30 clean English sentences, each of them has eight kinds of noisy sentences and four levels of SNR, which means 960 noisy sentences would be processed.The noise types include car, street, babble, exhibition, restaurant, station, train and airport.The four levels of SNR are 0 dB, 5 dB, 10 dB, 15 dB.After speech processing, we use two objective measures, i.e., the segmental SNR and the Perceptual Evaluation of Speech Quality (PESQ) [18], to assess effects of mentioned methods.Both measures are widely used in the speech enhancement area, besides, the PESQ has been proved to highly correlate with speech intelligibility.As a supplement, higher segmental SNR and PESQ values indicate better performance.

A. Best SNR Threshold for Methods
The threshold is important for the methods mentioned above; here we find the best thresholds for SMPO-Wiener and SMPO-MMSE firstly.As for the SMPO, the best threshold has been proved to be 0 dB by [4].
The research in [5] shown that the proper range of SNR threshold value should be [-12, 5] dB.In our experiments, the thresholds range from -10 dB to 5 dB, and the step is 5 dB.Here, we choose four kinds of noises, including babble noise, car noise, street noise, and airport noise.Each kind of noisy speech has four SNR levels, i.e., 0 dB, 5 dB, 10 dB, and 15 dB.We use SMPO-Wiener and SMPO-MMSE methods to process these degraded speech sentences respectively.After that, we compute the PESQ and segmental SNR values for each processed sentence, and then, we calculate statistical average values.The results are given in Table I and Table II, where Table I shows the segmental SNR results and the PESQ results are shown in Table II.
From Tables I and II, we find that both SMPO-Wiener and SMPO-MMSE methods perform best, in terms of PESQ values, when the threshold is -5 dB.After enough examining experiments, we found it consistent for all types of noise.As PESQ has a higher correlation with speech intelligibility than segmental SNR, we assumed -5 dB to be the best threshold for SMPO-Wiener and SMPO-MMSE methods and used it in further experiments.

B. Results and Comparison of Methods
In the following experiments, the above-mentioned methods, i.e., SMPO, SMPO-Wiener, and SMPO-MMSE are applied to noisy sentence processing.After that, we calculate PESQ and segmental SNR values of processed sentences.From Table III and Table IV we notice that the SMPO-Wiener performs best, its segmental SNR improve significantly.All in all, our methods, both the SMPO-Wiener and the SMPO-MMSE, perform better than the original method.
Figure 2 shows the improvement in terms of PESQ when compared to unprocessed speeches.Figure 3 and Fig. 4 show timing waveforms and spectrograms of the speeches.Among the three methods, the SMPO-Wiener performed best.The reason why the SMPO-MMSE is not as good as the SMPO-Wiener can be seen from Fig. 1.When the a priori SNR ξ is high, the gain value of SMPO-MMSE is obviously lower than that of SMPO-Wiener, which means more speech distortion.V. CONCLUSIONS In this paper, a new soft masking method was derived incorporating SNR uncertainty to enhance noisy speech.Compared to the conventional SMPO method, the proposed SMPO-Wiener and SMPO-MMSE methods yielded better performance owing to compressing residual noise.Comparing the SMPO-Wiener and the SMPO-MMSE, we analysed the reason why the SMPO-Wiener is more suitable than the SMPO-MMSE in this condition.Meanwhile, the difference between the SMPO-Wiener and the SMPO-MMSE means that there is still potential to improve performance by finding more suitable forms of E(Gk |H2).Besides, we realized that maybe we can change the binary masking model into other masking forms, because the noise would be totally masked by auditory masking effect when the local SNR value is high enough.In this condition, further compressing noise would be useless or even harmful for speech intelligibility and quality.In our future research, we would make efforts on these issues.

APPENDIX A
In this section, the PDF of Y 2 k presented in (11) would be deduced.With the known condition Y 2 k = X 2 k + N 2 k , we can get the following equation where τ is an integral variable.Insert ( 5) and ( 6) into (30): After calculating the definite integral, we get

Fig. 1 .
Fig. 1.Gain functions of the SMPO, SMPO-Wiener and SMPO-MMSE, respectively, as a function of the a priori SNR .The threshold was fixed to thr=0 dB.

TABLE I .
EXPERIMENTAL RESULTS OF SMPO-WIENER AND SMPO-MMSE AS A FUNCTION OF THRESHOLD, THR, IN TERMS OF PESQ.

Table III and
Table IV show our statistical experiment results.

TABLE III .
PERFORMANCE OF THE MENTIONED METHODS IN TERMS OF PESQ.