Fast Noise Compensation and Adaptive Enhancement for Speech Separation

We propose a novel approach to improve adaptive decorrelation ﬁltering-(ADF-) based speech source separation in di ﬀ use noise. The e ﬀ ects of noise on system adaptation and separation outputs are handled separately. First, fast noise compensation (NC) is developed for adaptation of separation ﬁlters, forcing ADF to focus on source separation; next, output noises are suppressed by speech enhancement. By tracking noise components in output cross-correlation functions, the bias e ﬀ ect of noise on the system adaptation objective function is compensated, and by adaptively estimating output noise autocorrelations, the speech separation output is enhanced. For fast noise compensation, a blockwise fast ADF (FADF) is implemented. Experiments were conducted on real and simulated di ﬀ use noises. Speech mixtures were generated by convolving TIMIT speech sources with acoustic path impulse responses measured in a real room with reverberation time T 60 = 0 . 3second. The proposed techniques signiﬁcantly improved separation performance and phone recognition accuracy of ADF outputs.


INTRODUCTION
Interference speech and diffuse noise present double folds of challenges for hands-free automatic speech recognition (ASR) and speech communication. For practical applications of blind source separation (BSS), it is important to address the effects of noise in speech separation: (1) noise may degrade the conditions of BSS and hence hurt the separation performances; (2) BSS aims at source separation and has limited ability in suppressing diffuse noise. Although "bias removal" has been identified as a general approach for improving speech separation in noise [1], the performance depends largely on specific separation algorithms. Some noise compensation (NC) methods, for example [2], were proposed for a natural gradient-based separation algorithm. Other reported studies either focused primarily on theoretical issues, for example [3], or handled only conditions like uncorrelated noises, for example [4], or simplified mixing models, such as anechoic mixing [5]. The limitations of BSS in noise suppression were reported previously. Araki et al. [6,7], established the mechanism similarities between BSS and the adaptive null beamformer. Asano et al. [8] grouped the two approaches into "spatial inverse" processing and pointed out that they are only able to suppress directional interferences but not omnidirectional ambient noises. Therefore, when both interference speech and diffuse noise are present, output noise suppression is needed in addition to separation processing. On the other hand, speech enhancement algorithms that are formulated for stationary noises cannot be applied directly in this scenario, because the adaptation of separation filters makes the output noise statistics time varying. Such variation may happen frequently when the mixing acoustic paths change, for example when a speaker moves.
In our previous works [9,10], the separation model of adaptive decorrelation filtering (ADF) [11,12] was significantly improved for noise-free speech mixtures in both aspects of convergence rate and steady-state filter estimation accuracy. A noise-compensated ADF [4] was proposed for speech mixtures contaminated by white uncorrelated noises. However, in real sound fields, diffuse noises are colored and spatially correlated in low frequency which deteriorate ADF performance more severely than uncorrelated noises [13]. It appears that noise can be removed from speech inputs prior to ADF separation. But such a noise prefiltering deteriorates the condition for subsequent source separation, due to 2 EURASIP Journal on Audio, Speech, and Music Processing nonlinear distortions introduced by speech enhancement [13].
In the current work, we propose to address the challenge of speech separation and diffuse noise suppression by an effective two-step strategy. First, a noise compensation (NC) [14] algorithm is developed to improve speech separation performances; effective blockwise implementations of compensation processing and ADF filtering are derived in FFT. As separation filters change over time, output noise statistics of cross-correlations are tracked so that filter adaptation bias can be removed. Second, output noise autocorrelations are estimated and used to enhance the speech signals separated in the first step [15], so as to improve speech quality. Speech separation, enhancement, and phone recognition experiments were conducted, and the results are presented to show the performances of the proposed separation and enhancement techniques.

ADF MODEL IN NOISE
In the following, we use variables in bold lower case for vectors, bold upper case for matrices, superscript T for transposition, I for the identity matrix, " * " for convolution, and E{} for expectation. The correlation matrix formed by vectors a and b is defined as R ab = E{ab T }, and the correlation vector between a scalar a and a vector b as r ab = E{ab}. N and K denote filter and block lengths, respectively. Speech and noise signal vectors contain N consecutive samples up to current time t, and their counterparts with 2N − 1 samples up to time t are marked with tilde.
The noisy speech mixing and ADF separation systems are shown in Figure 1, where g i j = [g i j (0), . . . , g i j (N − 1)] T , i, j = 1, 2, i / = j, are separation filters. We formulate the I/O relations of ADF as [4] v n = G( y + n), where y = [ y T 1 (t), y T 2 (t)] T and n = [ n T 1 (t), n T 2 (t)] T are vectors of the clean speech mixture and the noise, respectively, with For the noisy ADF output v n , its speech-only output is denoted by T and the noise output component by Then, the effect of noise in the system output correlation matrix is described by R vnvn = R vv + R ηη . The I/O relations in correlation vectors of speech are r vivj = r yiyj − G ji r yi yi − R yj yj g i j + G ji R yiyj g i j , In the absence of noise, the basic ADF adaptation algorithm is given in [11] as It has been shown in [4] that by taking the decorrelation objective functions as and approximating r vivj by instantaneous correlations v i (t) v j (t), the same adaptation equation can be obtained. For the step-size μ(t), [12] proposed an input-normalized technique based on a convergence analysis, which was combined in [9] with variable step-size (VSS) techniques to accelerate convergence and reduce ADF estimation error.
The proposed system for improving ADF in noise works in two steps, as shown in Figure 2. In the NC step, the noise effects on the adaptation procedure (6), including the stepsize computation, are reduced to improve speech separation. In the adaptive enhancement step, the ADF speech outputs are enhanced by noise reduction. The details of the techniques for these two processing steps are covered in Sections 3 and 4, respectively.

NOISE COMPENSATION FOR ADF
Since the objective function in the form of (7) becomes J nij = (1/2)r T vn i vn j r vn i vn j , the presence of noise deteriorates the adaptation performance of (6) which contains bias caused by output noise cross-correlations. As shown in (4), the noise component in output cross-correlation varies as filters g i j adapt. The time-varying noise effect can be reduced by using an estimate of speech cross-correlation r vivj = r vn i vn j − r ηiη j , that is, Based on (8), the noise-compensated ADF (NC-ADF) is obtained as where r ηiη j (t) is the estimate of output noise cross-correlation, and 0 < α ≤ 1 the discount factor to prevent overcompensation. In the following, α = 0.9 is used. In the current work, for the computation of step-sizes, the VSS technique of [9] is extended to include a compensation of output noise powers. The effect of unequal source energies on filter estimation errors is that the lower the relative strength of the jth source, the higher the estimation error will be for the filter g i j [9]. To reduce the ADF estimation error caused by unbalanced source energies, stepsizes can be scaled by relative short-term powers of ADF outputs as where the normalizing gain factor μ(t) was given by [12] with σ 2 yni (t) the short-term power of the ith input, and γ (0 < γ < 1) the constant gain factor that controls convergence speed. The estimated average speech output power σ 2 av (t) is The noise compensation to output power is made by subtracting noise power from the power of noisy ADF output, that is, and the output noise power is obtained from (5) as

Fast update of compensation terms
Direct computations of noise cross-correlation vectors in NC-ADF adaptation (9) are not feasible for real-time applications since the terms in (4) require matrix-vector multiplications for every time sample. For fixed speaker locations, the changes of ADF filters are in general small within short time intervals (e.g., around 30 milliseconds). The slow change of ADF parameters and the short-term stationarity of input noise make it possible to update compensation terms in a blockwise fashion, reducing the update rate by a factor of K (block-length). To speed up NC-ADF, we first reduce the update rate for compensation terms and then utilize the Toeplitz structures of both the system and the correlation matrices to derive an FFT-based estimation of (4). The estimate of output bias (4) can be rewritten as Computations of a i j and c i j share the same structure. The components of vector a i j , that is, a i j (k), k = 0, . . . , N − 1, can be expressed as the last N samples, in reversed order, of the convolution g ji (n) * ξ a i j (n), that is, where . Based on such convolutive expressions, the N-point sequences a i j (k), b i j (k), and c i j (k) can be computed by N Fpoint FFTs (N F > 2N − 1). For modularity, the (2N − 1)point sequence d i j (k) can be decomposed into two N-point subsequences and computed with two N F -point FFT-IFFT modules. In this way, all the sequences above only need to be zero-padded to length N F , because only N-point results are required in each module. The rest points with aliasing are irrelevant and are discarded.
From (13)-(15), the noise-free ADF output powers used in the VSS computation are estimated by

Fast ADF and NC-FADF
The samplewise procedures of filtering (1) and adaptation (6) of ADF are also modified for a blockwise implementation to enable fast noise compensation. The fast computation of (1) can use the standard overlap-add fast convolution [16] under the approximation that filters are constant within each block. By using a constant step-size in each block, a blockadaptive procedure for filter update can be obtained. For noise-free ADF, consider the mth block covering samples from t m to t m + K − 1, and let g m i j = g i j (t m ) be the filters of the current block. After obtaining ADF outputs of the mth block by a fast convolution filtering, the step-size μ m can be estimated to update filters in the entire current block. By summing up both sides of (6) for t = t m , . . . , t m + K − 1, 4 EURASIP Journal on Audio, Speech, and Music Processing the new filters for the next block, g m+1 i j = g i j (t m + K), can be estimated as The cross-correlation estimate can be computed by an FFT-based fast implementation [16]. Similarly, the blockwise NC-FADF is obtained from (9) as where r m vn i vn j is defined by replacing v i and v j with their noisy counterparts in (19), r m ηiη j is from (15), and the block step-size μ m i j is computed by (10). The normalization gain factor μ m in (11) uses ADF input powers that are estimated from the samples of both current and previous blocks. To prevent overcompensation in NC-FADF, σ 2 vj in (17) is set to zero when negative values occur. The denominator in (10) is also added a small positive number to avoid divide-by-zeros. Triangular windows w(n) = (N − n)/N, n = 0, . . . , N − 1, are applied to both correlation estimate r m ηiη j and ADF adaptation vectors to prevent instability.
The overlap-add method requires N ≤ K ≤ 2N. When K = N and the FFT length N F = 2N, the computation of 2Npoint FFTs is distributed to the block of length N, resulting in a complexity of O(log N) per time-sample for NC-FADF, in contrast to O(N 2 ) for a direct estimation of NC terms that are required by matrix-vector multiplications.

Tracking of ADF output noise autocorrelations
Although NC-FADF improves the speech separation performance in noise, the separation outputs v ni are still contaminated by noise. Thus, a speech enhancement postprocessing should be integrated with ADF to reduce noise in each output. To do so, we need to track the time-varying output noise statistics as filters evolve from block to block by a fast computation of (5). Similar to the derivations of (15), we obtain autocorrelation of ADF output noise for the mth block: where a m ii = G m i j r ni nj , b m ii = R ninj g m i j , c m ii = G m i j d m ii , and d m ii = R nj nj g m i j . Since input noise is stationary, its autoand cross-correlations can be measured a priori during a speech inactive period. The fast mappings from input noise correlations to output noise autocorrelation, depending only on current system parameters g m i j 's and G m ji 's, are implemented as fast convolutions of the following signal sequences: where ξ a

Enhancement of separated speech
Utilizing the adaptively estimated noise statistics r m ηiη i , many algorithms can be considered for postenhancement of ADF outputs. The time domain constrained (TDC) type of the generalized subspace (GSub) method [17] is tested due to its ability to handle colored noise. The TDC-GSub processing is applied to every block of ADF outputs, where for the mth block it requires the noise autocorrelation matrix R m η i η i , which can be constructed by forming a symmetric Teoplitz matrix from the output autocorrelation vector in (21). Specifically, r m ηiη i constitutes the first column and the first row of R m η i η i . Another piece of information that the TDC-GSub algorithm takes is the autocorrelation matrix of the noisy ADF output, R vn i vn i , which is estimated from ADF outputs of the current block. The TDC-GSub processing is performed on each nonoverlapping subframe of length L = 40 and the major steps are the same as in [17]. Step 2. Compute the optimal estimator H = U −T diag[α 1 , . . . , α M , 0, . . . , 0]U T , where the eigendomain filtering gains are obtained by α k = λ k /(λ k + β), k = 1, . . . , M, and β is determined from with SNR dB = 10 log 10 ( M k=1 λ k /L).
Step 3. Enhance the ith ADF output by v m i = Hv m ni .
The computations of matrix inversion, multiplication, and eigendecomposition become acceptable when a small value is used for L (2.5 milliseconds). In addition, a measure is taken to speed up TDC-GSub by utilizing the short-term stationary property of separated speech signals v ni 's. Within 20 milliseconds, the variations of R vn i vn i 's are relatively small, obviating the need for updating their eigendecompositions in every subframes. In practice, the computation rate for both steps 1 and 2 are thus reduced to every 12.5 milliseconds, without introducing significant degradations.

COMPLEXITY ANALYSIS
The complexities of the major computation steps in terms of the average number of real multiplications per time-sample are listed in Table 1. Trivial computation overheads are ignored. The gain of the fast over the direct implementations are evaluated for N = K and N F = 2N. The counts for FFT are based on the regular radix-2 method. It is possible to further reduce the complexities of computations. In Table 1, only a coarse complexity estimate is made for TDC-GSub, based on direct implementations of matrix operations. Faster computation techniques for TDC-GSub and complexity analyses are out of the scope of this paper.

Experimental data and setup
Speech mixtures were generated from a convolution of clean speech sources in TIMIT database with real acoustic impulse responses measured in a room of reverberation time T [60] = 0.3 second [18]. The speakers were approximately 2 m away from two microphones that were mounted 21 cm apart on a circular array of radius 15 cm, and the distance between the two speakers was 2.6 m. The target speech was sampled at 16 kHz and had 40 sentences from 4 speakers (faks0, felc0, mdab0, mreb0). The competing speech contained randomly selected TIMIT sentences. Both simulated and real diffuse noise conditions were tested. The simulated noise is speechshaped and was generated by the following procedure:  TIMIT data. Real diffuse noises were recorded in a computer lab with a pair of omnidirectional microphones placed in the center of the lab, where the microphones were the same distance apart as that of the array microphone pair. Ventilation and air-conditioning systems and 8 desktop workstations were working simultaneously, generating diffuse noises that fit the stationary assumption. As a default setting, a 2-second speech inactive segment immediately preceding the speech was used to estimate input noise statistics. Figure 3 illustrates the cross-power spectra for both types of noises.
The basic setup for ADF was N = 400 and γ = 0.01 and the separation filters were initialized with zeros, representing a totally blind condition (if certain prior knowledge of 6 EURASIP Journal on Audio, Speech, and Music Processing  the acoustic paths can be incorporated into the initial separation filters, then ADF separation performance can be improved, especially in severe noise). In all cases, a pre-emphasis (1 − z −1 ) was applied to speech mixtures to remove the 6-dB/octave tilt of speech long-term spectrum and to reduce eigenvalue dispersion for faster convergence [10]. Pre-emphasis enhances perceptually important speech components, and it also alters input noise properties as well as the relative strengths of noise and speech measured in signal-to-noise ratio (SNR): SNR = 10 log 10 (P S /P N ), where P S is the power of the clean speech mixture signal, and P N is the power of the noise component. In fact, the simulated speech-shaped noise spectrum was flattened by pre-emphasis, resulting in a loss of SNR of approximately 3 dB. On the other hand, the recorded diffuse noise retained a significant amount of coloration and spatial correlation after pre-emphasis that increased SNR by 12 dB through suppressing strongly correlated low-frequency noise components (see Figure 3). In subsequent discussions, SNR and target-to-interference ratio (TIR) refer to those evaluated on pre-emphasized input and output components, where TIR is defined as 10 log 10 (P T /P I ), with P T the power of target speech and P I the power of interference speech component. For FADF and NC-FADF, the block length was K = 400 and the FFT length was N F = 1024. Since VSS without NC would corrupt adaptation at high levels of noise, it was not applied to ADF (6) and FADF. In the appendix, more details are provided for the definitions of SNR and TIR.

Speech separation performance
The separation performances were evaluated by system gains in TIR, defined as TIR output − TIR input . In Tables 2 and 4, the TIR gains of NC-FADF outperform those of the baseline for both types of noises, at the cost of a slightly decreased SNR, as shown in Tables 3 and 5. Since FADF is a fast and approximate implementation of the baseline ADF, it suffered a slight degradation from the baseline and showed occasional instability in the iterative estimations of separation filters. The TIR gain values in Tables 2 and 4 are computed from the noise-free components in the noisy outputs v n1 and v n2 . It is interesting to observe that under severe noise conditions, for example SNR= − 12 dB (original), the baseline ADF actually increased SNR. This is consistent with the analysis in [13] that in correlated noises, the baseline ADF tends to divert from speech separation to noise cancellation. Tables 3 and  5 show that the NC algorithm can force ADF to focus on speech separation, rather than noise cancellation.

Speech enhancement and phone recognition
Experiments were conducted to compare the cases of using NC-FADF or FADF, with and without adaptive speech enhancements. Since SNR was altered by pre-emphasis differently for simulated and real diffuse noises, the range of initial SNRs were chosen differently for these two cases so that the input target speech had the same SNRs after pre-emphasis. After adaptive online speech enhancement, a de-emphasis 1/(1 − 0.98z −1 ) was applied to the enhanced speech. The overall enhancement of target speech against the effects of both interfering jammer and noise are shown by the target-to-interference-and-noise ratio (TINR) in Figures 4  and 5, where TINRs are defined in the appendix for the input, the separation output, and the separation output with noise reduction. It is seen that NC-FADF outperformed FADF in both types of noises under almost all SNR conditions. At high SNRs, the TINR improvements come mainly from the separation processing of NC-FADF or FADF, as speech jammer is the dominant problem. The larger TINR gains obtained by NC-FADF over FADF were also attributed to its    use of the variable step-size adaptation defined in (10) and (12)- (14), while without noise compensation the VSS was unavailable to FADF. This advantage of using variable step size over fixed step size in ADF adaptation is consistent with the findings in [9]. At low SNRs, the TINR improvement is mainly contributed by the suppression of the noise components, and in the real diffuse noise, the separation processing had a stronger effect on TINR improvement than in the simulated noise. When the SNR is very low, where the energy of speech mixture is dominated by the noise, the TIR improvement (between target and jammer speech) by NC-  FADF contributed less to the overall TINR gains, and here the enhancement processing by TDC-GSub improved TINR greatly in both types of noises.
Phone recognitions were performed by using HTK toolkit [19] for the noisy mixture, the noisy separated speech, and the enhanced separated speech of the target. The speech signals were represented by sequences of feature vectors obtained from 50% overlapped short-time analysis window of 20 milliseconds. Each feature vector consisted of 13 cepstral coefficients and their first-and second-order time derivatives. Both training and test data from TIMIT 8 EURASIP Journal on Audio, Speech, and Music Processing database were processed with spectral mean subtraction. Hidden Markov modeling (HMM) was used for 39 context independent phone units, defined by the phone grouping scheme of [20]. Each phone unit had 3 emission states, with state observation probabilities modeled by size-8 Gaussian mixture densities. Phone bigram was used as "language model." The phone accuracy results in simulated and real diffuse noise cases are shown in Figures 6 and 7, respectively. The upper limit of phone accuracy was 46.5%, which was obtained from the target speech separated from the clean speech mixtures by ADF. It is observed that when SNR is low or moderate, the adaptive enhancement techniques significantly improved the phone recognition accuracy of the separation outputs. Similar to the TINR results, at high SNRs, the improvement to phone accuracy comes mainly from speech separation, where NC-FADF is significantly better than FADF. Comparative experimental results were also generated for the proposed approach of applying TDC-GSub as a postprocessor after FADF (FADF enhanced by TDC-GSub postprocessing) and the apparent alternative of using TDC-GSub as a preprocessor prior to FADF (FADF after TDC-GSub Preprocessing). It is seen that the former performed better than the latter, especially in real diffuse noise. In general, the combination of NC-FADF with TDC-GSub postprocesing achieved the highest accuracy performance.

Sensitivity to noise estimation
In real applications, there are scenarios where the speech inactive periods are short, which would reduce the reliability of noise statistic estimation. It is therefore of interest to evaluate the feasibility of the proposed NC-FADF algorithm when the input noise statistics are estimated from short data segments. For this purpose, an experiment was performed to vary the speech inactive period from 0.5 second through 2.5 seconds, and the noise statistics computed from the different periods were used by NC-FADF followed by TDC-GSub to perform speech separation and enhancement. The test results confirmed that for the two types of noises investigated in the current work, there is no significant difference in the overall system performance over this range of speech-inactive intervals. Figure 8 illustrates the phone recognition performance versus the speech inactive interval lengths in real diffuse noise. It is seen that except for a performance drop when the speech inactive length was 0.5 second, phone accuracy remained essentially the same for all other speech inactive lengths. In simulated noise, the accuracy performance remained essentially the same for all of the speech inactive lengths, including the 0.5 second case. In general, in an online system a voice activity detection module is needed to identify speech inactive periods, and for fast-varying nonstationary input noises, robust algorithms are needed to estimate time-varying noise properties with adaptive memory lengths. Although this issue is practically important, it is out of the scope of the current work.

CONCLUSIONS AND FUTURE WORK
In this paper, we have presented methods of noise compensation and adaptive speech enhancement to improve the performances of ADF speech separation in diffuse noise. Fast implementations for ADF and noise compensation have  been made that warrant real-time online applications. FADF has achieved performance comparable to that of ADF with a much faster speed. NC-FADF significantly improved the separation performance for speech mixtures in diffuse noise, and the integration of NC-FADF with speech enhancement significantly improved phone recognition accuracies in separated speech. Future investigations may include other enhancement algorithms and noise-reduction implementations for a more streamlined integration with the NC-FADF procedure.

DEFINITIONS OF SNR, TIR, AND TINR
Since the ADF filtering model (1) is linear, the superposition principle holds, that is, its output components of target, interference, and noise can be computed separately from its respective input components. Unlike the linear model of ADF, the speech enhancement module is nonlinear and its output components cannot be separately estimated from its individual input components. Therefore, the separate computation of output TIR and SNR are not feasible for the speech enhancement module. Instead, TINRs can be estimated by taking the signal energies other than the original target as the sum of noise and interference signals. The computations of SNR, TIR, and TINR are defined below with respect to channel 1 (the definitions are similar for channel 2): SNR y1 = 10 log 10 P y1 P n1 , SNR v1 = 10 log 10 P v1 P η1 , TIR y1 = 10 log 10 P ys 1 P ys 2 , TIR v1 = 10 log 10 P vs 1 P vs 2 , TINR y1 = 10 log 10 P ys 1 P (ys 2 +n1) , TINR v1 = 10 log 10 P vs 1 P (vs 2 +η1) , TINR v1 = 10 log 10 P vs 1 P ( v1−vs 1 ) .
At ADF input, P y1 and P n1 are the powers of the clean mixture and the noise components, respectively; P ys 1 and P ys 2 are the powers of the target and the interference speech signals, respectively; y s2 + n 1 = y n1 − y s1 is the sum of interference speech and noise. At ADF output, P v1 and P η1 , P vs 1 and P vs 2 , and v s2 + η 1 = v n1 − v s1 are the counterparts of the above components at ADF input. The component v 1 is the output speech after enhancement processing.