Dual-Channel Cosine Function Based ITD Estimation for Robust Speech Separation

In speech separation tasks, many separation methods have the limitation that the microphones are closely spaced, which means that these methods are unprevailing for phase wrap-around. In this paper, we present a novel speech separation scheme by using two microphones that does not have this restriction. The technique utilizes the estimation of interaural time difference (ITD) statistics and binary time-frequency mask for the separation of mixed speech sources. The novelties of the paper consist in: (1) the extended application of delay-and-sum beamforming (DSB) and cosine function for ITD calculation; and (2) the clarification of the connection between ideal binary mask and DSB amplitude ratio. Our objective quality evaluation experiments demonstrate the effectiveness of the proposed method.


Introduction
A common example of the well-known 'cocktail party' problem is the situation in which the voices of two speakers overlap. How to solve the 'cocktail party' problem and obtain an enhanced voice of a particular speaker in machines have grabbed serious attention of researchers.
As for single-channel speech separations, independent component analysis (ICA) [1] and nonnegative-matrix factorization (NMF) [2] are the conventional methods. However, the assumption that signals are statistically independent in ICA and the model in NMF is linear limit their applications. Moreover, NMF generally requires a large amount of computation to determine the speaker independent basis. Recently, in [3], the authors proposed an online adaptive process independent of parameter initialization, with noise reduction as a pre-processing step. Using adaptive parameters computed frame-by-frame, this article constructs a Time Frequency (TF) mask for the separation process. In [4], the authors proposed a pseudo-stereo mixture model by reformulating the binaural blind speech separation algorithm for the monaural speech separation problem. The algorithm estimates the source characteristics and constructs the masks with the parameters estimated through a weighted complex 2D histogram.
Normally, multiple channel sources are separated by measuring the differences of arrival time and sound intensity between microphones [5,6], which are also referred to as the interaural time differences (ITD) and the interaural intensity differences (IID). Interaural phase differences (IPD) have been used in [7,8]. The authors proposed a speech enhancement algorithm that utilizes phase-error based filters that depend only on the phase of the signals. Performances of the above systems depend on how the ITD (or IPD) threshold is selected. Instead of a fixed threshold, in [9], the authors employed a statistical modeling of angle distributions together with a channel weighting to determine which signal components belong to the target signal and which components are part of the background. In [10], the authors proposed a method based on a prediction of the coherence function and then estimated the signal to noise ratio (SNR) to generate Wiener filter. In [11], the author presented a method based on independent component analysis (ICA) and binary time-frequency masking. In [12], the authors proposed that a rough estimate of channel level difference (CLD) threshold yielding the best Signal-to-Distortion Ratio (SDR) could be obtained by cross-correlating the separated sounds. In addition, a combination of negative matrix factorization (NMF) with spatial localization via the generalized cross correlation (GCC) is applied for two-channel speech separation in [13]. For two-channel convolutive source separation, as the number of parameters in the NMF2D grows exponentially and the number of frequency basis increases linearly, the issues of model-order fitness, initialization and parameters estimation become even more critical. In [14], the authors proposed a Gaussian Expectation Maximization and Multiplicative Update (GEM-MU) algorithm to calculate the NMF2D with adaptive sparsity model and to utilize a Gamma-Exponential process in order to estimate the number of components and number of convolutive parameters in NMF2D.
The goal of this paper is to cope with competing-talker scenarios by dual-channel mixtures. In this study, we use DSB to generate the cosine function that evaluates ITD by using several frames of the short-time Fourier transform (STFT) and makes target and competing signals have the same characteristics. Then, we utilize the binary time-frequency mask to obtain the target source. There are two contributions in this paper: (1) we novelly upgrade delay-and-sum beamforming (DSB) [15] for estimating the ITD; and (2) for the first time, we clarify the connections between ideal binary mask and DSB amplitude ratio.
The framework of our approach is illustrated in Figure 1. Moreover, our proposed algorithm can handle the problem of phase wrap-around. The remainder of this paper is organized as follows: Section 2 provides an overview of time difference model. Our proposed approach including system overview and algorithm will be discussed in Section 3. In Section 4, we will introduce source separation. Then, Section 5 shows our evaluations of the system. Finally, Section 6 puts forward the main conclusions of the work.

Time Difference Model
We suppose that there are I (I = 2) sources (subscript 1 to represent the target and subscript 2 to represent the noise) in a sonic environment. The signals from two different microphones are defined, respectively, as: where a L i and a R i denote the weighted coefficients of the recordings of the left and right microphone from the i-th source separately. τ i is the time delay of arrival (TDOA) of the i-th source between two microphones. Equation (1) can be simplified as: where b i is the ratio of a L i and a R i . By the short-time Fourier transform (STFT), the signals can be expressed as: where m is the frame index and ω k = 2πk/K. k and K are the frequency index and total window length, respectively. Under the assumption of Wdisjoint orthogonal [16], Equation (3) can be rewritten as: Thus, once the TDOA is obtained, we can make a simple binary decision concerning whether the time-frequency bin [m, k] is likely to belong to the target speaker or not.

Proposed Approach
Delay-and-sum (DSB) is an effective means for speech enhancement. Our method is based on DSB under the anechoic condition in the time-frequency domain. In DSB, the enhanced speeches in the time-frequency domain are modeled as: where are the enhanced speech of target and interferer, respectively. Theoretically, once the correct estimations of τ 1 and τ 2 are obtained, Equation (5) is written as: We define g[k] as: where sgn(x) = 1, According to Equations (6) and (7), we treat g The [k] as the theoretical result of g [k]. Under the assumption of far-field (b 1 We may obtain where g The [k] is the cosine function. Specially, if b 1 equals 1, we have Obviously, the maximum of g The [k] is 1. Furthermore, we let g real [k] be the real data of g[k] according to Equation (6). To ensure that the maximum of g real [k] is 1, we rectify g real [k] as: We define the minimum of g real [k] as g min [k]. Under the correct estimations of τ 1 and τ 2 , g real [k] approximately equals g The [k]. According to Equation (10), b 1 can be estimated as: Figure 2 demonstrates the process of ITD estimation. Figure 3 gives an example about the cosine functions with different estimations of ITD.
We define the criterion function as: Because of the periodicity of Trigonometric function, we fix |ω k (τ 1 − τ 2 )| < π. We use the summation on all frequency bands to avoid phase wrap-around problem. Then, we havê

Source Separation
After obtaining the ITD and attenuation coefficients (namely b 1 and b 2 ), we adopt the masking method to separate the target and competing sources. Firstly, we illustrate the effects of attenuation coefficients. Then, we utilize the time-frequency mask based on the DSB ratio.

The Effects of Weighted Coefficients
In Equation (10), we assume b 1 ≈ b 2 , but sometimes experiment settings can not meet this hypothesis strictly. In this section, we set different values of b 1 and b 2 artificially to demonstrate the effectiveness of the criterion function in Equation (14). We verify the effects of b 1 and b 2 with a simple example. Assume that The details are shown in Figure 4. We can observe that even experiment settings do not meet the assumption that b 1 ≈ b 2 strictly, and the ITD still can be estimated accurately. Moreover, though the values ofb 1 andb 2 are rough, the binary mask is free from attenuation coefficients since the DSB based mask only relies on ITD information.
The ITD estimation is valid for all of the settings.

Mask Based on DSB Ratio
Under the assumption of Wdisjoint orthogonal, the ideal ratio mask is defined using a priori energy ratio R SNR [m, k] [17]: In addition, the ideal binary is of the form: where λ is set to be a value in 0.2-0.8.
In our theoretical framework, 1+b 1 1+b 1 ×e jω k (τ 2 −τ 1 ) is greater than 1 according to Equation (6), is always less than 1. Then, the DSB ratio is of the form: Comparing R DSB [m, k] to 1, the binary time-frequency mask is obtained as: It is easy to find that when λ is set to 0.5, B[m, k] is equivalent to M[m, k]. Equations (6) and (20) demonstrate the essence that λ = 0.5 provides the best performance under the assumption of Wdisjoint orthogonal. Then, the speech can be separated as: where X[m, k] is defined as: Finally, we can obtain the separated speech waveforms using the Inverse Fast Fourier Transform (IFFT) and OverLapping and Adding (OLA).

Experimental Evaluations
In this section, we first describe the experimental data and evaluation criteria that we used, and then present experimental results. Figure 5 depicts the simulated experimental set-up. The sources are selected from the TIMIT database [18]. The sample rate of these audio files is 16,000 Hz. For simulated data, we evaluate the target speech separation performance using Perceptual Evaluation of Speech Quality (PESQ), C sig , C bak and C ovl [19]. These new composite measures show moderate advantages over the existing objective measures [19]. To meet the SiSEC 2010 campaign's evaluation criteria, we adopt the standard Source-to-Interference Ratio (SIR) [20] for SiSEC 2010 test data. For these objective measures, the higher values mean better performance.

Experimental Setup
The window length is 1024 samples with an overlap of 75%. We can calculate the voiced frames detected by Voice Active Detector (VAD) [21] to avoid the situation that Y 2 [m, k] = 0. Actually, Y 2 [m, k] = 0 hardly occurs and we do not have this operation in our experiment. Once the amplitude of Y 2 [m, k] is nonzero, we treat Y 2 [m, k] as one of the speakers.

Simulated Data
We generate data for the setup in Figure 5 with source signals of duration 2 s. Reverberation simulations are accomplished using the Room Impulse Response (RIR) open source software package [22] based on the image method. We generate 100 mixed sentences for each experimental set. Tables 1 and 2 show the ITD estimated results in terms of mean square errors. In our experiment, the units of ITD are represented by τ × f s. We compare our approach with other existing DUET [23], Messl [24], and Izumi [25] methods. Unlike the algorithms based on coherence, our method consolidates the estimation of τ 1 and τ 2 into one cosine function. Our method acquires better ITD estimation. Table 3 shows the relations between microphone distances with ITD estimated results. The real ITD is proportional to the distances. The estimated ITDs calculated by our method meet this rule. For all of the distances in our experiment, the proposed method provides better ITD estimations that influence the separation results. Figure 6 shows the details with ITD estimation. Though our method does not take reverberation into consideration, the results demonstrate that our method is effective for low reverberation (RT 60 = 150 ms) conditions. Figure 7 shows the target source separation performance and illustrates that our method has comparable performance. Figure 8 shows the target source separation performance for different microphone distances. For different microphone distances, the source separation performances are effective. Compared with other methods, the proposed method yields better results for all of the microphone distances.

SiSEC 2010 Test Data
The data of D2-2 sets of the Signal Separation Evaluation Campaign (SiSEC) [26] consists of two-microphone real world recordings. We applied the proposed method to set1 for both room1 and room2. We only compare our method with the classical Fast-ICA [27], since the results with other methods can be found online. Figure 9 shows ITD estimation details. Tables 1 and 2 illustrate that our method can achieve competitive results. In Figure 10, we demonstrate the trends between λ and mean SIR for room1 and room2. Mean SIR is symmetrical to λ = 0.5, where mean SIR achieves the best performance. These characteristics are consistent with our method. Figure 10. Average Signal-to-Interference Ratio (SIR) with different λ. We calculate the mean of SIR for each λ. The result demonstrates that λ = 0.5 provides the best performance, which is identical to our theoretical analysis. Furthermore, separation results are symmetrical to λ when we adopt the signal-to-noise ratio based on Y 1 [m, k] and Y 2 [m, k] to generate the ideal binary mask. Table 4 shows the separation performance for both room1 and room2.

Conclusions
In this paper, we have proposed a novel method based on DSB for dual-channel sources separation. Our method, for the first time, employs the extension of DSB for estimating interaural time difference (ITD) and illustrates the connection between ideal binary mask and DSB amplitude ratio. Our method is valid for phase wrap-around. Although our method is based on the assumption of an anechoic environment, the results illustrate the effectiveness for low reverberation environment (RT 60 = 150 ms). Objective evaluations demonstrate the effectiveness of our proposed methods.
In this paper, we focus on the estimation of the interaural time differences (ITD). In fact, the construction of an effective masking model is also very critical. We could attempt to replace our Time-Frequency Masking with an NMF2D model as proposed in [14], and adopt the GEM-MU and Gamma-Exponential process to separate sound sources. Moreover, in the presence of background noise, the idea of noise reduction in [3] is also valuable for our dual-channel speech separation.
Author Contributions: Xuliang Li performed the experiments and analyzed the data; Zhaogui Ding designed the experiments and analyzed the data; Weifeng Li and Qingmin Liao helped to discuss the results and revise the paper. All authors have read and approved the submission of the manuscript.

Conflicts of Interest:
The authors declare no conflict of interest.