Binaural Speech Separation Algorithm Based on Long and Short Time Memory Networks

: Speaker separation in complex acoustic environment is one of challenging tasks in speech separation. In practice, speakers are very often unmoving or moving slowly in normal communication. In this case, the spatial features among the consecutive speech frames become highly correlated such that it is helpful for speaker separation by providing additional spatial information. To fully exploit this information, we design a separation system on Recurrent Neural Network (RNN) with long short-term memory (LSTM) which effectively learns the temporal dynamics of spatial features. In detail, a LSTM-based speaker separation algorithm is proposed to extract the spatial features in each time-frequency (TF) unit and form the corresponding feature vector. Then, we treat speaker separation as a supervised learning problem, where a modified ideal ratio mask (IRM) is defined as the training function during LSTM learning. Simulations show that the proposed system achieves attractive separation performance in noisy and reverberant environments. Specifically, during the untrained acoustic test with limited priors, e.g., unmatched signal to noise ratio (SNR) and reverberation, the proposed LSTM based algorithm can still outperforms the existing DNN based method in the measures of PESQ and STOI. It indicates our method is more robust in untrained conditions.


Introduction
Speech separation focuses on separating target speech from interference, i.e., background noise, reverberation and interfering speech. As a front-to-end of speech signal processing system, it is widely used in various scenarios, e.g., smart homes, hearing aids, and speech interaction system. In terms of the number of used microphones, speech separation methods can be divided into two categories such as monaural and array-based ones. In monaural methods, they extract more discriminant features, and then SVM is for sub-band IBM estimation. DNN-SVM system significantly improved speech intelligibility. Wang firstly used DNN to binaural speech separation [Jiang and Wang (2014)]. In each TF unit, the spatial feature ITD and ILD, and the monaural feature GFCC are extracted as input features to train the DNN. This study shows that a trained DNN generalizes well to the untrained spatial configurations of sound sources, that is, the specific placement of sound sources and sensors in an acoustic environment. Also, when the target and the interference source are co-located or close to each other, the monaural features improve separation performance. The spectral monaural feature [Zhang and Wang (2017)] is extended to complementary monaural feature set including amplitude modulation spectrum (AMS), relative spectral transform and perceptual linear prediction (RASTA-PLP) and mel-frequency cepstral coefficients (MFCC). Also DNN training aims to estimate the ideal ratio mask (IRM). Xiao et al. [Xiao, Zhao, Nguyen et al. (2016)] used DNN to predict static parameters, differential parameters and cross-correlation features, which improves the speech separation performance for reverberant and noisy speech. High-level features [Yu, Wang and Han (2016)] are extracted from low-level features, such as mixing vector (MV), ILD and IPD by unsupervised learning, and then supervised learning are to find the nonlinear functions between high-level features and the orientations of dominant source. Based on trained networks, the probability that each TF unit belongs to different sources (target and interferers) can be estimated based on the localization cues which is further used to generate the soft mask for source separation. Two-stage DNN structure is proposed in Zhao et al. [Zhao, Wang and Wang (2017)]. The masking from the first DNN is used for noise reduction, and the second DNN is spectral mapping for dereverberation. The results show that the performance of the two-stage DNN is greatly improved compared to the singlestage DNN. Bi-directional long short term memory (BLSTM) ] is also utilized to determine whether or not the TF unit is dominated by target speech. Then TF units containing clean phase is for DOA estimation. Also, the classification of TF unit ] is determined by deep clustering and permutation, integrates spectral and inter-channel phase patterns for multichannel speech separation. We note that, in Wang et al. [Wang and Chen (2018); Chen and Wang (2017) ;Ding, Li, Han et al. (2019)], LSTM shows its powerful ability to capture long-term speech contexts for speaker and noise-independent speech enhancement. Inspired by these researches, our study is conducted to use LSTM for binaural speaker separation, where LSTM framework is designed combining with spatial features and modified IRM. Hereafter, we treat the speech separation problem as a speaker separation problem, since we pursue the speech of target speaker by using the spatial information of speaker. In our scheme, since spatial features of consecutive frames are related for un-moving or slow-moving speakers, binaural spatial features are trained by BLSTM. In detail, Cross-Correlation Functions (CCF) with ITD and ILD are calculated at TF unit level as spatial features. Then, BLSTM classifier is trained at each frequency channel for the frequency-varied binaural features. Moreover, assuming the sum of each speaker magnitude is consistent with the original mixture, we force the BLSTM outputs are the proportions of the target speech for their corresponding mixtures, where the training labels are provided by the modified IRM. The remainder of the paper is organized as follows. Section 2 presents an overview of our BLSTM-based binaural speech separation system and extraction of spatial features. Section 3 describes the structure and training of BLSTM networks. The simulation results and analysis are provided in Section 4. The conclusion is drawn in Section 5.

Structure system overview and feature extraction
The proposed speaker separation system is illustrated in Fig. 1. The binaural signals are first decomposed into TF units independently by 33-channel Gammatone filters. CCF, IID and ILD are extracted in each TF units, and regarded as spatial features. The BLSTM is trained to estimation IRM by these spatial features. Target speech is reconstructed from IRM and the mixture. The physical model for binaural speech signals in reverberant and noisy environments can be formulated as: where xL(t) and xR(t) are defined as binaural mixture signals, s1(t) and s2(t) represent two speech sources, hL and hR are the Binaural Room Impulse Response (BRIR) for left and right ears respectively for each speech source; Moreover, symbols nL(t) and nR(t) are additive noise for each ear, which are irrelevant to each other. Both left-ear and right-ear signal, xL(t) and xR(t), are decomposed into cochleagrams. The central frequencies of Gammatone filters ranges from 50 Hz to 8000 Hz on the equivalent rectangular bandwidth (ERB). The output of each channel is divided into 32-ms frame length with 16-ms frame shift. The binaural signals are then converted into TF units. In each unit, CCF, IID and ITD between the left-ear and right-ear signals are exacted. The normalized CCF of a TF unit pair is defined as: where xL (i,k,m) and xR (i,k,m) are the binaural signals of TF unit at i-th channel and k-th frame, m is the sample number in a TF unit; N is the frame length; d denotes the delay between binaural signals and of the range from [-1 1] ms. For the 16 kHz sampling rate, the value of L is set to 16 with the dimension of CCF of size 33. The ITD of each TF unit is the delay corresponding to the maximum value of CCF. It is formulated as: And the ILD is defined as the energy ratio of the left and right ears in each TF unit pair: The spatial feature vector extracted in each TF unit pair is as follows: As the main spatial feature, the CCF for two sources with different azimuths is described in Fig. 2. With Head related impulse response (HRIR) and TIMIT data, those two source are located at -30° and 60° respectively. The upper half of the figure is a curve of CCF at each TF unit, while the lower half is a CCF curve of the all channels. In both circumstances, the CCFs all have a similar peak in different TF units, which corresponds to the source location. In the high frequency channel, CCF has several peaks due to the phase wrapping. From Figs. 2(b) and 2(d), owing to the noise and reverberation, the peak of CCF is not obvious with the azimuth. Specifically, CCFs of each TF unit in reverberant room does not have obvious discriminability. For unmoving or slow-moving speaker, since the spatial feature of consecutive frames are highly correlated, the LSTM can be used to model temporal dynamics of spatial features. Time-step is the important parameter of LSTM, which is related to the interframe correlation of the spatial features. Fig. 3 shows the inter-frame correlation coefficient of spatial features in different acoustic environments.

Figure 3:
The inter-frame correlation coefficient of spatial features In Fig. 3, the abscissa indicates the frame interval of the spatial feature, and the ordinate is the inter-frame correlation coefficient of the spatial feature. The selected environment includes a clean environment, a noisy environment (SNR=15 dB), a noisy and reverberant environment (SNR are set to 15 dB, 5 dB, reverberation times are 0.2 s and 0.6 s). As a result, spatial features show significant inter-frame correlation, whether in a clean environment or in a reverberant and noisy environment. Noise and reverberation reduce the inter-frame correlation. When the frame interval exceeds 6, the correlation coefficient of the spatial features will be less than 0.1. At this time, inter-frame correlation is too small to be ignored. Thus the time step in the LSTM is set to 11, That is, IRM is estimated by using the spatial features of consecutive 11 frames (5 before and 5 after the current TF unit).

Training targets
The IRM is defined as Wang et al. [Wang, Narayanan and Wang (2014) where S(i,k) 2 and N(i,k) 2 denote speech energy and noise energy within a TF unit, respectively. The tunable parameter β is commonly set to 0.5.
In this paper, we separate the different speakers through spatial information, the IRM is defined as: where S1(i,k) and S2(i,k) represent the two speaker's signals, and N(i,k) is the additive noise. In Eq. (7), numerator represents the energy of the target speech, while the denominator is the total energy of mixture. In a given TF unit, two sources and noise are regarded as irrelevant, IRM for different speaker and noise are rewritten as: The LSTM outputs the IRM which indicates the magnitude ratio of the sound source to the mixture, and Eq. (8) guarantees that the IRM sum of all LSTM output neurons for each channel is 1.

Speech separation and reconstruction
As for speakers with different direction, LSTM outputs the IRM corresponding to the each speaker in given azimuth. In binaural speaker separation, sound sources are only located in the front half of the horizontal plane. With the MIT HRIR, the azimuth is uniformly sampled with the steps of 10°, the front plane has 19 directions, corresponding to the 19 output neurons of the LSTM network. Also, for the ambient noise, LSTM is designed with an additional noise term corresponding to the IRM of the noise in the mixture. Therefore, the number of LSTM output neurons is 20. Thus, the training target of LSTM for each channel is the IRM vector, including Eq. (8), that is: where IRM1(i,k)and IRM2(i,k) are the estimated IRM from LSTM for two speakers.

The architecture of BLSTM
The LSTM network consists of an input layer, an LSTM network layer and an output layer. The LSTM network layer is composed of a bidirectional long and short time memory unit (BiLSTM). Each memory unit is bidirectionally connected with the front and latter memory unit. The dimension of the input layer is equal to the dimension of spatial feature. The time-step in this paper is set to 11 (5 before and 5 after the current TF unit). The BLSTM network of the proposed system has two layers, including the hidden layer of 256 neurons and the output layer of 20 neurons. The structure of the LSTM is shown in Fig. 4.
where E [·] denotes the expectation operation, ‖•‖2 represents L2 norm; IRM vector is the output of LSTM, IRMvector is the desired output of LSRM. In this paper, the total number of training epochs is 20, the learning rate is 0.003. Adam optimizer is used to optimize the learning rate.

Simulation and result analysis 4.1 Simulation setup
For both training and test, the mono source signals taken from the CHAINS Speech Corpus [Cummins, Grimaldi, Leonard et al. (2006)], is convoluted with the MIT HRIR to generate binaural signals. The CHAINS speech corpus contains 33 sentences spoken by 36 speakers. 9 sentences are selected from the CSLU Speak Identification corpus and 24 sentences are from the TIMIT corpus. Binaural signals of two source with different azimuth are mixed to generate the mixture speech. One of the source is male speech and the other is a female speech. The speakers and speech content for training differs from that for test. Also, the Gaussian white noise is added to the binaural mixture speech as the ambient noise. The noise is uncorrelated with the binaural signals. In addition, binaural noise is uncorrelated with each other. The SNR of the mixtures signals for training and test is set to 0, 5, 10, 15 and 20 dB. For SNR generalization, the SNR for test also includes -3, 3, 6, 9 and 12 dB. Binaural room impulse response (BRIR) is obtained by ROOMSIM software [Campbell, Palomaki and Brown (2005)], which simulate room acoustics. The reverberation time (RT60) of BRIR is 0.2 s and 0.6 s. The reverberation signals are only used for test, which verifies the generalization of the proposed algorithm to reverberation. The two speech sources are located on the front half of the horizontal plane with different azimuth. There are 171 combinations of source spatial configuration, all of which are used for training and test. The placement of speech sources and receiver are depicted in Fig. 5.

Figure 5:
The spatial configuration of sources and receiver Perceptual evaluation of speech quality (PESQ) and short-time objective intelligibility (STOI) are used to evaluate the performance of speaker separation. STOI is intelligibility metric, the value range is typically between 0 and 1. PESQ is correlated with the speech quality, the value is in a range of -0.5 to 4.5. We compare the performance of the proposed method, LSTM based separation with IRM, with several other related methods for binaural speech separation. The three algorithms involved in the comparison are: the speech separation algorithm based on DUET, DNN based method with IBM, and DNN based method with traditional IRM.

Evaluation and analysis
Firstly, we evaluate the performance of the proposed algorithm in the matched noisy environment, that is, the test mixtures and the training mixtures have the same SNR. PESQ for different algorithm are shown in Tab. 1.      Based on the results in Tabs. 4-7, we found that both methods achieved reliable speech quality and intelligibility under reverberant conditions. Compared to Tabs. 1 and 2 with no reverberation, PESQ and STOI are slightly reduced, but the performance is still stable. At the same time, the performance of IRM-LSTM is better than that of IRM-DNN, which means that IRM-LSTM has better generalization performance to reverberation.

Conclusion
In this work, we present a LSTM-based binaural speech separation framework. By considering the temporal correlation of spatial features, we estimate the IRM for each sound source in TF unit more accurately by LSTM model. The LSTM-based speech separation has shown its ability to improve speech quality and intelligibility. Also, the proposed algorithm shows consistent results in unmatched reverberant and noisy conditions. The generalization ability is due to the use of LSTM model.