Binaural speaker identification using the equalization-cancelation technique

In real applications, environmental effects such as additive noise and room reverberation lead to a mismatch between training and testing signals that substantially reduces the performance of far-field speaker identification. As a solution to this mismatch problem, in this paper, a new binaural speaker identification system is proposed which employs the well-known equalization-cancelation technique in its structure. The equalization-cancelation algorithm is employed to enhance the input test speech and alleviate the detrimental effects of noise and reverberation in the speaker identification system. The performance of the proposed speaker identification system is compared with unprocessed identification systems and a traditional binaural speaker identification system from the literature. The proposed system is evaluated in both anechoic and reverberant conditions using different types of noise at various azimuthal positions. Simulation results show the superiority of the proposed method in all experimental conditions.


Introduction
Speaker identification (SI) systems aim to extract the embedded speaker information from the speech signal. A typical SI system involves three main stages of feature extraction, speaker modeling, and scoring [1][2][3][4].
As the first stage, feature extraction tries to transform the incoming speech signal into a convenient representation for later speaker identification stages. Three widely used features in the SI system are the Mel-frequency cepstral coefficients (MFCCs) [5], Gammatone frequency cepstral coefficients (GFCCs) [6], and perceptual linear predictive (PLP) coefficients [7]. The goal of the speaker modeling stage is to train the models that describe feature distributions of individual speakers. For this purpose, in early studies, Gaussian mixture models (GMMs) were used for speaker modeling in the SI systems. In those systems, the GMM parameters are trained by the expectationmaximization (EM) algorithm [8]. In later works, for efficient training of speaker-related GMM parameters, the combination of GMMs with a universal background model (UBM), known as the GMM-UBM model, was considered [9]. Finally, in the scoring stage of the SI system, identification scores are obtained by calculating the likelihoods of observing feature frames given the speaker model.
While speaker models are constructed using clean speech signals, in real applications, the speaker recognition (SR) is performed by unmatched test signals. Mismatches can be imposed by different causes, including channel and interspeaker variabilities, additive noise, and reverberation. The robust SI systems have the task of improving the recognition performance in unmatched conditions. For this purpose, many robust systems have been introduced to deal with the effect of different mismatches [10,11]. Ways of tackling with the mismatches generated by the channel distortion have been frequently studied. In this regard, the state-of-the-art SR systems incorporate joint factor analysis (JFA) [12] and i-vector [13][14][15], as one of its important variants, in their implementations. In recent years, deep neural network (DNN) is also used in the structure of i-vector-based systems [16,17]. The initial attempts with DNNs for SR have been made in the context of i-vector speaker modeling in terms of computing the phonetic posteriors [18,19]. To solve the channel mismatch problem, in later studies, a new feature, called the DNN bottleneck feature, was extracted as a substitution for traditional features such as MFCCs [20][21][22][23]. To improve the performance of DNN in speaker recognition, data augmentation embedding is used. In this regard, DNN, which is trained to discriminate between speakers, converts variable-length speech to fixed-dimensional embedding called x-vectors [24,25]. In the model, called generative x-vector, the complementary information of i-vector and x-vector is included [26].
The aim of the aforementioned methods has been the reduction of mismatches created by the telephony systems. However, little attention has been paid to the mismatch effects imposed by employing the SI systems in the far-field conditions. In the far-field conditions, the sensor position is far away from the target speaker. In this condition, the additive noises and room reverberation are the main sources of the mismatches. Compensation methods have been introduced to deal with the effect of this kind of mismatch [10]. Regarding the stage at which they are applied, these methods can be divided into three distinct categories, namely, the methods operating on feature extraction, speaker modeling, or scoring stages, respectively. In the methods of the first category, the noise is removed from the speaker characteristic information directly. Cepstral mean normalization (CMN) [27], relative spectra (RASTA) processing [28], employing multi-taper windows [29,30], and warping methods [31] are examples of the methods in this group. In the second category of compensation methods, such as parallel model combination [32], the aim is to make the SI system more robust by altering the learned speaker models and employing distortion characteristics. The third category aims at achieving the robustness of the SI system by changing the classifier score at the utterance or frame level. However, all the above-mentioned methods have the drawbacks that they require hard assumptions such as stationarity about the characteristic of the environmental effects and the description of noise explicitly, which leads to poor performance in the farfield speaker identification.
Human listeners perform speaker identification robustly without concerning any assumptions about the distortion characteristics [33]. This has inspired many researchers to introduce robust SI methods based on models of the human auditory system to deal with the mismatch problem. The ability of the human auditory system to separate voices in an environment with multiple sources is referred to as the auditory scene analysis (ASA) [34]. Computational auditory scene analysis (CASA) employs methods inspired by the ASA ability to separate speech in multi-source environments [35]. The techniques of CASA have motivated some researchers in the area of robust speaker identification [36][37][38][39][40]. The auditory system also exploits signals from the right and left ears which gives the ability to perform spatial separation of target speaker signal and interfering sounds [41]. Here, ASA uses the information about the spatial location of sound sources, principally encoded by the interaural time difference (ITD) and the interaural level difference (ILD) cues [34]. As one of the aspects of CASA, the binaural cues can be used to estimate ideal masks to segregate target speech from background noise [42][43][44][45][46][47][48]. As one of the binaural speech segregation methods, the mask is estimated by employing a deep neural network (DNN) classification method [47,48]. The training process of DNNs is based on features extracted from predefined mixture signals and defining the ideal masks as the target. A limitation of such supervised learning methods is that the efficiency of segregation is highly dependent on the quality of training and the amount of training data from various sources. In the speaker identification framework, the work of May et al. [39] suggests the utilization of the binaural scene analysis to deal explicitly with the mismatch problem. In this system, the binaural system is used to simultaneously localize, detect, and identify a predefined number of speakers in the presence of reverberation and interfering noise sources placed at different spatial locations. An important drawback of these separation and identification systems is that they are based on supervised learning strategy and, therefore, depend on prior knowledge of source characteristics, which is a strong limitation to be used for practical applications.
The spatial separation between target speaker and maskers often causes large improvements in speech intelligibility in those environments. The amount of intelligibility gain achieved by the binaural hearing is called binaural masking level difference (BMLD). The equalizationcancelation (EC) model is considered as one of the important and simple computational models of the binaural auditory system. The EC model has been originally developed by Durlach [49] and further improved by Culling and Summerfield [50] to predict BMLD. The original EC model is based on the idea that the auditory system transforms the signals arriving at the two ears so that the masker components are "equalized" (the E process) in both ears, and then the signal in one ear is "canceled" (i.e., subtracted) from that in the other ear (the C process). Culling et al. used the EC model to interpret intelligibility performance in two experiments in a simulated anechoic environment involving multiple speech-shaped noise (SSN) maskers [51]. Beutelmann and Brand applied an extended EC model to predict performance in speech intelligibility tasks in several environments, ranging from anechoic space to a cafeteria hall [52]. The EC idea was further developed by incorporating short-time strategies to predict cases involving nonstationary interferers [53].
An extended version of the EC model was also described and applied to speech intelligibility tasks in the presence of multiple maskers in [54]. Furthermore, Wan et al. developed a short-time version of the extended EC in speech intelligibility experiments in the presence of different maskers, including multiple speech maskers [55]. In another study and inspired by the EC theory, a two-stage binaural speech enhancement with the Wiener filter approach was introduced [56]. Later, an EC-based approach was introduced in the field of speech separation that shows performance superiority to the classical localization-based binaural speech separation systems [57].
The EC model was examined in many binaural processing fields because of its conceptual simplicity and its ability to describe the binaural phenomena. However, so far, it has not been considered as a solution to the mismatch problem in far-field speaker identification systems. In this paper, a new speaker identification system based on the short-time extended EC model is proposed to deal with the mismatch problem imposed by environmental effects in the far-field speaker identification. The backbone of the proposed SI system is the well-known GMM-UBM in which the EC process is applied to the auditory representations of both ears at the testing phase to remove the environmental effects, including noise and reverberation. Then, the output of the EC modeling is given as input to the decision module. The performance of the proposed system is compared with identification systems based on MFCC and GFCC features extracted from unprocessed signals and the traditional binaural speaker identification system of May et al. [39] in different simulated acoustic scenarios.
The structure of the paper is as follows: Section 1 gives a background on the monaural SI system, the auditory feature extraction, and the traditional binaural SI system. Section 2 outlines the main contribution of the paper. Here, the proposed speaker identification is presented along with a detailed explanation of the EC binaural model. In Section 3, speaker identification experiments are conducted to analyze the benefit of using the new binaural model in the SI system. Section 4 summarizes the main findings and concludes the paper.

Monaural speaker identification system
In speaker identification, human speech from an individual is used to identify who that individual is. There are two distinct processing stages. In the first stage, called training (or enrollment), the speech from each known speaker is taken to build (i.e., train) the model for that speaker. In the second stage, called testing, comparison of an unknown source of speech against each of the trained individual speaker models is carried out. In closed-set form identification, the unknown individual belongs to a pre-existing pool or database of speakers (speaker models) and the problem then becomes that of choosing which speaker from the pool the unknown speech is derived from.
As mentioned earlier, in this paper, the SI system based on GMM-UBM is used. Figure 1 shows the core structure of a typical SI system based on GMM-UBM. As illustrated, the block of feature extraction generates features that are used in the training of UBM, in adapting GMMs, and in the testing phase. In the training phase, a universal background model (UBM) is generated by utilizing a large collection of speech utterances and the expectation-maximization (EM) algorithm. The EM algorithm iteratively refines the model parameters by maximizing the likelihood of the resulting UBMs [58]. The speaker-dependent GMM models are obtained by adapting the trained UBM parameters to the speaker-dependent speech material. Fig. 1 The block diagram of a typical speaker identification system [10] Geravanchizadeh and Ghalamiosgouei EURASIP Journal on Audio, Speech, and Music Processing (2020) 2020:20 After building speaker models in an offline manner, in the testing phase, log-likelihoods of test features given by the GMM models and UBM are calculated. The difference between the likelihoods of GMMs and UBM produces values of scores. Finally, the speaker is identified by searching the argument that has the maximum value of the score.

Auditory features
The auditory perception of the sound frequency contents for speech signals could be described by nonlinear scales such as Mel and equivalent rectangular bandwidth (ERB) which lead to two feature extraction approaches of MFCC [5] and GFCC [6].
The Mel filterbank is composed of triangular filters where their center frequencies and bandwidths are calculated in the Mel scale. To extract the MFCC features, first, the input signal is decomposed into a time-frequency representation using the Mel filterbank. Then, the representations are compressed by logarithmic function and fed into the discrete cosine transform (DCT) to decorrelate the final MFCC coefficients.
The Gammatone filterbank inspired by psychoacoustical and physiological experiments is one of the standard models of cochlear filtering, which uses ERB-rate scaling to describe center frequencies and bandwidths of the filters. A bank of 32 filters is used with center frequencies ranging from 50 to 4000 Hz or 8000 Hz, depending on the sampling frequency of speech data. In this work, as in [39], a Gammatone filterbank with 32 filters in the range of (50, 8000) Hz was used for the sampling frequency of 16000 Hz.
The impulse response of the Gammatone filterbank is given below [6]: where a is the amplitude, n and b are the parameters defining the envelope of the Gamma distribution, f r is an asymptotic frequency, ERB N (f r ) is ERB, and ϕ is the initial phase.
Knowing that the filter output retains the original sampling frequency of the input signal, the fully rectified 32channel filter responses are down-sampled to 100 Hz along the time dimension. This yields a corresponding frame rate of 10 ms, which is used in many short-timespeech feature extraction methods. The resulting responses, called cochleagram, lead to a matrix representing a time-frequency (T-F) decomposition of the input signal [59]. A time frame of the cochleagram representation is called a Gammatone feature (GF). The dimension of a GF vector (here 32) is larger than that of typical feature vectors (e.g., GFCCs) used in a SI system. Additionally, because of the overlap among neighboring filter channels, GFs are largely correlated with each other. To reduce GF dimensionality and decorrelate its components, the DCT operation [60] is applied to produce GFCCs [6] with the dimension of 12.
To take into account the speaking rate of speakers in the SI system, the first and second derivatives of MFCC and GFCC features are computed and used along with the original feature values in the task of speaker identification.

Traditional binaural speaker identification system
The structure of the traditional binaural SI system introduced by May et al. [39] is illustrated in Fig. 2. This system comprises three important processing stages. In the first stage, the test speech signal is localized using binaural cues, and azimuths related to active sources are determined. In the second stage, called speech detection, the natures of active sources (i.e., speech or non-speech) are identified. The result of the first and second processing stages is a binary mask that is used in the final missing data (MD) SI system.

Proposed method
In this paper, a new SI system based on the short-time extended equalization-cancelation method is proposed. The EC processing aims to reduce the mismatches imposed by environmental conditions on the test features. Figure 3 shows the proposed binaural SI system based on the EC process. First, in the training phase, the UBM and GMMs are computed. Then, in the testing stage, the received left and right ear signals are decomposed by the auditory filtering model. For this purpose, the Gammatone filterbank is used. Then, the EC process operates on the Gammatonefiltered signals of the left and right ears. The output of the EC-based model is similar to that of the simulated auditory nerve response and can be converted to an acoustic feature used in the pattern matching unit. Figure 4 depicts the details of the binaural EC-based model employed in the proposed SI system. The received left and right ear signals from the auditory filtering, X L i ðtÞ and X R i ðtÞ, are split into time frames of 20 ms with a 10-ms overlap, yielding X L i; j ðtÞ and X R i; j ðtÞ, where i and j represent the channel and frame indices, respectively. Assuming X L i ðtÞ and X R i ðtÞ as input signals to the EC unit, the output can be computed as [61]: where W j (t) is the time window obtained at time frame j as: where K = 20 represents the length of the window (in ms) for the sampling frequency of f s = 16 kHz, and t is Geravanchizadeh and Ghalamiosgouei EURASIP Journal on Audio, Speech, and Music Processing (2020) 2020:20 Page 4 of 15 Fig. 3 The structure of the proposed binaural SI system using the EC-based model the time index (in ms). τ 0 (i, j) is the value that maximizes the cross-correlation functionρ i, j (τ): with ρ i; j τ ð Þ ¼ and α 0 (i, j) is: where E X L i; j and E X R i; j are the energies of the monaural left and right ear signals. It is noteworthy that in some applications of the EC algorithm (e.g., [50][51][52][53][54][55]), the definitions of the parameters ρ i, j (τ) and α 0 (i, j) are such that the noise signal is canceled at the output of EC. However, similar to the works in [57,61,62], the EC model presented here is based on the modified definitions of ρ i, j (τ) and α 0 (i, j) (see Eqs. (5,6)) to produce the target-canceled signal at the output (see Eq. (2)).
The energies of the framed left and right monaural signals (i.e., E X L i; j and E X R i; j ) and the output of binaural EC processing unit (i.e., E Y i; j ) at each T-F unit (i, j) are used to select the final output of the EC-based model, X EC i; j ðtÞ, in the decision module: Referring to Eq. (7), some points are worth mentioning. The output of the EC-based model is determined by evaluating two energy ratios; the ratio of the energies of the left and right monaural signals (i.e., E X L i; j and E X R i; j ( and the ratio of the energy of the monaural signal (i.e., E X L i; j or E X R i; j ) and the estimated interference ( E Y i; j ). The following example shows how Eq. (7) works. Suppose the case that the noise source is located on the right side. Knowing that the target signal comes from the frontal azimuthal position, the energy of the right signal is greater than that of the left signal. In this case, X L i; j ðtÞ is considered as a candidate for the output of the EC-based model. However, we cannot assure that the candidate signal is an estimate of the target signal, because X L i; j ðtÞ and X R i; j ðtÞ could both represent the noise signal at the specified T-F unit as well. This is based on the fact that generally, speech has concentrated energy compared to the noise in the T-F representation. Therefore, as a second criterion, the ratio of the energies of the left and residual noise signals (i.e., E X L i; j =E Y i; j ) is calculated. If this signal-to-noise ratio (SNR) value is larger than 0.5, then the selected signal (i.e., X L i; j ðtÞ) is taken as the output of the model. The same argument applies to the justification for selecting X R i; j ðtÞ as the output of the EC-based model in the decision module.
If none of the above conditions are fulfilled, Y i, j (t) is selected as the final output of the model, which is an estimate of the noise signal in that T-F unit. Selecting Y i, j (t) in the model has the effect of flooring the output of the EC-based model to the residual signal, which has been proven to enhance the quality of the source separation system. Figure 5 represents the cochleagrams of the left monaural, right monaural, and binaural EC-processed signals, and the EC-based model output. Assuming that the clean target and the point Babble noise [63] are located, respectively, at the azimuths of 0°and − 60°, the left and right ear mixture signals are obtained by convolving the clean and noise signals with their corresponding BRIRs and adding them at SNR = 0 dB. As it is obvious from the figure, the output of the model is very similar to the cochleagram of the right ear. This can be justified by the fact that, here, the right ear, called better ear (BE), has the largest SNR as compared to the left ear, and the binaural EC-based model selects the ear signal that is highly correlated to the target.
To obtain the features that serve as input to the SI system, cubic compression (i.e., ffiffiffiffiffi ffi ð:Þ 3 p ) and DCT operations are applied to the GFs to obtain the resulting GFCC features.

Experiments
The performance of the proposed SI system is assessed in different environmental conditions. For this purpose, speech signals are selected from the Grid database [64]. The Grid database consists of 17000 clean utterances spoken by 34 speakers (18 males, 16 females, 500 utterances per speaker). To ensure that there is no overlap between the speech material used for training and testing, the Grid database was randomly split into two sets. The first set consisting of 8500 utterances (250 sentences per speaker) was used to train two gender-dependent UBMs. From the remaining utterances of the second set (250 sentences per speaker), 175 sentences are used to generate GMMs, and the rest is used for the testing stage.
To build the GMM-UBM model, at the first step, the gender-dependent UBMs are constructed using the EM algorithm. Then, the GMM of each speaker is generated by adapting the parameters of the UBM with a relevance factor of 16 [9]. Each of the UBMs is modeled by a GMM with 128 components, and the model of each speaker is obtained by a GMM of 128 components. The GMM-UBMs are implemented by the MSR toolkit [65]. To prevent the underestimation of speech energy due to silent parts, an energy-based voice activity detector (VAD) is employed in the training phase to take into account only signal segments with relevant speech activity [38]. Here, the speech-active segments are defined as those segments which have an energy level within 40 dB of the global maximum.
To reduce the dependency of the SI system on the database, the system is simulated 10 times wherein each run of the algorithm the test and train utterances are randomly selected. Then, the simulation results are averaged among all runs of the algorithm.
In the testing phase, the experiments are conducted in the presence of various additive noises, including White, Factory, and Babble noises selected from the Noisex-92 database [63] and Speech-Shaped Noise (SSN) taken from the Oldenburg University webpage [66]. The left and right ear signals are generated by convolving clean and noise test signals with binaural room impulse responses (BRIRs) and mixing them in an additive manner. The BRIRs are generated by using the Roomsim simulation toolkit [67] with the selection of KEMAR as an artificial head [68]. The KEMAR is placed at 1.75 m above the ground in a simulated room of dimensions 6.6 × 8.6 × 3 m 3 . The noisy binaural test signals are generated by adding the noises to the left and right target signals at the SNRs of 0, 5, and 10 dB. The SNR of the mixtures is adjusted as the average value at the two ears. For evaluation purposes, the target signal is positioned at 0 ο azimuth. The noise source position is gradually changed in steps of 10 ο from 0 ο to 90 ο in radial distance of 1.5 m around the listener. The simulated listener is within the critical distance [69] of the target and noise sources. To evaluate systematically the impact of reverberation, the echoic room with T 60 = 0.29 s is selected for all room boundaries within the room simulation software [67].

Evaluation criterion
To investigate the performance of the SI system, the recognition accuracy is employed as the performance criterion. The recognition accuracy is defined as the ratio of the number of test speakers detected correctly to the overall number of test utterances.

Results and discussions
The evaluation results of the Here, "Unprocessed" means that there is no binaural model that simulates the interaction between the left and right ears. For this purpose, the test feature is obtained by averaging auditory representations of left and right ear signals and applying subsequently the auditory compression and DCT operations. For better modeling of speaker rate in the SI systems, the first and second derivatives of MFCC and GFCC are included in the "Proposed" and the "Unprocessed" systems.
The simulation results of different SI systems in anechoic and reverberant conditions are illustrated in Figs. 6,7,8,9,10,and 11. Figures 6,7, and 8 represent the performance evaluation for the anechoic environments at the SNRs of 0, 5, and 10 dB for different noise types. In general, it is seen that the SI systems using binaural processing techniques for the enhancement of the input mixture perform better than those based on unprocessed methods. However, as the level of noise decreases, the performance of the proposed SI method degrades vs. the system of "May et al." This can be justified by the fact that in contrast to "May et al.," the proposed EC-based SI model depends highly on the input noise energy to perform the equalizationcancelation procedure satisfactorily. As the value of SNR increases, the contribution of noise at the input of the EC processing unit is lowered which results in decreasing the performance of the proposed model. The performance comparisons of different SI approaches for the noisy (SNR = 0, 5, 10 dB) and reverberant conditions (T 60 = 0.29 s) for various types of noises are depicted in Figs. 9, 10, and 11. Once again, it is observed that the SI systems based on unprocessed input signals have the lowest performance as compared with the binaurally processed SI systems (i.e., "Proposed" and "May et al."). Also, it is seen that the proposed SI model outperforms the SI system of "May et al." in terms of recognition accuracy. The lower performance of the "May et al." SI system in the presence of reverberation can be explained by the operation of the speech detection module (refer to Fig. 2). Evidently, the SI method of "May et al." depends on determining the active source characteristics. Accordingly, in reverberant conditions, the unreliable detected active sources due to the late reflections lead to a challenge in the speech detection module, and consequently, this reduces the identification performance of the system. Figure 12 shows the averaged accuracies of the SI systems over different noise positions and noise types. The average results also show that the binaural methods achieve superior performance over the unprocessed SI systems.
The results in this diagram confirm those obtained in Figs. 6, 7, 8, 9, 10, and 11. For the anechoic noisy conditions, as the SNR level increases, the performance of "Proposed" gradually decreases in comparison with "May

Conclusions
It is known that the performance of the far-field speaker identification is reduced in real environmental conditions due to the mismatch between training and testing features. In this paper, a new binaural speaker identification system is proposed which employs a short-time extended EC model to tackle the mismatch problem by removing the detrimental effects of noise and reverberation from the input mixture signal. The proposed speaker identification system uses the GMM-UBM structure as the speaker modeling and a binaural ECbased model as a speech separation system that processes auditory representations of both ears to remove noise and reverberation from the input signal. The binaural EC-based model incorporates an EC processing unit and a decision module. First, in the EC processing unit, an estimate of the residual signal (i.e., interference) is computed by canceling the target signal from the mixture. Then, in the decision module, an estimate of the target signal from the input mixture signal is determined using the energies of the monaural left ear, monaural right ear, and the estimated residual signals. The advantage of the EC-based method is its simplicity which makes it easy to employ the spatial information for identifying the target speaker in complex auditory scenes.
To assess the efficiency of the proposed binaural SI system, the performance of the model is compared with those of the unprocessed and a baseline SI system from the literature. The experiments are conducted in anechoic and reverberant conditions using different types of noises. The simulation results show that the proposed binaural EC-based SI system outperforms its unprocessed counterpart in both experimental conditions. Moreover, in reverberant and low SNR scenarios, the proposed system has superior performance in comparison with the mask-based binaural SI system of "May et al." used as the baseline. It is known that human listeners identify the target speaker robustly in different environmental conditions. In this paper, an auditory model was proposed to remove the undesired environmental effects from the input mixture signal. In dealing with the mismatched problem, it remains to explore the benefits of other binaural auditory models in the proposed SI system for more realistic situations such as cocktail party environments. Moreover, simulating the speaker identification performance of the human is a way to introduce new auditory-based speaker modeling that improves the overall performance of the traditional SI systems in real environmental conditions. Therefore, as future work, the authors plan to design modern auditory-based speaker identification systems and evaluate their performance by conducting listening tests. As a common evaluation procedure of CASA systems, such listening tests are also important in exploring the limitations of the new SI models, and thereby, trying to achieve the human auditory SI performance.