Independent Component Analysis and Time-Frequency Masking for Speech Recognition in Multitalker Conditions

When a number of speakers are simultaneously active, for example in meetings or noisy public places, the sources of interest need to be separated from interfering speakers and from each other in order to be robustly recognized. Independent component analysis (ICA) has proven a valuable tool for this purpose. However, ICA outputs can still contain strong residual components of the interfering speakers whenever noise or reverberation is high. In such cases, nonlinear postprocessing can be applied to the ICA outputs, for the purpose of reducing remaining interferences. In order to improve robustness to the artefacts and loss of information caused by this process, recognition can be greatly enhanced by considering the processed speech feature vector as a random variable with time-varying uncertainty, rather than as deterministic. The aim of this paper is to show the potential to improve recognition of multiple overlapping speech signals through nonlinear postprocessing together with uncertainty-based decoding techniques.


Introduction
When speech recognition is to be used in arbitrary, noisy environments, interfering speech poses significant problems due to the ovelapping spectra and nonstationarity. If automatic speech recognition (ASR) is nonetheless required, for example for robust voice control in public spaces or for meeting transcription, the use of independent component analysis (ICA) can be important to segregate all involved speech sources for subsequent recognition. In order to attain the best results, it is often helpful to apply an additional nonlinear gain function to the ICA output to suppress residual speech and noise. After a short introduction to ICA in Section 2, this paper shows in Section 3 how such nonlinear gain functions can be attained based on three different principal approaches.
However, while source separation itself is greatly improved by nonlinear postprocessing, speech recognition results often suffer from artefacts and loss in information due to such masks. In order to compensate for these losses and to obtain results exceeding those of ICA alone, we suggest the use of uncertainty-of-observation techniques for the subsequent speech recognition. This allows for the utilization of a feature uncertainty estimate, which can be derived considering both artefacts and incorrectly suppressed components of target speech, and will be described in more detail in Section 4. From such an uncertain description of the speech signal in the spectrum domain, uncertainties need to be made available also in the feature domain, in order to be used for recognition. This can be achieved by the so-called "uncertainty propagation," which converts an uncertain description of speech from the spectrum domain, where ICA takes place, to the feature domain of speech recognition. After this uncertainty propagation, detailed in Section 5, recognition can take place under observation uncertainty, as shown in Section 6.
The entire process is vitally dependent on the appropriate estimation of uncertainties. Results given in Section 8 show that when the exact uncertainty in the spectrum domain is known, recognition results with the suggested approach are far in excess of those achievable by ICA alone. Also, a realistically computable uncertainty estimate is introduced, and experiments and results given in Sections 7 and 8 show that with this practical uncertainty measure, significant 2 EURASIP Journal on Audio, Speech, and Music Processing improvements of recognition performance can be attained for noisy, reverberant room recordings.
The presented method is closely related to other works that consider observation vectors as uncertain for decoding purposes, most often for noisy speech recognition [1][2][3][4], but in some cases also for speech recognition in multitalker conditions, as, for example, [5,6], or [7] in conjunction with speech segregation via binary masking (see, e.g. [8,9]).
The main novelty in comparison with the above techniques is the use of independent component analysis in conjunction with uncertainty estimation and with a piecewise approach of transforming uncertainties to the feature domain of interest. This allows for the suggested approach to utilize the combined strengths of independent component analysis and soft time-frequency masking, and to still be used with a wide range of feature parameterizations, often without the need for recomputing the uncertainty mapping function to the desired ASR-domain. Corresponding results are shown here for both MFCC and RASTA-PLP coefficients, but the discussed uncertainty transformation approach also generalizes well to the ETSI advanced front end, as shown in [10].

Independent Component Analysis for Reverberant Speech
Independent component analysis has been successfully employed for the separation of speech mixtures in both clean and noisy environments [11,12]. Alternative methods include adaptive beamforming, which is closely related to independent component analysis when informationtheoretic cost functions are applied [13], sparsity-based methods that utilize amplitude-delay histograms [6,8,14], or grouping cues typical of human stream segregation [15].
Here, independent component analysis has been chosen due to its inherent robustness to noise and its ability to handle strong reverberation by frequency-by-frequency optimization of the cost function.
In order to separate a number N of simultaneously active speech signals from M recordings, with M ≥ N, the reverberant, noisy mixing process is modelled as where the room impulse response h jk (t) from source k to sensor j is considered time-invariant. Since convolutions are easily separable in the frequency domain, this expression is transformed by a short-time Fourier transform (STFT). Then, (1) becomes where H(Ω) is composed of the room transfer functions H jk (Ω) from all sources k to the sensors j, and D is the sensor noise. Here, Ω and τ denote the integer-valued frequency bin index and frame index, respectively.
In order to extract the original sources from the mixtures, ICA finds an unmixing matrix for each frequency bin Ω, which by principle can only be known up to an arbitrary scaling and permutation described by the diagonal scaling matrix Δ and the permutation matrix P. The unmixing matrix W is found by maximizing the statistical independence of the unmixed signals S. Finally, unmixing is carried out separately in each frequency bin according to To learn the matrix W, the adaptive algorithm described in [16, where Λ is a diagonal matrix with Here, · denotes the mean value and Ideally, this optimization will result in independent output signals in each frequency bin. To obtain a complete spectrum of unmixed sources, it is additionally necessary to correctly sort the outputs, since their ordering after ICA is arbitrary and may vary from frequency bin to frequency bin. This so-called permutation problem can be solved in a number of ways; see, for example, [17,18]. In all following work, permutations have been corrected by sorting outputs in accordance with the distance criterion described in [19]. Here, ν is defined by Ω s is the frequency bin at which the permutation problem has to be solved and Ω r denotes the frequency bin to be used as reference, and p is a constant. For this strategy, ordering permutations first at higher frequencies and proceeding downward has proven beneficial; therefore, the ordering at the maximum frequency bin was chosen as reference, and sorting according to (8) took place binwise in descending order.
EURASIP Journal on Audio, Speech, and Music Processing SNR-threshold SDR, NFFT=256 SDR, NFFT=512 SDR, NFFT=1024 SDR, NFFT=2048 Figure 1: Performance of an ideal binary mask, tested on 12 pairs of same-and mixed-gender speakers. Performance is shown for frame lengths (NFFT) of 256, 512, 1024, and 2048 samples in terms of SDR and SIR-improvement. When the SNR-threshold is increased, the red SDR-curves are decreasing monotonically, while a more pronounced monotonic increase can be observed for the SIRimprovement, shown in green color.

Time-Frequency Masking for ICA
However, some residual noise and interference are to be expected even after applying source separation, especially in reverberant environments. For removing these, postmasking of ICA-results has often been employed [14,17,20,21] using This is motivated by the potential gains in Signal-to-Interference Ratio (SIR), which can already be attained by simple binary masking with an ideal mask. With such an, albeit practically elusive, mask, which is given by evaluating the true knowledge about the signal spectra via it is possible to obtain more than 40 dB SIR improvement on two-speaker mixtures even without ICA, while remaining above 20 dB of Signal-to-Distortion Ratio (SDR) [22]. The results of one such exemplary experiment are shown in Figure 1. For this figure, an additional masking threshold T was introduced, and the mask in (11) was only set to 1, if the source of interest was greater than all other sources by at least T dB, that is, if 20 log 10 |S k (Ω, τ)| ≥ 20 log 10 S j (Ω, τ) − T ∀ j. (12) However, an ideal mask is impossible to obtain realistically; thus, approximations to it are required. For obtaining such an approximation, mask estimation based on ICA Time-frequency masking results has been proposed and shown to be successful, both for binary and soft masks, see, for example, [17,18,20]. The motivation for this procedure lies both in the noiserobustness of ICA, which can therefore unmix signals even when large interferences make the estimation of a timefrequency mask extremely difficult, and also in the fact that ICA will unmix signals even in those time-frequency regions, where two or more of them are simultanously active to a significant extent. The architecture of such systems is shown in Figure 2 for the exemplary case of two sources and microphones.
In the following, four types of masks are considered: (i) amplitude-based masks, (ii) phase-based masks, (iii) two types of interference-based masks, which will be described in the subsequent sections.

Amplitude-Based Masks.
One of the simplest postmasks suitable for postprocessing of ICA results is based on comparing the magnitude of all ICA outputs [20]. Due to the sparsity of sources in an appropriate spectral representation [8], only one should be dominant; therefore, all others are discarded.

4
EURASIP Journal on Audio, Speech, and Music Processing In order for the strategy to be independent of the source signal energies, all ICA output signals need to be normalized to equal variance via before the mask is computed. Then, a hard amplitude mask can be obtained by comparing a local dominance ratio to an acceptance threshold T via with Ψ defined by This is a rather simple approach, which has been enhanced in the following by applying a sigmoid nonlinearity to reduce artefacts. This can be easily achieved by redefining Ψ to where g is the mask gain controlling its steepness.

Phase-Based Masks.
The source separation performance of ICA can also be seen from a beamforming perspective. When the unmixing filters learned by ICA are viewed as frequency-variant beamformers, it can be shown that successful ICA effectively places zeros in the directions of all interfering sources [23]. Therefore, the zero directions of the unmixing filters should be indicative of all source directions. Thus, when the local direction of arrival (DOA) is estimated from the phase of any one given time-frequency bin, this should give an indication of the dominant source in this bin. This is the principle underlying phase-based time-frequency masking strategies. Phase-based postmasking of ICA outputs was introduced in [17]. In this method, the angle θ k (Ω, τ) between the k'th target basis vector of the unmixing matrix and the microphone signal vector is used in order to determine whether and to what degree a given channel should be masked.
According to (2), when noise is not considered, the mixing system can be modeled by Here, h k (Ω) denotes the k'th column of the mixing matrix, and S k (Ω, τ) is the value of source k in frequency Ω at frame τ.
ICA results in an unmixing matrix W, which is used to obtain M estimated source signals according to (4). This corresponds to where the estimated mixing matrix W −1 is given in terms of its constituent column vectors, [a 1 , a 2 , . . . , a M ]. When comparing (18) and (2), and considering (3), it can be seen that the columns of W −1 correspond to the columns of H(Ω), the matrix containing the values of the room transfer function for each frequency, up to an arbitrary scaling of column vectors and a reordering of sources, which is constant over frequencies after the permutation correction. Thus, in those time-frequency bins, where source k is dominant, the associated basis vector a i (Ω) should correspond to the column of the mixing matrix H(Ω) associated with source k. In general, the index i may be different from the index k, due to possible permutations. However, as this change of indices will be consistent over frequency, it is disregarded in the following.
Thus, after appropriate normalization, in frames with dominant source k, the associated basis vector a would also be equal to X(Ω, τ) of the current frame. If an anechoic model is appropriate for the mixing process at hand, the basis vectors should form clusters, one for each of the sources. For this purpose, the basis vectors need to be normalized regarding both their phases and amplitudes as detailed in [17]. For phasenormalization, they are first normalized with respect to a reference sensor J and secondly frequencynormalized, which gives as a normalized vector. Here, f (Ω) stands for the center frequency in Hz of frequency bin Ω; c is the velocity of sound and d max stands for the distance between the reference sensor J and the farthest of all other microphones j = 1 . . . M. For this vector, the phase varies only between which is important for computing a distance measure between vectors. Finally, amplitude-normalization is carried out by After the normalized basis vectors a k (Ω) are thus available, masking is carried out based on the angle θ k (t, τ) between the observed vector X(Ω, τ) and the basis vector a k (Ω). This angle is computed in a whitened space, where X(Ω, τ) and a(Ω) are premultiplied by the whitening matrix V, which is the inverse square root of the sensor autocorrelation matrix, V(Ω) = R −1/2 xx . The mask is a soft mask, which is determined from θ k (Ω, τ) by the logistic function The parameter g describes the steepness of the mask and θ T is the transition point, where the mask takes on the value 1/2. More details on the mask computation can be found in [17].

Interference-Based Masks.
As an alternative criterion for masking, residual interference in the signal may be estimated and the mask may be computed as an MMSE estimator of the clean signal. This can be achieved with a number of approaches, two of which will be presented here in more detail.

Ephraim-Malah Filter-Based Post-Filtering.
The remaining noise components in the separated signals can be minimized based on the Ephraim-Malah filter technique. For this purpose, the following signal model is assumed where the clean signal S(Ω, τ) is corrupted by a noise component D(Ω, τ), the remaining sum of the interfering signals and the background noise. The estimated clean signals are obtained by where M SE (Ω, τ) is the amplitude estimator gain. For the calculation of the gain M SE (Ω, τ), different speech enhancement algorithms can be used. In the following, we are using the log spectral amplitude estimator (LSA) as proposed by Ephraim and Malah [24]. For the algorithm, the a posteriori γ k (Ω, τ) and a priori SNR ξ k (Ω, τ) are defined by Here, α is a smoothing parameter, S k (Ω, τ) is the kth ICAoutput, and λ D (Ω, τ) is the noise power (26) with the noise estimate |D k (Ω, τ)| given by With these parameters, the log spectral amplitude estimator is given by with ξ(Ω, τ) denoting the local a priori SNR and

Inclusion of Speech Presence
Probabilities. According to [25], the previous approach can be expanded using additional information for calculation of speech presence probabilities. The gain function of the Ephraim-Malah filter becomes where G min is a spectral attenuation floor, M SE the gain of the speech enhancement method, and p(Ω, τ) the speech presence probability [26,27]. The infomation needed for speech presence probability calculation is gained from a binwise noise dominance estimate, which can be computed in the spectrum domain by [18] A similar measure of speech dominance f S,k is needed in addition Both measures utilize the difference between the estimated target spectrogram S k (Ω, τ) and the sum of estimated nontarget signals m / = k S m (Ω, τ). The Euclidean norm operator · is applied to two-dimensional windowed spectrograms here by taking the sum over their squared entries, and uses a two-dimensional window function W of size R Ω × R τ , usually a two-dimensional Hanning window. The speech presence probability is then approximated by a soft mask via Here, λ s , λ n and g are parameters specifying the two threshold points and the mask gain, respectively.

Estimation of Uncertainties
Due to the use of time-frequency masking, part of the information of the original signal might be eliminated along with the interfering sources. To compensate for this lack of information, each masked estimated source is considered as uncertain and described in the form of a posterior distribution of each Fourier coefficient of the clean signal S k (Ω, τ) given the available information.
Estimating the uncertainty in the spectrum domain has clear advantages, when contrasted with uncertainty estimation in the domain of speech recognition, since much intermediate information about the signal and noise process as well as the mask is known in this phase of signal processing, but is generally not available in the further steps of feature extraction. This has motivated a number of studies on spectrum domain uncertainty estimation, most recently for example [7,10]. In contrast to other methods, the suggested strategy possesses two advantages: it does not need a detailed spectrum domain speech prior, which may require a large number of components or may incur the need for adaptation to the speaker and environment; and it gives a computationally very inexpensive approximation that is applicable for both binary and soft masks.
The model used here for this purpose is the complex Gaussian uncertainty model [28] where the mean is set equal to the Fourier coefficient obtained from post-masking Y k (Ω, τ) and the variance σ 2 represents the lack of information, or uncertainty. In order to determine σ 2 , two alternative procedures were used.

Ideal Uncertainties.
Ideal Uncertainties describe the squared difference between the true and the estimated signal magnitude. They are computed by where S k is the reference signal. However, these ideal uncertainties are available only in experiments where a reference signal has been recorded. Thus, the ideal results may only serve as a perspective of what the suggested method would be capable of if a very high quality error estimate were already available.

Masking Error Estimate.
In practice, it is necessary to approximate the ideal uncertainty estimate using values that are actually available. Since much of the estimation error is due to the time-frequency mask, in further experiments such a masking error was used as the single basis of the uncertainty measure. This uncertainty due to masking can be computed by If α = 1, this error estimate would assume that the timefrequency mask leads to missing signal information with 100 certainty. The value should be lower to reflect the fact that some of the masked time-frequency bins contain no target speech information at all. To obtain the most suitable value for α, the following expression was minimized In order to avoid adapting parameters to each of the test signals and masks, this minimization was carried out only once and only for a mixture not used in testing. After averaging over all mask types, the same value of α was used in all experiments and for all datasets. This optimal value was α = 0.71.

Propagation of Uncertainties
When uncertain features are available in the STFT domain, they could in principle be used for spectrum domain speech recognition. However, as shown in [29], due to the less robust spectrum domain models, this does not provide for optimum results. Instead, a more successful approach is to transform the uncertain description of speech from the spectrum domain to the domain of speech recognition. This can in principle be achieved by two approaches, data-driven as in [7] or model-driven as in [5]. In the following, we only consider the model-driven approach, which can achieve very low propagation errors with small memory requirements and without the need for a training phase [10]. However, a detailed comparison of both principal methods remains an interesting target for future work.
In order to carry out the propagation through the feature extraction process, the uncertain spectrum domain description is considered as specifying speech as a random variable according to (35). If such an uncertain description of the STFT is used, the corresponding posterior distribution p(S k | Y k ) has to be propagated into the feature domain. For this purpose, the effect of all transformations in the feature extraction process on this probability distribution needs to be considered, which will result in an estimated feature domain random variable, describing both the mean of the speech features as well as the associated degree of uncertainty. Since this computation takes place for each feature and in each bin, subsequent recognition will have a maximally precise description of all uncertainties, allowing the algorithm to focus most on those features that are most reliable, and, if desired, to replace the uncertain ones by better estimates under simultaneous consideration of the recognizer speech model.
In conventional automatic speech recognition, only the STFT of each estimated source Y k must be transformed into the feature domain of automatic speech recognition. Feature extractions involve multiple transformations, some of them nonlinear, which are performed jointly on multiple features of the same frame or by combining features from different time frames. Propagating an uncertain description of the STFT of each estimated source is therefore a complicated task EURASIP Journal on Audio, Speech, and Music Processing 7 that can be simplified by propagating only first-and secondorder information. This section shows how this propagation can be attained by a piecewise approach in which the feature extraction is divided into different steps and the optimal method is chosen to perform uncertainty propagation in each step. Uncertainty propagation is used with two of the more robust speech recognition features, namely the Mel-cepstrum coefficients (MFCCs) [30] and the cepstral coefficients obtained from the RelAtive SpecTrAl Perceptual Linear Prediction (RASTA-PLP) feature extraction [31], here denoted as RASTA-LPCCs.

Mel-Cepstral Feature Extraction.
The conventional Melcepstral feature extraction consists of the following steps.
(2) Compute each filter output of a Mel-filterbank as a weighted sum of the STSA features of each frame.
(3) Apply the logarithm to each filter output.
(4) Compute the discrete cosine transform (DCT) from each frame of log-filterbank features.
In order to propagate random variables rather than deterministic signals, these steps were modified as follows.
Step (1) can be solved if we take into account that if a Fourier coefficient S k (Ω, τ) is complex Gaussian distributed as given by (35), its amplitude |S k (Ω, τ)| is Rice distributed. From the first raw moment of the Rice distribution, it is possible to compute the mean of the uncertain STSA features as [28] where I 0 and I 1 correspond to the modified Bessel functions of order zero and one, respectively. The variance of the uncertain STSA features can be computed from the first and second raw moments as Step (2) in the Mel-cepstral feature extraction corresponds to the Mel-filterbank, which is a linear transformation and bears no additional difficulty for the propagation of mean and covariance. In general, given a random vector variable x and a linear transformation defined by the matrix T, the transformed mean and covariance correspond to Step (3) corresponds to the computation of the logarithm. Since the distribution of the Mel-STSA uncertain features has a relatively low skewness and the dimensionality of the features has been reduced by approximately one order of magnitude through the application of the Mel-filterbank, the use of the pseudo-Montecarlo method termed unscented transform [32] provides an acceptable trade-off between accuracy and computational cost. Details regarding the use of the unscented transform for uncertainty propagation can be found in [28].
Step (4), the DCT transform, completes the computation of the MFCC coefficients. Since this is a linear transformation like the Mel-filterbank, it can be computed according to (41).

Relative Spectral Perceptual Linear Prediction Feature
Extraction. The obtention of the RASTA-Linear Prediction Cepstral Coefficients (RASTA-LPCCs) corresponds to the following steps.
(1) Extract the power spectral density (PSD) from the STFT.
(2) Compute each filter output of a Bark-filterbank as a weighted sum of the PSD features of each frame.
(3) Apply the logarithm to each filter output.
(4) Filter the resulting frames with the RASTA IIR filter.
(5) Add the equal loudness curve and multiply by 0.33 to simulate the power law of hearing.
(6) Apply the exponential to invert the effect of the logarithm.
(7) Compute an all-pole model of each frame to obtain the linear prediction coefficients (LPCs).

(8) Compute cepstral coefficients from each LPC frame.
This feature extraction also requires a set of modifications and approximations in order to be applicable for uncertain features. An overview of these is shown in Figure 3 and the necessary computational steps are given in detail below.
Step (1) can be solved similarly to the case of the STSA. The propagated mean and covariance can be computed from the second and fourth raw moments of the Rice distribution as [33] 8 EURASIP Journal on Audio, Speech, and Music Processing

RASTA-filter
Pre-emphasis Power-law exp All-pole model Step (2), which corresponds to the Bark-filterbank, can be resolved identically to the case of the Mel-filterbank of the MFCCs by using (41).
Step (3) of the RASTA-PLP transformation consists of the computation of the logarithm as in the case of the Mel-cepstral feature extraction. However, the distribution of the Bark-PSD uncertain features presents a much higher skewness compared to the case of the Mel-STSA features. Consequently, the propagation through this step is more accurately computed using the assumption of log-normality of the Bark-PSD features, also used in other propagation approaches like [3,5,34]. The covariance under this assumption can be approximated by [34, equation 5.47], yielding where i, j are the filterbank indices and μ k (i, τ) BARK and Σ k (i, j, τ) BARK correspond to the mean and covariance after the Bark-filterbank transformation. The mean can be approximated by [34, equation 5.46] Step (4) corresponds to the RASTA filter. The RASTA filter is an IIR filter that imitates the preference of humans for sounds with a certain rate of change. It realizes the transfer function This can also be expressed by the following difference equation where y(τ) is a column vector containing the τth frame of RASTA-filtered features, and x(τ) · · · x(τ − 4) and y(τ − 1) correspond to previous logarithm domain input and RASTA domain output frames, respectively. The scalars b 0 · · · b 4 and a 1 are the normalized feedforward and feedback coefficients.
Computing the propagation of the mean μ k (τ) RASTA through this transformation is identical to the case of the Mel or Bark filterbanks. The computation of the covariance is, however, more complex due to the created time correlation between inputs and outputs. The correlation matrix for the τth filter output y(τ) can be computed from (46) as where the last summand accounts for the input output correlation. The corresponding covariance of the RASTA features can be obtained as Steps (5a) and (5b) correspond to conventional linear transformations in the logarithm domain, and therefore the propagation through them can be solved by applying (41) to obtain the means μ POW k and covariances Σ POW k . Furthermore, since the assumption of log-normality in the Bark-PSD domain implies that the log-domain features are normally distributed, RASTA, preemphasis, and power-law transformations do not alter this condition.
Step (6) corresponds to the transformation through the exponential. Since this transformation is the inverse of the logarithm, the corresponding features are log-normally distributed with mean and covariance computable from [34, equations 5.44, 5.45] The final steps of the RASTA-LPCC feature extraction, Steps (7) and (8), correspond to the computation of the allpole model to obtain the LPC coefficients, described in the conventional PLP technique [35], and the computation of the cepstral coefficients from the LPCs using [30, equation 3] Due to the complex nature of these transformations and the low skewness of the uncertain features after the exponential transformation, the propagation is computed using the unscented transform, similarly to the case of the logarithm transformation for the Mel-cepstral features.

Recognition of Uncertain Features
When features for speech recognition are given not as point estimates, but rather in the form of a posterior distribution p(o k |Y k ) with estimated mean μ CEPS k and covariance Σ CEPS k , the speech decoder must be modified in order to take this additional information into account. A number of approaches exist, both for binary and for continuous-valued uncertainties, for example, [2,36,37].
Here, two missing feature approaches were applied, which are capable of considering real-valued uncertainties. These methods, modified imputation [5] and HMM variance compensation [2], have been implemented for the Hidden Markov Model Toolkit (HTK) [38] and were used in the tests.
Both methods are appropriate for HMM-based systems, where recognition takes place by finding the optimum HMM state sequence [q 1 , . . . , q E ], which gives the best match to the feature vector sequence [o(1), . . . , o(E)] when each HMM state has an associated output probability distribution p(o | q).

HMM Variance Compensation.
In HMM variance compensation, the computation of state output probabilities is modified to incorporate frame-by-frame and feature-byfeature uncertainties [2]. This is formulated as an averaging of the output probability distribution p(o k (τ) | q) over all possible unseen cepstra defined by the posterior p which leads to Here, q denotes the HMM state, with mean μ q and covariance Σ q . For Gaussian mixture models, the same procedure can be applied to each mixture component. This yields for an M-component mixture model with weights w m .

Modified Imputation.
In modified imputation, the idea is to replace the imputation equation, originally proposed for completely missing features in [36], with an alternative formulation, which also allows for real-valued degrees of uncertainty. Thus, whereas missing parts of feature vectors are replaced by the corresponding components of the HMM model mean μ q in classical imputation, modified imputation finds the maximum a posteriori estimate Assuming a flat prior for o k (τ), as shown in [5], (53) leads to Finally, the modified imputation estimate of the feature vector o k in state q o k,q (τ) can be obtained. This estimate is used to evaluate the pdf of the HMM state q at time τ, as in conventional recognition or classical imputation. For mixture-of-Gaussian (MOG) models, (55) is evaluated separately for each mixture component m to obtain separate estimates o k,q,m (τ), and all mixture component probabilities p( o k,q,m (τ) | μ q,m , Σ q,m ) are finally added to obtain the feature likelihood for state q via where w m stands for the mixture weight of component m. This, again, is analogous to the process in conventional recognition or classical imputation.

Room Recordings.
For the evaluation of the proposed approaches, recordings were made in a noisy lab room with a reverberation time of T 60 ≈ 160 ms. In these recordings, audio files from the TIDigits database [39] were used and mixtures with two and three speakers were recorded at Source-signal i Figure 4: Experimental Setup.
f s =11 kHz. The distance L i between the loudspeakers and the center of the microphone array was varied between 0.9 and 3 m. The experimental setup is shown schematically in Figure 4. The distance d between two sensors was 3 cm and a linear array of four microphones was used in all experiments. The recording conditions for all mixtures are summarized in Tables 1 and 2.

Model
Training. The HMM speech recognizer was trained with the HTK toolkit [38]. The trained HMMs comprised phoneme-level models with 6-component MOG emitting probabilities and a conventional left-right structure. The training data was mixed and it comprised the 114 speakers of the TI-DIGITS clean speech database along with the room recordings for speakers sa and rk used for adaptation. Speakers used for adaptation were removed from the test set. The feature extractions presented in Section 5 were also complemented with cepstral mean subtraction (CMS) for further reduction of convolutive effects. Since CMS is a linear operation, it poses no additional difficulty for uncertainty propagation.

Parameter Settings of Time-Frequency Masks.
Parameters of all masks were set manually for good performance on all datasets, and were kept consistent throughout all experiments.

Amplitude-Based Masking.
For amplitude-based masking, a soft mask according to (14) and (16) was used. Thus, there are two parameters, the mask threshold T and the gain g, which were set to T = 0 and g = 1, respectively.
.0 m Angular position of the speaker i (as shown in Figure 4) Angular position of the speaker i (as shown in Figure 4) Phase-Based Masking. In phase-based masking according to (22), there are two free parameters as well, again a mask gain g and also a mask threshold, the angle threshold θ T . However, optimum performance was reached for different parameter values depending on the recognizer parameterization. For optimal performance on MFCC features, they were set to g = 20 and θ T = 0.2π, which will be refered to as Phase1 in the results. In contrast, for RASTA-PLP-based recognition, better results were generally achieved with g = 15 and θ T = 0.2π (Phase2), that is, the same threshold but less steep of a mask gain.

Interference-Based Masking.
For the first interferencebased mask, defined in Section 3.3.1, the two smoothing parameters defining the algorithm are set to α = 0.1 and α D = 0.9. This algorithm will be denoted by IB in the following.
The second interference-based algorithm additionally includes the speech probability estimate defined in Section 3.3.2. Thus, in addition to the parameters α = 0.9 and α D = 0.9, there are additional parameters in the weighting function (34). These are λ s , λ n and g, parameters specifying the two threshold points and the mask gain. They are defined to correspond to the mean absolute value of the estimated signal Fourier coefficients λ s = f S,k , the mean absolute value of the noise estimate Fourier coefficient λ n = f N,k ; and the mask gain is set to g = 10. For windowing in (33), a Hanning window of size 3 × 3 is used. For this algorithm, the abbreviation IBPE will be used.

Recognition Performance Measurement.
To evaluate recognition performance, the number of reference labels The value of PA, output by the HTK scoring tool, corresponds with 100 − WER, where WER is the word error rate that is also commonly used in the evaluation of speech recognition performance.

Multispeaker Recognition
Results. At first, results are given for the estimated uncertainty values and RASTA-PLP features in Table 3 and for MFCC features in Table 4. Especially for RASTA-PLP features, results are improved notably by masking and missing feature recognition by modified imputation, averaging an absolute improvement of more than 10% over all tested masks and experiments. For MFCCs, significant improvements can also be achieved by the suggested strategy. This is true especially for the two strategies of phase masking and interference-based filtering with speech probability estimation. In both cases, an absolute improvement of about 6% can be achieved. It is also clearly visible that here, uncertainty decoding performs better on average. When true rather than estimated uncertainties are used, results are again improved greatly, both for RASTA-PLP and for MFCC features, as shown in Tables 5 and 6. Compared to the use of ICA alone, a relative error rate reduction of 59% for uncertainty decoding and of 69% for modified imputation is achieved in the case of RASTA features.
Similar performance gains can be observed in the case of MFCC features, where word error rates can be reduced by 64% and 62% for uncertainty decoding and modified imputation, respectively. Comparing the uncertain recognition strategies, again, modified imputation is on average the better performer for RASTA-PLPs, whereas uncertainty decoding leads to better performance gains for MFCCs. Concerning the masking strategies, it is clear that the IBmask, which has fairly aggressive parameter settings and an extremely low recognition rate without missing feature approaches, is the best for this case of ideal uncertainties.

Conclusion
An overview of the use of independent component analysis for speech recognition under multitalker conditions has been given. As shown by the presented results, the conventional strategy of purely linear source separation can be improved by post-masking in the time-frequency domain, if this is accompanied by missing-feature speech recognition. Especially for three-speaker scenarios, this improves the recognition rate notably. Interestingly, the optimal decoding strategy is apparently dependent on the features that are used for recognition. Whereas modified imputation was clearly superior for RASTA features, better results for MFCC features have almost consistently been achieved by uncertainty decoding, even though uncertainties were estimated in the spectrum domain for both features and propagated to the recognition domain of interest. Further work will be necessary to determine how these results correspond to the degree of model mismatch in both domains, with the aim  of determining an optimal decoding strategy depending on specific application scenarios. A vital aspect of missing feature recognition is still the estimation of the feature uncertainty. Here, an ideal uncertainty estimate will result in superior recognition performance for all considered test cases and all applied post masks. Since such an ideal uncertainty is not available in practice, the value needs to be estimated from available data. In the presented cases, this measure has been derived from the ICA output signal and the applied nonlinear gain function. The resulting uncertainty estimate has a correlation coefficient of 0.45 with the true uncertainties, leading to superior and consistent performance among all tested uncertainty estimates.
However, uncertainty estimation for the ICA output signals should be improved further, in order to approximate more closely the ideally achievable performance of this strategy. For this purpose, it will be interesting to compare the proposed uncertainty estimation to other approaches. Specifically, the uncertainty estimation described in [7] is of interest for use with any type of recognition feature and preprocessing method, but it requires learning of a regression tree for the given specific feature set and environment. In contrast, feature-specific methods described for example in [2,3] are applicable only to the feature domain they have been derived for, but can be used without the need for additional training stages.
Since none of the above methods is designed specifically for use with ICA, another direction of research is a better use of the statistical information gathered during source separation. Further research can thus focus on an optimal use of this intermediate data, and on its combination with more detailed prior models in the spectrum domain, as those in [29], for arriving at more accurate uncertainty estimates which utilize all avaliable data from multiple microphones.