Recognition of Noisy Speech: A Comparative Survey of Robust Model Architecture and Feature Enhancement

Performance of speech recognition systems strongly degrades in the presence of background noise, like the driving noise inside a car. In contrast to existing works, we aim to improve noise robustness focusing on all major levels of speech recognition: feature extraction, feature enhancement, speech modelling, and training. Thereby, we give an overview of promising auditory modelling concepts, speech enhancement techniques, training strategies, and model architecture, which are implemented in an in-car digit and spelling recognition task considering noises produced by various car types and driving conditions. We prove that joint speech and noise modelling with a Switching Linear Dynamic Model (SLDM) outperforms speech enhancement techniques like Histogram Equalisation (HEQ) with a mean relative error reduction of 52.7% over various noise types and levels. Embedding a Switching Linear Dynamical System (SLDS) into a Switching Autoregressive Hidden Markov Model (SAR-HMM) prevails for speech disturbed by additive white Gaussian noise.


Introduction
The automatic recognition of speech, enabling a natural and easy to use method of communication between human and machine, is an active area of research as it still suffers from limitations such as the restricted applicability whenever human speech is superposed with background noise [1][2][3].Since the interior of a car is a popular field of application for speech recognisers, allowing hands-free operation of the centre console or text messaging, the car noises produced during driving are of great interest when designing a noise robust speech recognition system [4,5].
To enhance recognition performance in noisy surroundings, different stages of the recognition process have to be optimised.As a first step, filtering or spectral subtraction can be applied to improve the signal before speech features are extracted.Well-known examples for such approaches are applied in the advanced front-end feature extraction (AFE) or Unsupervised Spectral Subtraction (USS).Then, suitable patterns for auditory modelling have to be extracted from the speech signal to allow a reliable distinction between the phonemes or word classes in the vocabulary of the recogniser.Apart from widely used features like Mel-frequency cepstral coefficients (MFCCs), the extraction of Perceptual Linear Prediction (PLP) coefficients is an effective method of speech representation [6].
The third stage is the enhancement of the obtained features to remove the effects of noise.Normalisation methods like Cepstral Mean Subtraction (CMS) [7], Mean and Variance Normalisation (MVN) [8], or Histogram Equalisation (HEQ) [9] are techniques to reduce distortions of the frequency domain representation of speech.Alternatively, model-based feature enhancement approaches can be applied to compensate the effects of background noise.Using a Switching Linear Dynamic Model (SLDM) to capture the dynamic behaviour of speech and another Linear Dynamic Model (LDM) to describe additive noise is the strategy of the joint speech and noise modelling concept in [10] which aims to estimate the clean speech features of the noisy signal.
The derivation of speech models can be considered as the next stage in the design of a speech recogniser.Hidden Markov Models (HMMs) [11] are commonly used for speech modelling whereas numerous alternatives, like Hidden Conditional Random Fields (HCRFs) [12], Switching Autoregressive Hidden Markov Models (SAR-HMMs) [13], or other more general Dynamic Bayesian Network structures have been developed in recent years.Extending the SAR-HMM to an Autoregressive Switching Linear Dynamical System (AR-SLDS), as in [14], includes an explicit noise model and leads to an increased noise robustness compared to the SAR-HMM.
Speech models can be adapted to noisy conditions when the training of the recogniser is conducted using noisy training material.Since the noise conditions during the test phase of the recogniser are not known a priori, equal properties of the noises for training and testing hardly occur in reality.However, in case the recogniser is designed for a certain field of application as an in-car speech recogniser, the approximate noise conditions are known to a certain extent, for example, when using information about the current speed of the car.Therefore, the speech models can be trained using speech sequences corrupted by noise which has similar properties as the noise during testing.
In this article, the most promising approaches to increase recognition performance in noisy surroundings are implemented in an isolated digit and spelling recognition task.All denoising techniques applied in the experimental section, representing a selection of methods as simple and efficient as CMS, MVN, and HEQ but also more complex approaches like AFE, USS, and SLDM feature enhancement as well as novel noise robust model architecture such as HCRF or the AR-SLDS, are introduced in Sections 3 to 5. While it is impossible to take into account and implement all noise compensation techniques that were developed in recent years, the selection of methods in this work covers many of the different concepts that are thinkable for in-car, but also for babble and white noise scenarios with all their specific advantages and disadvantages.Since we aim to focus on incar speech recognition, noises produced by four different cars and three different road surfaces and velocities have been recorded and superposed with the speech sequences to simulate the noise conditions during driving.However, the findings may be transferred for many similar stationary noise situations.
Section 2 briefly outlines possible approaches to enhance the noise robustness of speech recognisers.In Section 3, an explanation of the different speech signal preprocessing techniques applied in this article is given, while Section 4 focuses on the feature enhancement strategies we used.Section 5 describes the speech model architecture which are used as alternatives to Hidden Markov Models in some of the experiments of Section 6.

Concepts for Noise Robust Speech Recognition
Aiming to counter the performance degradation of speech recognition systems in noisy surroundings, a variety of different concepts have been developed in recent years.The common goal of all noise compensation strategies is to minimise the mismatch between training and recognition conditions, which occurs whenever the speech signal is distorted by noise.Consequently, two main methods can be distinguished.One is to reduce the mismatch by focusing on adapting the acoustic models to noisy conditions in order to enable a proper representation of speech even if the signal is corrupted by noise.This can be achieved either by using noisy training data [15] or by joint speech and noise modelling [14].The other method is trying to determine the clean features from the noisy speech sequence while using clean training data [9,16,17].For that purpose, it is necessary to extract noise robust features and to find appropriate means of signal or feature preprocessing for speech enhancement.This section summarises selected methods for speech signal preprocessing, auditory modelling, feature enhancement, speech modelling, and model adaptation.

Speech Signal Preprocessing.
Preprocessing techniques for speech enhancement aim to compensate the effects of noise before the signal or rather the feature-based speech representation is classified by the recogniser which has been trained on clean data [18][19][20].
A state-of-the-art speech signal preprocessing that is used as a baseline feature extraction algorithm for noisy speech recognition problems like the Aurora2 task [21] is the advanced front-end feature extraction introduced in [22].It uses a two-step Wiener filtering technique before the features are extracted, whereas filtering is done in the time domain.
As shown in [23,24], methods based on spectral subtraction like Unsupervised Spectral Subtraction [17] reach similar performance while requiring less computational cost than Wiener filtering.Like the two-step Wiener filtering method included in the AFE, Unsupervised Spectral Subtraction can be considered as speech signal preprocessing step; however, USS is carried out in the magnitude spectogram domain.

Auditory
Modelling and Feature Extraction.The two major effects that noise has on speech representation are a distortion in the feature space and a loss of information caused by its random behaviour.This loss has to be considered as irreversible, whereas the distortion of the features can be compensated depending on the suitability of the speech representation in noisy environments [1,4].
Widely used speech features for auditory modelling are cepstral coefficients obtained through Linear Predictive Coding (LPC).The principle is based on the assumption that the speech signal can be regarded as the output of an all-pole linear filter that simulates the human vocal tract.However, speech recognition systems which process the cepstrum calculated via LPC tend to have low performance in the presence of noise [2].For enhanced noise robustness, the use of the Perceptual Linear Prediction analysis method is a popular approach to extract spectral patterns [6,25].The technique is based on a transformation of the speech spectrum to the auditory spectrum that considers multiple perceptual relationships prior to performing linear prediction analysis.Another well-known speech representation is the extraction of Mel-frequency cepstral coefficients which provide a basis for several speech signal analysis applications [17,[26][27][28].They are calculated from the logarithm of filterbank amplitudes using the Discrete Cosine Transform.
In [29], the TRAP-TANDEM features were introduced.They describe the likelihood of subword classes at a time instant by evaluating temporal trajectories of band-limited spectral densities in the vicinity of the regarded time instant.Thereby the TRAP refers to the way the linguistic information is obtained from speech, while TANDEM refers to the technique that converts the evidence of subword classes into features for HMM-based speech recognition systems.Unlike conventional feature extraction techniques, which consider time windows of about 25 milliseconds to derive spectral features, TRAP also includes relatively long time spans up to one second to extract information for the recogniser.The strategy is motivated by the finding that information about a phoneme spreads over about 300 milliseconds [30,31].Furthermore, this method is able to remove slow varying noise [32].
Another approach to suppress slow variations in the short-term spectrum is the RASTA-PLP concept [33,34] that makes PLP features more robust to linear spectral distortions.The filtering of time trajectories of criticalband filter outputs enables the removal of constant spectral components caused by convolutive factors in the speech signal.

Feature Enhancement. Further attempts to reduce the mismatch between test and training conditions are Cepstral
Mean Subtraction [7], Mean and Variance Normalisation [8], or the Vector Taylor Series approach [35] which is able to deal with the nonlinear effects of noise.Nonlinear distortions can also be compensated by Histogram Equalisation [9], a technique which is often used in digital image processing [36] to improve the contrast of pictures.In speech processing, HEQ is a powerful means of improving the temporal dynamics of feature vector components distorted by noise.A cepstrum-domain feature compensation algorithm aiming to decompose speech and noise had also been presented in [37].
Another preprocessing approach to enhance noisy MFCC features is proposed in [10]: here a Switching Linear Dynamic Model is used to describe the dynamics of speech while another Linear Dynamic Model captures the dynamics of additive noise.Both models serve to derive an observation model describing how speech and noise produce the noisy observations and to reconstruct the features of clean speech.This concept has been extended in [38] where timedependencies among the discrete state variables of the SLDM are included.To improve the accuracy of the noise model for nonstationary noise sources, [39] employs a state model for the dynamics of noise.
An enhancement of speech features can also be attained by incremental online adaptation of the feature space as in the feature space maximum likelihood linear regression (FMLLR) approach outlined in [40].There, an FMLLR transform is integrated into a stack decoder by collecting adaptation data during recognition in real time.

2.4.
Architecture for Speech Modelling.The most popular model architecture to represent speech characteristics in automatic speech recognition is Hidden Markov Models [11].Apart from optimising the principle of auditory modelling and the methods for speech enhancement, finding alternative model architecture that applies Dynamic Bayesian Network structures which differ from the statistic assumptions of HMM modelling is an active area of research and a promising approach to improve noise robustness [12,14,41].
Generative models like the Hidden Markov Model are restricted in a way that they assume that the speech feature observations are conditionally independent.This can be considered as drawback as the restriction ignores long-range dependencies between observations.On the contrary, the Conditional Random Fields (CRFs) introduced in [42] use an exponential distribution to model a sequence, given the observation sequence.In order to estimate the conditional probability of a class for an entire sequence, the Hidden Conditional Random Field [12] incorporates hidden state sequences.
Other model architecture like Long Short-Term Memory Recurrent Neural Networks [43] which, in contrast to conventional Recurrent Neural Networks, consider longrange dependencies between the observations was recently proven to be well suited for speech recognition [44].Even static classifiers like Support Vector Machines have been successfully applied in isolated word recognition tasks [45], where a warping of the observation sequence is less essential than in continuous speech recognition.
An alternative to the feature-based HMM has been proposed in [13] where the raw speech signal is modelled in the time domain.In clean conditions, methods based on raw signal modelling like the Switching Autoregressive HMM [13] work well; however, the performance quickly degrades whenever the technique is used in noisy surroundings.To improve noise robustness, [14] extended the SAR-HMM to a Switching Linear Dynamical System (SLDS) which includes an explicit noise model by modelling the dynamics of both the raw speech signal and the noise.

Model Adaptation.
Not only joint speech and noise modelling but also training with noisy data can incorporate information about potential signal distortion in the recognition process.Experiments as done in [46] prove that recognition results are highly dependent on how much the used training material reveals about the characteristics of possible background noise during a test phase.Depending on how similar the noise conditions for training and testing are, we can distinguish between low, medium, and highly matched conditions training.Multiconditions training refers to using training material with different noise types.In real world, applications matching the conditions of training and testing phase are only possible if information about the noise conditions in which the recogniser will be used is available, for example, during the design of an in-car speech recogniser as shown herein.
Apart from adapting models by using noisy training material, the research area of model adaptation also covers widely used techniques such as maximum a posteriori (MAP) estimation [47], maximum likelihood linear regression (MLLR) [48], and minimum classification error linear regression (MCELR) [49].

Speech Signal Preprocessing
3.1.Advanced Front-End Feature Extraction.In the advanced front-end feature extraction (AFE) algorithm outlined in [22], noise reduction is performed before the cepstral features are calculated.The main steps of the algorithm can be seen in Figure 1.After noise reduction, the denoised waveforms are processed, and the cepstral features are calculated.Finally blind equalisation is applied to the features.
The preprocessing algorithm for noise reduction is based on a two-stage Wiener filtering concept.The denoised output signal of the first stage enters a second stage where an additional dynamic noise reduction is performed.In contrast to the first filtering stage, a gain factorisation unit is incorporated in the second stage to control the intensity of filtering dependent on the signal-to-noise ratio (SNR) of the signal.The components of the two noise reduction cycles are illustrated in Figure 2. First, the input signal is divided into frames.After estimating the linear spectrum of each frame, the power spectral density (PSD) is smoothed along the time axis in the PSD Mean block.A voice activity detector (VAD) determines whether a frame contains speech or background noise, and so both the estimated spectrum of the speech frames and the estimated noise spectrum are used to calculate the frequency domain Wiener filter coefficients.To get a Mel-warped frequency domain Wiener filter, the linear Wiener filter coefficients are smoothed along the frequency axis using a Mel-filterbank.The Mel-warped Inverse Discrete Cosine Transform (Mel IDCT) unit calculates the impulse response of the Wiener filter before the input signal is filtered and passes through a second noise reduction cycle.Finally, the constant component of the filtered signal is removed in the "OFF" block.
Focusing on the Wiener filter approach as part of the advanced front-end feature extraction algorithm, a great advantage with respect to other preprocessing techniques for enhanced noise robustness is that noise reduction is performed on a frame-by-frame basis.The Wiener filter parameters can be adapted to the current SNR which makes the approach applicable to nonstationary noise.However, a critical issue of the AFE technique is that it relies on exact voice activity detection-a precondition that can be difficult to fulfil, especially if the SNR level is negative like in our in-car speech recognition problem (cf.Section 6.).Further, compared with other noise compensation strategies, the AFE is a rather complex mechanism and sensible to errors and inaccuracies within the individual estimation and transformation steps.

Unsupervised Spectral Subtraction. Another technique of speech enhancement known as Unsupervised Spectral
Subtraction had been developed in [17].This Spectral Subtraction scheme relies on a two-mixture model approach of noisy speech and aims to distinguish speech and background noise at the magnitude spectogram level.

Mixture Model.
To derive a probabilistic model for speech distorted by noise, a probability distribution for both speech and noise is needed.When modelling background noise on silent parts of the time-frequency plane, it is common to assume white Gaussian behaviour for real and imaginary parts [50,51].In the magnitude domain, this corresponds to a Rayleigh probability density function f N (m) for noise: Apart from the Rayleigh silence model, a speech model for "activity" that models large magnitudes only has to be derived to obtain the two-mixture model.For the speech probability density function f S (m), a threshold δ S is defined with respect to the noise distribution f N (m), so that only magnitudes m > δ S are modelled.In [17], a threshold δ S = σ N is used, whereas σ N is the mode of the Rayleigh PDF.Consequently, we assume that magnitudes below σ N are background noise.Two further constraints are necessary for f S (m).
(i) The derivative f S (m) of the "activity" PDF may not be zero when m is just above δ S ; otherwise, the threshold δ S has no meaning since it can be set to an arbitrarily low value.
(ii) As m goes towards infinity, the decay of f S (m) should be lower than the decay of the Rayleigh PDF to ensure that f S (m) models large amplitudes.
The "shifted Erlang" PDF with h = 2 [52] fulfils these two criteria and, therefore, can be used to model large amplitudes which are assumed to be speech: with 1 m>σN = 1 if m > σ N and 1 m>σN = 0, otherwise.The overall probability density function for the spectral magnitudes of the noisy speech signal is given as follows: P N is the prior for "silence" and background noise, respectively, whereas P S is the prior for "activity" and speech, respectively.All the parameters of the derived PDF f (m) summarised in the parameter set are independent of time and frequency.In the "Expectation" step, the posteriors are estimated as follows:

EM Training of
For the "Maximisation" step, the moment method is applied: all data is used to update σ N before all data with values above the new σ N is used to update λ S .The method can be described by the following two update equations: 3.2.3.Spectral Subtraction.After the training of all mixture parameters Λ = {P N , σ N , P S , λ S }, Unsupervised Spectral Subtraction is applied using the parameter σ N as floor value: Flooring to a nonzero value is necessary whenever MFCC features are used, since zero magnitude values after spectral subtraction would lead to unfavourable dynamics in the cepstral coefficients.
Overall, USS is a simple and computationally efficient preprocessing strategy, allowing unsupervised EM fitting on observed data.A weakness of the approach is that it relies on appropriately estimating a speech magnitude PDF which is a difficult task.Since the PDFs do not depend on frequency and time, the applicability of USS is restricted to stationary noises.USS only models large magnitudes of speech so that low speech magnitudes cannot be distinguished from background noise.

Feature Enhancement
4.1.Feature Normalisation 4.1.1.Cepstral Mean Subtraction.A simple approach to remove the effects of noise and transmission channel transfer functions on the cepstral representation of speech is Cepstral Mean Subtraction [7,54].In many surroundings, for example, in a car where the speech signal is superposed by engine noise, the noise source can be considered as stationary, whereas the characteristics of the speech signal change relatively fast.Thus, a goal of preprocessing techniques for speech enhancement is to remove the stationary part of the input signal.As this quasi-non-varying part of the signal corresponds to a constant global shift in the cepstrum, speech can usually be enhanced by subtracting the long-term average cepstral vector x t (8) from the received distorted cepstrum vector sequence of length T: Consequently, we get a new estimate x t of the signal in the cepstral domain: This method also exploits the advantage of MFCC speech representation: if a transmission channel is inserted on the input speech, the speech spectrum is multiplied by the channel transfer function.In the logarithmic cepstral domain, this multiplication becomes an addition which can easily be removed by subtracting the cepstral mean from all input vectors.However, unlike techniques like Histogram Equalisation, CMS is not able to treat nonlinear effects of noise.

Mean and Variance Normalisation.
Subtracting the mean of each feature vector component from the cepstral vectors (as done in CMS) corresponds to an equalisation of the first moment of the vector sequence probability distribution.In case noise also affects the variance of the speech features, a preprocessing stage for speech enhancement can profit also from normalising the variance of the vector sequence which corresponds to an equalisation of the first two moments of its probability distribution.This technique is known as Mean and Variance Normalisation and results in an estimated feature vector where the division by the vector σ, which contains the standard deviations of the feature vector components, is carried out elementwise.After MVN, all features have zero mean and unity variance.

Histogram Equalisation.
Histogram Equalisation is a popular technique for digital image processing where it aims to increase the contrast of pictures.In speech processing, HEQ can be used to extend the principle of CMS and MVN to all moments of the probability distribution of the feature vector components [9,55].It enhances noise robustness by compensating nonlinear distortions in speech representation caused by noise and therefore reduces the mismatch between test and training data.The main idea is to map the histogram of each component of the feature vector onto a reference histogram.The method is based on the assumption that the effect of noise can be described as a monotonic transformation of the features which can be reversed to a certain degree.As the effectiveness of HEQ is strongly dependent on the accuracy of the speech feature histograms, a sufficiently large number of speech frames have to be involved to estimate the histograms.An important difference between HEQ and other noise reduction techniques like Unsupervised Spectral Subtraction is that no analytic assumptions have to be made about the noise process.This makes HEQ effective for a wide range of different noise processes independent of how the speech signal is parameterised.
When applying HEQ, a transformation x = F(x) (12) has to be found in order to convert the probability density function p(x) of a certain speech feature into a reference probability density function p( x) = p ref ( x).If x is a unidimensional variable with probability density function p(x), a transformation x = F(x) leads to a modification of the probability distribution, so that the new distribution of the obtained variable x can be expressed as with G( x) being the inverse transformation of F(x).To obtain the cumulative probabilities out of the probability density functions, we have to consider the following relationship: Consequently, the transformation converting the distribution p(x) into the desired distribution p( x) = p ref ( x) can be expressed as where ) is the inverse cumulative probability function of the reference distribution, and C(• • • ) is the cumulative probability function of the feature.To obtain the transformation for each feature vector component in our experiments, 500 uniform intervals between μ i − 4σ i and μ i + 4σ i were considered to derive the histograms, with μ i and σ i representing the mean and the standard deviation of the ith feature vector component.For each component, a Gaussian probability distribution with zero mean and unity variance was used as reference probability distribution.
Summing up the three feature normalisation strategies, CMS is the most simple and common technique which, however, cannot treat nonlinear effects of noise.MVN constitutes an improvement but still it only provides a linear transformation of the original variable.By contrast, HEQ compensates also nonlinear distortions.However, its effectiveness and accuracy heavily depend on the quality of the estimated feature histograms in a way that numerous speech frames are needed before HEQ can be expected to work well.Furthermore, Histogram Equalisation is intended to correct only monotonic transformations but the random behaviour of noise makes the actual transformation nonmonotonic which causes a loss of information.

Modelling of Noise.
Unlike speech, which is modelled applying an SLDM, the modelling of noise is done by using a simple Linear Dynamic Model obeying the following system equation: Thereby the matrix A and the vector b simulate how the noise process evolves over time, and g t represents a Gaussian noise source driving the system.A graphical representation of this LDM can be seen in Figure 3.As LDMs are timeinvariant, they are suited to model signals like coloured stationary Gaussian noises as they occur in the interior of a car.Alternatively to the graphical model in Figure 3, the equations (17) can be used to express the LDM.
Here, N (x t ; Ax t−1 + b, C) is a multivariate Gaussian with mean vector Ax t−1 + b and covariance matrix C, whereas T denotes the length of the input sequence.

Modelling of Speech.
The modelling of speech is realised by a more complex dynamic model which also includes a hidden state variable s t at each time t.Now A and b depend on the state variable s t : Consequently, every possible state sequence s 1:T describes an LDM which is nonstationary due to A and b changing over time.Time-varying systems like the evolution of speech features over time can be described adequately by such models.As can be seen in Figure 4, it is assumed that there are time dependencies among the continuous variables x t but not among the discrete state variables s t .This is the major difference between the SLDM used in [10] and the models used in [38] where time dependencies among the hidden state variables are included.A modification like this can be seen as analogous to extend a Gaussian Mixture Model (GMM) to an HMM.The SLDM corresponding to Figure 4 can be described as follows: To train the parameters A(s), b(s), and C(s) of the SLDM, conventional EM techniques are used.Setting the number of states to one corresponds to training a Linear Dynamic Model instead of an SLDM to obtain the parameters A, b, and C needed for the LDM which is used to model noise.

Observation Model.
In order to obtain a relationship between the noisy observation and the hidden speech and noise features, an observation model has to be defined.Figure 5 illustrates the graphical representation of the zero variance observation model with SNR inference introduced in [56].Thereby it is assumed that speech x t and noise n t mix linearly in the time domain corresponding to a nonlinear mixing in the cepstral domain.

Posterior Estimation and Enhancement.
A possible approximation to reduce the computational complexity of posterior estimation is to restrict the size of the search space applying the generalised pseudo-Bayesian (GPB) algorithm [57].The GPB algorithm is based on the assumption that the distinct state histories whose differences occur more than r frames in the past can be neglected.Consequently, if T denotes the length of the sequence, the inference complexity is reduced from S T to S r whereas r T. Using the GPB algorithm, the three steps "collapse," "predict," and "observe" are conducted for each speech frame.
The Gaussian posterior obtained in the observation step of the GPB algorithm is used to obtain estimates of the moments of x t .Those estimates represent the denoised speech features and can be used for speech recognition in noisy environments.Thereby the clean features are assumed to be the Minimum Mean Square Error (MMSE) estimate Due to the noise modelling assumptions, SLDM feature enhancement has shown excellent performance also for coloured Gaussian noise even if the SNR level is negative.The linear dynamics of the speech model capture the smooth time evolution of human speech, while the switching states express the piecewise stationarity.The major limitation with respect to the noise type is that the model assumes the noise frames to be independent over time, so that only stationary noises are modelled accurately.Despite the GPB algorithm, SLDM feature enhancement is relatively time-consuming compared to simpler feature processing algorithms such as Histogram Equalisation.Another drawback is that the whole concept relies on precise voice activity detection in order to detect feature frames for the estimation of the noise LDM.

Speech Modelling in the Feature Domain.
To allow efficient speech modelling, it is common to model features extracted from the speech signal every 10 milliseconds instead of using the signal in the time domain as described in Section 5.2.As an alternative to conventional HMM modelling, the Hidden Conditional Random Field [58] will be introduced in the following and examined with respect to its noise robustness in Section 6.3.

Hidden Markov Models and Conditional Random
Fields.Generative models like the Hidden Markov Model assume that the observations are conditionally independent, meaning that an observation is statistically independent of past observations provided that the values of the latent variables are known.Whenever there are long-range dependencies between the observations, like in human speech [30], this restriction can be too strict.Therefore, model architecture like the Conditional Random Field [42,59,60] makes use of an exponential distribution in order to model a sequence, given the observation sequence, and thereby drop the independence assumption between observations.Nonlocal dependencies between state and observation as well as unnormalised transition probabilities are allowed.As a Markov assumption can still be enforced, efficient inference techniques like dynamic programming can also be applied when using Conditional Random Fields.CRFs have been successfully applied in various tasks like information extraction [42] or language modelling [61].

Hidden Conditional Random Fields.
As CRFs assign a label for each observation and each frame of a time-sequence, respectively, and, therefore, cannot directly estimate the probability of a class for an entire sequence, they need to be modified in order to be applicable for speech recognition tasks.Hence, the CRF has been extended to a Hidden Conditional Random Field which incorporates hidden state sequences [58].The HCRF was successfully applied in various pattern recognition problems like Phone Classification [12], Gesture Recognition [62], Meeting Segmentation [63], or recognition of nonverbal vocalisations [64] where it partly outperformed HMM approaches.An advantage of HCRF is the ability to handle features that are allowed to be arbitrary functions of the observations while not requiring a more complicated training.
Similar to an HMM, the HCRF is used to model the conditional probability of a class label w representing a word, given the sequence of observations X = x 1 , x 2 , . . ., x T .With λ denoting the parameter vector and f being the so-called vector of sufficient statistics, the conditional probability is Seq = s 1 , s 2 , . . ., s T represents the hidden state sequence that is run through while the conditional probability is calculated.The normalisation of the probability is realised by the function z(X, λ) which is The vector f determines which probability to model, whereas f can be chosen in a way that the HCRF imitates a left-right HMM as shown in [12].We restrict the HCRF to be a Markov chain; however the transition probabilities do not have to sum to one and the observations do not need to be real probability densities.
Like an HMM, an HCRF can be parameterised by transition scores a is and observation scores b s (x t ): The conditional probability can efficiently be computed when using forward and backward recursions as derived for the HMM.The forward probability is given as where S is the number of hidden states.The backward probabilities β i (t) can be obtained by using the recursion Given the forward probabilities α s (t), the probability p(X | w, λ) that the model with parameters λ representing the word w produces observation X can be written as The conditional probability of a class label w given the observation X is This HCRF definition makes it possible to use dynamic programming methods for decoding as with HMM.As shown in [12], a conditional probability density as for an HMM with transition probabilities a is , emission means, and covariances μ s and σ s , respectively, can be obtained by setting the parameters λ as follows: Thereby d denotes the dimension of the D-dimensional observation, whereas i and s are states of the model.For the sake of simplicity, ( 27) to (30) consider only one mixture component.The extension to additional mixtures is straightforward.

Speech Modelling in the Time
Domain.An alternative to conventional HMM modelling of speech is the modelling of the raw signal directly in the time domain.As proven in [13], modelling the raw signal can be a reasonable alternative to feature-based approaches.Such architecture offers the advantage that including an explicit noise model is straightforward, as can be seen in Section 5.2.2.

Switching Autoregressive Hidden Markov Models.
In [14], a Switching Autoregressive HMM is applied for isolated digit recognition.The SAR-HMM is based on modelling the speech signal as an autoregressive (AR) process, whereas the nonstationarity of human speech is captured by the switching between a number of different AR parameter sets.This is done by a discrete switch variable s t that can be seen as analogon to the HMM states.One of S different states can be occupied at each time step t.Thereby, the state variable indicates which AR parameter set to use at the given time instant t.Here, the time index t denotes the samples in the time domain and not the feature vectors as in Section 4.2.The current state only depends on the preceding state with transition probability p(s t | s t−1 ).Furthermore, it is assumed that the current sample v t is a linear combination of the R preceding samples superposed by a Gaussian distributed innovation η(s t ).Both η(s t ) and the AR weights c r (s t ) depend on the current state s t : with The purpose of η(s t ) is not to model an independent additive noise process but to model variations from pure autoregression.For the SAR-HMM, the joint probability of a sequence of length T is corresponding to the Dynamic Bayesian Network (DBN) structure illustrated in Figure 6.
As the number of samples in the time domain which are used as input for the SAR-HMM is usually a lot higher than the number of feature vectors observed by an HMM, it is necessary to ensure that the switching between the different AR models is not too fast.This is granted by forcing the model to stay in the same state for an integer multiple of K time steps.
The training of the AR parameters is realised by applying the EM algorithm.To infer the distributions p(s t | v 1:T ), a technique based on the forward-backward algorithm is used.Due to the fact that an observation v t depends on R preceding observations (see Figure 6), the backward pass is more complicated for the SAR-HMM than for a conventional HMM.To overcome this problem, a "correction smoother" as derived in [65] is applied which means that the backward pass computes the posterior p(s t | v 1:T ) by "correcting" the output of the forward pass.

Autoregressive Switching Linear Dynamical Systems.
To improve noise robustness, the SAR-HMM can be embedded into an AR-SLDS to include an explicit noise process as shown in [14].The AR-SLDS interprets the observed speech sample v t as a noisy version of a hidden clean sample.Thereby, the clean signal can be obtained from the projection of a hidden vector h t which has the dynamic properties of a Linear Dynamical System as follows: The dynamics of the hidden variable are defined by the transition matrix A(s t ) which depends on the current state s t .Variations from pure linear state dynamics are modelled by the Gaussian distributed hidden "innovation" variable η H t .Similar to the variable η t used in (31) for the SAR-HMM, η H t does not model an independent additive noise source.To obtain the current observed sample, the vector h t is projected onto a scalar v t as follows: The variable η V t thereby models independent additive white Gaussian noise which is supposed to corrupt the hidden clean sample Bh t .Figure 7 visualises the structure of the SLDS modelling the dynamics of the hidden clean signal as well as independent additive noise.
The SLDS parameters A(s t ), B, and Σ H (s t ) can be defined in a way that the obtained SLDS mimics the SAR-HMM derived in Section 5.2.1 for the case σ V = 0 (see [14]).This has the advantage that in case σ V / = 0 a noise model is included without having to train new models.Since inference calculation for the AR-SLDS is computationally intractable, the "Expectation Correction" algorithm developed in [66] is applied to reduce the complexity.In contrast to the exact inference which requires O(S T ), the passes performed by the Expectation Correction algorithm are linear in T.
While the SAR-HMM has shown rather poor performance in noisy conditions, the AR-SLDS achieves excellent recognition rates for speech disturbed by white noise, as the variable η V t incorporates an additive white Gaussian noise (AWGN) model.In clean conditions, however, the performance of HMM speech modelling in the feature domain cannot be reached by the AR-SLDS, since time domain modelling is not as close to the principle of human perception as the well-established MFCC features.Also for coloured noise, the AR-SLDS cannot compete with feature domain approaches such as the SLDM.Further, computational complexity is still very high for the AR-SLDS.The Expectation Correction algorithm can reduce complexity from O(S T ) to O(T); however, for a speech utterance sampled at 16 kHz, T is 160 times higher than for a feature vector sequence extracted every 10 milliseconds.

Experiments
In order to compare the different speech signal preprocessing, feature enhancement, and speech modelling techniques introduced in Sections 3 to 5 with respect to their recognition performance in various noise scenarios, we implemented all of the techniques in a noisy speech recognition experiment which will be outlined in the following.
6.1.Speech Database.The digits "zero" to "nine" as well as the letters "A" to "Z" from the TI 46 Speaker Dependent Isolated Word Corpus [67] are used as speech database for the noisy digit and spelling recognition task.The database contains utterances from 16 different speakers-8 female and 8 male speakers.For the sake of better comparability with the results presented in [14], only the words which are spoken by male speakers are used.For every speaker, 26 utterances were recorded per word class, whereas 10 samples are used for training and 16 for testing.Consequently, the overall digit training corpus consists of 800 utterances, while the digit test set contains 1280 samples.The same holds for the spelling database, consisting of 2080 utterances for training and 3328 for testing.

Noise Database.
Even though we also considered babble and white noise scenarios, the main focus of this work lies on designing a robust speech recogniser for an incar environment.Thus, great emphasis has been laid on simulating a wide spectrum of different noise conditions that can occur in the interior of a car.In general, interior noise can be split up into four rough groups.The first one is wind noise which is generated by air turbulence at the corners and edges of the vehicle and arises equivalently to the velocity.Another noise type is engine noise depending on load and number of revolutions.The third noise group is caused by wheels, driving, and suspension and is influenced by road surface and wheel type.Thus a rough surface causes more wheel and suspension noise than a smooth one.Finally, buzz, squeak, and rattles generated by pounding or relative movement of interior components of a vehicle have to be considered [68].
According to existing in-car speech recognition systems, the microphone would be mounted in the middle of the instrument panel.Consequently, all masking noises occurring in the interior of a car have been recorded exactly at the same point.Figure 8 illustrates the different noise sources.Note that the mouth-to-microphone transfer function had been neglected during the experiments in Section 6.3, since the masking effect of background noise was proven to be much higher than the effect of convolutional noise.In an additional experiment, the slight degradation of recognition performance in case of a convolution of the speech signal with a recorded in-car impulse response could be perfectly compensated by simple Cepstral Mean Subtraction.
As interior noise masking varies depending on vehicle class and derivates [68], speech is superposed by noise of four different vehicles as they are listed in Table 1.
Thus, a wide spectrum of car variations can be covered.Not only the vehicle type but also the road surface influences   2. The lowest excitation provides a driving over a smooth city road at 50 km/h and medium revolution (CTY).Thus, at this profile noise caused by wind, engine, wheels, and so forth has its minimum.The subsequent higher excitation is measured for a highway drive at 120 km/h (HWY).In that case, wind noise is a multiple higher than for a drive at 50 km/h.The worst and loudest sound in the interior of a car provokes a road with big cobbles (COB).At 30 km/h, wind noise can be neglected but the rough cobble surface involves dominant wheel and suspension noise.Figure 9 shows the SNR histograms of the noisy speech utterances for all four car types at each driving condition.
In spite of SNR levels below 0 dB, speech in the noisy test sequences is still well audible since the recorded noise samples are lowpass signals with most of their energy in the frequency band from 0 to 500 Hz (see Figure 10).
Consequently, there is no full overlap of the spectrum of speech and noise.The extremely low SNR levels for the car noises (see Figure 9) are mainly caused by intense spectral components below the spectrum of human speech (motor drone).Filtering out those spectral components did not significantly affect recognition performance.Note that no Aweighting had been applied to estimate the SNR levels.
Apart from car noises (CAR), two further noise types are used in our experiments: first, a mixture of babble and street noise (BAB) at SNR levels 12 dB, 6 dB, and 0 dB, recorded in downtown Munich.This noise type is relevant for incar speech recognition performance when driving with in an urban area with open windows.Furthermore, additive white Gaussian noise (WGN) has been used (SNR levels 20 dB, 10 dB, and 0 dB).
Note that heating, ventilating, and air conditioning (HVAC) noise was not examined as further potential noise source that can occur inside a car, since fan and defrost facilities were turned off during noise recording.Although it is quite evident that such additional in-car noises can further degrade speech recognition performance, we abstained from varying fan and defrost settings as those noise types can be characterised as stationary and are likely to not change the ranking of the individual noise compensation strategies but rather result in a negative "performance offset." Contrariwise, the Lombard effect, which causes humans to speak louder when background noise is present, was also not considered since this would mostly result in a constant shift of the SNR histogram (Figure 9) towards higher SNR levels, without affecting conclusions about the effectiveness of the different denoising strategies.

Results.
For every digit, a model was trained to build an isolated word recogniser.In the case of HMM and HCRF, each model consists of eight states with a mixture of three Gaussians per state.Thereby, clean utterances were used for training.13 Mel-frequency cepstral coefficients as well as their first-and second-order derivatives were extracted.In addition, the usage of PLP features instead of MFCC was evaluated.Attempting to remove the effects of noise, various speech enhancement strategies as outlined in Section 4. were applied: Cepstral Mean Subtraction, Mean and Variance Normalisation, Histogram Equalisation, Unsupervised Spectral Subtraction, and Advanced Front-End feature extraction.In most of the experiments, the recognition rate for clean speech was around 99.9%.All parameters were tuned to achieve the best possible recognition performance.
As can be seen in Table 3, for stationary lowpass noise like the "CAR" and "BAB" noise types, the best average recognition rate can be achieved when enhancing the speech features using a global Switching Linear Dynamic Model for speech and a Linear Dynamic Model for noise (see Section 4.2).Thereby, all available clean training sequences were used to train the global SLDM which captures the dynamics of clean speech.The speech model consisted of 32 hidden states.The utterance-specific noise model consisted of a single Gaussian mixture component and was trained on the first and last 10 frames of the noisy test utterance.To speed up the calculation, the algorithm for speech enhancement was run with history parameter r = 1 (see Section 4.2.4).Also for more demanding recognition tasks like the Interspeech Consonant Challenge [69], SLDM feature enhancement was proven to increase recognition rates for noisy speech.The technique cannot compete with strategies using perfect knowledge of the local SNR of timefrequency components in the spectrogram like oracle masks [70][71][72]; however, compared to the Consonant Challenge HMM baseline recogniser [69], the SLDM approach can improve noisy speech recognition rates by up to 174% [73].
Applying Hidden Conditional Random Fields instead of HMM for the classification of features enhanced by CMS did not result in a better recognition rate.
For speech disturbed by white noise, the best recognition rate (93.3%, averaged over the different SNR conditions) is reached by the autoregressive Switching Linear Dynamical System explained in Section 5.2.2, where the noisy speech signal is modelled in the time domain as an autoregressive process.As explained in Section 5.2.2, the AR-SLDS constitutes the fusion of the SAR-HMM with the SLDS.The AR-SLDS used in the experiment is based on a 10th order SAR-HMM with ten states.This concept is however not suited for lowpass noise at negative SNR levels: for the "CAR" noise type a poor recognition rate of 47.2%,In case an HMM recogniser without feature enhancement is applied, PLP features perform slightly better than MFCC.
For white Gaussian noise, Table 4 compares the recognition rates obtained in this work with the performance reported in [14], using Unsupervised Spectral Subtraction, SAR-HMM and AR-SLDS modelling.Note that we used only 10 digits in our experiment ("zero" to "nine"), while [14] used 11 digits (including "oh"), which, together with extensive parameter tuning, should be the major reason why our SAR-HMM and AR-SLDS performance is better.model was trained for every noise condition, this not only implies knowledge of the noise characteristics (e.g., by considering GPS or velocity information) but also higher memory requirements, as more than one model has to be stored.In the in-car scenario, this would entail one model for every driving condition, resulting in an increase of model size by factor four.The best MFCC feature enhancement methods were also applied in the spelling recognition task (see Table 6).Again, for noisy test data, SLDM performs better than conventional techniques like HEQ.

Conclusion
In this article, a wide range of different techniques to improve the performance of automatic speech recognition in noisy surroundings has been implemented and evaluated in a noisy in-car isolated digit and spelling recognition task.In contrast to previous researches, diverse cars and driving conditions resulting in different spectral noise characteristics have been taken into account in order to obtain reliable conclusions about the universality of recognition performance.Thereby, four major approaches, affecting feature extraction, feature enhancement, speech decoding, and speech modelling, have been considered.
Aiming to approximate the speech recognition performance of human perception in noisy conditions, the use of PLP features as speech representation leads to a relative error reduction of 18.6% (averaged over all evaluated noise conditions) with respect to conventional MFCC.Furthermore, we proved that feature enhancement methods based on spectral subtraction and normalisation like Cepstral Mean Subtraction, Mean and Variance Normalisation, Unsupervised Spectral Subtraction, or Histogram Equalisation are able to partly remove the effects of stationary coloured noises as they occur in the interior of a car.
As a further approach to enhance speech features, a global Switching Linear Dynamic Model was used to capture the dynamics of speech enabling a model-based speech enhancement through joint speech and noise modelling.This technique prevailed for all car noise types and reached the best mean recognition rate of 96.9% for the noisy isolated digit recognition task.
The usage of Hidden Conditional Random Fields as an alternative model architecture could not outperform the conventional HMM.However, embedding a Switching Linear Dynamical System into a Switching Autoregressive HMM, and thereby modelling the raw signal in the time domain, leads to the best recognition performance for speech corrupted with additive white Gaussian noise.
Adapting the speech models by using noisy training data to build the models could also improve noise robustness.While matched conditions training is hardly possible in real life applications since the exact noise condition is not known a priori, mismatched conditions training, which uses training sequences disturbed by a noise type different from that in the test phase, outperformed training on clean data with a relative error reduction of 54.5%.
Apart from recognition performance, also computational complexity and possible fields of application have to be considered when designing a robust speech recogniser.While AFE and USS are more complex than feature normalisation techniques such as CMS or MVN, they are still suited for real-time applications.HEQ and SLDM feature enhancements achieve better recognition rates but require more computational resources.Modelling the speech signal in the time domain as done in the AR-SLDS experiment requires the most computational power and is therefore not suited for most real-life applications.For stationary noises, the SLDM is the most promising technique; however, it relies on accurate voice activity detection.
To optimise existing denoising strategies, future research effort could be spent on increasing the suitability of promising concepts like SLDM feature enhancement for the in-car speech recognition task by including discrete state transition probabilities or finding the optimum compromise between an increment of the history parameter and computational complexity.Furthermore, the AR-SLDS concept could be optimised for coloured noise to improve recognition performance when applying autoregressive speech modelling for in-car speech recognition.It might be also interesting how the implemented denoising methods perform in a continuous speech recognition task where, due to longer observation sequences, the parameters of a global SLDM as well as the cumulative histogram for the HEQ method could be estimated more precisely than in an isolated digit or spelling recognition experiment.Further improvements in noise robustness could also be achieved by combining different denoising concepts or by the application of other promising modelling concepts like Long Short-Term Memory Recurrent Neural Networks.
Speech recognition in noisy environments remains challenging; however, as shown in this article, spending effort on finding accurate techniques for auditory modelling, feature enhancement, speech modelling, and model adaption can remarkably reduce the performance gap between automatic speech recognition and human perception.

Figure 6 :
Figure 6: Dynamic bayesian network structure of the SAR-HMM.

Figure 7 :
Figure 7: Dynamic bayesian network structure of the AR-SLDS.

Figure 8 :
Figure 8: In-car speech and masking sound (top) and information flow (bottom).

Figure 10 :
Figure 10: Long-term spectrum of the car noises COB, HWY, CTY (Mini Cooper S) and the spectral characteristics of the vowel [i:] spoken by a male speaker.
[10]re 3: Linear dynamic model for noise.4.2.Model-Based Feature Enhancement.Model-based speech enhancement techniques are based on modelling speech and noise.Together with a model of how speech and noise produce the noisy observations, these models are used to enhance the noisy speech features.In[10], a Switching Linear Dynamic Model is used to capture the dynamics of clean speech.Similar to Hidden Markov Model-based approaches to model clean speech, the SLDM assumes that the signal passes through various states.Conditioned on the state sequence, the SLDM furthermore enforces a continuous state transition in the feature space.
Figure 5: Observation model for noisy speech y t .

Table 2 :
Considered road surfaces and velocities.

Table 3 :
Mean-isolated digit recognition rates in (%) for different noise types, noise compensation strategies, and features (training on clean data), sorted by mean recognition rate.
(36)aged over all car types and driving conditions, was obtained for AR-SLDS modelling.A reason for this is the assumption in(36)which expects additive noise to have a flat spectrum.

Table 5
summaries the mean recognition rates of an HMM recogniser without feature enhancement for three different training strategies: training on clean data, mismatched conditions training, and matched conditions training.Here, mismatched conditions training denotes the case when training and testing is done using speech sequences disturbed by the same noise type but at unequal noise conditions (SNR levels and driving conditions, resp.).Matched conditions training means training and testing with exactly identical noise types and noise conditions.Whenever the test sequence is disturbed by noise, mismatched conditions training outperforms a recogniser that had been trained on

Table 4 :
[14]ated digit recognition rates in (%) for different SNR levels (white Gaussian noise) and noise compensation strategies (training on clean data); comparison between the results obtained in this work and the results reported in[14].
The results for matched conditions training serve as an upper benchmark for noisy speech recognition performance, as this strategy assumes perfect knowledge of the noise properties.Note that since in the matched conditions experiment one

Table 5 :
Mean isolated digit recognition rates in (%) of an HMM recogniser without feature enhancement for different noise types and training strategies: matched conditions (MC) training, mismatched conditions (MMC) training, and training with clean data.

Table 6 :
Mean spelling recognition rates in (%) for different noise types and noise compensation strategies (training on clean data).