An Extraction Method of Acoustic Features for Speech Recognition

This study presents a novel method that deals with extracting acoustic features for recognition of isolated speech words. This extraction method is based on the use of a bank of 41 Gabor filers, which aim to select the specific modulation frequencies and bring a limitation of information redundancy on feature level. The robustness and performance of proposed features, named as Gabor Mel Spectrum features (GMS features) are validated on isolated speech words in both clean and noisy environment case and compared to those of two classic methods such as PLP-features and MFCC-features. The recognition results obtained using HMM, show that our extraction method is more robust and achieve better recognition rates than the two latter methods.


INTRODUCTION
The extraction of acoustic features for speech recognition system has been the area of extensive research, with the aim of improving the robustness of this system in clean and noisy environment.Several developed approaches are inspired from the principles of the human auditory mechanism.These principles are integrated into these approaches in terms of various auditory model such as Mel scale filterbank (Davis and Mermelstein, 1980), Bark filtrebank (Hermansky, 1990), Gammatone filterbank (Qi et al., 2013) and Gammachirp filterbank (Zouhir and Ouni, 2015) to improve its robustness.However, they still far less robust to background noise compared to auditory mechanism of human.
Some recent study revealed the existence of auditory cortex neurons which are explicitly tuned and sensitive to various patterns in the signal's spectrotemporal representation.The estimation of the activity of these auditory neurons is Spectro-temporal receptive fields referred to as STRFs (Mesgarani et al., 2007).These findings motivated many speech processing researchers to model the STRFs and to employ it in many various applications (Mesgarani and Shamma, 2011).For example, 2 dimensional (2-D) Gabor filters is used as a model in Qiu et al. (2003) and these filters have been applied to treat the speech recognition problem in many studies (Kovács et al., 2015;Kovács and Tóth, 2015;Ravuri and Morgan, 2010;Schädler et al., 2012;Schädler and Kollmeier, 2015).These proposed feature extraction methods has achieved a good improvement in term of recognition rate by applying the 2-D Gabor filters to various representation of vocal signal such as log Mel-spectrogram (Schädler et al., 2012;Schädler and Kollmeier, 2015) obtained from MFCC (Davis andMermelstein, 1980) andPNCC spectrogram (Meyer et al., 2012) generated by Power-Normalized Cepstral Coefficients (Kim and Stern, 2009) and spectro-temporal representation generated from Bark filterbank output (Missaoui and Lachiri, 2014).
In this study, a new extraction method of acoustic features for recognition of isolated speech words is presented.This method incorporates a set of 41 two-Dimensional (2-D) Gabor filters to improve recognition performance and robustness.This method is tested and evaluated on recognition of isolated speech words of TIMIT database in both clean environments and noisy environments.For this, the Hidden Markov Models are used to obtain the recognition rate of the proposed features and PLP-features (Hermansky, 1990) and MFCC-features (Davis and Mermelstein, 1980).The power spectrum of speech is obtained by using windowing operation with 20 ms Hamming window and 10 ms overlap in the first step, followed by a square calculation of Discrete Fourier Transform in the second step.

An extraction method of acoustic features for recognition of isolated speech words in both clean and
The third step consists to apply a Mel scale filterbank to the obtained power spectrum.This filterbank consist of a set of filters evenly spaced along a warped scale resolution which is a perceptually motivated scale known as Mel frequency scale (Stevens and Volkmann, 1940).The used filters are triangular bandpass filters and the mel-scale can be approximate by the following equation: Then, the obtained spectrum is processed by an equal loudness pre-emphasis operation and intensity loudness conversion operation.These two operations aim to reproduce and simulate firstly the non-equal sensitivity and secondly the power law of hearing (Hermansky, 1990).
The last step is to perform a processing of the obtained spectro-temporal representation by 41 2D Gabor filters in order to generate the proposed Gabor features, named as Gabor Mel Spectrum features (GMS features).The 2D Gabor filter have been widely and successively used as front-end in many speech recognition systems (Schädler et al., 2012;Meyer et al., 2012Meyer et al., , 2011;;Missaoui and Lachiri, 2014).The used 41 Gabor filters are selected to offer the ability to cover many spectro-temporal modulation directions, to approximate the orthogonal filters and to limit the redundancy between the outputs of these filters.
The 2-D Gabor filter which can be represented as 2-D convolution, is the product of complex sinusoid s{n, k{ defined in equation 2 and Hanning envelope h{n, k{ defined in equation 3 (Schädler et al., 2012).
The values of the standard deviation of envelope are denoted by W and W E , while the radian frequencies that definite the periodicity are denoted by ω and ω E .

RESULTS AND DISCUSSION
In this section, we present the experimental results conducted to evaluate robustness and performance of the proposed GMS features in the clean and noisy environment case.
This experiment is conducted with 12534 isolated words speech taken from TIMIT database (Garofolo et al., 1993).9240 and 3294 of these extracted words are used respectively for learning phase and the recognition phase.In the noisy environment case, the noisy isolated words are generated by adding to the clean words three different noises « babble noise », « restaurant noise » and « station noise » at different SNR levels.These noises are drawn from the AURORA Corpus (Hirsch and Pearce, 2000).
The recognition results of the GMS features are compared to those of PLP-features and MFCC-features.All results are obtained using speech recognition system which employs the Markov Model Toolkit (Young et al., 2009) to build Hidden Markov Models (HMM).
The used HMM topology is left-to-right HMM with 5 states and 8 diagonal Gaussian mixtures per state for each isolated speech word (HMM_8_GM).
Evaluation in a clean environment: Table 1 reports the experimental results of the GMS features and those obtained using PLP and MFCC coefficients in the clean environment case.The used features in this table are: N is the total number of isolated speech words and H, S and D are the number of correct speech words, substitutions speech words and deletions speech words respectively.As shown, we can see that the GMS features provide a higher recognition rate than the two     The reported results showed that the GMS features provide the highest recognition rate compared to those of PLP and MFCC coefficients in the presence of the three noises.It demonstrates significantly better performance for all SNR levels.For example, the recognition rate obtained using the GMS features is 90.86 while those using PLP and MFCC are 60.50 and 60.50, respectively in the presence of « station noise » with SNR level equal to 10db.

CONCLUSION
An extraction method of acoustic features for speech recognition in the clean and noisy environment case was presented in this study.It is based on a set of 41 2-D Gabor filters which has been applied to a spectro-temporal representation to extract the proposed features.The performance of our features was tested on isolated speech words using HMM with 5 states and 8 diagonal Gaussian mixtures per state and was compared to those of PLP coefficients and MFCC coefficients.It was shown that our Gabor features outperform the two classic features in terms of recognition results.

Fig. 1 :
Fig. 1: The different steps of the proposed extraction method of acoustic features noisy environment case is presented in this section.The different steps of our extraction method are shown in Fig. 1.The power spectrum of speech is obtained by using windowing operation with 20 ms Hamming window and 10 ms overlap in the first step, followed by a square calculation of Discrete Fourier Transform in the second step.The third step consists to apply a Mel scale filterbank to the obtained power spectrum.This filterbank consist of a set of filters evenly spaced along a warped scale resolution which is a perceptually motivated scale known as Mel frequency scale(Stevens  and Volkmann, 1940).The used filters are triangular bandpass filters and the mel-scale can be approximate by the following equation:

Table 1 :
The recognition rates (%) of the proposed features (GMS features), PLP-features and MFCC-features in a clean environment Recognition rate using HMM_8_GM

Table 2 :
The recognition results of the proposed features (GMS features), PLP-features and MFCC-features using HMM_8_GM in Babble noise case

Table 4 :
The recognition results of the proposed features (GMS features), PLP-features and MFCC-features using HMM_8_GM in station noise case