Improving speech recognition using bionic wavelet features

: Bionic wavelet transform is a continuous wavelet, based on adaptive time frequency technique. This paper presents a speech recognition system for recognizing isolated words by discretizing the continuous Bionic Wavelet (BW). Conversion from continuous to discrete is achieved by adopting central frequency and thresholding techniques. The BW features of noisy signal are processed through MFCC to obtain the optimal features of the speech signal. SVM, Artificial Neural Network (ANN) and LSTM techniques are used to improve the recognition rate by enhancing the speech signals. The experiments are conducted on FSDD and Kannada data set. The speech feature vector is calculated using the parameters extracted by Bionic wavelet with different central frequencies of Morlet, Daubechies and Bior3.5, coiflet5 mother wavelets. The obtained Bionic-MFCC optimal features are fed to SVM, ANN and LSTM models for the classification and recognition process. The performance of the models is tabulated for correct recognition that varies from 95% to 96% among these models. The models are tested for various SNRs noise levels like 5 dB, 10 dB, 15 dB and the recognition accuracies of these models are presented for convoluted noisy speech data.


Introduction
One of the most important branches of speech processing is enhancing the speech recognition for noisy signals i.e. speech enhancement, speech recognition etc.. Reducing noise from a speech signal is very complex process. The main objective of speech enhancement is to find the optimal estimates of speech features. To obtain efficient feature, wavelet transforms are most useful because it is one of the most prominent technique to analyze the non stationary speech signals in both time and frequency domains in a better way.
Using wavelets [1], the noise can be reduced by appropriately selecting the wavelet coefficient threshold. These threshold values are subtracted from the noisy wavelet coefficients to obtain a noise reduced signal. Since features are computed in scalograms the obtained features are more prominent than the features obtained from short term Fourier transform technique.
In wavelet transforms there are two types: Continuous and Discrete wavelet transforms. Discrete wavelet transform decomposes the signal into approximation and detail components by shifting and scaling the copies of the basic wavelet to a required level. BWT is proposed and used in the present work because, it resembles the auditory model of human cochlea [2][3][4][5][6][7] and it can be easily correlated with the MFCC feature extraction process. This helps in extracting the prominent features of the noisy speech signal.
In this paper, we propose the optimal feature selection procedure using BWT and MFCC procedures for convoluted noisy speech data for recognizing words. To calculate the optimal features mother wavelet"s central frequencies of Morlet [7], Daubechies, Bior, Coiflet wavelets are adapted to BWT with thresholding and central frequency techniques.
Thresholding on BWT is calculated using the following selection methods. They are [8]: i) Stein"s unbiased estimate of the risk rule (SURE), ii) heuristic threshold selection rule, iii) fixed selection rule, iv) minimax v) sqtwolog threshold. To handle noise in the signal SURE threshold selection procedure has been adopted to BWT to estimate the recognition accuracy.
The contents of the paper is organized as follows: Section 2 discusses about the works carried out in literature using bionic wavelets. Section 3 provides introduction to continuous bionic wavelet. Section 4 presents the procedure adopted for converting the continuous wavelet to discrete wavelet. Section 5 discusses about the data set used for the experimentation purpose. Proposed system model is discussed in section 6 with results. The performance analysis of different classifier is discussed in section 7. Section 8 presents observations done during the simulation process. Last section discusses about the conclusion and future enhancements.

Literature survey
Extracting optimal feature plays a major role in classification and or recognition. However, many studies shows that bionic with Morlet wavelets are used for de-noising the speech signal by enhancing the signal component. At present the features can be extracted at three methods 1) Features from Time Domain, 2) Frequency Domain Features, 3) Features from Raw wave file. MFCC is the most popular method in frequency domain and the last method is now gearing up in the machine learning models. MFCC is well suited for clean speech signal but making it more robust for noisy data is also presented in this paper. In this direction the bionic wavelets are used for de-noising and the MFCC is made robust towards handling convoluted noisy speech data.
Bionic wavelet is made adaptive by applying various methods viz, by changing the "K" factor, using different hard/soft thresholding methods and applying various base/central frequencies. The following are some of the related work towards the application of bionic wavelets used for denoising the speech data. A. Garg & O. P. Sahu [9] proposed a method to discretize bionic wavelet using CWT and ICWT using Morlet as the mother wavelet.
Fie Chen [10] proposed adaptive DBWT by changing T-function of BWT and splitting the dyadic tiling map of DWT that uses quadrature-mirror filters, organized as DBWT tiling map for decomposition. M. Talbi [11] proposed entropy technique to BWT to identify the two sub bands having minimal entropy for each coefficient.
Cao Bin-Fang [12] proposed a bionic wavelet method of hierarchical threshold based on PSO. The noisy speech signal is decomposed using bionic wavelet transform. In this Particle Swarm Optimization is proposed for threshold optimization. The noise with high frequency is separated by bionic wavelet transform and this is fed as an input to an adaptive filter. From the experimental work the paper illustrates speech enhancement for various SNR conditions.
A detail analysis is made by Yang Xi, Liu et al. to understand the behavior of bionic wavelet with additive noise for various db"s. It clearly explains the usage of bionic with Morlet as a mother wavelet for removing various db level noises from a speech signal. Yao and Zhango proposed an adaptive bionic with a Morlet wavelet base frequency "ωo" of mother wavelet 15165.4 Hz that is suitable for human auditory system.
Mourad used [13] MSS-MAP for wavelet transform and used four different test such as SNR, segmental SNR, Itakura and perceptual evaluation for various types of noises and their levels. A new speech enhancement procedure is proposed by WU Li-ming [14] on improved correlation function processing for Bionic wavelet co-efficient.
Speech recognition for Arabic words is demonstrated in Ben-Nasr [15]. Feature extraction is done by using MFCC with bionic wavelet. To increase the recognition rate Delta-Delta coefficients are used and classification is done by using feedforward back propagation neural network. Zehtabian [16], proposed speech enhancement technique using BWT and singular value decomposition method. The paper illustrates SVD is better than BWT for higher SNR"s.
Liu Yan [17], proposed de-noising algorithm on sub band spectrum entropy with bionic wavelet transform. They showed that sub band spectrum is good in detecting the end point of the speech signal. Hence it is used to distinguish speech as well as noise. The experimental work demonstrate sub band entropy de-noising method is superior than Wiener filter algorithm. Pritamdas [18] focus on continuous wavelet transform and thresholding of coefficients for speech enhancements using thresholds and wavelet transform scales in adaptive manner.
From the literature survey, it is observed that a lot of work is reported on Bionic wavelet for speech enhancement with thresholding and rescaling procedures used for converting continuous to discrete wavelet co-efficient"s for additive noise only. In this paper, procedure to convert continuous to discrete wavelet based on the central frequency is proposed. New feature extraction technique and the procedure to reduce the noise of convoluted noise is presented.
To the best of our knowledge this work is unique in its own way for de-noising the convoluted noise at various levels. The next section describes the characteristics of Bionic wavelet.

Continuous bionic wavelet
Alternative to STFT, is the WT technique [19][20][21][22]. When these two are compared visually, The scalograms of WT are better in representing the formant frequencies and structural harmonics of speech. Hence WT technique is identified as one of the prominent method to handle non stationary signals. CWT is fixed with some base scale [23] that is 2 1/m where m is an integer greater than 1. Where "m" is the number of "voices per octave". Different scales are obtained by raising this base scale to positive integer numbers, for example 2 k/m where k = 1,2, 3…. The translation parameter in the CWT is discretized to integer values, represented by l. The resulting discretized wavelets for the CWT is represented by Eq. 1

Bionic wavelet
Bionic wavelet transform (BWT) is an adaptive wavelet transform based on a model of the active biological auditory system [24]. The decomposition of BWT [2] is perceptually scaled and adaptive. It has the following properties: i) High sensitivity and selectivity ii) Signal with determined energy distribution iii) Can be reconstructed The resolution of bionic wavelet transform can be achieved by adjusting signal frequency and the instantaneous amplitude with its first order differential values.

Realization of discrete bionic wavelet from continuous
This section discusses about the mechanism adopted to convert continuous wavelet to discrete wavelet. To convert any continuous to discrete wavelet the discrete thresholding and central or base frequencies of different mother wavelets are adopted.
, m varies from 1 to 22 for Morlet. For other wavelets centfrq function of Matlab is used.
All the wavelets possess different characteristics, hence the following four wavelets are considered Db11: asymmetric, orthogonal, bi-orthogonal. Coif 5: symmetric, orthogonal, bi-orthogonal. Bior3.5: symmetric, not-orthogonal ,bi-orthogonal. 22 scales are considered for BW in spite of center frequency .These wavelets are preferred because they mimics the mel-scale mapping of the MFCC [26] procedure and also these are designed to match the basilar membrane spacing i.e. based on nonlinear perceptual model of the auditory system. (1)

Thresholding
This parameter decides about the number of levels used to reduce the redundant information in the CWT towards the discretisation of the wavelet. The following thresholding mechanisms are considered with various levels by trial and error procedure as listed below in the Table 1. Levels are fixed based on the obtained thresholds of the signal. The various ways of calculating the thresholding is as discussed below: Sqtwolog: where σ is the mean absolute deviation (MAD) and p is the length of the noisy signal. MAD is expressed as ω wavelet coefficient and k-scale for wavelet co-efficient

Algorithm
Steps for Discretizing Bionic Wavelet.
Step 1: Read the speech signal Step 2: Multiply each value by "K" as shown in Eq. 2 (2) Step 3: Thresholding function is selected with high SNR using Matlab function thselect.
Step 4: Base/central frequencies of various mother wavelets is applied using centfrq (wname).
Step 5: The modified bionic wavelet coefficients are divided by the "K" factor to get the coefficients and reconstruction is done by taking its the inverse continuous wavelet transform. Where the "approximation is done by K"-factor using Eq. 3 (3) Step 6: Compute the inverse continuous transform Step 7: Obtain the Mel frequency Cepstral co-efficient for the de-noised signal [26] Step  The above presents the weighted features obtained from step 1 to step 8 of the algorithms. From this it is clear that wavelets weighted feature values are better for both for clean and noisy speech signal.

Data set
Two different datasets considered are free spoken digit dataset (FSDD) [27] and Kannada dataset ( Table 2) with recordings of spoken digits and words sampled at 8 kHz and 16 kHz respectively. The recordings are trimmed, so that they have near minimal silence at the beginnings and ends. It consists of English pronunciation words of numbers from one to nine from four different speakers. Totally 900 signals with 100 signals of each digit is collected. The second data set is isolated words Kannada data set. The words considered are as shown in Table 3. These signals are sampled at 16 KHz frequency consisting of 30 speakers with 20 male and 10 female speakers. 1000 word samples are collected from both genders for Kannada data set. The signals are artificially convoluted with street noise [28] with the SNR of 5, 10 and 15db to create convoluted noisy speech signals.

System model for the proposed approach
The obtained features are modeled for classification and recognition using machine learning models like SVM [29][30][31][32] ANN [15,33] and LSTM [34,35] in the proposed work. The overall data flow diagram of adopting all the models is as shown below Figure 1.

General experimental setup:
The obtained features of all the signals are grouped into training and testing samples. These signals are convoluted with 5 db, 10 db, 15 db street noise [28]. The same data set is used by all the models for testing and training purpose to evaluate the recognition accuracies performance of all the models. The results are discussed at two levels namely, i) signal to noise ratio before and after the application of bionic wavelet ii) Recognition accuracies of the models compared with the existing models if any.

Signal to Noise Ratio (SNR) [36]
It is a best indicator for identifying noise interference in a given signal. SNR is computed using the following formulas.

% in dB % in dB
The Table 4 presents the application of different central frequencies to bionic wavelet to reduce the noise levels. It is clear that an average of 2db of noise is reduced. Table 4. SNR for various central frequency with their mother wavelets. Table 5 depicts the application of bionic wavelets for convoluted noise considering the 22 scales as mentioned in the literature. Comparing Table 4 and 5 SNR level is better for convoluted noise. Hence noise reduction is better in Table 4 than in Table 5.

Performance analysis of various classification methods
In our earlier works [36,37], the experiments were carried no clean and noisy speech data set with normal MFCC features. The current feature extraction procedure applies bionic wavelets for extracting better features for the dataset specified in section 3. Hence, In this paper the new Bionic-MFCC features are used for the recognition purpose by reducing the noise using discrete bionic wavelets. Experiments are performed on standard benchmark dataset (FSDD) and Kannada dataset. The various models and their parameters used are as follows:

Support vector machine
Since the speech features are non-linear in nature, the features need to be mapped to high dimensional space. The basic idea is that the input space need to be mapped into a high dimensional feature space by nonlinear transformation and the optimal hyper plane is found in the new space. The optimal hyper plane not only needs to ensure that different categories can be discriminated correctly, but also the maximum categorization interval between them should be promised. Thus, the generalization capability of the support vector machine is stronger. The target function corresponding to the nonlinear separable support vector machine is given by: where ω represents the weight coefficient vector, and b is a constant. C denotes the penalty coefficient to control the penalty degree for misclassified samples and balance the complexity of the model and loss error. ξi represents the relaxation factor to adjust the number of misclassified samples that allowed exit in the process of classification.
When SVM is used to solve the classification problems, two strategies can be adopted. One is ONE-TO-ALL, and ONE-TO-ONE. In this paper ONE-To-ALL method is applied for multi-classification. Kernel functions are also the key functions for SVM. Hence, polynomial and radial basis kernel functions are considered. Table 6 and Figure 2 depict the recognition performance using SVM model. To implement SVM RBF(r) and polynomial kernel (p) functions are used. It is observed that Bionic-MFCC features, well classifies the noisy signal compared to clean speech proposed using Bionic-MFCC features [30]. SVM performs better with RBF kernel function for standard data set. Whereas, as it fails for Kannada data set. Polynomial function performs better for Kannada data set as shown in Figure 2. From this it identifies that the kernel performance depends on the data set. Table 6. Classification accuracy of SVM.

Neural network
In the literature bionic wavelets are applied with Morlet base frequency for additive noisy Arabic speech recognition system [15,34,38] using NN. Hence in this paper bionic wavelets are tried for convoluted noisy speech data to identify the level of noise reduction and feature weights for recognition accuracy. Standard dataset has good recognition rate compared to Kannada data set. Less performance is due to the variable word length and existence of ambiguity in the utterance of the speaker. The Table 7 and Figure 3 show the recognition accuracies obtained.

NN Implemented Procedure:
Neural network model has 9 nodes, each with 12 bionic MFCC features at the input layer. Two hidden layers are considered with 9 nodes at the output layer representing each word. The output layer has 9 nodes with one node for each digit. Softmax activation function is applied on the top of the network to get output class label probabilities. The model is optimized by adam-delta optimizer that adapts learning rate by moving window.
Learning is continued and network is learnt for all updates. The model is constructed and categorical cross entropy is used for multi classification.

LSTM
Procedure: The MFCC features are fed to the input-layer to build basic LSTM Cell. Wrapping of each layer in a dropout layer is considered with 0.5 probability value, for learning in each iteration. A group of dropout wrapped LSTMs are fed to a MultiRnn cell to group the layer together.
The CTC model helps to learn for labeling a variable -length sequence when the input-output arrangement is not known. Consider the features m = (m 1 , m 2 ,….m T ) and the label n = (n 1, n 2 ,….n U ). The CTC is trained based on maximum probability. The loss function of the CTC model is computed as The label sequence π is all expanded possible CTC path alignments Φ having length T P(k = πt|m) is a label distribution at time step t.
Finally, stacked LSTM layers are embedded. The CTC [39,40] loss function and Adam-delta optimizer functions are used to define the model to create a single fully connected layer with SoftMax activation function to get the labeled predictions. The activation function is as given below: The Ada-delta optimizer is considered to minimize the loss by feeding the predictions to mean squared error loss function. Accuracy metric is used for training and testing process. The predicted values minimized with errors using mean squared error and Adam-delta optimizer .Then at the end accuracy metric is used for training and testing.
In the literature, works are carried out using Bi-LSTM and LSTM model for speech classification [34] with 95% and 96.58% of accuracy for clean speech signal. T. Goehring, et al. [41] uses recurrent neural network model for feature extraction for Babble noise for 5 dB and 10 dB with a recognition accuracy of 78% and 82% as illustrated in Table 8.
Whereas in the proposed work LSTM model is applied to convoluted noisy speech data and the performance of the model is shown in Table 8 and Figure 4, demonstrates better results than identified in the literature. Using Bionic-MFCC features recognition accuracy is improved by 1% compared to Bi-LSTM model for speech data. Among SVM ANN, and LSTM models, LSTM is better in modeling the convoluted speech data using db11 mother wavelet.

Performance measures:
Word classification error rate is computed by

Observations and discussions
This section discusses about the observations done on the models used for the classification and recognition purposes.

SVM:
 In SVM the classification rate can be improved by applying different normalization methods.
 SVM performance varies with the choice of kernel function  Non-linear SVM kernels are well suited for classification of speech data ANN:  Recognition accuracy can be increased by using large data set and the selection of appropriate optimizer function  Increasing the number of hidden nodes improves the learning phase LSTM: It works on par with ANN, except the proper choice of the CTC loss function. The suitable selection of cost function will also help us to yield the good recognition rate. LSTM requires less features than SVM and ANN to model the data.
In general, SVM and ANN equally perform well compared to other model but not as good as LSTM. This is due to the optimality of the features obtained by the weighted values from Bionic-MFCC features. The results of LSTM model on FSDD dataset is better with db11 compared to other models because of fine-tuned dataset of FSDD. The results for db11 wavelet for 15 db is better because of high signal to noise ratio of noisy data.

Conclusion and future enhancements
In this work the discretization procedure of continuous bionic wavelet has been proposed for convoluted noisy speech recognition. The obtained bionic wavelet features are used for reducing the noise level in the speech data. These features are also used in MFCC to obtain the Bionic-MFCC speech features. It also presents the improvement of MFCC features using continuous wavelets. From the obtained results of the models it is clear that LSTM with DB11 wavelet at 15dB SNR outperforms. It is also observed that the recognition accuracies depend on the nature of dataset also.
It is a unique work of applying continuous bionic wavelet for feature extraction using the central frequencies of DB11, coif5 and Bior-3.5 wavelets for convoluted noisy speech data. This work also demonstrates that, even basic mother wavelets features can also be adopted in converting the continuous to discrete wavelets. It is very tedious to handle convoluted noisy speech data because of overlapping and the identification of the frequency of noise with the original data (convolution of signal and noise). According to our study, the additive noise can be completely removed by using filters but not convoluted noise. Hence this approach is towards reducing the noise using continuous wavelet for the isolated word recognition. As per the study, LSTM model better classifies and improves the recognition accuracy up to 96% with 4% of word error rate than other models. Hence bionic wavelet well sustains and it can be made adaptive in nature, by applying various thresholding concept. From this study it is also observed that central frequency and the thresholding concept plays a major role in noise reduction as well as in the conversion of continuous to discrete wavelet. For Kannada dataset, word error rate is high because of variation in speaker"s pronunciations. Whereas FSDD has good recognition rate because of its fine-tuned dataset.

Future enhancements:
In spite of thresholding, genetic algorithms can be adopted for feature reduction. Other wavelets central frequencies can also be tried for discretization of the wavelets. The performance of the above models can be verified for different types of noises for various noise levels to identify SNR. The model performances can also extend to sentence level recognition. The DWT trees can also be used for speech enhancement by noise reduction.