Speaker Recognition Using Cepstral Coefficient and Machine Learning Technique

Speaker recognition is one of the important tasks in the signal processing. In this study we perform speaker recognition using MFCC with ELM. First noise is removed in the speech through low pass filter; the purpose of the filter is to remove the noise below 4 kHz. After enhancement of individual speech, feature vector is formed through Mel-frequency Cepstral Coefficient (MFCC). It is one of the nonlinear cepstral coefficient function, features are extracted using DCT, Mel scale and DCT. The feature set is given to Extreme Learning Machine (ELM) for training and testing the individual speech for speaker recognition. Compared to other machine learning technique, ELM provides faster speed and good performance. Experimental result shows the effectiveness of the proposed method.


INTRODUCTION
One of the most secured features of biomedical recognition is speaker recognition.To generate the speaker identity, we have to extract the features from the individual voice, which is the process of speaker recognition.In this speaker recognition, two types of tasks are available such as verification and identification (Ai et al., 2012;Jain et al., 2004).Speaker verification determines if a person is the claimed identity based on a piece of voice sample; speaker identification describes which one matches the input sample voice from the group of training voices.
Speech recognition is called as a sister technology to speaker verification.The function of the speech recognition is to correctly identify what the person says.This speaker recognition is has sbecome most popular and it is pathway for the speaker authentication.Total voice solution is a method used to interact the individual person with the system; this method is formed by the combination of speaker recognition and speaker authentication.
Operation of speaker recognition is carried out in three ways (Judith and Markowitz, 1998), first operation is called as speaker identification or speaker recognition.Second operation had many names such as speaker verification, speaker authentication, voice recognition and voice verification.Speaker separation and speaker classification falls under the on third operation.
Already, we know speaker recognition is a biometric authentication process and here human voice is one of the characteristics and it is used as an attribute (Kinnunen and Li, 2010;Campbell et al., 2009).
Speaker recognition system have three fundamental section such as; to described the speech signal in a compact manner using noise removal and feature extraction.Then to characterize those features by some statistical approach, finally speaker classification is used to find out the unknown utterance.The literature about several speaker identification algorithms is given in Clarkson et al. (2001) and Hui-Ling and Fang-Lin (2007).
In this study, proposed work is focused on designing the techniques by effectively preserving the information related to speaker and it is used to improve the speaker recognition system.In this study, the system is split into three models, they are: • Speech enhancement is carried to improve the voice signal or remove the unwanted voice signal through low pass filter.

LITERATURE SURVEY
Linear transformation technique is implemented in Sahidullah and Saha (2012), it preserves the speech information effectively for speaker recognition improvement.Here, block based transformation approach is used to all Mel filter bank log energy at a time.Multi-block DCT is used for the formation of Cepstral coefficient.Better performance of speaker recognition is obtained by using combination of both systems.Performance is evaluated between NIST SRE 2001 andNIST SRE 2004.Feature selection is one of the important tasks in speaker recognition and identification.Because large numbers of features are extracted from the same from of speech, so redundancy is presented in the extracted speech.So remove the redundancy and select the possible feature vector is most important.In Sandipan and Gowtam (2010), proposed technique for feature selection using Singular Value Decomposition (SVD) followed by QR Decomposition with Column Pivoting (QRcp).This feature technique is baseline to MFCC and LFCC.
Two types of approach are in speaker identification that is text-dependent and text-independent.In this, text-independent speaker identification is proposed in Kumari et al. (2012); here they identified the speaker for individual person through two different types of feature set.Two feature sets are Mel Frequency Cepstral Coefficient (MFCC) and Inverted Mel Frequency Cepstral Coefficient (MFCC) features.Finally this individual speaker features are trained using Expectation Maximization algorithm.Testing the data using GMM for two feature sets.
For the improvement of recognition rate of speaker combination of two features with traditional one (MFCC and LPCC) are proposed in Chetouani et al. (2009), here features are depend on LP-residual signal.
Probabilistic Neural Network (PNN) is used to recognize the speaker, but problem this technique is recognized is based on the smoothing factor.So to overcome this problem, combination of smoothing factor with PNN is implemented in Fan-Zi and Hui (2013).Hybrid algorithm is proposed for the improvement of speaker recognition using (DFOA-SOM-PNN), first Self-Organization Map (SOM) is to cluster the speaker features it is extracted through MFCC, then double fly fruit optimization algorithm is used to smoothing factor of PNN.

METHODOLOGY
This section describes about the proposed work behind the speaker recognition.Block diagram for the proposed work is illustrated in Fig. 1.

Speech enhancement:
In the time of recording, speech signal is affected by noise or unwanted signal.Usually information of the input signal is present in the higher frequency.Noise may be occurred due to the channel fading, loss of speech segment, echo or reverberation.
So, in this study low pass is filter is used to remove the noise.This filter passes, if it below cut off frequency and stops above the cutoff frequency.
Feature extraction: This section is used to extracts the original signal into number of features for dimensionally  (Bahoura, 2009).Block diagram for MFCC is given in Fig. 2.
In this process, first transform the original signal from time domain to frequency domain by using Discrete Fourier Transform (DFT), here power spectrum is used.Before DFT, hamming window is used for the reduction of frequency distortion due to segmentation.Step 1: In the first stage, original signal is multiplied by using Hamming window and then the window speech frames are processed under DFT.This is obtained from Fourier transform: In the above equation, ˚ defines the number of points in the DFT.
Step 2: Filter bank is created: The above equation defines the energy spectrum ˥ {˩{, where number of filter is indicated by ˚ and ˩ = 1,2, . ., ˚ .
The above equation describes the band pass filter ˲{˫{ by triangular filter bank ˠ {˫{.where k denotes the index of the ˚ point DFT.
Step 3: Mel-scale calculation using O'shaughnessy (Ai et al., 2012).It is given by below equation: In the above equation, ˦ denotes the sampling frequency: In the above equation ˦ and ˦ denotes the low and high frequency boundary of the filter banks.Inverse transform ˦ # # is given by below equation: Step 4: MFCC coefficient is calculated, that the output of logarithmic filter bank is given to the DCT: where, n defines the number of MFCC coeffiecints.

Extreme Learning Machine (ELM) for speaker recognition:
Extreme Learning Machine is one of the useful statistical tools for machine learning techniques and it has been successfully applied in the pattern recognition tasks.ELM is proposed by Huang et al and it is developed for Single Hidden Layer feed forward Networks (SLFNs) with a wide variety of hidden nodes.This system can be represented as linear system; this system obtains the smallest training error and good performance.ELM has several methods such as optimization method based ELM (Sahidullah and Saha, 2012) regularized ELM and kernelized ELM (Sandipan and Gowtam, 2010).Consider a number of N training samples {{˲ # , ˮ # {, … , {˲ , ˮ {{, here ˲ ∈ ℝ and ˮ ∈ {−1,1{, usually SFLN have I input neuron and L hidden neuron and it is illustrated in Fig. 3.
The below equation give the function for binary classification: In the above equation weights are present in the vector β = ?β# , . ., β 2 C : this weight connecting the hidden neurons and output neurons.The output of the hidden layer is given by h{x{ = {h # {x{, … , h 2 {x{{ with respect to input x. nonlinear piecewise continuous function is defined by ˙{I, I, ˲{ it is derived from the following equation: Then the universal approximation capability theorems are satisfied by above nonlinear piecewise continuous function.
H is the hidden-layer output matrix is defined by below equation: To minimize the ȉH − ˠȉ and ȉ ȉ for training the ELM, in which ˠ = {ˮ # , ˮ $ , … , ˮ { .The solution to the problem can be calculated as the minimum norm least-square solution of the linear system:

EXPERIMENTAL RESULTS
NTT database (Matusi and Furui, 1993) and a largescale Japanese Newspaper Article Sentences (JNAS) database Itou et al. (1999) were used to evaluate proposed method.The proposed work is compared with the previous work is classified by RVM and some other existing methods through performance metrics such as accuracy.Speaker verification performance will be reported using the True Positive (TP) samples and True Negative (TN) samples: TP : Abnormal class classifies as abnormal TN : Normal class classifies as normal

CONCLUSION
In this study, we have investigated an MFCC-ELM approach for speaker recognition.In this study, three steps are carried over.First is removed through low pass filter if the signal has below 4 kHz, this is used to improve the speaker recognition accuracy.Second, standard MFCC features are extracted using linearly spaced filters in Mel scale.Third, classification for speaker recognition based on ELM, it is more suitable for particle acoustic signals, leading to high material recognition accuracy than that of other system.The comparison study with existing methods also demonstrated the performance of proposed method.Compared to the first proposed method of DT-CWT with RVM and existing method proposed method provides better results.The results have demonstrated the effectiveness of the proposed method for speaker recognition through accuracy.

Fig. 3 :
Fig. 3: ELM with I input neurons and L hidden neurons After this process, filter bank is used wrapping the frequency from hertz scale to Mel scale.Finally Discrete Cosine Transformation (DCT) is used for the extraction of feature vectors on the logarithm of Mel scale power spectrum.

Filter
In the above equation, H defines the Moore-Penrose generalized inverse of matrix H. ELM have speed training phase as well as good performance for computing the output weights analytically.ELM algorithm: Input: Training Set, hidden node activation function, number of hidden nodes Output: Weight vector Step 1: Hidden node parameters are randomly generated.Step 2: for i-1: L do I , I Randomly assigned Step 3: End Step 4: Hidden layer output matrix H is examined Step 5: ˦JJ ˩ = 1: H ˤJ ˦JJ ˪ = 1: ˚ ˤJ H{˩, ˪{ = ˙ I , I , ˲ end end Step 6: Finally output weight vector β is calculated Step 7: ӂ = H ˠ

Table 1
gives true positive and true negative rate for proposed method.Table 2 provides the speaker recognition rate, compared with existing approaches of Saeidi et al., 2009) for different gender provide the accuracy of 93%, the proposed method of DT-CWT with RVM gives the recognition rate of 93.5%, compared with this RVM proposed, MFCC-ELM proposed gives the better result of 98.4%.From the result clearly observed that the proposed method of MFCC-ELM gives better speaker recognition rate than previous proposed method as well as existing method.