Text Independent Automatic Speaker Recognition System Using Mel-Frequency Cepstrum Coefficient and Gaussian Mixture Models

,


Introduction
In last decades, an increasing interest in security systems has arisen.These systems are very useful since they allow managing security in a very efficient way, reducing the need of human resources.Most of them implement an access control system [1][2][3][4].In particular, a huge number of research efforts were directed to speaker recognition problem [5][6][7][8][9][10][11][12][13][14][15].In fact, many strategic places are of vital importance to the assessment of involved people.A simple way to verify people identity can consist in analyzing its voice.In fact, voice based recognition systems represent biometric systems that allow the access control in a very fast and low intrusive way, requesting a reduced collaboration of the people.
The human voice is peculiar to each person and this is due to the anatomical apparatus of phonation.The vocal tract consists of three main cavities: the oral cavity, the nasal cavity and the pharyngeal cavity [16].The nasal cavity is essentially bony, hence static in time; furthermore it can be isolated through the soft palate.The oral cavity is formed by the bony structure of the palate and soft palate; its conformation can be altered significantly by the movement of the jaw, lips and tongue.The pharyngeal cavity extends to the bottom of the throat and it can be compressed retracting the base of the tongue towards of the wall of the pharynx.In the lower part it ends with the vocal cords: a couple of fleshy membranes traversed by the air coming from the lungs.During the production of a sound, the space between the membranes (glottis) can be completely opened or partially closed.
Due to the peculiarity of the voice formation apparatus, it can be possible to recognize a particular individual from its voice.In addition, this operation can be evaluated in an automatic approach [13][14][15].In literature, this problem is addressed as Automatic Speaker Recognition (ASR) [17], and it is widely discussed by the research community [13][14][15].
Speaker recognition is classified as a hybrid biometric recognition approach, as it has two components: the physical one related to the anatomy of the vocal apparatus, and the behavioral component, pertinent to the mood of the speaker just in the recording moment [15].
In this paper, we propose text independent ASR system based on Mel-Frequency Cepstrum Coefficients (MFCC) [18,19] and Gaussian Mixture Models (GMM) [20][21][22].Then the model parameters are estimated with the maximum similarity making use of the Expectation and Maximization (EM) algorithm [23,24].The novel com-bination of these two techniques, allows the system to reach high recognition rates and high operative velocities, as shown in the following, allowing to use the proposed system in real security context.In addition, unlike other works on ASR presented in literature, because the recorded speaker signal could be corrupted by environmental additive noise, a spectral subtraction algorithm [25,26] is also used.Comparisons with the state of the art demonstrate the effectiveness of the proposed approach in terms of accuracy rate.
The data acquisition can be performed through simple microphones which are well spread and their cost is negligible.However cheap instrumentation may be more affected by disturbances such as background noise and the spectral subtraction algorithm could be no more sufficient for efficient noise suppression.
The paper is organized as follows: Section 2 describes the ASR problem.Section 3 introduces the MFCC technique, while Section 4 introduces the GMM models.Section 5 describes the proposed ASR system and Section 6 shows some interesting experimental results.Finally Section 7 concludes the work.

System Description
A biometric recognition system generally consists of:  A sensor which makes acquisition of data and its subsequent sampling: in the specific case the sensor is a microphone, possibly with a high Signal to Ratio (SNR) value.Since the input signal is essentially speech, the sampling rate is usually set to 8 kHz;  A step of preprocessing that in the voice context is constituted by the signal cleaning: simply denoising algorithm can be applied to recorded data after a normalization procedure.In order to clean recorded speech signal from environmental additive noise, a spectral subtraction algorithm is used [23,24] in this paper;  The extraction of the peculiar characteristics (feature extraction): in this stage Mel frequency cepstral coefficients are evaluated using a Mel filter bank after a transformation of the frequency axis in a logarithmic one;  The generation of a specific template for each speaker: in this work we have decided to use the Gaussian Mixture Models (GMM) where model parameters are estimated with the maximum similarity making use of the Expectation and Maximization (EM) algorithm;  In case of the user is registering (enrollment) for the first time to the system, this template will be added to the database, using some database programming techniques;  Otherwise, in case of test among users already present in the database, a comparison (matcher) determines which profile matches the generated template of the test speech.The matcher utilizes a similarity test, obtaining by a ratio value that can be accepted if it is higher than a decision threshold.The typical ASR system is shown in Figure 1.
The technologies used for the development of the biometric system are the MMFCC for the extraction of the characteristics and the GMM for the statistical analysis of the data obtained, for the templates generation and for the comparison.

Mel Frequency Cepstral Coefficient
The term "cepstrum" is a pun where the first letters of the term "spectrum" are reversed.It was described in 1963 by Bogert et al. [27].Cepstrum is defined as the inverse Fourier transform of the logarithm of the spectrum of a signal [28,29]: The cepstrum transform the signal from the frequency domain into the quefrency domain.
When cepstrum is applied to the voice, its strength is to be able to divide excitation and transfer function.In a signal y(n) based on the source-filter model, in this specific context, respectively the vocal cords and the vocal tract, cepstrum allows separation in , where the source x(n) passes through a filter described by the impulse response h(n).The spectrum of y(n) obtained by the Fourier transform is where k index of discrete frequencies, i.e. the product of two spectra, respectively the source and the filter one.Separating these two spectra is complicated.On the contrary, it is possible to separate the real envelope of the filter from the remaining spectrum by formulating all the phase at the beginning.The cepstrum is based on the properties of the logarithm that can transform the product of the argument in sums of logarithms.
Starting from the logarithm of the modulus of the spectrum: it is possible to separate the fast oscillating component from the slow one, respectively by means of a high and low pass filter, obtaining: that is the signal cepstrum in the quefrency domain.In the low quefrencies are described the transfer function information, in the high quefrencies there is data about excitation.
Hence the initial wave of percussion created by the vocal cords and shaped by the throat, nose and mouth can be analyzed as a sum of a source function (given by the excitation of the vocal cords) and a filter (throat, nose, mouth).The separation between high and low quefrency, can be obtained by a high pass lifter (filter) for the fast oscillation and a low pass lifter for the slow one.
Psychoacoustic studies [30][31][32] have shown that the mind perception of the frequency content of the sound follows a nearly logarithmic scale, the Mel scale, which is linear up to 1 kHz and logarithmic thereafter: The Mel scale is shown in Figure 2, where it is clear the compression of the Mel scale (reported in y-axis) with respect the Hertz scale (in x-axis) for frequencies greater than 1 kHz.In this scale pitches are judged by listeners to be equal in distance from one another.
Mel-cepstrum estimates the spectral envelope of the output of the filter bank.Let Y n represent the logarithm of the output energy from channel n, applying the discrete cosine transform (DCT) we obtain the cepstral coefficients MFCC through the equation: The simplified spectral envelope is rebuilt with the first K m coefficients, with K m < K: where B m is the bandwidth analyzed in Mel domain and K m = 20 is a typical value assumed by K m .c 0 is the mean value in dB of the energy of the filter bank channels, hence it is in direct relation with the energy of the sound and it can be used for the estimation of the energy.Schematically, the coefficients are derived in the following way: the spectrum of the original signal is computed with the Fourier transform; the obtained spectrum is mapped in Mel making use of appropriate overlapping windows; for each obtained function the logarithm is calculated; the discrete cosine transform is calculated (DCT); the coefficients are the amplitudes of the resulting spectrum.In order to emphasize the low quefrencies DCT is chosen.

Gaussian Mixture Model
Each arbitrary probability density function (pdf) can be approximated by a linear combination of unimodal Gaussian density [20].Under this assumption, Gaussian mixture models have been applied to model the distribution of a sequence of vectors each one of dimension D, containing data on the characteristics extracted from the voice of a subject, according to: (6) where w i are the weights of the corresponding mixtures to the unimodal Gaussian densities p i with     and: The weights of the mixtures satisfy the constraint: Each speaker is identified by a  m del obtained from GMM analysis.In particular lambda is defined as: o   , , where  is the mean vector and is the covariance matrix.
i  Given a characteristic vector sequence of the speaker to be identified, the model parameters are estimated with the maximum similarity  making use of the Expectation and Maximization algorithm [23,24].The  model is compared with a characteristic vector X by calculating the log-likelihood similarity [23]: In order to decide, it is utilized a similarity test obtained by the following ratio: where  is the dec on the contrary, a collection of models of different speakers.The final score of a certain subject c over an X vector containing the voice features of the test is given by: where represents the similarity value of X vector with respect to c compared with the characteristics of other individuals in the database (pop), excluding the one taken into account.

System Implementation
In the pre-processing phase, the signal has been improved using spectral subtraction [33,34] and segmented into frames partially overlying (50%) and relatively small.Frames not containing voice were skipped.The size of each frame is less than 20 ms in order to make the contained wave stationary.Each has been subjected to the Hamming window to minimize the discontinuities at the edges of the frame.For each frame 20 MFCC were calculated.The obtained data represents the characteristics of a speaker.This information, organized in a matrix containing a vector of Mel-Cepstral coefficients for each frame, is analyzed by the GMM using 32 mixtures.The result is a set of statistical data characterized by a mean vector, a covariance matrix and a weight vector which constitute the template itself.The template is employed when a speaker is added into the system or for the test step among the users already registered.
The public voice database Voxforge.org[35] was used in order to validate the system.Voxforge is an internet community including researchers and "donors" of human voice.The preset aims are to support who intends to realize and test an automatic speaker recognition system, a speech recognition engine, or any application related to analysis, to the recognition and more generally to the study of the human voice.Anyone can register on the website and send his own voice recordings to be made available to the whole community.For this study 450 speaker utterances were randomly extracted from Voxforge website.For each speaker two speeches were employed: the first one in order to perform the training phase and the second one to test the system.Since the recognition system is text independent, each speech contains different words (typically reads paragraphs of popular books).
In the training step each template generated from the analysis of the speakers' utterances is stored into the system.This set of information represents the knowledge base of the system obtained in the training phase.The test stage was made utilizing the test templates of each speaker compared to the whole knowledge base of the system, i.e. all the templates stored in the training phase.This comparison was performed using the criterion of log-likelihood previously described.The output of the test phase is a matrix containing the similarity estimation of each test with respect to each profile stored in the system.This matrix is structured in this way: the rows represent the ith test and the column the j-training.Hence in position (i,j) is contained the value representing the similarity likelihood of test speaker i with respect to training speaker j.Since the comparison is made by log-likelihood, for each row (test) the system nominates the column (speaker in the system) containing the maximum value as the owner of the speech.

Experimental Results
As shown in Table 1 there were 433 identifications on 450 subjects, this means that accuracy rate is 96.22%.Since the system creates a hierarchy of candidates owners of each test, if the top five were accepted as good results, it would be achieved a recognition rate of 97.78%.
With regard to temporal performances, it should be taken into account that the complete computation test involves the training data processing, the test data elaboration and the comparison from training and test data.Obviously it is also possible perform a single test and compare it to profiles in the system.These performance results in terms of time required, are specific to the database used, since the system developed can run with audio files containing variable size, speech length and sampling.The temporal performances are exposed in Table 2.

Comparison with the State of Art
This section discusses about the main speaker recognition systems found in scientific literature.In 1995 Reynolds [36] implemented an identification system based on spectral variability obtaining a 96.80% accuracy rate with 49 speakers.In 2009 Revathi, Ganapathy and Venkataramani [37] through an iterative clustering ap-

Conclusions
In this paper we have introduced an ASR system based on MFCC and GMM.The accuracy of the proposed system is greater than 96% and with 450 speakers.
Ith, as shown as a high recognition rate on a wide number of subjects, together with a high operative velocity, make it useful for real security access control applications.