A STUDY OF ACOUSTIC FEATURES FOR EMOTIONAL SPEAKER RECOGNITION IN I-VECTOR REPRESENTATION

Recently recognition of emotions became very important in the field of speech and/or speaker recognition. This paper is dedicated to experimental investigation of best acoustic features obtained for purpose of gender-dependent speaker recognition from emotional speech. Four feature sets LPC (Linear Prediction Coefficients), LPCC (Linear Prediction Cepstral Coefficients), MFCC (Melfrequency Cepstral Coefficients) and PLP (Perceptual linear prediction) coefficients were compared in an experimental setup of speaker recognition system, based on i-vector representation. For evaluation of the system emotional speech recordings from newly created Slovak emotional database and Mahalanobis distance metric as scoring method were used. The results of the experiment showed the MFCC representation as the best fitted for speaker verification from Slovak emotional speech with recognition rate higher than 80%.


INTRODUCTION
Emotions play very important part in everyday human communication.Since human interaction is multimodal it covers not only speech but gestures, facial expression and "body language" as well.To this kind of communication emotions fill very important background information according which we -people -are able to identify the meaning of the explicit spoken message [1].
The same principle of recognition of the emotions plays essential role in inter-cognitive human-computer interaction (HCI) [2] [3].One of many of the HCI applications using emotion recognition can be automatic tutoring, where a tutor -agent -may lead the study process according to the emotional expressions of the students.Another application of emotional recognition may be in systems which can alert a user to signs of emotion that call for attention.In forensic the emotion recognition can be very advisable supplement of polygraph or speaker verification and in medicine it can help to the early diagnosis of neurological disorders and diseases.
In this article the focus is on speaker recognition from emotional speech.Speaker recognition is relevant to applications where the access is controlled through the speaker`s voice (e.g.security control applications, telephone based access, etc.) and since speech in everyday life cannot be expressed only in neutral emotional state emotion recognition became very important even in this field of science.In the literature there were experiments which focused on this objective [4] [5].To the speaker recognition purposes emotional databases in foreign languages were used and the results showed the contribution of emotional speech in process of speaker recognition.
To the disposition of this work several emotional databases in foreign languages were available, namely EMA (Electromagnetic Articulography) database [6], EMO DB (Berlin emotional database) [7], EESC (Estonian emotional speech corpus) [8] and BAUM-2 [9].None of those corpuses satisfied the requirements of emotional speaker data volume for purposes of proper training of the speaker model.Therefore we decided to create our own native emotional dataset which is presented in this work.
In automatic speaker verification many different techniques may be used [10].Methods such as the Gaussian Mixture Models based on Universal Background Model (GMM-UBM) [11], Eigenvoices [12] and Eigenchannels [13] belong to the class of mostly applied generative models.To the category of GMM-UBM models also pertains nowadays the most powerful speaker verification technique -Joint Factor Analysis (JFA) [14].
JFA is based on use of the fundamentals of highdimensional GMM supervectors for the inter-speaker variability modelling and channel/session compensation.Results of the work in [15] showed that in channel space of JFA speaker information is obtained as well.Accordingly Dehak et al. [16] proposed a concept of so called i-vectors.The main idea of i-vectors is based on the use of only one GMM supervector subspace of low dimension.In this space speaker as well as channel variability information is represented thus this space is called total variability space.
To differentiate between speaker and session information in total variability space proper normalization technique has to be used.For suppression of session information mostly LDA (Linear Discriminant Analysis) [17], WCNN (Within-class Covariance Normalization) [18], NAP (Nuisance Attribute Projection) [19] techniques or its extended version known as Eigen Factor Radial (EFR) normalization is employed.
The organization of this article is as follows.In section 2 the recognition system is described, section 3 provides introduction to native Slovak emotional speech database -SUS.Section 4 is dedicated to the actual experiment and section 5 provides discussion of the results and conclusion.

SYSTEM DESCRIPTION
In Fig. 1 is depicted system used for emotional speaker recognition.In this system a state of the art technique of i-vectors is used for representation of audio information in low-dimensional space.To extract i-vectors from emotional recordings Alize/LiaRal [20] toolkit for speaker recognition was used.

Total variability matrix training
In front-end processing of total variability data training set feature vectors are extracted.To obtain UBM supervector for extracted feature vectors the adaptation of UBM using MAP (Maximum a posteriori) algorithm is performed.Then the i-vector technique is applied to the resulted UBM supervectors to obtain vectors without effect of speaker and session variability.With concept of total variability space in an i-vector method an UBM supervector W is defined as In equation ( 1) T is a rectangular total variability matrix and w stands for speaker and session-dependent supervector.Vector ω is random vector with standard normal distribution N(0,I) which components are called total factors or i-vectors.
The training process of T matrix is similar to the eigenvoice matrix training.The difference is in considering the set of speaker`s recordings to belong to different subject when training total variability matrix.
To compensate the session variability in total variability space EFR normalization is used.EFR uses the concept of NAP [21] where the channel variability is estimated as partial rank of the within-class covariance matrix.EFR itself then suppress the nuisance dimension of computed ivectors and continues in their normalization by rotating them to the first principal axe in orthogonal subspace of the speaker where the i-vectors are projected.The rotation of i-vectors in session space is depicted in Fig. 2.
In EFR the reduction of channel variability is defined as where w̅ is the mean of i-vectors, V is eigenvoice matrix and Wc is the covariance matrix defined by equation In equation ( 3) Ws is the speaker s covariance matrix, n is the number of all of the utterances, ns is the number of speaker s utterances with ws̅ as their mean.

Speaker model training
To train the speaker model feature extraction from training audio files is performed.Then the total variability matrix T and UBM are employed into projection of extracted feature vectors into space of i-vectors.Finally GMM training of speaker model is provided.

Testing
Based on our previous experiment [22], where Mahalanobis distance metric showed better result than CSS, we decided to use this only metric to evaluate the recognition system.
Mahalanobis distance metric compares training set of entities to the mean of known class distribution.The goal of this method is to allocate an observed entity to the best fitted class.In this work an entity -speaker`s i-vector -is assigned to the class of the speaker s as in equation where Ws is the covariance matrix of speaker s as in (3) and ws̅ is the mean of class.The final Mahalanobis scoring is defined as where w1 and w2 are two i-vectors scored by the logprobability that w1 and w2 belong to the same class in accordance to the covariance matrix Ws.

EMOTIONAL DATABASE
For purposes of speaker recognition in this experiment we created emotional database in native Slovak language.Emotional audio recordings of different subjects were captured from free FTA DVB-T transmission using PCI digital capture card.
This database can be categorized as induced database [23] since captured sessions of SUS consist fabricated law suits situations in which non-professional actors take parts.Participants are supposed to act according to the storyline in each session, but a slight anticipation of controlled expression of emotional states may be expected.
Each session is oriented to solutions of juridical cases therefore emotional range of recorded utterances covers mostly emotions of neutral state and curiosity as well as negative emotions such as anger, aggressiveness, sadness and disgust.Positive emotions are rare to find in those recordings.
SUS audio sessions were down-sampled to 16 kHz from the original 48 kHz 128kbit mpa2 audio stream, encoded using LIN16 PCM encoding, mono and saved all in WAV format.
All of the emotional utterances were manually labeled.The emotional evaluation was provided on the whole sentences so that the explicit meaning of the utterance was captured in recording.In case where more than one emotion occurred in an utterance the division of such an utterance to the shorter segments following the rule of information relevance was carried out.The whole sentences or segments were then evaluated from emotional point of view and then labeled with capital letters representing the specific emotions (e.g.CU for curiosity, N for neutral, etc.).Using Transcriber software [24], labeled sessions were segmented and then, using proprietary script, were cut into separate emotional utterances of individual speakers with duration from 5 to 6 seconds.
The SUS database consists nowadays from approximately 2000 utterances of 7 speakers (3 male, 4 female) in emotions of neutral, curiosity, anger, sadness and so.

EXPERIMENTAL SETUP
In this paper the focus was on performance of speaker verification system when employing recordings of emotional database in Slovak language.For this reason in front-end processing MFCC, LPC, LPCC and PLP features were extracted from emotional utterances of three male subject of the SUS corpus.
• LPC -Linear prediction analysis (LPA) is method in which speech signal is approximated as a linear combination of its p previous samples.The coefficients estimated by this method -Linear Predictive Coefficients (LPC) -describe the formants in speech signal.• LPCC -Linear system which models the human vocal tract can be described with use of cepstral coefficients as well.Linear Predictive Cepstral Coefficients (LPCC) are obtained by estimation of Power Spectral Density (PSD) of the signal.The advantage of LPCC resides in their smaller correlation in comparison to LPC. • MFCC -Mel -Frequency Cepstral Coefficients represent the real cepstrum of a windowed shorttime signal which is derived from the Fast Fourier Transformation (FFT) of that signal.Since the human auditory system processes a speech signal nonlinearly the MFCC analysis is used to represent such a signal with respect to nonlinear fashion of the frequency.To perform that nonlinear mel-scale bank is used to convert from normal frequency f to mel frequency fmel is given by 10 f 2596.log(1 f 700).
Additionally the MFCC`s are robust and reliable to variations according to speakers and recording conditions.When calculating the MFCC`s all audio information except those parameters similar to ones that are used by humans for hearing speech, are deemphasized.• PLP -Perceptual linear prediction (PLP) attempts to approximate to the perception of sound by human auditory organs.By discarding irrelevant information within a speech signal PLP improves the recognition performance.The process of PLP computation is identical to LPC with difference in spectral characteristics which have been transformed to match characteristics of human auditory system.
In the process of feature extraction firstly 19 coefficients of mention acoustic feature sets were computed.For augmentation of spectral parameters obtained in process of LP or mel-filterbank analysis an energy coefficient was appended.This energy term was computed as log of the signal energy.To enhance performance of a speech recognition system to the basic coefficients additional time derivates were appended as well.The first order regression coefficients, referred to as delta coefficients, the second order regression coefficients (acceleration coefficients) and the third order regression coefficients.The dimension of such computed vector was 80 segments per frame.
In the second step the number of extracted basic coefficients was increased to the count of 22.The final size of feature vector obtained in the process of extraction was 92 dimensions.All of the audio features were extracted using 25 ms Hamming window with shift of 10 ms.
In the process follows energetic coefficients of silent frames were normalized with respect to zero mean and variance and then, according to the speech energy, evaluated.Comparing to the specific threshold, frames with higher energy were used in the next i-vector extraction processing.Customized features were then mapped to the fixed-length vector in process of i-vector extraction.After running experiment with different values of dimensional parameter the decision was to set the dimension of extracted i-vectors to the number of 10.
Gender-dependent UBM was trained using 247 recordings of background noise and SUS male speakers not included in training and testing phase.In this experiment 32, 64, 128 and 256 Gaussians were used in UBM training.The comparison of the best number of Gaussians used in UBM training was made in evaluation phase.The total variability matrix was trained with 250 emotional utterances of the total variability training dataset.Several experiments with different number of T iteration were carried out.The best results were obtained with number of 10 iterations.
To created speaker model training and testing sets emotional utterances of three male speakers (spk1, spk2, spk3) from the SUS database in emotions of neutral and curiosity were used.Those emotions were the most common to extract from SUS sessions.In process of training the speaker model two different approaches were chosen to be applied as shown in Fig. 3.
In the first approach to train the speaker model emotions of neutral and curiosity were used.It resulted in generation of one mixed model per speaker trained with all available emotional recordings of the speaker.
The results of the second approach were two different speaker`s models, both trained with only one emotion from emotional training set of required subject.The best resulting score in emotional speaker recognition obtained for different extracted features is shown in form of confusion matrixes in Table 1, Table 2, Table 3 and Table 4.
In a confusion matrix diagonal elements represent correct classification while other elements in each row express misclassification.For example in the first row corresponding to speaker 1 (spk1) of Table 80% of emotional utterances were recognized correctly, 8% of them were misled as utterances of speaker 2 (spk2) and 12% of emotional utterances were presented as utterances of speaker 3 (spk3).

DISCUSSION OF RESULTS AND CONCLUSION
According the results from tables (Table 1, Table 2, Table 3 and Table 4) the best results in both approaches were mostly obtained with 128 GMM/UBM training.The difference was in case of extraction LPCC when the best recognition rate was with 64 GMM/UBM training and 22 coefficients per frame extracted independently on mixed or non-mixed model approach used.
When computing LPC and PLP the best recognition rate (73% and 70% respectively) was obtained with number of 19 coefficients extracted.
The best recognition rate whatsoever was obtained extracting 22 MFCC.The second approach (non-mixed model) showed better results compared to mixed model training.Recognition rate in case of non-mixed training model resulted in 85% while when using approach of mixed model training the best percentual value of recognition rate was 82%.
According investigation of this paper 22 MFCCs provide the best recognition rate in speaker verification on the SUS emotional database.The superior performance of MFCC may be related to the fact that MFCCs are the best features to represent perceptual aspect of short-term speech spectrum.
Since in non-mixed training approach speaker model was trained by utterances of only one emotion the emotional characteristics differ less than in mixed model training and this, we suppose, is the reason of better recognition rate in this case.
In the future we would like to focus on testing the gender-dependent speaker verification system on female emotional dataset of the SUS corpus.We also plan to continue in enlargement of the SUS corpus.

Fig. 1
Fig. 1 Scheme of the recognition system 2.1.Universal background model training Audio features extracted in font-end processing of training data waveforms are used in GMM/UBM training.UBM was trained by means of the EM (expectationmaximization) algorithm.

Fig. 3
Fig. 3 Mixed and non-mixed model approach

Table 1
Speaker recognition rate (%) for mixed model with 128GMM/UBM training

Table 2
Speaker recognition rate (%) for non-mixed model with 128GMM/UBM training (speaker model trained with neutral)

Table 3
Speaker recognition rate (%) for non-mixed model with 128GMM/UBM training (speaker model trained with curiosity)