Speech Emotion Recognition Using Unsupervised Feature Selection Algorithms

The use of the combination of different speech features is a common practice to improve the accuracy of Speech Emotion Recognition (SER). Sometimes, this leads to an abrupt increase in the processing time and some of these features contribute less to emotion recognition often resulting in an incorrect prediction of emotion due to which the accuracy of the SER system decreases substantially. Hence, there is a need to select the appropriate feature set that can contribute significantly to emotion recognition. This paper presents the use of Feature Selection with Adaptive Structure Learning (FSASL) and Unsupervised Feature Selection with Ordinal Locality (UFSOL) algorithms for feature dimension reduction to improve SER performance with reduced feature dimension. A novel Subset Feature Selection (SuFS) algorithm is proposed to reduce further the feature dimension and achieve a comparable better accuracy when used along with the FSASL and UFSOL algorithms. 1582 INTERSPEECH 2010 Paralinguistic, 20 Gammatone Cepstral Coefficients and Support Vector Machine classifier with 10-Fold Cross-Validation and Hold-Out Validation are considered in this work. The EMO-DB and IEMOCAP databases are used to evaluate the performance of the proposed SER system in terms of classification accuracy and computational time. From the result analysis, it is evident that the proposed SER system outperforms the existing ones.


Introduction
Speech Emotion Recognition (SER) is the method of detecting the emotional state of a speaker using a speech signal. The field of emotion recognition has gained a lot of interest in human-computer interaction these days, and intensive research is going on in this field using various feature extraction techniques and machine learning algorithms. SER is used in the applications viz., call-center services, in vehicles, as a diagnosing tool in medical services, story-telling and in E-tutoring applications etc.
There are six archetypal emotions: anger, neutral, happiness, disgust, surprise, fear and sadness. In situations where only a person's speech signals are available, SER plays a prominent role [1], [2]. Speech features can be classified as Continuous, Voice Quality, Spectral and Nonlinear Teager Energy Operator (TEO) features. Figure 1 shows the categorical representation of some of these speech features. A significant challenge in SER is the identification of useful speech features that holds the emotional characteristics from a speech signal, and most of the research related to SER is focused on identifying the effective feature set. It is evident from the literature that the feature fusion increased the classification accuracy of the SER system and became the most common practice.
Even though the classification accuracy of the SER system increases due to feature fusion, it also increases the computational overhead on the classifier. This is because some of the features contribute in a better way, while some of them might not be useful at all for emotion recognition. The feature selection methods simplify the task of interpretation by the classification algorithms easier. These techniques majorly eradicate the loss caused due to the curse of dimensionality and also the problem of overfitting by improving the generalization in the model, i.e., the use of less redundant data that leads to incorrect predictions increasing the accuracy of the SER system and thus, enhancing the prediction performance by decreasing the computational time and memory by the SER system. Hence, feature dimension reduction is the best way to enhance the accuracy of the SER system. The reduction of the number of features causes an uncertain loss of information and subsequently leads to instability in the performance of the SER system. To overcome this drawback and to acquire the most optimal feature sets that improve SER accuracy, many feature selection techniques are developed in machine learning. In feature selection, from the original feature set, a subset of features is selected with respect to their relevance and redundancy. It improves the prediction performance and reduces the computational complexity and storage, providing faster and cost effective models [3]. In machine learning, a feature vector is the n-dimensional vector representing the features of all samples. The space-related to these vectors is the feature space. To decrease the dimensionality of feature space, the feature selection or feature transformation methods can be used. In feature transformation, the original feature space is transformed into a different space having a distinct set of axes to reduce the dimensionality of the data.
In contrast, feature selection reduces into a subspace from the original feature space without transformation. Some examples of feature selection methods are ReliefF, Fisher Score, Information Gain, Chi Squares, LASSO, etc. Feature selection techniques can be categorized based on labelling of the data as supervised, unsupervised and semisupervised. In supervised feature selection, the data is labelled feature evaluation process. If the data is huge, labelling of the data is costly and a tedious task. Unsupervised feature selection can overcome these drawbacks of supervised approaches. But this is more difficult than supervised ones since it does not have labelled data and still its result can be good even without any prior knowledge. The evaluation of feature selection methods can be further classified into four types, i.e., filter, wrapper, embedded, hybrid and ensemble feature selection, as shown in Fig. 2.
Filter feature selection techniques use the statistical analysis to assign a distinctive feature with a score. Their score ranks the features, and later, these are retained or removed from the original feature vector set accordingly. These filter techniques mostly use a single variable in their analysis and features are considered independent of each other or dependent terms. The most commonly used filter methods are the Chi-squared test [4], variance threshold [5], information gain, etc. The fast feature selection method, i.e., Fisher feature selection is used in [6] with decision SVM for SER. The wrapper feature selection techniques consider a set of features with various combinations of the feature sub-sets. Later, these feature subsets are compared with one another as a search problem which is estimated and compared with other groups. Further, the prediction process is performed to assign the score onto each of the feature sets depending on the prediction accuracy. The process of search can be systematical such as searching the first best features, for example, hill-climbing algorithm or using heuristics. The search process can be systematic, stochastic or heuristic such as a best-first search, random hill-climbing algorithm, forward and backward passes to add and remove features. Genetic Algorithms, Recursive Feature Elimination (RFE), Sequential Feature Selection (SFS), etc. are some of the wrapper methods of feature selection. In [7], SFS and Sequential Floating Feature Selection (SFFS) are used for SER.
Embedded methods, while creating the model in the learning process, select the best features that can be used to enhance accuracy. Regularization techniques are the most commonly used embedded methods for feature selection: LASSO, FSASL, Ridge Regression, Elastic Net, etc. In [8], an L1-Norm with multiple kernel learning and embedded feature selection method are used for SER. The hybrid method is a combination of two or more feature selection methods (e.g., filter + wrapper). These methods try to acquire the benefits of both techniques by combining their corresponding strengths. It achieves improved efficiency, prediction performance and decreases computational complexity. The most widely used hybrid method is the combined feature selection with filter and wrapper approaches.
Ensemble method constructs a collection of feature subgroups and produces an aggregate result from the group. The primary goal of this method is to tackle the unpredictability problems in most of the feature selection algorithms. This method is based on various subsampling schemes in which one feature selection technique runs on many subsamples, and the resultant features are combined to attain a subset with more stability. With this, for high dimensional data, the feature selection performance is no longer dependent on any individual selected subset, thus attains more flexibility and robustness.
In [9], a sparse representation based sparse partial least squares regression (SPLSR) feature selection method is used for SER. Apart from these, feature selection techniques, feature transformation methods can also be used for feature dimension reduction in SER [10][11][12]. In [10], semi-NMF feature transformation technique with multiple kernel Gaussian process is used for feature dimension reduction. In [11], a supervised feature transformation based dimension reduction method i.e., modified supervised locally linear embedding (MSLLE) algorithm is adopted for SER. In [12], principal component analysis (PCA) is used for SER to transform the high dimensional feature space to a lower dimension.
In [13], unsupervised feature learning is carried out using k-means clustering, sparse Auto-Encoders (AE) and sparse restricted Boltzmann machines for feature mapping to obtain optimal feature set for SER. The adversarial AEs and variational AEs have the ability to encode the high dimensional feature vector to a lower dimension and also have the ability to reconstruct the original feature space. Therefore, in [14], [15], these are used as feature dimension reduction techniques for SER. In [16], a new variant of feature extraction technique i.e., deep neural network based heterogeneous model consisting of AE, denoising AE and an improved shared hidden layer AE is used to extract the features from speech signal. These layers also provide feature optimization up to some extent. But to obtain better performance for SER with the high-dimension feature set, a fusion level network with support vector machine (SVM) classifier is used.
In this paper, an SER system is proposed with unsupervised feature selection algorithms with the Support Vector Machine (SVM) classifier using Linear and Radial Basis Function (RBF) kernels. The significant contributions of this work are: i) Using the UFSOL and FSASL unsupervised feature selection algorithms for feature selection which have not yet been explored for SER. ii) To propose a Subset Feature Selection (SuFS) algorithm to improve the performance of the proposed SER system further by selecting the subset of features after UFSOL and FSASL feature selection, based on the 10-fold validation accuracy obtained by using UFSOL and FSASL algorithms, as the decisive factor for feature selection.
The rest of the paper is structured as follows: Section 2 describes the proposed SER system with UFSOL, FSASL algorithms along with a novel Subset Feature Selection (SuFS) algorithm and Section 3 discusses the performance analysis of the proposed SER system followed by Section 4 with the conclusion and future scope of the proposed work.

Proposed Speech Emotion Recognition System using Unsupervised Feature Selection Algorithms
In the proposed SER system, after the feature extraction, the unsupervised feature selection algorithms, i.e., UFSOL and FSASL are used individually to select the most prominent from the original feature set as shown in Fig. 3.

Database
In the proposed work, EMO-DB and IEMOCAP datasets are considered for the SER analysis. EMO-DB, the German database [18] is widely used in SER analysis by many of the researchers. The recording for emotional data was done in an anechoic chamber by five male and five female actors between the age group of 25-35. Totally 535 speech signals were recorded at 48 kHz with Anger, Boredom, Disgust, Anxiety/Fear, Happiness, Sad and Neutral. Later these are down-sampled to 16 kHz. The Interactive Emotional Dyadic Motion Capture (IEMOCAP) database [19] is an acted, multimodal and multi-speaker database. Twelve hours of audio-visual data that include video, speech, text transcriptions and motion capture of the face. In this work, the speech data with emotions, anger, happiness, neutral and sadness are considered as in most of the SER works, with a total of 4490 utterances.

Pre-Processing
The speech signal is initially passed through a preemphasis filter to boost the energy in their higher frequencies which are attenuated during the speech signal production from vocal tract [20].

Feature Extraction
Feature Extraction in speech emotion recognition is the process of extracting the speech specific features that have the emotion relevant information [1]. In order to obtain the emotional contents from a speech signal, a particular set of features can be extracted by applying various signal processing techniques. In this work, INTERSPEECH 2010 paralinguistic challenge feature set and Gammatone Cepstral Coefficients (GTCC) are used as features. The INTERSPEECH 2010 paralinguistic challenge set consists of 1582 features with a four-set of features combined [21].
The Munich open Speech and Music Interpretation by Large Space Extraction (openSMILE) toolkit [22] is utilized to extract the 1582 features for the individual speech signal. The configuration file 'IS10paraling:conf" is used to obtain these features and the features, along with the description are as shown in Tab. 1.
The gammatone filter takes its name from the impulse response, which is the product of a Gamma distribution function and a sinusoidal tone centered at the frequency, being computed as [23]: where g(t) is the impulse response of gammatone filter; K is the amplitude factor; n is the filter order; f c is the central frequency in Hz; is the phase shift; B is the duration of impulse response (B = 1.019  ERB(f c )).
ERB is the equivalent rectangular bandwidth i.e., ERB(f) = 24.7 + 0.108f. The center frequency f c of each gammatone filter is equally spaced on ERB scale, i.e., The fourth order gammatone filter is similar to human auditory model, therefore n = 4. Here, f low = 62.5 Hz, f high = 3400 Hz and N is the number of gammatone filters i.e., 20. After obtaining the gammatone filter coefficients the cepstral analysis is applied to these, obtaining a total of 20 gammatone cepstral coefficients using the gammatone filter.

Unsupervised Feature Selection
The unsupervised feature selection algorithms, i.e., UFSOL and FSASL, which are not yet explored for SER so far, are used in this work. Apart from this, a novel Subset Feature Selection algorithm is modelled by the results obtained after using UFSOL and FSASL algorithms to improve the performance of the SER system further. The entire set of 1602 features is given to the feature selection algorithms to select the most prominent features, as shown in Fig. 3. The UFSOL and FSASL algorithms are discussed as below:

Unsupervised Feature Selection with Ordinal Locality (UFSOL):
Consider X = [x 1 ,…,x d ]  m  d as the initial feature matrix with d speech signals and m number of features. Generally the regularized regression, feature selection is formulated as [24]: where W  m  d 2 (m < d 2 ) is a projection matrix/ feature selection matrix; l 2,q -norm (q is typically set to 0 or 1) assures the sparseness in rows of W; Whereas, H is a label matrix in case of supervised multi-class data. In this work, the bi-orthogonal semi Nonnegative Matrix Factorization (NMF) is used to decompose H into two new matrices i.e., H ≅ UV with V  0, VV T = I and U T U = I.
If the feature set selected for original sample x i is supposed to be y i = W T x i , then Y = W T X. According to the principle of "ordinal locality preserving", given a triplet (x i , x u, x v ) comprised of x i and its neighbors x u and x v , their corresponding feature groups also form a triplet (y i , y u, y v ). Let the distance metric be denoted by dist(.,.). The feature selection holds ordinal locality preserving if the following condition is preserved: Based on this, the appropriate feature group for each data point is identical to optimizing the following ordinal locality preserving loss function over a collection of triplets as below: where N i is a set of sequence numbers indicating the k nearest neighbors of The squared Euclidean distance is used to establish each pairwise distance. The loss function of ordinal locality preserving can be written accordingly as , which has an equivalent compact matrix form: where  and  are scalar constants that controls the relativeness of corresponding terms.
According to half-quadratic theory, for a fixed t, there The minimization of F̂(W,U,V,R) is as shown below: i) The diagonal elements of R are updated in parallel: The algorithm to solve (7) is as below: Algorithm 1: The algorithm to solve (7) Input: sample's nearest neighbors k; Parameters d 2 , c, , and .
ii) To solve (7), (U, V) is updated for fixed W by applying orthogonal Semi-NMF on projected data i.e., fea- The optimal W comprises d 2 Eigen vectors corresponding to the smallest Eigen values of d 2 .
All the above steps are updated until convergence as summarized in Algorithm 1. W is the resultant feature selection matrix.

Feature Selection with Adaptive Structure
Learning (FSASL): In this algorithm, consider the feature set as X   d  m , where d is the dimension of the speech files and m is the number of features. The parameters α, β, γ, µ are considered as the regularization parameters used to balance sparsity and reconstruction error of global as well as local structure learning. Also, considering the optimized data dimension as c, the resultant optimized feature set   d  c . Using the general equation that guides the FSASL method [25]: where, X = Input Feature set; x = a particular row of data matrix;

Algorithm 2: FSASL Algorithm
Input: Input feature set as X  m  d ; d is the dimension of the speech files; m is the number of features.

Solution:
For each data sample x q , all the data points {x r } m r=1 are considered as the neighborhood of x q with probability P(q,r). S = Weight matrix of the data matrix; s = a particular row of the Weight matrix; Z = feature selection and transformation matrix.
The optimization problem in (10) derives different variables (S, P and Z(t)) into a set of sub-problems with only single variable involved and is solved as follows: 1) Solving for S by keeping P and Z as constant. For each q, update the q th column of S by solving the problem: where X' and x' are the transpose matrices of X and x.
2) Solving for P by keeping S and Z as constant. For each q, update the q th column of P by solving the problem   , then the above problem can be written as: where p'(t) is the q th row of P.
3) Compute the overall graph laplacian by L = L S + (L S ), then where, D P is a diagonal matrix whose i th diagonal element is The optimal solution of Z gives the Eigen vectors corresponding to the c smallest Eigen values of generalized Eigen-problem: where Λ is a diagonal matrix whose diagonal elements are Eigen values.
Output: Sort all the d features according to ||z q || 2 (q = 1, ..., d) in descending order and select the top k ranked features.
The resultant is Z as the feature selection matrix. Both the FSASL and UFSOL algorithms rearrange the original feature set accordingly, as per their prominence with the ranks of the corresponding algorithms. Later, the rearranged feature sets are fed to the classifiers to perform emotion classification.

Subset Feature Selection (SuFS):
After the unsupervised feature selection, a novel Subset Feature Selection algorithm is introduced upon the UFSOL and FSASL algorithms. Accuracy vector a with accuracies based on ranking of various features using Feature Selection algorithm; l = number of features at which highest accuracy is obtained using UFSOL or FSASL.

Solution:
1: Initialize sub-rank (sr) with a(1) (since, first accuracy value is always > 0) 2: Initialize h = 2 for g = 0:1:l if a(g + 1) > a(g) sr(h) = r(g + 1) update h ← h + 1 end 3: for i = 0:1:len(sr) sf(g) = F(:, sr(g)) end Output: Subset of original feature vector (sf) To further reduce the dimension of the feature set without effecting the accuracy of the SER system, i.e., to obtain a better accuracy with a reduced feature set. The proposed SuFS depends on the ranking vector (i.e., prominence of the features) and the validation accuracy obtained from the features selected from UFSOL and FSASL algorithms. The ranking vector is according to d 2 smallest Eigen values of UFSOL algorithm and d smallest Eigen values of FSASL algorithm. The SuFS algorithm is discussed in Algorithm 3. The SuFS algorithm is applied to the features selected by UFSOL and FSASL to obtain sf feature vector. Further, the subset of features, i.e., features obtained from UFSOL-SuFS and FSASL-SuFS are given to the SVM classifier for both validation and testing.

Simulation Results and Discussion
In the proposed SER system, the 1602 INTER-SPEECH Paralinguistic and GTCC features are extracted from the speech signal. This huge set of features is fed to the UFSOL and FSASL algorithms for feature selection. In this paper, the support vector machine (SVM) classification technique with Linear and Radial Basis Function (RBF) kernels using Hold-Out and 10-fold Cross-Validation are used for emotion classification. Initially, the speech signal database is divided into training and testing datasets. The 80% of the dataset is considered for training and 20% for testing for hold-out validation. The k-fold cross-validation (here, k = 10) is a resampling method employed to evaluate machine learning models on a limited dataset. The dataset is randomly divided into k groups or folds of nearly equal size. The first fold is used as a validation set, and the model is fit on the remaining k -1 folds. In this work, the 10-fold cross-validation schema is used to train and test the accuracy of the proposed SER system. Hence, the entire dataset is randomly split into 10 parts, among that 9 parts are used for training the classifier (SVM), and testing is carried out on the hold-out or test data, i.e., the tenth part. This process is repeated in 10 folds, i.e., 10 times, until the entire dataset is completely trained.  The performance of the proposed SER system is evaluated using the machine learning performance metric, i.e., the Classification Accuracy. In this work, the 10-fold crossvalidation and Hold-Out Validation are used to train and test the accuracy of the proposed SER system. All the simulations are carried out in a Computer with Intel(R) Xeon(R) CPU E3-1220 v3 of 3.10 GHz 64-bit processor with 16 GB RAM. To select the first prominent features which give the highest accuracy, to select initial feature set, the feature selection matrix of both UFSOL and FSASL algorithms are given to the SVM classifier as shown in Fig. 3. Figures 4 and 5 show the variation of classification accuracy with the number of features using FSASL and UFSOL feature selection for EMO-DB and IEMOCAP.
For EMO-DB, using FSASL the highest validation accuracy of 86% is obtained for 600 features and 85% validation accuracy for 500 features with UFSOL. For IEMOCAP, for 1250 features the highest accuracy of 71.4% using FSASL and 72% using UFSOL is obtained.
It is evident from Figs. 4 and 5, even with initially selected features using UFSOL and FSASL algorithms, the SER accuracy is not increasing. Therefore, still, the feature selection is possible from initially chosen features. Hence, the SuFS algorithm is applied after UFSOL and FSASL feature selection to acquire better accuracy with less number of features. The initially selected features are fed to the SuFS algorithm to reduce further the number of features acquiring the best performance. Later, the highest prominent features selected by SuFS are fed to the SVM classifier with Linear and RBF kernels for emotion classification.
The performance of the proposed SER system with different feature selection algorithms is compared with the baseline SER system (without feature selection) using SVM classifier with Linear and RBF kernels using hold-out validation and 10-fold cross-validation are as shown in Tab. 3 and 4 in terms of classification accuracy and validation (or) testing time. Tables 3 and 4 show the simulation results of the proposed SER system for EMO-DB and IEMOCAP databases with hold-out validation and 10-fold cross-validation using SVM classifier. From the results, it can be clearly understood that for EMO-DB database, better performance is achieved upon using the SVM classifier with Linear Kernel, and RBF kernel for IEMOCAP database.
From the results shown in Tab. 3 and 4, it is clear that the SVM with Linear kernel gives better classification for EMO-DB data and with RBF kernel in case of IEMOCAP data. Table 3 shows the hold-out validation results for EMO-DB and IEMOCAP database. For EMO-DB, the highest testing accuracy of 86% with the lowest computational time for training and testing, i.e., 0.165 and 0.032 seconds using FSASL-SuFS algorithm. Similarly, for IEMOCAP database, the highest testing accuracy and at lowest computational time of 14 and 2.9 seconds for training and testing is 77.5% using UFSOL-SuFS algorithm.