Speech emotion recognition using overlapping sliding window and Shapley additive explainable deep neural network

ABSTRACT Speech emotion recognition (SER) has several applications, such as e-learning, human-computer interaction, customer service, and healthcare systems. Although researchers have investigated lots of techniques to improve the accuracy of SER, it has been challenging with feature extraction, classifier schemes, and computational costs. To address the aforementioned problems, we propose a new set of 1D features extracted by using an overlapping sliding window (OSW) technique for SER in this study. In addition, a deep neural network-based classifier scheme called the deep Pattern Recognition Network (PRN) is designed to categorize emotional states from the new set of 1D features. We evaluate the proposed method on the Emo-DB and the AESSD datasets that contain several different emotional states. The experimental results show that the proposed method achieves an accuracy of 98.5% and 87.1% on the Emo-DB and AESSD datasets, respectively. It is also more comparable with accuracy to and better than the state-of-the-art and current approaches that use 1D features on the same datasets for SER. Furthermore, the SHAP (SHapley Additive exPlanations) analysis is employed for interpreting the prediction model to assist system developers in selecting the optimal features to integrate into the desired system.


Introduction
Since deep learning and neural networks have advanced significantly over the past ten years, speech emotion recognition (SER) has gained a lot of attention in the field of affective computing. On the basis of voice cues, it seeks to identify and understand human emotions. SER has recently been effectively incorporated into several applications in the real world (Badshah et al., 2019;Chatterjee et al., 2021;Chen et al., 2020;Gong & Luo, 2007;Yoon et al., 2007;Zhu & Luo, 2007). Yoon et al. (2007) trained a speech emotion recognition agent model and integrated it into a mobile communication service to monitor and analyze the mood of the person that consumers wanted to know via their smartphones. An SER system with different languages from diverse courses was proposed in an e-learning system to address the absence of emotional contact in Zhu and Luo (2007).
Researchers have worked to develop effective classifier schemes and determine the best feature sets to increase the accuracy of the SER systems. Energy, pitch, Mel frequency cepstral coefficients (MFCC) (Koduru et al., 2020;Kuchibhotla et al., 2016), linear predictive coding, zero-crossing rate (Lingampeta & Yalamanchili, 2020), Mel spectrograms (Chen et al., 2018;Meng et al., 2019;Pham et al., 2020), and wavelet features (Abdel-Hamid, 2020; Koduru et al., 2020) are just a few of the feature extraction techniques that have been presented throughout the years to extract the reliable and ideal characteristics. Additionally, many classifier schemes, including support vector machines (Swain et al., 2015), artificial neural networks (ANNs), adaptive-neuro-fuzzy inference systems (Giripunje & Bawane, 2007), linear discriminant analyses (LDAs), k-nearest neighbors (Lingampeta & Yalamanchili, 2020), and regularized discriminant analyses (RDA) (Kuchibhotla et al., 2016), have been developed to increase the recognition rate for the SER. However, these were handcrafted features and modeling methods that could not achieve the required precision and required a significant amount of time and money.
Deep learning has become increasingly important with the advancement of neural networks in a variety of fields, including image processing, text recognition, speech recognition, and speech emotion recognition. A 3-dimensional deep learning model that incorporated convolutional neural networks (CNN), a long short-term memory (LSTM) network and an attention mechanism for the SER was put out by Chen et al. (2018). Mustaqeem et al. (2020) employed a K-means clustering algorithm with the help of the radial basis function network (RBFN) to cluster the key segments from the input speech. The selected key segment sequence was transformed into spectrograms and then fed into ResNet-101 to extract features. Finally, these features were passed to a deep bidirectional LSTM to learn the temporal information and high-level representations for recognizing emotional states. Bao et al. (2019) introduced a new emotion style transfer as a data augmentation method to generate synthetic feature vectors for speech emotion recognition by using a cycle-consistent based generative adversarial network (CycleGAN) with a classification loss function. For multimodal emotion recognition, Nguyen et al. (2021) designed a deep learning framework including two branches for multimodal features: a 2D deep auto-encoder branch and a 1D deep auto-encoder branch to extract features from visual input and audio input, respectively. Then, these features are fused and fed into an LSTM network for emotion recognition. However, utilizing deep learning models has required a significant amount of data and computing power.
Recent researchers have also tried to reconfigure, combine, or propose new classification loss functions to improve the speech emotion recognition rate. For instance, Meng et al. (2019) created a new loss function for the SER that combined the center loss and the softmax loss for emotion classification. By merging contrastive-center loss and softmax loss, the loss function in Meng et al. (2019) was enhanced for the SER in Pham et al. (2020). To enhance emotion identification from speech, researchers have used an attention mechanism (Neumann & Vu, 2017) and a transformer for the SER (Siriwardhana et al., 2020). However, it is challenging to extract the proper characteristics from practically all SER systems, and consumers are unsure of which features to pick. Additionally, in SER systems, selecting the classifier schemes is crucial.
It is general knowledge that researchers have attempted to employ 2D feature representations and classifier systems based on deep learning with and without attention processes to enhance speech emotion identification. However, researchers have not yet determined which features should be employed to improve speech emotion recognition. Moreover, deep learning models, such as CNN, RNN, and their variant architectures, require a lot of data and computing resources. Therefore, this study developed a novel set of 1D features for speech emotion recognition using the overlapping sliding window (OSW) method. Time-domain and frequency-domain (TD-FD) characteristics serve as the foundation for these features. Then, utilizing the new set of 1D features, a deep Pattern Recognition Network (PRN) is created to train a classifier to improve the accuracy of the SER system. The suggested approach is also accessible as open source at https://github.com/ nhattruongpham/osw-1d-prn-shap, which may be used to reproduce the experiments.
The following is a list of the study's key contributions: . An OSW approach is used to make use of a new collection of 1D features. A simple deep pattern recognition network model is created to train a classifier from the new set of 1D features using a cross-entropy loss function, which drastically reduces the amount of computational work required compared to CNN, RNN, and their variant architectures since PRN is a simple deep neural network-based one; . This OSW technique can also be used as a data augmentation method to enrich features for speech emotion recognition; . The suggested technique is tested using the Emo-DB (Burkhardt et al., 2005) and AESSD Vryzas, Matsiola et al., 2018) datasets, and SHAP analysis (Lundberg & Lee, 2017) is used to assess the contribution of each feature in the new collection of 1D features so that system developers may choose the best features for the desired system.
The rest of this paper is structured as follows. Related studies are summarized in Section 2. The proposed methodology is presented in Section 3. In Section 4, experimental results and discussion are depicted and analyzed. Finally, the paper is concluded in Section 5.

Related work
Speech emotion detection has faced significant difficulties with feature extraction. To extract reliable and ideal features for the SER, several researchers have looked at everything from low-level handmade features and conventional approaches to high-level representation and deep learning techniques. To identify emotional states from speech, Kuchibhotla et al. (2016) employed a total of 360 characteristics, including 324 MFCCs, 36 energy features, and 36 pitch features. Then, using sequential forward selection or sequential floating forward selection, these traits are fussed with or chosen (SFFS). When compared to other approaches, the experiment combining the RDA classifier scheme and SFFS provided competitive accuracy. In order to undertake a feature set for the SER, a lot of features were needed for this job. Two deep learning models were created by Zhao et al. (2019) that acquired local and global emotional characteristics from audio samples and log-Mel spectrograms to identify emotions in speech. The work of Zhao et al. (2019) did not, however, integrate several feature sets to create an algorithm that would mix the advantages of various deep features. Chen et al. (2020) proposed a two-layer fuzzy multiple random forest (TLFMRF) to recognize emotions from speech signals using a feature set of 16 basic features and 12 statistical values that belong to both personalized and non-personalized features.
In contrast, a three-step time-to-failure prognostic for rolling element bearings was developed by Wu et al. (2018) and covered feature extraction, feature reduction, and time-to-failure prediction. In the feature extraction process, TD-FD features were retrieved from raw vibration signals to create several statistical characteristics. Due to the rapid analysis of the speech signals in TD-FD, these characteristics are reliable and may be used for speech emotion identification. To evaluate audio content and extract useful information from it, Lerch (2012) identified several audio characteristics. Almost all aspects of speech processing, including speech emotion recognition, have made extensive use of some of them.
Most of the signal processing (Kusuma & Nuryani, 2019), mechanical systems (Clement et al., 2014;Li et al., 2018;Lin et al., 2021), and data streams (Domino & Gawron, 2019) have all used sliding window-based (SW) approaches. In addition, to minimize the noise in sound categorization, the OSW method was utilized in a short-time Fourier transform for assessing time origin from frame to frame . The length of the window (or frame) and the sliding step are crucial components of the OSW approach. These are created in accordance with the analysis's goal. Recently, the OSW for the 3-dimensional emotional system reacting to EEG (electroencephalography) input was proposed by Garg et al. (2021). It was also asserted in Liu et al. (2020) that the OSW with an ANN classifier technique might considerably increase the SER's accuracy. The EmoDB dataset's four emotional states are the only ones on which experimental findings are given.
As mentioned above, SW and OSW have been investigated in a lot of applications. The differences between a sliding window and an overlapping sliding window and the advantages of using the overlapping sliding window are summarized in some aspects. The sliding window method divides each input signal into several windows with a fixed interval size. There are two types of SW methods: OSWs and non-OSWs. In the OSW approach, there is an overlap between adjacent windows, whereas there is none between them in the non-OSW approach. For example, if the fixed interval size is 10 and the overlap is 5, the start and end of windows will be [1 10], [6 15], [11 20], etc. While in the non-overlapping window approach, they will be [1 10], [11 20], [21 30], and so on. In almost all popular windows, like Hanning and Hamming windows, choosing an overlap of 50% between adjacent windows works best in a lot of applications, such as signal processing, speech processing, mechanical systems, and control systems. In addition, because speech emotion recognition is a sequence process, each fixed interval size is not independent. If the OSW approach is employed, we can extract and capture as much information and features as possible. As a result, the suggested approach uses the OSW in various ways in this study. To extract the new set of 1D characteristics for SER, the OSW is first moved along with the voice signals. Second, the OSW is also regarded as a strategy for enhancing characteristics with incomplete data.

Methodology
The proposed method is depicted in Figure 1 as an overview. It includes three main components: feature extraction, classification, and interpretation. For feature extraction, the OSW is created to calculate a new set of nine 1D features based on TD-FD characteristics (Lerch, 2012;Wu et al., 2018). Then these features are enriched by calculating the first and second derivatives. For classification, a simple neural network is designed to learn the representations from the extracted 1D features to discriminate emotions from speech signals. After that, SHAP analysis (Lundberg & Lee, 2017) is employed to interpret the contribution of each feature in the feature set to the prediction. Because all 27 extracted features are 1D features and the deep PRN is a shallow network, computing resources and training costs are drastically reduced. As a result, the SER system performs better. The details of using the OSW technique, designing a simple deep PRN model, and analyzing the model prediction are presented below.

A new set of 1D features using overlapping sliding window
In this work, a new set of 1D characteristics is suggested to improve the accuracy of the SER system. The new set of 1D features is built upon the TD-FD features of the mean value (mean), zero-crossing rate (zrc), root mean square value (rmsv), signal crest (sc), maximum absolute value (mav), kurtosis coefficient (kc) (Soualhi et al., 2014), square mean root value (smrv), root mean square logarithm (rmsl), clearance factor (cf), and root mean square frequency (rmsf). Figure 2 is the first component in the proposed architecture that presents the OSW process to extract the new set of 1D features to train a classification model. As illustrated in Figure 2, a SW or frame Fr(t i ) with a window size of Ws and a hop length of Hp is used to calculate these attributes at each time step t i . There is a link between the hop length and the window size. The analysis time origin's hop length indicates how far it has advanced from frame to frame. This is heavily influenced by the analysis's goal. Increased overlap leads to additional analysis points, which over time smooth out findings, but at a higher computational cost. The window size Ws of 25,000 samples is fixed in this study for both the training data and testing data sets, while the hop length Hp is fixed at 250 samples for the training data set and is randomly selected in a range of 250 to 1,000 samples for the testing data set. The window size is chosen based on empirical validity and experience. Furthermore, because the audio sampling rate is 16,000 kHz, each window size is approximately 1.5 s, providing a reasonable amount of time to speak the longest word. Then, a subset of the nine based features is obtained based on Equations (2) through (10), where Ns ≈ len(Ws) is the number of data samples in each frame Fr(t i ). The mathematical models for nine 1D-based features extracted by using the OSW technique are described below.
fe rmsl = 20 log 1 Ns where and Fr(t i−1 ) = 0 is used as initialization if Fr(t i−1 ) does not exist. Since n = 1, 2, 3, let FE (n) feasub = [fe rmsv , fe mav , fe smrv , fe kc , fe cf , fe rmsf , fe rmsl , fe sc , fe zcr ] be the subset feature. The first subset feature is calculated from the speech signals Fr(t i ) when n = 1. To calculate the second subset feature for n = 2, first defineṠ(t i ) as the derivative by time of Fr(t i ), and then swap out every instance of Fr(t i ) in Equations (2) through (10) withḞr(t i ). Replace all instances of Fr(t i ) in Equations (2) through (10) withFr(t i ), which is obtained by divid-ingḞr(t i ) by time to compute the third subset feature with n = 3. The next step is to concatenate the three subset features, defining the set of features as FE feature = [FE (1) feasub , FE (2) feasub , FE (3) feasub ].

A simple deep pattern recognition network architecture
To train a classifier model, a deep Pattern Recognition Network (PRN) is utilized. The deep PRN receives its inputs from the collection of 27 retrieved characteristics. As shown in Figure 1, the deep PRN's architectural layout includes five hidden layers, such as 32, 64, 64, 32, and 16 hidden units, respectively, and the final hidden layer includes Es hidden units that correspond to the number of emotional states of the dataset. Finally, the weights are updated as training progresses using a cross-entropy (CE) loss function. The performance of classification tasks that compute the loss between predicted and ground truth values is frequently assessed using CE. The likelihood of emotional classes is calculated in this study using CE for multi-label classification, which is defined as follows: where y i represents the actual value andŷ i represents the projected value of the emotion i.
3.3. Interpretability: query the optimal features using SHAP analysis SHAP (SHapley Additive exPlanations) analysis, a machine learning method based on game theory, was introduced in Lundberg and Lee (2017). SHAP analysis may illustrate how much each attribute contributes to a prediction by measuring how much a forecast deviates from the average (response for regression or score of each class for classification). The LIME (Local Interpretable Model-Agnostic Explanations), which was initially proposed in Ribeiro et al. (2016), has been improved upon. A LIME strategy employs a simple interpretable model, like a decision tree or linear model, to estimate a complex model close to the desired prediction. According to kernelSHAP in Lundberg and Lee (2017), the Shapley values are computed in this study to describe the contribution of each feature to prediction (response to the score for each class for classification). Let N be the total number of features and M be the total number of features in a set. The value function v defines the Shapley values of the i th feature for the query point x as follows: where |S| is the cardinality of the set S or the total number of items in a set S, and v x (S) is the value function of the features for the query point x in a set S. The value of the function indicates how much the characteristics in S are probably going to influence the forecast for the input point x. The overall divergence of the forecast for the query point from the average is thus equal to the sum of the Shapley values for a query point across all characteristics, as shown below: The value function v x (S), in this case, must match the anticipated contribution of the features in S to the prediction f for the input point x.
The value function v x (S) of the kernelSHAP is obtained from Equations (12) and (13) as follows: where D stands for the interventional distribution, S c for the joint distribution of the features, x S for the query point value for the features in S, X S c for the features in S c , O for the number of observations, and (X S c ) j for the values of the features in S c of the j th observation.

Emo-DB
The German Emo-DB (Burkhardt et al., 2005) database of emotional speech is used in this study. Five male and five female speakers produced 535 samples at 44.1 kHz, which were further down-sampled to 16 kHz. There are various emotional states included in it, including anger, boredom, disgust, fear, happiness, neutral, and sadness. Figure 3 shows the distribution of various emotional states in the Emo-DB dataset.

AESSD
The AESSD (Acted Emotional Speech Dynamic Database), which includes 5 emotional states (anger, disgust, fear, happiness, and sadness), is recommended in ; Vryzas, Matsiola et al. (2018). AESSD was developed using SAVEE (Surrey Audio-Visual Expressed Emotion) in (Jackson & Haq, 2014), even though the utterances are in Greek rather than English. There are over 500 emotive spoken utterances in it, performed by five professional actors between the ages of 25 and 30. There are two men and three women in the ensemble. The AESSD dataset's distribution is shown in Figure 4. Figures 3 and 4 make it abundantly evident that the AESSD dataset is regarded as balanced and the Emo-DB dataset as unbalanced. The imbalance dataset is therefore one of the most difficult problems in practically all SER systems.

Experimental setup
In this study, the Emo-DB and AESSD datasets were used to evaluate the proposed method. The OSW is used to extract 27 features from the raw audio signals of the Emo-DB and AESSD datasets. Then, the extracted features of each dataset were randomly divided into training and testing sets with a ratio of 70/30. As a result, the number of samples in the training set is 77,940, and in the testing set, it is 35,894 for the EmoDB   dataset, while they are respectively 158,557 and 73,525 for the AESSD dataset. The training set is used to train and validate the proposed method, while the testing set is only used for prediction and evaluation.
We use MATLAB R2021a to implement our proposed method on an Intel Core i5 8th Gen computer without a graphics processing unit (GPU). The deep PRN is created using the patternnet MATLAB function. Since deep PRN is a deep shallow network, it does not support minibatch training. As a result, the entire dataset was applied for each epoch. The training function for the deep PRN is the scaled conjugate gradient backpropagation (trainscg) algorithm without an explicit learning rate. The maximum number of iterations is set to 5,000. In addition, early stopping is used to enhance the deep PRN model's generalization and prevent overfitting. Moreover, to explain the contribution of each feature to the prediction, kernelSHAP is calculated using the shapley function.
For performance evaluation, accuracy (ACC), precision (PCS), sensitivity (Sn), specificity (Sp), misclassification (MC), F1-score (F1), and Matthews correlation coefficient (MCC) evaluation metrics have been used to validate the effectiveness and robustness of the proposed method on both the Emo-DB and AESSD datasets. These metrics are defined below: where TN, FP, FN, and TP are true-negative, false-positive, false-negative, and true-positive,  respectively. Moreover, the receiver operating characteristic (ROC) curve has also been depicted to evaluate the quality of a multiclass classifier.

Results on the Emo-DB dataset
The deep PRN's confusion matrix, which employs a new set of 1D characteristics to confuse the predicted output class and the target class, is shown in Figure 5. The letters A, B, D, F, H, N, and S, respectively, stand for feelings of anger, boredom, disgust, fear, happy, neutral, and sad. According to the confusion matrix, the deep PRN's typical accuracy while employing the new set of 1D features is 98.5%. Additionally, Figure 6 shows the ROC curves of classifying seven emotions, where both the true-positive and true-negative rates of classifying all emotions are reaching 1. Besides, according to the ROC curves in Figure 6, the positive and negative rates of classifying all seven emotions are likely to be the same. As a result, the overall accuracy is higher. This work performs better than several earlier attempts by employing 1D characteristics for the SER. As presented in Table 1, in comparison to the feature sets with classifier techniques in Chen et al. (2020), Kuchibhotla et al. (2016), andZhao et al. (2019), the new set of 1D features with the deep PRN has greater accuracy, with the improvements on the Emo-DB dataset of 5.9%, 6.2%, and 12.9%, respectively. Also shown in Figure 7 are the Shapley values, which use SHAP analysis to illustrate how each feature from the new collection of 1D features contributed to the deep PRN for the SER. As seen in Figure 7, x3, x16, and x10, respectively, have a significant effect on the target emotions' prediction, as evidenced by their values of fe smrv , fe ′ rmsl , and fe ′ rmsv .

Results on the AESSD dataset
The deep PRN's confusion matrix, which employs a new set of 1D characteristics to confuse the predicted output class and the target class, is shown in Figure 8. The letters A, D, F, H, and S stand for anger, disgust, fear, happiness, and sadness, respectively. According to the confusion matrix, the deep PRN's typical accuracy while employing the new set of 1D features is 87.1%. Additionally, Figure 9 displays the ROC curves of classifying five emotions, where all the ROC curves are reaching for the upper left corner. As shown in Figure 9, the positive and negative rates of classifying all emotions are not the same. With significant improvements of 7.1% and 13.1% over the works in  and (Vryzas, Matsiola et al., 2018), respectively, this study is also superior to those works, as summarized in Table 2. In order to demonstrate how each feature from the new collection of 1D features contributed to the deep PRN for the SER, Figure 10 shows the Shapley values using SHAP analysis. As seen in Figure 10, x19, x15, and x12, which stand for fe ′′ rmsv , fe ′ rmsf , and fe ′ smrv , respectively, have a significant impact on the prediction of the target emotions.
Moreover, Table 3 presents the performance evaluation of the proposed method on the Emo-DB and AESSD datasets with different evaluation metrics. As shown in Table 3, the ACC, Sn, and Sp scores are somehow the same on the Emo-DB dataset, but they are more different from each other on the AESSD dataset. As a result, the results of the proposed method on the Emo-DB dataset are better than those on the AESSD one in terms of effectiveness and robustness.

Discussion
Our suggested strategy is superior and more comparable to recent research that employed 1D characteristics for voice emotion detection. Our suggested method is more comparable to earlier and current studies that use deep learning-based architectures like CNN and RNN and their pertinent architectures in terms of computing costs and computational resources because it uses a new set of features that includes all 1D features, and the deep PRN is a neural network-based architecture. As a result, the proposed method can be trained and deployed on a system without a GPU requirement. As observed in the experiments, the OSW technique has played an augmentation role in this study to generate more and more features, so we can capture as much information as possible about the emotional states from each SW. Hence, using OSW also helps in improving the accuracy of the proposed method. Besides, although using OSW might lead to increased processing time, it will give more points of view to analyze and result in smoother performance over time.
Additionally, experimental findings demonstrate that our suggested approach is capable of handling the unbalanced data in the Emo-DB dataset. Moreover, system developers can choose the ideal characteristics for the intended system based on SHAP analysis.

Conclusion and future work
This paper suggests a brand-new set of 1D characteristics for voice emotion identification. Additionally, the deep Pattern Recognition Network is used to train a classifier system to identify emotions in speech. According to experiments, the new collection of 1D characteristics not only enhances accuracy but also the functionality of the SER system and requires less processing power. Another important consideration is the usage of the OSW approach, which may be applied as a data augmentation strategy to enhance speech emotion detection characteristics and prevent the deep PRN model from overfitting.
In order to allow users to trade-off between accuracy and performance and select an appropriate collection of features for the desired system, the SHAP analysis is also used to assess the contribution of each feature in the new set of 1D features. The new set of 1D characteristics may also be easily applied to other applications, such as fault identification and diagnosis, structural damage detection, and ambient sound categorization, as it is based on both time-domain and frequency-domain features.
Future research will examine wavelet features (Shegokar & Sircar, 2016) to create a feature set for the SER that is both optimal and resilient and contains a variety of characteristics. Prior to generating the deep learning model, it is also thought to apply an optimal clustering approach (Nguyen et al., 2022) to separate and choose the relevant characteristics. Additionally, a previously developed adaptive-neuro-fuzzy inference system (ANFIS) (Nguyen et al., 2017) is also considered while designing a new classifier system, called deep-ANFIS, to identify emotions in speech. Finally, a hybrid data augmentation approach  is required to produce and synthesize an increasing amount of data to address the imbalance and shortage of data.