Advanced differential evolution for gender-aware English speech emotion recognition

Speech emotion recognition (SER) technology involves feature extraction and prediction models. However, recognition efficiency tends to decrease because of gender differences and the large number of extracted features. Consequently, this paper introduces a SER system based on gender. First, gender and emotion features are extracted from speech signals to develop gender recognition and emotion classification models. Second, according to gender differences, distinct emotion recognition models are established for male and female speakers. The gender of speakers is determined before executing the corresponding emotion model. Third, the accuracy of these emotion models is enhanced by utilizing an advanced differential evolution algorithm (ADE) to select optimal features. ADE incorporates new difference vectors, mutation operators, and position learning, which effectively balance global and local searches. A new position repairing method is proposed to address gender differences. Finally, experiments on four English datasets demonstrate that ADE is superior to comparison algorithms in recognition accuracy, recall, precision, F1-score, the number of used features and execution time. The findings highlight the significance of gender in refining emotion models, while mel-frequency cepstral coefficients are important factors in gender differences.

knowledge from recent successful evolutionary history.To ensure that DE achieves fast convergence, Lin et al. designed a framework that takes the advantages of different mutation strategies 19 .Firstly, an improved mean individual mutation strategy is integrated into the DE algorithm to enhance global convergence.Secondly, the DE/ current-to-rand/1 strategy is used to improve diversity and generate disturbances to prevent the algorithm from getting stuck in local optimum.Lastly, a perturbation strategy is proposed to assist the population in escaping from local traps and improve its exploration ability.To address issues like local optimal stagnation and numerical instability in large-scale feature selection, Wang et al. claimed a new DE 20 .First, they adopt a multi-population strategy to enhance population diversity.Then, a new adaptive mechanism is employed to select multiple policies from a policy pool to acquire various information from historical solutions.Finally, a weighted model is designed to identify important features, which enables the model to generate the most appropriate feature selection.
Based on the above analysis, the improvement of DE often involves mutation operators and parameter control.It is necessary to consider that DE is limited by positions when performing feature selection, and we design novel operators.The main contributions of this paper are summarized as follows: 1. We present a speech emotion recognition system based on gender-aware to address the differences between male and female voices.2. We propose a DE algorithm for feature selection in emotion recognition.3. We verify the superiority of the proposed algorithm (ADE) on English speech emotion datasets using multiple metrics, and ADE identifies features influencing emotion and gender.
The organization of this paper is as follows."Related works" section presents the related works of speech emotion recognition."Materials and methods" section describes the proposed system."Experimental results and discussions" section presents the experimental results with discussions, while "Conclusions" section offers the conclusions.

Related works
Speech signals play an important role in human-computer interaction, and serve as the primary input source for various applications, such as speech recognition, speech emotion recognition, and gender recognition 21,22 .
Recently, it has emerged as a prominent research field to automatically extract speakers' gender and emotional state from speech signals.Various approaches have been explored by researchers to enhance the accuracy of emotion recognition.Emotion recognition is becoming more popular due to its various applications, but it faces challenges arising from factors like corpus differences, speaker gender, and expression domains (spoken or sung).Zhang et al. studied the impact of these factors on the generalizability of emotion recognition systems across multiple corpora 23 .They used factors to define a multi-task learning method which incorporates variability due to the corpus.In the domain of speech, gender and corpus have equal influence.Sun presented a new SER algorithm that doesn't rely on acoustic features and incorporates speakers' gender information 24 .The goal is to obtain rich information from raw speech data without any human intervention.Unlike conventional speech emotion recognition systems that demand manual selection of acoustic features, the approach employs deep learning algorithms to automatically extract essential information from original speech signals.It prevents the lack of emotional information that cannot be mathematically modeled as acoustic features.Velichko et al. introduced a hierarchical framework for intricate paralinguistic speech analysis 25 , including gender, emotion, and deception recognition.The foundation of this framework is the study of the interrelationships between different paralinguistic phenomena.It employs gender information to predict emotional states and uses the results of emotion recognition to predict the authenticity of speech.Aggarwal et al. used naive Bayesian and support vector machine (SVM) to recognize emotions and gender 26 , and utilized four speech features: shimmer, jitter, energy, and pitch.The findings suggest that SVM exhibits higher accuracy in gender and emotion recognition compared to naive Bayesian.
Mishra et al. developed a two-stage emotion recognition model for gender-distinguished speech that utilizes MFCCs and a convolutional neural network (CNN) 27 .Gender-independent emotion recognizers are less effective than gender-dependent emotion recognizers due to the acoustic differences between male and female speakers.The results show that systems with gender recognition have a significant impact on performance.Notably, the performance is enhanced by employing a global average pool at the end of the CNN classifier.Latif et al. brought a multi-task framework that uses gender-speaker recognition as supplementary tools for emotion classification 28 .To maximize the effectiveness of multi-task learning, adversarial autoencoders (AAE) are integrated into the framework, which have strong learning and discriminative feature abilities.Furthermore, the combination of unsupervised AAE and a supervised classification network achieves the semi-supervised learning that improves the generality of the framework and the overall performance of SER.
Garain et al. converted input speech signals into spectral images 29 , and extracted a set of common features for gender, speaker, and emotion recognition tasks.The mayfly algorithm chooses features with minimal redundancy and maximum relevance (mRMR).Due to the challenging issue of determining the number of units per layer in MLP, the golden ratio is utilized to complete this task.
Yao et al. developed a framework that integrates three different classifiers: deep neural network (DNN), convolutional neural network (CNN), and recurrent neural network (RNN) 30 .This framework is used to classify and recognize four emotions: anger, happiness, neutrality, and sadness.To address feature confusion issues that complicate accurate emotion classification, Liu et al. employed a cascaded attention network for SER 31 .This network selectively discovers target emotional regions from MFCC features using spatiotemporal attention and employs a joint loss function to distinguish highly similar emotion embeddings to improve overall performance.Deep learning models are suitable for processing large, complex, and high-dimensional data and can

Materials and methods
The proposed system consists of emotional databases, feature extraction, and feature selection based on DE, and Fig. 1 presents the flowchart.This system builds a gender prediction model by extracting features from emotion databases.When using DE to implement feature selection on the extracted emotion features, it first predicts the gender of speakers, and then builds a gender-based emotional model to achieve accurate prediction.
1. CREMA-D CREMA-D is a dataset of 7,442 original clips from 91 actors.These clips are from 48 male and 43 female actors between the ages of 20 and 74 coming from a variety of races.The actors speak from a selection of 12 sentences, while sentences are presented with one of six different emotions: Angry, Fearful, Disgust, Neutral, Happy, and Sad. 2. EmergencyCalls The database is from Kaggle.18 speakers (9 males and 9 females) are asked four sentences to record their audios in four emotions: Angry, Drunk, Painful and Stressful.3. IEMOCAP-S1 Unlike many emotion databases that involve single speakers, the characteristic of IEMOCAP is the participation of multiple persons in various scenarios.The IEMOCAP-S1 database is a session of IEMOCAP, and it includes many emotions, such as Neutral, Happy, Sad, Angry, Surprised, Fearful, Disgust, Frustrated, Exciting, and Other.4. RAVDESS In RAVDESS, there are 24 professional actors (12 female and 12 male) who speak two lexicallymatched sentences with a neutral North American accent.Speech emotions include Fearful, Calm, Surprised, Happy, Angry, Sad, and Disgust expressions.

Feature extraction
We employ pitch features to identify gender and the OpenSmile toolkit 36 to obtain acoustic features for recognizing emotions.

Advanced differential evolution
Difference vectors and the scale factor affect the performance of DE.To improve its diversity, we introduce a novel selection of basis and difference vectors during the mutation process.The scale factor, based on differentiation, accelerates the convergence speed of the population.Additionally, a new position learning strategy is proposed to prevent stagnation in local optima throughout the evolution process.The flowchart of ADE is illustrated in Fig. 2.

Mutation
Individuals in DE update their positions through mutation and crossover, as depicted in Eqs. ( 1) and ( 2).If the objective function value of a new individual surpasses that of the existing individual, the latter is substituted.Otherwise, the original individual is retained.
where X j i (t) represents the position of individual i in the j-th dimension of the t-th iteration.m is a trail value, and F means the scale factor.PCR is a crossover probability.
The effectiveness of mutation is strongly impacted by the choice of basis and parent vectors.Therefore, their correct selection is crucial to balance diversity and convergence and to supply promising search guidance.We first rank the fitness values of the population, and then update the positions of only the worst half of individuals.The other half of individuals serve as guidance vectors; that is, r 1 , r 2 and r 3 in Eq. ( 1) come from this part.
Individuals with high objective function values bring useful insights into promising regions, while those with lower fitness values can identify less promising areas.Superior individuals guide exploration, whereas less ones indicate areas to avoid.In our approach, outstanding individuals act as attractors, and lower-performing individuals act as repellers.If the fitness value of r 2 is inferior to r 3 , we multiply F by -1, which can ensure that the population quickly moves closer to the optimal solution.
New basis and difference vectors improve convergence, and a differential-based scale factor is proposed to balance exploration and exploitation during the optimization process.
where f represents the objective function value, and F i (1) is set to the default value of DE.Due to the fast con- vergence of the algorithm, it is easy to cause F to become too small, and F cannot affect individuals' update.Consequently, the objective function values of the updated individuals are sorted from good to bad, and the value of F is normalized to [0.2, 0.8].To increase search diversity, a mutation within the range of [-0.1, 0.1] is randomly performed for each dimension.

Position learning
In the early stages of ADE, individuals are distributed throughout the search space, and global search is an efficient method for gathering evolutionary information.With the search of the population, individuals tend to cluster in several excellent areas, and it is necessary to fully consider both global and local searches.In the final stages of the algorithm, a slower convergence is caused by individual differences, and search capabilities tend to be localized.If the population does not update the global optimal solution for ten cycles, they will be compelled to leave the current search area. (1) if (rand(j) ≤ PCR) or j = randi() X j i (t) else (3) In feature selection, the positions of individuals are limited to [0,1], and then they are compared with a threshold of 0.5 to determine whether the corresponding feature is selected.At the beginning of the algorithm, Eq. ( 4) is the OBL (Opposition-based Learning), and individuals search in opposite directions.In the subsequent phase, individuals will randomly choose a search point at the current and opposite positions.

Repairing positions
In this study, the dimension of an individual is twice the number of emotional features (2 * dim), where dim represents the number of emotional features.The identification of male emotional features is based on 1 to dim, while female emotional features are represented by dim+1 to 2 * dim.Due to the physiological differences between males and females and the random nature of metaheuristic algorithms, their emotional features are not quite consistent.The individual's position needs to be corrected, as demonstrated in Algorithm 1. (4) else Algorithm 1. Repairing positions Lines 3 and 8 imply that in the early stages or when the positions in male and female are similar, the algorithm won't adjust the positions to enhance diversity among individuals.Lines 9-27 describe the correction process, where a and b represent the number of features selected for males and females in different historical optimal solutions achieved by the algorithm.Position correction ensures the consistency of acoustic features and also takes into account gender differences.

Experimental results and discussions
We conduct experiments to verify the superiority of the proposed ADE algorithm against DE 37 , BBO_PSO 38 and MA 29 algorithms.BBO_PSO and MA are two state-of-the-art algorithms in emotion recognition.BBO_PSO focuses on emotion recognition, while MA classifies emotions based on gender.DE and ADE utilize the genderemotion model shown in Fig. 1.Table 2 displays the main parameter settings of the compared algorithms.
The algorithms are executed 20 times, and their population size is fixed at 20. DE, BBO_PSO, and MA have a maximum iteration of 100, while ADE has 200 iterations.To evaluate potential significant differences in the experimental results, we utilize the Wilcoxon rank sum test and the Friedman test, with a significance level set at 0.05.

Objective function
Classification accuracy is the most important metric for SER algorithms, so it is utilized as the objective function in the experiments, as indicated in Eq. ( 5).Accuracy, weighted accuracy, and unweighted accuracy are metrics used to evaluate the classification ability of emotion recognition algorithms.Although weighted and unweighted www.nature.com/scientificreports/accuracy better reflect a model's performance in imbalanced categories, our emotion datasets contain multiple types, and they can test the performance of algorithms more comprehensively.Our objective is to maximize the recognition accuracy of these algorithms, which can be achieved through the evaluation of accuracy.We analyze the confusion matrix to identify which emotions the algorithm classifies well.Additionally, we evaluate the algorithms by comparing recall, precision, F1-Score, the number of selected features, and execution time.
where S1 and S2 represent the numbers of correctly classified and incorrectly classified samples, respectively.

Experimental analysis
We use SVM as the classifier, and we employ 10-fold cross-validation to assess the performance of the algorithms.
For ease of reading, we mark the best experimental data obtained by the algorithms in bold font.Table 3 displays the average, minimum, and maximum recognition accuracy.ADE exhibits superior classification accuracy compared to DE in CREMA-D, EmergencyCalls, and RAVDESS, whereas DE outperforms ADE only in IEMOCAP-S1.This suggests the effectiveness of the proposed DE improvement method.Additionally, ADE demonstrates better accuracy than BBO_PSO, MA, and DE in CREMA-D, EmergencyCalls, and RAVDESS, while DE has better average recognition than other algorithms in IEMOCAP-S1.The overall performance of the algorithms in IEMOCAP-S1 is general.The main reason for this is that the dataset contains the most emotional features and the sample distribution is uneven, which prevents the algorithms from creating accurate prediction models.In EmergencyCalls, ADE achieves the best prediction accuracy, and its worst prediction value is also better than BBO_PSO, MA, and DE.In IEMOCAP-S1, ADE attains the highest classification accuracy at 0.5729, outperforming other algorithms.Meanwhile, DE's worst prediction value of 0.5578 is superior to that of other algorithms.In RAVDESS, ADE and DE outperform the comparison algorithms in the best and worst prediction values, respectively.
The Wilcoxon rank sum reveals that BBO_PSO has consistent statistical data with ADE in RAVDESS, and DE and ADE have similarities in IEMOCAP-S1.The average ranks of BBO_PSO, MA, DE, and ADE in CREMA-D, EmergencyCalls, IEMOCAP-S1, and RAVDESS are 3, 3.75, 2, and 1.25 respectively, with a P-Value of 0.0440.The Friedman test demonstrates that ADE performs best on emotional datasets.MA, DE and ADE are all genderbased emotion recognition algorithms, while BBO_PSO does not utilize gender to complete emotion recognition.From Table 3, we can see that the performance of DE and ADE is better than that of BBO_PSO.This indicates that gender information can improve emotion recognition accuracy.
To further validate the efficiency of the algorithms, we analyze the performance of them from precision, recall, and F1-score, as shown in Table 4.The algorithms are the most effective in RAVDESS, and the data is comparable in CREMA-D and EmergencyCalls.Since some emotional samples in IEMOCAP-S1 have less data, the algorithms cannot classify them correctly.Consequently, the data for precision and F1-score are unavailable, but they also have low recall values.ADE outperforms the comparison algorithms in precision, recall, and F1-score in CREMA-D, EmergencyCalls, and IEMOCAP-S1, but its performance in RAVDESS is surpassed by BBO_PSO.ADE outperforms BBO_PSO, MA, and DE in RAVDESS for classification accuracy, but lacks in recall and precision.The optimization goal of ADE is to improve overall classification accuracy rather than focus on the recognition ability for each class of samples.This may cause ADE to perform poorly in identifying rare or borderline samples, leading to missing some positive samples (lower recall) or misclassifying more negative samples (lower precision).
Table 5 presents the running time and the number of selected features.BBO_PSO has a clear advantage in running time, and it has better operational efficiency than MA, DE, and ADE in EmergencyCalls, IEMOCAP-S1, and RAVDESS.ADE achieves the shortest running time in CREMA-D.The time complexity of the SVM classifier is between O(D * T 2 ) and O(D * T 3 ), where D means the feature size and T implies the number of samples.The calculation time of feature selection algorithms is mainly affected by classifiers.Although BBO_PSO employs more features than ADE, the time difference between them is marginal.ADE takes less time than DE on the four datasets.Because CREMA-D contains the largest number of samples and EmergencyCalls has the smallest sample size, the calculation time of the algorithms in CREMA-D is considerably longer than in the other datasets.
The algorithms obtain the same number of features from the datasets, so the numbers of features they selected in each dataset are similar.ADE utilizes fewer features compared to the other algorithms.On the other hand,

Discussion
The time complexity of ADE is O(G * N * dim + G * N * f ) , where f is the execution time of the objective func- tion, and G and N represent the maximum iteration and population size.In feature selection, due to the high complexity of f, the maximum time complexity can also be represented as O(G * N * f ). Figure 3 depicts the confusion matrix of ADE.In CREMA-D, ADE recognizes Angry, Neutral, and Sad well, but in Disgust and Fearful, Sad, Happy, and Angry greatly interfere with the accuracy.In EmergencyCalls, the recognition of Angry, Drunk, and Stressful is affected by the presence of Painful.ADE performs remarkably well in identifying Painful, Angry, and Drunk, but it's hard to distinguish Stressful.In IEMOCAP-S1, the number of emotion samples for Surprised, Fearful, Other, and Distinct is relatively small.It is difficult for ADE to make correct judgments, and they do not affect the recognition of other emotions.The algorithm's emotion recognition is complicated by Neutral and Frustrated emotions, and ADE has the best accuracy in classifying Sad and Exciting emotions.In RAVDESS, ADE is the top performer in recognizing Calm, Angry, Fearful, Disgust, and Surprised.However, the algorithm could easily mistake Neutral for Calm and Sad.
Gender information in MFCC 3,[6][7][8][9]13 is significantly different, and the mean, variance, and differentiation of features also have statistical characteristics that impact recognition. Reently, speech emotions have been recognized through CNN and feature fusion 39,40 .39 claimed an accuracy of 58.62% in RAVDESS, and it can be increased to 78.35% using data augmentation.By combining the frequency

Figure 3 .
Figure 3.The confusion matrix of ADE.

Table 1 .
Summary of the extracted features.

Table 2 .
The main parameter settings.

Table 3 .
The classification accuracies of the algorithms.Significant values are in [bold].

Table 4 .
The recall, precision, and F1-score of the algorithms.Significant values are in [bold].

Table 5 .
The number of selected features and running time of the algorithms.Significant values are in[bold].