Clustering-Based Speech Emotion Recognition by Incorporating Learned Features and Deep BiLSTM

Emotional state recognition of a speaker is a difficult task for machine learning algorithms which plays an important role in the field of speech emotion recognition (SER). SER plays a significant role in many real-time applications such as human behavior assessment, human-robot interaction, virtual reality, and emergency centers to analyze the emotional state of speakers. Previous research in this field is mostly focused on handcrafted features and traditional convolutional neural network (CNN) models used to extract high-level features from speech spectrograms to increase the recognition accuracy and overall model cost complexity. In contrast, we introduce a novel framework for SER using a key sequence segment selection based on redial based function network (RBFN) similarity measurement in clusters. The selected sequence is converted into a spectrogram by applying the STFT algorithm and passed into the CNN model to extract the discriminative and salient features from the speech spectrogram. Furthermore, we normalize the CNN features to ensure precise recognition performance and feed them to the deep bi-directional long short-term memory (BiLSTM) to learn the temporal information for recognizing the final state of emotion. In the proposed technique, we process the key segments instead of the whole utterance to reduce the computational complexity of the overall model and normalize the CNN features before their actual processing, so that it can easily recognize the Spatio-temporal information. The proposed system is evaluated over different standard dataset including IEMOCAP, EMO-DB, and RAVDESS to improve the recognition accuracy and reduce the processing time of the model, respectively. The robustness and effectiveness of the suggested SER model is proved from the experimentations when compared to state-of-the-art SER methods with an achieve up to 72.25%, 85.57%, and 77.02% accuracy over IEMOCAP, EMO-DB, and RAVDESS dataset, respectively.


I. INTRODUCTION OF SER
Automatic recognition and identification of emotions from speech signals in speech emotion recognition (SER) using machine learning is a challenging task [1]. SER is a quick and usual method of communication and exchanging information among humans and computers and has many real world applications in the domain of Human-computer interaction (HCI). Currently, researchers are facing a major challenge in feature extraction i.e., how to select a robust method to extract salient and discriminative features from speech The associate editor coordinating the review of this manuscript and approving it for publication was Mohammad Zia Ur Rahman . signals to represent the emotional state of a speaker from their acoustic contents. In the past decade, many researchers have investigated low-level handcrafted features for SER such as energy, zero-crossing, pitch, linear predictor coefficient, Mel-frequency MFCC, and nonlinear features such as tiger energy operator. Nowadays, mostly researchers utilize deep learning techniques for SER using Mel-scale filter bank speech spectrogram as an input feature. A spectrogram is a 2-D representation of speech signals which is widely used in convolutional neural networks (CNNs) for extracting the salient and discriminative features in SER [2] and other signal processing applications [3], [4]. Mostly 2-D CNNs are specially designed for visual recognition tasks [5]- [7] VOLUME 8,2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ and researchers are inspired by their performance to explore 2-D CNNs in the field of SER. Spectrograms are suitable representations of speech signals for CNNs model to extract high-level salient information to recognize emotions in speech signals. Similarly, some researchers have developed fully convolutional networks (FCNs) with the help of CNN's to handle fix input of variable size. The FCNs achieved a good performance output in time series classification tasks based on fix input variable size [8]. The lack of FCNs is not able to learn temporal information regarding this issue, the LSTM-RNNs is suitable for learning special and temporal features among sequences [9]. In the field of SER in this era CNN-LSTM and LSTM-RNNs are widely used for extracting hidden temporal information [10]. Some researchers are working to improve the recognition performance of SER to select some salient segments from speech signals and to learn temporal features using the CNN-LSTM model [11].
Badshah et al. [12] proposed a method for SER using the CNN features for smart effective devices to recognize the emotional state of the person in health care centers. SER is an active area of research, recently researchers are utilizing deep learning techniques to develop a variety of methods to recognize the emotional state of speakers. Typically, researchers utilize the CNNs model, to learn highlevel salient and discriminative features and feed them to the LSTM network to learn hidden temporal features to recognize emotions among sequences. The usage of CNNs and artificial intelligence increases recognition accuracy, but computation cost also increased with the usage of huge networks weight. The present traditional CNNs and LSTM architectures have not shown the substantial enhancement for increasing the level of accuracy and reducing the cost complexity of the existence SER systems. In this research, we proposed a novel deep learning-based approach for SER using RBF based K-mean clustering with a deep BiLSTM network. In the proposed method, we select the emotional segments from whole audio, utilizing RBF based similarity measurement technique to select one segment from each cluster. The selected sequence of segments is converted into spectrograms using the STFT algorithm. Furthermore, we extract the highlevel discriminative features from selected segments utilizing the ''FC-1000'' layer of the Resnet101 [13] model. After that, we use the mean and standard deviation strategy to normalize the features and feeds to deep BiLSTM network for extracting temporal information and recognize the final state. The Softmax classifier is used for producing the probability among speech emotions. The main contributions of the proposed technique are documented below: 1. We proposed an efficient and novel framework for SER that is able to learn spatial and temporal information from speech spectrogram by leveraging CNN with deep bidirectional LSTM. Our model is capable to learn features and automatically model the temporal dependencies. To the best of our knowledge, the CNN model used in our research is a novel one in SER domain, therefore, we aim to contribute to the SER literature by using ResNet101 features in an effective manner, integrated with sequential learning mechanism. 2. We proposed a new strategy for SER by using sequence selections and extraction via non-linear RBFN based method to find similarity level in clustering. We select one key segment from the whole cluster which is near to the centroid of the cluster and represents the rest of the segments. Furthermore, we process these key segments to ensure the accurate recognition of emotion and reduce the processing time, as proved from the experimentations. 3. We endorsed, that the presented technique is a recent success of a deep learning approach based on key segments sequence selection and normalization of CNN features based on mean and standard deviation that can easily improve the existing state-of-the-art methods. To the best of our knowledge, this is a new deep learning approach for SER based on RBFN with CNN and deep BiLSTM. Thus, the key contribution of our framework lies in the usage of normalization technique to enhance the usage of features. 4. We tested the proposed SER model over different benchmark datasets and evaluated from different perspectives with baseline methods, the results are encouraging and are suitable for monitoring to recognize the real time emotions of the speakers. The achieved accuracy for IEMOCAP, EMO-DB, and RAVDESS dataset is 72.25%, 85.57%, and 77.02%, respectively. The rest of the manuscript is distributed into the following folds: literature about the existing techniques of SER is documented in Section II, the detail explanation of the suggested framework of SER is elaborated in Section III, the experimental result of the mentioned technique are given in Section IV, and the detail discussion of the experimentations is mentioned in Section V, in the last Section VI, including on conclusion and future work of the proposed SER.

II. LITERATURE REVIEW OF SER
Digital signal processing is an emerging field of research in this era. Recently, many researchers have developed a various approaches in this area for SER from over the past decade. Typically, the SER task is divided into two main sections: features selection and classification. The discriminative features selection and classification method that correctly recognizes the emotional state of the speaker in this domain is a challenging task [14]. With the increase in data and cost computation deep learning approach is rapidly used for SER [15] and many researchers are used deep learning approaches for robust features representation in various fields [11]. Due to their enormous achievement in recognition of visual tasks, Huang et al. [4] presented a CNN based approach for SER and similarly, [16] used CNN to learn high-level discriminative features from spectrograms of speech signals and recognize the emotional state of speakers. Some researchers are used the Gaussian mixture model to classify the emotional state of speaker with robust features [17].
Nowadays, mostly researchers have worked with 2-D CNNs to extract high-level discriminative features from speech signals. Hence, extracting spectrograms, plotting speech signals with respect to time and feeding to CNNs to learn hidden information has become a new trend of research in this era for SER [2], [18]. Similarly, we can utilize the transfer learning strategies for SER using speech spectrograms passing through pre-trained CNNs models like VGG [5] or Alex-Net [19]. Spectrogram is a suitable representation for CNNs model to extract high-level discriminative features from speech signals to recognize the emotional state of the speaker in the SER system [20]. Similarly, LSTM-RNNs are mostly used to learn hidden temporal information in speech signals which is cyclically employed in the SER system [21], [22]. Nowadays, deep learning approaches play a crucial role to increasing the research interest in SER. Recently in [23] presented an end to end LSTM-DNN based model for SER with the combination of LSTM layers and fully connected layers to directly extract representation from raw data rather than obtaining hand-crafted features.
The joint approach of CNN-LSTM is presented in [24] to extract the deep salient high-level features from raw speech data using CNN and passed to the LSTM network for capturing the sequential information similar to [25]. Ma et al. [26] presented a neural network structure to take the variable-length speech for SER. In this method, CNN was used to represent the features of speech spectrograms and RNNs handled the variable-length speech segments. Zhang et al. [27] presented a technique for SER by utilizing the pre-trained Alex-Net model for features representation and traditional support vector machine (SVM) for emotions classification. Similarly, Liu et al. [28] used the CNN-LSTM model for spontaneous SER using the RECOLA [29] natural emotion dataset.
In the field of SER, many methods utilize CNN models with different types of input to extract salient features from speech signals to boost the recognition accuracy [30]. Similarly, some researchers used the pre-trained model to extract the high-level features from speech spectrograms and trained a separate classifier [31] for recognition, which boosts the cost computation of the system. In this paper, we developed a novel SER technique to process some useful segments from the whole audio file which are selected through K-mean clustering algorithm using RBF based similarity measurement. The selected segments of speech are converted into spectrograms and extract high-level discriminative features utilizing the CNN model called Resnet101. Furthermore, we normalized these features using mean value and standard deviation then feed them to deep BiLSTM network to learn hidden temporal information from speech segments to recognize the final emotional state of speakers. The proposed system reduces the execution time due to process selected segments rather than all segments and increases the level of accuracy due to used salient and normalized features with deep BiLSTM network. According to the best of our knowledge, the proposed architectures are novel and efficient than all other methods which are described in the literature.

III. PROPOSED TECHNIQUE OF SER
In this section, the proposed methodology of the SER framework and its main components are discussed in detail including the emotion recognition in speech. The suggested framework consists of the main three blocks. The first block consists of two parts; in the first part, we divide the audio file into multiple segments with respect to time and find the difference between consecutive segments. The obtained difference is to pass from a threshold to ensure the similarity and find out the value of ''K'' for clustering utilizing the shot boundary detection method [32]. Primarily start K = 1, and estimate the pairwise difference if the consecutive segment difference within threshold when the difference exceeds from threshold the ''K'' value automatically increases by one unit. Due to this process, we select the value of ''K'' dynamically for clustering to make groups accordingly. Furthermore, we select one segment from each group or sequence as a key segment that is near to the center of the cluster. We utilized the RBF, strategy for similarity measurement inside the clustering algorithm which is explained in section III (B) with detail. In the second part, the selected sequence of key segments is converted into spectrograms, plotting the frequency with respect to time using STFT. In the second main block, we work with features learning to extract the salient and discriminative features from speech spectrograms with transfer learning strategy utilizing the ''FC-1000'' layer of pre-trained Resnet101 [13]. The detailed specification of each unit and layers of the proposed Resnet model is mention in Table 1. The learned features are normalized with the help of mean and standard deviation for better performance. In the last block, we feed the extracted normalized CNN features to the suggested deep bi-directional LSTM to learn temporal cues and recognize the sequential information in a  sequence and analyze the final emotional state of the speaker in speech signals. The proposed framework diagrammatical representation is shown in Figure 1. The detailed description of each block of the framework are discussed in the subsequent sections.

A. PRE-PROCESSING AND SEQUENCE SELECTION
In this section, we split the audio file into multiple chunks (frames) concerning a suitable time and convert the whole utterance into segments. The selection of suitable time for the audio segment is a challenging problem in this era. Many researchers have worked, how to select a suitable time for each speech segment which has found some reasonable solution, that a segment of a speech signal is longer than 260ms that have more information to recognize the emotions in his/her speech [33], [34]. In this paper, we have done different observations on multiple frame durations to optimally select 500ms window size to convert single utterance into several segments. Single label is assigned to all segment of one utterance and give to K-mean clustering [35] algorithm to group the similar segment with each other. The K-mean clustering algorithm is most widely used for grouping the big data [36]. The Euclidean distance matrix [37], [38] is conventionally used in K-mean clustering technique for computing difference within elements. But in this work, we used the Radial Basis Functions (RBF) [37], [38] replaced by Euclidean distance matrix in K-mean for computing the difference between two frames. Because the RBF approach has been used for a non-linear method just like human brain's to compute the difference and recognize the patterns. The other important part is the selection of ''K'' value for partitioning the data into ''K'' groups. K-mean algorithm uses the random initialization technique to select the value of ''K'', but in this approach, we select the ''K'' value for each file dynamically by using the shot boundary detection method to estimate the similarity [32]. The pairwise difference is computed in the consecutive frame and if the difference is greater than the selected threshold value then increment the ''K'' value by one unit. After the total segments have been clustered using K-mean algorithm and one segment is selected from each cluster as key segment which near to the centroid of the cluster based on the RBF distance method, which is explained in the upcoming section. The selected key-segments are converted into spectrograms based on STFT algorithm for 2-D representation.

B. SIMILARITY MEASURING BASED ON RBF
In this section, we documented the detailed description of the non-linear similarity measure within audio signal segments. We also discuss the RBF based similarity approach for audio signal processing. The RBF uses the non-linear approach to compute the similarity between segments based on nonlinearity [39]. The visual perception section of the human brains also works on the non-linear processing system to differentiate and recognize the patterns. Hence, we use this approach in our proposed framework for finding the similarity measurement within audio segments.
We explore the RBF to simulate the non-linear human perception model to capture and compute the similarity between audio segments. Our model is also working as a non-linear model based on RBFN [40]. We use a mapping function to find the degree of similarity between audio segments. The concept of regularization is applied to estimate the mapping function of basic RBF. 1-D Gaussian shaped model [41] that meets an important requirement of the regularization method which smoothens the mapping function for the similar inputs 79864 VOLUME 8, 2020 consistent to similar outputs which is given by: The center and width of the function are denoted by the parameters ''z'' and ''σ '', and the transformation of Gaussian is performed by (x) that finding the distance and the degree of similarity between input ''x'' and center ''z''. the different RBFs are generated from an RBFN which is an exceptional ability for non-linear approximation [42] function f(x) which is given below that obtain by RBF: The expanded form of mapping function in it is given as: (x, z i ) represents the width and ''σ i '' presents the center of the function respectively and the mapping function f(x) is defined by the sum of ''N'' Gaussians. To reduce the computational cost of the network we utilize the 1-D Gaussian RBF for every segment of the speech signal.
In the above equations, x = [x 1 , ....., x P ] T is a particular part of speech signal in utterance and z = [z 1 , ....., z P ] T is the center points of the RBF and the width of the particular speech segment of RBF is denoted by σ i (i = 1, . . . ., P). We utilize the equation. 5 to calculate the similarity among two signal segments and characterize it by an adjustable width of each RBF to making ''P'' basis functions { 1 (σ 1 ), 2 (σ 2 ) . . . . . . P (σ P )}. The parameters tuning, non-linear weighting and sample variance estimation of the relevance set is obtained by: If the specific segment of the speech signal is more relevant, then the expected value of the standard deviation will be small among the speech segments. If the standard deviation value is high it means the speech segments are irrelevant, so the change in distance is more sensitive for those segments which have a small parameter ''σ ''.

C. CNN FEATURE EXTRACTION AND RECURRENT NEURAL NETWORK
In this section, we discuss the feature extraction and RNN process in detail for sequential, audio data for recognizing the emotions of a speaker from his/her speech. CNN is the most powerful source in this era for representation and recognition of hidden information in data. In contrast, we converted the speech signals into multiple segments, each individual segment is represented by CNN features, followed by deep BiLSTM for finding the sequential information. The speech signals have many redundant information, which are computationally expensive and defect the overall model efficiency.
Considering this constraint, we proposed a novel technique for selecting a most dominant sequence from utterance based on K-mean and RBF, the detail explanation is mentioned in the above sections. The selected sequence each segment is converted into spectrograms, plot the frequencies with respect to time for 2-D representation using STFT algorithm. The sequence of spectrograms [43] is fed to the pre-trained parameters of CNN, Resnet101 [44] model to extract high-level discriminative features by transfer learning strategy utilizing the last ''FC-1000'' layer. The features of each segment are considered as one RNN step with respect to time interval. RNNs is the most dominant source for analyzing hidden information in both spatial and temporal sequential data [45]. We process all key segments of every utterance and the final state of RNN is counted for each utterance as a final recognition of emotion. RNNs can easily learn the sequential data but forget the earlier sequence in terms of long sequences. This is a vanishing gradient problem in RNNs which is solved by LSTM [46]. It is a special type of RNNs having input, output and forget gates to learned long sequences that explain in the following equations.
x t Represents the input at time ''t'' and f t represent the forget gate in the LSTM, which needs to clear information form cell and keeps the records of the previous one. ''o t '' represents the output gate which responsible for keeping info about imminent step, and ''g'' represents the recurrent unit having ''tanh'' activation function to computed from the present input segment and previous segment s t−1 . The memory cell ''c t '' show the hidden state of RNN which is calculated in every step through the ''tanh'' activation function. The final state of the RNN step feeds to the Softmax classifier for taking the final decision of the RNN network. Training a huge amount of data with large and complex sequences is not correctly recognized by a simple LSTM network. Hence, VOLUME 8, 2020 in this paper, we proposed a multi-layer deep BiLSTM to learn and recognized long term sequences in audio data for recognizing emotions. The internal structure and memory blocks information is illustrated in Figure 2.

D. BI-DIRECTIONAL LSTM
In BiLSTM, the output at time ''t'' is dependent on both, previous and next segments of the sequence not only dependent in a single segment [47]. Bidirectional RNNs including two stacked of RNNs, one goes to forward, and another goes to the backward direction and calculates the joint output of both RNNs built on their hidden state. In this paper, we utilize the multi-layer concept of LSTMs network, in our method we used the two-layer network for both backward and forward pass. The overall concept of the suggested multilayer bidirectional LSTM is shown in Figure 3. The external architecture is shown in the given figure which represents the training phase of the bidirectional RNN and combined both forward and backward pass hidden state in the output layer. After the output layer, the cost and validation are computed and adjust the weights and biases through back propagation. The network is validated on 20 % data, which is separated from training data and compute the error rate in the validation data using cross-entropy. Adam optimization [48] is used for minimization of cost with a 0.001 learning rate. In the deep BiLSTM network, the forward and backward pass consists of cells, which make deep our network to compute the output from the previous and next sequence with respect to time because the network performed in both directions.

IV. EXPERIMENTAL SETUP AND RESULTS
In this section, we evaluated the effectiveness of the proposed system for SER and compared it with other baseline methods on publicly available benchmark speech emotions dataset. In this paper, we utilize the three public speech emotions datasets, the IEMOCAP [49] interactive emotional dyadic motion capture dataset, Emo-DB [50] berlin emotional dataset, and RAVDESS [51] Ryerson audio-visual dataset of emotional speech and songs. The detailed description of the datasets is explained in the upcoming sub sections.

A. IEMOCAP DATASET
The IEMOCAP [49] is a well-known dataset which is commonly used for recognition of emotional speeches, which has two types of dialogs, scripted and improvised. The dataset consists of 10 experienced actors to records 12 hours of audiovisual data including audio, videos, motion of faces, speech and text transcriptions. The IEMOCAP dataset has five sessions and each session consists of 2 actors (one male and one female) to record the emotional script with 3 to 15 second long with a 16 kHz sampling rate. Each session has different categories of emotions like; anger, sad, happy, neutral, surprise, disgust, frustrated, excited and fearful which is annotated by three expert persons. Individually labeled the data, we select those utterances that two experts are agreed upon them. In this paper, we evaluated our system on four emotions anger, sad, happy and neutral for comparison which is mostly used in literature. The detailed description of emotions distribution is mentioned in the given table 2. Table 2 shows the distribution of four emotions of all five sessions of the IEMOCAP dataset for evaluating the model. We utilize the 5-fold cross-validations technique to train the speaker-independent model, the four sessions are used for training and one session is used for testing the system in each fold.

B. EMO-DB BERLIN EMOTION DATASET
The Berlin emotion database Emo-DB [50] contains 535 utterances recorded by ten actors: 5 male and 5 female. Each actor read the pre-selected sentences with different emotions like anger, fear, boredom, disgust, happy, neutral and sadness. In the Emo-DB approximately 2 to 3 seconds utterances having a 16 kHz sampling rate. The detailed descriptions of emotions are mention in Table 3.  Table 3 represents the description of all emotions of the Emo-DB dataset which is a small dataset having limited emotions. We utilize the 5-fold cross-validation technique for training the speaker-independent model to recognize the emotions in daily conversations. We used the sentences of 8 speakers for training the system and the other 2 speakers are used for testing the system.

C. RAVDESS DATASET
The RVDESS (Ryerson audiovisual database of emotional speech and songs) [51] is an acted dataset, which is recorded in English language, which broadly utilize for expressive music and dialog reactions. The dataset contains (8) emotions having 24 professional actors, 12 in each category, male and female. The emotions like sad, calm, happy, angry, surprise, neural, fearful, and disgust recorded by different male and female. The total 1440 audio files are recorded with 48000 Hz sampling rate. We performed experiments using 5-folds cross-validation technique to split the dataset for training and testing parts. The explanation is remark in Table 4.

D. EXPERIMENTAL EVALUATION
In this section, we evaluate our system for speaker-dependent and independent emotions recognition. We separated each utterance into multiple segments ''fs'' with respect to time ''t'' with 25% overlapping to select the sequence (s = fs 1 , fs 2 , fs 3 , . . . . . . ., fs n ) from each utterance. The RBF based similarity method was used in K-mean clustering to select one segment as a key-segment from each cluster, which is near to centroid of the cluster that represents the whole cluster. The detail description is mentioned in Section III. After selecting key-segments, we extracted the high-level discriminative features utilizing the ''FC-1000'' layer of the Resnet101 model and normalize the extracted features with global mean and standard deviation for boosting the accuracy of the overall model. The normalized features feed to deep BiLSTM network step by step to learn the hidden patterns and recognize the emotion in the given sequence. The final state of the proposed deep BiLSTM network was followed by the Softmax classifier to produce the probability for emotions. The recommended system was implemented in MATLAB 2019b utilizing the neural network toolbox for features extraction, model training, and evaluations. The data are divided into training and testing folds with an 80:20 % ratio and generated spectrograms of every segment. The suggested model was trained and evaluated on a single NVIDIA GeForce GTX 1070, 8 GB on-board memory GPU system. The detailed description of speaker-dependent and independent experiments is in upcoming sub sections.

E. MODEL OPTIMIZATION
In the training stage, we tuned the model with different parameters to make it sufficient and optimal for SER. We performed different experiments with multiple batch sizes, learning rate, number of LSTM and BiLSTM layer to choose the optimal solution. We selected the Adam optimizer for model optimization and the best bias correction for better effect. We also did experiments with normalized features and un-normalized features to check the model efficiency. We selected the batch size, 512 and learning rate, 0.001 for this model which is empirically proved from extensive experiments over three different speech emotional datasets. We performed two types of experiments, with normalized features VOLUME 8, 2020 and without normalized features and obtained the results of both to select the features for model training. The detail description of diverse parameters and the corresponding result of the proposed model is shown for normalized and un-normalized features in the below tables for every dataset. Each table represents individual dataset result with different batch size and learning values. We select the best learning rate and batch size for whole model before these extensive experiments for all datasets In the Tables, 5-7 represents the results of the proposed model using normalized and un-normalized features. The features normalization improves the overall recognition accuracy for IEMOCAP is (0.4%), for EMO-DB is (0.23%) and for RAVDESS is (0.19%) respectively from un-normalized   features. Hence, the normalized features recognition accuracy is better and the processing time for model testing and training is lower than other baseline models.
Similarly, we compare our model processing time with other baseline methods using the diverse parameter for proving the model effectiveness and feasibility. We set the batch size to be 512 and select the 0.001 learning rate with Adam optimizer and analyze the processing time for IEMOCAP, EMO-DB and RAVDESS dataset utilizing the normalized features. The details are mention in Table 8.  Table 8 illustrated the processing time of the model which indicates that the proposed model takes less time in training and testing due to the efficient strategy of the model. In the proposed model we didn't take all segments of each utterance, but we just select one segment form each cluster as a key segment that represent the whole cluster and train a model on that selected cluster. So, that's the reason for less processing time, our model processes the selected segment not all segments of utterance and extract the CNN feature which feeds to deep BiLSTM network for classification.

F. SPEAKER INDEPENDENT PERFORMANCE OF THE PROPOSED MODEL
We performed experiments on spontaneous emotional data of the IEMOCAP, EMODB dataset and also evaluated the effectiveness of the model on RAVDESS corpus. The IEMOCAP 79868 VOLUME 8, 2020 and EMODB corpus have 10 speakers and the RAVDESS dataset has 24 speakers. We follow 5th-fold cross validation technique to split the data with an 80:20 % ratio according to the number of speakers, the 80% data are used for model training and the remaining data are used for test the model. We evaluated the proposed system over these datasets and check the prediction performance on testing data. The overall model performance are presented in term of class level precision, recall, and F1 score for each emotion. Similarly, we find out the weighted accuracy, the ratio between correctly classified emotion and total emotion in consistent class. The un-weighted accuracy, mean the ratio with in correct predicted emotion and total emotion in whole dataset. The detail description and quantitative or numerical results of each dataset is given in Tables 9-11.     the suggested system for speaker-independent evaluation was conducted in the given Figure 4. Which shows the recognition performance of the proposed model on the IEMOCAP challenging dataset for speaker-independent SER. In this experiment, we obtained 83% accuracy for anger emotion and 78% for sad, 70% for neutral and 58% for happy emotion respectively. The recognition rate of happy and neutral emotions is low in this experiment, but we obtained better results from state of the art. The results of the EMO-DB dataset are shown in Figure 5. In the above figure, the overall emotion recognition performance is increased as compared with other baseline methods, VOLUME 8, 2020 but the recognition rate of happy emotion is increased but still lower. Hence, the happy emotion mostly confused with other emotions in classification. The anger, fear, and boredom have high, greater than 90% accuracy and disgust, neutral and sad have greater than 80% accuracy respectively. Our proposed system overall achieved high recognition (85.75%) score for the EMO-DB dataset. The RAVDESS dataset confusion matrix is shown in Figure 6. We evaluated the effectiveness of our proposed system on the RAVDESS dataset, which is mostly used for emotional songs and speech. The performance of the suggested model is better than other baseline techniques. The system recognized anger, clam fear, and surprise with high priority and happy, neutral, and sad emotions were recognized with lower priority. The system mostly confused in happy, neutral, and sad emotions and recognized these emotions as a calm due to the minimum diversity with each other. The recognition rate of calm is high and the system confused with other emotions and recognize it as a calm. The overall accuracy of the system for speakerindependent emotion recognition is better than other baseline methods on IEMOCAP, EMO-DB, and RAVDESS corpuses.

G. SPEAKER DEPENDENT PERFORMANCE OF THE PROPOSED MODEL
In this type of experiment, we don't split the dataset individually like speaker independent. In the speaker-dependent system, we combine all speeches (dataset) in a single file and make a whole set and trained them respectively. We divide the whole set into an 80:20 % split ratio for model training and testing. We shuffle the data and randomly select 80% data for model training and 20% data is used for validation and testing. Similarly, we used the most normalized features for model training to reduce the overfitting and achieve the goal, to get the most reliable result of SER. Furthermore, we investigated the speaker dependent model for all datasets and also mention the qualitative result and statistic in term of precision, recall, F1 score, weighted, and un-weighted accuracy. The detail numerical results of the each dataset is given in Table 12-    We selected the best model which give best results in SER with a high preference for generalization. The classification result of speaker-dependent model in term of confusion matrix is illustrated in Figure 7, 8, and 9. Figure 7 presents the class level accuracy of the proposed model in a confusion matrix which indicated the original emotion label and predicted emotion label. In this experiment, the model highly recognized the anger and sad emotion with 92% and 89% respectively. The happy emotion recognition rate was relatively low from other emotions but better than the speaker-independent model. The happy and neutral emotions were mostly confused with sadness in both speaker-dependent and independent experiments. The speaker-dependent confusion matrix of EMO-DB dataset is shown in Figure 8.
The speaker dependent experiments of the proposed model showed outperform results on the EMO-DB dataset and  recognized the emotions with 91.14 % average recall. In this experiment the system recognized anger, fear and sadness emotion with high rank and disgust, neutral, boredom had more than 85% recognition rate and the happy emotion is recognized with a 75% ratio respectively. The system was confused among happy and neutral emotion and mostly happy emotions were recognized as neutral similarly, like a speaker independent. The overall performance of the proposed system is better, affective and significant than other baseline techniques. The speaker-dependent performance of the suggested system for RAVDESS is illustrated in Figure 9.
We evaluated our model on the REVDESS dataset to show the performance and generalization of the model for SER. We obtained the record results of the model on multiple benchmark datasets which outperform output respectively. The emotion recognition rate of the proposed model was 95% for anger, 93% for fear, 96% for surprised, 95% for calm and 90 % for disgust respectively. The happy emotion rate was relatively low but better than previous work. The proposed system misrecognized the happy emotion as compared to other classes. According to our opinion, the features of happy emotion are easily confused with others and as a result the suggested model misrecognized them. Another reason for misrecognized the happy emotion is the limitation of data, the size of the datasets is less than other pattern recognition datasets like images, video, and text. Hence, in SER, to increase the accuracy of happy emotion is a very significant improvement in this field. Many researchers have worked to develop new techniques to extract discriminative features and efficient way of classification to enhance the accuracy of this field, SER.

V. DISCUSSION
In the proposed framework, the efficient sequence selection using K-mean clustering based on RBF similarity and normalized discriminative features with deep BiLSTM are major contributions for SER utilizing speech signals. We performed, speaker-dependent and speaker independent experiments over three benchmarked datasets for recognizing the emotional state of his/her speech. We developed a new technique for SER, to select a sequence from utterance using RBF based K-mean clustering technique. We selected one segment from each cluster which is near to centroid as a key segment that represents the corresponding clusters and converted all key segments into spectrograms applying VOLUME 8, 2020 STFT for 2-D representations. Furthermore, we extracted the high level discriminative (CNN) features from spectrograms utilizing the ''FC-1000'' layer of the Resnet101 CNN model. We normalized the extracted features using average mean and standard deviation algorithm and passed to deep BiLSTM for classification. We used this novel approach for SER to improve the classification accuracy and reduce the processing time as compared to other traditional CNN_ELM [54] and DNN_KLM networks [33]. We obtained better results on three benchmarks, IEMOCAP, EMO-DB and RAVDESS datasets using this novel approach of SER for speaker-dependent and speaker-independent experiments. The comparison of the proposed approach with the baseline methods are shown in the below Table. Table 15-17 represents the comparative analysis of the proposed system with other baseline SER methods on IEMOCAP, EMO-DB and RAVDESS datasets respectively. The proposed system boosts the overall accuracy up to (6.14%), (2.14%) and (7.01%) in speaker-dependent and (3.07%), (1.57%) and (2.41%) in speaker-independent experiments on IEMOCAP, EMO-DB, and RAVDESS datasets to recognize the emotional state of speaker, respectively.   We reduce the processing time of the suggested model due to process one segment from each cluster and the usage of normalized features. In the state-of-the-art methods, researchers have used traditional and un-normalized features process for classification. The proposed system evaluated on three standard datasets which outperformed and demonstrated significant results that proved the robustness and effectiveness of the system. The performance of the proposed system has evaluated over different pre-trained CNN models as a features extractor. The comparative analysis of multiple CNN models is given in Figure 10 and 11.  We utilized different pre-trained CNN models as features extractor in the proposed technique and evaluated over three benchmarks: IEMOCAP, EMO-DB, and RAVDESS datasets for speaker-dependent and speaker-independent experiments. The recognition results of the proposed model are illustrated in Figures 10 and 11 with recognition accuracy over suggested speech emotion datasets. The recognition accuracy of the proposed system outperforms other CNN models that clearly indicated the robustness and significance of the model for SER using spectrograms of speech signals [3].

VI. CONCLUSION AND FUTURE WORK
The existing CNNs system of SER has too many challenges such as improvement in accuracy and reduce the computational complexity of the whole model. Due to these limitations, we planned a novel approach for SER to improve the recognition accuracy and reduce the overall model cost computation and processing time. In contrast, we suggested a new technique to select a more efficient sequence from speech using RBF based K-mean clustering algorithm and convert it into spectrograms by applying STFT algorithm. Hence, we extracted the discriminative and salient features from spectrograms of speech signal by utilizing the ''FC-1000'' layers of the CNN model, called Resnet and normalize it by applying mean and standard deviation to remove the variation. After normalization, we feed these discriminative features to deep BiLSTM to learn the hidden information and recognize the final state of sequence and classify the emotional state of speakers. We evaluated the proposed system on three standard IEMOCAP, EMO-DB, and RAVDESS datasets to check the robustness of the system. We improve the recognition accuracy for IEMOCAP dataset as 72.25%, obtain 85.57% for EMO-DB dataset and for RAVDESS dataset, we achieved 77.02%. We reduce the processing time of our system, which process the selected segments for emotion recognition rather than all segments that yielding a computational friendly system. The experimental results of the proposed system proved the robustness and significance for SER to correctly recognize the emotional state of the speaker using spectrograms of speech signals.
The proposed architecture can be further used in future for other applications and can explore speech emotion recognition using DBN, GRU and spike networks to get better accuracy with less computational complexity. The proposed model can be an aspiration for speaker recognition and speaker identification that is used in many real-world problems.
MUHAMMAD SAJJAD received the master's degree from the Department of Computer Science, College of Signals, National University of Sciences and Technology, Rawalpindi, Pakistan, and the Ph.D. degree in digital contents from Sejong University, Seoul, South Korea. He is currently an Assistant Professor with the Department of Computer Science, Islamia College Peshawar, Pakistan. He is also the Head of the Digital Image Processing Laboratory, Islamia College Peshawar, where students are involved in research projects under his supervision, such as social data analysis, medical image analysis, multimodal data mining and summarization, image/video prioritization and ranking, fog computing, the Internet of Things, virtual reality, and image/video retrieval. His primary research interests include computer vision, image understanding, pattern recognition, and robot vision and multimedia applications, with current emphasis on raspberry-pi and deep learning-based bioinformatics, video scene understanding, activity analysis, fog computing, the Internet of Things, and real-time tracking.
SOONIL KWON received the M.S. and Ph.D. degrees in electrical engineering from the University of Southern California, USA, in 2000 and 2005, respectively. He is currently a Professor at the Department of Software, College of Software Convergence, Sejong University, Seoul, South Korea, where he is also the Head of the Interaction Technology Laboratory, where students are involved in research projects under his supervision, such as social data analysis, audio analysis, multimodal data analysis and speech emotion recognition, speech synthesis, speaker recognition, and speaker diarization. His research interests include speech recognition, human-computer interaction, affective computing, and speech and audio processing. He served as a professional reviewer for several well-reputed journals, such as the IEEE Communication Magazine, Sensors, Information Fusion, Information Sciences, the IEEE TRANSACTIONS ON IMAGE PROCESSING, MBEC, MTAP, SIVP, and JVCI. VOLUME 8, 2020