Wavelet Multiresolution Analysis Based Speech Emotion Recognition System Using 1D CNN LSTM Networks

Speech Emotion Recognition (SER) is the task of recognizing a speaker's emotional state from speech. SER plays a significant role in Human-Computer Interaction and psychological assessment. Several kinds of time-frequency representations like spectrograms, mel-frequency cepstrum coefficients (MFCCs), and mel-spectrograms are commonly used to develop an SER system. These representations use the Fast Fourier Transform (FFT) to convert the time domain signal to the frequency domain. However, the FFT has one fundamental limitation due to the uncertainty principle, which does not simultaneously allow a good resolution in both time and frequency domains. On the other hand, the multiresolution property of wavelets can provide a good localization in both time and frequency domains. Therefore, this article investigates the competency of the wavelet transforms for SER. We propose a Wavelet based Deep Emotion Recognition (WaDER) method using an autoencoder and 1D convolutional neural network (CNN) and long short-term memory (LSTM) networks. The autoencoder is used to perform the dimensionality reduction of the wavelet features then the latent space is used to classify the emotions using the 1D CNN-LSTM model. We conducted a Monte-Carlo K-fold validation using the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) dataset. For speaker-dependent (SD) experiments, we achieved an unweighted accuracy (UA) of 81.45% and a weighted accuracy (WA) of 81.22%. The results of the experiments on the RAVDESS dataset show that the proposed method performs better than the state-of-the-art methods, which use other time-frequency representations.


E MOTION recognition plays a vital role in Human-
Computer Interaction. We can use speech emotion recognition (SER) to make the conversation between machines and humans more intelligent. SER also has applications in Manuscript  healthcare. It can be used to identify psychological disorders which can mitigate the risk of suicidal behaviors. Virtual Emotion AI chatbots can provide personalized therapy by interacting with patients online. SER can be used for emotional speech generation as well. A London-based company, DeepZen, partnered with NVIDIA to develop a deep learning model that can generate human-like emotional speech for audiobooks. Speech emotions have a tremendous amount of acoustic variance. Therefore, the first step is to identify distinguishable and salient features in a voice segment to get a better recognition rate in SER. Traditionally, researchers use features like melfrequency cepstrum coefficients (MFCC) [1], spectrograms, mel-spectrograms, energy, fundamental frequency (F0), spectral centroid, and zero-crossing rate. Additionally, many features can be handcrafted using the time-domain features' statistics. Over the past several decades, Hidden Markov Models (HMM) [2], [3], Gaussian Mixture Models (GMM), and Support Vector Machines (SVM) [4], [5], [6] have been used for SER. Various researchers have leveraged the combination of different features to get a better recognition rate. Yogesh et al. [7] extracted biospectral and biocoherence features from glottal and speech waveforms. Seehapoch et al. [4] used features like fundamental frequency, energy, zero-crossing rate, and linear prediction coefficients (LPC) to train an SVM model. Although these models require fewer parameters and are highly interpretable, they have a few limitations when capturing complex non-linear patterns from data.
Deep learning has made a significant improvement in this field. Researchers have developed variations of CNNs and LSTMs to model the spatial and temporal dependencies from the input features. Deep learning can extract salient and discriminative information from the input features to perform an accurate classification. Spectrograms, Chromagrams, and MFCCs can be fed to a CNN or LSTM network as inputs. In 2018, Zhang et al. [8] proposed to use 3-channel log-mel spectrograms as features to train their Deep Convolutional Neural Network. The 3-channels of the mel-spectrogram were static, delta, and delta-delta. The delta and delta-delta are the first and second derivatives of the signal. This representation resembles an RGB (red, green, and blue channels) image. In 2019, Zhao et al. [9] used a local feature learning block (LFLB) and an LSTM model to learn features from raw audio and log-mel spectrograms. This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Similarly, Khorram et al. [10] proposed a dilated CNN-based model [11] to capture the long-term dependencies from data while keeping the number of parameters low.
Recently, the attention mechanism gained much attention as it can focus on the relevant parts of the input to make a decision. Initially, the attention mechanism was introduced by Bahdanau et al. [12] for the machine translation task. However, it has been widely applied for classification purposes as well. Xie et al. [13] proposed an attention-based LSTM model for SER to utilize the difference in emotional saturation between multiple time frames. Mirsamadi et al. [14] also used a bidirectional LSTM with an attention mechanism to focus on the emotionally salient parts of speech. Xu et al. [15] proposed a method called Head Fusion based on a multi-head attention mechanism for speech emotion recognition. They used MFCC features after dividing each sample into multiple fixed-size utterances. They also experimented with different types of noise injections which increased the robustness of the model. Yu et al. [16] used IS09 and mel-spectrograms as features and trained them using an attention-based LSTM model. Some methods used raw audio to perform SER [17]. Since the human auditory system is designed to perceive the frequency and amplitude of sound [18], the focus of this article is to utilize frequency-based features instead of raw audio.
However, most of these methods utilize spectrograms, MFCCs, and mel-spectrograms which use the FFT to convert the time domain signal to the frequency domain. But due to the uncertainty principle, FFT cannot simultaneously get a good resolution in both time and frequency domains. FFTs use a fixed-size window to capture different frequencies. The higher frequencies require a smaller window and the lower frequencies require a bigger window. However, the multiresolution property of the wavelet transforms provides localization in both time and frequency domains simultaneously. In the early 80 s, orthogonal wavelets were discovered by Strömberg [19]. In 1982, Alex Grossman and Jean Morlet developed a continuous wavelet transform [20], [21] for seismic frequency analysis. With the advent of deep learning, wavelet transforms once again gained attention for time series classification. Some researchers have also investigated wavelets for SER tasks using deep learning approaches. Wavelet transform features can have very high dimensionality. Earlier it was challenging to train a neural network using such data due to the computational limits. However, high-dimensional data can be used more easily nowadays to train a deep neural network due to the increased computational power.
Zhiyan et al. [22] used wavelet features and an HMM model to classify Chinese emotional speech. Shegokar et al. [23] used the continuous wavelet transform (CWT) and prosodic coefficients as features and classified them using an SVM. They achieved an accuracy of 60.1% using quadratic SVM. However, they used principal component analysis (PCA) to reduce the dimensionality of the wavelet features. We found that the continuous wavelet transform features are highly non-linear. Therefore PCA is not a suitable dimensionality reduction method. In this article, we deal with dimensionality reduction using an autoencoder [24]. Hamid et al. [25] used the prosodic, spectral, and wavelet features to classify Arabic speech emotions. Many researchers have also utilized critical bands for speech-related tasks like SER and speaker identification. In 1961, Eberhard Zwicker proposed a psychoacoustic scale called Bark Bands [26]. The center frequencies in the bark bands are based on the human perception of different frequency ranges. Lalitha et al. [27] used a combination of the mel-scale and the bark-scale to perform SER and achieved encouraging results. Jiang et al. [28] used bark bands as a critical band division method and classified different emotions using a support vector machine. Their results showed that their proposed method performed better compared to MFCC features. Similarly, Fernandes et al. [29] also used bark bands as features and used an LSTM and a Bidirectional LSTM model for classification.
Several researchers have utilized wavelet packets also for SER. The wavelet packets is a generalization of multiresolution decomposition. It divides the frequency bands into several levels. Additionally, it decomposes the high-frequency portions also that are not subdivided in multiresolution analysis [30]. In 2020, Wang et al. [31] used the wavelet packet coefficients for SER and made a comparison with the MFCC features. They used a Sequential Floating Forward Search method for feature selection. Their experiments showed that the classifiers trained using the wavelets feature achieved better results than the MFCC features. Kishore et al. [32] used MFCC and Sub-Band Cepstral (SBC) features to classify emotions using a GMM. They computed SBC using the wavelet packet transform instead of the FFT. They reported that SBC features yielded a 70% accuracy and MFCC features yielded a 51% accuracy using the Surrey Audio-Visual Expressed Emotion (SAVEE) dataset.
Feng et al. [33] used wavelet packets and computed the energy of each sub-band to classify speech emotions. They classified these features using an LSTM based model. He et al. [34] proposed new features by computing energy entropy from wavelet packet frequency bands. They classified the features using the GMM algorithm. Huang et al. [35] proposed sub-band spectral centroid weighted wavelet packet cepstral coefficients for classifying emotions. Additionally, they fused the prosody and voice quality features with the wavelet packet features and classified them using a Deep belief network. Their results showed that this method performs SER efficiently even under noisy conditions. The use of wavelet packets proved useful in recognizing emotions under real-world noise conditions also [36].
As discussed above, researchers often use a combination of several features to increase the accuracy of an SER system. In this article, instead of using several kinds of time-frequency representations, we aim to use only one kind of robust representation that can capture distinguishable information. A comparison is made with other methods using other time-frequency representations where only one kind of representation is used instead of the fusion of different representations. Additionally, in many methods, the wavelets are not utilized efficiently. Therefore, we aim to revisit the wavelet transforms and explore their usage in SER. Fig. 3 illustrates the potential of using the wavelet multiresolution analysis for SER. The main contributions of this article are highlighted as follows: 1) We investigate the potential of the wavelet transform as features for SER by utilizing their multiresolution property. The continuous wavelet features are extracted within a suitable frequency range by analyzing the frequencies carrying the salient information. 2) We propose a method called WaDER to perform SER, which consists of two parts. Firstly, due to the high dimensionality of the wavelet features, an autoencoder is used to reduce the dimensionality of the wavelet features at each timestep. Secondly, a 1D CNN-LSTM based model performs classification using the latent space of the autoencoder. 3) We found that the wavelet transform features can efficiently distinguish between several emotions and perform an accurate classification compared to the other timefrequency representations. We achieved an unweighted accuracy (UA) of 81.45% and a weighted accuracy (WA) of 81.22% for speaker-dependent experiments using the RAVDESS dataset. A list of nomenclature used throughout this article is provided in Table I.

II. DATA
The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) [37] is used in this research. RAVDESS dataset is an example of simulated dataset. It contains speech and song samples with different emotions in a North American accent. Only speech emotion samples are used in this work. The 16-bit audio files are sampled at 48 kHz and provided in the Waveform Audio File (WAV) format. The number of utterances is 1440. There are 24 professional actors and 60 trials per actor. Out of the 24 professional actors, 12 are female, and 12 are male. Each actor is asked to speak two different sentences with different emotions. The two sentences are "Dogs are sitting by the door" and "Kids are talking by the door".
The speech emotions categories include neutral, calm, happy, sad, angry, fearful, disgust, and surprised. Apart from the neutral category, each category of emotions is expressed at two levels of emotional intensity (normal and strong). There are 192 utterances of each emotion except the neutral state. The neutral state only has 96 utterances.

A. Preprocessing
The audio files are sampled at 16 kHz to reduce the size of data without affecting the speech quality and intelligibility. Firstly, each audio clip's leading and trailing silence is trimmed because it contains no useful information. However, the silence occurring between words is not removed. It provides information about the speaking rate and helps distinguish between weak and strong emotions. For example, people tend to speak faster when they are angry; therefore, the duration of the silence will be less. However, when people speak with a calm emotion, the duration of silence can be longer. Secondly, each clip is normalized such that the mean is 0 and the standard deviation is 1.

B. Data Augmentation
Since there are fewer samples in the RAVDESS dataset, most SER algorithms tend to overfit. Therefore, data augmentation is performed to generate new samples and make the SER system more robust to noise. Additive White Gaussian Noise (AWGN) is the most widely used noise addition method. It can model the random processes that naturally occur in nature. Therefore, new samples are augmented from each trimmed audio sample using AWGN. The AWGN can be represented by (1) and (2). The noise is added with a Signal-to-Noise Ratio (SNR) between 15 dB and 30 dB.
where t is a discrete timestep, σ 2 is the variance of the noise, and Z is the noise that is drawn from a normal distribution.
where t is a discrete timestep, Y is the output/ augmented signal, and X is the input signal.

C. Wavelet Feature Extraction
After data augmentation, each audio sample's continuous wavelet transform (CWT) is computed. The CWT of a signal is represented by (3).
where a is the scale parameter, b is the translation parameter, ψ(t) is the mother wavelet,ψ (t) is the complex conjugate of the mother wavelet, and t is time. The scale and translation parameters in the CWT must be discretized to implement the algorithm. Different wavelet frequencies, F a , are given by F c aδ , where δ is the sampling period, and F c is the central wavelet frequency which is set to 1 Hz. If the sampling period is 1 16000 seconds, the scales 1, 2, and 3 correspond to 16 kHz, 8 kHz, and 5.33 kHz. Different discrete scales, a, are obtained by setting their values to positive integers, a ∈ {1, 2, 3, . . .N}. N is chosen such that the frequency corresponding to the scale N is more than 20 Hz because the lower limit of human hearing is 20 Hz. Similarly, the translation values, b, in the CWT are discretized to positive integer values, which are the timesteps. b ∈ {1, 2, 3, . . ., T }, where T is the total number of timesteps in the signal.
As parameters a and b change, different wavelets can be generated from the mother wavelet, which are called daughter wavelets. There are several kinds of mother wavelets like Daubechies, Mexican Hat, Symlet, Ricker, Haar, Morlet, etc. Different kinds of wavelets are used for different tasks. Morlet wavelet is well suited for speech and image processing tasks because it is closely related to human perception of hearing and vision. Over the past decade, it has been used for Voice Activity Detection (VAD), speaker recognition, and speech emotion recognition (SER).
The real-valued Morlet wavelet is used in this article and is shown in Fig. 1. The real-valued Morlet wavelet is given by (4).
where σ is the width of the Gaussian, ξ controls the time and frequency resolutions trade-off, and t is time. The values of σ 2 and ξ are usually set to 1 and 5, respectively [23].
As the scale parameter increases, the daughter wavelet dilates/ expands and captures lower frequencies. If the scale decreases, the daughter wavelet captures higher frequencies. If the mother wavelet is dilated by a factor of 2, it signifies that the frequency content is shifted by an octave. The number of octaves determines the number of frequencies being investigated.
For our experiments, the absolute values of the CWT features are taken. Now the crucial part is the selection of frequencies or scales. The human ear can hear frequencies between 20 Hz and 20,000 Hz. However, most speech lies between the 20 Hz and 4000 Hz range [38]. Therefore, selecting frequencies outside this range is not helpful. Additionally, to observe the difference in the distribution of the frequencies present in the male and female voices, the following approach is used: r Firstly, each audio clip is standardized in the time domain (the mean is 0 and the standard deviation is 1).
r Secondly, the spectrogram of each audio clip is computed. r Thirdly, all the frequencies are weighted by their amplitudes, and a histogram is computed as shown in Fig. 2. This process is repeated for both males and females separately. In the histograms, the frequencies between 80 Hz and 2000 Hz show a distinct pattern for males and females. Therefore, 125 frequencies in the range 80 Hz and 2000 Hz are extracted using their corresponding closest scales.
The CWT of all the emotions is shown in Fig. 3. The CWT of several emotions looks distinguishable. In Fig. 3, it can be seen that the CWT features of the "neutral", "calm", and "sad" emotions look similar. They have higher amplitudes in both lower and higher scales. However, the CWT features of emotions like "angry" and "disgust" show a higher amplitude in the lower scales (higher frequencies) only. It is due to a higher pitch when speaking loudly or angrily. The CWT of some emotions show a similar pattern, and it is difficult to differentiate between them visually. However, deep learning models should be able to uncover the complex underlying patterns from these CWT features to classify them accurately.

D. Fixed Size Segments Generation
Now we have the CWT features of all the audio samples. However, the duration of the audio samples is variable (between 3 and 5 seconds). To generate fixed-sized segments, one-second long segments are extracted from each sample with an overlap of 60%. The reason behind choosing one-second long segments is to capture multiple words and the pauses between them. Usually, some words are spoken in a more neutral manner than others. When we consider multiple words, we can estimate the emotion in that segment more accurately. Additionally, choosing long segments ensures that the segment's target emotion will not differ significantly from the whole utterance. The pauses between the words indicate how two words are connected, which helps distinguish between weaker (calm, neutral, sad) and stronger (angry, disgust, happy, surprised) emotions.
The ground truth label of the original utterance is assigned to its segments also. However, some portions of speech can carry a different (mostly neutral) emotion than the whole utterance. To address this issue, some methods dynamically generate pseudo labels for each segment [39]. However, in this article, the original utterance's label is also assigned to all its segments. During testing, the majority vote of the prediction of each segment is assigned to the whole utterance. Since the sampling rate is 16 kHz, one-second long segment corresponds to 16000 samples. Therefore, the dimensionality of each segment is (16000, 125), where the number of timesteps is 16000, and the number of scales is 125.
Currently, one major issue with the CWT features is the requirement of large-sized arrays, which makes it difficult to load the entire data during training. Therefore, the temporal resolution is decreased by a factor of 4 by treating the CWT features like images using inter-area interpolation. Now, the dimensionality of each segment is (4000, 125), where the number of timesteps is 4000, and the number of scales is 125. All the CWT features are standardized to have a mean of 0 and a standard deviation of 1.

E. Feature Compression
Currently, the number of features is significantly high at each timestep. Due to high dimensionality and fewer samples, the deep learning model is prone to overfitting. Therefore, a dimensionality reduction technique is applied.
Firstly, the most popular and simple dimensionality reduction technique, PCA, is explored. Currently, the CWT features form a (N × 4000 × 125) matrix, where N is the number of samples (segments). The CWT features are reshaped to (N × 4000 * 125) to apply PCA.
A scatter plot of the first two principal components is shown in Fig. 4. It is evident from the scatter plot that the data is highly non-linear. But the PCA performs a linear transformation. Therefore, PCA is not a suitable dimensionality reduction method in this case.
Hence, the autoencoder is chosen to perform dimensionality reduction as it can also model non-linear data.

F. Deep Learning Architecture
The proposed deep learning architecture consists of an autoencoder and a classifier module. The autoencoder is used to reduce the dimensionality of features while keeping the number of timesteps the same. The classifier takes the latent space as input and classifies different emotions. The deep learning architecture is shown in Fig. 5.
Let the input CWT features be X, where X ∈ R T ×D , where T is the number of timesteps and D is the dimensionality of features. The encoder and decoder are represented by e(.) and d(.), respectively. The latent space, z, is represented by (5).
where z ∈ R T ×D 1 , and D 1 < D. The sentence spoken in all the samples is "Kids are talking by the door". For convenience, the features from a 1 s long clip are extracted from each sample. The sampling rate is 16 kHz. Therefore, the number of timesteps shown is 16000 × 1 = 16000. In each emotion, only the words "Kids are talking" are uttered during the 1 s duration, which are annotated. The x-axis shows the time in seconds. The y-axis shows 125 scales, and the z-axis shows the amplitude. The scales are inversely proportional to frequencies. Note that the color axes vary by the plot to show the differences in amplitude per emotion.
The reconstructed CWT features, X , are described by (6).
where X ∈ R T ×D . The encoder and decoder use the time-distributed fully connected and time-distributed Batch Normalization layers. The time-distributed operation applies a specific layer to every timestep. This is done because the CWT features are viewed as a multivariate time series instead of a standard image here. Additionally, the temporal resolution is kept the same in the autoencoder. Only the dimensionality of the features at each timestep is reduced. Therefore, the feature maps are computed at every timestep using time-distributed layers.
The reconstruction loss, L a , to train the autoencoder is given by (7).
where K is the number of samples in the training data.
In our experiments, the values of D, D 1 , and T are 125, 8, and 4000, respectively.
The classifier takes the latent space, z, as input. The classifier utilizes 1D CNNs to extract the spatial features at each timestep. Then, the LSTM layer is used to extract the long-term temporal dependencies as the input sequence length is 4000.
To extract the spatial dependencies across the scales at each timestep, three 1D convolutional layers, followed by the exponential linear unit (ELU) activation, Time-Distributed Batch Normalization, and MaxPooling layers, are used. The purpose of each layer is explained below: 1) 1D CNN: 1D convolutional layers are applied to extract the spatial dependencies from the compressed frequency information in the latent space at each timestep. Instead of 2D CNN layers, 1D CNN layers are used because 1D CNNs require less memory during processing and are less computationally expensive. Moreover, the CWT features are considered here as a multivariate time series. Therefore, 1D CNN layers are utilized to learn the local features at each timestep. 2) ELU activation: The ELU activation is similar to ReLU activation but it can produce negative outputs [40]. The ELU activation is represented by (8).
where α is the hyperparameter that adjusts the saturation for negative input values. The ELU activation alleviates the effect of the vanishing gradient problem. Additionally, the ELU activation produces negative values, which pushes the mean of the activations closer to zero and results in faster training. Clevert et al. [40] showed that ELU activations lead to better generalization performance. 3) Batch Normalization: The distribution of the inputs of layers keeps changing in the neural network as the parameters of the previous layers change, which leads to slower training. This problem is termed "Internal Covariate Shift". Batch Normalization [41] adjusts the means and variances of layer inputs by normalizing them. The Batch Normalization layer reduces the dependency of gradients on initial parameters and makes the training faster by allowing higher learning rates [41], [42]. The time-distributed Batch Normalization layer applies the Batch Normalization at every timestep separately. 4) Max Pooling: 1D Max Pooling is used to reduce the temporal resolution of the CWT features by taking the maximum of a pooling region. Let the input to the 1D CNN layer be a time series X(t). The input X(t) is convolved with a kernel w(t) of size l to obtain the output O(t), which is described using (9).
The weights of the kernel w(t) are initialized using Xavier normal initialization. Then, the output of the CNN layer can be represented using (10). (10) where O l i is the i th output feature at l th layer, O l−1 k is the k th input feature at the (l − 1) th layer, w k denotes the convolution kernel at the k th index, and b l i is the bias term for the i th output feature at the l th layer.
The ELU activation function, σ(.), is applied on the convolution output, O l i . A time-distributed Batch Normalization layer, BN T D , is applied to normalize the output of the activation function at each timestep, which is represented by (11). The α parameter in ELU activation is set to 1 (default value).
Now, the outputs are passed into a MaxPooling layer which is shown in (12).
where Ω represents the pooling region with index i, a l u is the input feature of the l th MaxPooling layer at index u, and h l i is the output feature of the l th MaxPooling layer at index i.
After extracting the features using the 1D CNN layers, an LSTM layer is applied to extract the long-term contextual dependencies from the CWT features. LSTM acts as a global feature extractor. The output from the LSTM cell, h l t , can be expressed using (13) to (17) [43].
where the W , U , and b terms denote the neural network weight matrices, σ g is the logistic sigmoid function. i, o, and f are the input, output, and forget gates, respectively. i t , o t , and f t are the gate vectors, c is the cell state, the operator represents the element-wise product of the vectors, σ c is the hyperbolic tangent function, and l and l − 1 denote the index of input and output features. Note that our experiments use no activation on the LSTM layer's output (17).
The output of the LSTM layer is finally passed through fully connected layers, and a softmax activation is used on the final layer. The softmax output,ŷ is represented by (18) and (19).
where h l is the output of the LSTM layer, W 1 and W 2 are the neural network weight matrices, BN is the Batch Normalization layer, σ e is the ELU function, σ s is the softmax function, h 1 is the output of the first fully connected layer, and b 1 and b 2 are the bias matrices. Using the softmax probabilities, the predicted class, y class , is given by (20).
whereŷ i is the probability of the i th class. The classifier is trained using the categorical cross-entropy loss, L c , which is represented by (21).
where K is the number of samples in the training data, y k is the ground truth of sample k, andŷ k is the prediction of the sample k.
After predicting the labels of segments, the majority vote, y vote , of the labels of all the segments is assigned to the whole utterance.
The pseudo-code of the entire method is presented in Algorithm 1.

A. Evaluation Metric
The most widely used evaluation metrics for SER are weighted accuracy (WA) and unweighted accuracy (UA). WA is the average accuracy of all the samples which can be computed using (22).
where k is the number of classes, n k is the number of correctly classified samples in class k, and N k is the total number of samples in class k.
UA is each class's average accuracy, which can be computed using (23). When the class distribution is skewed, WA is not a reliable metric. Therefore, for unbalanced classes, UA is used primarily. Extract the CWT features from all three samples. 5: Divide the CWT features into fixed sized segments. 6: Decrease the temporal resolution of CWT features, R 16000×125 → R 4000×125 . 7: Append the CWT features into the array, X all . 8: end for X all is a (N 1 × T × D) matrix, where N 1 is the number of segments, T is the number of timesteps, and D is the dimensionality of CWT features. 9: Standardize the features array, X all . 10: Train the autoencoder using features X all . 11: Train an ensemble of seven classification models. Use the latent space, z ∈ R T ×D 1 , of the autoencoder as the input features for the classification model. D 1 is the dimensionality of the latent space and D 1 < D. 12: Segment-level prediction: Make prediction, y class , for each segment using the ensemble of models. 13: Utterance-level prediction: Compute the majority vote, y vote , of the predicted labels of segments to make prediction at the utterance level 14: Append y vote into the array P .

15: return P
For the experiments, the value of T, D, and D1 are 4000, 125, and 8, respectively.
where k is the number of classes, n k is the number of correctly classified samples in class k, and N k is the total number of samples in class k.

B. Experimental Setup
A speaker-dependent (SD) speech emotion recognition is performed here. A Monte Carlo experiment is conducted to get a robust estimate of performance. The Monte Carlo simulation involves seven experiments overall. At the beginning of each experiment, data is randomly split into training and testing. 85% of the samples from each class are used for training, and 15% of the samples from each class are used for testing.
Note that the same training data is used for both autoencoder and classifier models to prevent data leakage.
1) Autoencoder: The autoencoder is trained using 5-fold cross-validation. In each fold, 10% of the data is used for validation. The model is trained for 40 epochs in each fold. The batch size is set to 128, and the learning rate is set to 0.0001. The autoencoder model is trained using   Table II. 2) Classifier: The classification deep neural network is trained using 8-fold cross-validation. In each fold, 10% of the data is used for validation. The model is trained for 25 epochs in each fold. The batch size is set to 64, and the learning rate is set to 0.0001. The classifier model is trained using the Adam optimizer and categorical cross-entropy loss function. An ensemble of 7 classification models is used to make robust predictions. All the models have the same architecture but are initialized with different weights. The autoencoder and the classifier are trained using the CWT features of segments and not the whole utterance. Therefore, after predicting the labels of segments, a majority vote of the predicted labels of all the segments is taken. The label of the majority of the segments is assigned to the whole utterance. The tie is broken by selecting a random sample. In the autoencoder  III  COMPARISON OF THE PROPOSED METHOD WITH STATE-OF-THE-ART SER METHODS and the classifier models, experiments are conducted by switching the order of the batch normalization and ELU activation layers, and a similar performance is observed.

C. Results
The mean UA (%) and WA (%) achieved from the Monte Carlo experiment are 81.45 ± 1.19% and 81.22 ± 1.31%, respectively. The confusion matrix of one of the experiments is shown in Fig. 6

D. Comparison and Analysis
To validate the effectiveness of the proposed method, WaDER, a comparison is made with the state-of-the-art methods in terms of speaker-dependent unweighted and weighted accuracy. The comparison is made using the mean accuracy reported by different methods, [15,Tab. VII], and [48, Tab. IX]. The proposed method outperforms all the other methods in terms of unweighted and weighted accuracy. The best unweighted and weighted accuracies achieved by our method are 83.4% and 83.7%, respectively. Our method is similar to the work done by Aghajani et al. [49]. One difference is that we have reduced the dimensionality of the wavelet features using an autoencoder and chosen the wavelet scales carefully. The classifier feature extraction blocks are slightly similar to the local feature learning block (LFLB) by Zhao et al. [9]. Some methods show better results than our method. However, a direct comparison with those methods is not possible because either they are classifying only a few classes from a benchmark dataset or they use a combination of various features in their method [48]. These methods focus on increasing SER accuracy by combining different kinds of features. However, this work aims to explore the potential of the wavelet transform as features for SER. Additionally, instead of time-frequency domain features, some methods [50], [51] use embeddings from a pretrained model as features to train their SER model. Therefore, a comparison is not made with such methods. However, Farooq et al. [51] achieved a mean weighted accuracy of 81.3%, which is only 0.1% more than our method's mean weighted accuracy. Farooq et al. [51] used the features from a pretrained Alexnet, which resulted in a better performance. The 95% confidence interval of our model's weighted and unweighted accuracies also overlaps with the accuracies reported by Kwon et al. [46].
The confusion matrix in Fig. 6 shows that the proposed method shows high accuracy for all the emotions. However, some confusion exists between "neutral" and "calm" samples. It is quite challenging to differentiate between "neutral" and "calm" samples as they both have similar pitch and speaking rates. It is difficult for human listeners also to distinguish between these two classes with a 100% accuracy. Therefore, several methods sometimes merge these two classes into a single class for classification because of their high similarity. Similarly, there are some ambiguities between "calm" and "sad" samples as well. Some strong emotions like "angry", "disgust", "surprised", and "fearful" also have slight confusion as they all possess higher amplitudes in higher frequencies.
One key point was reducing the dimensionality of the continuous wavelet transform features. The autoencoder compresses the wavelet features by a factor of 15.6. It can be seen from Table II that the autoencoder is able to reconstruct the data from the latent space efficiently. The optimal size of latent space was found to be (4000, 8) (the original size was (4000, 125)) after experimenting with different latent space sizes. If the size of the latent space is further decreased, more information is lost, and the wavelet features are reconstructed with a higher loss. On the other hand, if the size of the latent space is increased, the classifier begins to overfit. The primary reason behind choosing an ensemble of seven models is to avoid overfitting. Since the RAVDESS dataset contains fewer samples, new samples are augmented by adding additive white gaussian noise with an SNR between 15 dB and 30 dB, which results in a robust performance. However, the model is still very prone to overfitting. If the model is trained for more epochs, the model immediately begins to overfit. Therefore, the training is stopped as soon as the model begins to overfit. However, the proposed method still outperforms the other methods in terms of weighted and unweighted accuracy.

V. CONCLUSION
This article uses the continuous wavelet transforms as features to perform speech emotion recognition. The proposed method, WaDER, firstly uses an autoencoder to reduce the dimensionality of the wavelet features. Secondly, the latent space is used to perform classification using an ensemble of seven deep neural networks. The experiments are conducted on the RAVDESS dataset. We observed that the continuous wavelet transform features are able to distinguish between several emotions and perform an accurate and robust classification. We showed the potential of utilizing the multi-resolution property of the wavelets to classify emotions. However, the current methodology still requires some improvements. Firstly, extracting features from frequencies that carry the most discriminative emotional information could improve the SER performance. A channel-wise attention mechanism that can extract the salient features from the frequencies at each timestep can be used to achieve this. Secondly, a new strategy is required to overcome the severe overfitting problem. Thirdly, we need to extend the model to the speaker-independent (SI) scenarios and evaluate the performance. Our future work will focus on overcoming these challenges and developing a more efficient SER architecture.