Cascaded Convolutional Recurrent Neural Networks for EEG Emotion Recognition Based on Temporal–Frequency–Spatial Features

Featured Application: The proposed method can improve emotion recognition accuracy in human–computer interactions


Introduction
Emotion represents a complex psychological construct that encompasses an individual's affective, cognitive, and behavioural responses to external stimuli, accompanied by corresponding physiological reactions [1]. In the modern era of intelligent technology, people's daily lives are becoming increasingly intertwined with these advanced systems, underscoring the growing importance of accurate emotion recognition in human-computer interaction. Studies in neurological physiology [2] and social psychology [3] have revealed a strong correlation between EEG signals and numerous cognitive processes, including emotional responses, and can objectively reflect the real emotions of subjects, establishing EEG-based emotion recognition as a cutting-edge research area in cognitive science [4].
Owing to the nonlinear, unsmooth, low signal-to-noise ratio (SNR) and multichannel correlation characteristics of EEG data, which are inherently complex chaotic data, there are numerous data processing methods available in the research field [5], but the extraction • EEG data are converted into a 4D matrix structure consisting of multiple frames, which contain information in three dimensions: temporal, frequency, and spatial, and can effectively represent the neural features of different emotions.

•
In this paper, a novel attention module FcaNet is introduced. FcaNet redistributes the weights of different channels to obtain high-quality discrimination. FcaNet is found to be superior to traditional channel attention squeeze-and-excitation networks (SENet) while incurring no significant computational cost.

•
To satisfy the real-time demands of the emotion recognition system, a residual network is devised, which comprises DC and PC to decrease the computational burden while utilizing the attributes of depth-separable convolution to segregate the spatial and channel mixing dimensions. Furthermore, the existence of a residual structure prevents overfitting. Ultimately, Bi-LSTM is employed to understand the temporal interdependence among different frames in the sample. The hidden layer states at each frame moment are allocated weights and then summed to serve as the input to softmax.
The experimental results show that the designed model achieves advanced performance on the DEAP dataset. The rest of the article is arranged as follows: Section 2: Related Work, Section 3: Methods, Section 4: Materials and Experimental Results, and Section 5: Conclusions.

Related Work
In the realm of emotion recognition, machine learning has traditionally been a popular approach for simple classification tasks [6]. Some prominent algorithms, including support vector machine (SVM), decision trees (DT), and k-nearest neighbour (KNN), have been successfully utilized in this field. Zubair [7] used the discrete wavelet transform (DWT) to extract temporal-frequency information and applied the maximum relevancy and minimum redundancy algorithm (mRMR) to select the most relevant features. In the literature [8,9], wavelet transform (WT) was applied for sub-band EEG signal decomposition, and the processed smooth feature information was then fed into SVM for classification. This method demonstrated promising improvement in accuracy for EEG emotion-state recognition in machine learning. However, machine learning techniques still face signifi-cant limitations when processing nonlinear and indistinguishable data, which restrict their capability for more complex classification tasks.
With the advent of deep learning, the machine learning limitations are gradually being overcome, and deep learning techniques are successfully moving in the direction of EEG emotions. Typically, deep-learning-based approaches focus on feature extraction from three dimensions: temporal, frequency, and spatial. For instance, in terms of the temporal information of EEG signals, Xing [10] proposed a framework combining a stacked autoencoder (SAE) and long short-term memory neural networks (LSTM). SAE is employed to simulate the mixing process in EEG and to separate the source signals. Then, the source signals are framed, and frequency features are extracted and combined into chained data, followed by discriminative classification using LSTM. Ma [11] designed the multimodal residual LSTM (MMResLSTM) network, using different LSTM layers to learn the temporal characteristics of different physiological signals and share parameters to achieve the information interaction of different modal data. References in the literature [12][13][14] introduced a temporal learning architecture that employs a 1D convolutional neural network (CNN) to extract temporal information from multichannel chained data.
Researchers have been interested in exploring physical models between electrode positions to characterize spatial features in EEG signals. Hwang [18] proposed a method to generate an image by performing a polar coordinate projection of the channel DE features and using different interpolation methods to fill the blank space after the projection, thus proving that the spatial topology based on electrode arrangement is effective. Song [19] utilized graph theoretic ideas to model multichannel EEG signals and used DGCNN to explore the depth spatial information of neighbouring channels. Other studies [20][21][22][23] constructed a connectivity matrix containing structural information of the brain to express features in different ways and then input the rearranged EEG signals into an end-to-end CNN model.
The researchers have also considered various feature information combinations. References from the literature [23,24] extracted the DE features of the four EEG signal sub-bands, which were mapped into a 3D matrix based on the electrode distribution to retain its channel information. Finally, the spatial-frequency information was extracted by different 2D convolutions. Researchers in [25][26][27][28] introduced a combined CNN and LSTM model that learns spatial-frequency and temporal features, respectively, from the input signal. Experimental findings reveal that the accuracy of combined multidimensional feature information surpasses that of a single dimension.
There exists a broad range of feature extraction methods; however, fully exploiting key features remains a significant challenge. Introducing an attention mechanism has greatly enhanced the capabilities of various classification models. Researchers in EEG emotion recognition have noted that attentional mechanisms can selectively focus on brain regions associated with emotional stimulation and have begun to explore their application to EEG emotion recognition to improve performance. Zhang [29] introduced band attention and temporal attention in a hybrid deep learning model to adaptively assign weights for different frequency bands and times, respectively. In [22,30], researchers constructed 3D data containing temporal-frequency information, introduced the channel attention module SENet to assign weights to frequency bands and channels, and obtained advanced results.

Methods
In this section, data preprocessing and the three components of the proposed model are explained in detail. A complete overview of the model framework is shown in Figure 1.
greatly enhanced the capabilities of various classification models. Researchers in EEG emotion recognition have noted that attentional mechanisms can selectively focus on brain regions associated with emotional stimulation and have begun to explore their application to EEG emotion recognition to improve performance. Zhang [29] introduced band attention and temporal attention in a hybrid deep learning model to adaptively assign weights for different frequency bands and times, respectively. In [22,30], researchers constructed 3D data containing temporal-frequency information, introduced the channel attention module SENet to assign weights to frequency bands and channels, and obtained advanced results.

Methods
In this section, data preprocessing and the three components of the proposed model are explained in detail. A complete overview of the model framework is shown in Figure  1.

Data Preprocessing
In the original data acquisition, a few seconds of the steady state of the subject before the emotional stimulus is recorded. Most studies choose to remove this part of the data directly when recognizing the task and consider only the EEG signal in the stimulated state. Studies [25,30,31] demonstrated that preprocessing through the baseline signal is effective for improving experimental robustness. The difference between the signal of the subject under the emotional stimulus and the baseline signal was calculated as an indication that the segment's emotional state data could yield the expected results. Assume that the dataset is sampled at a frequency of S. Consider the entire baseline clip

Data Preprocessing
In the original data acquisition, a few seconds of the steady state of the subject before the emotional stimulus is recorded. Most studies choose to remove this part of the data directly when recognizing the task and consider only the EEG signal in the stimulated state. Studies [25,30,31] demonstrated that preprocessing through the baseline signal is effective for improving experimental robustness. The difference between the signal of the subject under the emotional stimulus and the baseline signal was calculated as an indication that the segment's emotional state data could yield the expected results. Assume that the dataset is sampled at a frequency of S. Consider the entire baseline clip X base ∈ R MxN 1 , where M is the number of electrodes: 32; and N 1 is the sum of the sampling points for the entire baseline clip. First, X base is uniformly divided into 1 s periods to obtain X 1 base , X 2 base , . . . , X O base , X i base ∈ R MxS (i = 1, 2, . . . , O) denoting the i-th baseline segment, O = N 1 S . Then, the average value X base ∈ R MxS of X base is calculated as follows: X trial ∈ R MxN 2 denotes the experimental EEG signal in the emotional stimulation condition, and N 2 denotes the total number of sampling points of the experimental EEG signal. As in the above method, the experimental data were split into Q segments according to the length criterion of X base to obtain X 1 trial , X 2 trial , . . . , X Q trial , where X i trial ∈ R MxS (i = 1, 2, . . . , Q) denotes the i-th experimental segment, and Q = N 2 S . Finally, the baseline removal data were obtained by subtracting the baseline mean X base from each segment of the experimental data using the following equation: where X i trial.rmov (i = 1, 2, . . . , Q) represents the contrast data from X i trial and X base . Finally, the Q segment removed the baseline data to patch the complete data X trial.rmov = X 1 trial.rmov , X 2 trial.rmov , . . . , X Q trial.rmov , X trial.rmov ∈ R MxN 2 , which was performed on X trial.rmov in the subsequent feature extraction.

Multiband Four-Dimensional Feature Construction
In this section, the process of forming a 4D matrix based on the temporal-frequencyspatial is explained in detail, as shown in Figure 2.
to the length criterion of base X to obtain

Multiband Four-Dimensional Feature Construction
In this section, the process of forming a 4D matrix based on the temporal-frequency spatial is explained in detail, as shown in Figure 2. Since the integration of temporal and frequency band information of all availab channels requires a three-dimensional spatial representation, this requires the consider tion of a matrix of filtered signals obtained from all channels. Take the DEAP dataset an example, which includes a total of 1280 sample data points (32 subjects × 40 expe Since the integration of temporal and frequency band information of all available channels requires a three-dimensional spatial representation, this requires the consideration of a matrix of filtered signals obtained from all channels. Take the DEAP dataset as an example, which includes a total of 1280 sample data points (32 subjects × 40 experiments). For the problem that a small quantity of data can cause overfitting when the network layers are deep, we reduce this effect by increasing the number of samples. Specifically, for the signal X trial.rmov ∈ R M×N 2 after baseline removal, a nonoverlapping windowing process with a time duration of u seconds (design in this paper with u = 5 s) is performed to obtain the divided data X 1 S , X 2 S , . . . , X n S , where the ith window segment X i S = {x 1 , x 2 , . . . , x M } ∈ R M×R (i = 1, 2, . . . , n), x c ∈ R R (c = 1, 2, . . . , M) denotes data from the c-th electrode channel in X i S , and R is the sampling frequency of the whole window segment.
Different frequency band data in EEG reflect different emotional information, as detailed in Table 1 [32]. While conducting directional research, the linear data phase is commonly required to be high. A finite impulse response filter (FIR) can obtain a rigorous linear phase with high stationarity and low interference power caused by operational errors. Therefore, we use FIR to have a sub-band division for each window segment X i S . Qu [25] experimentally compared the EEG emotion recognition results under different band combinations and found that band combinations of α, β, and γ had the highest accuracy in task recognition. In addition, Frantzidis [33] remarked that the θ-band features are closely correlated with arousal. Therefore, in this paper, the four bands θ, α, β, γ were chosen to study the emotional state features of EEG signals. Through the FIR filter, . . , n). The specific realization formula is as follows: where h(n) is the filter coefficient, H d (e jw ) is the corresponding frequency response, H(w) is the amplitude-frequency response function, h d (n) is the unit impulse response, w(n) is the window function, W h and W l are the cut-off frequencies of the bandpass filter, τ = M−1 2 , and M is the number of filter steps. Treating all channel data of a single frequency band as a whole, such that X i S is transformed into p θ i , p α i , p β i , p γ i , where p = {p 1 , p 2 , . . . , p M } denotes all channel data of a single frequency band, the arrangement order is shown in Figure 3.  After data enhancement, our focus shifts to emotion recognition at the segmentation level, and considering that human emotion changes are temporally dynamic, the window segmented signal i S X is segmented into equal-length frames of 0.5 s. Using 0.5 s data per channel as the vector components that constitute a single frame window, i S X will be con- After data enhancement, our focus shifts to emotion recognition at the segmentation level, and considering that human emotion changes are temporally dynamic, the window segmented signal X i S is segmented into equal-length frames of 0.5 s. Using 0.5 s data per channel as the vector components that constitute a single frame window, X i S will be converted into a sequence containing 2u frame vectors, . . , n.j = 1, 2, . . . , 2u). Considering the information complementarity between different features, the DE and PSD features of all channels within each frame window are extracted in this paper [16]. DE is a derivative of Shannon's concept of information entropy over a continuous probability distribution, which is a good method to describe internal EEG information. A specific length of EEG that approximately obeys the Gauss distribution N(µ, σ 2 i ) is calculated as: where σ is the variance in the sequence signal. PSD is a physical quantity that characterizes the relationship between the power energy of a signal and its frequency and is often used to study random vibration signals, which can describe the activation level and emotional complexity of EEG signals. It is calculated as: where x i N (n) is the sampled data in segment i, d(n) is the selected window function, M is the length of each segment, and U is the normalization factor.
The final two sets of features obtained both include a two-dimensional vector sequence of four frequency bands, where the frame vectors f jγ i and g jγ i retain the same arrangement as p jγ i . It can be described as: Nevertheless, frequency features can never fully characterize all the feature components of the entire signal. The EEG is obtained from electrodes placed in different regions for acquisition, and the positional relationships between these electrodes contain information about the spatial structure associated with emotions. Therefore, in this paper, a single frequency band frame vector is mapped into a two-dimensional matrix using spatial mapping. References [34][35][36] proposed the sparse transform, compact transform, and sensitive transform methods, respectively. The sparse transformation matrix is 19 × 19, which undoubtedly requires many computations [34]. The compact matrix of Shen [35] reduces the size to 8 × 9, which drastically reduces the computational effort, and the adjacent electrodes in the matrix are more strongly connected, but it is not sensitive to spatial information and does not work well experimentally. Therefore, here, we use the sensitive transformation method of Xu [36] to map it to 9 × 9 data. Compared with the first two methods, the connection relationship between electrode points is more in line with the "10/20" system while considering the computational effort. These three specific mapping methods are shown in Figure 4. Each frame vector f jγ i of a single frequency band is further converted into a two-dimensional matrix. SEED and DEAP each contain 62 and 32 electrode channels, so here, 62 channels are used to map the electrode positions, the corresponding DE and PSD values are filled for the elements with position mapping, and the remaining positions are replaced by 0. In this way, the window segments transform into a two-dimensional matrix sequence from the perspective of a single frequency band. Taking the θ-band as an example, S θ = S θ 1 , S θ 2 , . . . , S θ j ∈ R 9×9×2u (j = 1, 2, . . . , 2u), it is extended into 3 dimensions in dimensionality. The four-dimensional data of a single feature in the whole window fragment are obtained by fusing the four frequency bands. Taking the DE feature of the window fragment as an example, S = S (DE)1 , S (DE)2 , . . . , S (DE)j ∈ R 9×9×8×2u , considering the problem of information complementarity between different features, we place the same frequency band data of DE and PSD features together to construct the final four-dimensional data S = S 1 , S 2 , . . . , S j ∈ R 9×9×8×2u . transformation method of Xu [36] to map it to 9 × 9 data. Compared with the first two methods, the connection relationship between electrode points is more in line with the "10/20" system while considering the computational effort. These three specific mapping methods are shown in Figure 4. Each frame vector j i f  of a single frequency band is further converted into a two-dimensional matrix. SEED and DEAP each contain 62 and 32 electrode channels, so here, 62 channels are used to map the electrode positions, the corresponding DE and PSD values are filled for the elements with position mapping, and the remaining positions are replaced by 0. In this way, the window segments transform into a two-dimensional matrix sequence from the perspective of a single frequency band. Taking the  -band as an example, , it is extended into 3 dimensions in dimensionality. The four-dimensional data of a single feature in the whole window fragment are obtained by fusing the four frequency bands. Taking the DE feature of the window fragment as an example, , considering the problem of information complementarity between different features, we place the same frequency band data of DE and PSD features together to construct the final fourdimensional data   After obtaining the mapping matrix of all frame window feature data, Equation (10) is used to normalize the data to a distribution with a mean of 0 and a variance of 1, attempting to avoid poles that may negatively affect the recognition results during gradient descent.
where x mean is the mean of the eigenvalue data, σ is the standard deviation of each set of features, x is the actual eigenvalue, and x scale is the final normalized data.

FcaNet
In this study, the aim is to obtain high-quality EEG feature information. To achieve this goal, we incorporated an attention module called FcaNet [37] into the backbone network, which is used to lower the weight of low-quality EEG information. FcaNet is a novel attention module based on SENet [38], which was initially used in target detection. Unlike the global average pooling (GAP) used in SENet to squeeze the feature map, FcaNet uses two-dimensional discrete cosine transformation (2D-DCT) for the same purpose. GAP corresponds only to the lowest frequency portion of the 2D-DCT, resulting in the loss of the In this study, the aim is to obtain high-quality EEG feature information. To achieve this goal, we incorporated an attention module called FcaNet [37] into the backbone network, which is used to lower the weight of low-quality EEG information. FcaNet is a novel attention module based on SENet [38], which was initially used in target detection. Unlike the global average pooling (GAP) used in SENet to squeeze the feature map, FcaNet uses two-dimensional discrete cosine transformation (2D-DCT) for the same purpose. GAP corresponds only to the lowest frequency portion of the 2D-DCT, resulting in the loss of the remaining frequency portions in the channel. A comparison of SENet and FcaNet is shown in Figure 5. Given an input X, it is divided into n parts along the channel dimension: For each part, the corresponding 2D-DCT frequency portion is distributed, and the result of 2D-DCT is used as the compression result noted by the channel. The following equation shows 2D-DCT: where H, W, and ii (u , v ) are the height, width, and 2D index of i X , respectively. The whole feature information compression vector can be represented by cascading as: Given an input X, it is divided into n parts along the channel dimension: X 0 , X 1 , X 2 , . . . , X n−1 , where i ∈ {0, 1, 2, . . . , n − 1}, X i ∈ R H×W×C , C = C n . For each part, the corresponding 2D-DCT frequency portion is distributed, and the result of 2D-DCT is used as the compression result noted by the channel. The following equation shows 2D-DCT: The complete FcaNet framework can be described as follows:

Spatial-Frequency Feature Learning
Extracting spatial-frequency features is primarily accomplished through the use of the convolutional encoder and the attention module FcaNet discussed in Section 3.3.1, with the module structure illustrated in Figure 6.
Specifically, 2u frames S j in each segment are fed into the CNN module sequentially in time order, and the three-dimensional data structure of each frame is 9 × 9 × 8. To better retain information in the relatively small three-dimensional data structure, two convolutional layers are first implemented. The first layer employs a 1 × 1 convolution kernel with 64 kernels, and the second layer utilizes a 3 × 3 convolutional kernel with 128 kernels. These different convolutional kernels are utilized to extract deeper information from the three-dimensional data. Subsequently, the residual network is constructed using DC and PC. This combination reduces the number of training parameters while extracting internal features from expanded individual feature maps using DC and expressing crossfeature map relationships using PC. Moreover, the residual structure effectively avoids network degradation. ReLU is utilized as the activation function for each convolutional layer, and BatchNorm processing is performed. Padding operations are carried out for DC to maintain output size consistency for each layer of convolution. FcaNet assigns weights to different channel features to enhance model performance. The data are subjected to dimensionality reduction through the utilization of a 2 × 2 maximum pooling layer at the end of the cycle, followed by one-dimensional data transformation via flattened layer processing. Finally, each data frame is convolutionally encoded to obtain vector S j ∈ R 1152 .
The complete FcaNet framework can be described as follows: ms _ att sigmoid(fc(Freq)) =

Spatial-Frequency Feature Learning
Extracting spatial-frequency features is primarily accomplished through the use of the convolutional encoder and the attention module FcaNet discussed in Section 3.3.1, with the module structure illustrated in Figure 6. Specifically, 2u frames j S in each segment are fed into the CNN module sequentially in time order, and the three-dimensional data structure of each frame is 9 × 9 × 8. To better retain information in the relatively small three-dimensional data structure, two convolutional layers are first implemented. The first layer employs a 1 × 1 convolution kernel with 64 kernels, and the second layer utilizes a 3 × 3 convolutional kernel with 128 kernels. These different convolutional kernels are utilized to extract deeper information from the three-dimensional data. Subsequently, the residual network is constructed using DC and PC. This combination reduces the number of training parameters while extracting internal features from expanded individual feature maps using DC and expressing cross-feature map relationships using PC. Moreover, the residual structure effectively avoids network degradation. ReLU is utilized as the activation function for each convolutional layer, and BatchNorm processing is performed. Padding operations are carried out for DC to maintain output size consistency for each layer of convolution. FcaNet assigns weights to different channel features to enhance model performance. The data are subjected to dimensionality reduction through the utilization of a 2 × 2 maximum pooling layer at the end of the cycle, followed by one-dimensional data transformation via flattened layer processing. Finally, each data frame is convolutionally encoded to obtain vector ' 1152 j SR  .

Temporal Feature Learning
Since emotion changes are temporally dynamic, changes between frames in four-dimensional data may hide emotion-related information. To explore the temporal correlation features in the whole window segment, we input the spatial-frequency features LSTM solves the problem that its distant text information cannot be exploited and its close distance but not much semantic association is based on the traditional recurrent neural networks (RNN). It is a model for processing sequential signals that can mitigate the

Temporal Feature Learning
Since emotion changes are temporally dynamic, changes between frames in fourdimensional data may hide emotion-related information. To explore the temporal correlation features in the whole window segment, we input the spatial-frequency features S = S 1 , S 2 , . . . , S j obtained after convolutional coding into the Bi-LSTM network in time order for coding.
LSTM solves the problem that its distant text information cannot be exploited and its close distance but not much semantic association is based on the traditional recurrent neural networks (RNN). It is a model for processing sequential signals that can mitigate the gradient disappearance that occurs with long sequence inputs in RNN. The computational equation of an LSTM cell is shown as: where x t is the current input feature, W and b are the matrix and bias vector to be trained, respectively, i t , f t , and O t are the three gates introduced by LSTM, which are the input gate, forget gate, and output gate, respectively, and the three gates via the sigmoid function, so that the threshold range is controlled between 0 and 1. Cell state C t characterizes long-term memory. Candidate state C t represents the new information to be deposited into C t by induction, and h t is the hidden state. In comparison, the output equation of the Bi-LSTM is shown below: Bi-LSTM networks combine an LSTM network that moves from the beginning of the sequence and an LSTM network that moves from the end of the sequence, and the backward layer is an extension of the information of past emotions. It is worth mentioning that the parameters of the two LSTM neural networks in Bi-LSTM are mutually independent, and they share only the input vector sequence. This structure merges the gate control architecture and the bidirectional feature perfectly and experimentally proves to be more efficient than a single LSTM for feature extraction of sequences. The network architecture of Bi-LSTM is shown in Figure 7. The time sequences are input to the model, and the forward layer has the information at time t and the previous time, while the backward layer has the information at time t and the subsequent time. The hidden layer data of the two LSTM layers can be processed using summation, average, or connection. Equation (21) is the output value in the form of a connection. the input gate, forget gate, and output gate, respectively, and the three gates via th moid function, so that the threshold range is controlled between 0 and 1. Cell stat characterizes long-term memory. Candidate state t C represents the new informati be deposited into t C by induction, and t h is the hidden state. In comparison, the put equation of the Bi-LSTM is shown below: Bi-LSTM networks combine an LSTM network that moves from the beginning o sequence and an LSTM network that moves from the end of the sequence, and the ward layer is an extension of the information of past emotions. It is worth mentioning the parameters of the two LSTM neural networks in Bi-LSTM are mutually indepen and they share only the input vector sequence. This structure merges the gate contr chitecture and the bidirectional feature perfectly and experimentally proves to be efficient than a single LSTM for feature extraction of sequences. The network archite of Bi-LSTM is shown in Figure 7. The time sequences are input to the model, and th ward layer has the information at time t and the previous time, while the backward has the information at time t and the subsequent time. The hidden layer data of the LSTM layers can be processed using summation, average, or connection. Equation ( the output value in the form of a connection. To highlight the data at key frame moments, the output of Bi-LSTM is not used as the result of the whole temporal feature learning module in this paper. Instead, a nonlinear transformation is first performed for each frame moment hidden layer state with the following equation: The amount of memory in each LSTM layer is q 2 . The hidden layer is processed in a connected form to obtain h j ∈ R q . H temp,j is the nonlinear expression of the hidden layer, while W temp,j ∈ R d×q and b temp,j ∈ R d are the weights and bias vectors of the tanh function, respectively, and d is set to 512. After obtaining H temp,j ∈ R d , the softmax function is used to calculate the weights for each frame moment to obtain A temp,j . The specific equation is as follows: where u temp,j ∈ R d is a trainable parameter, and the greater the value of A temp,j , the more important the corresponding frame is in the timing sequence. Multiplying the hidden layer states of all frame data with the weights and summing up, the equation is as follows: where Z temp is used as the output of the whole temporal feature learning model, which not only contains the temporal correlation of the whole window segment, but also enhances the important frame data and suppresses the irrelevant information. Finally, Z temp is used to obtain the prediction result by the softmax classifier. The entire temporal information extraction structure is shown in Figure 8: tion is used to calculate the weights for each frame moment to obtain where temp Z is used as the output of the whole temporal feature learning model, which not only contains the temporal correlation of the whole window segment, but also enhances the important frame data and suppresses the irrelevant information. Finally, temp Z is used to obtain the prediction result by the softmax classifier. The entire temporal information extraction structure is shown in Figure 8:  Figure 8. Time feature extraction module. In the proposed method, we conducted ablation experiments on how many memory cells are set in the unidirectional LSTM layer, and the results are shown in Table 2. Therefore, the unidirectional LSTM layer is set up with 256 memory cells (512 in total) in the proposed method. In the proposed method, we conducted ablation experiments on how many memory cells are set in the unidirectional LSTM layer, and the results are shown in Table 2. Therefore, the unidirectional LSTM layer is set up with 256 memory cells (512 in total) in the proposed method.

Dataset
The DEAP dataset comprises EEG signals from 32 participants, and the acquisition process is illustrated in Figure 9. During data collection, the "10-20" international standard 32-lead electrode cap is used to record signals, and each participant watches 40 one-minute videos while EEG signals are recorded for 63 s (3 s baseline + 60 s of video stimulation) per sample. Thus, the entire dataset consists of 1280 (32 × 40) samples, with each sample containing 63 s of data from 32 channels. Following video viewing, participants subjectively evaluated the videos based on arousal, valence, dominance, and liking using a 1-9 scale. Two versions of the DEAP dataset are officially available: one is the raw signal containing noise such as electromyography (EMG) and electrooculogram (EOG); the other is the preprocessed data, which consists of downsampling the data to 128 Hz from 512 Hz, filtering and denoising using a 4-45 Hz bandpass filter. This study utilizes the preprocessed version of the DEAP dataset in Python to conduct the experiment. The downsampling operation considerably reduces the computational effort, and the resulting accuracy impact is minor. ard 32-lead electrode cap is used to record signals, and each participant watches 40 oneminute videos while EEG signals are recorded for 63 s (3 s baseline + 60 s of video stimulation) per sample. Thus, the entire dataset consists of 1280 (32 × 40) samples, with each sample containing 63 s of data from 32 channels. Following video viewing, participants subjectively evaluated the videos based on arousal, valence, dominance, and liking using a 1-9 scale. Two versions of the DEAP dataset are officially available: one is the raw signal containing noise such as electromyography (EMG) and electrooculogram (EOG); the other is the preprocessed data, which consists of downsampling the data to 128 Hz from 512 Hz, filtering and denoising using a 4-45 Hz bandpass filter. This study utilizes the preprocessed version of the DEAP dataset in Python to conduct the experiment. The downsampling operation considerably reduces the computational effort, and the resulting accuracy impact is minor.  Upon removing the baseline from the dataset's signal, a single channel of 60 s will result in 60 × 128 samples, and the resulting trial.rmov X obtained from the 32 channels for a subject will be of the form 32 × 60 × 128. To create more samples, the data were segmented in this study into nonoverlapping windows with a duration of u = 5 s. Specifically, the data were partitioned into 12 windows of 5 s duration, and the data form of Each subject has 40 (video) × 60 (seconds) of EEG data, and using 5 s as a sample, 480 copies (40 × 12) are generated. For 32 subjects, a total of 15,360 samples are generated. Upon removing the baseline from the dataset's signal, a single channel of 60 s will result in 60 × 128 samples, and the resulting X trial.rmov obtained from the 32 channels for a subject will be of the form 32 × 60 × 128. To create more samples, the data were segmented in this study into nonoverlapping windows with a duration of u = 5 s. Specifically, the data were partitioned into 12 windows of 5 s duration, and the data form of X i S was 32 × 5 × 128. The follow-ups were all performed with each window fragment X i S as a single experimental sample for emotion recognition. The data form is expanded to 32 × 5 × 128 × 4 by dividing X i S into frequency bands. To capture the hidden temporal messages of the sequence signal, the single-band data p θ i , p α i , p β i , p γ i in X i S are separated into frames of 0.5 s, and each frame has the data form 32 × 0.5 × 128 × 4. The vector data are converted into a two-dimensional matrix with a sensitive transformation on the basis of a frame, and the DE and PSD features of every 0.5 s frame data are used as matrix element data. Each frame of a single frequency band becomes a 9 × 9 matrix, combining two groups of features as well as four frequency bands with different features to obtain the final frame data in the form of 9 × 9 × 8. The whole sample X i S is transformed into a four-dimensional feature sequence of 9 × 9 × 8 × 10 when viewed from the perspective of the whole sample X i S . Each subject has 40 (video) × 60 (seconds) of EEG data, and using 5 s as a sample, 480 copies (40 × 12) are generated. For 32 subjects, a total of 15,360 samples are generated.

Experimental Parameter Settings and Evaluation Indices
The code implementation in this article was partially performed on a CUDA 11.2, PyTorch version 1.11 framework, and the hardware module was a server with four Nvidia RTX2080Ti processors, manufactured by Lenovo, Beijing, China. The loss function utilized was cross-entropy, with L2 regularization applied to enhance generalization. The Adam optimizer was used, with a learning rate of 0.001 and a dropout rate of 0.5. A 10-fold cross-validation method was used. To determine the optimal number of iterations for subsequent experiments, a range of values was assessed, and epoch = 50 was chosen based on the achieved accuracy of the model. The outcomes of the comparative analysis are presented in Table 3. Various performance metrics can be employed to assess the performance of a system. Although accuracy is widely used as the evaluation criterion to indicate the percentage of correctly predicted samples, it is not always the most appropriate metric, particularly in the case of imbalanced data. In this regard, the F1-Score is used as an additional performance metric in this study. The F1-Score, which is the harmonic mean of precision and recall, is a statistical measure that assesses the accuracy of a binary classifier. Specifically, the equation for F1-Score is expressed as follows: where TP and TN indicate that the data labels are positive and negative classes, respectively, and are consistent with the recognition results, and FP and FN indicate that the data labels are positive and negative classes, respectively, but are inconsistent with the recognition results.

Emotion Recognition Binary Classification Experiment
To verify the efficacy of the feature extraction method proposed in this article, we selected EEG features that lacked frequency feature extraction and spatial transformation and compared them to the method outlined in this paper. For each set of features, we performed two-and four-classification tasks on the dataset. The set of features lacking frequency feature extraction and spatial transformation is referred to as the "baseline features" throughout this study, the detailed extraction process is shown in Figure 10. To ensure the maximum effectiveness of the proposed method, the same baseline removal, window segmentation, and frame-segmentation processes were applied to the baseline features. The resulting data form of trial.rmov X obtained from 32 channels of the subject is 32 × 60 × 128, and window segmentation generates the window fragment data i S X . The data form of i S X is 32 × 5 × 128. Then, i S X is divided into five frequency bands, and its data form is expanded to 32 × 5 × 128 × 5. To extract complete sample timing information and then carry out the time-length 0.5 s framing process, the form is transformed to 32 × 0.5 × 128 × 5. The five data bands are superimposed in the dimension to convert them into a three-dimensional matrix sequence of size 5 × 32 with a sequence length of 64 (0.5 × 128) so that each frame enters the convolutional coding module proposed in this paper in the form of 5 × 32 × 64. The important temporal features of EEG bands in matrix To ensure the maximum effectiveness of the proposed method, the same baseline removal, window segmentation, and frame-segmentation processes were applied to the baseline features. The resulting data form of X trial.rmov obtained from 32 channels of the subject is 32 × 60 × 128, and window segmentation generates the window fragment data X i S . The data form of X i S is 32 × 5 × 128. Then, X i S is divided into five frequency bands, and its data form is expanded to 32 × 5 × 128 × 5. To extract complete sample timing information and then carry out the time-length 0.5 s framing process, the form is transformed to 32 × 0.5 × 128 × 5. The five data bands are superimposed in the dimension to convert them into a three-dimensional matrix sequence of size 5 × 32 with a sequence length of 64 (0.5 × 128) so that each frame enters the convolutional coding module proposed in this paper in the form of 5 × 32 × 64. The important temporal features of EEG bands in matrix sequence data are extracted by Bi-LSTM, and the results of the binary classification on the dataset are shown in Figure 11 below: baseline features. The resulting data form of trial.rmov X obtained from 32 channels of the subject is 32 × 60 × 128, and window segmentation generates the window fragment data i S X . The data form of i S X is 32 × 5 × 128. Then, i S X is divided into five frequency bands, and its data form is expanded to 32 × 5 × 128 × 5. To extract complete sample timing information and then carry out the time-length 0.5 s framing process, the form is transformed to 32 × 0.5 × 128 × 5. The five data bands are superimposed in the dimension to convert them into a three-dimensional matrix sequence of size 5 × 32 with a sequence length of 64 (0.5 × 128) so that each frame enters the convolutional coding module proposed in this paper in the form of 5 × 32 × 64. The important temporal features of EEG bands in matrix sequence data are extracted by Bi-LSTM, and the results of the binary classification on the dataset are shown in Figure 11  The average training and testing accuracies of binary classification for the four evaluation metrics were 98.99% and 94.43%, respectively.
To enhance the spatial representation of the EEG signal, this paper employs a detailed transformation technique that maps the frequency band's sequential signal into a two-dimensional matrix. This approach rectifies the loss of spatial information in the baseline features. Additionally, the more indicative frequency features, namely, DE and PSD, are utilized as the matrix element values to substitute for the sub-band amplitudes in the baseline features. The results of the binary classification on the four DEAP dataset metrics are displayed in Figure 12 below. The average training and testing accuracies of binary classification for the four evaluation metrics were 98.99% and 94.43%, respectively.
To enhance the spatial representation of the EEG signal, this paper employs a detailed transformation technique that maps the frequency band's sequential signal into a twodimensional matrix. This approach rectifies the loss of spatial information in the baseline features. Additionally, the more indicative frequency features, namely, DE and PSD, are utilized as the matrix element values to substitute for the sub-band amplitudes in the baseline features. The results of the binary classification on the four DEAP dataset metrics are displayed in Figure 12 below.
To enhance the spatial representation of the EEG signal, this paper employs a detailed transformation technique that maps the frequency band's sequential signal into a two-dimensional matrix. This approach rectifies the loss of spatial information in the baseline features. Additionally, the more indicative frequency features, namely, DE and PSD, are utilized as the matrix element values to substitute for the sub-band amplitudes in the baseline features. The results of the binary classification on the four DEAP dataset metrics are displayed in Figure 12 below. It is evident that the average accuracy of the suggested feature extraction method in this article is 99.21% and 97.84% for training and testing in valence, arousal, dominance, and liking states, respectively. This is approximately a 3% improvement over the test accuracy of baseline features.

Experiment on Four Classes of Emotion Recognition
There were 32 subjects, 40 groups of affective experiments per subject, and 1280 data groups in total. They were labelled and categorized according to four modalities: LALV, LAHV, HALV, and HAHV. Among them, the baseline features have 88.92% and 84.70% training and testing accuracy on the four classifications, respectively, and the feature extraction approach proposed in this article has 90.15% and 88.46% training and testing accuracy, respectively, on the model. The test accuracy improved by nearly 4% compared to the baseline features. The results are shown in Figure 13. It is evident that the average accuracy of the suggested feature extraction method in this article is 99.21% and 97.84% for training and testing in valence, arousal, dominance, and liking states, respectively. This is approximately a 3% improvement over the test accuracy of baseline features.

Experiment on Four Classes of Emotion Recognition
There were 32 subjects, 40 groups of affective experiments per subject, and 1280 data groups in total. They were labelled and categorized according to four modalities: LALV, LAHV, HALV, and HAHV. Among them, the baseline features have 88.92% and 84.70% training and testing accuracy on the four classifications, respectively, and the feature extraction approach proposed in this article has 90.15% and 88.46% training and testing accuracy, respectively, on the model. The test accuracy improved by nearly 4% compared to the baseline features. The results are shown in Figure 13.
There were 32 subjects, 40 groups of affective experiments per subject, and 1280 data groups in total. They were labelled and categorized according to four modalities: LALV, LAHV, HALV, and HAHV. Among them, the baseline features have 88.92% and 84.70% training and testing accuracy on the four classifications, respectively, and the feature extraction approach proposed in this article has 90.15% and 88.46% training and testing accuracy, respectively, on the model. The test accuracy improved by nearly 4% compared to the baseline features. The results are shown in Figure 13.

Ablation Experiments
In Section 3.2, three forms of electrode mapping were proposed to determine the optimal mapping method for the deep learning model presented in this article. The performances of these mapping methods were experimentally compared, and the results are presented in Table 4. The sensitive transformation method, which includes more precise electrode locations and provides spatial information that is more consistent with emotional neural features, yielded the highest accuracy. Furthermore, the time consumption of this method was comparable to that of the compact mapping method and approximately onethird of that of the sparse mapping method. Therefore, considering both time and accuracy, the sensitive transformation method is the most suitable approach. The present study focuses on the recognition task by employing window segments as samples, which depends on the interframe correlation to extract temporal features. The selection of an appropriate time window and frame length is crucial, and thus, this study conducts experiments to determine optimal values. The accuracy of the model is compared for different values of u and t, as presented in Table 5. Based on the results, it is apparent that the optimal classification performance for the DEAP dataset was achieved when u = 5 s and t = 0.5 s. Regarding binary valence and arousal, the maximum difference in accuracy was 4.04% and 3.89%, respectively, and the maximum difference in the four classifications was 3.41%. For window segments of different lengths, the increase in the number of frames enables Bi-LSTM to extract richer temporal information. t = 0.25 s/0.5 s will be higher than the accuracy of 1 s frame division for the same window segment. However, when t = 0.25 s, the doubling of the number of frame sequences does not bring a continuous increase in accuracy, and the overall accuracy is close to the experimental data for t = 0.5 s, but the computational effort of the network is greatly increased.
Furthermore, although the proposed model approach in this paper showed good performance for both binary and four classifications, the effectiveness of individual modules within the model remains unknown. To address this issue, we designed five models with different structures for ablation experiments, and their compositions and results are presented in Table 6. The task identification of all models is based on 4D features designed in this paper, and the baseline CNN is designed without an attention mechanism. The unspecifically labelled LSTM model takes the output of the last time step as the input to softmax. For the binary classification task, the combined CNN and LSTM model achieves an average accuracy of 84.45%, indicating that the combined model can integrate the temporal-frequency-spatial information of the multidimensional features well and produce good results. Bi-LSTM overcomes the limitation of LSTM in learning sequences only sequentially by learning sequences in both directions and combining the hidden layer states to determine the output results, which improves accuracy by approximately 4% compared to the CNN-LSTM model. SENet can focus on important channels in the feature matrix and improve model accuracy by assigning different weights. FcaNet uses two-dimensional discrete cosine variations to compress the feature map, avoiding the loss of frequency components caused by SENet and improving accuracy by 2% at a subtle computational cost. Additionally, to highlight the importance of different frame moments in the sample, the weighted sum of all time step hidden layer states is chosen as the output, achieving the best accuracy of 94.58%.

Experimental Comparison
In conclusion, the proposed approach is compared with models from other references in the literature, as shown in Tables 7 and 8. In binary classification, some studies, such as [10,11], applied only 1D convolution to extract temporal information in EEG data, whereas Xu [20] constructed 3D spatial-frequency features of DE and selected a fully convolutional residual network recognition model, and Saha [27] used 3D convolutional kernels to extract and process the spatiotemporal features of EEG data. However, none of the above networks fully considered the feature information of the three dimensions, and their average accuracies were lower than the combined model of references [23,24,26], except for the studies where Xu [20] and Saha [27] used channel attention to augment the model accuracy. These findings demonstrate that attention mechanisms and consideration of multidimensional information are significant for the performance improvement of recognition systems. In this article, we employed Bi-LSTM based on LSTM to learn sequence signals for future sentiment information and used FcaNet to compensate for the shortcomings of SENet. Consequently, we achieved the best accuracy of 98.10% for binary classification.  In the context of the four-classification task, prior works such as those in [4][5][6] utilized wavelets to extract time-frequency features for emotion recognition using SVM and mRMR algorithms. However, these methods did not incorporate the spatial information and dynamic temporal features of EEG. Mei [18] and Chao [36] represented features by constructing a connectivity matrix of brain structures, followed by extracting spatial features using CNN. Similarly, PSO-BiLSTM [35] utilized DWT to decompose the signal, applied a third-order cumulant transformation to a high-dimensional space, and then reduced the dimension to eliminate redundancy. However, this approach also did not consider spatial feature learning. In contrast, the four-dimensional data used in this paper contain more information than the two-and three-dimensional data used in prior works. Furthermore, the proposed approach in this article constructs a combined network that can adapt to multidimensional features and extract spatial-frequency and temporal features. It also employs the more advanced FcaNet and fully considers information from all frame moments. These advantages make the proposed method more effective compared to those in prior examples in the literature.

Conclusions
This study proposes a cascaded convolutional recurrent neural network based on multidimensional features for emotion recognition. To address the limitations of the previous literature, a 4D matrix is constructed to incorporate emotional features of the signal in the temporal, frequency, and spatial dimensions. Additionally, a hybrid deep learning model is proposed to better fit the extracted feature matrix. The convolutional encoder is mainly used to extract spatial-frequency features from 4D input data, and the residual network composed of DC and PC improves the real-time performance of the recognition system. FcaNet assigns more accurate weights to different feature channels at a negligible computational cost, allowing useful feature information to be further highlighted. Finally, to emphasize the temporal significance of the frame windows in the sample, the weighted sum of the hidden layer states of the Bi-LSTM at all frame moments is utilized as input to the softmax layer. The experimental results demonstrate that the proposed method in this paper performs well compared to the rest of the literature, with an average accuracy of 97.84% in the two classification experiments and 88.46% in the four-classification experiments. In future work, we will explore a more expressive feature extraction method and apply a more streamlined network to the recognition task, making emotion recognition more rapid in the HCI domain.