ScalingNet: extracting features from raw EEG data for emotion recognition

Convolutional Neural Networks(CNNs) has achieved remarkable performance breakthrough in a variety of tasks. Recently, CNNs based methods that are fed with hand-extracted EEG features gradually produce a powerful performance on the EEG data based emotion recognition task. In this paper, we propose a novel convolutional layer allowing to adaptively extract effective data-driven spectrogram-like features from raw EEG signals, which we reference as scaling layer. Further, it leverages convolutional kernels scaled from one data-driven pattern to exposed a frequency-like dimension to address the shortcomings of prior methods requiring hand-extracted features or their approximations. The proposed neural network architecture based on the scaling layer, references as ScalingNet, has achieved the state-of-the-art result across the established DEAP benchmark dataset.


Introduction
Emotion recognition plays a very important role in human-computer interaction [1]. Through recognizing human emotions more accurately and quickly, we can promote a smarter life [2]. Generally, expressive modalities are used to judge human being's emotions, such as facial expressions, audio-visual expressions, and body language, etc. [3]. In recent years, more and more studies that recognize human emotions have used physiological electrical signals [4] [5], such as electrocardiogram (ECG), electromyography (EMG) and electroencephalography (EEG). Among them, EEG signals can better reflect real human emotions because it is not affected by subjective factor [6]. In this work, we use EEG signals to recognize human emotions.
It has been proved that there are intimate correlations between human emotions and their different brain states [7] [8]. With the progress in EEG hardware equipment, it is more convenient to collect EEG signals with a high sampling rate nowadays [9]. Meanwhile, the processing and analysis methods of EEG signals are being explored and researched constantly [10]. In EEG based emotion recognization, researchers mainly focus on three technical sects. The first and most widespread methods are based on feature engineering and machine learning algorithms to recognize human arXiv:2105.13987v1 [eess.SP] 7 Feb 2021 emotions [11], which requires hand-extracted emotion-related features from EEG signals, such as Power Spectral Density (PSD), Differential Entropy (DE), etc. With the proposal of deep learning, some methods tend to combine feature engineering and deep neural networks, which replace classifiers from machine learning algorithms to deep neural networks, such as CNNs [12]. Furthermore, some researchers consider extracting data-driven features from EEG signals, which employ parameterizable data representation methods or neural networks [13] as a feature extractor. While the feature extraction methods mentioned above achieved remarkable performance of EEG based emotion recognition, there is still potential for improvement. Hand-extracted features are mostly tasks related, and mostly require strong hypotheses and mathematically driven theoretical supports. Considering the reality, we may say that the hand-extracting of features is not easy works and potentially not robust.
Inspired by the shortcomings of hand-extracted feature based methods, we introduce an end-to-end artificial neural network method mainly constructed by our well-designed data-driven signal feature extracting layer, which we reference as ScalingNet, allowing to robustly performs raw EEG data based emotion recognition without requiring any handextracted features. The idea of the layer, which we reference as scaling layer, is to dynamically generate a series of convolution kernels scaled from one data-driven pattern to produce a robust data-driven spectrogram-like feature map from raw EEG signals for downstream tasks. The introduced architecture has several interesting properties:(1)It automatically extracts robust feature maps from raw EEG signals without any hand-interaction. (2)It handles any length of EEG signal without requiring data alignment. (3)It is fully convolutional. (4)It is compatible with the existing neural networks, providing robust feature extraction for different downstream tasks. We validate the proposed approach on the challenging DEAP benchmark dataset, achieving the state-of-the-art result that highlights the potential of models for data-driven feature extraction from raw EEG signals.

Related Work
In EEG based emotion recognition, machine learning based methods fed with hand-extracted EEG features are possibly the most widely used framework. With the development of deep learning, researchers gradually tend to replace machine learning methods with deep neural networks, especially CNNs [14]. The hand-extracted EEG features are mainly time domain, frequency domain, time-frequency domain and spatial signal features. The classification methods mainly include random forest, SVM, CNNs, LSTM, etc. Zheng et al. [15] [17] proposed to perform Continuous Wavelet Transform(CWT) on the EEG signal of each channel, and then convert it to scalograms, then input the construction frame into CNNs and Long Short-Term Memory (LSTM) for emotion recognition. Kim et al. [18] proposed to extract brain asymmetry features and heart rate features, respectively, and ConvLSTM(Combination of CNN and LSTM) was used for classification.
Inspired by the powerful feature transform ability of deep neural networks, some researchers commit to design an end-to-end framework for EEG based emotion recognition. Wang et al. [19] proposed an EmotionNet network for EEG-based emotion classification. It can take EEG as input and uses 3-D convolution to extract spatial and temporal features for emotion recognition. However, for general purpose network layers, it is hard to learn and extract robust features from signals. In the long run, this research field still has great potential for development. Consider, there is a need for a special neural network layer that specially design for robust feature extraction from raw EEG signals, and a neat neural network architecture that can naturally inference on raw EEG signals.

Methodology
In this section, we will firstly present the building block layer used to adaptively extract effective data-driven spectrogramlike features from raw EEG signals, which we reference as scaling layer. Then we will introduce a fully convolutional neural network constructed through basing the scaling layer, which we reference as ScalingNet since its core feature is the application of scaling layer.

scaling layer
The motivation is to dynamically generate a series of convolutional kernels by scaling one data-driven pattern to different periods in order to expose a frequency-like dimension from signals. This motivation brings the possibility of automatically adaptive extracting effective and robust data-driven spectrogram-like features for downstream tasks from raw EEG signals.
We consider a multi-kernel convolutional layer that takes a one-dimensional signal shaped like (sampling points, 1) as input and a two-dimensional spectrogram-like feature map shaped like (sampling points, scaling levels) as output with the following defined layer-wise propagation rule: where H input is the input vector shaped like (time steps, 1), i.e. the one-dimensional signal. H output is the matrix of activations shaped like (time steps, scaling levels), i.e. the data-driven spectrogram-like feature map. bias is the biases for multi-kernel generated by scaling a basic kernel. δ(·) denotes an activation function; weight is the basic kernel where others kernel scaled from. l is a hyper-parameter that controls the scaling level.
⊗ is a valid cross-correlation operator, normally defined as: downSample is a pooling operator that downsamples the weight by average filter with a window of size 2. To ensure that the length of downsampled weight still is odd, the downSample setup a padding of size 1 for the filter when the length of directly downsampled weight potentially is even.
Further, bias(l) is the bias for the kernel generated at l th scaling level. H output (l) is the activation of l th scaling level. downSample(weight, l) denotes the generated kernel scaled from weight at l th level, which recursively filters the weight l-times.
Steply, assume we would extract features for signal H input at l th scaling level. We first generate the l th scaling level kernel scaled from weight by downSample(weight, l). Then, we perform the cross-correlation operator of the scaled kernel and H input by Equation 2. Then, we add the previous result and the bias(l), and then feed to activation function δ(·), i.e. Equation 1.
We repeat the above process expected total scaling level times with different setup of hyper-parameter l on a range of 0 to maximum scaling level. Finally, we stack all extracted feature vectors into a 2D tensor to obtain the data-driven spectrogram-like feature map. In particular, in order to ensure the alignment of extracted feature vectors, the length of basic kernel weight must be odd and the input signal H input must be padded with (scaledKernelLength − 1)/2. For the backpropagation, the trainable parameter are the basic kernel weight and biases bias, which will be handled by autograd mechanism. Figure 1: The core principle of scaling layer. Scaling layer directly extracts data-driven spectrogram-like feature maps from raw EEG signals for downstream tasks. It extracts feature by multi-kernel generated from scaling a data-driven pattern.
The core principle of scaling layer is illustrated by Figure 1.

ScalingNet
In this subsection, we introduce a neural network architecture mainly constructed by a series of parallel scaling layers to perform raw EEG data based emotion recognition, which we reference as ScalingNet. Figure 2: The ScalingNet architecture. It's mainly constructed by a series of parallel scaling layers that are followed by neat convolutional and linear layers. With the help of data-driven spectrogram-like feature maps extract by scaling layers, it performs raw EEG data based emotion recognition without any hand-extracted features.
The ScalingNet architecture is illustrated in Figure 2. Considering that the scaling layers that mainly used to construct the ScalingNet extract data-driven spectrogram-like feature maps for EEG channels separately, we especially illustrate the EEG channels by carefully stacking the data-driven spectrogram-like feature maps extracted by scaling layer from EEG signal of different channels into a 3D tensor.
The EEG signals of different channels are first fed to scaling layers separately to extract data-driven spectrogram-like feature maps. Then, the feature maps extracted by scaling layers are stacked into a 3D tensor along the EEG channel dimension. Then the 3D tensor fed into several convolutional layers to perform feature map transform. Finally, the transformed features maps are fed into an average global pooling layer and a linear layer to perform emotion classification. Worthily, the ScalingNet architecture robustly performs raw EEG data based emotion recognition without requiring any hand-extracted features.

Experimental & Results
We evaluate the performance of the proposed ScalingNet architecture on EEG data based emotion recognition task using the established challenging DEAP dataset [20] and compare it with strong benchmarks or previous state-of-the-art methods. In this section, we first introduce the DEAP dataset, then proceed to a detailed description of the experimental setup, and finally report the experimental results.

Datasets
The DEAP [21] is an established challenging benchmark dataset for EEG based emotion recognition. The dataset contains EEG and physiological signals collected from 32 subjects stimulated by watching music videos. After they watch each video, they self-evaluate their valence, arousal, dominance, and liking according to 1-9 immediately. Each subject is asked to watch 40 videos, and 63 seconds of signals are collected for each video. The signals are default downsampled to 128Hz and filtered with a 4.0Hz to 45.0Hz bandpass filter. In this paper, only EEG signals are used to classify the valence, arousal, and dominance by the rating threshold of 5, which closely follows the setting of [22]. Specifically, 1280 EEG samples from 32 subjects are used for three binary classification tasks of cross-subject emotion recognition.

Experimental setup
The five-fold cross-validation strategy is employed to objectively evaluate the raw EEG data based emotion recognition performance of the proposed ScalingNet architecture. We manually optimize the hyper-parameters of proposed ScalingNet architecture on the DEAP dataset, and the most related tuned hyper-parameters are reported in Table 1.
where the "length of weight" is the size of basic kernel weight of scaling layer in Equation 1. The "kernel size" is the size of convolutional kernels used in feature map transform convolutional layers of proposed ScalingNet architecture illustrated in Figure 2. The "number of filter" is the number of filters used in feature map transform convolutional layers of proposed ScalingNet architecture illustrated in Figure 2. All experiments in this paper were conducted using a Geforce RTX 2080 Ti. The machine learning framework used in this paper is PyTorch [23].

Results
The experimental results of the proposed ScalingNet architecture compared with previous state-of-the-art methods using the DEAP dataset and the same evaluation strategy are shown in Table2. Where evaluation criteria are the emotion recognition accuracies of arousal, valance, dominance in closely following previous studies. In Table 2, Chao et al. [24] extract MFM features and use CapsNet as a classifier for emotion recognition. Chen et al. [25] use H-AAT-BGRU to classify emotions. Li et al. [26] and Yang et al. [27] use SVM for classification by extracting DBN features and VAE features respectively. Gupta, R [28] use graph-theoretic features and RVM for classification.
The results 2 show that the accuracy of the proposed method in this paper is 69.99%, 71.13%, and 70.78% for arousal, valence, and dominance, respectively, which are both higher than the previous state-of-the-art studies. It indicates that the proposed ScalingNet architecture is effective and feasible for EEG data based emotion recognition. Noticeably, its performance achieves the state-of-the-are result, but without any hand-interaction.

Discussion
In this section, we have elaborately designed a series of experiments to explore the properties of the scaling layer and ScalingNet, visualize the data-driven spectrogram-like feature maps extracted by scaling layers to explore its interpretability, and verify it's contribution through ablation experiments.
Since the scaling layer handles any length of EEG signals without requiring data alignment, we can arbitrarily adjust the length of the basic kernel weight to explore the relationship between model capacity and its representational capacity. We explore the relationship through observing the emotion recognition performance of ScalingNet with different setups of scaling layers. In the experiments, we deliberately select several representative parameters of the basic kernel weight in scaling layers. The results is shown in Table 3.
We can observe that the representational capacity attains the best at the model capacity of setting the length of weight to 33. Obviously, the 33 is related to the DEAP dataset, and here are more interested in the Table 3 itself. In addition, we visualize the data-driven spectrogram-like feature maps extracted by scaling layers under the architecture of ScalingNet and the dataset of DEAP. The visualized data-driven spectrogram-like feature maps are shown in Figure  3, where the horizontal axis denotes sampling points and the vertical axis denotes the frequency-like dimension, i.e. the time and scaling levels. We can observe that Figure 3 (a) contains more low frequency-like energy and (b) contains more high frequency-like energy, it all starts with that one data-driven pattern that used to generate scaled kernels to extract useful information. These learned useful information contained in the data-driven spectrogram-like feature maps are aggregated by followed layers and used for downstream tasks.
(a) lower frequency-like energy (b) higher frequency-like energy In order to verify the contribution of the proposed scaling layer, ablation experiments are also considered. The results of ablation experiments are shown in Table 4. In the ablation experiments, we compare the scaling layer with the convolutional layer to explore their feature extraction capability for EEG signals. We explore the capability through observing the emotion recognition performance with replacing the scaling layer of ScalingNet by the convolutional layer.We can observe that the scaling layers play an important role in ScalingNet. It also indicates that the scaling layer extract more robust feature for EEG signals with better generalization performance.

Conclusion
We have presented the scaling layer and ScalingNet, a novel convolutional layer for extracting a spectrogram-like feature map from raw signals and a neural network that operates on raw EEG data for classification, leveraging dynamically generated convolutional kernels by scaling from one data-driven pattern. We demonstrate that it can automatically adaptive extracting robust data-driven spectrogram-like feature maps and successfully applied to raw EEG data based emotion recognition. Thus it addresses many shortcomings of prior methods based on hand-extracted features with strong hypotheses or their approximations. Our ScalingNet models leveraging scaling layers have successfully achieved state-of-the-art performance across the well-established emotion recognition benchmarks.