An improved self-supervised learning for EEG classiﬁcation

: Motor Imagery EEG (MI-EEG) classiﬁcation plays an important role in di ﬀ erent Brain-Computer Interface (BCI) systems. Recently, deep learning has been widely used in the MI-EEG classiﬁcation tasks, however this technology requires a large number of labeled training samples which are di ﬃ cult to obtain, and insu ﬃ cient labeled training samples will result in a degradation of the classiﬁcation performance. To address the degradation problem, we investigate a Self-Supervised Learning (SSL) based MI-EEG classiﬁcation method to reduce the dependence on a large number of labeled training samples. The proposed method includes a pretext task and a downstream classiﬁcation one. In the pretext task, each MI-EEG is rearranged according to the temporal characteristic. A network is pre-trained using the original and rearranged MI-EEGs. In the downstream task, a MI-EEG classiﬁcation network is ﬁrstly initialized by the network learned in the pretext task and then trained using a small number of the labeled training samples. A series of experiments are conducted on Data sets 1 and 2b of BCI competition IV and IVa of BCI competition III. In the case of one third of the labeled training samples, the proposed method can obtain an obvious improvement compared to the baseline network without using SSL. In the experiments under di ﬀ erent percentages of the labeled training samples, the results show that the designed SSL strategy is e ﬀ ective and beneﬁcial to improving the classiﬁcation performance.


Introduction
Electroencephalography (EEG), as a non-invasive and cost-effective approach, is commonly used in various fields such as rehabilitation [1] and disease diagnosis [2].Motor Imagery EEG (MI-EEG) classification plays an important role in different Brain-Computer Interface (BCI) systems [3].The goal of the MI-EEG based BCI system is to control different external devices [4].Due to non-stationary, nonlinearity, and randomness of MI-EEGs [5], how to accurately classify MI-EEGs is a crucial step in the MI-EEG based BCI system.
Up to now, different MI-EEG classification methods have been proposed by many researchers.In traditional classification methods, Müller-Gerking et al. [6] proposed the Common Spatial Pattern (CSP) method to classify single-channel MI-EEGs.Huang et al. [7] employed Surface Laplace Transform (SLT) and Power Spectral Density (PSD) to extract MI-EEG features.Chatterjee et al. [8] used wavelet energy, root means square error and the average power for feature extraction, and Support Vector Machine (SVM) to classify the left and right hand MI-EEGs.Recently, deep learning has developed rapidly in the computer vision and signal processing fields [9,10].Therefore, deep learning provides an effective tool for the MI-EEG classification task [11,12].Schirrmeister et al. [13] found that Batch Normalisation (BN) [14] and Exponential Linear Units (ELU) [15] could effectively improve the representation capability of Convolutional Neural Network (CNN).In [16], CNN and Augmented CSP (ACSP) was firstly used to extract the features and CNN was then used to classify the MI-EEGs.Besides CNN, a deep belief network was used for the MI-EEG classification task [17].Li et al. [18] utilized the spatial location and time-frequency information to build 2D images.These images were then used as the inputs to train a network.Lawhern et al. [19] proposed EEGNet to identify the paradigm of EEGs.The above-mentioned MI-EEG classification methods belong to supervised learning, and the performance of supervised learning relies on a large number of labeled training samples.
However, accurate MI-EEG labeling is time-consuming and expensive.For example, labeling EEG during sleep requires professional technicians to check the EEG for several hours and mark the 30second window one by one [20].In the case of a small number of the labeled training samples, the performance of supervised learning may be poor and unstable.In the past years, Self-supervised Learning (SSL) has been verified that it is a novel and effective strategy to improve the performance under a small number of labeled training samples [21].SSL generally includes two parts: a pretext and downstream task.A network is learned by the pretext task and transferred to the downstream task.The pretext task is mainly divided into two categories: generation and contrast.The typical generation methods include Colorization, Auto-Encoders, etc.In the typical contrast methods, the training samples are generated according to some strategies and used to train a pretext network.In SSL, how to define the pretext task is the core content.
Recently, SSL has been widely used in computer vision, natural language processing and other fields.Noroozi et al. [22] utilized a puzzle approach to design the pretext task.Li et al. [23] proposed a Twin-Cycle Autoencoder (TCAE) to learn the pretext network from videos without manual annotation.Banville et al. [24] constructed a time contrastive SSL method for classifying EEGs.In the MI-EEG classification field, few SSL-based studies have been investigated.Therefore, it is meaningful to design the SSL strategy to improve the MI-EEG classification performance.
This paper aims to propose a novel SSL approach to alleviate the degradation influence in the case of a small number of the labeled training samples.In this paper, a Temporal-Rearrange based MI-EEG network (TRMINet) is designed.The pretext task based on the temporal characteristics of MI-EEGs is firstly built to learn a pretext network from the whole MI-EEGs.The pretext network is then used to initialize the downstream MI-EEG classification one.Finally, the downstream network is trained using a small number of the labeled training samples.The main contributions are outlined as follows: • A pretext task is designed according to the temporal characteristics of MI-EEGs.
• The differences of the MI-EEG representation ability among different CNNs are investigated.
• The impact of the pretext task on the classification performance of the downstream task is analyzed.
The rest of the paper is organized as follows.Some of the networks used in this paper are described in Section 2. Section 3 describes the pretext and downstream tasks.The details of the datasets are described in Section 4. The experimental configurations and results are described in Section 5.The conclusion and future directions are given in Section 6.This section describes the networks used in the experiments.In deep learning, there were many classic networks, such as AlexNet [25], GoogLeNet [26], ResNet [9], etc.In these classic networks, AlexNet used ReLU [27] as the activation function and was trained in parallel on two GPUs.The computational complexity of the network and the interdependence of parameters were reduced by the ReLU function.The inception architecture was proposed in GoogLeNet, which could further reduce the computational complexity of the network.Meanwhile, BN [14] was used in GoogLeNet to standardize the output mean and variance of each layer of the network.Furthermore, the BN alleviated the vanishing gradient problem.The residual network was introduced in ResNet to deal with the training of deep network models.EfficientNet [10] was proposed by the Google team in 2019 and could balance the network width, depth, and resolution to improve performance.EfficientNet could focus on more image details than other scaling methods.The MobileNet Convolution (MBConv) was used in the convolutional layers of EfficientNet, except for the first convolutional layer which used the normal convolutional structure.The MBConv included a 1 × 1 pointwise convolution, a depthwise convolution [28], and a Squeeze-and-Excitation (SE) module [29].The commonly used CNNs for EEG classification included DeepConvNet, ShallowConvNet [13] and EEGNet [19].In particular, EEGNet attempted to classify EEGs from several paradigms using a single network.EEGNet used depthwise and separable convolutions instead of the traditional ones.Additionally, ELU [15] was used as the activation function.The architecture of EEGNet is shown in Figure 1 and the parameters of each layer of EEGNet are given in Table 1.

Background
Meanwhile, a glossary containing all initials and abbreviations is given in Table 2.

Proposed methods
To address the performance degradation under a small number of the labeled training samples, we design a self-supervised network framework, as shown in Figure 2. Firstly, MI-EEGs are pre-processed by denoising and filtering.In the pretext task, the Temporal Rearrange (TR) approach generates the rearranged MI-EEGs.The network of the pretext task is then trained to learn the MI-EEG representation using the original and rearranged MI-EEGs.The network of the downstream task is initialized by the pretext one and trained using the labeled MI-EEGs. Formally,

Pretext task
The framework of the pretext task is shown in Figure 3.To generate labeled samples from multichannel MI-EEGs, we propose a Temporal Rearrange (TR) method.A fixed-length sliding window H is used to divide each original MI-EEG into multiple time window blocks.A new MI-EEG is then obtained by rearranging these time window blocks.The label of the original MI-EEG is set to 1, and that of the rearranged MI-EEG is set to −1.The original and rearranged samples and labels are represented by x i and y i , respectively, and the number of generated samples is n .The training set of the pretext task In this paper, EEGNet is used as the baseline network.The prediction function is denoted as Y = w T F Θ (Z) + w 0 .The convolution function in the EEGNet is denoted as F Θ .The parameters of the convolution function are denoted as Θ.The weight and the bias of the fully connected layer are denoted as w and w 0 , respectively.The weight parameters are denoted as W = [Θ, w, w 0 ].Cross entropy is used as the loss function in the pretext task, which is calculated by Eq (1).
The optimal network parameters W * = [Θ * , w * , w 0 * ] can be obtained by training the pretext network.
After fine tuning, the optimal network parameters W can be obtained.The MI-EEGs can be finally predicted by W .

Datasets and preprocessing
This section describes the used data sets.Data set 1 of BCI competition IV contain 7 subjects.Each subject chooses two types from left hand, right hand, and foot for MI-EEGs.In the Data set 1, an MI-EEG trail contains 8 s of data, and a cross mark is displayed on the screen from 0 to 2 s.From 2 to 6 s, an arrow indicator appears in the center of the screen, and is superimposed on the cross mark.The subject performs the MI-EEGs according to the directions of the arrow indicator.The screen is finally blank in 6-8 s, indicating the end of this trail.The data is collected from 59 EEG channels, and the sampling frequency is set to 100 Hz.Therefore, one MI-EEG trail has 800 sampling points.Event-Related Desynchronization (ERD) and Event-Related Synchronization (ERS) are most evident in EEG channels C3, C4, and Cz [30], thus, the three channels are selected in the experiments.The positions of electrode caps [31] are generally illustrated in Figure 5. Since electrooculogram (EOG) and noise can be collected in the MI-EEGs, the MI-EEGs is firstly preprocessed in the experiments.Generally speaking, the 8-30 Hz frequency band can reflect the characteristics of the MI-EEGs [32].Therefore, a third-order Butterworth band-pass filter at 8-30 Hz is built to eliminate the effects of baseline drift and noise.Then, Independent Component Analysis (ICA) is used to reduce the influence of EOG.

Experiment configuration
All the used networks are implemented using Pytorch on NVIDIA GTX3090 and Intel(R) Core(TM) i9-10940X CPUs.The convolutional function in the CNN is initialized by the kaiming normal function, and the fully connected layer is initialized by a normal distribution (N(0, 0.1)).Adam is used for the optimization.In the pretext task, the epoch, batchsize, and learning rate are set to 200, 8, and 0.0001, respectively, and there is no validation or test set.The loss value is used to measure the performance of the trained pretext model.In the downstream task, the epoch, batchsize, and learning rate are set 100, 50, and 0.001, respectively.The dataset of each subject is randomly divided by 3:1 to constitute the training and test sets.This process is repeated 10 times for each subject.The validation set is not used because the sample number for each subject is small.

Evaluation metrics
Accuracy (ACC) and Area Under the Curve (AUC) are used to measure the performance of the networks.ACC can reflect the classification ability of the networks.AUC represents an aggregate measure of performance across all possible classification thresholds.In addition, Confidence Intervals (CI) at the 95% confidence level are used in the ablation study to obtain a level of accuracy confidence.

Comparison of different models
In this section, to determine the suitable network for the MI-EEG task, six popular networks are conducted on all subjects in the experiments, including ResNet [9], MobileNet [28], EfficientNet [10], DeepConvNet, ShallowConvNet [13] and EEGNet [19].The experimental results are shown in Table 3.
It can be seen intuitively from Table 3 that EEGNet performs better than the other networks on most subjects, and ShallowConvNet can yield comparable performance to EEGNet on most subjects.It indicates that EEGNet and ShallowConvNet can learn better MI-EEG representations than the other four networks.Therefore, EEGNet and ShallowConvNet are chosen as the base classifiers.

Ablation analysis
An ablation experiment conducted to verify the effectiveness of the proposed pretext network under a small number of labeled training samples.The base classifiers and one third of the training set are used for experiments.The MI-EEG classification method without using SSL is denoted as Base, which uses the random strategy to initialize the network weights.The proposed SSL method is denoted as TRMINet, which uses the pretext task.These pretext network is used to initialize the weights of the downstream classification network.The effectiveness of the pretext network is evaluated according to the performance of the downstream classification network.In order to visualize the accuracies of the Base and TRMINet methods, bar charts are plotted in Figures 6 and 7.
As shown in Figures 6 and 7, the accuracies of the TRMINet method are higher than the Base method on most subjects.It shows that the pretext task can learn a good representation which has a positive impact on the network training under a small number of the labeled training samples.
AUC is used to further measure the TRMINet and the Base methods.The results are shown in Table 4.As seen from Table 4, the AUC of the TRMINet method is closer, if not higher, than the Base method on most subjects.This shows that SSL under a small number of the labeled training samples can improve the classification performance of the downstream network.It also shows that the proposed pretext task can effectively learn MI-EEG representations.ability of the pretext task is evaluated by the classification results of the downstream classification task.Figures 8 and 9 show the overall accuracy trends of the nine subjects as the percentage of the labeled training samples increases.The accuracies of the TRMINet method are higher and increase more obviously than the Base method as the labeled training samples increase on most subjects.This shows that the proposed pretext task can effectively learn MI-EEG representations, and SSL can effectively improve classification performance under a small number of the labeled training samples.

Conclusions
In this study, we propose an SSL method, called Temporal Rearrange, which exploits the temporal characteristics of MI-EEGs to learn the pretext network by identifying the original and rearranged MI-EEGs.The network learned by the pretext task is used to initialize that of the downstream task.The experimental results show that the proposed method can improve the classification performance under a small number of the labeled training samples.It is indicated that the proposed method can alleviate the performance degradation caused by insufficient labeled training samples.Therefore, it is possible to improve the classification accuracy of the MI-based BCI system and reduce the burden of labeling MI-EEGs.In the future, transfer learning and subjectwise strategy will be employed to further improve the classification performance of the network from multiple subjects.
1}} is used to denote the MI-EEG dataset.xi denotes each MI-EEG and y i ∈ {−1, 1} denotes the label of x i .C denotes the number of MI-EEG channels and T denotes the number of sampling points of MI-EEGs.

Figure 2 .
Figure 2. The proposed SSL classification framework.

3. 2 .
Downstream taskThe downstream task aims to classify MI-EEGs by training an optimal network.The network of the downstream task is the same as that of the pretext task.The convolutional function in the network is denoted asF Θ d.The parameters of the convolutional function is denoted as Θ d .The weight and the bias of the fully connected layer are denoted as w d and w 0 d , respectively.The weight parameters are denoted as W d = [Θ d , w d , w 0 d ].The downstream network is initialized by the pretext one, i.e., W d = [Θ * , w * , w 0 * ].Then, The downstream network is fine-tuned using the labeled MI-EEGs.Cross entropy is used as the loss function in the downstream task, which is calculated by Eq (2).

Figure 5 .
Figure 5.The positions of electrode caps.

Figure 6 .
Figure 6.Test accuracy using one-third labeled training samples of Data sets 1 of BCI competition IV and Data set IVa of BCI competition III.

Figure 7 .
Figure 7. Test accuracy using one-third labeled training samples of Data sets 2b of BCI competition IV.

Figure 8 .Figure 9 .
Figure 8. Test accuracy using different percentages of labeled training samples of Data sets 1 of BCI competition IV and Data set IVa of BCI competition III.

Table 2 .
Glossary of Terms and Abbreviations.

Table 3 .
Results of six model.

Table 4 .
Results of ablation studies.