Global Adaptive Transformer for Cross-Subject Enhanced EEG Classification

Due to the individual difference, EEG signals from other subjects (source) can hardly be used to decode the mental intentions of the target subject. Although transfer learning methods have shown promising results, they still suffer from poor feature representation or neglect long-range dependencies. In light of these limitations, we propose Global Adaptive Transformer (GAT), an domain adaptation method to utilize source data for cross-subject enhancement. Our method uses parallel convolution to capture temporal and spatial features first. Then, we employ a novel attention-based adaptor that implicitly transfers source features to the target domain, emphasizing the global correlation of EEG features. We also use a discriminator to explicitly drive the reduction of marginal distribution discrepancy by learning against the feature extractor and the adaptor. Besides, an adaptive center loss is designed to align the conditional distribution. With the aligned source and target features, a classifier can be optimized to decode EEG signals. Experiments on two widely used EEG datasets demonstrate that our method outperforms state-of-the-art methods, primarily due to the effectiveness of the adaptor. These results indicate that GAT has good potential to enhance the practicality of BCI.


I. INTRODUCTION
B RAIN-COMPUTER interface (BCI) is a cutting-edge technology that bridges humans and external devices by allowing users to directly control machines with their intentions, which are decoded from brain signals of potential activities [1]. Among many techniques for detecting brain signals, electroencephalograph (EEG), as a kind of noninvasive physiological signal, has been widely used to record the voltage of multiple electrodes on the scalp by wearing an electrode cap. Due to its reliability and convenience, EEG-based BCI has broad prospects in many application fields in daily life, ranging from specific scenarios such as functional rehabilitation for patients with movement disorders [2], sleep stage classification [3], and emotion regulation to general intelligent applications like brain-controlled systems [4].
Numerous machine learning methods have been intensively investigated for EEG classification, which plays a crucial role in the performance of EEG-based BCI. However, the significant individual differences of EEG make it still very challenging to learn a general model applicable across subjects [5]. Conventional methods have to collect a large number of labeled samples over a few training sessions and then use these data to train a new model for each subject since there is no significant gain or even deterioration in classification performance by directly using data from existing subjects. Unfortunately, this demand is time-consuming and may lead to poor user experience. Therefore, transfer learning is developed to leverage information from existing subjects and thus make up for the limitation of training data [6], [7], [8]. As a popular branch of transfer learning, domain adaptation approaches refer to the EEG samples from the new subject as the target domain and those from existing subjects as the source domain [9]. They aim to align the feature distribution of the source domain with the target domain, thereby expanding useful information in the target domain. For example, some methods use statistical metrics to narrow the difference between the two domains, such as maximum mean discrepancy (MMD) and correlation alignment [10], [11]. Inspired by the generative adversarial network (GAN), approaches based on adversarial learning employ a discriminator to automatically reduce distribution discrepancy and learn domain-invariant features [12], [13]. These domain-adaptive methods align distributions by constraining the similarity of features or This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ statistical metrics but ignoring the global dependencies within the EEG context of different domains, leading to inefficient alignment.
During the process of distribution alignment, feature representation is of particular importance. Recently, neural networks with deep architectures and convolutional neural networks (CNN) have demonstrated strong capabilities in EEG feature learning [14]. For example, the deep ConvNet in [15] employs the combination of temporal and spatial convolution filters as the first convolution-pooling block and introduces three more convolution-pooling blocks to further reduce the feature dimension before feeding the output to the classification layer. Similarly, EEGNet [16] starts with a temporal convolution to learn frequency filters, then uses a depthwise convolution to extract frequency-specific spatial features, followed by a separable convolution to learn temporal representation further. C2CM [17] also breaks the two-dimensional convolution into two separate one-dimensional layers and applies spatial filters to temporal-filtered features. However, these methods usually prioritize learning temporal filtering and then spatial filtering, which may fail to preserve both temporal and spatial structure simultaneously.
To tackle the above issues, we propose a novel domain adaptation method called Global Adaptive Transformer (GAT) for cross-subject enhanced EEG classification. The feature representation of existing subjects is well aligned with GAT to help the classification performance of the target subject. GAT comprises four components: feature extractor, global adaptor, discriminator, and classifier. We first design a parallel temporal and spatial convolution module to simultaneously preserve the spatio-temporal structure information. The global adaptor, a Transformer-based module, measures the non-local correlation of the target domain and implicitly maps to the source domain. The discriminator distinguishes aligned features from the source or target domain to drive marginal distribution alignment during training. We also design a loss function to reduce conditional distribution discrepancy. Finally, we adopt a simple classifier optimized with target and aligned source features for classification.
To sum up, our contributions are fourfold: • We propose a domain adaptation framework that uses attention and adversarial learning to address the issue of global dependencies and marginal distribution alignment.
• We introduce an adaptive center loss to leverage the label information and further reduce the conditional distribution discrepancy.
• We investigate a parallel convolution module to learn robust representation, which preserves the spatialtemporal structure information.
• We conduct extensive experiments on two real EEG datasets to demonstrate the capability of our method to utilize the source domain for enhancement. The remainder of this paper is organized as follows. Related works are presented in Section II. The proposed method is illustrated in Section III. The details of implementation, experiments, and results are shown in Section IV. A careful discussion is in Section V. Finally, we come to a conclusion in Section VI.

II. RELATED WORKS A. Machine Learning for EEG Classification
A wide variety of machine learning methods are devoted to improving the performance of EEG decoding via feature extraction and classification. Traditional methods usually rely on prior knowledge to extract hand-crafted feature representation, such as fast Fourier transform (FFT), continuous wavelet transform (CWT) [18], and common spatial pattern (CSP) [19]. In addition, Ang et al. [19] presented filer bank CSP (FBCSP) by employing spatial information of multiple frequency bands to enhance category discrimination. The extracted features are subsequently fed into linear or nonlinear classifiers, such as linear discriminant analysis (LDA), support vector machine (SVM), and multiperceptron (MLP) [20], [21]. More recently, deep learning methods have become increasingly popular in EEG decoding. Sakhavi et al. [17] proposed CNN to learn the temporal features of EEG for classification. Schirrmeister et al. [15] designed deep and shallow ConvNet to decode task-related information from EEG without hand-crafted features, achieving competitive performance as FBCSP. However, these methods usually have poor performance in cross-subject cases due to the large individual difference in EEG signals.

B. Transfer Learning and Domain Adaptation
Transfer learning is an important technique to deal with inter-subject variability in EEG [22]. It allows transferring of knowledge learned from existing subjects to reduce calibration time and improve the target subject's performance [23]. To solve the individual difference, some works learned features or data structures invariant across subjects. For example, Liu et al. [7] performed transfer learning by aligning the spatial pattern and covariance. Zanini et al. [24] presented a Riemannian geometry framework to center the covariance matrices of every domain with respect to a reference covariance matrix. Zhang et al. [25] pre-trained a deep network with the data of all available subjects excluding the target subject and fine-tuned the fully-connected layer with the data of the target subject. Zhang et al. [26] compared the effects of different transfer schemes in detail. Besides, domain adaptation gives a more interpretable way to align the distribution of the source and the target domain. Zhao et al. [12] matched the deep representation obtained by CNN with the help of adversarial learning. Hong et al. [27] proposed a dynamic joint domain adaptation framework and reduced the marginal and conditional distribution discrepancies by multiple discriminators.

C. Attention Mechanism and Transformer
Recently, Transformer based on the attention mechanism has aroused great interest in the fields of computer vision and natural language processing [28]. The attention mechanism enables models to capture long-term dependencies with no limitation of sequence length due to its non-local characteristic [29]. Liu et al. [30] proposed a hierarchical Transformer with sifted windows to serve as a general-purpose backbone for computer vision, which achieved satisfactory performance Thus we can use source data to enhance the classification performance of the target data. We consider global correlation within EEG for better distribution alignments.
on the tasks of image classification, object detection, and semantic segmentation. Wang et al. [31] presented a pyramid vision Transformer for dense prediction without convolutions. Bagchi et al. [32] combined both CNN and Transformer and combined them to learn both local temporal features and inter-region interactions. The long-range dependencies that Transformer can capture are rarely explored in EEG sequences, not to mention the individual difference cases. Therefore, we propose a global adaptive Transformer framework for cross-subject EEG classification, which encodes the non-local correlation of EEG between the source and target domains during the domain adaptation.

A. Overview
In the context of EEG-based BCI, we assume the annotated EEG signals from the target subject and those from the source subjects belong to two different but related domains, namely, the target domain D t and the source domain D s . Both domains share the same feature space and label space.
∈ D t denote N t trials of EEG signals in the target domain, where x t i ∈ R C×T is the ith EEG trial with C electrode channels and T time samples, y i = {1, · · · , M} ∈ Z M is its corresponding label of M categories. Similarly, (x s j , y s j ) N s j=1 ∈ D s denote N s EEG trials in the source domain. In the real world, the sample size N t is usually too small to train a reliable model for the target subject, while the annotated data in the source domain is much more than that in the target domain. However, due to the large individual difference and the special characteristics of EEG signals, both domains draw different joint distributions, namely, P(x t , y t ) ̸ = P(x s , y s ), which can be further split into the marginal distribution P(x t ) ̸ = P(x s ) and the conditional distribution P(y t |x t ) ̸ = P(y s |x s ) as in Fig. 1. In this regard, how to better reduce the distribution discrepancies and leverage useful information from the source domain to improve the decoding performance for the target domain remains a challenging problem.
To tackle this problem, we propose a novel domain adaptation framework, GAT, for cross-subject enhanced EEG classification. The overall framework of GAT is depicted in Fig. 2. The whole structure can be divided into four components: a feature extractor, an essential global adaptor, a domain discriminator, and a classifier. In the training phase, samples from different domains are fed into the feature extractor to learn feature maps for the source and target domains, respectively. The feature extractor consists of two parallel temporal and spatial convolutional branches, which allows for simultaneously preserving both the temporal and spatial structures for feature representation. With the features learned from different domains, a global adaptor subsequently encodes the global dependencies in the target domain, and utilizes them to guide the transfer of features from the source domain to the target domain. After then, a domain discriminator is adopted to determine whether the features come from the source or the target domain. Based on the adversarial learning, the discriminator is alternately updated and learns against the combination of the feature extractor and the global adaptor until a Nash equilibrium reaches [33]. In this way, the features learned from different domains are similar enough to confuse the discriminator, and thus, the marginal distribution discrepancy is gradually eliminated. Furthermore, an adaptive center loss is proposed to make use of the labeled information, which pulls features from the same categories closer while pushing away features from different classes. Therefore, the conditional distribution is also aligned, and the joint distributions can be considered similar between the source and the target domains. Finally, a classifier can be safely trained with the aligned features from both domains for EEG decoding. In the test or online phase, we can directly use the well-learned feature extractor and the classifier to decode newly arrived EEG trials for the target subject.

B. Preprocessing
To avoid structural information loss, we use as few preprocessing operations as possible before feeding the raw EEG trials into the model. Here, we independently apply bandpass filtering and standardization to raw EEG trials for each subject. To remove various high-or low-frequency artifacts and remain valuable rhythms, we employ a 6-order Chebyshev filter to constrain the frequency of EEG data to [W 1 , W 2 ] Hz, based on the paradigm. Then, a z-score standardization is performed on the filtered data to reduce the non-stationarity with where x i and x o denote the band-pass filtered data and standardized output, respectively. µ and σ 2 represent the mean value and variance, which are calculated from the filtered training trials and used directly on the test data.
C. Network Architecture 1) Feature Extractor: After preprocessing, we construct a feature extractor with convolutional layers to take as input The overall framework consists of four components, a feature extractor with two parallel branches to capture temporal and spatial features, a global adaptor to adapt global dependencies with attention mechanism, a discriminator to drive adversarial learning and a classifier. The adaptive center loss is given to reduce the conditional distribution discrepancy. Only the feature extractor and the classifier, connected by thick black lines, are used in the test phase. and learn robust feature representation from the source and the target EEG trials of size C × T . Since the dimensional units are different and C is much smaller than T , we decompose the 2D convolution into two 1D convolutions along the spatial and temporal dimensions. As shown in Fig. 3(a), conventional feature extractors typically apply the spatial convolution to the output feature maps of the temporal convolution, and as a result, the original spatial information is inevitably destroyed. Therefore, we propose a more efficient combination of parallel temporal and spatial branches in Fig. 3(b). For the temporal branch, the convolutional kernel is of size (k × 1 × n), where k denotes the number of feature maps, n denotes the kernel length along the temporal dimension. Similarly, the kernel size of the spatial convolution is (k × m × 1), where m is generally equal to EEG channels C. The resulting outputs of the temporal and spatial convolutions are connected to an additional convolutional layer to reshape the feature maps for further additive merging. Each convolution layer is followed by batch normalization and an ELU activation function to enhance the nonlinear representation capability. In this way, these two parallel branches can simultaneously preserve the temporal-spatial structural information, which is helpful for subsequent feature transfer between different domains.
2) Global Adaptor: Because of the coherent brain-driven behavior, we assume the corresponding long EEG sequences are also context-dependent. However, due to the limited perceptual field of convolutional operation, the global dependencies between the EEG signals are rarely explored and considered during the domain adaptation procedure. Therefore, we propose a global adaptor based on the attention mechanism to enhance feature alignment via encoding the non-local correlation in the EEG features. Specifically, we first flatten the channel of feature maps and split the source and target  features into multiple slices with a length of d in the temporal dimension. In this way, we obtain (T − n + 1)/d slices of size 1 × dk, each slice representing a small feature segment. As depicted in Fig. 4, the slices of the source and the target domain are linearly transformed into vectors of the same size, namely the target query (Q), the target key (K ), and the source value (V ). Then, the global correlation in the target domain can be evaluated by the pairwise similarity between the target query and key, which can be calculated by the dot product between Q and K with a scaling of 1/ √ d. The So f tmax function converts the global correlation into attention scores to guide a transformation for the source value. Two additional fully-connected layers are used to improve the fitting ability. This attention process can be formulated as In the implementation, we use the multi-head attention (MHA) layer for improving the performance of the global adaptor. The attention input is separated equally into smaller parts, called the head. Afterward, the attention process is performed individually on each head, and the resulting outputs are concatenated as follows: where MHA denotes multi-head attention, W Q l , W K l , W V l ∈ R dk×dk/ h is the learnable linear transformation to obtain the query, key, and value, respectively, and l is the index of heads. Then, the multi-head attention is followed by two fully connected layers to improve the fitting ability. In this way, the global attention module encodes the non-local correlation of the target domain and uses it to guide the feature mapping for the source domain.
3) Domain Discriminator: Inspired by generative adversarial networks (GAN) [33], we use a discriminator to train against the previous feature processing part. Taking the features learned as input, the discriminator aims to distinguish which domain the feature comes from. This is formulated as a binary classification problem and achieved by two fullyconnected layers connected with a neuron as output. In contrast, the features of different domains generated from the feature extractors and the global adaptor are similar enough to fool the discriminator. In this way, the marginal distribution between the source and target features is gradually aligned until the discriminator is confused by the feature extractor and the global adaptor. Following WGAN [34], we employ the Wasserstein loss for the discriminator to ensure the stability of the adversarial learning procedure with wherex = αx t + (1 − α)x s . D, A, and F are short for the discriminator, the global adaptor, and the feature extractor. G P denotes the gradient penalty, and α represents a random value between (0, 1).
The feature extractor and global adaptor are viewed as a generator with the constraint of 4) Classifier: After feature alignment, features extracted from both domains can be used to train the classifier. Here, we adopt a simple network with two fully-connected layers as the classifier. The output goes through So f tmax to get a M-dimensional vector; each dimension represents the probabilities of different categories. With the predicted outputŷ and ground truth label y, we use the cross-entropy loss with 5) Adaptive Center Loss: Although the marginal distribution between the source and target domains is aligned by the domain discriminator, the conditional distribution remains different. Therefore, we further present the adaptive center loss to align the conditional distribution as where c t y c denotes the deep feature center of the c class in the target domain. Under this constraint, the intra-class variation is reduced, and the inter-class distance is amplified. At the same time, the source domain features converge towards the corresponding class center of the target domain.
For the whole network, we set up two optimizers and alternately train them, one optimizing the discriminator with L D , and the other jointly optimizing the feature extractor, global adaptor, and classifier with where ω G and ω act are the hyperparameters. The overall processing can be summarized in the Algorithm 1.

A. Datasets
We evaluate our method on two real public EEG datasets, namely, Dataset 2a and Dataset 2b of the BCI competition IV. These two datasets use different experimental designs, acquisition devices, and data sizes, which can help us verify the generalization of our method. The motor imagery paradigm used in these datasets is one of the most influenced by individual differences.
1) Dataset 2a of BCI Competition IV: The EEG datasets 1 provided by Graz University of Technology was acquired from nine subjects with twenty-two Ag/AgCl electrodes at a sampling rate of 250 Hz. Four different motor imagery tasks were collected in two sessions on different days, including imagination of moving the left hand, right hand, both feet, and tongue. Each session contained 288 trials (72 trials per task), with [2,6] second of each trial being used in our experiments. The data were band-pass filtered to [4,40] Hz as [19], [35]. The first session was used as the training set, and the second session was used as the test set in our experiments. Algorithm 1 Global Adaptive Transformer 1: Input: Source EEG data x s ∈ R C×T , source label y s , target EEG data x t ∈ R C×T , target label y t . 2: Output: y pr ed , prediction results of target data. 3: Initialize: the networks of feature extractor F, adaptor A, discriminator D, and classifier C with normal weights. 4: while not converaged do 5: Update D: 6: obtain D(A(F(x s ))) and D(F(x t )); 7: compute the loss L D by Eq. 4; 8: update D weights with L D ; 9: Jointly Update F, A, and C: 10: obtain D(A(F(x s ))); 11: compute the loss L G by Eq. 6; 12: obtain C(A(F(x s ))) and C(F(x t ))); 13: compute the loss L cls by Eq. 7; 14: obtain A(F(x s )) and F(x t ); 15: compute the loss L act by Eq. 8; 16: update F, A, and C weights with L joint by Eq. 9; 17: end while 18: Return: F and C trained with aligned x s and limited x t . 19: Testing or Online Phase: decode newly arrived target EEG trials x t ′ with trained F and C.
2) Dataset 2b of BCI Competition IV: The EEG datasets 2 were collected from nine subjects with three bipolar electrodes at a sampling rate of 250 Hz. Two motor imagery tasks were collected in five sessions, including imagination of moving left hand and right hand. 120 trials of each session with [3,7] second of each trial were used in our experiments. The data were band-pass filtered to [4,40] Hz. We train with the first three sessions and test with the last two sessions. We train with the first three sessions and test with the last two sessions.

B. Experiment Details
Our method is implemented with the PyTorch library in Python 3.6 with a Geforce 2080Ti GPU. All EEG channels in the datasets were used, discarding the electrooculogram channels. We train the model using Adam optimizer with the learning rate, β 1 , and β 2 chosen to be 0.0002, 0.5, and 0.999, respectively. The batch size is set to 64. In the feature extractor, the number of filters k 1 , k 2 , and k are set to 2, 5, and 10. The kernel size for temporal convolution n is set to 51 to encapsulate local temporal information. In the global adaptor, both the slice length d and the number of heads h are chosen to be 5. In the discriminator, λ gp is typically set to 10. The loss weight ω G and ω act are set to 1 and 0.2.
We employ classification accuracy and kappa for evaluation metrics. kappa is a normalized measurement taking into account the chance level, and is calculated with where p o denotes the average accuracy of all the trials and p e denotes the accuracy of a random guess. Besides, we use 2 https://www.bbci.de/competition/iv/desc_2b.pdf the Wilcoxon Signed-Rank Test to analyze the statistical significance of performance comparisons.

C. Baseline Comparison
To evaluate our framework broadly, we compare the proposed method on both Dataset 2a and 2b with the state-of-theart approaches, including the skillfully hand-crafted feature extractors FBCSP [19] and ssCSP [36], the conventional EEG classifiers SSMM [37], the end-to-end convolutional neural networks ConvNet [15], MI-CNN [38], C2CM [17] and SHNN [39], and the adversarial learning based domain adaptation model DRDA [12]. The classification performance for each individual and the average results are presented in Table I and Table II. It can be seen that the proposed global adaptation framework has achieved superior performance on both datasets, with the highest average accuracy of 76.58% and kappa value of 0.6877 for Dataset 2a. Our method performs significantly better than the methods with subtly hand-crafted feature extraction, like FBCSP and ssCSP, as well as those end-to-end CNN-based models such as ConvNet and MI-CNN. Though C2CM and SHNN inherit both the advantages of conventional feature extractors and CNNs, even fine-tunes the model hyperparameters for each subject, it is still inferior to ours. Compared with the method of DRDA, our method pays attention to the global correlation of the target subject during the feature adaption and thus gains a 1.84% improvement on average accuracy. Similarly, our method also outperforms other methods on Dataset 2b, and achieves the highest average accuracy of 84.44% and kappa of 0.6889, which validates the effective generalization of our method.

D. Effect of GAT-Based Domain Adaptation
The key of the proposed method is the attention-based domain adaptation framework to improve the decoding performance of the target subject via leveraging useful information from other subjects. Therefore, we first confirm the significance of our method by evaluating the efficiency of the Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.  domain adaptation module. Specifically, we obtain a degraded method without domain adaptation by removing the pivotal global adaptor, domain discriminator, and adaptive center loss. Then, we train two degraded models using only the target data and the mixing data from both domains, respectively. Fig. 5 shows the comparison of the test performance of these two degraded models and our domain adaptation method on Dataset 2a and Dataset 2b. It can be observed the degraded model with mixing domains increases the accuracy by 1.58% (p = 0.2419), taking the degraded model trained on the target data as a baseline. Accuracy even decreases on Subject 3, 7, 8, and 9. There is no obvious boost to directly fuse the source data into the target data as a training set, which demonstrates the detrimental effect of individual differences on the cross-subject EEG classification. On the contrary, our method with domain adaptation has significant improvement in the average test accuracy, showing 13.62% ( p < 0.01) and 12.04% ( p < 0.01) higher than the baseline and the degraded one with mixing data, respectively. Kappa values show the same trend as the accuracy, where the baseline is slightly lower than that with mixing data by 0.0211 and much lower than our domain adaptation by 0.1815. Besides, our model also provides significant help on Dataset 2b, compared to the baseline. The accuracy of each subject is improved and an average accuracy improvement of 4.49% ( p < 0.01) has been achieved. The kappa value obtained with our method also increases by 0.0898 than just using target data as the training set. Therefore, the proposed domain adaptation framework is capable of handling individual differences and making good use of the source domain, thus enhancing the cross-subject classification performance.

E. Ablation Study
After validating the whole framework, we conduct ablation experiments to evaluate the effectiveness of several key feature transfer components by removing each component from the entire model each time. The experimental results on Dataset 2a are presented in Table III and Table IV. 1) Global Adaptor: The global adaptor is the most critical part of the overall framework. It encodes the non-local dependencies of the target subject into the feature mapping for the source data. Thus the source features learned have a similar global correlation with the target features. As shown in Table III, we compare the test performance with and without the global adaptor on Dataset 2a. There is a noticeable drop of 9.45% ( p < 0.01) in the average classification accuracy when we remove the global adaptor. Among these subjects, those with poorer performance are more affected, i.e., the test accuracy of S04 is reduced by 15.96%; even the least affected subject S03 degrades by 3.82%. Similar phenomena also occur on Dataset 2b as shown in Table IV. The average accuracy drops by 3.71% ( p < 0.01) without the global adaptor. This illustrates the efficiency of feature adaption of the source domain by considering the global correlation within the target subject, so our adaptor module considerably improves the results.
2) Discriminator: The discriminator aims to align the marginal distribution between the source and target domain by learning against the feature extractor and global adaptor combination based on adversarial learning. The second row of Table III and Table IV presents the classification performance of the proposed method without the discriminator module. It shows that the average accuracy on Dataset 2a is 3.32% ( p < 0.01) lower than the whole framework. In addition, almost all subjects have degraded performance, except for S06. The absence of the discriminator also introduces a 3.23% ( p < 0.01) degradation on Dataset 2b. The results show that the discriminator used for adversarial learning plays a staple role in driving domain adaptation.  3) Adaptive Center Loss: We propose the adaptive center loss to constrain the inter-and inner-class distance further since the conditional distribution varies across subjects. From Table III, we can see that removing the adaptive center loss negatively impacts the overall performance for Dataset 2a. The average classification accuracy degenerates 3.16% ( p < 0.01), with the worst case S04 dropping by 5.89% and the least case S07 dropping by 1.73%. From Table IV, we can see similar results where the average accuracy degenerates 1.96% ( p < 0.01) on Dataset 2b without the adaptive center loss. The results show that the adaptive center loss is substantially helpful in reducing conditional distribution discrepancies and enhancing domain adaptation performance. We also compare the design of adaptive center loss to illustrate the effect of aligning the source and target domains. The results that constrain features close to the class center within target and source data are given in Table V. It can be seen that the average accuracy significantly decreased by 2.66% ( p < 0.05) just applying center constraint on target data for Dataset 2a.
The same trends are obtained on Dataset 2b, where just constraining target data brings a 1.99% ( p < 0.01) decrease in average accuracy. Constraining only the source domain even brings slightly negative effects. The average accuracy was reduced by 3.63% ( p < 0.01) on Dataset 2a and 2.35% ( p < 0.01) on Dataset 2b compared to using the complete adaptive center loss.

F. Evaluation the Design of Feature Extractor
Convolutional neural networks are the current benchmark for EEG feature extraction. Convolutional layers are connected layer by layer to extract features from different dimensions. However, the typical series connection inevitably leads to information loss in the adjacent layers. Therefore, we propose a parallel strategy to simultaneously capture the temporal and spatial structural information in two branches. In this experiment, we compare the results of the proposed parallel strategy and two different serial connections. One of the serial connections is shown in Fig. 3(a), where the temporal convolutional filter is followed by the spatial filter (i.e. tempspat); and the other is the spatial filter followed by the temporal one (i.e., spat-temp). The results are illustrated in Fig. 6. We can see that the parallel temporal-spatial branches, as in Fig. 3(b), have a significant advantage over the series connection on all nine subjects. The average accuracy of parallel connection is 6.13% ( p < 0.01) higher than temp-spat  connection and 4.55% ( p < 0.01) higher than spat-temp connection. Unexpectedly, the spat-temp connection performs better at 1.58% ( p = 0.0273) than the temp-spat connection. It is worth noting that better performance may cause more computational costs. The parameters and FLOPs comparison of three convolution connections is given in Table VI.

G. Visualization
In our method, the marginal distributions between the source and target domains are aligned through the synergy of the global adaptor and the discriminator, while the adaptive center loss further constrains the discrepancies between the conditional distributions. Here we visualize the features to show the contribution of each component to the feature alignment via t-SNE [40].
For illustrative purposes, we treat the data of S01 from Dataset 2a as the target domain and the remaining subjects' data as the source domain. Fig. 7 visualizes the feature distribution of the source and target domains under the influence of the discriminator, the global adaptor, and the adaptive center loss progressively. Blue dots denote the target domain, and the other colors denote the source domain.
From Fig. 7(a), it is evident that the original feature distributions between individuals exhibit significantly different. Therefore, training the target model directly with mixed data from the source and target domains, as shown in Fig. 7(b), would lead to poor performance for the target subject. Introducing the discriminator for adversarial learning helps align the feature distribution between the source and target domains, as observed in Fig. 7(c). However, the feature distribution remains disordered. On the other hand, when incorporating the global adaptor without the discriminator, as depicted in Fig. 7(d), there is a noticeable improvement in the alignment of the marginal distribution. By combining both the discriminator and the global adaptor, as shown in Fig. 7(e), the marginal distribution between the source and target domains becomes uniformly similar. This demonstrates the effective role played by the discriminator and the global adaptor in feature alignment, with mutual enhancement when used together.
Next, let's focus on the adjustment of the adaptive center loss on the conditional distribution. Fig. 7(e) demonstrates that the features are scattered without any constraint on the conditional distribution, and the boundaries between different categories become blurred. However, with the adaptive center loss, as illustrated in Fig. 7(f), the features belonging to the same category from different domains are aggregated, resulting in clearer boundaries between different classes. This gradual adaptation of the conditional distribution of the source domain to the target domain is achieved. Overall, the results highlight the effective role of both the discriminator and the global adaptor in feature alignment between the source and target domains. Additionally, integrating the adaptive center loss helps adapt the conditional distribution, leading to improved performance in aligning the domains.

V. DISCUSSION
Due to large individual differences, people have to collect sufficient target subject data to calibrate EEG-based BCI systems. Domain adaptation provides a feasible way to enhance EEG classification of the target subject by leveraging the data from other subjects. The existing methods commonly use fixed constraints or directly employ adversarial learning to align features from source and target domains. However, EEG series that reflect the intention of our brain is context-dependent in most tasks. In this case, such methods are limited by poor feature representation due to ignoring long-term dependencies. Therefore, we propose Global Adaptive Transformer for more effective feature representation by emphasizing the global correlation of the target domain. Parallel convolutions are first adopted to obtain the local spatial-temporal features of EEG initially. Then an attention-based adaptor is designed to transfer the global dependencies of the target domain to the source domain.
In experiments, we can observe that GAT achieves stateof-the-art performance on public datasets and a significant improvement compared to only using target data or directly mixing target and source data for training. The results prove that our method performs effective domain adaptation, so that the data from other subjects can be used to calibrate the target model. The ablation study shows that attention-based global adaptor plays a major role in aligning feature distribution. The discriminator and adaptive center loss further improve the overall performance by a large margin. It is worth noting that the parallel connection works better than the series connection of temporal and spatial convolutional layers widely used in EEG analysis. The visualization shows that the GAT framework successfully aligns the marginal distribution and conditional distribution of the source and target domains.
There are still several limitations to our method. Firstly, we only evaluate GAT on motor imagery EEG data, which is heavily affected by individual differences. The performance on other paradigms still needs further validation. In addition, we have confirmed that global interactions calculated by the attention mechanism are significant for EEG data during domain adaptation, but neglect to further explore whether the global representation is helpful for EEG decoding. Besides, we treat the data from all subjects as the source domain for practical implementation, but the contribution of different subjects may vary in the domain adaptation process. In the future, we will promote our method across multiple paradigms and further explore its online practice on multi-distributed source data. Another important issue is that this paper explores domain adaptation for EEG classification from a deep learning perspective, while ignoring the comparison with some traditional methods [41], [42]. These methods have good interpretability, focusing on specific signal characteristics. We will also incorporate these methods into deep learning for better performance.

VI. CONCLUSION
In this paper, we have introduced a domain adaptation framework designed for cross-subject enhanced EEG classification. Our framework incorporates parallel convolution layers to capture the temporal-spatial structure information, an attention-based adaptor to align non-local correlations, and an adaptive center loss to address conditional discrepancies. Through extensive experiments conducted on two real EEG datasets, we have demonstrated the effectiveness of our proposed method in leveraging the source data. Our approach has achieved remarkable improvements compared to state-of-theart methods in EEG classification. We believe that our method holds significant potential in facilitating EEG classification tasks and enhancing the practicality of BCI systems.