A Dual-Adversarial Model for Cross-Time and Cross-Subject Cognitive Workload Decoding

Electroencephalogram (EEG) signals are widely utilized in the field of cognitive workload decoding (CWD). However, when the recognition scenario is shifted from subject-dependent to subject-independent or spans a long period, the accuracy of CWD deteriorates significantly. Current solutions are either dependent on extensive training datasets or fail to maintain clear distinctions between categories, additionally lacking a robust feature extraction mechanism. In this paper, we tackle these issues by proposing a Bi-Classifier Joint Domain Adaptation (BCJDA) model for EEG-based cross-time and cross-subject CWD. Specifically, the model consists of a feature extractor, a domain discriminator, and a Bi-Classifier, containing two sets of adversarial processes for domain-wise alignment and class-wise alignment. In the adversarial domain adaptation, the feature extractor is forced to learn the common domain features deliberately. The Bi-Classifier also fosters the feature extractor to retain the category discrepancies of the unlabeled domain, so that its classification boundary is consistent with the labeled domain. Furthermore, different adversarial distance functions of the Bi-Classifier are adopted and evaluated in this model. We conduct classification experiments on a publicly available BCI competition dataset for recognizing low, medium, and high cognitive workload levels. The experimental results demonstrate that our proposed BCJDA model based on cross-gradient difference maximization achieves the best performance.

task [1], [2] and has gained vast attention.There is, however, still no powerful consensus [3] about cognitive workload, and it can be the juxtaposition of the following definitions: Welford [4] defined cognitive workload as the resources available to meet task demands; Wickens argued that it is the relationship between the mental resources required by the task and the resources available to the operator [5].Young and Stanton suggested that cognitive workload reflects the attention resources to meet both subjective and objective performance criteria [6].Generally, the cognitive workload is distinguished from low and high levels [7], [8] and some work [9], [10] also classifies it into three more precise and practical categories, such as underload, normal, and overload.Due to the limited mental resources of humans, maintaining a moderate cognitive workload helps operators perform their work safely and efficiently [11].Thus, cognitive workload decoding (CWD) is proposed to automatically judge the operator's mental state based on physiological signals.In recent years, CWD has been widely used in various industries, including education (online course evaluation [12]), medical (sequelae of rehabilitation [13]), transportation (car driving [8]), and aerospace [2], [14].
Multiple methodologies that utilize physiological signals have been proposed for qualitative cognitive workload levels [15].Among many physiological signals, EEG is directly collected from the scalp and is proven to correlate with the CWD [16].EEG reflects the activity of brain neurons and has the unique advantages of high temporal resolution, low cost, and easy acquisition [7], [17], which foster its wide use in CWD.Moreover, EEG signals are not easily camouflaged and can provide reliable results [18].As a key neurofeedback application, EEG-based CWD aids in managing mental states effectively with its non-invasive, harmless, and enduring qualities [19].Thus, we emphasize an EEG-centric perspective, focusing on the identification of reliable biomarkers that correlate with cognitive workloads.
However, the low signal-to-noise ratio makes EEG susceptible to noise interference.With the discrepancy between different times, subjects, and tasks, the EEG signal representation is unstable [20].Therefore, in the field of EEG-based CWD, it remains a great challenge to obtain a general model that can be shared across different times and subjects (also called cross-time and cross-subject CWD problems).One solution to re-collect the corresponding data due to subject and time calibration before each test is impractical because the pre-acquisition of EEG data is time-consuming [17], [21], which undoubtedly brings great trouble and fatigue to users.Consequently, the challenge lies in optimizing the utilization of constrained datasets, a pivotal issue that necessitates resolution [22].This issue arouses public attention to the cross-time and cross-subject CWD problem, which facilitates the migration of existing data, and this is the difficulty our paper focuses on.
In subject-dependent and time-dependent CWD, some machine learning [8], [10], [23] and deep learning methods [11], [24] have been proposed and achieved satisfactory results.However, they cannot adapt well to the cross-subject and cross-time CWD scenarios since they did not take the distribution differences between domains into account.Furthermore, domain generalization and domain adaptation methods are exploited to find shared label-common features (discussed in section II) by domain alignment.But they fail to preserve the class difference information simultaneously, as deep domain adaptation tends to smooth out the feature gullies, inadvertently resulting in a reduction in classification accuracy.
To address these issues, we propose a Bi-Classifier Joint Domain Adaptation (BCJDA) model for EEG-based crosstime and cross-subject CWD, containing not only adversarial domain adaptation but also adversarial inter-class distances of a Bi-Classifier, as shown in Fig. 1.The BCJDA is committed to extracting domain-invariant features through domain adaptation while maximizing the discriminant distance of two classifiers.This approach is aimed at maintaining the alignment of classes to revitalize the reliability of boundary samples.Different from the existing models, our proposed BCJDA takes raw EEG data as input to perform an endto-end training process and considers both domain alignment and category alignment.This integration is achieved through adversarial training, which is further enhanced by a comparative analysis using three distinct disparity algorithms.Our contributions are as follows: 1) We design a novel CWD model based on Bi-Classifier and domain adaptation, alleviating the problem of class difference loss in deep domain adaptation models.2) We investigate the effect of using a task-specific Bi-Classifier with different adversarial disparity algorithms on the model performance.To our knowledge, we are the first to evaluate the influence of different determinacy disparity between classifiers in CWD. 3) We evaluate the BCJDA model on a public EEG dataset to recognize the practical low, medium, and high workload levels, demonstrating significant improvement over baselines.The rest of this paper is structured as follows.Section II briefly reviews some work related to cross-time and crosssubject CWR.Section III introduces the structure and implementation of the proposed model in detail.Section IV presents the experimental setup and preliminary analysis of the experimental results.The resulting discussion is given in Section V. Finally, we conclude the paper and suggest some directions for future work.

II. RELATED WORKS A. CWD With Machine Learning Methods
Traditional machine learning methods, such as support vector machine (SVM) [10], [23], K-nearest neighbor [23], [25], and linear discriminant analysis [26], need to manually extract features in advance and construct classification models through statistical methods.Although traditional machine learning methods are simple to implement and easy to train, they have reached the bottleneck due to the input of incomplete features and weak fitting ability for EEG samples that are susceptible to noise.Researchers are more inclined to exploit deep learning methods for EEG-based classification tasks because deep learning methods have more powerful representation capabilities [24] and realize end-to-end training, eliminating the trouble of manual feature extraction.Assuming that different sessions have a set of domain-invariant features, Jin et al. [27] achieved subject-specific CWD across time with a deep separable convolutional neural network (CNN) based on transfer learning; Kuanar et al. [28] presented a deep recurrent neural network (RNN) to learn robust cognitive workload features from intersubjective differences.Ni et al. [29] used an adversarial EEG generation method combined with hierarchical RNN to alleviate the performance degradation problem of event-related-potential-based BCI on cross-subject applications.As a review in [30], the CNN module is one of the most commonly used modules in the field of CWD, and we also adopt it in our proposed method.

B. CWD With Domain Adaptation Methods
Domain adaptation aims to transfer knowledge from the source domain to the unlabeled target domain and is suitable for knowledge transfer with a small number of samples [31], just like the EEG dataset.Domain adaptation can be a remedy for alleviating the nonstationarity of EEG.In cross-time CWD, the first session of a single subject is regarded as the source domain and the second session is regarded as the target domain.In cross-subject CWD, subjects whose EEG data have been acquired are classified as the source domain, and unknown subjects are classified as the target domain.Currently, domain adaptation has been applied to various other EEG-based state monitoring tasks to effectively address the problem of excessive inter-domain differences.Furukawa et al. [32] used a small amount of estimated target data in emotion recognition to relieve the EEG measurement burden and proposed a model with multiple domain discriminators in their subsequent work [33].For medical applications, a model utilizing hierarchical domain adaptation with projective dictionary pair learning [34] was used in epilepsy diagnosis for medical IoT for integrating pathological data from different nodes.Chambon et al. [35]and Heremans et al. [36] also explored the influence of domain adaptation on sleep state recognition.However, its application in the field of CWD is not sufficient.For example, Zhou et al. [15] proposed to use of domain adaptation approaches combined with machine learning models for binary-classification cross-task CWD, followed by a work [22] that constructs a CWD model with a deep neural network based on adversarial domain adaptation.These methods employ traditional machine learning techniques or fully connected layers for feature extraction.However, they inadequately exploit the deep temporal information of EEG signals and exhibit limited capacity in feature extraction.Moreover, none of these methods address the issue of category discrepancy loss arising from domain adaptation.
To address the aforementioned challenges, we propose BCJDA for cross-time and cross-subject CWD to boost general performance.Several CNN and pooling modules, which have a higher tolerance to noise, compose the feature extractor to enhance the automatic feature extraction ability that can obtain spatio-temporal features, reduce the dimensionality, and simplify the computational complexity.The BCJDA leverages the adversarial learning between the feature extractor and the domain discriminator to acquire the domain-invariant features, and the class difference is maintained by the disparity maximization of the Bi-Classifier.Through jointly aligning the domains and the classes, the domain distribution difference is diminished while more category feature differences are kept, and the BCJDA model is compelled to produce more robust output results.

III. METHODS
Since domain distribution and class discrepancy knowledge are equally important for EEG classification, we adopt joint alignment including domain-wise alignment and class-wise alignment to reduce inter-domain differences while retaining the class differences to the greatest extent.Thus, the classification performance of BCJDA is improved.This part mainly includes the following parts to illustrate our proposed model: 1) we will explain the overall structure of our proposed model; 2) the detailed structure and loss function of each module of the model are given; 3) two groups of adversarial methods and the overall optimization function are introduced, and three classifier adversarial losses used by our model are given.

A. Our Model Structure
Our model design draws inspiration from some existing domain adaptation methods, such as GANS, MCD [37], and BCDA [38].There are three main parts in BCJDA: one feature extractor, one Bi-Classifier consisting of two task-independent label classifiers, and one domain discriminator, as shown in Fig. 2. Both the domain discriminator and the Bi-Classifier obtain features from the feature extractor.The Bi-Classifier further extracts deep features and then predicts the cognitive workload level.It aims to minimize prediction errors with ground truth on source domain samples while maximizing prediction discrepancies on target domain samples to preserve category differences (inconsistency loss in Fig. 2).The domain discriminator, simultaneously, tries to distinguish features from source domain or target domain by minimizing domain classification errors.By contrast, the feature extractor attempts to deceive both the Bi-Classifier and the domain discriminator, thus forming two sets of minimax games which will be detailed discussed in part D. Therefore, the feature extractor can gradually generate domain-invariant features that maintain the differences between categories.The Bi-Classifier performs deep extraction on the features and learns label-specific features during the process of adversarial learning.Finally, our model's prediction results are determined by the superposition of the two classifiers.
The detailed layer structure and parameters are summarized in Table I.It describes the layers used by each module of the model and the number of parameters for different layers excluding dropout and batch-norm layers.Table SI in supplementary materials compares our proposed model's modules with other common CNN methods in section IV part D. Compared with common CNN methods (i.e.EEGNet and DeepCNN) in EEG classification tasks, our model introduces only minimal increases in parameters and computational complexity with two additional classifiers.

B. Feature Extractor
EEG data is small in size and highly sensitive, making it unsuitable for general deep network models, unlike other categories such as images and speech.We explored various common EEG feature extraction models and ultimately devised the feature extractor depicted in Fig. 3, drawing inspiration from DeepCNN, ShallowCNN, and Multi-branch 3D  CNN.Some of the subject-dependent pre-experimental results are placed in Table SII in the supplementary materials to support our choice.Raw EEG data is fed into the feature extractor and output in the form of a 200 × 1 feature map after passing through four blocks.Block 1 includes a 1 × 10 temporal filter and a 61 × 1 spatial filter, where the latter fuses each channel with different weights, and the channel dimension is compressed to one.Block 2 -block 4 just utilize the 1 × 10 temporal filter respectively, capturing the local and global timing information of EEG signals.Max pooling layer is adopted in all 4 blocks to reduce the complexity of computation while effectively retaining EEG feature information and mitigating noises in consideration of the low signal-to-noise ratio and non-stationarity of EEG.
To minimize the misclassification and guarantee the effectiveness of the model's classification ability, it is necessary to minimize the training loss in annotated data in source domain.
Here we take F y as a classifier and take θ y , θ f as parameters of the classifier and feature extractor.The loss function is formulated as L class : G f and x s represent the feature extractor and source domain samples respectively, as shown in Fig. 2. y s denotes the label of the sample x s .p denotes the softmax function that turns the classifier output into a probability for a given label l k .L y means the label prediction loss, also called supervised loss.

C. Bi-Classifier & Domain Discriminator
The Bi-Classifier consists of two task-specific classifiers (F y1 , F y2 ) with the same structure, which is a network with three fully connected layers.The input layer and hidden layer each have 128 neurons and the output layer has 3 neurons (same as the number of classes).During the testing process, the classifiers cooperate to determine the final result.Yet in training, in addition to the supervised loss determined by Eq. ( 1), the confrontation between the classifiers is also considered to maximize the distance of the Bi-Classifier.Here, we modify Eq. ( 1) to give the supervised loss function in the form of the Bi-Classifier: Here, n s represents the number of source domain samples, x si represents a source domain sample and y si is its corresponding label.The domain discriminator is a special classifier consisting of three fully connected layers where the input layer, hidden layer, and output layer contain 1024, 1024, and 1 neuron respectively.As shown in Fig. 2, it takes features from the feature extractor and tries to accurately identify the domain each feature group belongs to.Instead, the feature generator is trained to mislead the domain discriminator (G d ).To achieve the above purpose, the domain discriminator and the feature generator are jointly optimized to perform domain alignment with parameters θ d and θ f .The Loss function of L domain is given: (3) In the formula, n = n s + n t , which is the total number of samples in a group.d i corresponds to the label of the sample x i .L d means the label prediction loss in the domain discriminator.

D. Joint Domain Adaptation
This part mainly introduces two sets of adversarial processes in training, namely the game between the feature generator and the domain discriminator to implement domain-level alignment, and the game between the feature generator and the Bi-Classifier to implement class-level alignment.
For domain-level alignment, the loss function includes the supervised loss and the domain classification loss, as shown in Eq. ( 2) and Eq.(3).The overall loss function with weight coefficient λ at the domain level is denoted as: For class-level alignment, we selected three algorithms of determinacy disparity between two classifiers.They are L1-norm, Bi-Classifier Determinacy Maximization (BCDM) [38], and Cross-Domain Gradient Discrepancy Minimization (CGDM) [39].Figure.4 shows their calculation methods.
L1-norm is easily calculated by taking the absolute value of the difference between the outputs and averaging them over all the samples.BCDM method calculates the difference using the relevance matrix obtained by multiplying the two resulting vectors as shown in Fig. 4(b).Specifically, the CGDM method uses the recognition results of the source domain samples as in Fig. 4(c).It uses the ground truth and source domain samples to calculate the supervised loss and uses the model in training to pseudo-label the target samples to obtain the self-supervised loss.We then take the partial derivatives of the two losses with respect to the parameters; the partial derivatives g t , g s of the parameters θ y1 , θ y2 for the two losses are obtained.Finally, gradient discrepancy loss is derived by the above g t , g s .The formulas of the L1-norm, BCDM, and CGDM are given below in order, denoted as L dis0 , L dis1 , L dis2 : Here, p yi t represents the prediction probability matrix of classifier i ∈ {1, 2} for target domain samples, and T ( * ) represents the transpose function.
In EEG recognition problems, the division of boundary samples is very important, especially when adopting cross-domain methods since the category discrepancy will be lost to some extent when the source domain knowledge is transferred to the target and the instability of EEG signals.And class-level alignment can mitigate the above issues so we hope the Bi-Classifier is able to discern target domain samples close to the classification boundary and force the feature generator to generate features that preserve significant category discrepancy.Therefore, the overall class-level loss function is as follows: where Our model is trained with the back-propagation of two adversarial losses and the overall loss function is obtained as L: To reduce the discrepancy of the distribution across domains thus enabling the feature generator to learn domain-invariant features, the parameter θ d is determined to minimize -L domain , that is, to maximize L 0 and the parameters θ f , θ y1 , θ y2 are determined to maximize L class , that is, to maximize L 0 too.In this case, the domain discriminator will try its best to tell which domain the sample belongs to and the feature extractor will seek shared knowledge.We summarize the optimization objective function of the above process as: in which x belongs to {x s , x t }.That means Eq. ( 10) is optimized for all samples.
To alleviate the smoothness of discriminativeness in EEG signals under domain adaptation and align sample partitions of the source and target domains, the purpose of the Bi-Classifier is to maximize its discriminative difference concerning the target domain and enforce the feature extractor to yield features with clear category discrepancy.Besides, minimizing the supervised loss on the source domain is also taken into account.Thus, the parameter θ y1 and θ y2 is optimized to maximize L 1 and minimize L class , that is to maximize and minimize L respectively in source and target samples.
We summarize the optimization objective function of the above process as: It should be noted that although Eq. ( 11) may appear contradictory, it operates on separate samples, allowing for the use of the same loss function during training with a gradient reverse layer.Through Eq. ( 9), (10), and (11), once BCJDA is well trained, it can perform joint alignment of EEG features across domains.

A. Dataset
The public cognitive workload dataset released by the Neuroergonomics Conference is used in our experiments.We first illustrate the components of this dataset and then describe the preprocessing steps used.More details about the dataset can be found in www.neuroergonomicsconference.um.ifi.lmu.de.
The dataset contains a total of 15 subjects, 6 females and 9 males, with an average age of 25 years.Each subject was invited to the lab for three independent sessions, one week apart (exactly seven days apart).Each experiment session had a short warm-up period.After that, the EEG data of the subject's resting state were recorded.Participants then completed a multi-attribute task battery (MATB) divided into three fiveminute modules, each of which presented a different level of difficulty (that is, a different level of workload; see Table SIII of the supplementary materials for details) in a pseudo-random manner.3 levels of workload were elicited by varying the number and complexity of subtasks which were validated by statistical analysis of subjective and objective behavioral and cardiac data.All EEG data were recorded under the EEG cap of the international 10-20 system with 64 channels.In the final dataset, each subject has 3 sessions, and each session contains 447 cognitive workload samples, which are evenly divided into 3 categories.Raw EEG signals include a reference electrode, a cardiac activity electrode, and a bad lead, which were removed.We use the first two sessions with provided labels for model training and evaluation.Each sample has 500 sampling points (2s epoch, sampling rate 250Hz).Data preprocessing is done by EEGlab, and the specific steps are as follows: 1) Data were divided into 2-second non-overlapping epochs, using the right mastoid electrode as a reference.2) Used a high-pass filter (FIR filter, pop-filtnew from EEGlab) at 1Hz. 3) Used electrode suppression (mean amplitude above 2 sd across channels) with spherical interpolation.4) Used SOBI with automatic IC_Label rejection (muscle, heart, and eye components are rejected with a threshold of 95%).5) Used a 40Hz low-pass filter (FIR filter).6) A common average reference method was used and downsampled to 250Hz.

B. Comparative Methods
SVM: SVM is a supervised machine learning algorithm that maps the data into a high-dimensional space by using different kernel functions to find an optimal hyperplane to segment the data into different classes.
A common feature for CWD using traditional machine learning methods is Power Spectral Density (PSD).We use the multi-taper method proposed by Thomson [40] to extract the PSD features of each channel and concatenate the features of all channels to form a feature vector as the input of SVM.This method is a non-parametric method of PSD estimation, which does not require any prior information about the signal generation process and has a low estimation variance.
EEGNet [41]: EEGNet is a compact CNN widely used in the field of EEG classification, which uses deep-separated convolutions to build an EEG-specific model with the ability to classify across BCI paradigms.It has a strong generalization ability when the training data is limited.
Multi-branch 3D CNN [42]: To extract the spatial-temporal features of EEG signals, MB3D employs a multi-branch 3D convolutional neural network (MB3D), where each branch has a different receptive field size and can capture EEG features at different scales.This approach fully exploits the 3D structure of EEG signals and enhances the classification accuracy and robustness.
DeepCNN [43]: DeepCNN (DCNN) uses multiple CNN modules as the main model structure and mines the task-related deep features of EEG signals through multi-level temporal feature extraction.It has been proven useful for spatially mapping the learned features.
ShallowCNN [43]: ShallowCNN (SCNN) is a simplified CNN that uses only one feature layer to extract the intrinsic features of EEG sequences, and then uses training time and accuracy to evaluate the classification effect.It has a small number of parameters and high real-time performance.
Joint Distribution Adaptation [44]: Joint Distribution Adaptation (JDA) is a domain adaptation method that strives to obtain latent representations where the source domain is similar to the target domain by aligning marginal distributions and conditional distributions.JDA uses domain discriminators and associative reinforcement to deal with shallow and deep features.PSD features similar to SVM are fed into this model.

C. Implementation Details
The experimental part comprises cross-time and crosssubject CWD experiments.In the cross-time scenario, session 1 of a single subject serves as the source domain and session 2 as the target domain.Since time is irreversible, we do not consider the reverse direction.We apply all eight aforementioned methods to each subject, and a total of 360 experiments are conducted on 15 subjects.We then take the average of all subjects as the final result.In the cross-subject scenario, we employ leave-one-subject-out (LOSO) cross-validation to evaluate model performance.This involves using one subject as the target domain while utilizing all other subjects as the source domain in each session.A total of 240 experiments were performed for these two sessions.During training, epochs set to 10. Batch size is set to 32 in the CNN methods and 48 in JDA and our proposed method.The learning rate is set to 0.003 in the CNN methods and 0.005 in JDA and our proposed methods, and a learning rate decay strategy is adopted.CrossEntropyLoss is used for training loss and Adam and SGD is selected as the optimizer in CNN methods and ours respectively.For the weight parameter λ in Eq. ( 4) and ( 9), if it is set too high, the model will have difficulty learning the class differences.If it is set too low, the source domain data will not be aligned well with the target domain, consequently reducing the classification performance.Therefore, in L1-based BCJDA, we set the weights of L domain , L dis0 losses to 1.0, whereas in the other two BCJDAs, we set the weight of L dis loss to 0.01.
In terms of evaluation metrics, we select accuracy, F1-score, sensitivity, and specificity as the model evaluation metrics.The macro-averaged F1-score results are given for 3 classes and sensitivity and specificity are given separately for each class.
We used the Linux operating system (Ubuntu 20.04) and an Nvidia RTX3090 graphics card with 24GB of video memory to train the model.Our model training speed is about 10.8 iters/s on average, and the training completion time (10 epochs) is around 138.6s.

D. Results
We implement our BCJDA model for validation on the dataset given in part A under the cross-time and the crosssubject scenarios.Tables II, III, and IV show the accuracy of recognition across time and subjects respectively.The highest mean value in the tables is shown in boldface, and the second highest mean value is underlined.The experimental settings are implemented as given in part C. The significance of this part will be discussed in section A part V.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
For cross-time CWD scenario, we find that CGDM-based BCJDA achieved the best performance with an increase of about 1%-1.5% to L1-norm-based and BCDM-based BCJDA, and all our proposed methods outperform the compared methods.L1-norm-based BCJDA is about 18% higher than SVM and 13%-16% higher than deep learning methods and JDA, which demonstrates that our methods can effectively extract the domain-invariant features, and consider the alignment between categories while transferring the source domain knowledge to the target domain.Furthermore, we compare the accuracy, F1-score, sensitivity, specificity, and macro-averaged sensitivity and specificity of the methods across time in 15 subjects, as shown in Fig. S1 and Table SIV in supplementary materials.Take L1-norm-based BCJDA, for example, the method is 17%-22% ahead of the CNN methods in terms of metrics on F1.In addition, SEN is significantly improved in C2 (Normal) and C3 (Overload) categories, and SPE is significantly improved in C1 category.In general, our methods have few decreases in F1-score compared with ACC (< 1%), whereas the other methods all have varying degrees of decrease (> 3%).The well-known JDA method, which also belongs to domain adaptation, outperforms a bunch of CNN methods despite using only full connectivity.Our method improves about 14% relative to it.
For cross-subject CWD scenario, Table III shows the ACC results under session 1, and Table IV shows the ACC results under session 2. We observed that L1-norm-based BCJDA and CGDM-based BCJDA achieved the best performance in the two sessions with an increase about 17%-19% compared with CNN methods and JDA, respectively.Furthermore, we compare various metrics of the methods across subjects in 2 sessions, as shown in supplementary materials in Fig. S2 and Tables SIV and SVI.Fig. S2(a) and Table SIV give metrics in session 1 and Fig. S2(b) and Table SVI give metrics in session 2. In general, the F1-score of our methods is basically the same as ACC (< 1%), whereas the other methods have varying degrees of decrease (> 3%), further illustrating the advantage of our methods in CWD across subjects.From the comparison of SEN and SPE, the classification performance of our methods is significantly improved on C1 (Underloaded) and C2 (Normal).

A. Results Distribution
This section presents the statistical analysis of the ACC metrics and the significance test using the Friedman test, followed by the Nemenyi post-hoc test [18].To further illustrate the difference of three determinacy disparity of Bi-Classifier in BCJDA, we conduct separate tests for the three methods we proposed, in addition to the unified test for all models.
Our models achieve state-of-the-art performance in all experimental scenarios, as shown by the significant differences (p < 0.01) from the other models.Specifically, in the crosstime scenario (Fig. 5(a)), the CGDM-based BCJDA differs significantly from the L1-norm-based BCJDA.In the crosssubject scenario of session 1(Fig.5(b)), the CGDM-based BCJDA differs (p < 0.05) from the BCDM-based BCJDA.In the cross-subject scenario of session 2 (Fig. 5(c)), the CGDM-based BCJDA shows some advantages over the other two methods (p < 0.1), but they're not dramatically different.These results suggest that the type of inter-classifier determinacy disparity has a significant impact on the performance, and the CGDM-based BCJDA is superior to the other methods.

B. Confusion Matrix
To further analyze the misclassification rate of our proposed models on the three-class recognition of cognitive workload, we visualize the recognition accuracy using confusion matrices.Figure.6 shows the confusion matrices of BCJDA under two experimental scenarios.For instance, in the cross-time scenario, the three BCJDAs achieve the highest recognition accuracy for the Underload state, which are 80.10%, 79.42%, and 82.51%, respectively.
The recognition accuracy of the Normal state is the lowest, and the Normal state is misclassified as Underload and Overload states with similar proportions, indicating that the Normal state is not well separated from the other two states.In the Overload state, about 33% of the samples are misclassified as Normal state.These results suggest that the Normal state has less distinctive features, partly because the subjects are fatigued when performing this part of the experiment, and the EEG signals generated are closer to the Overload state, even though the task difficulty is set to moderate.The samples classified as Overload by our model indicate that the subject is approaching the limit of cognitive resources while performing the task and needs special attention.

C. Analysis of EEG Topography
In the field of EEG classification, Power Spectral Density (PSD) is a common feature to distinguish between different cognitive workload levels.Although CNN and our methods do not directly employ PSD features and extract other features from raw EEG signals, PSD features are still implicit in these features.Therefore, it is meaningful to perform a visual analysis of PSD features.We visualized the PSD features of five common EEG frequency bands in two sessions of subject 7 (whose PSD exhibited the best classification ability) as topographic maps.Five frequency bands are: δ waves (1-3Hz), θ waves (4-7Hz), α waves (8-13Hz), β waves (14-30Hz), and γ waves (31-50Hz).Two-way ANOVA was used to test the correlation between PSD characteristics and Sessions and Workloads.The results are shown in Fig. 7.
We observe that the PSD features show similar properties, including the activated brain areas and the intensity of activation, in different domains under the same state.These demonstrate that PSD features are consistent in responding to the cognitive workload state, indicating that the joint alignment across domains is feasible.However, different Sessions exhibit a lower correlation with PSD features.Specifically, among the five frequency bands, only the α (alpha) band demonstrates a significant correlation (p < 0.05).This finding suggests that the variation between sessions is more pronounced than the differences related to Workload.Given these results, we emphasize the necessity and significance of pursuing crossdomain recognition.
In the same domain, we find that the PSD feature distributions of Underload state, Overload state, and Normal state are quite different and have stronger correlation with PSD; PSD features and correlation analysis of subject 7 in five commonly used bands displayed in the form of EEG topography.
however, the PSD distributions of Normal and Overload state are closer, and the difference of power between Normal and Overload state is not obvious.This explains why models (both CNNs and our methods) are prone to misclassify on the Normal state.As shown in the confusion matrix and the SEN and SPE bar charts of Fig. 6 and Fig. 2 of the additional document, Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.the SEN and SPE of the classification on Underload are high, whereas the sensitivity and specificity of the classification on Normal are low, and they are mainly confused with Overload.In this case, although the subjects were performing moderately difficult tasks, the cognitive workload tended to be overloaded.

D. Feature Visualization
To further analyze the feature representation ability of SVM, CNNs, and BCJDAs, we use t-SNE [22] for visual analysis.The t-SNE is a statistical method for visualizing high-dimensional data by giving each data point a location in a two or three-dimensional map.Fig. 8 visualizes the distribution of features.For SVM, we use the PSD features extracted beforehand.We find that the features extracted by the three types of models in the cross-time scenario have higher discrimination than in the cross-subject scenario.
Compared with the other two methods that deteriorate when the domain difference is large, CGDM-based BCJDA not only has a similar feature extraction ability as DeepCNN, but also considers domain alignment and category alignment, so it achieves good performance in cross-domain recognition.As shown in Fig. 8 (f) (the solid line is the common boundary of the two domains, which achieves the domain alignment), we find that the feature distribution between different domains is consistent, and the discriminability between categories is preserved.Figure . 8 (c) indicates that its category boundary is sharp, and the classification performance is not significantly reduced while shifting the knowledge from source to target.

E. Ablation Study
In this part, we perform ablation analysis to evaluate the impact of the domain discriminator and the Bi-Classifier module on the overall performance of the BCJDA based on the process in Fig. 2. We set up four cases for this study: case1 is the CGDM-based BCJDA model retaining the Bi-Classifier and the domain discriminator; case2 and case3 remove the domain discriminator, and uses CGDM and BCDM, two groups of determinacy disparity for inter-classifier adversarial distance respectively; case4 retains the domain discriminator, but only applies the single classifier (i.e., there is no adversarial distance).
The final results are presented in Fig. 9.These results show that both the domain discriminator and the Bi-Classifier can effectively improve the performance and robustness of the model.Although the Bi-Classifier can simulate the function of the domain discriminator to a certain extent, the domain discriminator can perform more targeted adversarial training, and then mine more effective domain invariant features.In addition, the added Bi-Classifier compared with a single classifier can preserve the discrimination of features while domain adaptation, and then optimize the partition of boundary features.In conclusion, the comparison between the model with different modules and the proposed model confirms that they all contribute to varying degrees to the BCJDA.

F. Analysis of Generalization Performance
To further evaluate the robustness and generalizability of the BCJDA model, we conducted cross-subject experiments on an emotion dataset called 'SEED'.In this section, we present the cross-subject recognition accuracy results of our model and the compared methods on this dataset, and provide detailed data for our model on each subject.
The SEED dataset consisted of 15 subjects, each of whom collected EEG data for 3 sessions.In each session, each subject was required to watch 15 movie clips, including 5 negative, 5 neutral, and 5 positive.Self-assessment ensured that each subject's emotion was consistent with that shown in the movie clips.The EEG data was collected at the 10-20 international standard and consisted of 62 electrode channels with a sampling rate of 200 Hz.More information can be found in paper [45].In our experiments, we divided each subject's signal into 1692 samples, each with a time length of 2s.For cross-subject experiments, one of the subjects served as the target domain, and all the remaining subjects were the source domain.We used the most representative L1-normbased BCJDA model and selected the session holding the best performance.
Table SVII in the supplementary materials gives the accuracy, F1-score, sensitivity and specificity metrics for the cross-subject recognition of our model on the SEED dataset, where the F1-score is computed in weighted form.Table SVIII in the Supplementary materials shows the average accuracies of the compared methods.Our method achieves the optimal result with an accuracy of 88.81%, which is an improvement of about 29% with respect to the CNN method, as well as slightly higher than the JDA method, and has a smaller standard deviation.The F1-score is also consistent with the accuracy, proving the better performance of our model in each category.
The experimental results show that our method is not only applicable to CWD, but can also be used in other areas of cross-domain decoding of EEG signals, such as emotion categorization.

VI. CONCLUSION
In this study, we introduce the BCJDA model tailored for decoding cognitive workload across time and subjects, incorporating domain-level and class-level alignment.The evaluation of our model encompassed three classification scenarios involving cross-time and cross-subject experiments, enhancing the state-of-the-art in adversarial domain adaptation methodologies for the determination of three cognitive states.Furthermore, we conducted a pioneering comparison of various determinacy disparity algorithms employed by Bi-Classifier on model performance, revealing that the CGDM-based BCJDA exhibited superior performance.Our model not only mitigates the loss of category distinction during knowledge transfer but also significantly enhances model performance, particularly demonstrating its efficacy in addressing sensitive EEG data classification challenges.Given the diversity in network structures beyond CNN, utilized for EEG classification tasks, we intend to integrate more powerful modules, such as transformer [46], adaptive graph convolution module [47], and multi-head self-attention layer [48] into BCJDA.

Fig. 1 .
Fig. 1.Illustration of joint alignment in our proposed model.Domainwise alignment and Class-wise alignment are conducted on two flows of adversarial learning process.The three shapes represent the three states of cognitive workload in our experiments, respectively.

Fig. 2 .
Fig. 2. The architecture overview of Bi-Classifier Joint Domain Adaptation (BCJDA), where green indicates the source domain and yellow indicates the target domain; G f is the feature extractor; Bi-Classifier is composed of F y1 and F y2 , which are Classifier 1 and Classifier 2 respectively; x s and x t are the features of the source domain and target domain extracted by G f .The inconsistency loss is specifically expressed as the degree of difference between the deep feature matrices generated by the two classifiers for the target domain samples.The corresponding numbers 1 and 2 on the target output label indicate which classifier the label is given by.

Fig. 3 .
Fig. 3. Structure of feature extractor in BCJDA.The input is the raw 61 × 500 (Channel×Time) EEG data, and the output is the 200 × 1 feature map.

Fig. 4 .
Fig. 4. Determinacy disparity of Bi-Classifier in BCJDA.Part (a) is the L1-norm calculation method, part (b) is the BCDM difference calculation method and part (c) is the CGDM difference calculation method.

Fig. 5 .
Fig. 5. Accuracy distributions and significance test between the proposed models and comparing methods.Friedman test is used; * represents p < 0.05 and * * represent p < 0.01 which means a great significance.(a) shows the results in cross-time experiment; (b) and (c) shows the results in session 1 and 2 of cross-subject experiments respectively.

Fig. 6 .
Fig. 6.The accuracies (%) of BCJDA represented by confusion matrices.(a), (b), and (c) shows the results under the cross-time scenario; (d), (e), and (f) shows the results under the cross-subject scenario.The results displayed in the confusion matrix represent the aggregated classification outcomes across all subjects within a specific experimental scenario.For example, for (a), 80.1% of the Underload samples out of 15 subjects are predicted correctly, and 6.28% of the samples are misclassified to be in the Overload category.

Fig. 7 .
Fig. 7.PSD features and correlation analysis of subject 7 in five commonly used bands displayed in the form of EEG topography.

Fig. 8 .
Fig. 8. Visualization of EEG features extracted by models.Red dots represent overloaded samples, green dots represent normal samples, and blue dots represent underloaded samples.Semi-transparent points represent source domain samples, and opaque points represent target domain samples.(a), (b), and (c) are the cross-time visual features of recognition; (d), (e), and (f) are the cross-subject visual features of recognition in session 2. Model and subject details are annotated at the top of the subfigures (T means target domain).

Fig. 9 .
Fig. 9.The impact of classifiers and domain discriminators on overall model performance under cross-time scenario.

TABLE I SUMMARY
OF THE LAYER STRUCTURE USED IN OUR PROPOSED BCJDA MODEL WITHOUT DROP & BATCH-NORM LAYERS

TABLE II CROSS
-TIME COGNITIVE WORKLOAD DECODING RESULTS OF ACCURACY (%)IN15 SUBJECTS

TABLE CROSS -
SUBJECT COGNITIVE WORKLOAD DECODING RESULTS ACCURACY (%) IN 15 SUBJECTS IN SESSION 1

TABLE IV CROSS
-SUBJECT COGNITIVE WORKLOAD DECODING RESULTS OF ACCURACY (%) IN 15 SUBJECTS IN SESSION 2