A Multi-Source Transfer Joint Matching Method for Inter-Subject Motor Imagery Decoding

Individual differences among different subjects pose a great challenge to motor imagery (MI) decoding. Multi-source transfer learning (MSTL) is one of the most promising ways to reduce individual differences, which can utilize rich information and align the data distribution among different subjects. However, most MSTL methods in MI-BCI combine all data in the source subjects into a single mixed domain, which will ignore the effect of important samples and the large differences in multiple source subjects. To address these issues, we introduce transfer joint matching and improve it to multi-source transfer joint matching (MSTJM) and weighted MSTJM (wMSTJM). Different from previous MSTL methods in MI, our methods align the data distribution for each pair of subjects, and then integrate the results by decision fusion. Besides that, we design an inter-subject MI decoding framework to verify the effectiveness of these two MSTL algorithms. It mainly consists of three modules: covariance matrix centroid alignment in the Riemannian space, source selection in the Euclidean space after tangent space mapping to reduce negative transfer and computation overhead, and further distribution alignment by MSTJM or wMSTJM. The superiority of this framework is verified on two common public MI datasets from BCI competition IV. The average classification accuracy of the MSTJM and wMSTJ methods outperformed other state-of-the-art methods by at least 4.24% and 2.62% respectively. It’s promising to advance the practical applications of MI-BCI.

B RAIN-COMPUTER interface (BCI) is a direct interactive system between the brain and the outside world that can convert brain signals into control instructions to interact with external devices [1]. Due to the high temporal resolution and good portability, electroencephalogram (EEG) is the most commonly used control signals for BCIs [2]. Motor imagery (MI) is a popular active BCIs paradigm where users image the movements of their body parts but don't execute them. Since MI is a spontaneous brain activity and doesn't require external stimuli, it has been widely studied and applied in stroke rehabilitation and online BCI game fields [2].
Nevertheless, the large inter-subject variability in brain patterns will hinder the widespread use of BCI devices. Because the data distribution change caused by individual differences will significantly degrade the decoding performance of MI-BCI [3], [4], [5], [6]. Typically, a 20-30 minutes system calibration phase needs to be done at the beginning, aiming at acquiring sufficient labeled samples to train a subjectspecific model for a new user [7]. It's time-consuming and fatiguing for users. Whereas if the calibration is reduced, the available labeled samples are limited, resulting in poor decoding performance [8]. Thus, developing a reliable intersubject system that can shorten calibration time and maintain satisfactory performance is highly desirable in the practical application of MI-BCI [9].
One promising way to reduce individual differences and data requirement is transfer learning (TL) [4], [10]. It can transfer shared knowledge across different subjects, and use some existing data to alleviate the limitation of insufficient data of target subjects [4], [11]. For cross-subject transfer, the multiple existing subjects are usually called the source domains, and the current new user with few or without labeled samples is called as the target domain. Several related works have attempted to reduce individual differences by considering the information on different levels of instances, features, and models [5], [6], [12].
In recent years, some studies have proved that data of multi-source subjects can obtain better accuracy than single subject [6], [13], since it can expand the available data and increase data diversity, which would facilitate the transfer model to learn more generalized and robust representations [14], [15]. Thus, some researchers have adopted multi-source transfer learning (MSTL) to reduce individual differences in MI-BCI field. Multi-source fusion adaptation regularization (MFAR) is based on Euclidean alignment via rest-state knowledge (EARK) [16]. Cai et al. integrated two Riemannian manifolds into a TL framework to align marginal distribution and joint distribution simultaneously, named manifold embedded TL (METL) [17]. Besides that, Liang et al. proposed a multi-source fusion TL (MFTL) algorithm [18]. Reference [7] presented two TL methods, where multi-source joint domain adaption (MJDA) worked in Riemannian space, while multi-source joint Riemannian adaption (MJRA) worked in Euclidean space.
To sum up, most of these inter-subject transfer works can be summarized into two categories: one is to select the most similar subject as a source domain to assist the target subject task, and the other is to directly combine multiple source domains of other existing subjects into a mixed domain [15]. The former does not make full use of available data and needs to find the optimal golden subject as the source domain [19]. The latter treats each instance and each source domain equally, which ignores the large difference in multiple subjects, a common nature in physiological signals [20], [21]. It would cause negative transfer and degrade the MI decoding accuracies, even result in worse results than using one appropriate subject as the source domain sometimes [13]. In fact, an effective transfer method should assign a large weight to vital instances in the source domain [22]. It is the same true for source domains with more similar distribution [23]. In general, distribution shift exists not only between each pair of source and target domains, but also exists on different source domains, so directly combining multiple domains may influence each other during knowledge transfer [14]. Thus, how to effectively use the rich information from multiple sources is important to MSTL.
To resolve the above problems, this paper propose an inter-subject MI-BCI framework to reduce individual differences, as shown in Fig. 1. The hypothesis is that, despite some differences, the stable and consistent patterns still exist across subjects [24], [25]. A simple yet effective TL method, transfer joint matching (TJM) [22], is introduced into the MI decoding task to address the problems of few labeled trials available for a new subject and inter-subject variability. The framework adopts a semi-supervised domain adaptation (SSDA) setting [26]. It assumes that individual differences lie in multiple levels, including instance and feature levels.
TJM jointly performs two TL methods, instance reweighting and feature matching alignment, in a principled dimensionality reduction procedure. It's suitable for EEG data with noises and high-dimensional features [27], [28]. However, it works only in a single source to the target, so we extend it to a multiple sources condition. The main contributions of this paper are as follows.
• We improve TJM to multi-source methods, referred to as MSTJM and weighted MSTJM (wMSTJM). They can effectively utilize the information of multiple subjects to overcome the lack of new subject data and consider the large differences in multiple source subjects.
• We propose an inter-subject MI-BCI framework to reduce inter-subject variability based on MSTJM or wMSTJM. It could reduce the effects of individual differences and task-irrelevant instances. In addition, only ten calibration data are needed to select important source subjects whose data distribution is similar to that of target subjects.
• The superiority of this framework is verified on two public MI datasets of BCI Competition IV. The classification accuracies can achieve 85.53% and 82.69% respectively, superior to most state-of-the-art (SOTA) methods. The rest of this paper is organized as follows: MI-BCI related works are briefly reviewed in Section II. Then we describe the inter-subject framework based on MSTJM and wMSTJM. Experimental settings and results are shown in Section IV. The following are discussions and conclusion.

II. RELATED WORK A. Methods on Inter-Subject Variability
To cope with the inter-subject variability and construct a generic MI decoding model, common spatial pattern (CSP) is a commonly used spatial filter method in MI-BCI, and there are many variants, such as regularized CSP [29], and filter bank CSP [30], since the spatial information and frequency band are important for decoding MI. When performing MI, it will typically elicit a decrease in mu and beta rhythms contralateral to the movement [24], [25]. Another commonly used method in recent years is based on deep neural networks, making use of good learning capacity [24], while it requires many labeled training samples. The performance of MI decoding is limited when only a few labeled calibration data can be obtained from the current user at the beginning. Besides that, a series of MI decoding algorithms have been proposed, such as selecting the optimal frequency band or channels [31], [32], and adding physiological information [15]. However, most of them ignore the distribution shift between training and test samples, resulting in a degradation of decoding performance [33]. Thus, it's highly necessary to find a way to achieve high decoding performance with limited calibration data.

B. TL on Inter-Subject Variability With Limited Data
TL can not only bridge the gap in the data distributions among different subjects, but also compensate for the lack of labeled data by leveraging data of other existing subjects [34]. Thus, in recent years, TL has been widely applied in BCI [4], [34]. According to the learning strategies, TL in MI-BCI can be divided into 3 categories, including instance-based, feature-based, and model-based transfer [11], [18].
The first type of instance-based TL usually selects important instances which have more impact on the parameter estimation or weights instances according to the distribution similarity, such as active transfer learning [12]. While it cannot handle task-irrelevant instances or noises when randomly selecting instances [17].
The second feature-based TL maps the features of both domains into a common latent space or feature distribution matching. Some widely used TL methods have been introduced into MI-BCI, such as transfer component analysis (TCA) [35], balanced distribution adaptation (BDA) and weighted BDA (WBDA) [36]. Another widely used method is based on Riemannian manifold, since the congruence invariance property of Riemannian metric can reduce the influence of brain's volume conduction effect [6], [37]. Typical works are the minimum distance to Riemannian mean (MDRM) classifier [37], Riemannian geometry alignment (RGA) [5], and manifold embedded knowledge transfer (MEKT) [6]. Moreover, Wu et al. proposed Euclidean space data alignment (EA) [38], and covariance matrix centroid alignment (CA) [6] to simplify the calculation of geodesic. However, most studies perform TL on a certain level. Long et al. proved that jointly performed feature matching and instance reweighting would be more effective when there are large differences [22].
The last model-based TL focus on sharing parameters. Pre-training model and fine-tuning it on a few labeled instances is one of the most popular methods [24], but [39] suggested that the effect of fine-tuning is not obvious when the distribution shift is large.
Thus, we adopt CA as a preprocessing step, then TJM integrates feature distribution matching and instance reweighting into a unified framework to reduce individual differences.

III. THE PROPOSED METHOD
Our inter-subject MI-BCI framework based on MSTJM or wMSTJM will be presented in this section. It consists of 3 modules, as illustrated in Fig. 1. In the first modules, we adopt CA as preprocessing to align the centroid of the covariance matrix of source and target trials by taking Euclidean mean as a reference matrix, which can minimize marginal probability distribution shift and whiten EEG into an approximate identity matrix. The second module is source selection after features Tangent Space Mapping (TSM). The source subjects similar to the target subjects are selected by a few calibration data (ten trials). In the third module, MSTJM or wMSTJM is utilized to reduce the data distribution shift among different subjects and the influence of task-irrelevant instances or noise. Then the results of multi-source transfer learning models are fused in the decision-making stage, by majority voting rules or weighted voting rules. The details will be described below.
where E and T s are the number of electrodes and time points, and y i ∈ R C is the corresponding label for X i of C classes. The target domain is consists of a few labeled samples in D T l and many unlabeled samples in D T u . These few labeled samples in the target domain are also called calibration data in MI-BCI from the new subject. A feature space X and a marginal probability distribution P(X) form a domain D, i.e., D = {X , P(X)}, where X ∈ X .
2) Definition 2 (Task): A task T consists of C label set Y and a modle f (x) to learn a relationship after given domain , it also can be explained from the perspective of conditional probability distribution. In our experiments, only binary classification problem is considered, i.e., C ∈ {−1, +1}. The goal of TL is to predict y t ∈ Y t using data in D S and D T l by the constructed task-specific classifier H : X t → y t .
3) Problem Setting: Under SSDA setting, besides a labeled source domain D i = (x 1 , y 1 ) , . . . , x n s , y n s , there are a small number of labeled calibration data and many unlabeled instances in the target domain, i.e., D T = D T l ∪ D T u , where D T l = x n s +1 , . . . , x n s +n l , and D T u = x n s +n l +1 , y n s +n l +1 , . . . , x n s +n l +n u , y n s +n l +n u . The aim is to reduce the inter-subject variability in a new feature representation space. All notions are summarized in Table I.

B. Covariance Matrix Centroid Alignment
In order to make use of the congruence invariance property of Riemannian metric while reducing computing overhead, we adopt CA as preprocessing and Euclidean mean as a reference matrix to minimize marginal probability distribution shift among different subjects [6].
For i-th EEG trial in the source or target domains, the sample covariance matrix (SCM) is where T s is time points of each trial and P i ∈ R E × E . Since the SCMs are SPD matrices, thus that each of them can be regarded as a point in Riemannian manifold space M [5], [37]. Then the Riemannian distance of two point P 1 and P 2 is: , where log is the logarithm for the eigenvalues of P −1 1 P 2 . Euclidean mean M E , i.e., the arithmetic mean, is used to calculate distribution distance. The Euclidean distance of two points is δ E (P 1 , P 2 ) = ∥P 1 − P 2 ∥ F on M. Then the Euclidean mean of SPD matrices can be defined by With Euclidean mean M E , the SCMs are aligned by It's the same to the target domain samples, so we can obtain .
Two desirable properties of CA can be used to align the data distribution [6]: 1) Minimization of marginal probability distribution shift. 2)EEG trial whitening.
According to the nature property of Riemannian manifold, including congruence invariance, we have with Gl(n) = {W ∈ M} the set of invertible matrices. These properties are very important in MI-BCI, because it means that the distance between two SPD matrices is invariant to a change of reference matrix [37]. So we can perform some operations, such as PCA, on this space without affecting the distance.
When the reference M r e f = M E −1/2 is adopted, The arithmetic centers of different domains will be approximated as an identity matrix. Thus, the data distribution of different subjects are brought closer, and the aligned SCMs of EEG trials after CA is equivalent to whitening on the manifold.

C. Source Selection After Tangent Space Mapping
Following the CA is Tangent Space Mapping (TSM), which can transform the operation on Riemannian manifold into a Euclidean tangent space. A SPD matrix P i is in manifold space, while by using TSM, it will be converted to a vector . TSM can be denoted as Here, upper(.) operator represents taking the upper triangular part of an SPD matrix while vectorizing it. The unity weights are applied to diagonal elements and √ 2 are assigned to off-diagonal elements [37].
The tangent space consists of a set of tangent vectors at a point P i on Riemannian manifold. When two conditions are met: 1) P i is embedded in the local space of manifold; 2) P is the mean of the P i , the distance in tangent space is approximately equal to that in Riemannian manifold space. Further details can be found in [37] and [40].
Then, source subjects similar to the target subject are selected to avoid negative TL and reduce the computation overhead. After TSM, the source subjects selection can be done in Euclidean space with accuracy. The SVM is trained on the TSM features x S of each source subject to get base classifier h : x S → y s . The vectorized calibration data x T, j are utilized to evaluate the similarity directly by the classification accuracies. We hold that if the model has higher accuracy, the corresponding subject in the source domain may be more similar to the subject in the target domain, so the data of them can be mixed as training samples to train the classification model of target domain when there are limited data.

D. Multi-Source Transfer Joint Matching
The third module is the MSTJM or wMSTJM approaches for domain adaptation, which combines both MSTL and decision fusion into a uniform framework, as shown in Fig. 2. It's composed of two main steps: 1) distribution alignment separately according to each selected subject in the source domain; 2) decision fusion to integrate multiple results of base classifiers. The difference between MSTJM and wMSTJM is whether the decision fusion process considers the different weights. The scheme for distribution alignment before fusion instead of integrating before alignment, can consider the difference between source subject and target subject, but also the difference in different subjects in the source domains.
1) Distribution Alignment Separately: TJM is introduced to handle the difference in two levels, including samples-based and feature-based differences, which transfers knowledge by jointly implementing instance reweighting and feature distribution matching to construct domain-invariant features in the subspace [22]. For adaptive instance reweighting, structured sparsity penalty, ℓ 2,1 -norm is performed on instances in the source domains. The features are mapped into a reproducing kernel Hilbert space (RKHS) H by "kernel trick", and then feature distribution is matched by minimizing the maximum mean discrepancy (MMD) between the source domain and target domain. Principal Component Analysis (PCA) is used to combine these two operations since the reconstruction error of the input data can be minimized to learn a new feature representation in this dimensionality reduction process.
Given the number of samples n = n s + n l + n u , and the feature dimension m, the input data matrix can be denoted as X = [x 1 , . . . , x n ] ∈ R m × n . Through kernel mapping ψ(X) = [ψ (x 1 ) , . . . , ψ (x n )], we can calculate the kernel matrix by K = ψ(X) T ψ(X) ∈ R n × n . The kernelize PCA can be denoted as where A ∈ R n × k is the mapping matrix to realize kernelize PCA. The new representation is embedded in Z = A T K. MMD is adopted to measure the distance between Kernel-PCA representations for comparing different distributions in the RKHS. MMD matrix M can be computed by Then MMD between source and target domain is Eq. (7) is maximized by minimizing Eq. (9), such that statistics of feature distributions in the first-and high-order are matched under the new representation Z = A T K .
Although feature matching can partly minimize the distribution shift among subjects, there are also differences caused by non-stationary EEG and task-irrelevant noises. The instance reweighting method is used to further deal with these differences. The structured sparsity regularizer, ℓ 2,1 -norm is applied to the transformed matrix A. By introducing row-sparsity to each instance, so the instances are reweighted as follows: where A s := A 1:(n s +n l ) ,: is the transformed matrix for source instances, and A u := A n s +n l +1:n s +n l +n u ,: is the transformed matrix for the target instances.
By integrating Eq. (7) and Eq. (10), the optimization problem can be expressed as: where λ is the balance parameter to weigh the importance of feature matching and instance reweighting. By solving the transformation matrix in the new representation Z = A T K, the discrepancy between different subjects can be reduced.

2) Decision Fusion of Base Classifiers:
After TJM, the test data of target subject is transformed into a new subspace, and the distribution of them is relatively consistent with the corresponding single subject in the source domains. Then the key issue is how to integrate the new represents of test data from multiple TL models. We adopt parallel architecture to construct base classifiers of each subject in the source domains, and then the final decisions of new test samples from these base learners are fused by majority voting or weighted voting, corresponding to MSTJM and wMSTJM.
For MSTJM, suppose labeled source domains D S = {D 1 , D 2 , . . . , D N }, and binary classification problem, i.e., C ∈ {−1, +1}. The goal is to learn a more robust and accurate classifier H = {h 1 , h 2 , . . . , h N } by integrating multiple base classifiers h : z S → y s to predict y t using data in D S and D T l . The base classifiers are constructed by SVM classifiers, using data of each subject in the source domains. Given a sample x and base classifier h i , the prediction of each classifier is ). Here, h j i (x) represents the output of h i on class label C j . Then the output label is obtained by the most voted class through the majority voting method by Eq. (12), which could achieve a good performance [41].
For wMSTJM, weighted voting is utilized to ensemble multiple outputs by considering the unequal similarity of different subjects. After constructing base models, a few labeled calibration data x T, j in the target domain D T l are used to measure differences between each pair of source subject and target subject. We intuitively believe that the higher the accuracy, the more similar the two subjects are. Thus, higher weights are endowed to the corresponding subjects, and the integration weights are learned adaptively for each target subject to cope with subject variations. In general, the voting weights should be normalized. The final output is the label with the highest voting class through the weighted voting method by Eq. (13). If the voting scores of two classes are the same, randomly select one as the final label.

A. Datasets and Preprocessing
Two common public MI datasets were used in the experiments, BCI competition IV 1 dataset 1 and dataset 2a, as [6], [16]. Recording paradigms of these two datasets are similar, displayed in Fig. 3 For both datasets, a 8-30Hz bandpass filter was used to remove artifacts and help to analyze the characteristics of Mu and Beta rhythms [6], [24], [25]. The time interval of each trial EEG data was segmented between [2.5, 5.5] seconds.

B. Experimental Settings
To estimate our subject-independent MI-BCI framework based on MSTJM and wMSTJM, we implemented the leave one-subject-out cross validation (LOSOCV) paradigm on multi-to-single (MTS) transfer tasks as [17] and [6]. Here, we took each subject as the current target domain to test the model in turn, and other existing subjects as multiple source domains mixed a few calibration trials of a target subject to train the base models. For dataset1, there were seven MTS tasks, and dataset2a included nine MTS tasks. The mean value of the classification accuracy of each target subject was used as the evaluation metric.
where D t is the test data in the target domain, y(x) is the truth label of x, and y(x) is the predicted label.

C. Baseline Algorithms
The proposed MSTJM and wMSTJM algorithms were compared with many algorithms, including classical CSP, 1 https://www.bbci.de/competition/ several SOTA Riemannian manifold relevant methods, and MSTL algorithms for MI decoding. Some common used TL methods were also considered into our experiments, such as TCA, WBDA, JDA, CORAL, JGSA and GFK, combined with different preprocessing steps, such as EA, EARK or CA.
• CSP-LDA, a typical decoding method in Euclidean space. • EA-CSP-LDA. EA is Euclidean alignment by task-state knowledge as a preprocessing step [38].
• S-STM (supervised style transfer mapping) [48], a similar MSTL method with two different prototypes, nearest prototype and Gaussian model.
• Semi-STM (semisupervised style transfer mapping) [48], using both labeled calibration data and unlabeled data in the target domain to learn STM.

D. Parameters Details
For CSP, we used three pairs of CSP variances to form 6 features as [49]. The parameters were set as original papers for other methods. In our MSTJM and wMSTJM, the number of calibration data was set to 10 depending on the aim of our study and experience, which will be discussed later. The weights for wMSTJM directly came from the sorted accuracies of calibration data to adapt to the subject variants. The regularization parameter and subspace bases were λ = 0.01, k = 300 for dataset1, and λ = 0.01, k = 200 for dataset2a. In addition, the number of iterations and kernel type were fixed as T = 10 and 'RBF'. Note that the parameters were the same in both algorithms. In addition, since data distribution between the source domain and target domain were different for different subjects, tuning optimal parameters by crossvalidation is not realistic, so we empirically searched the parameter space to obtain the optimal parameters as [13], [22]. λ searched in the range of λ ∈ {0.01, 0.1, 1}, and k searched in the range of k ∈ {20, 50, 100, 200, 300}. When the highest average accuracy achieved on each dataset, the optimal parameters were fix up, then the experiment repeated ten times to avoid randomness.

E. Results of MSTJM and wMSTJM
The average accuracies and standard deviation of different target subject in turn are illustrated in Table II. Bold indicates the optimal result and underlined indicates the suboptimal result. The results of MSTJM and wMSTJM achieved 85.53 ± 10.80% and 84.03 ± 11.69% in dataset1, and those of dataset2a can achieve 82.69 ± 10.03% and 80.95 ± 11.17% respectively. It's clear that our algorithms outperform all the other SOTA algorithms in both datasets. The average accuracy of the MSTJM and wMSTJ methods in two data sets were 4.24% and 2.62% higher than the suboptimal result. Compared with MEKT [6], which has similar preprocessing steps, our MSTJM and wMSTJM could obtain better results, especially in dataset2a. Meanwhile, CA-JDA [47], S-STM [48] and METL with JDA [17] showed good results on dataset1. S-STM and semi-STM had suboptimal results in dataset2a, which also regarded each subject as a source domain to reduce the difference in the source domains. The main difference between them lies in how to transfer knowledge in different subjects. All results demonstrated the necessity of considering the differences in the source subjects, and the effectiveness of MSTJM on EEG data with large differences.
To further interpret the results, Table III and Table IV list the details of each subject when they were taken as the target domain. 'S1-S7' represents 7 subjects in dataset1. The results are consistent with [5] and [16]. There are obvious individual  IV  THE MEAN ACCURACY (%) OF DATASET2A WHEN EACH SUBJECT  IS TAKEN AS THE TARGET DOMAIN IN TURN   TABLE V  SIX EXPERIMENTAL SETTINGS AND MEAN ACCURACIES(%) OF ABLATION EXPERIMENTS differences in dataset1 and dataset2a, where the results of S2, S5 in dataset1 and those of S2, S4, S5 in dataset2a are relatively worse than other subjects. These subjects who are not proficient in MI are called 'Bad subject' [5] or 'BCI illiteracy' [3], since they are unable to control the BCI equipment well. Although many other subjects' data were used for knowledge transfer, the results of our MSTL methods were relatively limited. Thus, it is necessary to develop more advanced TL methods to overcome individual differences.

V. DISCUSSIONS A. Ablation Experiments
To figure out which part of the framework works, we did ablation experiments. The six experimental settings are shown in Table V, abbreviated as C1-C6. The source-combined scheme means that all data in the multiple source domains were combined into a single mixed source. While the multisource scheme considers each existing subject as an independent source domain to transfer knowledge individually. Note that the scheme C5 and C6 correspond to MSTJM and wMSTJM.
The mean accuracies of each target in ablation experiment are shown in Table V. Obviously, the results of C5 and C6 were much better than those in other cases, especially those without TL. Compared with C1 and C2, C4 and C5, transferring each source domain individually gave better results than combining all source domains. This is consistent with our hypothesis that the data distribution of source subjects is different, thus each source need to process separately.
In addition, compared with C2 and C3 in the setting without TL, weighted voting could improve the decoding performance than majority voting. However, compared with C5 and C6 in the setting with TL, the mean accuracy of the weighted scheme in both datasets were worse than those of majority voting. Further analyzing the results of Table III in dataset1, the result of 'S5' in the wMSTJM was far worse than MSTJM, about 5%. It's the same for dataset2a in Table IV, especially for 'S5' and 'S7'. The weighted scheme is not always effective against those 'Bad subjects'. This is because the weights are directly from the standardized accuracy of 10 calibration data of target subject, while the data distribution of them is significantly different. Moreover, compared with C1 and C2 or C3, the multi-source scheme was superior to the sourcecombined scheme. It's the same to C4, C5 and C6.
To sum up, the effectiveness of MSTJM and wMSTJM were verified by these ablation experiments. The performance improvement stems not only from considering the differences between the source and target domains, but also from considering the differences between different source domains. However, when the data distribution is quite significant, this weighting method will fail. It is worth noting that Eq. (13) holds only when the outputs of the base models are independent of each other [41]. But in our experiments, the base models established for each subject seem to be relatively independent, but all of them are used to decode the same MI decoding problem, which means that there is a strong correlation between the outputs of these base models, so they do not meet the independence hypothesis. Therefore, in practical MI application, the results from weighted voting cannot be guaranteed to be superior to that of majority voting method. A more adaptive and better weighting method is needed for MI-BCI.

B. Visualization of Feature Distribution
To further explain the effect of MSTJM and wMSTJM on distribution, we did feature distribution visualization based on t-SNE (t-distributed stochastic neighbor embedding) [50]. The data distribution of dataset1 is intuitively illustrated Fig. 4, when data of subject1 were used as a target. The SCM features were vectorized by TSM to form a 1 × 1770 feature, then it was reduced to 300 after MSTJM. The dimension of all features was further reduced to two by t-SNE.
In Fig.4(a), the data spatial distribution of left hand MI for each subject is completely independent. There is an obvious noise point, marked with a green box, which is away from the cluster center for target subject1, abbreviated as T1. After MSTJM, the effect of noise point is not obvious, and the distribution of different subjects is relatively reduced. The data of T1 and S2 are almost overlapped, which indicates that after MSTJM, the distribution of the two subjects' data is relatively close. We also checked the corresponding weights of calibration data for S2, and it showed the highest accuracy. Three clusters after MSTJM illustrate that only some data are similar to the target data, so it's necessary to select source domains and data. It's the same for the MI tasks of right hand.
Moreover, the data distribution of two classes for only two subjects is also exhibited in Fig. 5, when data of subject1 were used as target domain and subject2 as source domain. The data distribution of the subjects in the source domain and the target domain is reduced after MSTJM, which means the individual difference is reduced. In addition, despite some mismatching data points, the data from different classes in the target domain can be separated more easily which is conducive to classification. For a better classification performance, it is not only necessary to reduce the intra-class distance, but also to increase the inter-class distance [51]. Thus, MSTJM and wMSTJM can further advance classification.

C. Effect of Centroid Alignment
We also used the imagesc to visualize the SCM before and after CA, as illustrated in Fig. 6. As explained in Section III, when using Euclidean mean as the reference matrix, the aligned SCM after CA is approximate to the identity matrix. The results in Fig. 6 verified this property of CA. Here, we took the first EEG trial of subject2 in dataset1 and dataset2a as examples. The left and right columns represent the raw SCM and the aligned SCM, respectively. It is obvious that the diagonal elements of SCM are close to 1, while offdiagonal values are around 0, thus that EEG whitening was approximately achieved by CA. This property of CA has been proven before. The marginal probability distribution shift of EEG trials will be minimized simultaneously for each subject.
Note that CA is used as a preprocessing step, which is similar to other Riemannian manifold methods, such as RGA [5], EA [38], EARK [16] and so on. Moreover, we can also directly match the covariance by calculating the distance between two matrices [52]. The effectiveness of Riemannian manifold in the practice of BCI has been verified in the above researches and many other works. It may become a standard paradigm for EEG data preprocessing in the future.

D. Effect of Knowledge Transfer With Limited Data
To further analyze the effect of subject transfer, and confirm our hypothesis that TL can facilitate MI decoding with limited data of the target subject, we compared our MSTLs with two basic methods without TL.
• Baseline 1, SVM was trained using only the labeled calibration data of target subject, and then the model was tested using the unlabeled data of target subject.
• Baseline 2, the training samples were mixed data from calibration data of target subject and labeled data of all existing subjects in the source domains. To simulate real-world scenarios when a new subject uses BCI equipment initially, the number of calibration data was set to 2, 4, 6, 8 and 10 in sequence. We also verified the effectiveness of TL simultaneously, when the number of calibration data increased from 20 to 80 and the step length increased by 10. Fig. 7 shows that our MSTLs on both datasets are better than those without TL, no matter how much labeled data of the target subjects were taken. Thus, in order to obtain better decoding performance with a few calibration data, we only used 10 instances samples of the target subject for calibration. Moreover, the result of baseline2 is better than those of baseline 1 when the labeled calibration data is insufficient. However, with labeled data increase, the results of baseline1 will exceed. This phenomenon is consistent with the results of [18] and [16]. This is because of the large differences among subjects. With the increase of target subject's data, a reliable model can be trained directly by his own data. Most traditional machine learning algorithms, such as SVM, hold the assumption of independent identically distributed (i.i.d.) [11]. When it comes to the non-i.i.d. data from different subjects, the individual variability should be reduced by TL methods. It indicates that there is inter-subject variability among different subjects, and our MSTLs can not only reduce individual differences, but also can maintain a relatively stable result with limited data.

VI. CONCLUSION AND FUTURE WORK
This paper proposed two MSTL algorithms, MSTJM and wMSTJM, to minimize individual differences under the SSDA setting, which reduces the differences at two levels of instance and feature. They first align the data distribution of each pair of subjects in the source domain and target domain, and then fuse the results of multiple TL models in the decision-making stage. They not only consider the differences between subjects in the source domains and the target domain, but also the differences among subjects in the source domains. We also construct an inter-subject BCI framework for MI decoding. This framework consists of three modules: CA, source selection, and data distribution alignment by MSTJM or wMSTJM. The results of two public MI datasets demonstrate the superiority of MSTJM and wMSTJM over other methods.
In the future, we will consider further optimizing our MSLTs from the following aspects: (1) The limited MI decoding performance may be improved by other deep TL methods [53], due to the stronger representation ability when using a larger MI dataset. (2) Domain generalization techniques could be considered in MI decoding to avoid using labeled data from the target subject. (3) In order to effectively utilize the rich information of multi-source domains, an adaptive and efficient source domain selection method should be adopted for different target subjects in practice.