Motor Imagery EEG Classification Based on Decision Tree Framework and Riemannian Geometry

This paper proposes a novel classification framework and a novel data reduction method to distinguish multiclass motor imagery (MI) electroencephalography (EEG) for brain computer interface (BCI) based on the manifold of covariance matrices in a Riemannian perspective. For method 1, a subject-specific decision tree (SSDT) framework with filter geodesic minimum distance to Riemannian mean (FGMDRM) is designed to identify MI tasks and reduce the classification error in the nonseparable region of FGMDRM. Method 2 includes a feature extraction algorithm and a classification algorithm. The feature extraction algorithm combines semisupervised joint mutual information (semi-JMI) with general discriminate analysis (GDA), namely, SJGDA, to reduce the dimension of vectors in the Riemannian tangent plane. And the classification algorithm replaces the FGMDRM in method 1 with k-nearest neighbor (KNN), named SSDT-KNN. By applying method 2 on BCI competition IV dataset 2a, the kappa value has been improved from 0.57 to 0.607 compared to the winner of dataset 2a. And method 2 also obtains high recognition rate on the other two datasets.


Introduction
Brain computer interface (BCI) based on motor imagery (MI) is used to analyze human intention by electroencephalogram (EEG) signals generated by human brain electrophysiological activity [1,2]. Based on BCI technology, exoskeletons can be used to help people with physical disabilities regain their motor ability, and BCI also has wide applications in smart home, entertainment, military, and other fields [3][4][5][6].
Common spatial pattern (CSP) is widely used in motor imagery to extract EEG features [7]. CSP has excellent performance in two classification tasks, but the drawback is that it needs a lot of electrodes [8].
Despite its short history, the use of the Riemannian geometry in BCI decoding is currently attracting increasing attention [9][10][11][12][13]. Covariance matrices lie in the space of symmetric positive definite (SPD) matrices, which can be formulated as a Riemannian manifold [14]. In the BCI field, the connections of the CSP algorithm and the tools of information geometry have been investigated, considering several divergence functions in alternative to the Riemannian distance [15][16][17][18]. Barachant et al. proposed a simple data augmentation approach for improving the performance of the Riemannian mean distance to mean (MDM) algorithm [13]. Kumar et al. propose a single band CSP framework for MI-BCI that utilizes the concept of tangent space mapping in the manifold of covariance matrices, and the proposed method obtains good results when compared to other competing methods [19]. A hierarchical MDM classifier for multiclass problem has been tested in [20].
Advanced classifiers based on the tangent space on the Riemannian manifold of positive matrices are also receiving increasing attention. Barachant et al. map the covariance matrices in the tangent space and apply feature selection and linear discriminate analysis (LDA) in the tangent space [10]. For the application of the classifier in the tangent space, the problem is that the curse of dimensionality. Traditional data dimensionality reduction methods include two categories: linear dimensionality reduction (LDR) and nonlinear dimensionality reduction (NLDR). Since most of the actual data are nonlinear, NLDR techniques such as locally linear embedding (LLE) [21], isometric mapping (ISOMAP) [22], maximum variance unfolding (MVU) [23], and t-distributed stochastic neighbor embedding (t-SNE) [24,25] are used to tackle problems widely. Lee et al. used discrete wavelet transform (DWT) and continuous wavelet transform (CWT) to extract features of MI tasks, and Gaussian mixture model (GMM) was used to construct GMM supervectors; this method accelerates the speed of training and improves the accuracy of motor imagery [26]. Sadatnejad et al. propose a new kernel to preserve the topology of data points in the feature space, and the proposed kernel is strong, particularly in the cases where data points have a complex and nonlinear separable distribution [8]. Xie et al. proposed a framework for intrinsic submanifold learning from a highdimensional Riemannian manifold; the proposed method exhibited strong robustness against a small training dataset [27].
ere is still another approach for overcoming the problem of high dimensionality in SPD manifolds. And this method maps from a high-dimensional SPD manifold to a lower dimensional one while the geometry of SPD manifolds is preserved. And there are only two works of this way. Davoudi et al. [14] proposed distance preservation to local mean (DPLM) as dimensionality reduction technique, combined with FGMDM, the best performance of this article in terms of kappa value is 0.60. Harandi et al. [28] learned a mapping that maximizes the geodesic distances between interclass and simultaneously minimizes the distances between intraclass, and it is done via an optimization on Grassmann manifolds.
In this paper, we proposed a novel SSDT-FGMDRM and SSDT-KNN for the classification of multiclass MI tasks by designing a simple yet efficient subject-specific decision tree framework. Method 1 contains SSDT-FGMDRM to improve the performance of FGMDRM. For each individual, method 1 first separates the two most discriminative classes from the group. Furthermore, the remaining categories including the misclassification samples of the previous nodes are reclassified in the last node. Method 2 contains SSDT-KNN and a NLDR method named SJGDA. SJGDA combines the advantage of semi-JMI and GDA, and method 2 performed well on different datasets. e aims of this article are as follows: (1) To verify the effectiveness of the proposed SSDT framework through dataset 1 (2) To verify the superiority of SJGDA in feature extraction, compared with semi-JMI and GDA (3) To validate the generalization ability of method 2 through different datasets, in this paper e rest of the paper is organized as follows: Section 2 introduced the mathematical preliminaries of the Riemannian geometry. Section 3 discussed the proposed methods in detail. ree datasets are introduced in Section 4. e results of our work are discussed in Section 5. And in Section 6, we compared our methods with the state of the art. is paper concludes in Section 7.

Geometry of SPD Matrices
Let X i represent a short segment of continuous EEG signals, and X i can be denoted as follows: where X i corresponds to the ith trail of imaged movement starting at time t � T i . T s denotes the number of sampled points of the selected segment. For the ith trail, the spatial covariance matrix (SCM) P i ∈ R n×n can be calculated as follows: Based on the SCM, there are two ways to classify MI tasks in the Riemannian manifold.

Filter Geodesic Minimum Distance to the Riemannian
Mean.
e Riemannian distance between two SPD matrices P 1 and P 2 in P(n) is given by [29] Given m SPD matrices P 1 , . . . , P m , the geometric mean in the Riemannian sense is defined as For algorithm mean Riemannian distance to Riemannian mean (MDRM), we compute the Riemannian distance between unknown class P to the Riemannian mean point of each class and classify the unknown class into categories corresponding to the shortest distance. Inspired by the principal geodesics analysis (PGA) method [30], the literature [31] finds a set of filters by applying an extension of Fisher linear discriminant analysis (FLDA) named Fisher geodesic discriminant analysis (FGDA). And then, apply these filters to MDRM to form filter geodesic minimum distance to Riemannian mean (FGMDRM). More details can be seen from [31]. Figure 1, the SPD matrix of P is denoted by a differentiable Riemannian manifold Z. Each tangent vector S i can be seen as the derivative at t � 0 of the geodesic Γ(t) between P and the exponential mapping P i � EXP P (S i ), defined as follows:

Tangent Space Mapping. As shown in
e inverse mapping is given by the logarithmic mapping and can be defined as follows: log P P i � S i � P 1/2 log P −1/2 P i P −1/2 P 1/2 .
Using the Riemannian geodesic distance, the Riemannian mean of I > 1 SPD matrices by 2 Computational Intelligence and Neuroscience I P 1 , . . . , P I � arg min Using the tangent space located at the geometric mean of the whole set trials, P I � I(P i , i � 1 . . . I), and then, each SCM P i is mapped into this tangent space, to yield the set of m � n(n + 1)/2 dimensional vectors: Many efficient classification algorithms can be implemented in the Riemannian space [10].

Subject-Specific Decision Tree Framework.
Decision tree is a common machine learning method. Each node of decision tree can be defined as a rule. Guo and Gelfand [32] proposed classification trees with neural network, and this method embeds multilayer neural networks directly in nodes. In the decision tree, one of the most important things is to construct a proper binary tree structure; the upper nodes have the greater impact of the accuracy of the whole samples [33]. In order to solve the multiclassification problem in this paper, we constructed a subject-specific decision tree (SSDT) classification framework as shown in Figure 2 according to the best separating principle [34]. As can be seen from Figure 2, the SSDT proposed in this paper trains a different classification model at different nodes of the decision tree. e advantages of the SSDT framework are as follows: (1) is model separates the two MI tasks (e.g., C.1 and C.2) with the highest recognition rate as far as possible (2) At the last node, we reclassify some samples to enhance the classification ability of the classifier

Method 1: A Direct Classification Method Based on SSDT-FGMDRM.
Firstly, we point out one problem of the multiclass FGMDRM by using an example. Figure 3 gives a threeclass classification problem. Figure 3(a) shows the classification progress by FGMDRM. We can see that three Riemannian mean points (RMPs) are located on the manifold. Since the classification criterion is decided by the distance calculated between the test point and the RMP, it caused a wrong classification. Figure 3(b) shows the example of the classification results obtained by using the first node of the SSDT-FGMDRM framework. It can be seen that the error classification is corrected by using the decision tree framework. Method 1 is used to classify four types of MI tasks directly. e training and testing diagram is shown in Figure 4.

Feature Extraction Algorithm Based on the Riemannian
Tangent Space. In this paragraph, we propose a novel data reduction method which combines semi-JMI and GDA, namely SJGDA, to solve the dimension disaster problem after tangent space mapping.

Semisupervised Joint Mutual Information.
are unlabelled data. A binary random variable S is introduced to determine the distribution of labelled dataset and unlabelled dataset. When s � 1, we record the value of y, otherwise not. In this way, the labelled set D L comes from the joint distribution p(x, y|s � 1), while the unlabelled set D U comes from the distribution p(x|s � 0). e underlying mechanism S turns out to be very important for feature selection.
Feature selection method based on mutual information theory is a common feature selection method [35]. In these methods, we rank the features according to the score and select the features with higher scores. For example, by ranking the features according to their mutual information with the labels, we get the sort of correlation that is related to class labels. e characteristics of the score are defined as follows: where X θ represents the set of the features already selected and X k is the feature ranked by scores. Y represents the label corresponding to feature X k . Semi-JMI is a method of using a semisupervised dataset as a training set for JMI. More details can be seen from Reference [36]. In this paper, the missingness mechanism is class-prior-change semisupervised scenario (MAR-C) [37]. After feature ranking, we can obtain a feature vector as follows: where n is the length of the tangent vectors S i . Since information redundancy exists in f, we select the best vector length m (m < n) of each subject by the classification recognition rate:

Generalized Discriminant Analysis.
After variable selection, this paper uses generalized discriminant analysis Figure 1: e tangent space at point P, and the geodesic Γ(t) between P and P i . Computational Intelligence and Neuroscience 3 (GDA) [38,39], which is a nonlinear feature reduction technique based on kernels to reduce the length of the feature vectors f SJ and their redundancies. Mapping X (f SJ ) into a high-dimensional space F through a kernel function Φ: e linear Fisher decision is performed in the F space, and the criterion function for its extension is where W Φ ∈ F and S B and S W are between-class scatter and within-class scatter, respectively.
Training classifier in an OVR scheme. According to the ranking of classification recognition rate, the highest recognition rate model is placed in this node The second high recognition rate model is placed at this node The classifier is trained in an OVO scheme this node    Computational Intelligence and Neuroscience For the convenience of the numerical calculation, kernel functions are introduced to solve the problem: Gauss kernel, poly kernel, and sigmoid kernel are widely used in GDA [40]. For test data z, its image Φ(z) in F space projects on W Φ is as follows: is paper uses ploy kernel to reduce the dimension. After GDA, we can get a vector f G as follows: where d of f G is decided by the actual needs, and in this paper we set d � 1. And then, SJGDA is applied to the dataset of this paper, and the final feature vectors are constructed as follows:

Method 2: SJGDA and Subject-Specific Decision Tree k-Nearest
Neighbor. Method 2 is used to classify four types of MI tasks after tangent space mapping. e training and testing diagram is shown in Figure 5.

Dataset 3.
In our own dataset, Emotiv Epoc+ is used to collect EEG data of motor imagery. It is a portable EEG acquisition device with a sampling rate of 128 Hz. It has fourteen electrode channels (AF3, F7, F3, FC5, T7, P7, O1, O2, P8, T8, FC6, F4, F8, and AF4), two inference electrodes (CMS and DRL), and the electrode placement follows the international 10-20 standard. Equipment and the Emotiv 14 electrodes are located over 10-20 international system positions as shown in Figure 6. is experiment collected three kinds of EEG signals of one joint: imagination of shoulder flexion (F), extension (E), and abduction (A), as shown in Figure 7.
Seven subjects participated in this experimental study. ese subjects were in good health. During the experiment, subjects were naturally placed with both hands, trying to avoid body or head movement. During the experiment, subjects carried out motor imagery under the outside cue, a single experiment collected EEG signal for 5 seconds, and then took 5-7 seconds to have rest, each action repeated acquisition 20 times. e experimental process is shown in Figure 8.

Results of Method 1.
We use SSDT-FGMDRM to classify multiclass MI tasks as introduced in Section 3.1. Since there are four classes, we can have four pairs of MI tasks: left vs rest (L/ RE), right vs rest (R/RE), foot vs rest (F/RE), and tongue vs rest (T/RE). For each subject, the pair with the highest accuracy is used to train N.1, and the pair with the second highest accuracy is to train N.2. Table 1 gives the ten-folder crossvalidation results obtained using FGMDRM in OVR scheme. Table 2 displays the kappa values obtained by method 1. Compared with other methods, five subjects (A03, A06, A07, A08, and A09) achieved higher kappa value of nine without exploring the frequency domain information by method 1. In the case of fixed frequency window, we have improved the mean kappa value of 0.069 than MDRM (p � 0.4683), and 0.139 than FGMDM_fixed (p � 0.1423). Our approach also shows significant improvement than FGMDM (p � 0.6607), which has exploited subject-specific frequency information, in terms of the kappa value of 0.039.

Results of Method 2.
e results in Figure 9 show the T/ RE feature distribution of the five features of subject A09. Figure 9(a) shows the first five ranked features with semi-JMI. After applying the semi-JMI, the first five best features extracted have shown statistically significant improvement in the separability with p values <0.05 except feature 2 with p value 0.77. In Figure 9(b), the first five features extracted from primitive feature vectors with p value of 0.13, 0.05, 0.87, 0.05, and 0.13. e p values indicate that the pair T/RE have no significance in the primitive feature vectors. e results show that with our semisupervised feature ranking algorithm, the separable degree of the feature has been greatly improved. Figure 10 shows the evolution of the classification accuracy with KNN (k � 5 in this paper) against the number of ranked variables in OVR scheme. L/RE and T/RE are the two pairs with the highest recognition rate, and they achieved the highest recognition rate in 100 variables. But this is still a curse of dimensionality for classifiers; GDA is used to analyze the first 100 sorted variables in our study.
As the separation of characteristics cannot meet our requirements, GDA is used to get more obvious variables.    Computational Intelligence and Neuroscience Figure 11 illustrates distributions for the first five most discriminant variables with GDA and semi-JMI. It can be seen from Figure 11 that L/RE is separated equally well by using GDA. Table 3 displays ten-folder cross-validation results obtained using SJGDA and KNN in OVR scheme. It can be seen that the vectors which are mapped to the tangent space have better classification performance than that in the Riemannian manifold directly. Table 4 presents the results obtained by SJGDA in pairwise way for multiclass MI tasks. We have six pairs of MI tasks: left and right (L/R), left and foot (L/F), left and tongue (L/T), right and foot (R/F), right and tongue (R/T), and foot and tongue (F/T). Table 5 displays the comparison of classification accuracy using SJGDA and KNN for L/R task in 10-folder cross validation. References [8,[43][44][45] contain the classification of    Computational Intelligence and Neuroscience other publications. We have improved the accuracy compared with Reference [44] (p � 0.85) and Reference [45] (p � 0.45). Gaur et al. [43] (p � 0.95) explored the specific frequency information for each subject, and Sadatnejad and Shiry Ghidary [8] (p � 0.90) used a novel kernel for dimensionality reduction which is similar to SJGDA. Although the results in the paper are not as high as those in Reference [43], it can be concluded that there is no difference between the results in Reference [43] and those in this paper because of p � 0.95. Table 6 presents the results in terms of the kappa value. e proposed method 1 achieved a mean performance of 0.589 which ranks this method to the first place of the competition. And with our proposed method 2, we have achieved a mean performance of 0.607, which makes method 2 to acquire the best performance of the state of the art.
Dataset 2 is used to verify the effect of method 2, and the classification results are given directly in this paper. e results are shown in Table 7. As can be seen from Table 7, method 2 obtained the second highest recognition rate in the comparative literature. Compared with the recent reference [47], method 2 achieved good classification results.

Results of Dataset 3.
Dataset 3 is used to evaluate the performance of method 2. Figure 12 shows the classification error with KNN against the number of ranked variables in OVR scheme. A/RE and F/RE are the two pairs with the lowest classification error, and they all achieved the highest recognition rate within 60 variables. In this paper, the first 60 ranked variables are used for the next analysis. Figure 13 displays 5-folder cross-validation results obtained by using SJGDA and KNN in OVR and OVO scheme.
is Figure 13(a) illustrates three possible pairs of MI tasks (F/RE, E/RE, and A/RE) for each subject. It can be learned from the figure that flexion and abduction are the easiest movement to distinguish in six subjects of seven, and the six subjects are S1, S3, S4, S5, S6, and S7. However, due to individual differences, the highest recognition rate of each subject is different.
We also compared three possible pairs (F/E, F/A, and E/ A) in OVO scheme of seven subjects. Figure 13(b) depicts the comparison results for each subject, and it can be seen that the pair of F/A obtained the highest recognition rate in seven subjects. Combined with the analysis results of Figures 13(a) and 13(b), it can be considered that flexion and extension are more obvious in the three MI tasks.
As SJGDA is a new method proposed in this paper, we also compared the feature distribution of SJGDA, GDA, and semi-JMI to illustrate the effectiveness of SJGDA. Figure 14 depicts the feature distribution of F/E MI tasks of seven subjects. e blue and red circles represent the two different feature classes. As shown in Figure 14, the F/E MI tasks learned by SJGDA have high separability than GDA and semi-JMI. e performance of the proposed method 2 is evaluated by using classification accuracy. Since there are three classes, the chance level is 33.33%. Figure 15 demonstrates that the proposed method achieves higher performance for six subjects (S1, S2, S3, S4, S5, and S6) out of seven except S7 compared to semi-JMI and GDA methods. In addition, it also can be seen that GDA obtains a better classification accuracy for four subjects of seven (S1, S2, S5, and S7) compared with semi-JMI. e reasons for this phenomenon can be attributed to as follows: In the process of feature selection, we manually select feature dimensions suitable for classifiers, which results in partial information loss. As a feature dimensionality reduction technique, GDA is suitable for the preservation of useful information from the primitive vectors. And the proposed method SJGDA in this paper not only preserves the advantages of GDA but also adds some

Discussions
In this paper, we proposed a novel SSDT framework combined with classifiers to improve the performance of classifiers for multiclass MI tasks. We also proposed a novel NLDR method named SJGDA, and this NLDR method performs better than both semi-JMI and GDA on different datasets. In the following paragraphs, we have discussed the two methods in detail. Method 1 indicates the drawback of FGMDRM, and then the novel SSDT framework is used to improve the accuracy for each individual. As shown in Table 2, compared with other published results, method 1 gets a quite good result in the case of processing the EEG signals of fixed frequency segment (8-30 Hz).

Computational Intelligence and Neuroscience
As shown in Table 6, Gaur et al. [43] proposed SS-MEMDBF to select the subject-specific frequency to obtain enhanced EEG signals which represent MI tasks related to µ and β rhythms, then classification with the Riemannian distance directly. TSLDA was proposed by Barachant et al. [10], and the covariance matrices are mapped onto a higher dimensional space where they can be vectorized and treated as Euclidean objects. Ang et al. [46] is the winner of the competition, FBCSP and multiple OVR classifiers were used for MI tasks, and achieved the mean kappa value of 0.57. Sadatnejad and Shiry Ghidary [8] proposed a new kernel for NLDR over the manifold of SPD matrices, the kappa value is 0.576. Davoudi et al. [14] considered the geometry of SPD matrices and provides a low-dimensional representation of the manifold with high-class discrimination, and the best result of this method in terms of the kappa value is 0.60.
In method 2, SJGDA is used to get more obvious vectors from the tangent vectors, and a SSDT-KNN classifier is used to identify different MI tasks. Combined with SJGDA and SSDT-KNN, we have achieved a better performance compared with method 1 (p � 0.8632), Reference [     It is clear that the proposed method in this paper is effective for MI tasks in a BCI system. In order to prove the effectiveness of the proposed method 2, we tested it on two other datasets. As shown in Table 7 and Figure 15, method 2 achieves good classification results on two datasets.

Conclusion
e experimental results of method 1 show that the proposed classification framework significantly improves the classification performance of the classifier. e experimental results of method 2 show that the SJGDA algorithm proposed in this paper is superior to GDA and semi-JMI in feature extraction, and method 2 has the highest recognition rate in this paper. However, as the classifiers in the SSDT framework is substitutable, the focus of the next work is to combine more advanced classifiers with SSDT to increase the recognition rate of the BCI systems.
Data Availability e dataset 1 and dataset 2 used to support the findings of this study are available from http://bnci-horizon-2020.eu/ database/data-sets. e dataset 3 used to support the findings of this study is available from the corresponding author upon request.

Conflicts of Interest
e authors declare no conflicts of interest. Computational Intelligence and Neuroscience 11