A Survey on Machine Learning Algorithms in Little-Labeled Data for Motor Imagery-Based Brain-Computer Interfaces

: The Brain-Computer Interfaces (BCIs) had been proposed and used in therapeutics for decades. However, the need of time-consuming calibration phase and the lack of robustness, which are caused by little-labeled data, are restricting the advance and application of BCI, especially for the BCI based on motor imagery (MI). In this paper, we reviewed the recent development in the machine learning algorithm used in the MI-based BCI, which may provide potential solutions for addressing the issue. We classified these algorithms into two categories, namely, and enhancing the representation and expanding the training set. Specifically, these methods of enhancing the representation of features collected from few EEG trials are based on extracting features of multiple bands, regularization, and so on. The methods of expanding the training dataset include approaches of transfer learning (session to session transfer, subject to subject transfer) and generating artificial EEG data. The result of these techniques showed the resolution of the challenges to some extent. As a developing research area, the study of BCI algorithms in little-labeled data is increasingly requiring the advancement of human brain physiological structure research and more transfer learning algorithms research.

The motor imagery is a typical BCI based on ERS/ERD. ERS/ERD is a spontaneous rhythm signal that does not require external stimulation, and therefore makes motor imagery more advantageous than other signal patterns for the disabled user. However, supervised learning algorithms are commonly used in the BCI require a large amount of labeled data, as the EEG signals are commonly non-stationary and high-dimensional. Furthermore, the EEG signals, especially ERS/ERD, are susceptible to subjects own physical condition. Therefore, it requires a long calibration time for subjects learning to adjust the amplitude of rhythm signal. But there is rarely numerous labeled data provided in practice [Fazli,Dahne, Samek et al. (2015)], so we need machine learning algorithms in little-labeled data to improve the performance of the BCI system. In this paper, we reviewed the approaches which focus on addressing these problems. We explored the techniques of processing little-labeled dataset using machine learning tools in the MI-based BCI. In order to provide a broad overview, briefly, we divided these methods into two categories, enhancing the representation of the feature and expanding the training dataset. As shown in Tab. 1, the information and results of some algorithms were presented and classified into different types, which were reported by the effects of dealing the open dataset, such as BCI Competition. 2 Methods of enhancing the representation of the little-labeled dataset A typical algorithm for feature extracting in BCI designs is the Common Spatial Pattern (CSP), which has been proven to be one of the most efficient algorithms for motor imagery tasks [Ramoser, Muller-Gerking and Pfurtscheller (2000)]. CSP aims at learning a spatial filter by maximizing the variance of one class when minimizing the variance of the other class [Ramoser, Muller-Gerking and Pfurtscheller (2000); Ang, Chin, Zhang et al. (2008)]. However, the CSP filter is subject-specific, which is dramatically sensitive to noise, notably, in the case of the subject with much fewer training trials [Reuderink and Poel (2008)]. Currently, in order to solve the problem of little-labeled data, many machine learning algorithms were improved on the basis of CSP. Ang et al. [Ang, Chin, Zhang et al. (2008)] proposed a feature selecting approach based on filter-bank which processed the feature in multiple frequency bands to enable the feature more discriminative. It comprises four stages of processing: multiple band-pass filters, CSP algorithm filters the data in spatial domain, feature selection of the CSP feature, and classification based on the selected feature. The first stage cuts the EEG data into multiple bands. The second stage uses the CSP filter to extract CSP feature of each bands. The feature selection is defined as: given a set of with features, find the subset ⊂ with features in the condition of class that maximizes Mutual Information ( ; ). The Mutual Information between the two variables is (1) where the entropy of the variable and the conditional entropy are This approach addresses the problem of selecting an appropriate operational frequency band for extracting discriminating CSP features. This algorithm can achieve an average accuracy of 89% (see Tab. 1).
As the extension of filter-bank, [Suk and Lee (2013)] proposed a Bayesian framework with the spatio-spectral filter optimized by means of the probabilistic and information-theoretic approaches for extracting the discriminative feature. It defines the frequency band as a random variable. The problem of optimizing the spatio-spectral filter is formulated as the estimation of the posterior probability density function (pdf) by the Bayes rule. After the features of the proposed pdf estimation, computing the weighted label decision rule by linearly combining the output from multiple classifiers. The weight of each particle can compute as follows: where denotes a feature vector set extracted from the spectrally and spatially filtered, is a particle representing a single frequency band. The classification accuracy of this algorithm nearly improved 15% for the subject who performed ordinary. But the performance is decreased for the subjects lacking of BCI efficiency. Some strategies are similar as filter-bank, such as Sub-band CSP(SBCSP) [Novi, Guan, Dat et al. (2007)] and Optimum Spatio-Spectral Filtering Network (OSSFN) [Haihong, Yang, Keng et al. (2011)], decomposed the EEG signals into multiple bands. Besides, stationary subspace analysis (SSA) [Von Bunau, Meinecke, Kiraly et al. (2009)] and Riemannian geometry [Barachant, Bonnet, Congedo et al. (2012)] had been used for this task. Samek et al. [Samek, Vidaurre, Muller et al. (2012)] proposed to combine CSP and SSA and achieved a better classification result. It used SSA to find the steady-state part of the EEG data before using CSP to calculate the spatial filter, and the results indicate that stationary CSP is better than that of other methods for the subjects lacking of BCI efficiency.

Methods of expanding the training dataset
Supervised learning is aimed at inferring an unknown model of input-output, by which unlabeled samples can be classified. If little training dataset is available, or these data cannot completely reflect the feature distribution, the covariance matrices may be poorly estimated. This will lead to the spatial filter or classifier under-fitting. One of the most direct way to solve the problem of insufficient data is to supplement more relevant training trials from other sessions or subjects. It can be divided into the following types of approaches: session to session transfer, subject to subject transfer and generating artificial data.

Session to session transfer
Since CSP filters are subject-specific, in order to improve the performance of filters, the similarity sessions data is applied for a given subject to extend the training set. Krauledat et al. [Krauledat, Tangermann, Blankertz et al. (2008)] aimed at mitigating the estimation bias of the previous model through reusing previous sessions data of the same user in a clustering approach. It suggested that the similar filter should be found across all sessions for a given subject. To find the more densely sampled regions in the data space consisting of CSP filters, it calculates the angle between the column vectors of a CSP filters matrix : compute the average distance γ-index of to its neighbors: where is the nearest neighbors of , the lowest γ-index denotes this filter inside a region which contain other filters, and the corresponding filter should be chosen as cluster prototype. But this approach can not applicable to a completely new subject. Gradually, covariate shift has been introduced in BCI designs. Covariate shift is defined as the situation where the training input points and testing input points follow different distributions, under the hypothesis that the conditional distribution of these points is invariant [Sugiyama, Krauledat and MAZller (2007)]. Li et al. [Li, Kambara, Koike et al. (2010)] had combined the labeled training data with the unlabeled test data together, assuming that the marginal distribution of the sessions changes but the decision rules are unchanged. It proposed re-weighting the data of previous sessions to fine-tune the predictive function ( ) in order to correct the covariate shift. The function is described as follows: where denotes training samples, ( ) = ( ) ( ) , ( , , ( )) is a loss function.
However, it produces a large-variance estimator. Li et al. [Li, Kambara, Koike et al. (2010)] introduced a technique namely Bagging to overcome the weakness. The accuracy of the algorithm summarily improved 10% to traditional LDA algorithm.
Besides, Semi-supervised Learning is a way to exploit unlabeled data effectively [Tu, Lin, Wang et al. (2018)]. Meng et al. [Meng, Sheng, Zhang et al. (2014)] used a semi-supervised approach to solve the problem of lacking data. First, it selected less data for pre-training to get a weak spatial filter and filtered the data of test dataset to get a trial with the highest confidence. Then, it added the trial to the training dataset and iterated the processing until the number of trials met 20. But this algorithm was lacked robustness and caused the performance worse when the training set contained the outliers.

Subject to subject transfer
The early researches were to supplement the regularized data of other subjects into the training dataset [Kang, Nam and Choi (2009); Lotte and Guan (2011)]. Kang et al. [Kang, Nam and Choi (2009)] modified CSP for subject-to-subject transfer, which proposed to composite covariance matrices from other subjects. In this approach, the weights were defined according to the Kullback-Leibler (KL) divergence between subjects: where denotes the importance of subject to subject , = ∑ 1 [ || ] ≠ is normalization constant, and the KL-divergence is defined as follows: The modified CSP filter improved the classification performance which was caused by the lack of training trials. The accuracy of subjects aw and ay from the dataset of BCI competition III IVa had increased nearly 15%. However, the performance of this algorithm was unstable due to the potentially large variability between subjects. Thus, Barachant et al. [Barachant, Bonnet, Congedo et al. (2013)] proposed to do not use the data from all available subjects but from selected subjects only. It sequentially selected the subject from the subset, training the BCI on the data from the selected subject and tested it on the training data of the target subject. Based on the classification effect on the target subjects, it is decided that the data of the subjects should be deleted or added from the current dataset. This method guided a way to select the source data. The results showed that the best regularizing CSP algorithm can outperform CSP by almost 10% in median classification accuracy. In a word, these methods were aimed at suppressing the influence of noise on CSP. The result shows that the best regularized CSP can outperform CSP by almost 10% in median classification accuracy. On the basis of using regularization, Samek et al. [Samek, Kawanabe and Muller (2014)] designed a divergence-based framework to compare the various divergence calculation methods and automatically selected the regularization coefficient by utilizing historical data. In order to find the relationship of the dataset from different subjects, Fazli et al. [Fazli, Popescu, Danóczy et al. (2009);Morioka, Kanemura, Hirayama et al. (2015); Kang and Choi (2011)] learned an invariant sparse in multiple datasets to predict the different subjects. Sparse representation is a typical signal processing method to represent the main information of a signal using non-zero coefficients as few as possible [Wang, Shen, Li et al. (2018)]. This kind of approach guided the algorithms for better performance, thus enabling the algorithm to generalize with little-labeled data. Simultaneously, multi-task learning made achievements in this regard. In the field of BCI, each task generally learns a classification model for a specific user to ensure the similarity between the data of users. Even for users with little-labeled data, it can be less affected to the model. Choi (2014, 2011)] combined CSP with a Bayesian model for a multi-subject learning, assuming that spatial patterns across subjects shared a latent subspace. Another frontier approach is based on the Riemannian geometry of the manifold of symmetric positive definite (SPD) matrix [Barachant, Bonnet, Congedo et al. (2012);Zanini, Congedo, Jutten et al. (2017)]. Riemannian metric learning had been proposed to process the data on the space of SPD matrix [Yger, Berar and Lotte (2017)]. One of the fundamental problems in the application of SPD mainfold is to find the nearest neighbor of SPD matrix [Zheng and Song (2018)]. Formally, the Riemannian distance between two SPD matrices A and B is shown in the following: [ Barachant et al. [Barachant, Bonnet, Congedo et al. (2013)] proposed affine transformation of the covariance matrices of each subject, in order to make the data from other subjects comparable. The results often were poor in the original data space, the affine transformation can provide a much better classification accuracy and precision. The approach is tested on the BCI competition IV dataset IIa and outperforms the conventional CSP method by 18%.

Generating artificial data
Further, for solving the situation with little-labeled data, some researches had proposed to generate artificial data. The main idea of this approach is to generate multiple trials from the few training trials in order to increase the number of training trials [Lotte (2015)]. Initially, the idea of Lotte [Lotte (2011)] was dividing the training trials into several segments in time domain for generating new artificial trials. The process is schematized in Fig. 1. This method is extremely simple, though effective, but it may conduct mismatching at the boundary and decrease the signal-to-noise ratio.  In this paper, we surveyed existing approaches to solve the problem of little-labeled data, and classify them into two categories. The method of enhancing the representation of data can dramatically improve the classification accuracy in BCI design. In the case of little available train data, expanding the training dataset was the most direct and effective way. A large amount of literature had emerged in this field, so this is also the most mature method at present. In the absence of other available data, generating artificial EEG data showed a better performance. Some points may be improved in the future. With the advancement of human brain physiological structure research, there are more prior knowledge could be added to the feature processing part. It is an important issue to combine more discriminative information from other brain signals. The techniques could be found to dealing with outliers in the little training dataset. Finally, we expect to introduce more transfer learning algorithms, which could make a great progressing in the BCI design.