An Enhanced Probabilistic LDA for Multi-Class Brain Computer Interface

Background There is a growing interest in the study of signal processing and machine learning methods, which may make the brain computer interface (BCI) a new communication channel. A variety of classification methods have been utilized to convert the brain information into control commands. However, most of the methods only produce uncalibrated values and uncertain results. Methodology/Principal Findings In this study, we presented a probabilistic method “enhanced BLDA” (EBLDA) for multi-class motor imagery BCI, which utilized Bayesian linear discriminant analysis (BLDA) with probabilistic output to improve the classification performance. EBLDA builds a new classifier that enlarges training dataset by adding test samples with high probability. EBLDA is based on the hypothesis that unlabeled samples with high probability provide valuable information to enhance learning process and generate a classifier with refined decision boundaries. To investigate the performance of EBLDA, we first used carefully designed simulated datasets to study how EBLDA works. Then, we adopted a real BCI dataset for further evaluation. The current study shows that: 1) Probabilistic information can improve the performance of BCI for subjects with high kappa coefficient; 2) With supplementary training samples from the test samples of high probability, EBLDA is significantly better than BLDA in classification, especially for small training datasets, in which EBLDA can obtain a refined decision boundary by a shift of BLDA decision boundary with the support of the information from test samples. Conclusions/Significance The proposed EBLDA could potentially reduce training effort. Therefore, it is valuable for us to realize an effective online BCI system, especially for multi-class BCI systems.


Introduction
Brain computer interface (BCI) is a new communication channel that directly translates brain activities into control commands or messages for peripheral equipments. BCI may enable the disabled to control a computer application or a neuroprosthesis [1,2]. For both laboratory study and practical application, accuracy and information transfer rates (ITR) [3] are two important factors for BCI performance evaluation. At present, BCI applicability is severely limited by its unsatisfactory ITR and accuracy. A feasible way to increase ITR of a BCI system is to change the usual binary decision to a more diverse decision [4,5].
However, when the number of brain patterns increases, both signal processing (feature extraction) and machine learning (pattern classification) will encounter difficulties. For example, the classification accuracy may decrease due to the interference of the new patterns. Currently, some classifiers (e.g., the linear discriminant analysis (LDA), multilayer perception, nearest neighbor classifier, and combined algorithms [6]) have been introduced for a multiclass BCI. However, most of the classifiers only use the information in training set without considering the possible change of the statistic properties between training and test sets.
Recently, certain probabilistic methods (e.g., Gaussian processes [7], Bayesian learning [8,9]) have been introduced to improve the robustness and generalization of a BCI. In recent studies, because of its low computational complexity and immunity to overfit, LDA classifier is often preferred over its nonlinear counterparts in BCI, especially when a small number of samples are available for training [10]. Motivated by the success of LDA in BCI, Hoffman et al. developed an evidence framework based Bayesian LDA (BLDA), and verified its usefulness in a P300 based BCI [11]. Although probabilistic methods may provide confidence level of the output that is meaningful for further post-processing (e.g., classifiers combination [11,12]), these algorithms have not been discussed in terms of solving multi-class problem in BCI.
By focusing on the application of BLDA in multi-class motor imagery task, this paper proposed an enhanced BLDA (EBLDA), which could increase the performance of BCI by using the information mined from test samples (i.e., adding reliable tested samples with high classification probability to the training set to further improve the classifier performance).
The paper is organized as follows. Section Materials and Methods provides a detailed description of BLDA and EBLDA, and the simulated dataset, the BCI experimental dataset, and the preprocessing techniques (e.g., selection of filters, time interval, and feature extraction) are included in this section too. Results are provided in Section Results. Section Discussion is a general discussion for the results.

BLDA
BLDA is based on the evidence framework for Bayesian regression and has been certified very useful in a P300 based BCI [11]. BLDA is a Bayesian version of regularized LDA, in which regularization parameters are estimated with Bayesian regression. Previous studies have revealed that compared with LDA, BLDA is more competitive for the conditions with a small number of train sets or strong noise contamination [9,11].
Assuming that the target y and feature vector X are linearly related, and bias z has Gaussian form, the linear classifier should have a form as follows, Let S XY~f (x 1 ,y 1 ),:::,(x L ,y L )g be the training set composed of feature vector x i and the corresponding binary states y i [Y~f1,2g.
From (1) we can obtain the likelihood function for the weights w as, Where Y denotes a vector containing all the training targets,X X denotes the matrix that is obtained from the horizontal stacking of the training feature vectors, b denotes the inverse variance of the noise, and L denotes the number of training samples. In BLDA, combing the bias z into the weights w, by expanding the dimension of w, i.e., the last entry of w is the bias term, the prior distribution of the weights w is assumed as, where w is extended to n+1 dimension with last entry being the bias term, and I 0 (a) is a diagonal matrix with n+1 dimension having form as The prior for the weights is thus an isotropic, zero-mean Gaussian distribution with variance 1=a, and the prior for the bias being the last entry in w, is a zero-mean univariate Gaussian process with variance 1=e, where e is a small bias value for overcoming the danger of overfitting. n is the dimension of a feature vector. From the likelihood function and prior distribution, the posterior distribution can be computed by using Bayes rule as, Since the prior distribution and likelihood distribution are Gaussian, the posterior distribution also has Gaussian form and the distribution can be determined by the mean (m) of w and covariance (C ) as follows, From the posterior distribution and likelihood function, the predictive distribution can be obtained by inserting a new test feature vectorX X as, where p(ỹ yjb,X X ,w) denotes a normal distribution. The predictive distribution can be characterized by its means (u) and variance (s 2 ) as, Accordingly, we can calculate the probability of featureX X belonging to class label y = 1(similar to class label y = 21) as, where W denotes the standard normal cumulative distribution function.
Obviously, the probabilistic output (10) depends on the mean u and variance s 2 calculated from equation (9). Therefore, both the posterior distribution (5) of w and the predictive distribution (10) of y y depend on the parameters a and b. In BLDA, the parameter selection problem is solved efficiently by maximum-likelihood estimates [11]: where n denotes the dimension of a feature vector and L denotes the number of training samples. BLDA uses an iterative scheme to estimate the parameters as : 1) C and m are computed for an initial value of a and b; 2) the hyperparameters are updated according to (11) and (12); 3) Equations (6), (7), and (11), (12) are iterated to obtain the predictive distribution until the values for the hyperparameters converged; 4) the probability that featureX X belongs to class 1 by standard normal cumulative distribution function is calculated by (9) and (10). At last, the linear decision boundary for binary problems is,

EBLDA
In order to release the training effort and improve the performance of a BCI system, this paper proposed a postprocessing algorithm by rebuilding a new classifier through adding additional test samples with high probability to training sets. We refer to the new algorithm as enhanced Bayesian LDA (EBLDA). Generally, the size of training sets can influence the performance of classifiers in two ways. On one hand, a large training set may contain more outliers and artifacts, which may reduce classification accuracy for learning machine because the classifier may overfit the training data. On the other hand, a small training set may not provide enough information for classification. Therefore, the classification accuracy cannot be guaranteed either.
We supposed that unlabeled samples with high classification probability may provide valuable information to enhance the learning process and generate a classifier with refined decision boundaries. Hence, in our study, probabilistic information from BLDA was regarded as a confident evaluation criterion to select reliable test samples, which could enlarge the training set. For example, in a binary class problem that contained a positive and negative class, we obtained the probability from BLDA classifier for a test sample. When the probability of a sample was lager than a relatively strict threshold (such as 0.9), this test sample would be added to the training set for classifier calibration.

Simulation and real data tests
In this paper, based on the finding that that BLDA is more robust in the BCI application compared with other approaches like LDA, MD, SVM [9,11], BLDA serves as the baseline for performance comparison.

The Simulated Data Sets
To explore when and why BLDA and EBLDA are effective in practice, we first constructed a simulated data set, whose exact decision boundaries were known. Based on this simulated dataset, we estimated the parameters of BLDA and EBLDA to obtain their corresponding decision boundaries.  (15). The optimal discrimination channels of different tasks were found to be located at C4 for left hand, C3 for right hand, Cz for foot and CP6 for tongue, respectively. doi:10.1371/journal.pone.0014634.g001 The data set consists of two 2-dimensional normal distributions. The positive class (labeled y = 1) and the negative class (label y = 21) are given by below model parameters, where u1 and P 1 are the mean and covariance matrix for class labeled with 1, and u2 and P 2 are the mean and covariance matrix for class labeled with 21.
Mathematically, the Bayesian optimal decision boundary of this data set is,

Experiment Data
Two datasets were used in this study, where dataset 1 was recorded from our BCI system, and dataset 2 was Data set IIIa in BCI Competition III 2005 provided by BCI-Lab.
In dataset 1, EEG was recorded from two healthy male righthanded subjects aged 22 and 26 (P1, Y1) respectively. During the experiment, the subjects sat in a relaxing chair with armrests. The trial began with a fixation cross ''+'' appearing in the screen center. After 2 s' presentation of the fixation, a letter cue indicating the  motor imagery for left hand, right hand and foot or tongue was presented. The subject was asked to perform the corresponding motor imagery. The experiments consisted of several runs (6 or 9 runs), each of which contained 40 trials and lasted about 9 minutes. All runs were conducted in one session with a 3-4 minutes break between them. In the experiment, the four movement tasks have the equal probability to appear (i.e., 25% for each task). The total number of trials was 240 for P1 and 360 for Y1. For the two subjects, the trials were split into a training set and an unlabeled test set. The recording was done by the Net Amps 200 systems with a 129-channel electrode cap (Electrical Geodesics Incorporated, USA), two channels for EOG and the other 127 for EEG. EEG was recorded with Cz as reference at sampling rate 250Hz, and the band-pass filter between 0.1 to 48Hz was applied to recordings.
The paradigm designed for dataset 2 was similar to that of dataset 1. The details can be found in [13]. Three subjects (K1, K6 and L1) participated in the experiment. The recording was made with a 64-channel EEG amplifier from Neuroscan, using the left mastoid as reference and the right mastoid as ground. The total number of trials was 360 for Subject K1 and 240 for the other two subjects (K6 and L1). The data were sampled at 250 Hz and filtered from 1 to 50 Hz. In our offline analysis, all data sets were down-sampled to 100Hz, and re-referenced to common average reference. The trials for each subject were also split into training and testing sets for performance evaluation.

Subject-Specific Feature Extraction
Event related synchronization and event related desynchronization (ERS/ERD) [14] could be observed over sensorimotor cortex during motor imagery tasks, and the experimental observation showed that particular mental tasks had related effects on the spatial distribution of EEG at m (8-13Hz) and b (18-26Hz) rhythms. Further classification requires extraction of the rhythm related features from scalp EEG signals. In this study, in order to tackle the multi-class motor imaginary problems with binary classifiers, one-versus-one strategy [5] was adopted to change the multi-class problems to sub-binary problems. The final classification output is obtained by majoring voting for outputs of those sub-binary classifiers.
CSP has been proved to be an effective method to extract ERD/ERS related features from multi-channel EEG data of a two-motor imaginary task [15,16]. The CSP extensions for multiclass problem have been shown in [14,17]. Generally, the spatial filters were calculated individually for each subject, and such hyper-parameters of CSP as the frequency band, time section, optimal channel subset, m and b band-pass filters could be semiautomatically estimated for each subject [18]. In our study, these hyperparameters were estimated in the following procedure.
1) Activity regions for different patterns. After the single trial log band power was estimated by spectrum estimation technique for each channel, the average band power during the motor imagery execution (Individual power) is calculated for each task based on each channel by averaging the task-correlated trials. The activity region of different patterns can be obtained by using Individual power minus Background power, where Background power is the average of band power over all the training trails of the four tasks. At last, the optimal discrimination channels of different tasks were selected by using below r 2 , where X 1 and X 2 are the band power related features of certain class and the background power respectively, while L 1 and L 2 are the numbers of samples for different classes. Fig. 1 provides an example for Subject K1 that the optimal activity regions of different patterns were found to be located at C4 for left hand, C3 for right hand, Cz for foot and CP6 for tongue. The related positions may vary across different subjects, but the similar pattern as that in Fig. 1 could be observed.
2) Selection of optimal time interval. The amplitude envelope of m-rhythm for an optimal discrimination channel selected in the above step 1) was calculated by Hilbert transform and averaged over all training trials as the Background module. The amplitude envelopes during the execution of the four motor imagery tasks were calculated respectively as the Individual modules. The optimal time interval differentiating the four tasks was determined by the r 2 , which is defined in equation (15). Fig. 2 gives the optimal time interval 4.2s-7.5s for Subject K1.
The gray translucent rectangle marks the optimal time interval for the classification of the four different tasks.
3) Band-pass filter used for m and b rhythms. Though the ERD/ERS were observed for all the 5 subjects, the band of the m and b rhythms were different among subjects. In this paper, the best band-pass filter was estimated according to both the best discrimination electrodes obtained from step 1) and the optimal time segments extracted from step 2) for each subject. In our study, 8-33Hz band filter is identified as a good choice for all the five subjects.  4) Selection of channels for CSP. Selection of the discriminative channels was meaningful to lower the computation load for feature extraction and promote the stability of BCI. There are various techniques to select the channels [19]. In this study, we used the Greedy iterative to search for channel selection because it is easy to understand and implement. In each iteration of Greedy iterative algorithm, 4 channels randomly selected from the channel set were removed and the classification was performed using the remaining channels. The average classification result of a 265-fold cross-validation (CV) was used as the criteria for electrode selection. The iteration continued until all possible combinations of electrodes were tested. The best channel set was the channel combination that has the best classification accuracy in iterations.

5) Selection of the CSP filters.
After electrodes were selected, CSP was used to extract the features for classification. In CSP, good contrasts were provided by the paired filters, which had high eigenvalue and low eigenvalue, respectively. In this study, the optimal number of CSP filters was chosen from the number set {2, 4, 6, 8} for each subject by a 265-fold cross-validation. Fig. 3 shows two dominant CSP filters for the right and left imaginary tasks for Subject K1.
In this paper, the log-variance of the CSP projected signals is used as features for classification.
In summary, the optimal filters for m and b and the time section were determined from the selected best channel. Based on the time window and frequency band, the optimal channel subset was selected for a robust CSP implementation. Finally, the log-variance of CSP filtered time series was treated as feature for classification.
The above steps provided us with a fixed procedure to select the optimal parameters for different subjects. In this paper, we followed this fixed procedure to find the relatively optimal parameters and extract the rhythm related features for each subject.

Majority Voting
The one-versus-one decomposition transforms a N-class problem into N(N21)/2 binary classification problems, and the final classifier output for N-classes could be made by the voting of all the binary classifiers. Let p ij (ijj; x) denote the probability of feature x belonging to class i when the classification is made between class i and class j. The below voting strategy could be used to obtain the probability of feature x belonging to class i with all binary classifiers (named as p(ijx)) as, Based on the combination of classifiers, the input x is assigned into the class y[class~f1,2,3,4g using the majority of votes,

Kappa Coefficient
The kappa coefficient [20] is an evaluation criterion for unifying different number classification problems. In the N class problems, the proper performance measure of the classifier is described by its confusion matrix [20].If the N classes occur equally with probability of 1/N, the relationship between kappa coefficient k and accuracy (acc) can be described as, where acc is the classification accuracy. In our work, there were an equal number of trials of each class in each session. Therefore, we took this simplified equation to evaluate the performance of classifiers.
In the algorithm realization, the parameters of BLDA and EBLDA (mean and covariance matrix) were not obtained by the time-consuming CV procedure in this study. Instead, we got these parameters by an iteration algorithm, which was constructed according to the probabilistic output model and allowed a quick and automatic parameters estimation with a few iterations as noted in Section BLDA [9].

Why and When do BLDA and EBLDA work?
As illuminated in section the Simulated Data Sets, the optimal decision boundaries can be determined by the distribution parameters of the simulated data (equation (14)). To observe the efficiency of EBLDA for the small training set, we increased the size of training sets from 20 to 200, and another 100 test samples were used to select high probabilistic samples. As for EBLDA, the samples from the original training sets combined with the selected samples were used to train the classifier. BLDA and EBLDA used the same test dataset, which contained 100 samples for performance evaluation. Fig. 4 reveals the change of boundaries between BLDA and EBLDA when the size of training set increased. Fig. 4 showed four cases where the initial training sample sizes were 50, 100,150 and 200 respectively. In Fig. 4, the blue circles and the red crosses represent the training samples of classes 1 and 2 respectively. The green solid lines are the Bayesian optimal decision boundaries of the two classes.
5610-fold CV was adopted to obtain the average kappa coefficient for BLDA and EBLDA, where the 10-fold CV was repeated for 5 times. Meanwhile, the paired t-test was performed to investigate the difference between BLDA and EBLDA. Table 1 shows the mean Kappa and standard deviation of 5-fold cross validation with BLDA and EBLDA for the simulated dataset.

Experimental results
As for this dataset, the five steps described in III-C were firstly used to select the subject-specified parameters. As for the 4 motor imagery tasks with the one-versus-one CSP method, there were six individual binary-class groups in total, ({1,2}, {1,3}, {1,4}, {2,3},  {2,4}, {3,4}). We took the 2J CSPs filters that best discriminated each binary class problem to obtain the feature vector, where 265 fold CV was used to choose a suitable J in the range of 1,6. The different 2J features vectors were obtained for each class group, and the final results were obtained by voting strategy. In our study, the optimal number of CSP filters was six for Subjects K6 and L1 and four for other subjects. BLDA and EBLDA were tested with a 5610-fold cross-validation. To investigate the effect of training set size on classifier performance, the trials were split into three sets (i.e., the training set, the enlarging set for obtaining high probabilistic samples, the test set for evaluating the performance of BLDA and EBLDA). The size of training set ranged from 40 to 120 for all subjects. The size of the enlarging set was 60 for Subjects K6, L1 and P1 and 90 for Subjects K1 and Y1, and the size of test set equaled that of the enlarging set for each subject. In each CV procedure, EBLDA and BLDA had the same test set for performance evaluation and comparison. In this paper, as a model selection procedure, the values of the BLDA parameters (the mean and covariance matrixes) were automatically estimated by an iterative algorithm introduced in [11] according to the training set for different subjects. For the tested datasets, hyper-parameters optimization usually converged after eight to fifty iterations.
The classification performance for those five subjects when different approaches were used was listed in Table 2.
The number of the test samples selected for enlarging training set was determined by the probability threshold, which was used for reliable sample selection. The one-versus-one decomposition strategy transformed the four-class problem into six binary subtasks, ({1,2}, {1,3}, {1,4}, {2,3}, {2,4}, {3,4}), where three classifiers were related to one task. Fig. 5 illustrated the number of high probabilistic test samples in the test samples and the corresponding accuracies achieved by combining those high probabilistic test samples for classification in EBLDA. In Fig. 5, the confident threshold is 0.90 for all subjects, and those test trials having probabilities above 0.90 will be added to the training set for classifier re-training. Fig. 6 shows the curves of probability threshold vs accuracy (blue) and probability threshold vs number of selected reliable samples (green) for subject K1. Fig. 4 reveals that the classification boundary of BLDA shifted to the tangent direction of Bayesian optimal interface when the training sets were expanded by adding the reliable test samples that had high probability. The angle between BLDA to EBLDA became small when the size of training set increased, suggesting that the boundaries of BLDA and EBLDA were more similar to each other. Furthermore, the final decision boundaries of these two classification methods had a high degree of overlap when the training set had enough samples. Table 1 reveals that the EBLDA method produced a significantly better kappa than BLDA with a training size of 20 to 100 (t-test, p,0.05). The results suggest that high probabilistic test samples can refine the BLDA boundary and EBLDA can obtain a more stable result using information from the reliable test samples. As a linear classifier, based on BLDA [9], EBLDA can solve the overfitting problem better than other nonlinear counterpart like LDA, MD, when a limited sample size is available. Furthermore, EBLDA can also relearn information from new samples, which can produce a better boundary for normal distributions especially when a small sample set is used. The kappa coefficients shown in Table 1 also show that the expansion of the training set with reliable test samples improves the classifier performance, where more obvious improvement can be observed for small training set. Generally, EBLDA has better performance using the information in the additional high probabilistic test samples, and it could be a refined version of BLDA.

Multi-class problem and majority voting
Unlike the two-class problem, a multi-class problem may have unclassifiable region where a sample may have an equal chance to be classified into a few classes when a 0/1 voting is adopted. As for the real dataset containing four tasks, we adopted probability majority voting, which could automatically remove uncertainty.  6 demonstrates that with larger threshold, the accuracy increased and the selected samples will be reduced. Besides, as shown in Fig. 5, the size of training samples for Subjects K1 and Y1 is 180 and 120 for other subjects (K6, L1, P1). The rest of the trials are treated as test samples. When 0.9 is used as the confident threshold, the number of test samples added to training set is about 1/3,1/2 of the size of the test sets for most of the subjects. According to Fig. 6, all test samples are added to the training set if we set a probabilistic threshold smaller than 0.5 for Subjects K1, but the accuracy may be small. If we choose a high threshold such as 0.75, the classification accuracy will be higher than 0.9 for K1. This fact suggests that if the test samples with high probability could be selected for classifier training, the classification accuracy could be substantially improved. On the other hand, the unreliable information introduced by the trials with small probability will distort the classifier performance. In this study, 0.9 is a good threshold for Subjects K1 and Y1, 0.85 is good for subjects K6, L1, and 0.8 is suitable for subject P1. This threshold has to be estimated for each individual subject and the selection criteria is to include as many high probability samples as possible while high accuracy could be guaranteed at the same time.

EBLDA for individual subjects
As Table 2 shows, EBLDA generated better results than BLDA for all subjects except P1. The averaged classification Kappa coefficients of the five subjects ranged from 0.586 to 0.657. When the training size increased, the kappa coefficient of EBLDA did not fluctuate as much as that obtained by BLDA, especially for subjects with high classification accuracy. For Subject P1 with low classification accuracy, the classification result of EBLDA was even worse than that of BLDA when the training set was expanded. One possible explanation is that when the training set was expanded with misclassified samples, the classifier would be distorted by the unreliable information from those misclassified samples. However, for subjects with high classification kappa coefficients (i.e., K1, L1 and Y1), the performance of EBLDA was better than that of BLDA, suggesting that EBLDA can learn useful information from the correctly classified test samples to improve the classification accuracy, especially for subjects with high kappa coefficients.

Conclusion and prospect
The results confirmed that EBLDA could achieve substantial improvement over the traditional BLDA algorithm, especially when the size of training set is small. Since the unlabeled test samples added to the training set are strictly selected by the probability threshold, EBLDA could produce a more creditable result than BLDA. Some previous studies have confirmed that BLDA can get more reliable results for the multi-class classification compared to the traditional classifier like LDA, MD, SVM, and accordingly EBLDA can have superior performance to those traditional counterparts according to the performance relationship between BLDA and EBLDA revealed in this paper.
In summary, BLDA is robust to noise in the training data [9], and can capture information from test samples without much human intervention. Since Bayesian approach of prediction could take the posteriori uncertainty of the parameters into account, it could produce a more accurate estimation of the uncertainty in predictions, especially when the training data do not have enough information for a precise estimation of the model parameters. Based on BLDA, EBLDA can efficiently use the additional information from test samples for classifier calibration. EBLDA also has other properties, which are very important for practice of BCI. First, EBLDA can obtain probabilistic output, which can be used to reject trials that cannot be classified with certainly. Therefore, it could help alleviate the negative effect of wrong decisions [9]. Furthermore, probabilistic output can be used for continuous control with high classification results. Finally, the hyperparameters of BLDA can be estimated quickly, which may satisfy the demand for real-time BCI communication.