Combining multiple features for error detection and its application in brain–computer interface

Background Brain–computer interface (BCI) is an assistive technology that conveys users’ intentions by decoding various brain activities and translating them into control commands, without the need of verbal instructions and/or physical interactions. However, errors existing in BCI systems affect their performance greatly, which in turn confines the development and application of BCI technology. It has been demonstrated viable to extract error potential from electroencephalography recordings. Methods This study proposed a new approach of fusing multiple-channel features from temporal, spectral, and spatial domains through two times of dimensionality reduction based on neural network. 26 participants (13 males, mean age = 28.8 ± 5.4, range 20–37) took part in the study, who engaged in a P300 speller task spelling cued words from a 36-character matrix. In order to evaluate the generalization ability across subjects, the data from 16 participants were used for training and the rest for testing. Results The total classification accuracy with combination of features is 76.7 %. The receiver operating characteristic (ROC) curve and area under ROC curve (AUC) further indicate the superior performance of the combination of features over any single features in error detection. The average AUC reaches 0.7818 with combined features, while 0.7270, 0.6376, 0.7330 with single temporal, spectral, and spatial features respectively. Conclusions The proposed method combining multiple-channel features from temporal, spectral, and spatial domain has better classification performance than any individual feature alone. It has good generalization ability across subject and provides a way of improving error detection, which could serve as promising feedbacks to promote the performance of BCI systems.

While successful demonstrations have been achieved in laboratory settings, the application of BCI technologies in real-life scenarios still faces critical challenges. Due to its noninvasive nature, EEG recordings are relatively far from signal sources and are further smeared by the scalp, cerebrospinal fluid and other soft tissues sitting in between. These factors result in useful EEG signals that are usually weak and susceptible to static and electromagnetic interference, as well as other spontaneous activities, such as electromyography from movements of head and eyes [13]. These limitations of EEG make inevitable errors in the process of detecting users' intentions in BCI systems [14]. It is thus of great importance in improving the robustness and reliabilities of BCI systems in order to achieve real-life applications.
Human and other species [15] learn and adapt their behaviors through the perception of errors. Past studies found that a time-locked negative deflection in EEG, mostly visible in frontal and central cortical sites, accompanies the occurrence of errors, namely, error-related negativity (ERN) [16]. Similar negativities in EEG signals have been reported in BCI studies when subjects observe incorrect outputs from BCI systems [17,18]. The negative potentials detected at the onset of unexpected feedbacks (feedback ERN, or fERN) [19,20] can be utilized to adjust command outputs of BCI systems. Thus, improvements on detection of error potentials (ErrPs) could facilitate the development of BCI systems with improved accuracy. Spuler et al. [21,22] implemented an error-correction scheme in a P300 speller to correct error outputs in order to improve writing speed, which instantiates the application of ErrP detection in EEG data for promoting performance of BCI systems. However, both scanty knowledge about the neural mechanism of ErrPs and their temporal variations in status, amplitude and latency impose difficulties on the investigation [23].
The key factor in error detection is to effectively extract specific features from raw EEG data that are with abundant information, but of low signal-to-noise ratio. Various algorithms have been developed in searching for effective methods to extract characteristic features of ErrPs. Dal Seno et al. [24] proposed a genetic algorithm to extract features based on encoding different weight functions. Such algorithm is not only applicable to the extraction of P300 features, but also ERN signals. Omedes et al. [25] utilized low-frequency components as features on top of traditional feature extraction method in the temporal domain. Zhang et al. [26] came up with a method using directed transfer function (DFT) to extract continuous features that can improve the detection rate of error-related potentials, such as ERN. In term of spatial features, Ramoser et al. [27] proposed a spatial filtering method to extract features related to motor imagery in EEG, i.e., common spatial pattern (CSP). Such method searches for a set of weight coefficients at different channels of EEG to combine multiple-channel data to one, on which variance from different task conditions can be maximized in order to improve classification rate. Due to the vulnerability of CSP algorithm to overfitting, Song and Yoon proposed an adaptive CSP [28], Lotte and Guan investigated means in regularizing CSP [29], and Li et al. proposed L1-norm based CSP [30], all in an effort to overcome the overfitting problem. Shou and Ding [31,32] proposed blind source analysis and studied EEG signals including ErrP associated with errors.
Because of the nonstationarity of EEG, no optimal features can be extracted from temporal or spectral domain alone. Meantime, due to the fact that various activities take place across different brain regions, overfitting might occur if using features from all channels for classification [33]. On the other hand, it is a critical challenge to select feature channels containing large inter-condition differences, without affecting the performance of BCI systems [34]. To tackle these problems, a procedure is proposed in the present study, which includes two times of dimensionality reduction on three types of features from temporal, spectral, and spatial domains with the use of neural network, and then the features are combined for classification. The present results from experimental data suggest superior classification performance of combined features over any individual features alone.

Experimental protocol
EEG data from the BCI challenge in IEEE EMBS NER 2015 conference were chosen for evaluation [35]. Perrins et al. designed the experimental protocol and collected the EEG data [36]. Twenty-six healthy subjects took part in this study (13 males and 13 females, mean age = 28.8 ± 5.4, range 20-37). All subjects went through five copy-spelling sessions. Each session consisted of 12 five-letter words, except the fifth which consisted of 20 five-letter words.
All subjects reported normal or corrected-to-normal vision and had no previous experience with the P300 speller paradigm or any other BCI applications. EEG data were recorded with 56 passive Ag/AgCl EEG sensors whose placement followed the extended international 10-20 system. Their signals were all referenced to a reference sensor at the nose. The ground electrode was placed on the shoulder and impedances were kept below 10 kΩ. Signals were sampled at 600 Hz.
In order to evaluate the generalization ability across subjects, the data from 16 participants was used for training and the rest 10 for testing.

Preprocessing
The downloaded EEG data have been downsampled to 200 Hz.
Since previous literatures indicate that information of error related potentials mainly falls into the theta band and mu rhythm [17,25], before further processing, we firstly used a fourth order Butterworth bandpass filter (1-20 Hz) to remove DC component and high-frequency noise [37]. After that, independent component analysis (ICA) was applied to filtered EEG data to remove common artifacts, such as eye movements, electrocardiography (ECG) and so on. EEG data from all channels were then referenced to a common average reference (CAR) to further increase signal-to-noise ratio [38]. At last, all data points of each epoch between 200 and 1000 ms after feedback onset were selected as one sample.

Feature extraction
The features from temporal, spectral, and spatial domains were extracted from EEG signals, and then the back-propagation neural network (BP neural network) was adopted to perform two times of dimensionality reduction, in the end the acquired three levels of features from temporal, spectral, and spatial domains were used in another BP neural network for classification. The procedure is detailed in Fig. 1.
Step 1: Extract temporal feature F 1 from each EEG channel as the level-1 features.
Step 2: Using the level-1 features F 1 from the training group to train a BP neural network, which was applied to classify F 1 features. The derived one-dimensional post-hoc probabilities were the level-2 features F 1 ′. Step 3: Using the level-2 features F 1 ′ from all channels to train another BP neural network, which was applied to classify the 56-dimensional level-2 features F 1 ′, resulting in one-dimensional level-3 temporal features F 1 ′′.
Step 4: Extract level 1 features F 2 in the spectral domain and repeat step 2 and 3 to achieve one-dimensional spectral features F 2 ′′.
Step 5: Extract level 1 features F 3 from in the spatial domain and repeat step 2 and 3 to achieve one-dimensional spatial features F 3 ′′.
Step 6: Combine three features [F 1 ′′ F 2 ′′ F 3 ′′] from the training group to train a feedforward neural network, which was applied to classify samples from the testing group.
To extract the level-1 features at different domains, a series of algorithms were implemented and described as below.
①Extraction of the level-1 features in the temporal domain (F 1 ): the training data were separated into two classes based on their labels, i.e., positive and negative feedbacks. ȳ ∈ [Position, Negative] denotes the mean of each class. Then the correlation R xy and covariance C xy between each sample x and ȳ were computed as the feature set F 1 , using where x denotes each sample, N is the length of each sample and m is the corresponding latency.
②Extraction of the level-1 features in spectral domain (F 2 ): the extraction was achieved following the approach from Huang et al. [39]. The empirical mode decomposition (EMD) was firstly performed to decompose samples from each channels into intrinsic mode functions (IMF) using where c i is IMF, n is the number of IMF decomposed, r n is residue after EMD decomposition. Then Hilbert transformation was performed on each IMF component: The analytic signal z i (t) was achieved by: where a i (t) and θ i (t) were instantaneous amplitude and phase respectively, which were calculated by: Then instantaneous frequency of the ith IMF component was acquired by taking the derivative of θ i (t) as Thus, signal x(t) can be describe as below in reflecting its changing amplitudes along time and frequency: The Hilbert spectrum for each IMF component was denoted as: Finally, relative energy coefficient (E), mean frequency ( ), mean slope (MS), and coefficient of variance (CV) were calculated as below to form the level-1 features in the spec- where μ i and σ i are the mean and standard deviation of the ith IMF component.
③Extraction of the level-1 features in the spatial domain (F 3 ): the extraction was implemented through four steps based on the approach from Ramoser et al. [27] a. Calculate the mean covariance matrices R p and R n for the two classes (positive and negative feedbacks), and eigenvalue decomposition as R p +R n = U C C U While the common spatial pattern filter (each row of W 56×56 ) provides a mathematical mean of combining features in the spatial domain, manual adjustment is still required to further improve the performance [33,40]. Otherwise, overfitting could occur in classification due to the hyper-dimensional space [33]. However, in our method, we need not choose the filter manually, because we have realized the dimensionality reduction of spatial features using neural network from level 2 to 3 and could use all spatial filters, bypassing the redundant manual work.

Dimensionality reduction
Each type of features was with different dimensionalities. There were 164, 12 and 161 dimensions in temporal, spectral and spatial features, respectively. Thus, the total length of the level-1 feature vector was 56 × (164 + 12 + 161). For convenience, we wrote it as 56 × 3 × M, where 56 was the channel number, the number 3 represented the number of kinds of features, and M ∊ [164, 12,161] represented the length of corresponding features. The whole dimensionality reduction process is illustrated in Fig. 2. The 1st dimensionality reduction led to the collapse of level-1 features from a 3D space to level-2 features on a 2D plane, by replacing samples in level-1 features with posteriori probabilities. The dimension of the level-2 feature from all channels was then further collapsed, which could be visualized as the linearization of a plane (Fig. 2). The feedforward BP neural network was used to reduce dimensionality of features. By inputting multi-dimensional level-1 features F, one-dimensional level-2 features F′ were acquired after dimension deduction, by where i ∊ [1, 2, 3] denotes different features. W T and b are weights of the neural network and bias, respectively, acquired from training datasets. Tansig indicates the hyperbolic tangent sigmoid transfer function that calculated a layer's output from its net input.
When repeating the same steps with level-2 features, the outputs were level-3 features as F″.

Classification
For classification, the feedforward neural network was implemented after obtaining level-3 features p = [F 1 ″F 2 ″ F 3 ″]. The neural network can be described as where Output is the classification results. W T and b are weights of the neural network and bias, respectively, obtained from level-3 features p using training data. Logisg is a transfer function as

Results
The features from different domains F 1 , F 2 , F 3 are features extracted from temporal, spectral, and spatial domains, respectively. The magnitude differences represent the ability to distinguish two types of signals.
The feature set F 1 consists of R p , R n , C p and C n , and is shown in Fig. 3. F 2 is the intrinsic mode functions (IMF) decomposed by EMD decomposition. Samples from each channel are decomposed into four IMF components. Due to the reason that the fourth IMF is a monotonic curve, the first three components are chosen as F 2 features, as shown in Fig. 4. F 3 features are the projections of EEG from each channel on the projection matrix W from CSP. The projections of EEG onto the first or last eigenvector or some eigenvector  15:17 in B were commonly used. Although that would decrease the difficulty, signal leakage could occur [41]. Figure 5 presents the projections of EEG onto the 1th eigenvector in B.

Performance in error detection
The training and predict programs run on the personal computer (CPU: Intel(R) Core(TM) i5-4590 @ 3.30 GHz, RAM: 8 GB, System: Windows 10 64-bit, Platform: Matlab R2014a). The data from 16 participants (total 5400 samples) were used as the training set, which took 137 min for training. And it took 3.56 s to predict one sample in testing set.
The confusion matrix is implemented to evaluate the performance of classification. Figure 6 shows the results of multiple features (F 1 ′′ + F 2 ′′ + F 3 ′′) from testing group

Influence of features and individual variance
In order to further evaluate the effectiveness of feature extraction and the performance of classification, receiver operating characteristic (ROC) curves using different features and their combinations are shown as Fig. 7. F 1 , F 2 and F 3 represent the temporal, spectral, and spatial features, respectively. The combination of three features leads to the best performance. Table 1 shows the results of the variance among individual subjects for accuracies of error detection using the metric of area under ROC curves, from different types of features. It demonstrates the combination of features could improve the performance of classification, the results of a one-way Analysis of Variance (ANOVA) show significant difference (F = 7.24, p < 0.005) between single features and combination of three features. The data also reflect variance among individuals. Based on the classification performance from the combination of three features, the participants generally fall into two groups. One group contains 6 subjects with the average AUC of 0.8339, and another group contains 4 subjects with the average AUC of 0.6585. The performance using three single features in the first group also surpasses the second one, as shown in the first three columns of Table 1. One possible reason is that the participants in the first group were more concentrated in the task and the signal-to-noise ratios (SNR) of EEG data were higher than the second group during the extraction of useful features. Some previous   15:17 studies have shown that electrophysiological responses are known to reflect participant's involvements in the task [36,42,43].

Influence of electrodes
Although multi-channel EEG signals could provide more comprehensive and complete information about different conditions, the added dimensionalities would also lead to overfitting and reduced the classification performance. Figure 8 presents the influence of electrodes on the classification performance from using the combination of three features. Firstly, it reveals the effect of electrode location to the classification results. Features from electrodes in central brain regions generally exceed those at peripherals in classification performance. Secondly, it shows that single electrodes have poor performance in detecting errors with the average AUC of 0.5726. The AUC values of features from electrodes (AF4, F4, F6, O2) are near 0.5, which demonstrate the poor ability of classification. With added features from more electrodes, the classification generally show increasing pattern except for a few electrodes (i.e., AF4, F8 and T8), illustrated by stars in Fig. 8. The added features lead to the adjustment of weights in neural network toward desired directions, which in turn contribute to the improvements in classification performance.

Discussion
Feature extraction and representation are critical factors in error detection. Single features from temporal, spectral, and spatial domain have been widely investigated in many studies [25,27]. In the present study, we proposed a method of error detection using neural network to combine various features from multiple-electrode EEG, which not only combines features of different domains, but also addresses the overfitting issue caused by the curse of dimensionality. In the contest, our performance score was 0.7818 and ranked fourth among all the 260 teams attending the challenge, as shown in Table 2. The abilities in error detection of the three features F 1 , F 2 , F 3 can be observed in Figs. 3, 4,   Fig. 8 The influence of number of electrodes on AUC. Each bar denotes AUC of features from only one electrode. Each star presents AUC of combined features from FP1 to each of the following electrodes and 5, respectively, revealed by the magnitude differences in various features. The observation is in line with the classification performance in Fig. 7 and Table 1. For example, the magnitude differences are small for F 2, comparing to other two features, and its classification performance are also worse than others as shown in the 3rd column in Table 1. Nevertheless, there is still constructive information in F 2 for error detection, suggested by the improved detection performance with F 2 added to the combination of features in Table 1. Such combinations make use of information from temporal, spectral, and spatial domains, and provide more comprehensive information about errors than individual features. However, simple combination of features would result in long feature vectors. When further considering added information from different electrodes, the complex model is very susceptible to the overfitting problem in classification, which might lead to degeneration in performance. Therefore, feature extraction and dimensionality reduction play important roles in error detection in the present study. The feedforward BP neural network is implemented to reduce the dimensionality of features. The outputs of neural network are essentially the posterior probability of the primary inputs, and the values are between [0, 1] (values close to 1 favor the labeling towards positive class, and 0 to the negative class). After two times of dimensionality reduction, the level-3 features become just one dimension.
In previous studies, some researchers realized dimensionality reduction through channel selection. They selected electrodes via observing topographic EEG power maps over the scalp [36,37]. In addition, there are some other studies that implement PCA [44], ICA [45,46] or other channel selection algorithms [39], for the selection of spatial features.
In term of detection performance, the following factors pose impacts in the proposed method. The first factor is the preprocessing of raw EEG data, such as removal of eye artifacts, time window length and cutoff frequency of bandpass filter. It is found that eye-movement artifact removal during EEG preprocessing could enhance the accuracy about 2 %. Another factor is the feature extraction process, such as the selection of time delay parameter m in the process of temporal feature extraction. The larger the m value is, the more information about error detection in F 1 features. When extracting features in the spatial domain, it is found that other spatial filter method such as xDAWN [47] could also be used to improve performance.
The error detection is essentially a binary classification problem. Such type of classification usually suffers greatly from unbalanced sample numbers from different classes. This imbalanced sample numbers result in biased classification towards the majority  15:17 class and lower detection rate in the minority class [48,49]. To tackle such a problem, different techniques were explored to compensate inter-class sample differences, such as over-sampling and under-sampling [50]. In addition, some researchers improved the prediction rate of the minority class by adopting classifier algorithms [51]. In future works, it could be an important aspect to investigate in order to improve the accuracy of error detection.

Conclusions
In the present study, to capture the discriminative information about error potentials in features from different domains and avoid overfitting caused by features of multiple dimensionalities, we proposed a new approach of combining multiple-channel features from temporal, spectral, and spatial domains through two times of dimensionality reduction based on neural network. It took advantage of information from multiple electrodes and combination of features from different domains rather than single features. The classification results using ROC curves and AUC metrics suggest superior performance with combined features over single features, and show the good generalization ability across subjects of the proposed algorithm. The improved accuracy in error detection in present study demonstrate great potentials in promoting the performance for BCI systems integrated with scheme of error correction. This could facilitate developing robust BCI systems towards real-life environment.