Channel selection in motor imaginary-based brain-computer interfaces: a particle swarm optimization algorithm

The number of electrode channels in a brain-computer interface affects not only its classification performance, but also its convenience in practical applications. How-ever, an effective method for determining the number of channels has not yet been established for motor imagery-based brain-computer interfaces. This paper proposes a novel evolutionary search algorithm, binary quantum-behaved particle swarm optimization, for channel selection, which is implemented in a wrapping manner, cou-pling common spatial pattern for feature extraction, and support vector machine for classification. The fitness function of binary quantum-behaved particle swarm optimization is defined as the weighted sum of classification error rate and relative number of channels. The classification performance of the binary quantum-behaved particle swarm optimization-based common spatial pattern was evaluated on an electroencephalograph data set and an electrocorticography data set. It was subsequently compared with that of other three common spatial pattern methods: using the channels selected by binary particle swarm optimization, all channels in raw data sets, and channels selected manually. Experimental results showed that the proposed binary quantum-behaved particle swarm optimization-based common spatial pattern method outperformed the other three common spatial pattern methods, significantly decreasing the classification error rate and number of channels, as compared to the common spatial pattern method using whole channels in raw data sets. The proposed method can significantly improve the practicability and convenience of a motor imagery-based brain-computer interface system.

The number of electrode channels in a brain-computer interface affects not only its classification performance, but also its convenience in practical applications. However, an effective method for determining the number of channels has not yet been established for motor imagerybased brain-computer interfaces. This paper proposes a novel evolutionary search algorithm, binary quantumbehaved particle swarm optimization, for channel selection, which is implemented in a wrapping manner, coupling common spatial pattern for feature extraction, and support vector machine for classification. The fitness function of binary quantum-behaved particle swarm optimization is defined as the weighted sum of classification error rate and relative number of channels. The classification performance of the binary quantum-behaved particle swarm optimization-based common spatial pattern was evaluated on an electroencephalograph data set and an electrocorticography data set. It was subsequently compared with that of other three common spatial pattern methods: using the channels selected by binary particle swarm optimization, all channels in raw data sets, and channels selected manually. Experimental results showed that the proposed binary quantum-behaved particle swarm optimization-based common spatial pattern method outperformed the other three common spatial pattern methods, significantly decreasing the classification error rate and number of channels, as compared to the common spatial pattern method using whole channels in raw data sets. The proposed method can significantly improve the practicability and convenience of a motor imagerybased brain-computer interface system.

Keywords
Brain-computer interface; motor imagery; common spatial pattern; channel selection; binary quantum-behaved particle swarm optimization

Introduction
A brain-computer interface (BCI) is a type of communication system that establishes a no-muscular pathway between the brain and the outside world (Wolpaw et al., 2002). As such, BCIs can help people with motor disabilities communicate with external environments or control an external device. BCI systems can be divided into non-invasive and invasive types (Leuthardt et al., 2004). Non-invasive human BCIs currently use electroencephalography (EEG), magnetoencephalography (MEG), functional magnetic resonance imaging (fMRI), and functional near-infrared spectroscopy (fNIRS) as brain imaging techniques. Among them, EEG and fNIRS have been attractive due to their low cost and portability (Arvaneh et al., 2011;Aydemir et al., 2018;Blankertz et al., 2008;Ehrsson et al., 2003;Hong et al., 2017Hong et al., , 2015Khan and Hong, 2017;Naseer and Hong, 2015;Shin et al., 2012). In contrast, invasive human BCIs are primarily based on electrocorticography (ECoG) signals recorded from the cortical surface (Lal et al., 2005;Leuthardt et al., 2004).
Various paradigms are used for building a BCI system. One of them utilizes motor imagery (MI) to generate distinguishable brain signals (Ehrsson et al., 2003;Pfurtscheller and Neuper, 2001). For example, imagination of a limb movement results in two neurophysiological phenomena -event-related desynchronization or event-related synchronization (ERD/ERS) (Pfurtscheller and da Silva, 1999;Toro et al., 1994). This is a decrease/increase in power of EEG signals in the frequency bands of µ rhythm (8-14 Hz) and β rhythm (16-28 Hz) over the motor and sensorimotor lobes. Common spatial pattern (CSP) is an effective algorithm for extracting ERD/ERS features from EEG data (McFarland et al., 1997). As a powerful spatial filtering algorithm, CSP can detect the oscillatory characteristic of EEG signals in specific brain areas, thus facilitating its use for discriminating between the two classes of EEG patterns (Muller-Gerking et al., 1999). However, these specific brain areas may vary between people based on differences in physiology and anatomy. One approach then is to apply as many as channels as possible to record data from BCI systems. However, this introduces significant noise in the data and can cause an overfitting problem for the CSP algorithm (Blankertz et al., 2008). Furthermore, using a large number of electrodes impedes the convenience of practical applications.
To balance the need for both performance and convenience in a BCI, it is crucial to remove task-irrelevant channels using a channel selection method (Arvaneh et al., 2011). So far, the methods for channel selection can be divided mainly into three cat- Figure 1. (a) The placement of the 118 electrodes in the EEG data set according to extended international 10/20 system; (b) The placement of the 8 × 8 electrode grid used for recording ECoG data of the second patient in the ECoG data set. It was placed on the right hemisphere. egories, namely filtering, wrapping, and embedded (Alotaiby et al., 2015). Filtering methods are independent of the subsequent learning algorithm, and rely on certain criteria to evaluate candidate channel subsets. For example, Arvaneh et al. (2011) formulated a sparse common spatial pattern (SCSP) algorithm as an optimization approach to reduce the number of channels (Arvaneh et al., 2011). Filtering methods can reduce the number of channels at high speed, but usually at the cost of classification accuracy. Conversely, wrapping methods employ a different strategy, whereby channel selection is combined with a classification algorithm. Candidates are assessed by classification accuracy, and can therefore yield more robust results, but are also more computationally expensive than filtering methods. Aydemir et al. (2018) recently presented such a sequential forward search method (SFSM) for channel selection. Finally, in an extension of wrapping techniques, embedded methods select channels based on criteria generated during the learning process of a specific classifier. For example Lal et al. (2004) embedded feature selection algorithms, recursive feature elimination (RFE), and zero-norm optimization into support vector machines (SVM) to recursively eliminate the channels that yield the worst classification results. Together, all these methods can reduce the number of channels to a considerable degree.
Despite the volume of research on channel selection, accurately determining the number and position of channels is still a big challenge for MI-based BCIs. This study employs the idea of a wrapping method to construct a channel selection process, and attempts to answer the following two research questions: 1) What is the degree of improvement in the performance of a BCI system using only selected channels for classification, as compared to using all recording channels? And 2) What is the minimum number of channels required to achieve satisfactory classification performance (classification accuracy of approximately 90%)? We selected the wrapping method for its higher classification performance than the filtering method, and lower complexity of compu-tation than the embedded method.
To address our research questions, we propose a novel evolutionary search algorithm, the binary quantum-behaved particle swarm optimization (BQPSO) (Xi et al., 2010), for channel selection in MI-based BCIs. The BQPSO evaluates all candidate channel subsets under the guidance of the fitness value, and continuously updates the candidate subsets until the maximum number of iterations is reached. Based on the chosen channels, the CSP algorithm is used for feature extraction, and support vector machine (SVM) for classification. The fitness function of the BQPSO is defined as the weighted sum of classification error rate and the relative number of channels, so that the number of channels can be reduced as much as possible, on the premise that the classification performance meets the need of BCI applications.
Another evolutionary search algorithm, binary particle swarm optimization (BPSO) (Kennedy and Eberhart, 1997), has been used for channel selection of MI-based BCIs by Kim et al. (2015). To demonstrate the advantages of BQPSO for channel selection, we evaluated the performance of BQPSO-based CSP on an EEG data set and an ECoG data set, in comparison with the performance of BPSO-based CSP, CSP with all recording channels, and with manually chosen channels, according to prior knowledge of neurophysiology. We report that BQPSO-based CSP achieved superior performance compared to the other three CSP methods.

Experimental data
In this study, two data sets were used for evaluating the performance of the proposed CSP method. The first is a publicly available EEG data set -IVa of BCI Competition III (Blankertz et al., 2006). The other is an ECoG data set provided by the authors of Lal et al. (2005), and used in their study. These data sets were employed owing to their use of many recording electrodes. The two data sets differ primarily in their signal-to-noise ratios (SNR) and sizes (i.e. numbers of total trials).  Figure 2. (a) The timing scheme of each trial for the EEG dataset. In each trial, the duration of motor imagery was 3.5 s and the next 1.75-2.25 s was the time for a subject to relax; (b) The timing scheme of each trial for the ECoG dataset. In each trial, the duration of motor imagery was 4 s and the next 2 s was the time for a subject to relax.

The EEG data set
The EEG data set was originally provided for classifying EEG data with small training sets and is widely employed in BCI studies to compare different classification algorithms. It consists of five data subsets derived from five healthy subjects (aa, al, av, aw and ay). Each subject participated in a MI-based BCI experiment, in which they were required to conduct mental tasks of imagining left hand, right hand or right foot movements, following a given visual cue denoted by a letter (L, R, or F). Starting from the visual cue, the subjects carried out the corresponding MI task for 3.5 s. These visual cues were presented intermittently with random lengths ranging from 1.75 s to 2.25 s, during which the subject could relax. EEG signals were collected using a BrainAmp amplifier and a 128-channel Ag/AgCl electrode cap. 118 electrodes were used for recording experimental data, according to the extended international 10/20 system. The EEG data were digitalized at 1000 Hz by the amplifier and re-sampled to 100 Hz by the competition organizers for offline analysis. A total of 280 trials per class were performed by each subject. Only the data from MI tasks of right hand (R) and right foot (F) were provided for the competition. The electrode placement for recording EEG data and the timing scheme of each trial are illustrated in Fig. 1 (a) and Fig. 2 (a), respectively.

The ECoG dataset
The ECoG data set was recorded from three epileptic patients (AM, JS, SS) with intracranial electrodes. All patients suffered from focal epilepsy and had to undergo surgical operation to have their foci resected. Prior to the surgery, the localization of epileptic foci required placing electrodes onto the surface of the cortex and into deeper regions of the brain. After several days of recovery and follow-up examinations due to implantation surgery, the BCI experiments were carried out in the hospital.
For the experiment, each subject was asked to repeatedly imagine two different limb movements according to the visual cue.
Each trial started with a fixation cross displayed in the center of screen and lasted for 7 s. At second 1, a visual cue appeared on the screen indicating the MI task to be performed. The cue for patient SS was an arrow pointing to the left or right hand, whereas that for patients AM and JS was a picture showing either a tongue or a little finger. The imagination phase lasted 4 s. In the final 2 s of each trial, the patient could relax.
All three patients had grid electrodes implanted, but patients JS and SS had additional strip electrodes. The electrode grids were placed on the cortex under the dura master, covering the primary motor and premotor areas as well as the frontotemporal region of either the right or left hemisphere. The electrodes were connected to an EEG amplifier by cables. The ECoG signals were recorded at a sampling rate of 1000 Hz and re-sampled to 100 Hz for offline analysis. The number and positions of implanted electrodes, the tasks performed, and the number of trials recorded from each patient are listed in Table 1. The electrode placement for recording ECoG data of the second patient and the timing scheme of each trial are illustrated in Fig. 1 (b) and Fig. 2 (b), respectively.

Methods
The channel selection algorithm based on the wrapper is illustrated in Fig. 3. As shown in Fig. 3 (a), raw EEG data are first temporally filtered in 8∼15 Hz, and then subjected to channel selection via BPSO/BQPSO, and finally classified by SVM. To accurately assess classification performance, 10-fold cross-validation is applied, i.e. the whole data set is divided into 10 equal parts, with each part being used for testing set once and the other parts for training set. Measurements of average error rate and number of channels are employed to calculate the fitness value at each iteration. To realize the 10-fold cross validation, the training data were divided into 10 equal-size parts. Each part was used for testing set once and the other nine parts were used for training set; (b) Feature extraction and classification of EEG/ECoG signals based on CSP and SVM using selected channels in one fold of the 10-fold cross validation process.
to filter both training and testing data. Feature signals are extracted based on spatially filtered data. Feature signals from training set are employed for training an SVM classifier model, which then classifies feature signals from testing set.

Channel selection
The particle swarm optimization (PSO) developed by Kennedy and Eberhart (1995) is a population-based evolutionary search method. The main idea of the algorithm comes from the social behavior of animals, such as bird flocking, fish schooling, and animal herding. The original PSO designed for continuous search space was modified to be applicable to discrete binary search space, thus termed binary PSO (BPSO) (Kennedy and Eberhart, 1997). From the perspective of quantum mechanics, Shin et al. (2004) adapted the PSO algorithm to develop a novel quantum-behaved particle swarm optimization (QPSO), using the quantum uncertainty principle to describe the motion state of particles. Subsequently, they further generalized the QPSO algorithm to discrete binary search spaces, developing the binary QPSO (BQPSO) (Xi et al., 2010).

Binary particle swarm optimization
In PSO, a particle swarm consists of M particles that denote potential problems, X = {X 1 , X 2 , · · · , X M }. A potential solution to a problem is expressed as a particle flying in a D-dimensional space having the position vector X i = {X i1 , X i2 , · · · , X iD } and the velocity vector V i = {V i1 ,V i2 , · · · ,V iD }. Each particle maintains a record of the position of its previous best performance (i.e. the position with the best fitness value) in a vector, pbest i = {pbest i1 , pbest i2 , · · · , pbest iD }. At each iteration, each particle competes with others in the population for the best position, denoted as gbest i = {gbest i1 , gbest i2 , · · · , gbest iD }. Thereby, particles move in the search space according to the following: where i = 1, 2, · · · , M and d = 1, 2, · · · , D. The w is the inertia weight introduced for accelerating the convergence speed of PSO. φ 1 and φ 2 are two random positive numbers generated for each i and d. At each iteration, the value of V id is confined to [-V max , In a discrete binary space, the velocity of a particle can be described by the number of bits changed per iteration, or the Hamming distance of a particle, between time t and t + 1. A particle with zero bits flipped does not move, while it moves the "farthest" with all bits flipped. Accordingly, the velocity of a particle can be defined in terms of the probabilities that a bit will be in one state or the other. That is to say, a particle moves in a state space with each dimension confined to 0 and 1, where each V id denotes the probability of bit X id taking the value 1.
In summary, the particle swarm Eqn.
(1) remains unchanged in BPSO, but now pbest id , gbest id , X id and X id are taken as integers in {0, 1}. Since V id is a probability, it must be confined to [0.0, 1.0]. This can be implemented by a sigmoid limiting transformation function S(v) = 1/(1+exp(-v)). Thus, the main difference between PSO and BPSO is that formula (2) is replaced by the following Eqn. (3): limits the ultimate probability that bit X id will take on a binary value.

Binary quantum-behaved particle swarm optimization
Quantum-behaved particle swarm optimization (QPSO) is a novel variant of PSO and outperforms PSO in search abilities. In QPSO, there are no the concepts of velocity and trajectory, but those of position and distance. A particle moves in the continuous search space according to the following equations: where mbest is the mean best position among the particles. p id is a stochastic point between pbest id and gbest d , i.e. the dth coordinate of the local attractor of the ith particle pi. φ and µ are two random numbers distributed in [0, 1], and α is a parameter of QPSO called contraction-expansion coefficient.
Since the iteration equations of QPSO are far different from those of PSO, the methodology of BPSO does not apply to QPSO. Because the position of a particle in a discrete space is expressed as a binary string, the key problem of designing BQPSO is to define the distance between two positions and the corresponding transformation. In BQPSO, the distance is defined as the Hamming distance between two binary strings X and Y, i.e. |X-Y| = d H (X,Y), where d H () is the function for computing the Hamming distance, i.e. the sum of different bits between the two strings. In BQPSO, the variable X id stands for the dth substring (i.e. dth decision variable) of the ith particle, rather than dth bit of a binary string. Let the length of X id be l d , then the length of X i can be calculated as The remaining problem for BQPSO design is in adapting the continuous evolution Eqns. (4)-(6) to discrete binary spaces. In QPSO, the mean best position of all particles (mbest) is derived from Eqn. (4), whereas in BQPSO, the jth bit of mbest is determined by the states of jth bits of all particles' pbests. The jth bit of m best is 1 if mbest j > 0.5, 0 if mbest j < 0.5, and randomly taken as 1 or 0 if mbest j = 0.5. In BQPSO, P i can be generated through crossover operation of pbesti and gbest, which can be divided into one-point operations and multi-point operations.
The update Eqn. (6) for QPSO can be rewritten as |X id -pid| = α |mbestd -X id | ln(1/µ), µ = Rand(). It can be further adapted for use in BQPSO as follows where b = αd H (X id , P id )ln(1/µ). Function ⌈ ⌉ is a round sign used for rounding towards nearest decimal or integer. According to the above equations, a new substring X id can be calculated with time complexity O(bl d ). To reduce the computation cost, X id is generated by mutating each bit of P id with the mutation probability,

Fitness function
When BPSO/BQPSO is used for channel selection, individuals in a population are represented in terms of n-bit binary strings, corresponding to n channels used for data recording. The BPSO/BQPSO operates on a population of binary strings and chooses channels by optimizing a fitness (or objective) function. There are two goals in channel selection: improving classification accuracy, and reducing the number of channels. Accordingly, the fitness function, f(z), can be defined as the weighted sum of Table 3. Classification error rates (%) and the number of channels yielded respectively by BQPSO-CSP and BPSO-CSP at weight coefficients w 1 = w 2 = 0.5, the CSP methods using all channels and the 18 channels around the electrodes C3 and C4, and the best channel selection method (SCSP-1) (Arvaneh et al., 2011)  two decision variables, the error rate of 10-fold cross-validation, f 1 (z), and the relative number of channels, f 2 (z), for a minimization problem (Hasan et al., 2010;Reyes-Sierra and Coello, 2006) where the weights w i are normalized, i.e., tained from the given channel subset denoted by an individual, z, and f 2 (z) is derived by dividing the number of channels chosen in the individual z by the total number of channels in raw data set, i.e. the length of the binary string, n. Since the numerical value of f 1 (z) and f 2 (z) ranges from 0 to 1, so does that of f(z). Several important steps for BPSO/BQPSO based channel selection are explained below: 1) Coding. Each particle in a population is coded as a binary string, whose length is equal to the total number of channels in a raw data set. When any bit of the binary string is 1, the corresponding channel is retained; otherwise the corresponding channel is removed. Thus, each particle denotes a different subset of channels, which is a candidate solution to the problem of channel selection.
2) Initialization. An initial population with i particles (i = 20 in this study) is randomly generated, and each bit of binary string for every particle is randomly set to 1 or 0.
3) Selection. Channel selection is equivalent to finding a global minimum of the fitness function. At each generation, the particle with the smallest fitness value is found. After each iteration, the positions of all particles are updated, and the current best position of each particle is compared with that of the best particle of the previous generation to find the global optimal position, i.e. the best particle at the current generation. The best particle at the final generation includes the channels selected by BPSO/BQPSO.

Feature extraction 3.2.1 Data preprocessing
Prior to channel selection, both the EEG and ECoG data sets were preprocessed with respect to time windowing, temporal filtering, and electrode referencing. In the EEG dataset, the raw data in a time window of 1-2 s after the visual cue were segmented from each channel for classification (Shin et al., 2012). The windowed EEG data were band-pass filtered between 8 to 15 Hz to extract µ rhythm signals associated with MI (Shin et al., 2012). In the ECOG dataset, the raw data in a time window of 0.5-2.5 s following the visual cue were segmented from each channel for classification.
The data segments used for classifying the two data sets were not optimized, but were determined experimentally and heuristically. Common average reference (CAR) was used to re-reference the windowed data to reduce sensitivity to artifacts (Ludwig et al., 2009). Re-referenced ECoG data were band-pass filtered between 8 and 30 Hz to extract both µ and β rhythm signals associated with MI (Wei et al., 2007).

Common spatial pattern
Common spatial pattern (CSP) is a powerful algorithm for spatial filtering, which has been successfully employed in MI-based BCIs for discriminating between two classes of EEG data. By spatially filtering multi-channel EEG signals, CSP maximizes the variance of one class while minimizing the variance of the other class, making subsequent classification more effective (Blankertz et al., 2008;Lotte and Guan, 2011;Muller-Gerking et al., 1999).
The purpose of CSP is to extract task-related signal components and suppress task-unrelated components and noise. Assume that there are two-class EEG signals evoked by two mental tasks, e.g. MI of left hand and right hand. Let X 1 and X 2 respectively denote a single-trial EEG signal of the two classes 1 and 2, with the dimension of N(channels)T (sampling points). Two normalized spatial covariance matrices, R 1 and R 2 , are calculated with X 1 and X 2 , respectively, as where superscript T denotes the transpose operation, and trace(A) stands for the trace operation, i.e. the sum of diagonal elements of matrix A. The averaged spatial covariance matrix,R i across all training trials can be obtained for each class. Subsequently, the composite spatial covariance matrix R c can be calculated as Since R c is a real and symmetric matrix, it can be factored as R c = U c Σ c U T c , where U c is the eigenvector matrix and Σ c is the diagonal eigenvalue matrix. U c and Σ c can be used for calculating the whitening transform matrix as P = √ Σ −1 c U T c , which trans-formsR i as follows Consequently, S 1 and S 2 will share the same eigenvector. If S 1 is factored as S 1 = BΣ 1 B T , S 2 will be factored as S 2 = BΣ 2 B T , and Σ 1 + Σ 2 = I, where I is the identity matrix. Given that the sum of two eigenvalues corresponding to the two-class EEG signals is always equal to one, eigenvectors with the largest eigenvalues for S 1 will correspond to those with the smallest eigenvalues for S 2 , and vice versa. This property is extremely important for classification of EEG signals, because it means that when the signal variance for one class is maximized, that for the other class will be minimized.
The CSP algorithm leads to a spatial filter matrix as follows: where W ∈ R N×N . In general, the first and the last m rows are used as two spatial filters W 1 and W 2 for the two mental tasks respectively. The two spatial filters are optimal in the sense that they extract task-related components and eliminate common components.

Feature definition
The last step of feature extraction is to define feature signals for classification. Suppose that task 1 causes a relatively increased EEG variance over a specific area of the brain, and the variance of the EEG component filtered by W 1 is greatly enhanced compared with that filtered by W 2 , and vice versa. Given a single-trial spatiotemporal signal matrix, X, with unknown label, two runs of spatial filtering by W 1 and W 2 are applied. Then, features f 1 and f 2 are defined as follows: where f i takes value between 0 and 1 before logarithmic operation.
In theory, f 1 takes value 0 for trials from task 2 and takes 1 for trials from task 1. Contrary results will be yielded for f 2 . The logarithmic operation is adopted to make the distribution of elements in f i more normal. Ultimately, the feature vector used for classification can be structured as

Classification
A linear support vector machine (SVM) was used as the classifier in this study. Proposed by Cortes and Vapnik (1995), SVM is a superior classification algorithm in the field of pattern recognition and machine learning. In the field of BCI studies, SVM has exhibited robust classification performance (Blankertz et al., 2003;Kaper et al., 2004;Schlgl et al., 2005). The purpose of SVM is to design a hyperplane g (V ) = w T V +w 0 = 0, which maximizes the margin between two classes of training data, where w is a weight vector and w 0 is an offset. Due to this characteristic, the generalization performance of the classifier is guaranteed.
A linear SVM can be summarized as the following optimization problem: where i is the index of training trials, ζ i is a slack variable and C is a regularization parameter. The role of ζ i is to slack the requirement of linear separability, whereas that of C is to make a compromise between the bias and variance of classification results. A linear SVM classifier (Mller et al., 2003) is trained with the function fitcsvm in Statistics and Machine Learning Toolbox TM . Table 4. Classification error rates (%) and the number of channels yielded respectively by BQPSO-CSP and BPSO-CSP at weight coefficients w 1 = w 2 = 0.5, the CSP method using all channels, and the channel selection method (RCE cross-val.) (Lal et al., 2005)  Usually, a model selection procedure is required for determining the regularization parameter C, in order to improve classification accuracy. Since the purpose of this research is to evaluate the search algorithm for channel selection, we adopted the default parameter in fitcsvm, i.e. C = 1.

Results
The efficacy of the CSP algorithm depends heavily upon the number and positions of channels used for classification. Hence, before feature extraction was conducted by CSP, different channel sets were applied, including i) channel subsets chosen by BQPSO and BPSO, ii) whole/total channels contained in raw data sets, and iii) 18 benchmarked channels around electrodes C3 and C4 for the EEG data set. These four CSP methods are hereafter labelled: BQPSO-CSP, BPSO-CSP, Basic-CSP-1, and Basic-CSP-2, respectively.
The performance of BQPSO-CSP was tested and compared with the other three CSP methods on the two data sets, EEG ad ECoG. For the EEG data set, three pairs of most important spatial filters (i.e. m = 3 in the Eqn. (14)) were used, according to their contribution to classification. For the ECoG dataset, only one pair of the most important spatial filters (i.e. m = 1 in the Eqn. (14)) were used. The parameters used for channel selection in BQPSO and BPSO algorithms are listed in Table 2.

BQPSO/BPSO for channel selection
In this study, both the error rate and the number of chosen channels yielded by BQPSO-CSP and BPSO-CSP were the results of 10-fold cross-validation averaged across 5 independent executions (Xi et al., 2016). The setting of w 1 and w 2 in the fitness function (10) is a dynamically changing process, i.e. one weight coefficient changed with the other. We tested 9 combinations of w 1 and w 2 in which w 1 increased from 0.1 to 0.9 in increments of 0.1, while simultaneously, w 2 decreased from 0.9 to 0.1 in sequential reductions of 0.1. Thus, for each subject or patient, both the BQPSO-CSP and the BPSO-CSP had 9 sets of classification results. Fig. 4 and Fig. 5 depict the classification error rate and the number of channels yielded by BQPSO-CSP and BPSO-CSP at the nine pairs of weight coefficients on the two data sets. Each mark (cycle or asterisk) in each subplot represents the error rate and the number of channels achieved at one pair of weight coefficients. When the weight of error rate (w 1 ) was assigned the maximum value (0.9), the two methods for channel selection produced the least (or near least) classification error rates, by excluding the most redundant channels. On the contrary, when the weight of channel number (w 2 ) was assigned the maximum value (0.9), the two methods retained minimal (or near minimal) number of channels, without increasing the error rates as compared to the CSP method using all channels.
From each subplot of Fig. 4 and Fig. 5, it can be observed that the changing curve of error rate with the number of channels yielded by BQPSO-CSP is always located on the left of that yielded by BPSO-CSP. This means that to obtain a roughly equal error rate, the latter needs to select many more channels than the former. Examining data from subject al as an example, a 2.17% error rate required an average of 9.6 channels for the former, whereas a 2.42% error rate required an average of 27.6 channels for the latter. Therefore, the proposed BQPSO-CSP method outperformed the BPSO-CSP method, particularly when the number of channels is small. Fig. 6 and Fig. 7 display the topological analyses of spatial patterns yielded by the three methods for channel selection in representative subjects in the two data sets. The spatial patterns are derived from the CSP filters, i.e., the inverse of the CSP filter matrix (Eqn. 14). The first and the last columns of the inverse matrix constitute the most important spatial patterns. In the two figures, the first row plots the topological maps obtained from all channels, whereas the second and third rows show the topological maps obtained from channels selected by BPSO-CSP and BQPSO-CSP, respectively. The dots in each topological map represent the positions of total channels or chosen channels.

Spatial patterns
It can be observed from Fig. 6 that the spatial patterns obtained from all channels (1st row) have large weights scattered in several locations irrelevant to the MI tasks. Especially for subject aa, spatial patterns yielded from MI of foot movement appear messy, displaying no clear focus. After BPSO-or BQPSO-based channel selection, the focus of spatial patterns (2nd and 3rd rows) is clearer than that using all channels. Moreover, the foci of these patterns are moved to (nearby) locations related to the MI tasks.
With respect to these two methods for channel selection, while BPSO could reduce the number of channels employed, the positions of the chosen channels were relatively scattered. In addition, some channels outside the focus area were also selected, raising the potential to introduce noise into data used for classification. By contrast, BQPSO selected fewer channels and these channels were concentrated mainly on the focus area. The positions and number of chosen channels explained the decrease in error rate compared to that of BPSO and total channel method. Moreover, channels selected by BQPSO were almost identical to those selected in the focus area by BPSO, especially for subject AM in

Error rate and the number of channels
As shown in Fig. 4 and Fig. 5, the nine combinations of weight coefficients resulted in nine pairs of error rate and number of channels. Thus, channels in a BCI system can be configured according to these results and the requirement of the error rate for a specific application. As an example, the error rate and the number of channels yielded by BQPSO-CSP and BPSO-CSP at the weight coefficients of w 1 = w 2 = 0.5 on the EEG and ECoG data sets are listed in Table 3 and Table 4, respectively. As a comparison, the error rate and number of channels yielded by Basic-CSP-1 and Basic-CSP-2 are listed in Table 3, and those yielded by Basic-CSP-1 only are listed in Table 4. (As Basic-CSP-2 contains results from the 18 benchmarked channels around electrodes C3 and C4 for the EEG data set, there are no Basic-CSP-2 values for the ECoG data set in Table 4). To compare the proposed method with previously presented methods for channel selection, the error rate and number of channels yielded by sparse CSP for channel selection (SCSP-1) (Arvaneh et al., 2011) and recursive channel elimination (RCE cross-val.) (Lal et al., 2005), are also listed in Table 3 and Table  4, respectively.
It is observed from Table 3 that BQPSO-CSP yielded the lowest error rate for each subject among the four CSP methods. In particular, subject av demonstrated a substantial drop in error rate from 32.79% (yielded by the full complement of 118 channels) to 18% (by an average of 14.5 channels selected by BQPSO). On average, BQPSO-CSP achieved a reduction of 2.12%, 5.74%, and 7.71% in error rate compared to BPSO-CSP, Basic-CSP-1, and Basic-CSP-2, respectively. These decreases are remarkable in terms of MI-based BCIs. Paired Wilcoxon signed rank tests at 95% confidence level establish a significant difference in error rate between BQPSO-CSP and the other three CSP methods, and between BPSO-CSP and the two Basic-CSP methods, with p values all equaling 0.043. In addition, the average number of channels used by BQPSO-CSP was considerably decreased to an average of 14.9, as compared to 30.8 (in BPSO) and 118 (in Basic-CSP-1). Paired Wilcoxon signed rank tests at 95% confidence level reveal a significant difference in the number of channels between any two of the former three CSP methods, with p values all equaling 0.043. Finally, compared with SCSP-1, BQPSO-CSP remarkably reduced the average error rate by 9.66% and the average number of channels by 7.7.
In the ECoG data set, Table 4 reveals that BQPSO-CSP yielded an average reduction of 8.63% in the error rate of Basic-CSP-1, by decreasing the average number of channels from 74 to 7.9. The decrease in error rate is especially large for subject AM (14.98%). Likewise, BPSO-CSP reduced the average error rate by 3.63% with a remarkable drop in the average number of channels from 74 to 16.9. Hence, both BQPSO-CSP and BPSO-CSP are capable of reducing the error rate by removing a large number of redundant channels. However, BQPSO-CSP was considerably more effective than BPSO-CSP, evidenced by its considerably lower error rate with fewer channels selecte for each of the three patients. Compared with REC cross-val., BQPSO-CSP reduced the average error rate by 8.75% and the average number of channels by 2.9.

Discussion
Feature extraction is a crucial component in a BCI system as the classification performance depends primarily upon the quality of feature vectors used for classification rather than the classifier itself. CSP is a powerful spatial filtering algorithm that is widely used for feature extraction in MI-based BCIs. However, the use of excessive electrodes for data recordings renders CSP algorithm over-fitting, especially when the size of training set is small. Furthermore, installing a large number of electrodes adds inconvenience to practical application of BCIs. Thereby, it is an extremely important step to determine the minimum optimal number and positions of electrodes for building a high-performance BCI. This can be accomplished by channel selection.
While channel selection has been studied extensively, it is still a huge challenge to accurately determine the number and positions of channels for a specific subject. In this context, we propose a Figure 6. Visualization of the most important spatial patterns of the two MI tasks derived from three CSP methods for subjects aa and al in the EEG data set (118 raw electrodes in total). The dots in each topological map represent the whole channels in raw data or the channels selected by BPSO/BQPSO. The color at each electrode denotes the magnitude of spatial patterns. novel evolutionary search algorithm, BQPSO to optimize channel selection in MI-based BCIs to acquire better data -obtaining high classification accuracy using as few channels as possible. BQPSO combines the strength of genetic algorithm (GA) with the features of PSO and is thus able to determine the global optimum of an optimization problem more efficiently than BPSO. This is verified by our results from in Fig. 3 and Fig. 4, where BQPSO-CSP consistently achieved a significantly lower error rate than BPSO-CSP using a nearly identical number of channels, or a nearly identical error rate at a significantly fewer number of channels.
What is the degree of performance improvement following channel selection? This question might be answered by the results from Table 3 and Table 4. These results indicate that both BQPSO-CSP and BPSO-CSP significantly decrease the average error rate as compared to Basic-CSP-1, which uses all available channels. Thus, these process of channel selection are more effective. Interestingly, Basic-CSP-2, which used 18 channels selected manually from prior knowledge of neurophysiology, increased the average error rate rather than reducing it. BQPSO-based channel selection decreased the average error rate from 13.8% to 8.06% for the EEG data set -an improvement of 41.59%; it was decreased from 24.98% to 16.05% for the ECoG data set -an improvement of 35.75%.
It is important to note that these results were achieved at only one combination of weight coefficients, i.e. w 1 = w 2 = 0.5. Considering that there were nine additional pairs of weight coefficients, further improvements in classification performance are entirely possible. It must also be noted that since electrodes for the ECoG data set were arranged for removing epileptic foci, they did not cover the whole motor area important for MI-based BCI study. This may explain why the average error rate of the ECoG data set was larger than that of the EEG data set, although the former had higher SNR. Despite this, the average improvement in error rate remained as high as 35.75%. This should be attributed to BQPSObased channel selection which displayed success in selecting informative channels, while removing redundant ones.
For MI-based BCI paradigms, different types of brain signals can be used as input for a BCI system. In the study conducted Figure 7. Visualization of the most important spatial patterns of the two MI tasks derived from three CSP methods for the patient AM in the ECoG data set (64 raw electrodes in total). The dots in each topological map represent the whole channels in raw data or the channels selected by BPSO/BQPSO. by Naseer and Hong (2015), fNIRS signals arising from two mental tasks of right and left wrist MI were exploited for building a BCI. The mean and slope of changes in oxygenated hemoglobin (HbO) concentration were extracted as the feature signal for classification. The results, based on the slope of changes in HbO concentration, suggest an average classification accuracy of 87.28% across ten subjects using the data segment of 2-7 s. This degree of accuracy is on par with that obtained in our study (91.94% for the EEG data set, and 83.95% for ECoG data set), demonstrating the promising potential of fNIRS-based BCIs.
How many channels are necessary to achieve satisfactory classification performance (∼90%) for MI-based BCI? The answer depends upon several factors, including the effect of subjects, experimental conditions, and signal processing algorithms used for classification. In the case that the latter two factors are fixed, the number of channels required for a high accuracy rate becomes subject-specific, i.e. it is heavily determined by the subjects themselves. It can be observed from Fig. 3 that an error rate of 10% was achieved by subjects al, aw, and ay, using 10 or fewer channels, and by subject using 20 channels. The subject av could not achieve the error rate regardless of the number of channels used for classification. It can be observed from the third row of Fig. 5 and Fig. 6 that the optimal position of electrodes might vary for different subjects, but was nevertheless primarily located in motor areas related to corresponding limbs. Note that results in Table 4 cannot be used to explain the problem of the number of channels as the ECoG recording channels were confined to localized brain regions for the purpose of surgery. In summary, for most welltrained subjects, about 20 carefully selected channels can ensure satisfactory classification performance if the experimental conditions and classification algorithms are well-designed.
This study focused on channel selection methods in MIbased BCI applications. There are two requirements for channel selection-first, to reduce the number of channels, and second, to reduce the error rate compared to that yielded by using all available channels in raw data. To this end, the BQPSO-based wrapping approach is proposed for channel selection. Although it is computationally demanding, the subset of selected channels can achieve better classification results. The proposed BQPSO-CSP method for channel selection outperforms Basic-CSP-1, in terms of both classification accuracy and the number of selected channels. That is to say the BQPSO-CSP method can achieve higher classification accuracy with fewer channels compared to the CSP method using all available channels. As such, the convenience (fewer channels) and practicability (lower error rate) of a BCI system can be improved simultaneously.

Conclusion
To increase the classification ability of CSP, an evolutionary search algorithm, BQPSO, is proposed for channel selection, which is achieved by a wrapping manner. The fitness function of BQPSO is defined as the weighted sum of the error rate and the relative number of channels. The classification performance of BQPSO-based CSP method was tested on two data sets and compared with that of BPSO-based CSP, and Basic-CSP which employs either all, or manually selected, channels. Experimental results demonstrate that the proposed BQPSO-CSP method outperforms the BPSO-CSP method, by both reducing the error rate and the required number of channels for a MI-based BCI, as compared to Basic-CSP methods using all channels.