Bayesian Uncertainty Modeling for P300-Based Brain-Computer Interface

P300 potential is important to cognitive neuroscience research, and has also been widely applied in brain-computer interfaces (BCIs). To detect P300, many neural network models, including convolutional neural networks (CNNs), have achieved outstanding results. However, EEG signals are usually high-dimensional. Moreover, since collecting EEG signals is time-consuming and expensive, EEG datasets are typically small. Therefore, data-sparse regions usually exist within EEG dataset. However, most existing models compute predictions based on point-estimate. They cannot evaluate prediction uncertainty and tend to make overconfident decisions on samples located in data-sparse regions. Hence, their predictions are unreliable. To solve this problem, we propose a Bayesian convolutional neural network (BCNN) for P300 detection. The network places probability distributions over weights to capture model uncertainty. In prediction phase, a set of neural networks can be obtained by Monte Carlo sampling. Integrating the predictions of these networks implies ensembling. Therefore, the reliability of prediction can be improved. Experimental results demonstrate that BCNN can achieve better P300 detection performance than point-estimate networks. In addition, placing a prior distribution over the weight acts as a regularization technique. Experimental results show that it improves the robustness of BCNN to overfitting on small dataset. More importantly, with BCNN, both weight uncertainty and prediction uncertainty can be obtained. The weight uncertainty is then used to optimize the network through pruning, and the prediction uncertainty is applied to reject unreliable decisions so as to reduce detection error. Therefore, uncertainty modeling provides important information to further improve BCI systems.


I. INTRODUCTION
B RAIN-COMPUTER interface (BCI) allows users to communicate and interact with external devices through brain signals without using any muscle or peripheral nerve activity [1]. It provides people with paralysis or other severe motor disabilities a new enhanced communication technology that greatly improves their quality of life. Event-related potentials (ERPs) are a type of brain signal commonly used in BCI, which are very small voltages generated in the brain in response to specific events or stimuli [2]. Among them, P300 potential is the most widely studied ERP component. P300 is a positive potential evoked in the brain approximately 300 ms after the onset of stimulation and was first discovered by Sutton et al. in 1965 [3]. Currently, P300-based speller is one of the most widely used BCIs.
An efficient P300 detection algorithm is the key to implementing BCI speller. Many traditional pattern recognition methods for P300 detection have been proposed. Kaper et al. [5] proposed an SVM-based P300 detection method. Linear discriminant analysis (LDA) [6] and Bayesian LDA (BLDA) [7] have also been widely applied in P300based BCIs. Rakotomamonjy and Guigue [8] proposed an SVM-based ensemble method for P300 detection. The method achieved the best performance on Dataset II of BCI Competition III.
Compared with traditional machine learning methods, deep learning is able to extract discriminative features from raw EEG signals, thus avoiding complex manual feature extraction. At present, deep learning has been widely used in P300 detection and achieved excellent results. Cecotti and Graser [9] first applied convolutional neural network (CNN) to P300 detection, which achieved high character recognition accuracy on dataset II of the 3rd BCI competition. Li et al. [10] proposed a spatial-temporal discriminative restricted Boltzmann machine to extract the spatial and temporal features of ERP. Experimental results show that the method achieved state-ofthe-art ERP detection performance. Ma et al. [11] used capsule network (CapsNet) for ERP detection and visualized the ERP features encoded in the capsules. They found that, in addition to P300, the P100 component also played an important role in ERP detection. Comprehensive review and evaluation of deep learning methods in P300 detection can be found in [12].
For most non-invasive BCI applications, multi-channel EEG data collection (e.g., 32 or 64-channel) is necessary. Hence, EEG signals are usually high-dimensional. Moreover, since collecting EEG signals is a time-consuming and costly process, EEG dataset is typically small. Therefore, data-sparse regions usually exist within EEG dataset. Classification of input located in these regions results in higher uncertainty. However, traditional deep networks use point estimates as weights. As a result, these networks cannot effectively evaluate uncertainty and tend to output high probabilities for samples from these regions [13], [14], so their predictions are sometimes overconfident.
To address the drawback of point-estimate networks, various Bayesian neural networks (BNNs) have been proposed [15]. BNN places probability distributions over model weights to capture model uncertainty. In prediction phase, a set of neural networks can be obtained by Monte Carlo sampling. Integrating the predictions of these networks implies ensembling, which can obtain more reliable predictions than point-estimate networks. To implement BNN, various methods have been proposed [15], among which the Bayes by Backprop algorithm [16] and Monte Carlo dropout (MC dropout) [17] have been widely used. In view of the superiority of BNN, it has been successfully applied in computer vision [18] and speech recognition [20]. Recently, BNNs have also been applied to EEG classification, including motor imagery classification [21], [23], sleep staging [22] and emotion recognition [24], and have achieved satisfactory results. Milanés-Hermosilla et al. [23] proposed a BNN combining shallowCNN with variational inference, which achieved encouraging results on Dataset 2a and 2b from BCI Competition IV. Asadzadeh et al. [24] proposed a Bernoulli-Laplace-based BNN to overcome the challenge of the low spatial resolution of EEG recorders, which can significantly improve emotion recognition accuracy on SEED and DEAP datasets. It is worth mentioning that [21] and [22] build BNN based on MC dropout technique. In addition, they focused on classification accuracy and did not fully explore the application of uncertainty estimation in BCI.
Despite the importance of uncertainty modeling achieved by BNN, in BCI community, there is few comprehensive and indepth study on applying BNN. Therefore, we propose to use a Bayesian convolutional neural network (BCNN) for P300 detection based on Bayes by Backprop [16] and explore how the uncertainty information can improve BCI performance. By treating the weights as distributions and integrating over all possible values through Monte Carlo sampling, BCNN is expected to not only obtain more reliable predictions than point-estimate networks, but also evaluate prediction uncertainty for each signal sample. The weight uncertainty can be further used to identify and remove redundant weights in the network. In addition, the weights can be regularized by placing prior distributions over them, so BCNN is prone to be more robust to overfitting than point-estimate networks.
The rest of this paper is organized as follows. Section II introduces the experimental data used in this work. The data preprocessing method and the proposed BCNN for P300 detection are introduced in Section III. In Section IV, the experimental results of P300 detection and character recognition are shown. Several applications of uncertainty estimation in P300 detection are discussed in Section V. Finally, conclusions are made in Section VI.

II. EXPERIMENTAL DATA
Dataset II of BCI Competition III 1 and Dataset IIb of BCI Competition II 2 are widely used P300 datasets [25], so we conducted experiments on them. The row-column paradigm in the experiments was proposed by Farwell and Donchin [4]. Specifically, the subjects were presented with a 6 × 6 character matrix (see Fig. 1) and focused their attention on the pre-specified target characters. Then, all the rows and columns of the character matrix were intensified in random order at a frequency of 5.7Hz. Each intensification lasted 100 ms and the interval between intensifications was 75 ms. Each stimulation round included 12 intensifications (6 rows and 6 columns), among which only two highlighted the target character. Theoretically, the two intensifications that highlighted the target character would evoke P300 potentials, while the other ten would not. Therefore, the target character can be recognized by inferring the row and column in which P300 are stimulated. Ideally 12 intensifications of a stimulation round are enough to infer the target character. However, the signalto-noise ratio (SNR) of P300 is usually low, which makes it difficult to accurately detect single-trial P300. To improve the character recognition accuracy, the stimulation round was repeated 15 times for each character in the experiments. Therefore, there were 30 target samples and 150 non-target samples for each character.
The EEG signals were recorded from 64 electrodes (shown in Fig. 2). The signals were digitized at 240Hz and bandpass-filtered between 0.1 and 60Hz [25]. Dataset II of the BCI Competition III was collected from two subjects, A and B. We refer to their data as III-A and III-B, respectively. For each subject, there were 85 characters for training and 100 characters for testing. Therefore, the training set contains 2550 target samples and 12750 non-target samples, while the test set contains 3000 target samples and 15,000 non-target samples. Dataset IIb of BCI Competition II was collected from one subject. We refer to the subject as Subject C and the dataset as II-C. There were 42 characters for training and 31 characters for testing in this dataset. Therefore, the training set contains 1260 target samples and 6300 non-target samples, while the test set contains 930 target samples and 4650 nontarget samples. The number of target and non-target samples for each dataset is shown in Table I.
The probability of occurrence of target stimuli in the row-column paradigm is 1/6, resulting in a highly imbalanced    Table II.

A. Data Preprocessing
The collected EEG signals are usually mixed with noise of different frequencies. Therefore, the raw signals need to be preprocessed to improve the SNR. First, the signals were bandpass-filtered from 0.1 to 20Hz using a 4th-order Butterworth filter. Second, we extracted 0-650 ms of the signal after stimulus onset for detection. Therefore, there were 240 × 0.65 = 156 sample points per channel. Then, the signals were downsampled by a factor of 2 to reduce the number of sample points per channel to 78. The signal sample input to the network was an N c × N t matrix, where N c is the number of channels and N t is the number of sample points per channel (i.e., N c = 64 and N t = 78). Finally each signal sample was normalized as follows where x i j denotes the signal value of the i-th channel at the j-th sample point, and x ′ i j denotes the normalized signal value. µ i and σ i are the mean and standard deviation of the signal of the i-th channel, respectively. It is worth noting that the mean and standard deviation are calculated from each individual sample. In traditional P300 detection, artifact removal techniques are usually necessary to improve detection performance. However, deep learning frameworks provide an end-to-end detection approach that can capture representative high-level features from raw brain signals [26]. Since CNN is considered in our work, no specific preprocessing was designed to deal with the artifacts.

B. Network Architecture
CNN is capable to extract discriminative features from raw signals and has been successfully used in BCI [12]. Among them, Cecotti and Graser [9] first applied CNN to P300 detection. The proposed CNN model is able to achieve superior detection performance and is widely used as a benchmark model. However, this model is a point-estimate network, which cannot provide prediction uncertainty. To improve the prediction reliability of the model, we propose a Bayesian convolutional neural network (referred to as BCNN) for P300 detection based on the network architecture of this CNN. In BCNN, all point-estimate weights are replaced by probability distributions to capture model uncertainty. The architecture of BCNN is shown in Fig. 3.
The input of BCNN is a single-trial signal sample of size 64 × 78. The first layer is a spatial convolution layer, which contains ten convolution kernels of size 64 × 1 with a stride of 1. In this layer, the number of channels of the input signal is reduced from 64 to 1. Therefore, the convolution kernels in this layer act as spatial filters. The size of the output feature map in this layer is 1 × 78 × 10.
The second layer is the temporal convolution layer, which contains 50 convolution kernels of size 1 × 13 with a stride of 13. In this layer, the signal is convolved along the temporal dimension, so the convolution kernels in this layer act as temporal filters. The size of the output feature map in this layer is 1 × 6 × 50. The activation function used in [9] is adopted after the first two convolutional layers: x .
(2)  The third layer is the fully connected layer, which contains 100 neurons. The activation in this layer is a sigmoid function. The output layer is also a fully connected layer, containing two neurons. The outputs of these two neurons, P 1 and P 2 , represent the probability that the input is a target and nontarget sample, respectively. The predicted label of the input sample is defined as: where X is the input sample and C(·) is the classifier. C(X ) = 1 indicates that the current input signal is classified as a target sample, otherwise it is classified as a non-target sample. In addition, P 1 can be used to infer the target character because it represents the probability that the input EEG involves P300. More details on character recognition are described in Section III-E. The hyperparameters of BCNN (i.e., the size and number of spatial and temporal convolution kernels, and the number of neurons) were set according to the network configuration in [9] and various attempts in our experiments. Table III shows the number of parameters in each layer in BCNN and the total number of parameters is 75004. It is worth noting that the weights in the convolution kernels and in the fully connected layers are both distributions. Fig. 4 illustrates the difference between the convolution operations in traditional CNN and BCNN. Different from the traditional CNN, BCNN uses distributions instead of point estimates as the weights of the convolution kernels, so each output unit in the feature map is also a distribution.

C. Bayes by Backprop Algorithm
In the traditional CNN, the weights of the network can be learned through the back-propagation algorithm. However, the weights in BCNN are distributions, the conventional back-propagation cannot be directly applied. Moreover, the posterior distribution of weights is usually very complex and exact posterior inference is intractable, so many methods have been proposed to approximate the posterior distribution of weights. Among them, variational inference (VI) is a commonly used technique [27], [28], which is adopted in this work. Specifically, VI approximates the true posterior P(w|D) with a tractable variational posterior distribution q(w|θ ), where D is the data and θ is the variational parameter. The Kullback-Leibler (KL) divergence [29] is usually used to measure the difference between two probability distributions. In order for the variational posterior to approximate the true posterior, it is required to minimize the KL divergence between q(w|θ ) and P(w|D) to obtain the optimal variational parameters θ opt : where P(w) is the prior of the weights and P(D|w) is the likelihood term. Since the evidence P(D) is a constant, in order to minimize KL[q(w|θ ) ∥ P(w|D)], it is required to minimize the following objective function F(D, θ ): This objective function is known as the evidence lower bound (ELBO) [30]. By minimizing F(D, θ ), θ opt can be obtained. The first term in F(D, θ ) is called the complexity cost and it acts as a regularization term. The second term is called the likelihood cost. By rearranging KL[q(w|θ ) ∥ P(w)], F(D, θ ) can be written as: To minimize F(D, θ ) in neural networks, Blundell et al. [16] proposed the Bayes by Backprop algorithm, which combines Monte Carlo sampling and backpropagation techniques to learn the distributions over weights. First, the method approximates (6) by Monte Carlo sampling as follows [16]: where w (i) denotes the i-th Monte Carlo sample randomly drawn from the variational posterior q(w|θ ) and n is the number of Monte Carlo samples. Reference [16] adopted a diagonal Gaussian distribution as the variational posterior. Let µ and σ denote the mean and standard deviation of the Gaussian distribution, respectively. Then, using the Bayes by Backprop algorithm proposed in [16], we can obtain optimal variational parameters (i.e., µ opt and σ opt ) by optimizing the objective function (7). The process of Bayes by Backprop algorithm was described in [16]. After training the network, the true posteriors can be replaced by the variational posteriors to make predictions. The output of BCNN on the test data, P(y|x, D), is defined as follows: where x is the input data and y is the label. It can be seen that BCNN is equivalent to an ensemble of infinite networks. This not only improves the generalization performance of the network but also avoids overconfident predictions. The difference between the predictions of these networks can be explained as prediction uncertainty. However, the integral in (8) is computationally intractable, so it is usually estimated by Monte Carlo method. The prediction of the BCNN can be estimated as follows: where w (t) denotes the t-th Monte Carlo sample drawn from q(w|θ ) and T is the number of Monte Carlo samples. In this work we set T = 100. Kwon et al. [39] proposed to measure the prediction uncertainty by calculating the variance of the prediction distribution: where diag(v) denotes a diagonal matrix with elements of the vector v.p t denotes the predicted probability vector output by the network in the t-th Monte Carlo simulation andp = T t=1p t /T . According to the derivation in [39], the first term in (10) is aleatoric uncertainty. It captures noise inherent in the observations [18] and is therefore known as data uncertainty. The second term in (10) is epistemic uncertainty. It accounts for uncertainty in the model parameters [18] and is known as model uncertainty.
Following the work in [16], we also adopted the diagonal Gaussian distribution as the variational posterior. In addition, they chose a scale mixture of two Gaussian densities with zero mean as the prior over the weights: where w j denotes the j-th weight of the network, and π is the scaling factor. Blundell et al. [16] suggested that the standard deviation of the first Gaussian density should be larger than that of the second, i.e. σ 1 > σ 2 , and σ 2 should be much less than 1. We also used this distribution as the prior. The Gaussian Mixture Model (GMM), together with Gaussian distribution and Laplace distribution, were considered as the prior. Experimental results show that the performance of GMM is the highest among all the prior distributions. Therefore, GMM was used. It is also worth mentioning that, the values of σ 1 , σ 2 , and π in (11)  We tried various combinations of hyperparameters to determine the optimal configuration as σ 1 = 0.1, σ 2 = 0.0005, and π = 0.5. In addition, these prior parameters were shared among all the weights of the network. We did not optimize the prior parameters during training phase because it could not improve the performance of the network, but would lead to slow convergence and local minimum [16].

D. Training
In BCNN, both weights W and biases b follow Gaussian distributions. The mean of weights, W µ was initialized with Kaiming uniform initialization [31]. This initialization method can well prevent the output of the activation layer from vanishing or exploding during the forward propagation of the network, which is beneficial to the convergence of the network. For the mean of biases, b µ , it was initialized as follows: where U [a, b] denotes the uniform distribution on the interval [a, b], and n is the number of feature points in the output of the previous layer. For the standard deviation of weights W θ and biases b θ , they were initialized to a constant value of 0.05. The number of Monte Carlo samples was set to 3 during the training phase (i.e. n = 3 in (7)) and 100 during the testing phase (i.e. T = 100 in (9)). We chose stochastic gradient descent [32] as the optimizer with a learning rate of 0.001. The batch size was set to 100. Our models were implemented with Pytorch and ran on a single Titan Xp GPU. The models were trained for 100 epochs. The training time of BCNN was 23.95 minutes and the prediction time per sample was 0.0057 seconds, which meets the real-time requirements of P300 BCI.

E. Target Character Recognition
The P300 detection probability P 1 output by BCNN (see (3)) can be used to infer the target character. It is worth noting that P 1 is obtained by taking the mean of 100 Monte Carlo simulations given by (9). In theory, the target character can be recognized using the signals from only one stimulation round. However, due to the low SNR of P300, character recognition accuracy is often poor in single trial case. Hence, the cumulative probability from n (1 ⩽ n ⩽ 15) stimulation rounds is usually used to improve character recognition accuracy. Let P 1 (i, j) denotes P 1 output by BCNN when the input signal is evoked by the j-th (1 ⩽ j ⩽ 6) column intensification or the ( j -6)-th (7 ⩽ j ⩽ 12) row intensification at the i-th stimulation round. Then the cumulative probability of P300 detection for intensification j (1 ⩽ j ⩽ 12) in the first n (1 ⩽ n ⩽ 15) stimulation rounds, q( j), is calculated as follows: Then the column index c and row index r of the predicted target character are obtained as follows: The predicted character lies on the intersection of the r -th row and the c-th column in the character matrix (see Fig. 1).

A. P300 Detection
P300 detection is not only a binary classification task but also an imbalanced data classification task. It is worth noting that the input of BCNN is a single-trial signal. To measure the detection performance of a model, the following evaluation metrics are used in this paper: true positive (TP), true negative (TN), false positive (FP), false negative (FN), Recall,  Precision, F1, and recognition rate (Reco.). F1 is the harmonic mean of Recall and Precision. A higher F1 indicates higher detection performance on both the positive and negative samples. In contrast, Reco. is incapable of measuring the average performance on positive and negative samples. Therefore, F1 is used as the primary evaluation metric in this paper. We refer to the network structure of the CNN proposed by Cecotti and Graser [9] to build the BCNN, so we compare the performance of the BCNN with that of their models. Note that the CNN models in [9] are traditional point-estimate networks. Among the seven models proposed in [9], CNN-1 and the multi-classifier convolutional neural networks (MCNN-1 and MCNN-3) are the three models with the best classification performance. MCNN-1 and MCNN-3 are ensemble models based on CNN-1. In addition, to verify whether applying Bayesian uncertainty estimation technique improves the performance of the network, we replace the weight distribution in BCNN with point estimates and test its performance. The corresponding model is called P300-CNN. It should be noted that the network structure of P300-CNN is the same as that of BCNN, but P300-CNN is a deterministic neural network and is optimized by traditional back-propagation.
The results of P300 detection by BCNN and other models on the three datasets are shown in Tables IV, V, and VI. The numbers in bold indicate the highest F1 and Reco. among all the models. The accuracies of CNN-1, MCNN-1, and MCNN-3 presented in Tables IV and V are taken from [9]. Table VI lacks the results of these three models because [9] did not report the accuracy scores for Dataset II-C. It can be seen from the experimental results that BCNN achieves the highest F1 and Reco. among all the models on the three datasets, indicating that BCNN is able to achieve superior single-trial P300 detection accuracy. In addition, the detection performance of P300-CNN is lower than that of BCNN on the three datasets, which demonstrates that applying Bayesian uncertainty estimation technique in neural network is beneficial to improve the P300 detection performance.

B. Character Recognition
The number of characters correctly recognized by the BCNN and other models on the three datasets is shown in Tables VII, VIII, and IX. It is worth mentioning that there are 100 test characters in Datasets III-A and III-B and 31 test characters in Dataset II-C. The numbers in bold indicate the maximum values in the column (a "-" symbol means that prior literature did not provide the accuracy). It has been reported that ensembles of support vector machines (ESVM) [8] achieved the best recognition performance on Dataset II of BCI Competition III, and SVM [5] and the method based on Student's t-statistic and continuous wavelet transform (t-CWT) [33] achieved the highest accuracy on Dataset IIb of the BCI Competition II. Hence, we compared BCNN with these methods. When the number of repetitions is n, it means that only the signal samples from the first n stimulation rounds were used to recognize the characters. It can be seen from the tables that, when the number of repetitions reached 15, BCNN correctly recognized 99% and 97% characters for Datasets III-A and III-B, respectively, which was the highest recognition accuracy among all the methods. For Dataset II-C, BCNN needed only 4 repetitions to correctly recognize all the characters, whereas the other methods required 5 or 6 repetitions. Therefore, BCNN is able to achieve outstanding character recognition performance, which can improve the practicality of P300-based character spellers. Moreover, the results show that the recognition accuracy of P300-CNN is lower than that of BCNN on all three datasets, which further validates the effectiveness of Bayesian uncertainty estimation technique in BCI deep learning.
We used paired t-tests to compare the character recognition accuracy between BCNN and the other methods. The paired t-test analysis was performed using the accuracies from all the repetitions (i.e., 30 pairs for Datasets III-A and III-B; 15 pairs for Dataset II-C). The results are shown in Table X. It can be seen that the recognition performance achieved by BCNN on the three datasets is significantly higher than that of other   We further compared the information transfer rate (ITR) between BCNN and other methods. ITR is commonly used to measure the speed of character spellers and is defined as follows ITR = 60 P log 2 (P) + (1 − P) log 2 where P denotes the character recognition accuracy, and N denotes the number of character categories (i.e., N = 36).
T denotes the selection time. In the row-column paradigm, each intensification last 100 ms and the interval between intensifications is 75 ms. Each stimulation round includes 12 intensifications. Therefore, the time required for one stimulation round is 12 × (100 + 75) = 2100 ms. There is an interval of 2.5 s before each character experiment, so T is calculated as follows where n is the number of repetitions of the stimulation round. Fig.5 shows the average ITR obtained by each method on Datasets III-A and III-B. The ITR is calculated by the  average accuracy of subjects A and B. Fig.6 shows the ITR obtained by each method on Dataset II-C. It can be seen from the figure that BCNN achieved significantly higher ITR than other methods including CNNs, especially in the first few stimulation round. These results show that BCNN can effectively improve the spelling efficiency of BCI.

A. Model Performance on Small Dataset
To investigate the robustness of BNN on small EEG dataset, we trained BCNN and P300-CNN with only 75%, 50%, and 25% of the training data. It is worth reminding that P300-CNN is a point-estimate network with the same architecture as BCNN. We conducted experiments on Dataset III-A. Table XI shows P300 detection results of BCNN and P300-CNN before and after reducing training data, and Table XII shows the number of correctly recognized characters. It can be seen that the F1 achieved by BCNN remained high when using less training data. From Table XI, the number of characters recognized by   TABLE XI  P300 DETECTION RESULTS OF BCNN AND P300-CNN BEFORE AND  AFTER REDUCING TRAINING DATA ON DATASET III-A   BCNN was significantly higher than that of P300-CNN on different sizes of training data. With only 25% of the training data, BCNN can still recognize up to 96 characters. However, P300-CNN only recognized 90 characters, which were much lower than those achieved by BCNN. We conducted paired t-tests to compare the character recognition accuracy between BCNN and P300-CNN when training with different proportions of training data. The result is shown in Table XIII. It can be seen that, when training with different proportions of training data, BCNN achieved significantly higher character recognition accuracy than P300-CNN ( p < 0.05). Based on the above results, it is found that BCNN can still perform well on small EEG datasets. It is known that, CNN was originally proposed for image processing. The amount of image data is usually very large, so it can meet the training needs of CNN. However, the amount of EEG data is usually much smaller than that of image data, so CNN is prone to overfitting on EEG datasets. In BNN, the weights can be regularized by placing prior distributions on them [34], so BCNN is more robust to overfitting. Therefore, BCNN still performs well on small datasets. In addition, the prediction of BCNN is obtained by integrating the outputs of multiple networks, which improves the generalization ability of the network.

B. Analysis of Spatial Filters
The first layer of BCNN contains ten spatial filters. To study the spatial features extracted by BCNN from the input signal, we visualized the weights of the spatial filters. Blundell et al. [16] measured the importance of weight by the SNR of the weight distribution, which is defined as |µ| /σ , where µ and σ are the mean and standard deviation of the distribution, respectively. Weights with high SNR are considered informative. For the purpose of visualization, we first calculated the SNR of the weights for each spatial filter. Then, for each subject, the SNR of the weights of the ten filters were averaged and normalized to [0, 1]. Finally, the averaged SNR were projected onto the scalp electrodes, as shown in Fig. 7. The red region corresponds to weights with high SNR, and the electrodes located in these regions provide more discriminative information. The blue region is the opposite. These three topographic maps are similar to the results reported in [9], [10]. Table XIV shows the eight electrodes with the highest SNR in spatial filters for each subject arranged in descending order of the SNR. As can be seen from Fig. 7 and Table XIV, the discriminative electrodes are mainly located in the parietal, occipital, and central regions. This result is consistent with [35] and [36]. In addition, the electrodes listed in Table XIV are similar to those selected in [8], [36], and [37]. Their common electrodes include P z , PO 7 , PO 8 , PO z , CP z , O 1 , PO 3 , and PO 4 . By analyzing the results it can be demonstrated that BCNN is able to extract discriminative spatial features from the P300 signal.

C. Network Pruning
There are usually some redundant weights in a trained network. These redundant weights not only cannot improve the performance of the network, but also increase the risk of overfitting. In traditional point-estimate neural networks, weights with small absolute values are considered to be redundant [38]. In BNN, each weight is a distribution. Therefore, Blundell et al. [16] found redundant weights by measuring the weight uncertainty, and further investigated the level of redundancy in the network through pruning. Specifically, they first computed the SNR for each weight distribution. Then the weights with the lowest SNR were removed by replacing them with a constant zero.  BCNN AND P300-CNN AT DIFFERENT  PRUNING RATES ON THE DATASET III-A (THE PRUNING OF BCNN IS  BASED ON SNR OF WEIGHTS, AND P300-CNN IS PRUNED  ACCORDING TO THE ABSOLUTE VALUES OF WEIGHTS) To investigate the level of redundancy in BCNN for P300 detection, we applied the method in [16] to prune BCNN. As a comparison, network pruning based on absolute value of weights was performed on the point-estimation model P300-CNN, where the weights with the lowest absolute values were removed. We explored the performance variation of these two networks at various pruning rates (50%, 75%, and 85%). Since the experimental results were consistent among the three datasets, we took Dataset III-A as an example to present the results. Regarding BCNN and P300-CNN at different pruning rates, Table XIII and XVI show the P300 detection results and the character recognition results respectively. Results at 0% pruning rate indicate those before pruning. As can be seen from the results, when 50% and 75% of the weights were removed, the drop in F1 and character recognition accuracy achieved by BCNN was negligible. Therefore, the detection performance of the BCNN was hardly affected at these two pruning rates. However, the F1 and character recognition accuracy achieved by P300-CNN were significantly reduced after pruning. After removing 85% of the weights, BCNN can still recognize 96% of the characters, but P300-CNN can only recognize 79% of the characters, much lower than that of BCNN. This result indicates that many weights in P300-CNN that were helpful for detection were eliminated. We further performed paired t-tests to compare the character recognition accuracy between BCNN and P300-CNN at various pruning rates. The result is shown in Table XVII. It can be seen from the results that BCNN achieved significantly higher character recognition accuracy than P300-CNN at various pruning rates ( p < 0.05). By comparing the pruning results of BCNN and P300-CNN, it can be concluded that, only using the absolute values of the weights for pruning will degrade the performance of the network, indicating that the weight uncertainty information plays an important role in network pruning. Therefore, by exploiting the weight uncertainty, we can eliminate redundant weights in BCNN while retaining detection performance, thereby reducing unnecessary computation when conducting inference.

D. Result of Classification With Rejection Option
In some practical BCI applications, models that lack the ability to estimate uncertainty can be catastrophic. For example, in brain-controlled wheelchairs or rehabilitation robots ,  TABLE XVI  NUMBER OF CORRECTLY RECOGNIZED CHARACTERS OF BCNN AND  P300-CNN AT DIFFERENT PRUNING RATES ON THE DATASET III-A  (TEST SET = 100 CHARACTERS. THE PRUNING OF BCNN IS BASED  ON SNR OF WEIGHTS, AND P300-CNN IS PRUNED ACCORDING TO  THE ABSOLUTE VALUES OF WEIGHTS)  To verify the feasibility of this idea, we investigated whether rejecting samples with high prediction uncertainty could improve the performance of the network. Equation (10) gives us a way to estimate the aleatoric uncertainty and epistemic uncertainty for each sample. Shridhar et al. [40] have demonstrated that aleatoric uncertainty depended on the dataset rather than the model, since the same dataset exhibited constant aleatoric uncertainty across different models. In contrast, BNN exhibits increased epistemic uncertainty for test samples that are located in data-sparse regions. Cabiscol [41] demonstrated that epistemic uncertainty can be reduced by observing more data. Therefore, in order to evaluate the model, we used epistemic uncertainty as a measure of prediction uncertainty. In the testing phase, we calculated the epistemic uncertainty for each test samples. All test samples were then sorted by epistemic uncertainty value, and the samples with the highest epistemic uncertainty were rejected by the network (i.e., they were not classified). We then recalculated the P300 detection metrics after rejecting these samples. Table XVIII shows the P300 detection results of BCNN at different rejection rates on three datasets. When the rejection rate is n%, it means that n% of samples with the highest epistemic uncertainty value are rejected by BCNN. It can be seen from the results that both F1 and Reco. were improved when samples with high epistemic uncertainty were rejected. The more samples being rejected, the higher F1 and Reco. BCNN achieved. It is worth noting that after rejection, FP and FN were reduced more significantly than TP and TN. Since FP and FN are the number of misclassified samples, it can be seen that most of the rejected samples were misclassified ones, which explains why rejecting these samples improved the P300 detection accuracy. In summary, BCNN is able to evaluate prediction uncertainty. Highly uncertain inference are more prone to be wrong. By refraining from classifying ambiguous samples, the detection error rate of the model can be greatly reduced, thereby improving the reliability of the BCI.

VI. CONCLUSION
Unlike traditional point-estimate neural networks, BNN can efficiently estimate prediction uncertainty, thus providing more reliable classification results. However, few studies have comprehensively explored its applications in P300 detection. In view of the superiority of BNN, we proposed a BCNN model for P300 detection and explored its various applications in BCI. The proposed BCNN can achieve better P300 detection performance than point-estimate CNN and other benchmark methods. We also demonstrate that BCNN can still achieve satisfactory deep learning performance on small EEG datasets. In addition, the results of pruning experiments show that weight uncertainty can be used to eliminate redundant weights while maintaining the performance of the network, which helps to reduce storage cost. More importantly, we designed a classification method with rejection option based on the prediction uncertainty output by BCNN. By analyzing the experimental results, we found that samples with high prediction uncertainty are likely to be misclassified. By rejecting these ambiguous samples, the detection accuracy can be greatly improved. It is worth pointing out that although this paper studies the application of BNN for P300 detection, our methods can also be applied to BCI systems with other modalities.