ST-CapsNet: Linking Spatial and Temporal Attention With Capsule Network for P300 Detection Improvement

A brain-computer interface (BCI), which provides an advanced direct human-machine interaction, has gained substantial research interest in the last decade for its great potential in various applications including rehabilitation and communication. Among them, the P300-based BCI speller is a typical application that is capable of identifying the expected stimulated characters. However, the applicability of the P300 speller is hampered for the low recognition rate partially attributed to the complex spatio-temporal characteristics of the EEG signals. Here, we developed a deep-learning analysis framework named ST-CapsNet to overcome the challenges regarding better P300 detection using a capsule network with both spatial and temporal attention modules. Specifically, we first employed spatial and temporal attention modules to obtain refined EEG signals by capturing event-related information. Then the obtained signals were fed into the capsule network for discriminative feature extraction and P300 det- ection. In order to quantitatively assess the performance of the proposed ST-CapsNet, two publicly-available datasets (i.e., Dataset IIb of BCI Competition 2003 and Dataset II of BCI Competition III) were applied. A new metric of averaged symbols under repetitions (ASUR) was adopted to evaluate the cumulative effect of symbol recognition under different repetitions. In comparison with several widely-used methods (i.e., LDA, ERP-CapsNet, CNN, MCNN, SWFP, and MsCNN-TL-ESVM), the proposed ST-CapsNet framework significantly outperformed the state-of-the-art methods in terms of ASUR. More interestingly, the absolute values of the spatial filters learned by ST-CapsNet are higher in the parietal lobe and occipital region, which is consistent with the generation mechanism of P300.


I. INTRODUCTION
B RAIN-COMPUTER interfaces (BCI) provide an opportunity for people to directly interact with their surroundings through brain waves [1], [2]. For example, Long et al. combined motion imagery and P300 potentials to control a 2-D cursor movement [3], and further developed a BCI-based system to control the movement of a wheelchair [4]. Wang et al. identified the user's gaze direction using frequency-encoded steady-state visual evoked potentials [5]. Lin et al. developed a BCI-based system to estimate drivers' drowsiness [6]. Zheng et al. proposed a high-performance brain switch based on code-modulated visual evoked potentials with both fast reaction and low false positive rate (FPR) during idle state [7]. Among all BCI paradigms, Electroencephalography (EEG) is a method of acquiring brain waves that has attracted many researchers due to its high temporal resolution and noninvasive nature [8], [9]. An event-related potential (ERP) based EEG is a brain reaction that occurs directly from a specific event [10]. A typical ERP component, P300 that occurs around 300ms after the target stimulus onset at the parietal lobe, has been widely used in BCI [11], [12], [13]. For instance, Farwell and Donchin [14] proposed a P300 speller paradigm in 1988, allowing individuals to type with their minds. Many datasets of P300 are based on this pioneer paradigm. It is noteworthy mentioning that the international BCI competition datasets also include the P300 paradigm, which are usually the benchmark datasets to compare the performance of various models on EEG classification. Alain Rakotomamonjy and Vincent Guigue [15] won the championship using an ensemble of support vector machines (ESVMs) for P300 detection in BCI III Competition [16]. However, the method does not take into account the importance of the individual electrodes and simply feeds the raw data into the classifier for training.
In order to improve the accuracy of detecting ERP signals, Rivet et al. [17] raised xDAWN, a spatial filtering method, to enhance P300 potentials with respect to the Non-P300 potentials; further, Barachant improved the xDAWN to a generalization to any type of ERP [18]. Most recently, with graphics processing units (GPUs) becoming more powerful, deep learning has grown tremendously. Zhang et al. proposed an improved EEGNet [19] that combined xDAWN saptial filtering with EEGNet [20] for the individually-calibrated rapid serial visual presentation (RSVP) task and won second place in the BCI Controlled Robot Contest at 2022 World Robot Contest [21]. Wang et al. proposed denoising autoencoder neural networks to improve the symbol recognition accuracy by about 0.7% compared to ESVMs, which can automatically learn features from unlabeled data and solve the problem of local minima in neural networks due to random initialization [22]. Cecotti and Graser [23] used convolutional neural networks to detect P300 for the first time and achieved a high recognition rate (95.5%) in the 15th repetition. However, it has a low symbol recognition rate in the first 5 or even 10 repetitions, leading to a low information transfer rate (ITR). To further increase the symbol recognition rate in the first 5 repetitions, Wang et al. [24], who have crowned champions of the P300based BCI competition in the 2019 World Robot Conference, proposed Multiscale-CNN to enhance the performance of P300 detection. Three temporal kernels at different scales were applied on its temporal convolution layer to obtain discriminative time features. However, some valuable information that would help in classification will be lost during the forward propagation, because it employed the max pooling operation to reduce feature maps which only retains the most active features and discards the rest. To overcome the information loss in the pooling operation, Sabour et al. [25] proposed capsule network (CapsNet). A capsule contains a set of neurons and the output is a vector which represents various entity materialisation parameters, such as position, size, rotation etc. The length of the vector represents the probability of the corresponding class. The lower level capsules are connected to the higher level capsules by a dynamic routing algorithm. Several recent studies have demonstrated that CapsNet could achieve better performance than traditional techniques. For example, we used a multi-kernel capsule network to identify schizophrenia which outperformed other methods in our previous study [26]. Chao et al. combined multiband feature matrix (MFM) and CapsNet outperforming 2D-CNN in emotion recognition [27]. Liu et al. employed 1D-CapsNet to detect P300 which reached 96% symbol recognition rate [28]. Ma et al. attempted to use ERP-CapsNet for ERP detection and obtained much better results than the traditional machine learning algorithms and CNNs [29] and also explained the mechanism of how P300 components are preserved in capsules. However, ERP-CapsNet just took the raw EEG signals as input, which introduced additional noise.
In order to reduce signal noise and further improve the P300 detection accuracy, we employed spatial and temporal attention mechanism to refine the input EEG signals, and then fed the refined EEG signals into ERP-CapsNet for classification. Several attention mechanisms have been widely used, such as the Squeeze-Excitation (SE) block [30] proposed by Hu et al. which adaptively generates channel attention maps and recalibrates the feature responses of channels by explicitly modelling the interdependencies between channels. It first generates average-pooled features from the original convolutional feature maps via the average pooling functions, then feeds the generated features into a multilayer perceptron (MLP) with Sigmoid activation, which yields a channel attention map. Then the element-wise multiplication of original convolutional feature maps with the channel attention map gives the calibrated channel feature response, i.e., the channel refined feature map. The work of Hu et al. inspired Woo et al. to develop a more powerful attention mechanism, the Convolutional Block Attention Module (CBAM) [31]. It consists of a channel attention module and a spatial attention module. The channel attention module is a variant of the SE block. It generates average-pooled and max-pooled features from the original convolutional feature maps via the average pooling and max pooling functions which are then fed into a shared MLP where the outputs are summed and activated by a Sigmoid function to produce a channel attention map. The channel attention map is also element-wise multiplied with original convolutional feature maps to obtain channel refined feature maps. The spatial attention module first compresses the channel refined feature maps into two features via the max and average pooling functions respectively, and then generates the spatial attention map via a 7 × 7 convolution. Finally, the channel refined feature maps are element-wise multiplied with the spatial attention maps to obtain channel and spatial refined feature maps. Their experimental results on various image datasets showed that inserting CBAMs into the baseline model can significantly improve the classification performance. Inspired by this, we try to combine ERP-CapsNet [29] with CBAMs, which we call ST-CapsNet, expecting to improve the performance of P300 detection.
The main contributions of this work are summarized as follows: 1) To our knowledge, this is the first attempt to combine spatial and temporal attention with a capsule network to improve the accuracy of P300 detection. 2) We proposed a more comprehensive method (ASUR) to measure symbol recognition performance by comparing the average correctly recognized symbols under the first 5, 10 and 15 repetitions of a stimulus round.

A. Description
The data sets used in this paper are the dataset IIb of BCI competition 2003 and dataset II of BCI competition III [16]. We separated dataset II into two data sets: dataset II-A and II-B because it contains two subjects (subjects A and B). These datasets are complete records of P300 evoked potentials recorded with BCI2000 [32] using a paradigm described by Farwell and Donchin [14]. The subjects were presented with a 6 × 6 matrix of symbols. All rows and columns in the matrix were randomly intensified at a frequency of 5.7 Hz. By staring at the desired symbol in the matrix, a P300 evoked potential would occur in the subjects' brains when the desired symbol flashed. When other symbols flashed, stimulated potentials do not have a P300 component and are called Non-P300 evoked potentials. The P300 potentials are different from the Non-P300 potentials, because the rare target stimuli cause subjects' brains to generate P300 potentials [33]. Six columns and six rows were randomly intensified in the matrix; only one column and one row contain the desired symbol, which means there are two P300 evoked potentials and ten Non-P300 evoked potentials in one stimulation round. Due to the extremely low SNR of ERPs, the stimulation round should be repeated several times to improve the P300 recognition accuracy. The EEG data was recorded from 64 electrodes at a sampling rate of 240 Hz in several sessions. Each session consisted of a number of runs. In each run, subjects focused on a series of symbols. At first the screen was displayed for 2.5 seconds, during which time each symbol had the same intensity (i.e., the matrix was blank). Subsequently, one of the rows or columns in the matrix was randomly enhanced for 100 ms, and then the matrix was blanked for 75 ms. The enhancement of the rows/columns was carried out randomly 12 times in a block. The block was repeated 15 times for each symbol to spell. There were a total of 31 symbols in dataset IIb, and 100 symbols in datasets II-A and II-B. Table I shows the number of P300 and Non-P300 samples for training and testing in each dataset. For more information pertaining to the dataset, please refer to https://www.bbci.de/competition.

B. Data Preprocessing
To reduce the effect of the imbalance of the data sets, we averaged two randomly selected samples from P300 samples many times so that the number of P300 is the same as the number of Non-P300. The preprocessing step consists of the following stages. We first extracted all data samples between 0 to 650 ms, i.e., 156 time samples after the start of an intensification. Afterwards, an FIR band-pass filter (Hamming window) with a frequency range of 0.1 to 20Hz was adopted that was followed by downsampled (to half of the staple points for each channel) and normalized steps (via Z-score in eq (1) and sigmod approach) to normalize the filtered EEG data. The sigmoid function was used because the value range of the reconstructed EEG signal in the decoder layer is from 0 to 1. The obtained band-pass filtered and normalized EEG data was set as input for the ST-CapsNet.
X ∈ R C×78 is the half downsampled filtered EEG signal and X i j is the signal value of the i-th electrode at the j-th time point.X i and σ i are the average and standard deviation of the i-th electrode signal. C represents the number of electrodes, and 78 stands for the time samples of the signals. We set C to 64 because datasets IIb, II-A, and II-B all have 64 electrodes.
III. METHODS ERP-CapsNet has shown good performance in P300 detection [29]. However, it just took the raw EEG signals as input which might introduce some additional noise. Hence, to reduce the noise of EEG signals and improve the accuracy of P300 detection, we linked spatial and temporal attention modules with ERP-CapsNet as illustrated in Fig.1.

A. Spatial Attention
We define the variable V ∈ R c×h×w , where c, h and w represent the channel, height and width dimensions of V, respectively. The spatial attention module is used to enhance the spatial information of the raw input EEG signal X , as summarised below.
where F s avg and F s max ∈ R C×1×1 are the features generated from the reshaped signal X R ∈ R C×1×78 through max pooling and average pooling function along the width dimension (the pooling kernel size and pooling stride were set to 78 and 1, respectively). In the shared MLP, W 0 ∈ R C× C r is the weight between the input layer and the hidden layer, while W 1 ∈ R C r ×C is the weight between the hidden layer and the output layer, and r is the reduction ratio. We set r to 16 as suggested in [31]. The function ⊕ denotes element-wise addition, and σ is the sigmoid operation. M S ∈ R C×1×1 is the spatial attention map that we get at last in the spatial attention module. By simply multiplying the reshaped signal X R with the spatial attention map M S through the function ⊗ which denotes the element-wise multiplication, we get the spatial refined signal X S ∈ R C×1×78 . Note that M s is auto broadcasted along the width dimension when doing the element-wise multiplication due to the special mechanism of Pytorch [34].

B. Temporal Attention
In the temporal attention module, the spatial refined signal X S first compressed itself into two feature maps (i.e., F t avg and F t max ∈ R 1×1×78 ) through max pooling and average pooling function along the channel dimension (the pooling kernel size and pooling stride were set to C and 1, respectively). Then the two feature maps were stacked and convolved by a convolution layer with a 1 × D (D can be taken as 3, 5, and 7) filter, a stride of 1, same padding, and sigmoid activation, producing a temporal attention map M T ∈ R 1×1×78 .
Afterwards, we can get the refined EEG signal X ST ∈ R C×1×78 through the function below, which is auto broadcasted along the channel dimension.

C. Capsule Network
In the capsule network, we first extracted temporal features from X ST using 10 C × 1 spatial filters through convolution operation and ReLU function, where the stride is 1. Next, we used 64 1 × 13 temporal filters by convolution and ReLU operations to extract temporal features of which size is 64×1× 8. The temporal features are divided into 8 groups. The size of each group is 8×1×6, which means six 8D primary capsules. So we got 48 8D primary capsules in total as the input of the dynamic routing. The output of the dynamic routing is two 16D output capsules. The lengths of the two output capsules, calculated through the L2 norm and then activated by Softmax, represent the probabilities of P300 and Non-P300, respectively. We can determine the label of the input sample X using eq (6), where p target and p non represent the probability that the model identifies sample X as a P300 and a non-P300 sample, respectively The mechanism of the dynamic routing algorithm is completely different from that of the CNN and is described in Algorithm 1. Sabour et al. suggested that better convergence can be obtained by using three routing iterations than one iteration [35]. Therefore, we set the maximum number of routing iterations, i.e., N to 3. After the dynamic routing layer, we keep the output capsule representing the category of the input EEG sample X as the input of the decoder network and mask the other output capsule. The decoder network consists of three fully connected layers; the number of neurons is 512, 1024, C × 78, and the activation functions are ReLU, ReLU, sigmoid, respectively. The loss function of ST-CapsNet consists of two components, namely margin loss and reconstruction loss. The margin loss is defined as follows: where L j stands for the loss of j-th output capsule, λ = 0.5, m + = 0.9, m − = 0.1. T j = 1 if the label of the input sample is j, otherwise T j = 0. For binary classification, the margin loss function is more efficient, because it punishes the predictions depending on how closely they match with the sign of the target [36]. The reconstruction loss L r is obtained by calculating the mean squared error between the input EEG signal and the reconstructed EEG signal. Adding the reconstruction loss can boost the routing performance [25]. The total loss of the ST-CapsNet is summed as follows: where α is set to 0.0005.

D. Training
We used parameters of a pre-trained model to initialize ST-CapsNet in attention layers and two convolution layers to obtain better convergence and avoid local optimum as suggested in [35]. The pre-trained model is shown in Table II. All models were implemented in PyTorch and trained on GeForce RTX 2080 Ti. The batch size was set to 64. The learning rate was initially set to 0.001 with an exponential decay rate of 0.96. For the pre-trained CNN, we employed cross-entropy loss. The Adam optimizer with default parameters was used to optimize all models. To avoid overfitting, the early stop and data augmentation in braindecode [37] were used.

E. Target Symbol Recognization
The StimulusCodes [16] shown in Fig.2 have a value range of 0 to 12 (0 when no row/column is being intensified, 1 to 6 for intensified columns, 7 to 12 for intensified rows). Because of the low SNR of ERP, subjects need to take 15 repetitions to recognize one symbol in the P300 speller paradigm. Each repetition has 12 stimuli that correspond to 12 stimulus codes. Let p k (i) denote the length of the 16D output capsule which stands for the probability of P300 when the stimulus code is k in the i-th repetition. P k is the sum of those P300 possibilities from the first to the n-th repetition. Different row/column intensifications are assigned to the StimulusCodes [16]. The numbers in blue are the StimulusCodes.
Then we can identify the column c and row r of the target symbol in the n-th repetition by: r = arg max k∈ [7,12] P k (11)

A. Algorithms for Comparison
To evaluate the accuracy of P300 detection and symbol recognition, we compared our ST-CapsNet with six models (i.e., a capsule network, two traditional methods, two deep learning models, and a method combining deep learning and traditional algorithms). The details of the models are described as follows: 1) ERP-CapsNet, which was state of the art, is the first capsule network applied to ERP detection and achieved good results [29]. The network structure is the same as the Capsule Network in Fig.1. 2) CNN-1 is the first proposed CNN model for P300 detection [23]. It consists of four layers; the first two are convolutional layers (with a 64×1 spatial kernel and 50 1 × 13 temproal kernels separately) used to extract spatial and temporal features respectively, and the last two are fully connected layers (with 100 and 2 neurons respectively) used to classify ERP signals. 3) MCNN-1 is an ensemble of five CNN-1 models, each trained on a different partition of the data [23]. There are five data partitions in total because the number of Non-P300 samples is five times larger than the number of P300 samples in the original data. Each data partition is derived from the original data and has the same number of P300 and Non-P300 samples. CNN-1 and MCNN-1 are often used to compare P300 performance as benchmarks. 4) Linear discriminant analysis (LDA) with covariance shrinkage has shown better performance than a conventional LDA classifier in detecting single trial ERP signals [38]. We abbreviated this approach as S-LDA and copied the reproduction results done by Ma et al. [29] for a clear comparison. 5) Spatially Weighted FLD-PCA (SWFP) is designed for single trial ERP detection, which outperformed than Hierarchical Discriminant Component Analysis (HDCA) [39] and Hierarchical Discriminant Principal Component Analysis Algorithm (HDCPA) [40]. First, a spatial filter is estimated at each time point using Fisher Linear Discriminant (FLD), and then all the estimated spatial filters (78 in total) are applied to an EEG sample to obtain a spatially filtered EEG sample. Each channel of this EEG sample is then applied with principal component analysis (PCA) for dimensionality reduction. Six principal components are retained to explain > 70% variance as reported in [40]. 6) MsCNN-TL-ESVM was proposed by Sourav Kundu and Samit Ari [41]. It consists of two blocks, the feature extraction block and the classification block. The authors first used a convolution network with spatial filters with fixed size (64 × 1) and multiple temporal filters of different sizes (1 × 20 and 1 × 10) based on transfer learning to extract discriminant spatial and temporal features, after which they applied Fisher ratio to select important features and then sent those selected features to the ensemble of SVMs for symbol recognition.

B. Evaluation Metrics
We adopted accuracy (Acc.) and F1-score as metrics to evaluate the performance of P300 detection in single trial. To evaluate the performance of symbol recognition, it is not sufficient to compare the number of symbols correctly recognized under separate repetitions, because the P300-based speller paradigm has the characteristic of cumulative effect, i.e., the recognition accuracy of the previous repetition affects the recognition accuracy of the next repetition. Here we give an assumption that a good model should perform well with fewer repetitions (reach a higher information transfer rate) without sacrificing overall performance (correctly identifying as many symbols as possible under all repetitions). Hence, to quantify the performance of models in recognizing symbols under different repetitions, we proposed a comprehensive evaluation measure as following: where C i means the correctly recognized symbols in the i-th repetition. ASUR k stands for the average correctly recognized symbols per repetition when we take k repetitions into account. We take three values of k (5, 10, 15). ASUR 5 , ASUR 10 and ASUR 15 represent the average correctly recognized symbols in the first five, ten and fifteenth repetitions separately. It is worth mentioning that ASUR 15 means the overall performance of symbol recognition because there are 15 repetitions in total. Besides, higher ASUR 5 and ASUR 10 mean higher accuracy of symbol recognition with fewer repetitions. In addition, to compare the symobol recognition speed of models under different repetitions, we referred to the formula for calculating ITR under the i-th repetition in the paper [42], defined where A i is the accuracy of symbol recognition rate (in percent) under the i-th repetition, and G (G is 36 here) is the number of symbols presented in the p300-speller paradigm as shown in Fig. 2.

C. Performance of P300 Detection in Single Trial
The kernel size of temporal attention module in ST-CapsNet was chosen to be 1 × 5. The results are shown in Table III. It is obvious that ST-CapsNet outperforms other models both in accuracy and F1-score on datasets IIb and II-B, while ERP-CapsNet has a little higher F1-score than ST-CapsNet on dataset II-A. The results indicate attention modules of ST-CapsNet could boost the performance of P300 detection in single trial.

D. Performance of Symbol Recognition
The correctly recognized symbols in every repetition for each model on datasets IIb, II-A, II-B are shown in Tables IV, V, VI. The character '-' means the authors did not report the value in their papers. Table IV illustrates that ST-CapsNet, ERP-CapsNet, CNN1 and MsCNN-TL-ESVM can correctly identify all symbols in the 4th repetition, while S-LDA requires 5 repetitions and even SWFP need take 8 repetitions to correctly recognize all symbols on dataset IIb. In addition, ST-CapsNet and MsCNN-TL-ESVM have almost the same performance and are better than the other methods. On dataset II-A, both ST-CapsNet and ERP-CapsNet correctly identified 98 symbols in the 15th repetition, and ST-CapsNet is more accurate in the 5th to 10th repetitions while ERP-CapsNet is more accurate in the 11th to 13th repetitions. On dataset II-B, ST-CapsNet has the highest accuracy from repetition 4 to 7, while MsCNN-TL-ESVM are the most accurate from repetition 9 to 13.
As summarized in Table V and Table VI, some models have higher accuracy when there are more repetitions but lower recognition accuracy when there are fewer repetitions, which means that different models have different accuracy tendencies under repetitions. Our ST-CapsNet tends to be more accurate with fewer repetitions, while ERP-CapsNet and MsCNN need more repetitions to be accurate. Table VII illustrated that  TABLE IV  NUMBER OF CORRECTLY CLASSIFIED SYMBOLS FOR DATASET IIB   TABLE V  NUMBER OF CORRECTLY CLASSIFIED SYMBOLS FOR DATASET II-A   TABLE VI  NUMBER OF CORRECTLY CLASSIFIED SYMBOLS FOR DATASET II-B   TABLE VII  ASUR K (K = 5, 10, 15) ON DATASETS II-B, II-A AND II-B ST-CapsNet has the highest accuracy of symbol recognition on the overall performance (highest ASUR 15 ) on the three datasets (II-b, II-A and II-B). ERP-CapsNet is a little more accurate in the first 5 repetitions. In summary, our ST-CapsNet outperforms ERP-CapsNet by about 1 percent and is better than the other models in symbol recognition.

E. Performance of ITR
To show the speed of symbol spelling, we compared the ITR under each repetition as shown in Fig.3. The kernel size of the temporal module was chosen to be 1×5. On dataset IIb, ST-CapsNet and MsCNN-TL-SVM achieved almost the same ITR performance (both with highest ITR of 51.56 bits/min) and outperformed the other models significantly. Furthermore, ST-CapsNet achieved the highest ITR of 13.32 bits/min in the 6th repetition on dataset II-A and 19.74 bits/min in the 2nd repetition on dataset II-B, respectively. Interestingly, we found that with the same symbol recognition rate, the performance of ITR decreases significantly with the number of repetitions. Thus, improving the symbol recognition rate for the first few repetitions is a key point to obtain a higher ITR.

F. Effect of Temporal Attention to Model Performance Under Various Kernel Sizes
We also explored the performance of ST-CapsNet with different temporal attention kernel sizes (1 × 3, 1 × 5, 1 × 7). Table VIII illustrates that, in single trial P300 detection, 1 × 3 kernel outperformed the other two on dataset IIb, and 1 × 5 is the best on datasets II-A and II-B. Although there is a difference in performance between these three kernels in detecting the P300, it is not significant. The number of correctly recognized symobols and ASUR k values are given in Tables IX,X separately. ST-CapsNet with 1 × 7 kernel has better performance of symbol recognition in the first five and ten repetitions, while with 1 × 5 kernel has the best overall performance. Those findings showed that ST-CapsNet is not sensitive to the choice of kernel size of the temporal attention module.

V. DISCUSSION
In this paper we used a capsule network with spatial and temporal attention modules to improve the performance of detecting P300. This method has superior performance compared to ERP-CapsNet, CNN-1, MCNN-1, S-LDA, SWFP, MsCNN-TL-ESVM for P300 detection in single trial. Among them, the traditional methods (S-LDA, SWFP) have the worst performance, probably because those handcrafted features do not contain rich discriminative information, and the number of parameters of these two models is so small that there is a risk of underfitting. The results of classical convolutional networks (CNN-1, MCNN1) are slightly better, but still less satisfactory. ERP-CapsNet is about two points higher than classical convolutional networks, probably because the capsule network used a dynamic routing layer to replace the max pooling layer, thus avoiding information loss during backpropagation. MsCNN-TL-ESVM is a combination of multi-scale convolutional network (automatically extract rich multi-scale temporal features) and ensembled SVMs (reduce the variance of the classifiers to avoid the risk of overfitting), and employed migration learning training stratage (ensure the amount of training data). The results are excellent and have nearly the same performance as ERP-CapsNet. Our proposed ST-CapsNet outperformed ERP-CapsNet by about 1 percentage probably because we employed attention mechanisms to make the capsule model automatically learn and strengthen discriminative features focusing on space and time.
To be able to accurately detect the symbols to be spelled, a typical solution is to increase the number of repetitions which could improve SNR. However, as the number of repetitions increases, the time taken to detect individual symbols becomes longer. A good model should be able to recognize as many symbols as possible with as few repetitions as possible. A traditional metric of evaluating the accuracy of symbol recognition is to directly compare the correctly recognized symbols at repetitions 5, 10 and 15, respectively as used in [43] and [44]. However, this approach does not take into account the cumulative effect of the P300-based speller paradigm, where the spelling accuracy of the previous repetiton affects the accuracy of the next repetition. Thus, we introduced a new metric ASUR to evaluate the accuracy of symbol recognition. Higher ASUR 5 and ASUR 10 indicate higher average symbol recognition rate  for the first 5 and the first 10 repetitions, respectively. Higher ASUR 15 indicates better overall performance of the symbol recognition. Our experimental results show that the spatial and temporal attention modules can improve the accuracy of ERP-CapsNet for symbol recognition at low repetitions without losing the overall performance. In addition, in the temporal attention module, we tested different sizes of kernels (1 × 3, 1×5 and 1×7). These three different kernels all could achieve better results than ERP-CapsNet on both P300 detection in single trial and symbol recognition with similar performance, indicating that ST-CapsNet is less sensitive to the choice of kernel size.
To investigate the region of interest learned by spatial attention module, we ranked the averaged values of spatial attention maps in descending order, and marked top eight electrodes in red as shown in Fig.4. We found that all three spatial attention maps share two common channels (Cz and CPz), and the enhanced electrodes were located roughly in the central and parietal lobes of the brain, indicating that the attention module was able to capture the spatial features of P300. Furthermore,  the learned spatial attention maps generally accords with those of previous studies [15], [23], [24].
To further investigate the mechanisms of how the spatial and temporal attention modules affect the raw EEG signals, we sent all raw EEG signals to attention layers and obtained refined EEG signals. However, due to complex non-linear transformations, the characteristics of the EEG signals change considerably in time and space, which is difficult to understand humanly. From another perspective, comparing the difference between the mean P300 signal and the mean Non-P300 signal is a better approach, as the attention layers maximize the difference between the P300 samples and the Non-P300 samples, as shown in Tables III. We therefore subtracted the mean Non-P300 signal from the mean P300 signal and averaged the EEG topographies over the entire time period, as plotted in Fig.5. We can see from this figure that on datasets IIb and II-A, the energy areas of both the refined and raw EEG topographies are concentrated in the parietal lobe; while on dataset II-B, the energy in the parietal lobe of the refined versus the raw EEG topographies is more focused. Those findings indicate that attentional mechanisms can enhance the ability to capture P300 features. We also explored the spatial features learned by the capsule newtork in ST-CapsNet. First, we selected 10 spatial filters in the first convolutional layer of the capsule network, and took their absolute values for normalization. Next, we used MNE-Python [45] to plot the topography of datasets IIb, II-A and II-B. Fig.6 shows the weights of each of the learned spatial filters. Fig.7 shows the average of the 10 spatial filter weights for each dataset. We can find that the average spatial filter has higher values in the parietal and occipital regions, which is consistent with the results in [23] and [24]. The ranking of best 8 electrodes for the datasets IIb, II-A, II-B are shown in Table XI. The electrodes are arranged in descending order of absolute values of the averaged 10 spatial filters. The common electrodes between the three datasets are PO7, PO8, O1, CPz, Pz, which is in general agreement with the results in [23].
Our approach illustrates that extracting good spatial and temporal features is crucial for the classification of EEG signals, as reported by others. For example, the deep subject-adapted convolutional neural network (SACNN) by Liu et al. uses parallel multiscale convolutional networks to extract temporal and spatial features from raw EEG data and achieve good classification accuracy [46]. Despite the excellent performance of ST-CapsNet in P300 detection, the method has some shortcomings. The capsule network model has a relatively large number of parameters compared to traditional methods and CNNs which means it needs longer training time and requires higher performance equipment. Although ST-CapsNet is able to achieve higher accuracy of symbol recognition at low repetitions, we are not able to precisely control the recognition accuracy at a single repetition. Because P300 detection in single trial and symbol recognition are two tasks, and our model and loss function are designed for the first task without a well-designed training method for the second task. In the future, we will look for a better approach in terms of reducing the number of parameters in the model and designing a separate training method for the symbol recognition task.

VI. CONCLUSION
In this study, we proposed a novel deep-learning analysis framework-ST-CapsNet to enhance the performance of P300 detection. Specifically, instead of sending EEG signals directly to the capsule network, the complex spatio-temporal characteristics of EEG signals were initially extracted through spatial and temporal attention modules, which were served as inputs to the capsule network for P300 detection. On this account, the spatial and temporal of P300 features could be attained. Subsequent performance evaluation was conducted on two publicly-available datasets that reveals superiority of the proposed ST-CapsNet in both single-trial P300 detection and cumulative effect under different repetitions (i.e., better ASUR). Within this context, our results demonstrate the beneficial effect of adding attention mechanisms to the capsule network in P300 speller, which may lead to new directions for developing better P300-based BCI communication system.