A Cascade xDAWN EEGNet Structure for Unified Visual-Evoked Related Potential Detection

Visual-based brain-computer interface (BCI) enables people to communicate with others by spelling words from the brain and helps professionals recognize targets in large numbers of images. P300 signals evoked by different types of stimuli, such as words or images, may vary significantly in terms of both amplitude and latency. A unified approach is required to detect variable P300 signals, which facilitates BCI applications, as well as deepens the understanding of the P300 generation mechanism. In this study, our proposed approach involves a cascade network structure that combines xDAWN and classical EEGNet techniques. This network is designed to classify target and non-target stimuli in both P300 speller and rapid serial visual presentation (RSVP) paradigms. The proposed approach is capable of recognizing more symbols with fewer repetitions (up to 5 rounds) compared to other models while possessing a better information transfer rate (ITR) as demonstrated on Dataset II (17.22 bits/min in the second repetition round) of BCI Competition III. Additionally, our approach has the highest unweighted average recall (UAR) performance for both 5 Hz ( $0.8134\pm 0.0259$ ) and 20 Hz ( $0.6527\pm 0.0321$ ) RSVP. The results show that the cascade network structure has better performance between both the P300 Speller and RSVP paradigms, manifesting that such a cascade structure is robust enough for dealing with P300-related signals (source code is available at https://github.com/embneural/Cascade-xDAWN-EEGNet-for-ERP-Detection).


I. INTRODUCTION
CIS are devices that endow people to communicate with their surroundings using brain waves [1,2].They are particularly helpful for individuals with disabilities, such as those with amyotrophic lateral sclerosis (ALS), who may rely on a BCI to interact with others [3], [4], [5].Electroencephalogram (EEG)-based BCIs are becoming more popular in both commercial and academic settings because they are non-invasive, portable, and provide fast responses and over 80% BCI publications rely on EEG [6], [7].
Various components of EEG signals can be appliable for BCIs such as sensorimotor rhythms (SMRs), steady-state visual evoked potentials (SSVEPs), code-modulated visual evoked potentials (c-VEPs), miniature asymmetric visual evoked potential (aVEP) and event-related potentials (ERPs).Specifically, ERP which is known as the time-locked electrical potentials in the event has gained attention in BCI studies as different modalities like visual, auditory, or tactile events provide the best results for the control of a BCI based on ERPs [8], [9], [10].P300, which is a response that occurs about 300-600 milliseconds after the onset of the stimuli, has been extensively studied in ERP-based BCIs [11].In visual P300 paradigms like row/column spellers, differences between target and non-target ERPs are used to generate characters by flashing corresponding rows and columns so that the P300-based speller allows the user to communicate letter by letter [7].
Because the EEG collection for the P300 speller is noninvasive, the resulting signal-to-noise ratio (SNR) is low [12].To improve SNR, repeated stimuli are used and multiple trials of EEG collections are averaged [13].However, this averaging method is time-consuming and inefficient [14].It adds a temporal overhead to the pattern recognition process and may negatively impact BCI real-time performance by incorporating past information.Therefore, finding a way to accurately classify P300 for P300 speller in a single trial remains a challenge.Apart from the row/column speller explored BCI with P300, RSVP-based BCI detects and recognizes objects, scenes and events in static images and videos via P300.Different from the P300 speller, the RSVP-based BCI can be beneficial to more applications where a large number of images need to be reviewed by professionals but are unable to be well analyzed by computers [15], [16], [17].In the RSVP-based P300 experiment, the participants need to distinguish between target and non-target stimuli where the target one should be recognizably different and occupy 5%-10% of all stimuli [15].Although both BCIs with the P300 speller and BCIs with the RSVP use the P300 component for classification, there are some differences between the two paradigms.Firstly, in the P300 speller, the operator sets the sequence repetition number to improve SNR, while in RSVP, the target or non-target stimuli are set in a single trial.Secondly, the stimulus frequency is different for the two paradigms -the RSVP usually displays with a refresh rate of 5-10 Hz, while the P300 speller flashes for 125 ms and turns off for 62.5 ms until the next sequence flashes [18].Thirdly, due to the repetition of targeted stimuli elicited by the P300 speller, the average amplitude of the target P300 produced is weaker than that produced by the single-trial RSVP paradigm.Lastly, the P300 speller is more stable in EEG, whereas the RSVP paradigm is prone to noise induction as it requires button presses to determine the target stimuli.As mentioned, notable ERPs emerge at 315 ms in the RSVP paradigm whereas occur at 262 ms in the row-column-based P300 speller paradigm during target events compared to nontarget events [18].As mentioned in [19] [20], an experimental protocol consisted of two sessions performed on two different days with the first session on day one of the P300 speller experiment and the second session on day two of the RSVP experiment.The results showed that the temporal filtering capacity in the RSVP task can be a predictor of both the P300based BCI accuracy and the amplitude of the P300 elicited when performing the BCI task.Moreover, Won et al. reported a significant positive correlation between P300 amplitude in RSVP and P300 speller performance.They observed a strong negative correlation between the variation in P300 latency across trials in RSVP and P300 speller performance [21].These findings indicate that there exists a relationship between RSVP task characteristics and P300 speller performance.
To develop a sophisticated BCI system, a unified framework for processing is essential, as it can simplify system maintenance and upgrades, and reduce overall development costs.Additionally, a sophisticated BCI system must meet the diverse requirements of its users.For instance, the standard row-column paradigm used in the P300 speller may not be appropriate for patients who lack gaze control [22].In such cases, the RSVP paradigm presenting the stimuli in the same position may serve as a useful alternative.While the ERPs obtained from these two paradigms exhibit similarities, a unified decoding approach is necessary to develop a BCI system that can be tailored to different users.[22] To target the above characteristics of the P300 speller and RSVP paradigms, various methods have been proposed for P300 component detection.Traditional machine learning methods such as support vector machine (SVM), discriminant analysis, and common spatial pattern algorithms were first applied [20]- [28].These methods detect ERP signals with manual feature extraction, and the quality of the extracted features plays a big role in how well the algorithm performs.Recently, with the emergence of deep learning models, ERP features can now be automatically learned from data without any manual intervention.Several convolutional neural networks (CNN) such as EEGNet, DeepConvNet, ShallowConvNet, CNN1 and BN3 have been developed for detecting ERP signals [5], [13], [23], [24].However, deep learning models often require a large number of samples for better learning, as they lack domain knowledge about the data [25].An important domain knowledge regarding ERP signals is that the SNR can be enhanced through trial averaging.[26], [27].To take advantage of the domain information, it might be more suitable to process the ERP signals by xDAWN spatial filtering before sent to neural networks, as the xDAWN estimates the spatial filters by maximizing the difference between the averaged signals of the corresponding category and the whole EEG data [28].Research has extensively explored the combination of xDAWN with classification algorithms, yielding promising results.For instance, Cecotti et al. successfully integrated xDAWN with MLP, BLDA, and linear SVM, resulting in performance enhancement [29].In a similar vein, the xDAWN-based algorithm emerged victorious in the Kaggle BCI competition NER 2015, leveraging Riemannian geometry, channel subset selection, L1 regularization, and elastic net regression [30], [31].Moreover, Wu et al. achieved remarkable progress by aligning ERP data with Euclidean alignment and enriching features with xDAWN and tangent space mapping which secured the top spot in the RSVP detection competition at the World Robot Contest 2021 [32], [33].Meanwhile, in our previous work, Zhang et al. achieved second place in the same competition by combining xDAWN with EEGNet [34].These studies showcase the potential of integrating xDAWN with classification algorithms to achieve superior performance across diverse applications.However, in the existing literature, there is no single algorithm that can be well-suitable for both P300 speller and RSVP target detection.To tackle this challenge, we leverage our prior research [34] and extend the methodologies of xDAWN and EEGNet to encompass both the P300 speller and RSVP paradigms.
In the rest of this work, we describe the dataset in Section 2 and detail the methods applied in Section 3. We present corresponding results in Section 4 and compare them with prevailing ones in the discussion part.Finally, we conclude our findings.

A. Dataset Description
The study utilized two datasets: the P300 speller-based dataset and the RSVP dataset.The P300 speller-based dataset was derived from two datasets, dataset IIb from BCI Competition II, which involved one participant, and dataset II from BCI Competition III, which consisted of subjects A and B. The dataset was recorded using the BCI2000 system, which employed 64 electrodes at a sampling rate of 240 Hz.During the experiment, participants were shown a 6×6 symbol matrix and were instructed to pay attention to specific target symbols.The intensity of all the rows and columns in the symbol matrix was randomly increased at a frequency of 5.7 Hz.Each intensification lasted for 100 ms.After each intensification, the matrix remained blank for a duration of 75 ms, followed by the next intensification of a row or column.Each symbol presentation consisted of 15 rounds, with each round containing 12 stimuli.Of these 12 stimuli, only two stimuli (rows/columns) corresponded to the desired symbols and the elicited responses were labeled as P300 samples while the responses elicited by the remaining 10 stimuli were labeled as non-P300 samples.Thus, each symbol contains 150 non-P300 samples and 30 P300 samples.In dataset IIb, there were 42 symbols for training and 31 symbols for testing.In dataset II, for each subject, 85 symbols were used for training and 100 symbols for testing.The dataset is available at https://www.bbci.de/competition/.
On the other hand, the RSVP dataset [35] employed a stimulus set of 200 visual objects from different categories, presented to 16 adult participants (5 females; age range 18-38 years) who were instructed to count target stimuli (boats or geometric star shapes) randomly inserted into the sequence, with a maximum of 4 targets per sequence.Each sequence lasted between 40.2 and 40.8 seconds, with a presentation rate of 5 Hz in the first session and 20 Hz in the second session.There were a total of 40 sequences for each session.The EEG data were recorded at a sampling rate of 1000 Hz using a BrainVision ActiChamp system and international standard 10-10 system for 64-electrode placement.During recording, all scalp electrodes were referenced to Cz.Then the recorded data were filtered with a Hamming windowed FIR filter (0.1 Hz high pass and 100 Hz lowpass filters) and down-sampling to 250 Hz for further processing.This dataset is available at https://osf.io/a7knv/ .

B. Data Preprocessing
To preprocess the P300 speller-based dataset, we first cropped an 800 ms segment after the stimulus onset, and then detrended the data to remove linear trends.Next, we applied a 30 Hz low-pass Chebyshev filter with zero phase to filter out high-frequency noise, while preserving the desired signals.Additionally, we down-sampled the data by half, resulting in a 96-time sample segment.For the RSVP-based datasets, we retrieved a 0-1000 ms segment after the stimulus onset, detrended the data, and applied the same 30 Hz low pass Chebyshev filter with zero phase, resulting in a 250-time sample segment.Here, 푋 ∈ ℝ × denotes the i th trial EEG data sample, where C represents the number of EEG electrodes and T is the time samples of one trial data.We set T to 96 and 250 for the P300 speller and RSVP datasets, respectively.Since both datasets were collected using a 64-electrode system, C was set to 64.

C. xDAWN Spatial Filtering
The EEG signals are noisy and have low SNR because they record complete brain activity, including areas of the brain that are not relevant to the task, which leads to a hard classification problem [30].In addition, some channels carry more valuable information than others, such as the channels around the parietal lobe in the P300 speller and RSVP paradigms.To effectively enhance the task-relevant information in channels, we used xDAWN spatial filtering method to refine the original EEG signals.The xDAWN algorithm is defined as follows: Compute the averaged pattern of class k. 푋 () , 푃 () ∈ ℝ × and 푚 , are the i th trial EEG data in class k, averaged pattern for class k, the number of trials of class k, respectively.
Estimate spatial filters for class k.The spatial filter is a vector 푤 ∈ ℝ × .푤 * ∈ ℝ × represents an estimated spatial filter.푋 ∈ ℝ ×( ∑ ) is the concatenation of all the trials (from all classes).Because (3) is a generalized Rayleigh quotient, the solution could be given by calculating the eigenvectors of the matrix 푃 () 푃 () (푋푋 ) .The top n eigenvectors (ordered by eigenvalues) were selected as spatial filters.
Finally, we applied the Z-score normalization to each individual enhanced EEG segment: where µ and σ are the mean and standard deviation of each channel of the enhanced EEG data respectively.This approach ensured that the data were standardized and consistent across participants and electrodes.

D. Network Architecture
The xDAWN filtering approach estimates a set of spatial filters by maximizing the difference between different classes   [26]).Deep learning models are hard to learn specific domain knowledge as they often require a large amount of data to learn a certain inductive bias [25].Therefore, we linked the xDAWN with the classic EEGNet to maximize the use of the prior information and further improve the model's detection performance on ERP signals.
Fig. 1 shows the structure and detailed description of our proposed architecture.The xDAWN spatial filters were first applied to the input EEG data 푋 , followed immediately by a temporal convolution operation and BatchNorm (with a convolution kernel of size 1×64, stride of 1 and padding of 'same') to produce F1 feature maps (F1 was set to 8 in this paper).We then manipulated these feature maps using depthwise convolution, with depth D set to 2, a depth convolution kernel size of M×1 and 'valid' padding.Then the BatchNorm was used, and the results were activated by ELU operation.Next, the average pooling operation was used to reduce the size of feature maps and dropout was used to avoid overfitting.The average pool kernel size and dropout value p were set to be 1×4 and 0.25, respectively.We then used separable convolution (with a kernel size of 1×16, stride of 1 and 'same' padding) to extract deeper features.The BatchNorm and ELU activation were also used.Separable convolution is composed of depthwise convolution and pointwise convolution, to reduce the number of model parameters [35].Next, we apply an average pooling layer of size 1×8 for dimension reduction and a dropout layer with the p value equal to 0.25.In the dense layer, N neurons are densely linked to the features of the previous layer and activated by Softmax activation.

E. Training
We used the pyRiemann package [36] to estimate xDAWN spatial filters and reproduced EEGNet with PyTorch [37].The proposed model was trained on GeForce RTX 2080 Ti.The batch size was set to 64.The Adam optimizer with default parameters was used, and the learning rate was initially set to 0.001 with an exponential decay rate of 0.96.The P300 speller dataset consisted of two categories: non-P300 and P300 samples.In contrast, the RSVP dataset encompassed three categories: non-target, boat (target 1), and geometric star (target 2) samples.Both datasets exhibited class imbalance, with the P300 speller dataset having a category ratio of 5:1 and the RSVP dataset showing a more pronounced class imbalance with a category ratio of 145:1:1.To reduce the effect of imbalance, we employed focal loss [38] with weights.
where 푝 is the probability of class k, 푡 is truth label (a value of 0 or 1), 푤 is the assigned weight for class k (see ( 6)), and γ is a hyperparameter to tune the loss of well-classified samples.We set γ to 2 as recommended in [38].
In addition, we used Mixup [39] defined in (7) for data augmentation, where λ∼Beta (α, α).It picked two random samples 푥 and 푥 and the corresponding one-hot labels 푦 and 푦 , and then simply added them together linearly to generate a new sample 푥 and label 푦 .
We set up different configurations for the P300 speller and RSVP paradigm.Specifically, for the P300 speller, we set the number of xDAWN filters to 8 and the Mixup alpha value to 0.3.In contrast, for the RSVP paradigm, it was found that a different configuration yielded better performance.Hence, we set the number of xDAWN filters to 4 and the Mixup alpha value to 0.4 for the RSVP paradigm.Afterward, for the P300 speller task, we trained the model in 80 epochs to achieve optimal results.For the RSVP task, where a dedicated test set was not available, we employed a three-fold cross-validation approach to evaluate the model.This involved dividing the dataset into three folds.Each fold was used as a validation set, while the remaining two folds were used for training.This process was repeated three times.For each subject, we computed the unweighted average recall (UAR) for all three folds during the cross-validation process and averaged the three validation score curves across epochs resulting in a single averaged validation score curve for each subject.Finally, the epoch with the highest UAR was selected from the average validation score curve as the RSVP result for each subject.

F. Symbol Decision Fuction for P300 Speller
In the case of the P300 Speller dataset, once the model is trained, the next step is to determine the position of the desired symbol based on the model's output.This involves detecting the row and column in which the symbol is located.Each symbol in the dataset is repeated 15 times, with each repetition consisting of 12 stimuli represented by a stimulus code value ranging from 1 to 12. Let 푞 () denote the softmax (with temperature 10) of the output neuron which represents P300 probability when the stimulus code value is j in the i th repetition.푄 () is the sum of those probabilities from the first to the z th repetition under stimulus code value j.
Then, on the z th repetition, we may identify the target symbol's column c and row r by:

G. Method of Evaluation
We used symbol recognition accuracy and ITR to evaluate the performance of different models in the P300 speller paradigm.We referred to the formula for calculating ITR in the i th repetition in the paper [40], defined as follows: where A is the symbol recognition accuracy, and G (i.e.36) denotes the number of symbols presented in the P300 speller paradigm.For the RSVP paradigm, we adopted the unweighted average recall (UAR) [41] to evaluate the accuracy of the imbalanced dataset which is defined as follows: where 푁 denoted the total number of categories, 푤 denotes the weight factor applied for each category which was currently set to [0.33, 0.33, 0.33], 푡 denoted the number of images per category, and 푐 was the number of correct predictions per category.

H. Models for Comparison
A series of prevailing models are proposed for comparison with the proposed one and we have provided a brief description of the characteristics of these models: 1) Spatially weighted fisher linear discriminant-PCA (SWFP) is a method designed for single-trial ERP detection.It utilizes fisher linear discriminant (FLD) to estimate spatial filters at each time point, which are then applied to an EEG sample for spatial filtering.PCA is then used for dimensionality reduction, with six principal components retained to explain over 70% of the variance [42].2) Ensemble Support Vector Machines(ESVMs) [43] is a machine learning method that combines multiple SVM models to improve classification performance for P300 detection.This approach has been shown to achieve good classification 3) DeepConvNet [24] is a deeper convolutional model for end-to-end EEG analysis, utilizing temporal convolution, spatial convolution and pooling operation and is a general approach for EEG decoding tasks in the BCI domain.4) ShallowConvNet [24] is a simpler model with fewer convolutional layers and has also been successful in EEG decoding tasks in the BCI domain.

A. Performance of Symbol Recognition For P300 Speller
Dataset II consists of two subjects A and B, while dataset IIb only contains one subject.2 present the number of symbols correctly recognized per repetition by each model on dataset II and dataset II-b, respectively.TABLE 1 presents the results of paired t-tests (i.e. 30 pairs for dataset II and 15 pairs for IIb) which compares symbol accuracy according to [45].The authors did not report the performance  of the P300 speller for DeepConvNet, ShallowConvNet, EEGNet and DCPM.Hence, we ran these models (except for DCPM as it is a traditional machine learning algorithm) with the same training strategy (see section E Training).We also took the reported results of other models (SWFP, ESVMs, BN3, CNN-RG-MINMA and ST-CapsNet) for comparison.In dataset II, our proposed model shows the ability to recognize more symbols with fewer repetition rounds.Notably, our model has better performance than other models (푝 < 0.05), except for the DeepConvNet (푝 > 0.05).In dataset IIb, our model exhibits superior performance compared to SWFP, BN3 and DCPM, achieving statistically significant results (푝 < 0.05).While the other methods may show better performance than our model in dataset IIb, these differences are not statistically significant (푝 > 0.05).These findings underscore the effectiveness of our model as a viable option for P300 signal detection, particularly in scenarios where there are limitations on the number of repetition rounds and a need for a balance between high accuracy and repetitions.

B. Effect of xDAWN Number on Symbol Recognition
Extensive experiments were conducted to investigate the effect of the number of xDAWN filters on symbol recognition rates, with results presented in Fig. 2. The averaged symbols under repetitions (ASUR) metric [40] was used to compare the performance of models with different numbers of xDAWN filters more intuitively.
where ASURk stands for the average correctly recognized symbols per repetition when we take k repetitions into account.
퐶 stands for the correctly recognized symbols in the i th repetition.The number of xDAWN filters ranged from 0 to 20 with an interval of 2, where 0 means no xDAWN filter was added (i.e.EEGNet).The interval of the alpha value of Mixup was 0.2, 0.3, and 0.4.Analyzing Fig. 2, we observed that in dataset II, subject A displayed an upward trend in the average symbol recognition rate as the number of xDAWN filters increased from 2 to 8, eventually reaching a stable level.Conversely, subject B exhibited a noticeable improvement in symbol recognition from 2 to 4 filters, followed by stabilization with a slight decline.Notably, when the xDAWN filter number was set to 8, both subjects A and B demonstrated a performance improvement.In addition to examining the influence of xDAWN filters, we also investigated the impact of the Mixup alpha value on the symbol recognition rate.We found that the Mixup alpha value had a relatively minor effect compared to xDAWN.Our results indicate that employing 8 xDAWN filters and a Mixup alpha value of 0.3 led to improved performance compared to scenarios where xDAWN and Mixup were not utilized.However, it is important to note that for dataset IIb, the xDAWN filters resulted in a decrease in the symbol recognition rate.This finding emphasizes the need for careful consideration when selecting the number of xDAWN filters, taking into account the specific dataset and task at hand.

C. Performance of ITR For P300 Speller
The ITRs of each model on datasets II and IIb were plotted in Fig. 3 to visually compare the speed of symbol spelling.For subject A of dataset II, BN3 and ESVMs had higher ITRs than other models, while our method and DeepConvNet had faster ITR performance for subject B. Overall, our model had the best ITR performance on dataset II, particularly in the second repetition where its ITR reached 17.22 bits/min.On dataset IIb, ST-CapsNet has the best ITR performance, as shown in Fig. 3

D. Performance of UAR For RSVP
To evaluate the model performance on RSVP tasks, a 3-fold cross-validation was implemented for each model.The training strategy (refer to section E Training for details) was kept consistent across all models (except for SWFP, ESVMs, and DCPM which are traditional machine learning algorithms).The significant difference was analyzed by paired t-test (n = 16).The UAR performance of each model at 5 Hz and 20 Hz RSVP is illustrated in Fig. 4 (a) and (b), respectively.The proposed method achieved the highest UAR performance for 5 Hz RSVP (proposed method: 0.8134±0.0259;EEGNet: 0.7823±0.0201,p<0.05.Moreover, our method exhibited even more significant improvements (p-value < 0.0001) over the other models for both 5 Hz and 20 Hz RSVP.It is worth noting that DeepConvNet exhibits significant variance across multiple subjects.This can be primarily attributed to its large model parameters and sensitivity to hyperparameter selection, such as the γ value in the focal loss function, which leads to convergence difficulty (i.e.failed to learn information from multiple categories).SWFP performed the worst among all models.In conclusion, the cascade xDAWN-EEGNet model demonstrated the best UAR performance on RSVP tasks compared to the other models.These findings suggest that the proposed method shows the potential as an effective approach for analyzing EEG data in RSVP tasks.improvement in performance over EEGNet.In other cases, performance decreased, but the improvement of UAR performance by Mixup was still significant.The results suggest that the number of xDAWN filters should be carefully chosen according to the specific RSVP characteristics.In addition, the findings indicate that our model is particularly effective for improving detection accuracy in RSVP tasks at low frequencies.

IV. DISCUSSION
In this study, we proposed a cascade structure combining xDAWN and EEGNet for both the P300 speller and RSVP paradigm.Compared with other methods like DeepConvNet or mere EEGNet, the proposed method outperforms ITR performance with fewer repetition rounds for the P300 speller and gains high UAR performance in the RSVP paradigm.

A. P300 speller and RSVP paradigm
Selective attention, as measured by accuracy in an RSVP task, is closely linked to an individual's ability to update changing information over time and may also be connected to performance in P300 speller tasks [20], [47], [48].This ability relies on attentional filtering capacity, which involves the ability to distinguish the object of interest from distractors and maintain this differentiation over time [21].This concept aligns with the P300 speller, where individuals must filter out nontarget rows/columns and focus on the target until a letter is identified.The similarity between selective attention and attentional filtering is reflected in EEG components.Moreover, a multi-feature predictor that includes multiple RSVP features has demonstrated that it can accurately predict P300 speller performance, outperforming single-feature predictors [21].Furthermore, the P300 speller and RSVP paradigms exhibit shared characteristics, such as the utilization of low-frequency stimuli and the presence of a positive waveform observed within a specific time window following stimulus presentation.These findings demonstrate the commonality between these two paradigms and suggest that a uniform method could be applied across different BCI-related tasks.

B. xDAWN could enhance the P300 pattern
In the proposed approach, we used xDAWN spatial filtering to improve the SNR of the raw EEG signals before feeding them into EEGNet, thus providing a process for incorporating prior domain knowledge into the model.

C. Compare with other methods
The key for ERP-based BCI is to distinguish ERP from the background of EEG signals as ERPs have many components, and they are weak and can be influenced by many factors.Linear discriminative analysis (LDA) is a traditional method for the detection of ERPs.However, such a method cannot handle various components of ERPs [42], [49], [50].Furthermore, as mentioned before, DCPM is a robust method that has excellent performance for the detection of ERPs from various paradigms [46].To clearly illustrate the difference between DCPM and the proposed method, we compare our work with [46].In [46], only the classification performance of various models was compared across different ERP paradigms in a single trial.However, important metrics such as symbol recognition and ITR performance in the P300 speller paradigm were not taken into consideration.These metrics are crucial for evaluating the effectiveness of a BCI system.In contrast, we conducted a comprehensive evaluation of our method and compared it with other models on standard BCI competition P300 speller datasets.We considered both ITR and symbol recognition performance, and our results showed that our method outperformed DCPM in both aspects, demonstrating superior performance.
Furthermore, the EEGNet has been proven its effectiveness in BCI competitions [41], [51], [52].Notably, the EEGNet exhibits exceptional generalization capabilities, displaying the ability to perform well on diverse datasets, and also exhibits robustness to noise, enabling reliable performance even in the presence of noisy input signals [23].Besides, EEGNet stands out for its computational efficiency, allowing for efficient real- time processing.The effectiveness and simplicity of its architecture make it an optimal choice as our basic model.
Building upon our previous study [34], which utilized the combination of xDAWN with EEGNet and achieved the second place in the RSVP competition at the BCI Controlled Robot Contest during the 2021 World Robot Contest, we have further extended our previous work [34].In this extended work, we focus on analyzing the impact of varying xDAWN filter numbers on RSVP classification results, a crucial factor that was not previously investigated [34].Moreover, we have explored the applicability of our algorithm to the P300 speller and investigated the effectiveness of Mixup data augmentation techniques for both the P300 speller and RSVP tasks.Through our investigations, we discovered that selecting the appropriate number of xDAWN filters and Mixup alpha value can enhance the performance of our model.These findings highlight the capability of our algorithm to address a wider range of BCI applications effectively.

V. CONCLUSION
This study introduces a cascade structure for unified detection of visual-evoked related potentials.Evaluated on dataset II of the BCI Competition III, our method exhibited better symbol recognition accuracy and achieved a higher ITR compared to competing approaches, particularly reaching 17.22 bits/min in the second repetition round.Furthermore, our algorithm demonstrated superior performance than other models in terms of the UAR on the RSVP task (0.8134±0.0259 at 5 Hz and 0.6527±0.0321at 20 Hz).Additionally, we observed that applying xDAWN filters to raw evoked EEG signals effectively enhances the P300 pattern.As a result, our algorithm shows improved performance on both the P300 speller and RSVP paradigms.These results underscore the effectiveness of the proposed cascade structure for detecting P300-related signals across both P300 speller and RSVP paradigms.

Fig. 1
Fig. 1 Network structure of the proposed cascade xDAWN-EEGNet.The network first applies xDAWN spatial filters to the input EEG signals and then feeds the enhanced signals into the EEGNet.The max norm was used to constrain the weights in the DepthwiseConv2D and dense layers, where it was set to 1 and 0.25 respectively.

Fig. 2 Fig. 3
Fig. 2 Effect of different number of xDAWN filters on the symbol recognition rate.Subplots (a), (b), and (c) represent the symbol recognition rates of subjects A and B in dataset II and dataset IIb, respectively.

Fig. 4
Fig. 4 Comparison of the UAR performance of each model on RSVP.Figures (a) and (b) represent the UAR performance of the models at 5 Hz and 20 Hz RSVP, respectively.
To show the effect of xDAWN filter number and Mixup on UAR, several experiments were conducted, and the results are shown in Fig. 5 Effect of different xDAWN filters and different Mixups on UAR.Figure.The xDAWN filter number interval ranged from 0 to 14 with an interval of 2, excluding the case of xDAWN filter of 2 due to its known poor effect as shown in Fig. 2. Fig. 5 Effect of different xDAWN filters and different Mixups on UAR. Figure (a) shows that as the number of xDAWN filters increases, the detection accuracy for 5 Hz RSVP EEG also increases.Furthermore, using larger alpha values for Mixup leads to larger UAR.In Fig. 5 Effect of different xDAWN filters and different Mixups on UAR. Figure (b), for the detection of 20Hz RSVP EEG, only when the number of xDAWN filters was 4 was there a slight To visualize the EEG signals evoked by different paradigm stimuli, we first plotted the EEG signals evoked by P300 speller, 5 Hz RSVP, and 20 Hz RSVP, along with their EEG spatial distribution (0.25-0.6 s), as is shown in Fig. 6 Comparison of EEG signals of different paradigms.Figure in (a), (c), and (e), respectively.From these three figures, it can be seen that the spatial distribution of the EEG signals evoked by the target sample in the P300 speller and the EEG signals evoked by the target 1 and target 2 samples in the 5 Hz and 20 Hz RSVP are similar.Afterward, we utilized the xDAWN spatial filter to enhance the EEG signal evoked by the target stimulus.For uniformity, we set the xDAWN spatial filter to 4 for visualization.The evoked signals of each paradigm after the xDAWN filter, the time-frequency thermograms, and their extracted EEG distribution patterns are shown in Fig. 6 Comparison of EEG signals of different paradigms.Figure in (b), (d), and (f),

Fig. 5
Fig. 5 Effect of different xDAWN filters and different Mixups on UAR.Figures (a) and (b) show the effects on UAR for 5 Hz and 20 Hz RSVP, respectively.

Fig. 6
Fig. 6 Comparison of EEG signals of different paradigms.Figures (a), (c), (e) represent the averaged raw signal and its EEG topography; Figures (b), (d), (f) represent the evoked potentials, time-frequency thermograms and xDAWN-extracted patterns after xDAWN filtering A Cascade xDAWN EEGNet Structure for Unified Visual-evoked Related Potential Detection Hongtao Wang, Senior Member, IEEE, Zehui Wang, Yu Sun, Senior Member, IEEE, This article has been accepted for publication in IEEE Transactions on Neural Systems and Rehabilitation Engineering.This is the author's version which has not been fully edited and content may change prior to final publication.Citation information: DOI 10.1109/TNSRE.2024.3415474 This work is licensed under a Creative Commons Attribution 4.0 License.For more information, see https://creativecommons.org/licenses/by/4.0/ of averaged patterns and the overall EEG signals.Such estimated spatial filters make full use of strong domain knowledge (e.g. the SNR of ERP signals can be improved by trial averaging and the averaged trial signals are representative of typical ERP signals for respective tasks This article has been accepted for publication in IEEE Transactions on Neural Systems and Rehabilitation Engineering.This is the author's version which has not been fully edited and content may change prior to final publication.Citation information: DOI 10.1109/TNSRE.2024.3415474This work is licensed under a Creative Commons Attribution 4.0 License.For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in IEEE Transactions on Neural Systems and Rehabilitation Engineering.This is the author's version which has not been fully edited and content may change prior to final publication.Citation information: DOI 10.1109/TNSRE.2024.3415474 This work is licensed under a Creative Commons Attribution 4.0 License.For more information, see https://creativecommons.org/licenses/by/4.0/

TABLE 1 NUMBER
OF SYMBOLS CORRECTLY RECOGNIZED PER REPETITION FOR EACH MODEL ON DATASET II

TABLE 2 NUMBER
[46]YMBOLS CORRECTLY RECOGNIZED PER REPETITION FOR EACH MODEL ON DATASET IIBThis article has been accepted for publication in IEEE Transactions on Neural Systems and Rehabilitation Engineering.This is the author's version which has not been fully edited and content may change prior to final publication.Citation information: DOI 10.1109/TNSRE.2024.3415474DiscriminativeCanonicalPattern Matching (DCPM) is a machine learning algorithm that is highly robust in detecting ERP components from different paradigms with excellent performance.It is especially useful when there is limited training data available[46].
[23] work is licensed under a Creative Commons Attribution 4.0 License.For more information, see https://creativecommons.org/licenses/by/4.0/9)EEGNet[23]is a lightweight end-to-end CNN network that incorporates temporal convolution, spatial convolution, separable convolution, and classification layers.It has demonstrated good robustness and has been widely used as a benchmark in EEG analysis III. RESULTS

TABLE 1 and
TABLE

TABLE 1
This article has been accepted for publication in IEEE Transactions on Neural Systems and Rehabilitation Engineering.This is the author's version which has not been fully edited and content may change prior to final publication.Citation information: DOI 10.1109/TNSRE.2024.3415474 This work is licensed under a Creative Commons Attribution 4.0 License.For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in IEEE Transactions on Neural Systems and Rehabilitation Engineering.This is the author's version which has not been fully edited and content may change prior to final publication.Citation information: DOI 10.1109/TNSRE.2024.3415474This work is licensed under a Creative Commons Attribution 4.0 License.For more information, see https://creativecommons.org/licenses/by/4.0/(d).These findings suggest that our cascaded xDAWN-EEGNet model, DeepConvNet, ST-CapsNet can be suitable models for achieving high ITRs in P300 speller systems.