Comparing user-dependent and user-independent training of CNN for SSVEP BCI

Objective. We presented a comparative study on the training methodologies of a convolutional neural network (CNN) for the detection of steady-state visually evoked potentials (SSVEP). Two training scenarios were also compared: user-independent (UI) training and user-dependent (UD) training. Approach. The CNN was trained in both UD and UI scenarios on two types of features for SSVEP classification: magnitude spectrum features (M-CNN) and complex spectrum features (C-CNN). The canonical correlation analysis (CCA), widely used in SSVEP processing, was used as the baseline. Additional comparisons were performed with task-related components analysis (TRCA) and filter-bank canonical correlation analysis (FBCCA). The performance of the proposed CNN pipelines, CCA, FBCCA and TRCA were evaluated with two datasets: a seven-class SSVEP dataset collected on 21 healthy participants and a twelve-class publicly available SSVEP dataset collected on ten healthy participants. Main results. The UD based training methods consistently outperformed the UI methods when all other conditions were the same, as one would expect. However, the proposed UI-C-CNN approach performed similarly to the UD-M-CNN across all cases investigated on both datasets. On Dataset 1, the average accuracies of the different methods for 1 s window length were: CCA: 69.1%  ±  10.8%, TRCA: 13.4%  ±  1.5%, FBCCA: 64.8%  ±  15.6%, UI-M-CNN: 73.5%  ±  16.1%, UI-C-CNN: 81.6%  ±  12.3%, UD-M-CNN: 87.8%  ±  7.6% and UD-C-CNN: 92.5%  ±  5%. On Dataset 2, the average accuracies of the different methods for data length of 1 s were: UD-C-CNN: 92.33%  ±  11.1%, UD-M-CNN: 82.77%  ±  16.7%, UI-C-CNN: 81.6%  ±  18%, UI-M-CNN: 70.5%  ±  22%, FBCCA: 67.1%  ±  21%, CCA: 62.7%  ±  21.5%, TRCA: 40.4%  ±  14%. Using t-SNE, visualizing the features extracted by the CNN pipelines further revealed that the C-CNN method likely learned both the amplitude and phase related information from the SSVEP data for classification, resulting in superior performance than the M-CNN methods. The results suggested that UI-C-CNN method proposed in this study offers a good balance between performance and cost of training data. Significance. The proposed C-CNN based method is a suitable candidate for SSVEP-based BCIs and provides an improved performance in both UD and UI training scenarios.

external environment. BCIs can be classified into two broad categories: endogenous BCIs and exogenous BCIs [5]. Endogenous BCIs allow a user to modulate his/her neuronal activity completely based on covert intentions. These include paradigms such as motor imagery BCI, imagined tactile responses based BCI, etc. Exogenous BCIs are dependent on an external stimulus to modulate the user's neuronal activity, providing contextual information regarding the intention of users. Examples include P300 based BCI, steady-state visual evoked potentials (SSVEP) based BCIs, steady-state motion visual evoked potential (SSMVEP) BCIs, etc. A key advantage of exogenous BCIs over endogenous BCIs is that they do not require as much user training. In the present study, we investigate the SSVEP BCI paradigm. Some of the desirable properties that SSVEP-based BCIs offer include low participant training time, high signalto-noise ratio and high information transfer rate (ITR). In this paradigm, one or more visual stimuli in the form of flickering light sources are presented to the user on a computer screen with each stimulus flickering at a certain frequency. When the user attends to one of the stimuli, an SSVEP response is elicited in the occipito-parietal region of the brain. These responses manifest as an increase in the amplitude of the EEG at the corresponding flicker frequency and often its harmonics. Each stimulus is usually mapped to a command on the external control application. The stimulus with the user's attention can be identified by analyzing the dominant response in the EEG and the corresponding command is generated.
As with any human-machine interface, there are two learning agents in BCI: the user and the algorithm. One of the advantages of SSVEP-based BCIs is that little, if any at all, user training is required. On the other hand, several feature extraction and classification algorithms have been developed for SSVEP processing, which can be categorized into three broad categories: training-free methods, user-specific training methods and user-independent (UI) training methods [6]. Training-free algorithms do not require any training data and the user of the BCI can immediately start using the system [7][8][9][10]. The most widely used training-free method for SSVEP classification is the canonical correlation analysis (CCA) [7,8]. This method has been used as the baseline algorithm for SSVEP detection. CCA is a multi-variable statistical technique that allows us to capture the underlying correlation between two random variables. One variable is the EEG data and the other variable is a set of sinusoidal reference templates corresponding to the stimuli frequencies.
Secondly, the user-specific or user-dependent (UD) training methods require training data from each user, from which a user-specific model is generated. Finally, the UI training methods require training data from multiple participants and a generalized model, or a 'UI' model, is generated such that it can be applied to unseen users. This method is particularly suited for SSVEP BCIs as these responses are more consistent across most humans, compared with other signal modalities of BCI such as sensory-motor rhythm [11]. Figure 1 illustrates the UD and UI training methods. Among the three categories, the UI method is favorable next to the training-free method as this does not require any training/calibration data to be collected from a new unseen user, thus making the system virtually calibration-free once properly trained. Most UD and UI approaches for SSVEP have been extensions of the classical CCA method that incorporate some form of template matching scheme in addition to the sinusoidal reference templates. Most widely used UD methods include: Combination method-CCA [12], Individual Template CCA (IT-CCA) [13] and more recently proposed Task Related Components Analysis (TRCA) [14]; UI methods include: Filter-Bank CCA (FBCCA) [15], Combined-tCCA and Adaptive Combined CCA (A3C) [16].
Recently, there has been an increased interest in the application of deep learning algorithms for detection and classification in EEG based BCI [17][18][19]. Deep learning offers the advantage of automatic feature extraction either in the time domain EEG or in the transform domain as opposed to sophisticated feature extraction methods. A recent survey indicated that 41% of studies used some form of transform/feature extraction before applying a deep neural network [19]. The convolutional neural network (CNN) was the most prevalent among these studies accounting for 43% of the studies [19]. Many recent studies have shown that CNNs provide significant improvement in performance compared to traditional techniques for SSVEP detection [20][21][22][23][24][25]. Among these studies [21,23,24], have transformed the SSVEP trials into the frequency domain before providing it as an input to the CNN for classification. In [24], an asynchronous SSMVEP BCI was developed in which EEG was converted to the frequency domain using a fast Fourier transform (FFT) and then applied as input to the CNN to distinguish between an intentional control (IC) and a no control (NC) state. Subsequently, CCA was used to classify the SSMVEP targets. This approach was shown to outperform the traditional approaches such as CCA-Threshold and CCA-kNN methods. A similar FFT based transformation was applied to the SSVEP data in [21 and 23]. In [23], the authors showed that the CNN classification was better than LASSO [26] in decoding the SSVEP targets. It is important to note that these studies have used only the UD based training procedure. On the other hand [22], was one of the early studies to evaluate a UI based training procedure for SSVEP detection using a CNN. The authors used the time domain EEG directly as input to the CNN and showed that it was able to learn discriminable features to classify twelve unique SSVEP targets. In addition to this, the authors showed the importance of phase related information present in the SSVEP data that aided in better classification accuracy. Although these studies have independently done UD and UI training of CNN, to our knowledge, there are a limited number of studies comparing both UD and UI training of CNN for SSVEP detection. A recent survey on training methods for SSVEP BCIs indicated that there was a glaring gap in the literature for a lack of comparative performance studies between UD and UI methods for SSVEP BCIs [6]. Therefore, in the current study, we address this gap by providing a comparison of UD and UI training methods for SSVEP BCIs. Specifically, we compare the performance of different feature extraction methods with CNN for SSVEP classification.
A generic architecture that works across multiple datasets is highly desirable for any deep neural network-based approach. Using time domain as input to a CNN poses some challenges in this regard. The dimensions of the input data directly depend on the sampling rate of the EEG system. A subsequent up or down sampling step maybe required in a case where the CNN model trained on a lower sampling rate was to be used on a data with higher sampling rate, and this could lead to loss of information. The ITR is directly influenced by the window length of the time domain data, and thus require modifying the input layer of the CNN when window length changes. These challenges can be addressed when the frequency domain representation is used as input to the CNN and can be achieved by fixing the resolution of the FFT. Moreover, these earlier studies using CNN were exclusively based on the magnitude spectrum of FFT and had ignored the phase related information [21,23]. Many studies have used a frequency-phase coding approach in the stimulus design [12,27] and use the phase information as part of the classifier to detect the SSVEP targets [28,29]. Models that ignore the phase information could potentially underperform on these datasets. These models were computationally lighter models compared to the CNN using time-domain data [22], the goal of which was to modify a previously proposed Compact-CNN [30] for detection in BCI tasks such as P300, error-related negativity (ERN), movement related cortical potentials (MRCP) and sensorimotorrhythms (SMR) to be re-purposed for the application of SSVEP BCIs. Although a generic architecture that works across multiple BCI paradigms is desired, a task-specific CNN model could provide high performance and simultaneously provide a less complex architecture.
In the current study, we propose combining the real and imaginary parts of the FFT and providing it as input to the CNN as this combines both the amplitude and phase related information in SSVEP for decoding the targets. The preliminary results of this approach was presented in [31]. The proposed method aimed to study the performance in UD and UI scenarios while maintaining a simple architecture. This architecture was inspired by [21] and was used in a previous study [32]. This model achieved reduced computational complexity and reduced the number of tunable parameters compared to previously published CNN models in [22] for SSVEP classification. One of the key challenges highlighted in deep learning based methods for BCIs is the reproducibility of results [18], where the authors provided guidelines such as: providing a clear description of the architecture, clearly describing the data used, use of existing datasets and evaluating the performance with baseline. Therefore, the proposed method was compared with the magnitude spectrumbased transformation under both UD and UI training scenarios. CCA was used as the baseline algorithm. FBCCA and TRCA were also compared with the proposed methods. In addition, two datasets were used in this study for the comparison: (1) a seven class SSVEP dataset with 21 participants recorded in our lab and; (2) a publicly available 12 class SSVEP dataset with ten participants, which has been used in many earlier studies [12,13,16,22]. Additional comparisons were performed with other published methods that reported the classification accuracies on the same public dataset.
In the next section, the datasets and methodologies used in this study are detailed. In section 3, the results of the comparison of all the methods on both seven class and 12 class datasets are presented. Section 4 discusses the results of the experiments. The conclusion and directions for future work are provided in section 5.

Dataset description
The proposed methods and other previously published methods were evaluated and compared on two datasets; a dataset acquired in our lab-Dataset 1 [33] and a public dataset-Dataset 2 [12].

Dataset 1
Twenty-one healthy adults (six females and 15 males, aged 19-28 years) with normal or corrected-tonormal vision participated in an offline experiment. The experiment was approved by the Office of Research Ethics of the University of Waterloo (ORE # 31850). A written informed consent was signed by each participant before starting the experiment. All participants were seated in a comfortable chair at 0.6 m from an LCD monitor. Seven flickering stimuli were displayed on the monitor (60 Hz refresh rate) with the following flicker frequencies: 8.423 Hz, 9.375 Hz, 9.961 Hz, 10.84 Hz, 11.87 Hz, 13.4 Hz, and 14.87 Hz, respectively. One stimulus was fixated at the center of the screen and six surrounding stimuli were placed concentrically around the central stimulus. Each stimulus was white in colour and circular in shape as they were shown to elicit the better SSVEP responses than other shapes and colors [34,35]. The interstimulus distance or viewing angle was measured as a function of the distance between the centers of each stimulus. This was fixed as 5.24 as used in previous studies [33].
EEG data was acquired using the g.USBamp and Gammabox (g.tec Guger Technologies, Austria) wet electrode (g.Scarabeo) system with a sampling rate of 1200 Hz. Six active electrodes were placed at the occipital and occipito-parietal areas as follows: O1, O2, Oz, PO3, POz and PO4, according to the international 10-20 system. FPz was used as the ground and right ear lobe was used as the reference.
The experimental protocol was similar to the one used in our previous study [33]. At the beginning of each trial, the participant was directed by a visual cue (yellow marker above the target stimulus) to gaze at the target stimulus of the trial on the screen. This cuing period was 2 s. The participant was asked to focus on the target stimulus for a 6 s period. A break of 4 s between two consecutive trials was provided. Each stimulus was repeated eight times in a single run, resulting in 56 trials per run. A pseudorandom sequence was generated for stimulus presentation. In addition, the participants were asked to avoid any eye blinks or sudden jerky movements during the trials. The experimental protocol and stimulus were designed using OpenViBE software [36]. All data were recorded, stored and analyzed offline.

Dataset 2
An offline SSVEP dataset collected on ten healthy volunteers was downloaded from a public repository [12]. All participants were seated in a comfortable chair at 0.6 m from an LCD monitor in a dim room. Twelve flickering stimuli were displayed on the monitor with the following flicker frequencies: 9.25 Hz, 9.75 Hz, 10.25 Hz, 10.75 Hz, 11.25 Hz, 11.75 Hz, 12.25 Hz, 12.75 Hz, 13.25 Hz, 13.75 Hz, 14.25 Hz, and 14.75 Hz. The stimuli were arranged in a 4 × 3 grid of 6 cm × 6 cm squares that represented a numeric keypad.
The EEG data was acquired using the BioSemi ActiveTwo EEG (Biosemi B.V., Netherlands) system with a sampling rate of 2048 Hz. Eight active electrodes were placed over the occipito-parietal areas. At the beginning of each trial, the participant was directed by a red square cue to gaze at a specific stimulus on the screen. The cuing period was 1 s. The participant was asked to focus on the targeted stimulus for a duration of 4 s. One block consisted of 12 trials with one trial for each of the 12 stimuli on the screen. They were presented in a random order. A total of 15 blocks were presented leading to a total of 180 trials.

Pre-processing and feature extraction 2.2.1. Pre-processing
The pre-processing for each dataset was performed separately. For Dataset 1, the signals from three occipital channels O1, O2 and Oz were filtered using a 4th order Butterworth band-pass filter between 1 Hz and 40 Hz. Each 6 s trial was then segmented with a sliding windows scheme with different widths: 0.5 s, 1 s, 1.5 s, 2 s, 2.5 s, 3 s, and with a step of 100 ms to bootstrap the number of training epochs. For Dataset 2, the signals were preprocessed based on [12 and 22] for comparison. All eight channels were used from this dataset. Consistent with the analyzing method in [12 and 22], a 4th order Butterworth band-pass filter between 6 Hz and 80 Hz was used to filter the data. Each 4 s trial was divided into 1 s non-overlapping segments as per [22].

Magnitude spectrum features
Prior studies have considered the use of magnitude spectrum features as input to a CNN for SSVEP classification [21,23,24,32]. In these prior studies, the pre-processed time-domain EEG signals x (n) were transformed into the frequency domain X (k) by computing the FFT resulting in a sequence of complex numbers Re(X (k)) + j Im(X (k)), from which the magnitude spectrum was calculated: In the current study, the frequency resolution of the FFT was fixed as 0.2930 Hz and the frequency components between 3 Hz and 35 Hz were selected. As a result, the length of the FFT transformed signal was N fc = 110. The resultant signals computed along each channel were stacked one below the other to form a matrix with dimensions N ch × N fc , where N ch was the number of channels and N fc was the number of frequency components and provided as input to the CNN. In this study, we refer to this approach as the M-CNN method. An example of the input I M-CNN for Dataset 1 is defined as: This approach only considers the magnitude at different frequencies, with the phase information ignored. Earlier studies have shown that phase information provides significant information in decoding SSVEP [22,[27][28][29]. Therefore, we propose the use of the complex spectrum features directly as input to the classifier.

Complex spectrum features
The magnitude and phase related information can be extracted from the FFT of a signal. The input timedomain signal was transformed into the complex FFT representation using the standard FFT computation with a resolution of 0.2930 Hz. Next, the frequency components of the real part and the imaginary part along each channel were extracted between 3 Hz and 35 Hz resulting in two vectors of length 110. These two vectors were concatenated into a single feature vector as I = Re(X)||Im(X), where the first half contained the real part and the second half contained the imaginary part of the complex FFT. The resultant signal was stacked one below the other to form a matrix with dimensions N ch × N fc , where N fc = 220. This approach of using the complex FFT as input to the CNN is referred to as the C-CNN method. An example of the input I C-CNN for Dataset 1 is defined as (2): (2)

Convolutional neural network
The CNN architecture used in this study was the one proposed in our previous study [32] and was inspired by the one proposed in [21]. Figure 2 illustrates the CNN architecture used in this study. The CNN consists of four main layers, an input layer, two convolutional layers, and a fully connected output layer. The features extracted in the previous step were provided as input to the CNN. The input layer of the CNN had dimensions N ch × N fc . This was followed by the convolutional layer Conv_1 which was designed based on the intuition of spatial filtering. This layer performed 1D convolutions across the channel dimension (N ch ) with kernel dimensions of N ch × 1. The objective for this layer was to learn to weigh the contribution of each channel differently. The number of feature maps in the Conv_1 layer was 2 * N ch and each feature map had dimensions 1 × N fc . The Conv_2 layer operated on the spectral representation of the input. The kernel dimension for this layer was 1 × 10. The number of feature maps in this layer were 2 * N ch . As a result of the convolution, the feature maps in this layer had dimensions equal to 1 × (N fc − 10 + 1). Batch normalization was performed on the outputs of layers Conv_1 and Conv_2. The rectified linear unit (ReLU) was used as the activation function. Dropout was added to the network as a regularization technique to prevent overfitting. Batch normalization was shown to reduce the internal covariance within input samples resulting in the samples having zero mean and unit variance [37]. Dropout and batch normalization were shown to improve the generalization performance and training speed of neural networks [24,37]. The output layer of the network consisted of K units equal to the number of SSVEP classes in the input data. The output layer was equipped with the softmax function to output the probability that a given input segment belonged to a particular class.

Training parameters
The weights of the CNN were initialized based on a Gaussian distribution ∼ N(0, 0.01). The network was trained using the backpropagation technique by minimizing the categorical cross-entropy loss function. The stochastic gradient descent with momentum was used as the optimization algorithm for training the network. A grid search was employed as the search strategy for hyper parameters to find the best training values for these parameters. The search space was defined as follows: Learning Rate  the ones that led to the best average accuracy across all participants were chosen. The hyper-parameter optimization was performed for four combinations of dataset (Dataset 1 and 2) and pipelines (M-CNN and C-CNN), separately. Within each of the four combinations, the same hyper-parameters were used for all participants and window sizes.

User-dependent training procedure (UD)
In this method, a classifier was trained using the data of one single participant and the classifier was validated on the same participant's data. To achieve this, 10-fold cross-validation was performed on each participant's dataset. First, all trials of one participant were pre-processed using different window lengths (W) and both types of features were extracted. The pre-processed trials epochs were split into ten nonoverlapping parts and the CNN was trained separately for each window length on nine parts and tested on the one remaining part. This procedure was carried out for Dataset 1. For Dataset 2, similar 10-fold crossvalidation was performed on the 1 s window length of the data. No other window length was used because 1 s was the window length used in [12,22]

User-independent training procedure (UI)
The proposed method was evaluated in a UI training scenario for its efficacy to classify novel unseen user's SSVEP data, leading to a calibration-free system. In this method, a leave-one-participant-out method was used for training and validation of the classifier. If a given dataset contains P participants, then the classifier was trained by combining the data of P-1 participants and tested on the data of the single unseen participant. This procedure was performed individually for each feature extraction method and for each window length of data. For example, the total number of 1 s segments in training fold were: 54880 (Dataset 1) and 6480 (Dataset 2) and testing fold were 2744 (Dataset 1) and 720 (Dataset 2) respectively. The parameters that resulted in average highest accuracy across all participants were selected. The methods using this type of training with the two types of feature extraction methods were referred to

Canonical correlation analysis (CCA)
CCA was performed on each segment of the EEG data. This is a multivariate statistical method used to find the underlying correlation between two sets of multidimensional variables. Prior studies have shown that CCA can produce a superior performance in detecting SSVEP responses in EEG [7,8]. It is most widely used as a baseline classification method for SSVEP detection [6,12,13,27]. CCA is based on linear transformations. Consider the transformations x = X T w x and y = Y T w y , where X refers to the set of multi-channel EEG data and Y refers to a set of reference signals of the same length as X. The objective of CCA was to find projection vectors w x and w y that maximize the correlation between x and y by solving the following: (4) The maximum of ρ with respect to w x and w y was the maximum correlation. The reference signals Y n were defined as (4), where Y n ∈ R 2N h ×Ns , f n was the stimulation frequency, f s was the sampling frequency, N s was the number of samples, and N h was the number of harmonics. In this study, N h = 2. The canonical features ρ fi , where i = 1, 2, …, K were extracted for each segment of the EEG data, and the output class C for a given sample was determined as: C = argmax (ρ fi ).

Filter-bank canonical correlation analysis (FBCCA)
FBCCA is a UI variant of the CCA method [6,15]. In this method, the multi-channel EEG data X was decomposed into J sub-band components (X j , j = 1, 2, . . . , J ) and the standard CCA was applied to each of the sub-band components separately. The correlation values between the subband component X j and the predefined reference signals Y n belonging to the ith stimulation frequency, denoted by ρ j, i was calculated. A weighted sum of squares of the correlation values were calculated as the feature vector for SSVEP detection: where j is the index of the sub-band. The weight vector corresponding to the sub-band components was defined as: where a and b are constants that maximize the classification performance. In this study, the following values were set empirically, J = 5, N h = 2, a = 1.25 and b = 0.25. The five different sub-bands with the low cut-off and high cut-off frequencies for Dataset 1 were designed as (Hz): (6,40), (10,40), (14,40), (20,40) and (26,40) and for Dataset 2 were: (Hz) (6.5, 80), (12.5, 80), (18.5, 80), (24.5, 80) and (30.5, 80).

Task-related component analysis (TRCA)
TRCA is a UD training method used to obtain spatial filters that extract task related source activities from multi-channel EEG data [14]. Using individual training data, TRCA extracts task related components by maximizing their reproducibility during task periods. Consider the multi-channel EEG x (t) ∈ R Nc , TRCA finds a linear coefficient vector w ∈ R Nc to maximize the inter-trial correlation of its projections y(t) = w T x(t), which is called a task-related component. The hth trial in the observed EEG is given by x (h) ∈ R Nc×Ns and the taskrelated component is given by y (h) ∈ R Ns . The covariance C h1,h2 between the h 1 th and h 2 th trials of y (h) is described as: All possible combinations of N t trials are summed as: To obtain a finite solution, the variance of y(t) is constrained as: The constrained optimization problem can then be solved by: The eigenvalues of the matrix Q −1 S indicate the task consistency among multiple trials. The eigenvector corresponding to the largest eigenvalue w was selected as the spatial filter to extract task related components. In this study, the following values were set empirically, N h = 2, a = 1.25 and b = 0.25. The number of sub-bands was set as J = 5 and the same sub-bands used in FBCCA were used in this method. The performance of this method was evaluated on the segmented SSVEP data of each participant based on a 10-fold crossvalidation scheme.

Statistical analysis
Statistical analysis was performed on the results of both the datasets to evaluate the performance of the different classification methods. The UD training methods and user independent training methods were compared with each other and with the baseline CCA method. Additional comparisons were performed with FBCCA and TRCA. A mixed-effect model ANOVA was used to evaluate the classification methods. The metric of interest was the overall accuracy of each method in classifying the different SSVEP targets. Therefore, the response variable was the classification accuracy. The participant was a random factor, the window length (W) was a random factor with six levels (W = [0.5 s, 3 s]), and the classification algorithm was a fixed factor with seven levels (CCA, FBCCA, TRCA, UD-M-CNN, UD-C-CNN, UI-M-CNN, UI-C-CNN) respectively (Dataset 1). The null hypothesis was that the classification accuracy was same for all classification algorithms. A 95% confidence interval was used for the comparison and analysis. The same statistical analysis was performed on both datasets with slight modifications for Dataset 2, in which window length was not a factor as it was fixed as W = 1 s. These results indicate that the proposed C-CNN pipeline outperformed the M-CNN pipeline when both methods were applied either in UI or UD scheme. Even when the C-CNN was used in the UI scheme, it performed similarly to M-CNN used in the UD scheme, highlighting its advantages. Subsequent analysis was performed to measure the interactions between window lengths and the different classification methods. The tests revealed that both UD CNN methods outperformed CCA (p ⩽ 0.001) and FBCCA (p < 0.001) across all window lengths. Among the UI methods, UI-C-CNN had significant improvement than CCA for window lengths between 0.5 s and 2 s (0.001 ⩽ p < 0.022). Across all windows, UI-C-CNN was significantly better than FBCCA (p ⩽ 0.003). Similarly across all windows, UD-M-CNN was significantly better than UI-M-CNN (p ⩽ 0.002). There was a significant difference in accuracy at lower windows from 0.5 s to 1.5 s between UD-C-CNN and UI-C-CNN (0.002 ⩽ p < 0.043). The average accuracies of the different methods for 1 s window length were: TRCA: 13.4% ± 1.5%, FBCCA: 64.8% ± 15.6%, CCA: 69.1% ± 10.8%, UI-M-CNN: 73.5% ± 16.1%, UI-C-CNN: 81.6% ± 12.3%, UD-M-CNN: 87.8% ± 7.6% and UD-C-CNN: 92.5% ± 5%. Figure 3 Figure 4(a) summarizes the average accuracies of all the classification methods for Dataset 2 across ten participants for the data length of 1 s. It can be inferred from the figure that the UD-C-CNN method achieves the highest accuracy of 92.33% ± 11.1%. Among the UD and UI methods, the UD CNN methods outperform the UI CNN methods, FBCCA and CCA, as expected. The TRCA method had the lowest performance among all other methods and was not included in subsequent analysis. The likely explanation for the poor performance of TRCA is that the method is not suitable for asynchronously processed SSVEP data, explained further in subsequent sections. The average accuracies of the different methods for data length of 1 s were: UD-C-CNN: 92.33% ± 11.1%, UD-M-CNN: 82.77% ± 16.7%, UI-C-CNN: 81.6% ± 18%, UI-M-CNN: 70.5% ± 22%, FBCCA: 67.1% ± 21%, CCA: 62.7% ± 21.5% and TRCA: 40.4% ± 14%.

Dataset 2
The mixed-effect model ANOVA revealed a significant difference between all the classification methods (p < 0.001). Post hoc Bonferroni simultaneous comparison was performed to compare the different algorithms. The UD-C-CNN, UD-M-CNN and UI-C-CNN significantly outperformed CCA (p < 0.001) and FBCCA (p < 0.002). There was a significant difference between UI-M-CNN and UI-C-CNN (p = 0.037). There was no significant difference between UI-M-CNN versus FBCCA (p > 0.05) and UI-M-CNN versus CCA (p = 0.398). There was no significant difference between UD-M-CNN and UD-C-CNN (p = 0.118). Further analysis was carried out to compare between the UD and UI CNN methods based on each feature extraction technique: UI-M-CNN versus UD-M-CNN (p = 0.014), UI-C-CNN versus UD-C-CNN (p = 0.045), and UD-C-CNN versus UI-M-CNN (p < 0.001). The difference between UD-M-CNN versus UI-C-CNN (p = 1) was not significant. Figure 4(b) illustrates the average ITR for all the methods for a W = 1 s calculated with 0.5 s

Discussions
The results of this study clearly indicate that the UD method performs better than UI methods. It is interesting to note that from the results of both datasets, the C-CNN methods outperformed the other methods in both UD and UI training scenarios. The UI-C-CNN based method performed similar to the UD-M-CNN method and outperformed the UI-M-CNN method. Further investigation was performed to compare the results of the UI-M-CNN with the UI-C-CNN by visualizing the learned feature representations of the CNN on both the datasets. The features were visualized using the t-Stochastic Neighborhood Embedding (t-SNE) technique [38]. This method is a widely used feature visualization technique which enables us to visualize high-dimensional feature spaces in 2 or 3 dimensions [16,22,39]. The features from the input layer, the ReLU layer of Conv_1 and the ReLU layer of Conv_2 were extracted. The 1 s long SSVEP segments were visualized for both datasets using the magnitude and complex features for the UI method. Figures 5(a)-(f) illustrates the features at different layers of the network for the UI-M-CNN ((a)-(c)) and UI-C-CNN ((d)-(f)) methods respectively. To achieve this, the CNN was trained on the P-1 participants' data and the unseen test participant's data was forward propagated into the pre-trained network and features were extracted at the output of each layer. Each data point in the figure belongs to 1 s segments of a single trial and is colored based on the class label. It can be observed that the features become more and more clustered as we progress into the deeper layers of the network for both methods. Comparing the outputs of the ReLU of Conv_2 layers of the M-CNN (c) and C-CNN (f), it can be observed that using the complex representation of the inputs leads to better clustering and class separation. The CNN has learned seven unique classes from the training data and is able to cluster the unseen participant's data into one of the seven classes. There is smaller overlap between the classes in the C-CNN compared to the M-CNN. This is also evident from the classification accuracies for all window lengths in which the C-CNN outperforms the M-CNN method. Therefore, by including the real and imaginary parts of the complex FFT as input to the CNN, we observe that the CNN can extract significantly more discriminative features that lead to better overall separation and classification accuracy when compared to the magnitude spectrum features.

Dataset 2
Similar feature visualization was performed on Dataset 2. Figure 6(a) illustrates the overall feature clustering on Dataset 2 of all the participants. The UI-M-CNN and UI-C-CNN methods are compared based on the t-SNE visualization. It can be clearly seen that the clustering in the feature space of the ReLU of the Conv_2 layer of C-CNN shows distinct clusters compared to M-CNN. This type of class separation aids in achieving better classification accuracy. A previous CNN study in [22] used time domain features on this dataset and showed that within-class clusters were captured by their proposed method. In the present study, using complex spectrum features  and a lighter CNN architecture, similar within-class differences, similar to the ones reported in [22], have been learned by the UI-C-CNN. Figure 6(b) illustrates an example of the trials belonging to the 12.25 Hz class. Four distinct clusters can be identified in this class. Further analysis found that these actually correspond to the four non-overlapping 1 s segments of the 4 s trials of the 12.25 Hz data. The clusters were colourcoded according to the segment label in figure 6(c). It can be observed that there was a segment level clustering that the CNN has learned, i.e. all the 1 s segments were clustered into four groups as follows: 0 s-1 s, 1 s-2 s, 2 s-3 s and 3 s-4 s. This separation could be due to the fact that there exists phase related information that has been extracted from the complex representation of the input and this has enabled in clustering into four clusters. And such level of detailed discriminative information was not presented with the M-CNN method. This shows evidence that the UI-C-CNN method is capable of extracting phase and amplitude related features. These clustering results are consistent with the findings reported in [22] where a radial phase plot analysis on the 1 s segments showed that segments between 0 s-1 s, 1 s-2 s had separable phases with the segments between 2 s-3 s, 3 s-4 s and the clustering revealed that the 1st and 2nd segments were on opposite phases of the 3rd and 4th segments of the trial. These results were also consistent across multiple classes which can be observed in figure 6(b) of the present study. From these results, it is evident that the C-CNN method proposed in this study can improve the overall SSVEP decoding performance significantly.
The results obtained on both the datasets indicate that the TRCA method had the lowest performance among all compared methods. One of the likely reasons for this is that, in the current study the SSVEP data was processed in an asynchronous manner in which the training trials were segmented based on a fixed window and step size, and the data was not phase locked. Previous studies applying TRCA were based on synchronous SSVEP paradigms with fixed windows of data which were always tied precisely to the onset of the stimulus. A similar observation was reported by [22] where the combined-CCA method performed poorly for asynchronously processed SSVEP data. Therefore, future studies could investigate the application of TRCA to asynchronous SSVEP paradigms.
Further analysis was performed to compare the results achieved on the Dataset 2 with the ones reported in the literature. The methods presented in the present study were CCA, FBCCA, TRCA, UD-M-CNN, UD-C-CNN, UI-M-CNN, and UI-C-CNN. In a recent study, it was reported that a vast majority of published studies based on deep learning for EEG based BCIs did not compare the proposed techniques to state-of-the-art methods or they performed biased comparisons [40]. In the current study, we have attempted to provide a comparison of our methods with other techniques proposed in the literature in an unbiased way. Therefore, we compared our methods with two UD and two UI methods as identified in [6]. The following methods that were selected were those published studies that tested on Dataset 2 and reported the results for the 1 s data length. We directly compared the accuracies reported in these published studies with the methods evaluated in this presented paper. Among UD methods, the combination method [12] and Independent Template based CCA (IT-CCA) [13] were selected. Among the UI methods, the Compact-CNN [22] and the Combined-tCCA [16] methods were selected. Figure 7 compares the classification accuracies of the calibration-free CCA, FBCCA, TRCA, UD and UI training methods of CNN with the accuracies reported by previously published studies in the literature such as [12,13,16,22]. Overall results show that the UD methods achieve higher accuracies compared to UI methods and CCA. Among the UD methods, the proposed UD-C-CNN (92.33% ± 11.1%) outperforms CCA (62.7% ± 21.5%), UD-M-CNN (82.77% ± 16.7%), IT-CCA (81.17% ± 18.84%) and TRCA (40.4% ± 14%) but it is similar in performance to the Combination method (92.78 ± 10.22). Among the UI methods, the proposed UI-C-CNN (81.6% ± 18%) achieves the highest performance compared to CCA (62.7% ± 21.5%), UI-M-CNN (70.5% ± 22%), FBCCA (67.1% ± 21%), Compact-CNN (79% ± 15%) and Combined-tCCA (75% ± 24%). Only those studies that have used Dataset 2 for benchmarking have been compared here. The pooled transfer-based methods were used in this comparison, and adaptive learning method such as the adaptive combined CCA [16] was not used in this comparison.

Conclusions and future work
In this study, we investigated CNN for SSVEP classification based on both the UI and UD schemes. We introduced a method to extract complex spectrum-based features from SSVEP and provided as input to a CNN for classification. The classifier was evaluated in both UD and UI training schemes. The proposed method was compared with the magnitude spectrum features (M-CNN) and CCA. The results indicated that the proposed C-CNN outperformed both M-CNN and CCA across all processing window lengths in both UI and UD training scenarios. The UD based training methods consistently achieved higher classification accuracies compared to the UI methods, as one would expect. The UD-C-CNN based method ranked highest among the compared methods. Within the UI methods, the UI-C-CNN achieved the highest performance. Further, its performance was similar to UD-M-CNN. Visualizing the features extracted by the UI-C-CNN method indicated that the method likely learned phase related information from the SSVEP data. The proposed methods and comparisons performed on a publicly available twelve class SSVEP dataset showed that the findings were consistent with the ones reported in the literature. The UI-C-CNN method achieved the highest accuracy among most tested UI methods on the public dataset, and the UD-C-CNN performed similarly to the combined method, which was the best SSVEP decoder in [12].
A comparative study was required to inform whether the cost of training would be borne by the user (in case of UD training) or by the developer of the BCI (in case of UI training) [6]. We have addressed some of the points in this study. There is a trade-off between achieving high classification accuracy versus the cost of collecting training data. If the performance of the system was of higher priority, UD methods offer the best accuracy compared to UI and training-free methods. However this will require each user to undergo a calibration session, which could lead to issues such as poor user-compliance. On the other hand, if the developers are willing to collect training data from multiple participants, then the UI-C-CNN method proposed in this study offers a good balance between performance and cost of training data. Transfer learning-based method has the potential to provide the combined advantage of high accuracy of the UD method and the training-free UI method. A future study can explore using multiple participants' data to build a model and fine-tune the model by collecting minimal calibration data from the unseen user. With pre-trained models, online adaptation strategies can be employed for improving the overall performance of the BCI. Number of participants required to build a sufficiently accurate UI model should be explored. In the current study, we evaluated the methods on two datasets consisting of 21 participants and ten participants each. The methods presented in this study have been evaluated in an offline manner, therefore future studies can explore online performance in an asynchronous SSVEP-BCI setting.
In conclusion, the proposed C-CNN based methods are suitable candidates for SSVEP-based BCIs and provide improved performance in both UD and UI training scenarios.