A Dynamic Window Method Based on Reinforcement Learning for SSVEP Recognition

Steady-state visual evoked potential (SSVEP) is one of the most used brain-computer interface (BCI) paradigms. Conventional methods analyze SSVEPs at a fixed window length. Compared with these methods, dynamic window methods can achieve a higher information transfer rate (ITR) by selecting an appropriate window length. These methods dynamically evaluate the credibility of the result by linear discriminant analysis (LDA) or Bayesian estimation and extend the window length until credible results are obtained. However, the hypotheses introduced by LDA and Bayesian estimation may not align with the collected real-world SSVEPs, which leads to an inappropriate window length. To address the issue, we propose a novel dynamic window method based on reinforcement learning (RL). The proposed method optimizes the decision of whether to extend the window length based on the impact of decisions on the ITR, without additional hypotheses. The decision model can automatically learn a strategy that maximizes the ITR through trial and error. In addition, compared with traditional methods that manually extract features, the proposed method uses neural networks to automatically extract features for the dynamic selection of window length. Therefore, the proposed method can more accurately decide whether to extend the window length and select an appropriate window length. To verify the performance, we compared the novel method with other dynamic window methods on two public SSVEP datasets. The experimental results demonstrate that the novel method achieves the highest performance by using RL.


A Dynamic Window Method Based on
Reinforcement Learning for SSVEP Recognition Weizhi Zhou , Le Wu , Member, IEEE, Yikai Gao , and Xun Chen , Senior Member, IEEE Abstract-Steady-state visual evoked potential (SSVEP) is one of the most used brain-computer interface (BCI) paradigms.Conventional methods analyze SSVEPs at a fixed window length.Compared with these methods, dynamic window methods can achieve a higher information transfer rate (ITR) by selecting an appropriate window length.These methods dynamically evaluate the credibility of the result by linear discriminant analysis (LDA) or Bayesian estimation and extend the window length until credible results are obtained.However, the hypotheses introduced by LDA and Bayesian estimation may not align with the collected real-world SSVEPs, which leads to an inappropriate window length.To address the issue, we propose a novel dynamic window method based on reinforcement learning (RL).The proposed method optimizes the decision of whether to extend the window length based on the impact of decisions on the ITR, without additional hypotheses.The decision model can automatically learn a strategy that maximizes the ITR through trial and error.In addition, compared with traditional methods that manually extract features, the proposed method uses neural networks to automatically extract features for the dynamic selection of window length.Therefore, the proposed method can more accurately decide whether to extend the window length and select an appropriate window length.To verify the performance, we compared the novel method with other dynamic window methods on two public SSVEP datasets.The experimental results demonstrate that the novel method achieves the highest performance by using RL.

I. INTRODUCTION
B RAIN-COMPUTER interfaces (BCIs) enable humans to interact with external machines by decoding The authors are with the Department of Electronic Engineering and Information Science, University of Science and Technology of China, Hefei 230027, China (e-mail: zhouweizhi@mail.ustc.edu.cn;lewu@ustc.edu.cn;ykgao@mail.ustc.edu.cn;xunchen@ustc.edu.cn).
ITR has been commonly used to evaluate the performance of BCI [17], and it represents the effective information transmitted per unit time.ITR is positively correlated with the number of detection targets and the detection accuracy, and negatively correlated with the detection window length.Therefore, there are three ways to improve the performance: (1) increasing the number of detection targets, (2) improving the detection accuracy, and (3) decreasing the detection window length.
A number of studies use mixed frequency, phase, or sequential coding methods to increase detection targets.Nakanishi et al. implemented a 32-target SSVEP speller by mixing 8 frequencies and 4 phases [13].Recently, Chen et al. implemented an SSVEP speller that can achieve 160 classifications while guaranteeing an average accuracy of around 0.9, using frequency sequential coding [18].Although these works have reported promising performances, their effectiveness remains constrained by the detection accuracy in the recognition methods.
Another approach is to improve the detection accuracy to increase the ITR.A variety of methods have applied spatial filtering techniques that enhance task-related components and suppress task-unrelated components [19].Canonical correlation analysis (CCA) designs spatial filters to calculate the correlation coefficient between EEG and sine-cosine reference signals with detection frequencies [20].The frequency corresponding to the maximum coefficient is taken as the classification result.Chen et al. proposed filter bank CCA (FBCCA), which enhances CCA-based detection by incorporating the components from different frequency bands [21].Due to their robustness and ease of implementation, CCA and FBCCA have been popular in SSVEP-based BCIs [21], [22], [23], [24], [25], [26], [27].However, their detection accuracy is limited by the lack of utilizing subject-dependent information.To further improve the detection accuracy, a number of studies utilize subject-dependent training data to calibrate the SSVEP detection.Nakanishi et al. proposed ensemble task-related component analysis (ensemble TRCA), which extracts subject-dependent components and achieved an ITR of 325.3 bits/min [28].Liu et al. proposed task-discriminant component analysis (TDCA), which learns the spatial filter from SSVEP signals with various frequencies and utilizes temporal information [29].TDCA and ensemble TRCA are representative and high-performance recognition methods in SSVEP.
Although the above methods could significantly improve the ITR, these methods lack optimization of the window length.
Traditional methods analyze SSVEPs of all trials at a fixed window length.SSVEP is easily influenced by the environment and subjects, and the optimal window length is different across different trials.The window length of traditional methods cannot be adaptively adjusted according to different trials.Therefore, dynamic window methods were proposed to select an appropriate window length for different SSVEPs.These methods extend the window length until credible results are obtained.One dynamic window method proposed in [30] builds a linear discriminant analysis (LDA) classifier to predict whether the current classification result is correct.When detecting SSVEP, this method would extend the window length until the result is predicted to be classified correctly.Another class of dynamic window methods takes Bayesian estimation to estimate the credibility of detection results and would extend the window length until the credibility is higher than a predefined threshold [30], [31], [32], [33].
Although the existing dynamic window methods have improved ITR, the hypotheses introduced by LDA and Bayesian estimation may not align with the collected realworld SSVEPs.For example, the linear projection in LDA would ignore nonlinear features, and the risk function in Bayesian estimation is not specifically designed with SSVEP signals.To solve the issue, we propose a dynamic window method based on reinforcement learning (RL).RL is a powerful technique to optimize strategies [34].The proposed method optimizes the policy for adjusting the window length without additional hypotheses.The decision model can automatically learn a strategy that maximizes the ITR through trial and error.Additionally, we utilize neural networks that can extract more effective features for the dynamic selection of window length.Since deep Q-network (DQN) is a widely recognized RL method [35], we take DQN to verify the improvement of RL to dynamic window methods.

II. METHODS
We use DQN to implement an SSVEP detection method that can adaptively adjust the window length.DQN is a representative RL method and can learn effective policies without additional hypotheses.By using DQN, the proposed method can select a more appropriate window length to achieve higher SSVEP performance based on the features of SSVEP signals.In the following subsections, we will introduce the model establishment, training process, and testing process of the proposed dynamic window method respectively.

A. Model Establishment
DQN enables intelligent agents to learn a policy for taking actions in an environment to achieve a specific goal through a feedback mechanism.The agent first observes the state of the environment.Then, DQN takes the state as input and outputs the value of each action.Finally, the agent will execute the action with the highest value.Actions would change the environment, and generate rewards that indicate feedback from the environment.Rewards will guide DQN to optimize model parameters to achieve the goal.Dynamic window methods decide whether to extend the window length based on the features of SSVEPs.By taking the features of SSVEPs as the state and SSVEP performance as the goal, the dynamic window method based on DQN can learn how to control the window length.Next, we will introduce the "goal", "action", "state", "reward", and "value" of our proposed method in detail respectively.
The "goal" of dynamic window methods is to optimize the performance of SSVEP by choosing an appropriate window length.As a communication interface, SSVEP-BCI needs to take into account both transmission speed and accuracy.The transmission speed is inversely related to the window length.Therefore, the dynamic window method is optimal when the window length is adjusted to the minimum value that ensures correct classification.
The dynamic window method uses the "action" a to adjust the window length.There are two distinct actions: (1) a 1 , i.e., extending the window length, and (2) a 2 , i.e., outputting the result.Specifically, when a 2 is executed, the window length stops extending and the current SSVEP trial is terminated.The ideal strategy is extending the window length when SSVEP is misclassified and promptly outputting the result when SSVEP is correctly classified.
The "state" s is the foundation for judging whether to extend the window length and needs to indicate the credibility of the SSVEP classification.We use a recognition method to classify SSVEP and obtain the state of the SSVEP signal.Recognition methods calculate the features of each target from the SSVEP signal and take the target corresponding to the maximum feature as the classification result.Several studies show that if the difference between the maximum feature and the remaining features is greater, the accuracy of the recognition result is higher [26], [36].Thus, existing dynamic window methods estimate the credibility of the recognition result based on these features [30], [31], [32], [33].We follow their ideas and take features ρ = [ρ 1 , ρ 2 , . . ., ρ N t ] ∈ R 1×N t extracted by TDCA or ensemble TRCA to generate the state, where N t denotes the number of detection targets.Referring to these works, we also normalize and sort ρ: When outputting the result, the proposed method will output the target corresponding to ρ max .Additionally, the window length T can provide effective information for dynamic window methods [30].We concatenate ρ * and T as the state The "reward" r is the immediate feedback of an action in a state and can guide the dynamic window method to optimize SSVEP performance.During the trial and error process, the DQN tries different actions and these actions result in different rewards.Positive rewards are used to encourage actions that improve SSVEP performance, and negative rewards are used to punish actions that diminish SSVEP performance.Therefore, we manually set three types of rewards: (1) r 1 (r 1 < 0), which denotes the reward of extending the window length; (2) r 2 (r 2 < 0), which denotes the reward of error output; (3) r 3 (r 3 > 0), which denotes the reward of correct output.The correctness of the output result is determined by the SSVEP recognition method.r 1 punishes extending the window and guides the proposed method to increase the transmission speed.r 2 punishes outputting incorrect results, and r 3 encourages outputting correct results.r 2 and r 3 guide the proposed method to improve the recognition accuracy.
The value Q represents the expected cumulative reward corresponding to performing an action in a certain state.Since actions will affect subsequent states and results in the long term, DQN uses Q to evaluate the quality of an action.The optimal cumulative reward Q * is defined as: where γ is the discount factor and r t+i is the reward under the state s t+i and the action a t+i at time t + i (i = 0, 1, . . .).Q * assumes that all actions performed after t are optimal.Thus, future rewards are considered to be the maximum.Q * can also be written as: Fig. 1 shows the process of DQN estimating Q * .In the DQN, the input is s t , and the two outputs are the Q * of extending the window length and the Q * of outputting the result at s t .
The proposed method will perform the action with a higher Q * .We set up 6 hidden layers for the neural network, and their sizes are 1024, 256, 128, 64, 32, and 16 respectively.

B. Training Process
During the training process, DQN will optimize the estimate of Q * based on states, actions, and rewards.At the i-th iteration, suppose the network parameter of DQN is θ i .To update the network, the loss function L i (θ i ) at each iteration is set as: DQN also uses a separate target network with the parameter θ − i to calculate a target optimal value Q target .θ i is updated at every iteration.θ − i is only periodically updated with θ i at every C (C ≫ 1) iterations and stays fixed at other iterations.The relative stability of θ − i improves the learning efficiency of the network.By calculating the loss function and using the stochastic gradient descent method, DQN can gradually update the weights of the neural network to optimize the estimate of Q * (s t , a t ).
Fig. 2 shows the training process of the dynamic window method based on DQN.We first generate the data required for training and then use these data to train DQN.Equation (5) indicates that the calculation of the loss depends on s t , a t , r t , and s t+1 .In the proposed method, s t+1 represents the state extracted from an extended window compared to the previous window.Therefore, training DQN requires quadruples (s t , a t , r t , s t+1 ), and these quadruples are called experiences.For each SSVEP trial in the training set, we will perform the same operation on the SSVEP signals under different window lengths.We first use a recognition method to get the state and the prediction label, where the prediction label is the target corresponding to the largest feature.Next, we consider two hypotheses for SSVEP detection: extending the window length and outputting the result.a 1 produces a feedback of r 1 .As the window length increases, SSVEP signals are usually incorrectly recognized first and then correctly recognized.When a 2 is executed, incorrect recognition produces r 2 , and correct recognition produces r 3 .By combining the states, actions, and rewards in the current window length and the states in the next window length, we generate the experiences.Finally, we randomly select a certain amount of experiences for each iteration to train DQN.

C. Testing Process
Fig. 3 shows the testing process of the dynamic window method based on DQN.The window length starts at a minimum value.We first use a recognition method to get the state from SSVEP signals and predict the result.Then, based on the state, DQN predicts Q * of different actions.Finally, the proposed method decides the action corresponding to the higher Q * .If the proposed method decides to output the result, it will output the prediction and terminate the detection.If the proposed method decides to extend the window length, it will Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.wait for more SSVEP data and then reassess the decision under the longer window.Besides, the proposed method will output the result if the window length reaches its maximum value.

III. EXPERIMENTS
In this section, we conducted synchronous SSVEP experiments on two public datasets with four methods, which are the fixed window method (FW), the dynamic window method in FBCCA with dynamic window (FBCCA-DW) [32], the dynamic stopping method (DS) [30], and the dynamic window method based on DQN (DW-DQN).Besides, we abbreviate the dynamic window method in FBCCA-DW as DW.

A. Parameter Setting and Comparison Algorithms 1) Two Recognition Method and Data Pre-Processing:
Ensemble TRCA [28] and TDCA [29] are representative subject-specific methods for SSVEP detection.When testing SSVEP signals, we use the rest of the SSVEP signals from the subject to construct a classification model.Besides, the window length is set from 0.08 s to 1.0 s with an interval of 0.04 s.SSVEP signals were filtered by a bandpass filter.The code of the two classification methods and the data pre-processing refers to [37].
2) Fixed Window Method: Most methods, such as ensemble TRCA and TDCA, take fixed-length SSVEPs for detection.In practice, these methods apply the window length that maximizes the ITR across all trials.However, the window length can't be dynamically adjusted for different SSVEP trials.We tested the performance of the fixed window methods under different window lengths.
3) Dynamic Window Method in FBCCA-DW: FBCCA-DW takes FBCCA to extract features of detection targets and uses Bayesian estimation to evaluate the credibility of the classification result based on the features.FBCCA-DW analyzes SSVEP data of appropriate length when the credibility is Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
higher than a threshold [32].Features obtained from ensemble TRCA or TDCA can also be used to evaluate the credibility.Thus, ensemble TRCA with dynamic window (ensemble TRCA-DW) and TDCA with dynamic window (TDCA-DW) were proposed.The threshold is a hyperparameter that requires manual tuning.The threshold will influence the average window length and performance of SSVEP detection.To maximize the performance of this dynamic window method, the threshold is set to optimal in our experiments.
4) Dynamic Stopping Method: Jiang et al. proposed two dynamic stopping methods [30].The two methods both make use of the distribution of correct and incorrect classifications.Bayes-based DS method takes Bayesian estimation to calculate the posterior probability and outputs the result until the posterior probability is higher than a threshold.Discriminantbased DS method trains an LDA classifier to predict the result and outputs the result until the prediction is correct.Besides, the performance of ensemble TRCA with the two DS methods was shown in [30].

5) Dynamic Window Method Based on DQN:
We propose a dynamic window method based on DQN.The proposed method directly determines whether to extend the window length based on the features of SSVEP.We set r 1 = −0.04,r 2 = −1 and r 3 = 1.Besides, the proposed method takes a fully connected network to predict the action.γ is set to 0.99.We call ensemble TRCA with the proposed dynamic window method ensemble TRCA-DW-DQN and call TDCA with the proposed dynamic window method TDCA-DW-DQN.

B. Two SSVEP Datasets
We used the TH dataset [38] and the BETA dataset [39] to validate the performance of the experiment.On the TH dataset, there are 35 subjects who participated in 6 blocks of an SSVEP speller.On the BETA dataset, there are 70 subjects who participated in 4 blocks of an SSVEP speller.In each block, the two datasets both contain 40 visual flicker stimuli, which are encoded with different frequencies (8 Hz to 15.8 Hz with an interval of 0.2 Hz).Both datasets recorded 64-channel EEG data with a sampling frequency of 250 Hz.There are 0.5 s EEG data before stimulation and 0.5 s EEG data after stimulation on the two SSVEP datasets.The averaged visual latency is 0.14 s on the TH dataset and 0.13 s on the BETA dataset.Besides, the duration of visual flicker stimulation is 5 s on the TH dataset and 2 s (from S1 to S15) or 3 s (from S16 to S70) on the BETA dataset.
We conducted cross-subject experiments for performance validation on the two datasets, respectively.For DW-DQN, the EEG data from one subject was utilized for testing, while the EEG data from the remaining subjects were used for training.The EEG of nine channels from the occipital cortex on the two datasets was used in this experiment.Since EEG data corresponding to visual latency and EEG data without stimulation don't contain detection information, these EEG data were not incorporated into the detection.

C. Performance Evaluation
The accuracy P is an important indicator to evaluate the performance of recognition methods.P is defined as the number of correct detections divided by the number of total detections.Besides, ITR is also widely used to evaluate the performance of SSVEP-based BCI and is defined as: where P, T , and N denote the detection accuracy, the window length, and the number of targets, respectively.Calculating ITR takes into account 0.5 s of gaze shifting.Since the average window length was different in different dynamic window methods, the accuracy is difficult to verify at the same window length.ITR is more commonly used to compare the performance of different dynamic window methods.

A. DW-DQN vs FW and DW
We compared the performance of FW, DW, and DW-DQN on the two public SSVEP datasets.Fig. 4 shows the performance for ensemble TRCA-FW, TDCA-FW, ensemble TRCA-DW, TDCA-DW, ensemble TRCA-DW-DQN, and TDCA-DW-DQN on the two datasets.Either in the ensemble TRCA-based methods or the TDCA-based methods, the proposed dynamic window method significantly outperformed FW and DW (p<0.001).Especially, the ITR of TDCA-DW-DQN was the highest among these methods (p<0.001).Besides, Table I lists the ITR and its corresponding window length for different methods on the two datasets.The ITR for ensemble TRCA-DW-DQN was 296.3 bits/min at 0.962 s on the TH dataset and 194.7 bits/min at 1.099 s on the BETA dataset.The ITR for TDCA-DW-DQN was 320.3 bits/min at 0.912 s on the TH dataset and 223.0 bits/min at 1.060 s on the BETA dataset.Compared with FW, DW-DQN improved by at least 25.0% on the two datasets.Compared with DW, DW-DQN improved by at least 8.8% on the two datasets.On the other hand, on the TH dataset, 33 subjects with ensemble TRCA-DW-DQN performed better than those with ensemble TRCA-DW and ensemble TRCA-FW, and all subjects with TDCA-DW-DQN performed better than those with TDCA-DW and TDCA-FW.On the BETA dataset, 65 subjects with ensemble TRCA-DW-DQN performed better than those with ensemble TRCA-DW and ensemble TRCA-FW, and 67 subjects with TDCA-DW-DQN performed better than those with TDCA-DW and TDCA-FW.

B. DW-DQN vs DS
Since Jiang et al. had validated the performance of ensemble TRCA-based discriminant DS and ensemble TRCA-based Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

4.
The performance for ensemble TRCA-FW, ensemble TRCA-DW, ensemble TRCA-DW-DQN, TDCA-FW, TDCA-DW, and TDCA-DW-DQN on the two datasets.The window length, accuracy and ITR is the average across all subjects.

TABLE II COMPARISON OF DW-DQN AND DS ON THE TH DATASET
Bayes DS on the TH dataset, we adopted the result in [30].Table II illustrates the comparison of DW-DQN and DS, where the gaze shifting time was 0.55 s, and only the sixth block was used for the test [30].The ITR of ensemble TRCA-DW-DQN was 284.9 bits/min, while the ITR of ensemble TRCA-based discriminant DS was 222.4 bits/min and that of ensemble TRCA-based Bayes DS was 230.2 bits/min.The result suggested that DW-DQN performed better than DS by an obvious margin.

V. DISCUSSIONS
The results in IV suggest that DW-DQN outperformed FW, DW, and DS.Nevertheless, it is worth discussing why DW-DQN performed better than other methods.

A. Performance Comparison
FW only analyzes SSVEPs at a fixed window length, while DW-DQN selects an appropriate window length for different SSVEP signals.Thus, DW-DQN outperformed FW.Fig. 5 shows the increasing rate of DW-DQN methods compared with FW methods.First, the increasing rate of all subjects is greater than 0, which means that DW-DQN can improve the performance of all subjects.Additionally, the increasing rate of DW-DQN was negatively related to the ITR of FW (p<0.01).Especially, DW-DQN improves the ITR a number of times for subjects whose ITR of FW methods was lower than 40 bits/min.The result demonstrates that DW-DQN performs better for subjects with poor performance.
The results also show that DW-DQN significantly outperformed DW and DS.Compared with DW and DS, DW-DQN can make a more reasonable decision.On the one hand, DW-DQN considers the longer-term impact of decisions on outcomes and thus makes more reasonable decisions.DQN optimizes the policy according to long-term rewards, as shown in equation (3), while the other dynamic window methods make decisions only based on the distribution of current features.On the other hand, the model of DW-DQN is more complex and has more parameters to model the dynamic window strategy.The dynamic window method with more parameters has a stronger ability to learn to adjust the window length.

B. Contribution of the Long-Term Rewards
For DW-DQN methods, we compare the ITR of considering long-term rewards with the ITR of considering short-term Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.rewards.In equation (3), γ ∈ [0, 1] adjusts the influence of the future reward.A smaller γ means that the proportion of future rewards in cumulative rewards is smaller.Especially, if γ = 0, DW-DQN considers the current reward only.To better understand the contribution of the future reward, we experimented with DW-DQN with different γ .In Fig. 6, the ITR of these methods increased with γ on the two datasets.In other words, methods that care more about future rewards achieve a higher ITR.The result demonstrates that considering the long-term reward benefits a lot in boosting the ITR.

C. Necessity of Sorting
We sort the features extracted from SSVEP.To verify the necessity of sorting, we compared DW-DQN and unsorted DW-DQN on the two datasets.Fig. 7 shows the performance of DW-DQN methods and unsorted DW-DQN methods.DW-DQN significantly outperformed unsorted DW-DQN (p<0.001).According to Fig. 7, unsorted DW-DQN methods learn an unreasonable reward for each action from the unsorted frequency features and thus detect SSVEP at an unreasonable window length.In order to compare the changes Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.before and after sorting, we plot the t-distributed stochastic neighbor embedding (t-SNE) projections of unsorted features and sorted features in Fig. 8.The t-SNE method projects features down to two dimensions and visualizes the distribution of features among trials from different targets.DW-DQN needs to classify SSVEP trials that output the result versus SSVEP trials that extend the window length.Before sorting, the maximum frequency features of each target are clustered in the dimension corresponding to the target.Features of different targets are difficult to simultaneously classify.After sorting, the maximum frequency features of different targets are clustered in the same dimension, i.e., the first dimension.The desired decision boundaries of different targets are more similar.Thus, the classifier for sorted features is easier to learn.Besides, due to the similar distribution of sorted features from different targets, DW-DQN can learn across targets.Therefore, sorting can significantly improve the performance of DW-DQN.

D. The Effect of Rewards
DW-DQN learns a dynamic window strategy to maximize the cumulative reward.Therefore, rewards (r 1 , r 2 , and r 3 ) are important hyperparameters that affect the performance of DW-DQN.r 1 is reward for extending the window with a value of −0.04 indicating the loss of 0.04 s of detection time.r 2 and r 3 represent the rewards for error output and correct output respectively.To simplify hyperparameter adjustment, we set r 2 = −r 3 .We kept r 1 unchanged and verified the performance of DW-DQN under different r 2 and r 3 , where r 2 = −r 3 .The results in Table III show that the ITR of DW-DQN first increases and then decreases as r 3 increases, reaching its peak when r 3 = 1.In order to maximize the cumulative reward, DW-DQN will be more inclined to extend the window length to improve the accuracy with larger r 3 .Therefore, both the accuracy and the window length increase as r 3 increases.Several SSVEP-based BCIs also consider other performance metrics, such as individual user preferences and practical bit rate [40].When the design goals of SSVEP change, DW-DQN can adapt by adjusting the rewards.For example, our proposed method can increase r 3 to meet the design of higher accuracy.

E. The DQN Size
We explore the impact of DQN size on ITR in Table IV.The table shows that within a suitable size range, the proposed method achieves excellent results.The network size set in section II achieves the highest average ITR on different recognition methods and datasets.In addition, the complexity of DQN does not affect practical use.We took an NVIDIA GPU (GeForce RTX 3080) for SSVEP experiments.Our dynamic window method has a computational time of approximately 0.03 s on a single trial, while the window length (excluding the 0.5 s of gaze shifting) of a single trial is approximately 0.5 s.Since the calculation time of the dynamic window method is much less than the detection window length, the method can work in practice.

F. Future Work
This is the first time that RL has been applied to dynamic window methods and the detection of SSVEP.RL can better the window length of SSVEP detection, which shows its potential for more widespread application in SSVEP.In the future, we will further explore the structure and size of the dynamic window methods based on RL.More excellent RL methods deserve to be applied.DQN is an important milestone in RL.Recently, A large number of works related to RL have been proposed [41], [42], [43], [44], [45].They improve the performance by adjusting the loss function or the structure of the neural network.Additionally, although sorting can significantly improve the performance, the sorted features lose their frequency information.A novel method that can analyze unsorted frequency features is needed.
RL can also be applied to asynchronous BCIs, where subjects can control the time of operation.In asynchronous SSVEP-based BCIs, there are two states, which are the idle state and the control state [46].The control state indicates that users watch a flicking stimulus to give commands, and the idle state indicates that subjects need not give commands and do not look at flickers.In the control state, the reward of outputting the result is higher than the reward of extending the window length.In the idle state, the reward of extending the window length is higher than the reward of outputting the result.Thus, dynamic window methods can also distinguish the idle state from the control state.Yang et al. implemented an asynchronous dynamical BCI by using spatio-temporal equalization multi-window [47].If RL is applied to asynchronous SSVEP-based BCIs, it will greatly facilitate SSVEP-based BCIs to practical application.

VI. CONCLUSION
We took the DQN to implement a dynamic window method, which can select a proper window length based on the frequency features of SSVEPs.The novel method addresses the issue that the hypotheses introduced by the existing dynamic window methods may not align with the real-world SSVEPs.The results demonstrate that DW-DQN outperformed better than FW, DW, and DS on the two public SSVEP datasets.Specifically, TDCA-DW-DQN achieved the best performance.This is the first time that RL has been introduced to the dynamic window methods and the detection of SSVEP.Furthermore, RL has broad application potential in the field of asynchronous dynamical BCIs.

Manuscript received 17
July 2023; revised 4 January 2024 and 8 April 2024; accepted 28 May 2024.Date of publication 3 June 2024; date of current version 7 June 2024.This work was supported in part by the National Key Research and Development Program of China under Grant 2023YFC3603600, in part by the National Natural Science Foundation of China under Grant 32271431, and in part by the Fundamental Research Funds for the Central Universities under Grant KY2100000123.(Corresponding author: Xun Chen.)

Fig. 1 .
Fig. 1.The model of DQN in the dynamic window method.

Fig. 2 .
Fig. 2. The training process of the dynamic window method based on DQN.L T denotes the true label of the SSVEP signal, and L E denotes the label of error prediction.s i (i = 1, 2, . ..) denotes the state under the i − th window length.Different actions will generate different rewards.a 1 produces r 1 .The left side of the blue dashed line is the incorrect classification, and the prediction label is L E .The right side of the blue dashed line is the correct classification, and the prediction label is L T .When a 2 is executed, an incorrect prediction produces r 2 and a correct prediction produces r 3 .According to equation 5, experiences (s t , a t , r t , s t+1 ) are used to train DQN.

Fig. 3 .
Fig. 3.The testing process of the dynamic window method based on DQN.T max denotes the maximum window length.

Fig. 5 .
Fig. 5.The increasing rate of individuals under two recognition methods and two datasets.The increasing rate is defined as the ITR of FW subtracted from the ITR of DW-DQN and divided by the ITR of FW.

Fig. 7 .
Fig. 7. Comparison of DW-DQN methods and unsorted DW-DQN methods on the two SSVEP datasets.(a) shows the ITR for ensemble TRCA-based methods and (b) shows the ITR for TDCA-based methods.

Fig. 8 .
Fig. 8.The t-SNE of unsorted features and sorted features.The 'star', 'circle', 'plus', and 'square' denote SSVEP trials from different targets, respectively.For the illustration, only four different targets are shown.The dark black means the result needs to be output, and the light black means the window length needs to be extended.The dashed line indicates the desired decision boundary.

TABLE I THE
ITR AND ITS WINDOW LENGTH (T)

TABLE III THE
PERFORMANCE OF DW-DQN WITH DIFFERENT r 3 ON THE TWO DATASETS AND TWO RECOGNITION METHODS RESPECTIVELY.THE VALUE OF r 1 REMAINS UNCHANGED, AND r 2 = −r 3

TABLE IV THE
ITR (BITS/MIN) OF DW-DQN UNDER DIFFERENT NETWORK SIZES ON DIFFERENT DATASETS (TH AND BETA) AND DIFFERENT RECOGNITION METHODS (TDCA AND ENSEMBLE TRCA)."NETWORK SIZE" REPRESENTS THE RATIO OF THE NEURON NUMBER IN THE NETWORK TO THE NEURON NUMBER IN SECTION II.FOR EXAMPLE, THE NETWORK SIZE OF "0.5" IS 512, 128, 64, 32, 16, AND 8