Towards error categorisation in BCI: single-trial EEG classification between different errors

Objective. Error-related potentials (ErrP) are generated in the brain when humans perceive errors. These ErrP signals can be used to classify actions as erroneous or non-erroneous, using single-trial electroencephalography (EEG). A small number of studies have demonstrated the feasibility of using ErrP detection as feedback for reinforcement-learning-based brain-computer interfaces (BCI), confirming the possibility of developing more autonomous BCI. These systems could be made more efficient with specific information about the type of error that occurred. A few studies differentiated the ErrP of different errors from each other, based on direction or severity. However, errors cannot always be categorised in these ways. We aimed to investigate the feasibility of differentiating very similar error conditions from each other, in the absence of previously explored metrics. Approach. In this study, we used two data sets with 25 and 14 participants to investigate the differences between errors. The two error conditions in each task were similar in terms of severity, direction and visual processing. The only notable differences between them were the varying cognitive processes involved in perceiving the errors, and differing contexts in which the errors occurred. We used a linear classifier with a small feature set to differentiate the errors on a single-trial basis. Main results. For both data sets, we observed neurophysiological distinctions between the ErrPs related to each error type. We found further distinctions between age groups. Furthermore, we achieved statistically significant single-trial classification rates for most participants included in the classification phase, with mean overall accuracy of 65.2% and 65.6% for the two tasks. Significance. As a proof of concept our results showed that it is feasible, using single-trial EEG, to classify these similar error types against each other. This study paves the way for more detailed and efficient learning in BCI, and thus for a more autonomous human-machine interaction.


Introduction
When a human recognises that an error has been committed, either by themselves or in actions that they are observing, characteristic signals known as errorrelated potentials (ErrP) are generated in the brain [1]. A number of studies have shown that it is possible to differentiate between errors and correct actions, by detecting ErrP using electroencephalography (EEG), on a single-trial basis [2][3][4][5]. Interestingly, previous studies have confirmed the possibility of using single-trial error versus non-error classification as a feedback function for a reinforcement learning-based Brain computer interfaces (BCI) [2][3][4][5]. This opens up the possibility of moving toward autonomous BCI systems, allowing the machine to learn appropriate low-level actions based on the human's perceptions of which actions are correct, and which are errors. Such systems are able to reduce human mental workload by learning quasi-optimal solutions in scenarios such as simple navigation tasks [3,4]. However, when tasks increase in complexity, learning will become slower if the only available information is whether a given action was correct or erroneous. Hence, if a system can be given more detailed information about the type of error that occurred, it can correct its actions more appropriately, and learn more quickly.

Towards error categorisation in BCI: single-trial EEG classification between different errors
More recently, a handful of studies have shown that, beyond classifying errors against correct actions, it is possible to distinguish different errors against each other based on their ErrP. In a study by Iturrate et al, participants observed a virtual robotic arm, which had the task of selecting a specific basket [6]. However, the arm also could erroneously select baskets 1 or 2 steps away from the target, to the left and to the right. The study showed that there were significant differences between the ErrP for errors to the left versus those to the right, and also between those of small versus large errors. In addition to this, a small number of studies have considered neurophysiological differences arising from varying sources of errors. Different ErrP and error types that have been discussed are as follows: 'response ErrP' caused when a human recognised that they have responded incorrectly to a task [5,7,8], 'feedback ErrP' caused when a human is informed that they have made an error of which they were previously unaware [5,7], 'observation ErrP occurring when a human observes an error committed by a machine or another human [5,7], 'execution errors' occurring when a machine fails to execute a command as instructed by the human [5,9], and 'outcome errors' appearing when a human experiences a task failure [5,9]. A study by Spüler and Niethammer showed that it is possible to classify outcome errors (committed by a human) against execution errors (committed by a machine) on a single-trial basis [9].
Despite these recent advances, the vast majority of literature in the field concerns the classification of errors against correct actions, rather than the classification of different error types against each other. Where single-trial error categorisation has been explored in a few recent studies, metrics that have been considered to distinguish the error categories include direction, severity, and whether the error was committed by the human or the machine. However, different errors cannot always be categorised by such metrics. For example, if we are trying to navigate to a target location we could either take a wrong turn on the way, or we could reach the target but then pass it. These two errors could be of the same direction and magnitude, and therefore indistinguishable by currently explored metrics, but knowing which one had occurred would provide useful information. Therefore, it is important to consider whether there are significant neurophysiological distinctions in EEG signals between the brain's responses to very similar error conditions, even in cases where metrics explored in existing literature are not available.
To address this question, we evaluated data from two tasks. In the first task, users were presented with 'go' and 'no-go' stimuli and asked to respond to 'go' stimuli, but withhold responses to 'no-go' stimuli. All of the errors considered by this experiment were response errors committed by humans who failed to withhold responses to 'no-go' stimuli, and then recognised their own errors. None of the errors had any direction associated with them, and participants were not instructed to consider any errors as more or less severe than any others. The key difference between the error conditions lay in the cognitive processes required to recognise them, with the recognition of one error condition being more memory-dependent than the other. In the second task, users observed a virtual robot attempting to navigate to, and grab, a target object. Here, we investigated users' EEG responses to two navigational errors: moving away from the target when in position and ready to grab it, and moving further away from the target object if not already in position. Errors were equally likely to be made to the left or the right. In this case, all errors were being committed by the machine. As with the first task, direction could not be used to distinguish the error conditions, and users were not told to consider either error to be more or less severe than the other. As such, the error conditions considered here could not be differentiated by metrics used in existing literature. However, the contexts in which the errors arose differed slightly: In one condition, the expected correct action would be a lateral movement towards the target. In the other condition, the expected correct action would be to grab the target. We aimed to use distinctions in the EEG signals, arising from these subtle differences of cognitive load and context, to classify the error conditions against each other.
To explore the neurophysiological distinctions between the responses to these error conditions, we used time domain data to compare the latency and amplitude of key ErrP features: the error-related negativity (ERN), and the error positivity (Pe). The ERN is a negative deflection, usually peaking fronto-centrally around 100 ms after an error [1,2,10]. The Pe is a slower positive wave, often peaking centro-parietally between 200-400 ms after the error [2,[10][11][12]. In contrast with the ERN, the Pe has been shown to depend on participants' awareness and confidence that an error has been committed [13][14][15][16], suggesting that the Pe is linked to conscious processing of errors. In addition to amplitude, the 'build-up rate' of the Pe (i.e. the steepness of the slope as amplitude increases to the peak) has also been identified as a marker of evidence accumulation for error detection [17]. Further to this, secondary Pe peaks have been identified, again being linked to conscious, evaluative processes [18,19]. The ERN and Pe have been displayed in a variety of previous singletrial error classification studies [2,3,7,9].
We also investigated the spatial distribution of the brain's response to each error condition, using topographical maps. In order to distinguish between error conditions on a single-trial basis, we employed a stepwise linear discriminant analysis classification strategy, using a small, highly discriminative set of time domain features from 20 electrode sites. We tested the efficacy of this strategy using data from 20 young and five older adults performing one task, and 14 young adults performing the other task.

Participants
This study used data collected during two tasks, which we refer to as the 'error awareness dot task' (EADT) and the 'claw observation task' (COT). Fifty-four healthy adults were recruited for the EADT. Twentyeight of these were young (aged 18-34) and 26 were older (aged 65-80). Seventeen healthy adults were recruited for the COT.
All of these participants were included in neurophysiological analyses, but some were excluded from the single-trial classification phase of this study. Twenty-three were excluded from the EADT (four young, 19 older) due to not producing enough artefact-free trials for all conditions. A further six from the EADT (four young, two older) were excluded as it may have been possible to classify their data based on motor signals, rather than ErrPs. The rationale for these exclusions is explained in further detail in section 2.4.1. This left 25 participants from the EADT (20 young, five older) to be included in the single-trial classification phase. Three participants were excluded from the COT due to not producing enough artefactfree trials for all conditions. All COT participants used for single-trial classification were young (aged [18][19][20][21][22][23][24][25][26][27][28][29][30][31][32][33][34][35].
All participants for both tasks had normal or corrected-to-normal vision. They reported no history of psychiatric illness, head injury, or photosensitive epilepsy. Written informed consent was provided before testing began. All participants of the EADT also reported that they had no history of colour-blindness. All procedures for both tasks were in accordance with the Declaration of Helsinki. Procedures for the EADT were approved by the Trinity College Dublin Ethics Committee, and procedures for the COT were approved by the University of Sheffield Ethics Com-mittee in the Automatic Control and Systems Engineering Department.

Experimental setup 2.2.1. EEG setup
For the EADT, 64 channels of EEG were recorded at 512 Hz, using the BioSemi ActiveTwo system. Electrodes were placed using the 10-20 system. Electrooculogram (EOG) electrodes were also placed at the outer cantus of each eye, and above and below the left eye. Reference electrodes were placed on the left and right mastoid.

The error awareness dot task
The EADT was a time-critical reaction task, requiring sustained attention. The task employed a 'go/nogo' paradigm, requiring participants to react to 'go' stimuli with a mouse click, but withhold their reaction in the case of 'no-go' stimuli.
Participants were shown a succession of randomised, differently-coloured dots on a computer screen, with a blank grey screen shown between dots, as shown in figure 1. Participants were asked to perform a left mouse click, in a timely manner, in response to the presentation of each new dot. However, in two 'no-go' scenarios, they were asked to withhold their response. These scenarios were the presentation of a blue dot, or of a dot that was the same colour as the previous dot. These are known as the 'colour condition' and 'repeat condition', respectively. If participants did click in either of these scenarios, they were asked to perform a second click with the right mouse button, in order to indicate their awareness of the error. Participants were asked to respond to 'go' stimuli with a left mouse button click (L). They were asked to withold this response in the event of either a 'colour no-go' stimulus (the stimulus is blue) or 'repeat no-go' stimulus (the stimulus is the same colour as the previous stimulus). If participants performed a left mouse click following a no-go stimulus, they were asked to follow this with a right mouse button click (R), to register their awareness of their error.
Before testing began, a practice block took place, in which participants had to respond successfully to three consecutive no-go trials, either by withholding their initial response or, if they did click erroneously, by following up with an awareness click. 8 blocks of trials were collected from each participant, with the exception of five, for whom 4-6 blocks of trials were collected. Each block lasted approximately 6 min, and contained 176 'go' trials, 16 'repeat condition' trials, and eight 'colour condition' trials.
The duration for which each stimulus was shown varied throughout the task, depending on the accuracy of the participant in performing correct responses to go and no-go trials. Initially, stimuli were displayed for 750 ms. However, if the participant's accuracy were below 50%, stimulus duration would increase to 1000 ms. Conversely, if the participant's accuracy were above 60%, stimulus duration would decrease to 500 ms. Accuracy between 50 and 60% would result in stimulus duration remaining at, or reverting to, 750 ms. Stimulus duration was updated every 40 trials. An inter-stimulus gap, in which the screen was a blank grey, remained constant at 750 ms. This meant that the time period between the onset of stimulus n and the onset of stimulus n + 1 could vary between 1250 ms and 1750 ms.

The claw observation task
In the COT, the errors in question were committed by the machine and observed by the participants, as opposed to errors being committed by the participants themselves in the EADT. Thus, the COT is similar to error-driven BCI scenarios in which users observe actions made by a machine [3,6].
Here, participants were asked to observe a computer-controlled simulation of an arcade 'claw crane' game. Participants were shown a screen with eight coloured circles arranged in a row and, above the circles, a virtual robotic arm, as shown in figure 2. A single circle, selected at random at the start of each run, was designated as the target. This circle was coloured blue and marked with a score of +25 points. Every other circle was coloured red. The red circles immediately adjacent to the target were marked with a score of −10 points, and the scores marked on each circle decreased by a further five points with each step further from the target. The robotic arm began each run directly above a circle either 2 or 3 steps away from the target. Every 1.5 s, the robotic arm would either move 1 step to the left, move 1 step to the right, or extend downward to grab the circle beneath it. Movements occurred instantaneously. The probability of each type of action occurring depended on whether or not the arm was positioned directly above the target circle. A table of action probabilities is shown in table 1.
A score was also displayed in the top left corner of the screen. When a 'grab' action was performed, the score would be updated according to the score marked on the circle that had been grabbed. After each 'grab' action the run would finish and the screen would become completely black. Nine of the COT participants were asked to silently count the number of times each movement error was made in each run, in an attempt to help them stay focused on the task. These participants were asked to write down the number of errors on a sheet provided at the end of each run. As such, the gap between the end of one run and the start of the next run was 10 s. The remaining eight COT participants were not asked to perform the counting. For these participants, the gap between runs was 5 s. In either case, a beep would sound 1 s before the next run began. Participants were asked to refrain from movement and blinking during each run, but told that they could move and blink freely between runs, while the screen was blank. This process repeated until the end of the block, with each block lasting approximately 4 min. The score was reset to 0 at the beginning of each new block.
The actions considered for this study were movement errors. Movements in which the virtual robot was aligned over one of the red non-target balls, and moved further away from the target, are hereafter referred to as 'condition 1' errors. Movements in which the virtual robot was aligned over blue target ball, but stepped off it, are hereafter referred to as 'condition 2' errors. A third error type was present in the task: a 'grab error', when the robot grabbed a non-target ball. These errors occurred from a different type of movement than condition 1 and 2 errors, which both occurred as a result of lateral movements. The robot would always have information about whether it had made a lateral movement or a grab action. As such, in a BCI application, there would be no need to differentiate grab errors against other error types using EEG. Standard error detection applied following a grab action would be enough to identify them. For this reason, grab errors were not considered as a part of this study. The score was only updated after a 'grab' action, and not after lateral movements (including either 'condition 1' or 'condition 2' errors), therefore no points were directly gained or lost as a result of either error condition. Considering this, together with the fact that each error was of the same magnitude (1 step), we considered them to be of similar severity.
Participants were asked to observe blocks, with breaks of as long as they wished between blocks, until they reported their concentration levels beginning to decrease. Most participants observed six blocks of trials. However, four participants observed 3-5 blocks, and three participants observed seven and eight blocks.

Data analysis
For both tasks, EEG data were first resampled to 64 Hz. In order to do this trials were first upsampled, then filtered using a least squares linear phase anti-aliasing FIR filter with a lowpass cutoff of 32 Hz. The filtered data were then downsampled by averaging across data points, and initial data points from the output of filtering were removed to compensate for the delay introduced by the linear phase filter. After resampling, data were band-pass filtered from 1 Hz to 10 Hz, as ErrP components have been shown to occur at low frequencies [1,2]. Event related spectral perturbation plots confirmed that activity for these tasks occurred predominantly in low frequencies (see supplementary figure 1 (stacks.iop.org/JNE/17/016008/mmedia)). For the EADT, trials were included in cases where the error was followed by a secondary mouse click to indicate the participant's awareness of their error. Trials were extracted from a time window of −300 ms to 700 ms, relative to the commission of each error (i.e. the initial, erroneous mouse click). Previous literature has shown evidence that participants' EEG may show signs of an error response before they commit the error [12]. As such, the EADT time window began before error commission. Errors of which the participants were unaware were not considered as part of the main investigations of this study. As the COT involved errors committed by the machine, rather than the human, it would not have been pertinent to consider signals prior to error commission. Therefore, for the COT, trials were extracted from a time window of 0 ms to 1000 ms, relative to the movement of the virtual robot. Each extracted error trial was baseline corrected relative to a period of 200 ms immediately before the presentation of its related stimulus. Artefact rejection was performed by discarding any trials in which the range between the highest and lowest amplitudes, in any channel, was greater than 100 µV. In EADT data, a mean of 1.9 colour condition trials and a mean of 3.0 repeat condition trials were rejected per participant, from overall means of 22.2 and 32.5 trials per participant for Participants were asked to observe as a virtual robotic claw attempted to navigate towards, and grab, a blue target ball. If the claw was aligned over the target ball, possible actions were either to grab the ball or take 1 step away from the target. If the claw was not aligned over the target ball, possible actions were either to move 1 step towards the target, move 1 step further away from the target, or grab the red ball beneath the claw's current position. Table 1. Action probabilities for the claw observation task. Note that correct actions and grabbing errors were not considered as a part of this study, as the robot would always have information about whether it had performed a lateral movement or a grab action.

Arm location
Action Type the two conditions, respectively. In COT data, a mean of 2.0 trials from condition 1 and a mean of 0.7 trials from condition two were rejected per participant, from overall means of 48.8 and 23.4 trials per participant for the two conditions, respectively. Further to this, independent component analysis (ICA) was performed on the pooled trials from all participants combined, for each task. Components resembling EOG artefacts, as identified by visual inspection of topographic maps, were filtered out of the data. Thus, one component was removed from the data related to each task, from a total of 64 components for the EADT and 20 components for the COT. The remaining components for each task were then recombined. Grand average time domain ErrP data were plotted using the extracted trials, showing the mean voltage ±1 standard error of the following comparisons: EADT colour condition versus repeat condition in young adults, EADT colour condition versus repeat condition in older adults, and COT condition 1 versus condition 2 in all participants. A small number of trials were excluded from the grand average time domain plots for the EADT, where the initial click had occurred at least 550 ms after the presentation of the stimulus. This was due to the fact that longer reaction times could result in the presentation of stimulus n + 1, which could occur 1250 ms after stimulus n in the EADT, occurring within the time window (−300 ms to 700 ms, relative to the click) of stimulus n, and so the inclusion of these trials could have contaminated the late part of the grand average data with responses to these following stimuli. In total, 14 out of 717 colour condition trials and 12 out of 1181 repeat condition trials were excluded from these plots for this reason.
Peak analysis was performed in order to identify the latencies at which ERN and Pe occurred in the ErrP data. ErrP signals are known to be associated with midline electrodes [8]. Visual inspection of time domain ErrP and topographical plots showed high positive Pe activity around the central midline across all tasks and age groups, with the most notable amplitude difference between the classes being visible in Cz time domain data. As such, electrode site Cz was chosen as the most suitable channel for peak analysis for this study. In each task, this peak analysis was carried out on the grand average ErrP waveform related to each error condition, and also for the grand average ErrP of all trials of the two error conditions pooled together. In the EADT, the analysis was carried out seperately for each age group. For each group, the data were first averaged, and then peaks were identified in the resultant waveform. The ERN was identified as most prominent negative peak, and Pe as the highest positive peak, occurring in specific time windows. Time windows for ERN were −100 ms to 200 ms in the EADT, and 0 ms to 300 ms in the COT. Time windows for Pe were 0 ms to 400 ms in the EADT, and 100 ms to 600 ms in the COT. These time windows were selected based on a visual inspection of the time-domain data; ERN windows started slightly before the start of the negative deflection in grand average plots and centred on the negative peaks, and Pe windows began just before the start of the positive deflection and ended once amplitudes had returned approximately to baseline levels. As discussed earlier in this section, evidence has shown that some participants may show signs of an error response before they commit the error [12], hence the ERN time window in the EADT beginning 100 ms prior to error commission. To check for statistically significant differences in peak latencies across error conditions, the same peaks were identified in the average time domain data for each individual participant with at least 12 trials per condition and at least 40 trials in total, as previous literature has suggested that a minimum of 12 trials are required to achieve a reasonable level of temporal stability of ERN and Pe, and that temporal stability increases with the number of trials [20]. Wilcoxon signed-rank tests were then carried out on these data, comparing the latencies identified in each of these participants' average time domain waveforms for the two conditions. To check for statistically significant differences in peak amplitude, the amplitude was calculated in each of these participants' average waveforms for each condition, in a 50 ms window surrounding the ERN and Pe peaks identified in grand average data (from peak −25 ms to peak +25 ms). Wilcoxon signed-rank tests were carried out to compare these amplitudes. Furthermore, the build-up rate of the Pe was calculated for the average waveform of each participant, in each error condition, for both tasks. This was achieved by performing a linear regression on a time window, 100 ms in duration, ending at the identified Pe peak. This gives an indication of the rate at which the amplitude is increasing up to the peak. Wilcoxon signedrank tests were carried out to check whether the buildup rates of the different error conditions varied in a statistically significant way.
Topographical maps were then plotted for each error condition, using the same time windows. All topographical maps for a given task used the same scale, from the minimum value to the maximum values across all grand averages.
While the main focus of this study was on errors of which the participants were aware, a brief analysis was carried out to compare the number of 'aware errors' (errors followed by an awareness click) versus 'unaware errors' (errors not followed by an awareness click) in the EADT. The percentage of errors of which each participant was aware was calculated for each error condition in each task. Wilcoxon signed-rank tests were carried out in order to check whether there was any significant difference between awareness rates for the various conditions.

Classification
Broadly, the same classification protocol was followed for all participants of both tasks. However, different time windows were used to extract features for the two tasks. The protocol is described in this section. 20 electrode channels were available in the COT data  (F7, F3, Fz, F4, F8, FC1, FC2, T7, C3, Cz, C4, T8, CP1,  CP2, P3, Pz, P4, PO7, PO8, and Oz). As such, these 20 channels were used for single-trial classification of the both tasks. As with the neurophysiological analysis, data for classification were resampled to 64 Hz and band-pass filtered between 1 Hz and 10 Hz. In the EADT, trials were extracted from −100 ms to 400 ms, relative to the commision of errors (i.e. the erroneous click), in cases where the participants showed awareness of the error. In the COT, trials were extracted from 100 ms to 700 ms, relative to the virtual robot's movement. These time windows were selected based on visual inspection of grand average time domain data for each task, aiming to encapsulate the areas which indicated differences between the amplitudes of responses to the two conditions. Trials were baseline corrected to a period of 200 ms immediately before presentation of the stimulus, and artefact rejection was performed to remove any trials with a range of greater than 100 µV between the highest and lowest amplitude in any of the channels being used for classification. After this, remaining EOG artefacts were cleaned using ICA, as previously described in section 2.3.

Preprocessing
As discussed in section 2.3, temporal stability of the ERN and Pe have been shown to increase with the number of trials, with a minimum of 12 trials being recommended to achieve a reasonable level of stability [20]. As such, for the purpose of single-trial classification, we only included participants who had generated at least 12 trials per error condition, and a minimum of 40 trials overall.
Due to the experimental setup of the EADT, which involved participants clicking a mouse to confirm error awareness, motor movements would sometimes occur less than 400 ms after error commission, i.e. within the classification time window. As such, it was important to ensure that the classification was based on error responses rather than sensorimotor rhythms. To this end, two analyses were carried out on the latency between error commission and awareness confirmation in the various error conditions. Firstly, for each participant, a Fisher's exact test was carried out on the number of trials that did contain awareness confirmation within the time window used for classification versus the number that did not, in each of the two error conditions. This test was to check, for each participant, whether significant classification could feasibly be achieved based on the presense or absence of sensorimotor rhythms. Secondly, for each participant in each task, Welch's t-test was carried out, comparing the latencies at which participants confirmed their error awareness, between the two error conditions. The latencies of mouse clicks, confirming error awareness, were included in the t-test if they occurred within the classification time window (−100 ms to 400 ms). Clicks outside this window were ignored as they were not deemed to have a potential effect on classification. The t-test was automatically marked as not significant if there were no awareness confirmations within the classification epoch. The purpose of this test was to act as a guide, for each participant, as to whether significant classification could feasibly have been achieved based on differences in the time at which awarenessbased sensorimotor rhythms occurred. We were mindful that the classification results of this study could have been unfairly biased if we had included any participants for whom classification may have been possible due to differences between motor signals across the two conditions. Therefore, participants for whom a significant result (p < 0.05) was recorded, in either the Fisher's exact test or the t-test, were discarded from the classification phase.
After preprocessing, 25 participants remained to be used in the classification phase from the EADT (20 young, five older), and 14 remained from the COT (eight asked to count errors, six not asked to count errors).

Feature extraction
Our EEG data, having been resampled at 64 Hz, contained 33 time points per trial in the EADT and 40 time points per trial in the COT. If we were to consider all available time domain data, there would have been a total of 660 features (20 channels × 33 time points) or 800 features (20 channels × 40 time points) to describe each trial. Although we employed a minimum cutoffs of 12 trials per condition and 40 overall trials, many participants still had relatively few trials per class. With the number of features given by the full time domain data greatly outweighing the number of trials per condition, it was clear that the curse of dimensionality could cause problems if we attempted to classify based on all available time domain data [21].
Our classification was performed using stepwise linear discriminant analysis (SWLDA), as described in section 2.4.3. However, the feature selection inherent in SWLDA is relatively sophisticated, and less complex methods are known to be less susceptible to overfitting [22]. Therefore, we opted to reduce the dimensionality by using a simpler first step for preliminary feature extraction. This allowed the SWLDA to be applied to a small number of highly discriminative selected features.
For each participant, the preliminary step was carried out as follows: For each time-domain feature (i.e. each time point in each channel), there were a set of training data points. Each point had an amplitude and an associated class label. A linear correlation coefficient was calculated between these amplitudes and class labels, resulting in each feature having an associated correlation coefficient. The correlation coefficients acted as a simple indication of how strongly related the amplitude was to the class labels in a given feature, and thus how separable the classes may be based on the amplitude. In each channel, the feature with the largest absolute correlation coefficient was selected. This meant that each trial was represented by 20 features.

Stepwise linear discriminant analysis implementation
In order to classify the data based on the most pertinent subset of the extracted features, SWLDA was chosen as our classification approach, since it has previously been shown to perform well in feature selection and classification of EEG data [23][24][25]. Stepwise regression was performed to select which features would be included in the model. Initially, an empty model was created. At each step, a regression analysis was performed on models with and without each feature, producing an F-statistic with a p-value for each feature. If the p-value of any feature was <0.025, the feature with the smallest p-value would be added. Otherwise, if the p-value of any features already in the model had risen to >0.075 at the current step, the feature with the largest p-value would be removed from the model. This process continued until no feature's p-value reached the thresholds for being added to, or removed from, the model. If no features were added to the model at all, a single feature with the smallest p-value would be selected. Training and test trials were then reduced to the selected features. The class with the fewest training trials was oversampled in order to ensure that training occurred with an equal number of trials per class. A linear classification model was then trained and tested.
All classifiers were trained and tested using leaveone-out cross validation. For each iteration, one trial was selected as the test sample, and all the other trials were used as the training samples. Feature extraction and training of the stepwise linear model were then performed on the training samples. The model was then tested on the test sample. This process was repeated until each trial had been selected as the test sample. To test statistical significance of the classification, a right-tailed Fisher's exact test was performed on the confusion matrix of each participant's results. As the individual participants were independent, no p-value adjustments were necessary [26]. Therefore, classification for an individual was deemed to be significant if the p-value was less than 0.05. In order to test the significance at a group level, individual p-values were combined into a group p-value using Fisher's method [27,28]. To test whether there was any difference in the efficacy of the classification strategy across age groups, Welch's t-test was carried out comparing the overall accuracies of all young adults with those of older adults in the EADT.

Neurophysiological analysis of error-related potentials
Peak analysis was used to identify ERN and Pe latencies based on the grand average Cz time domain waveform for each combination of task, condition, and age group. The identified latencies are shown in table 2.
Wilcoxon signed-rank tests were carried out to check for statistically significant differences in the ERN and Pe amplitudes and latencies generated in response to the different error conditions, as discussed in section 2.3. The results of these tests are shown in table 3.

Error awareness dot task
In the grand average ErrP of young adults in the EADT, responses to both conditions showed ERN Table 2. ERN and Pe latencies, relative to error commission, as identified by peak analysis on the grand average channel Cz time domain waveform. The most prominent negative peak, between −100 ms and 200 ms in the EADT, or between 0 ms and 300 ms in the COT, relative to error commission, was selected as the ERN. The highest positive peak, between 0 ms and 400 ms in the EADT, or between 100 ms and 500 ms in the COT, relative to error commission, was selected as the Pe.    Participants of all ages indicated awareness of a higher proportion of colour condition errors (mean 89.3%, SD 17.7%) than repeat errors (mean 76.4%, SD 23.5%). A Wilcoxon signed-rank test showed that this difference was significant ( p = 8.7 × 10 −8 ).
In the older adults' EADT data, early positivity stalls the ERN, and some differences between the error conditions can be seen in the time domain data prior to error commission, as shown in figure 3 (green and brown lines). However, the difference between responses to the conditions was not found to be significant in older adults at the ERN. As with younger adults, the latencies of the ERN and Pe showed no significant difference (p = 0.69 and p = 0.15, respectively). While the build-up rate of the Pe was appeared to be steeper in response to the colour condition than the repeat condition, a Wilcoxon signed-rank test did not find this to be significant in older EADT participants (p = 0.25). Again, the most notable difference between the two error conditions was the greater amplitude of the Pe in the colour condition, as compared to the repeat condition (p = 0.016).
Both ERN and Pe peaks were observed to be more positive in older adults than young adults, in response to both error conditions. Welch's t-tests confirmed that that these age-related amplitude differences were statistically significant ( p = 2.1 × 10 −15 for colour condition related ERN amplitudes, p = 5.4 × 10 −8 for colour condition related Pe amplitudes, p = 3.1 × 10 −20 for repeat condition related ERN amplitudes, and p = 5.4 × 10 −13 for repeat condition related Pe amplitudes).
The typical fronto-central negativity cannot be identified by visual inspection of the topographical maps of the ERN in response to either error condition for older adults' EADT data (figures 4(c) and (d)). A posterior-anterior shift in aging (PASA) has been reported in previous literature [29,30] and is evident here in the Pe related to both conditions of the EADT. As discussed previously, the most positively active areas during the Pe are centro-parietal in young adults, as shown in figures 4(e) and (f). In older adults, this shifts toward more fronto-central activity, in both the colour condition and the repeat condition, as can be seen in figures 4(g) and (h). Indeed, the electrode sites with the highest grand average Pe amplitudes in young adults were CPz & Cz for the colour condition, and CPz & Pz in the repeat condition. In older adults, the highest grand average Pe amplitudes were found at electrode sites FCz and FC1, for both error conditions. Across all EADT participants, mean amplitudes for individual channels in the selected time windows ranged from −11.1 µV to 8.1 µV, and their associated standard deviations ranged from 0.04 µV to

Claw observation task
Time domain data related to responses to the COT can be seen in figure 5. Here, no statistically significant difference was found between either the latency or amplitude of the ERN (p = 0.72 and p = 0.22, respectively). In contrast to the EADT, neither the amplitude of the main Pe peak, nor the build-up rate of the Pe showed signifigant differences (p = 0.19 and p = 0.60, respectively). However, the latencies of the Pe peaks, at their highest points, were found to be significantly different (p = 0.032), with the Pe in responses to condition 2 peaking later than that related to condition 1.
A secondary component of the Pe also appeared to be present in the grand average COT data, and appeared to be more prominent in response to condition 2 than condition 1, followed by a difference in grand average amplitudes. We identified that the maximum difference here occurred at 538 ms (see supplementary figure 4 for illustration), and performed a further Wilcoxon signed-rank test on the amplitudes of the two conditions in the 50 ms window surrounding this latency. The difference in amplitudes at this point was found to be statistically significant (p = 6.1 × 10 −4 ).
Topographical maps showed broad, slightly negative amplitudes across the brain during the ERN of the COT, in response to both error conditions, as shown in figures 6(a) and (c). Slightly more positive amplitudes can be seen in fronto-central regions in response to condition 1. During the Pe, strong positive activity can be seen in central and centro-parietal regions, as shown in figures 6(b) and (d).
Mean amplitudes for individual channels in the time window ranged from −1.1 µV to 5.4 µV, and their associated standard deviations ranged from 0.01 µV to 0.8 µV. Further topographical maps showing the standard deviation from the mean at each channel in the COT are shown in supplementary figures 3(i)-(l).

Classification of EADT errors
The classification accuracies achieved for each individual participant in the EADT are shown in table 4. The mean overall accuracy for all EADT participants was 65.2%. Amongst young adults, mean overall accuracy was 63.7%, and for older adults it was 71.3%. Mean colour condition accuracy was 60.4% for all participants, 59.4% for young adults, and 60.4% for older adults. The mean accuracy of the repeat condition was 67.6% for all participants, 66.0% for young adults, and 74.0% for older adults. Trained classification models for the EADT included a mean of 3.7 ± 1.3 features. Generally, more features were selected from posterior regions of the brain than anterior regions, echoing the heightened activity, varying in amplitude across the two classes, that was shown in these regions. A Wilcoxon signed-rank test was used to compare the average number of features selected per channel, for each participant, in more anterior channels (fronto-central channels and further anterior) against those in more posterior channels (centro-parietal channels and further posterior). The results showed the average number of selected features per channel was significantly higher in the posterior region compared to those in the anterior region (p = 4.9 × 10 −4 ). At an individual level, features were often selected where the subject-average amplitude displayed a relatively large differences between the two classes supplementary figure 5 contains a further breakdown of feature selection rates, including an example for an individual EADT participant.
Statistically significant separation of the error conditions (p < 0.05) was found, using Fisher's exact tests, for 17 of the 25 participants overall (68.0%). Statistical significance was achieved for 13 of the 20 young adults (65.0%), and four of the five older adults (80.0%). At a group level, the classification results were found to be statistically significant in each age group (p = 1.6 × 10 −16 for young adults and p = 3.2 × 10 −11 for older adults) and overall (p = 2.7 × 10 −25 ).
The overall accuracies of young adults were compared with those of older adults using Welch's t-test. The result did not show any significant difference (p = 0.16). While Welch's t-test is considered to be reliable in dealing with unequal sample sizes [31,32], it should be noted that only five older adults remained in the single-trial classification, which may mean that this finding should be treated with a measure of caution.

Classification of COT errors
The classification accuracies achieved for each individual participant in the COT are shown in table 5.
The mean overall accuracy for all COT participants was 65.6%. Mean accuracy for condition 1 was 69.5%, and the mean accuracy for condition 2 was 57.4%. Welch's t-test showed no significant difference in participants accuracy depending on whether or not they were asked to keep count of the errors (p = 0.80, see supplementary table 1). Trained classification models for the COT included a mean of 2.9 ± 1.5 features. At a population level, it was difficult to discern clear patterns of which features were selected. However, as in the EADT, an individual level features were often selected where there was a relatively large difference between the subject-average amplitudes of the classes. Supplementary figure 5 contains a further breakdown of feature selection rates, including an example for an individual COT participant.
Statistically significant separation of the error conditions (p < 0.05) was found, using Fisher's exact Table 4. Single-trial classification results of EADT data. Overall accuracy calculated as the percentage of trials, of either class, correctly classified. SD refers to standard deviation. The participant for whom the highest overall accuracy was achieved is highlighted in italics. Group p-values were calculated by combining p-values using Fisher's method. tests, for 10 of the 14 participants (71.4%) in the COT. At a group level, the classification results were found to be statistically significant ( p = 1.9 × 10 −11 ).

Distinctions in responses by condition and age
Previous literature has shown that different tasks can elicit differing ErrP waveforms [33]. In some cases, distinctions have been shown in ErrPs even when the errors are committed during variants of the same task [34,35]. Indeed, our findings are aligned with those of the previous literature on this point. Interestingly, when comparing the error conditions within each task, the key neurophysiological distinctions that we were able to identify were found in different components of the ErrP for the two tasks in this study.
In the EADT, the clearest distinction shown between the error conditions was in the amplitude of the Pe. We witnessed greater amplitudes of Pe in the colour condition than the repeat condition for both young and older adults. Previous studies, including some which were based on error awareness tasks, have shown a diminished Pe in errors of which participants are unaware, compared to errors of which they are aware [13][14][15][16]. Here, in the case of the colour condition, all necessary information for the participant to know whether they have committed an error is present, on-screen, in the current stimulus. With the repeat condition, however, participants are relying on their memory of the previous stimulus to determine whether or not they have committed the error. Indeed, Wilcoxon signed-rank tests found that participants were significantly more likely to be aware that they had committed a colour condition error than a repeat condition error. While this study was focused on trials in which participants signified awareness of their errors, it is possible that participants could be more confident in their assertion of the error for some trials than others. It is possible, therefore, that the higher amplitude of the Pe in the colour conditions, compared to the repeat conditions, is due to greater certainty and confidence that an error was committed. Previous studies have also identified the build-up rate of the Pe as a marker of evidence accumulation for error detection [17]. In young adults, the build-up rate to the Pe was found to be significantly greater in the colour condition than the repeat condition. This is a further indication that a greater degree of awareness may be present in the case of colour condition errors than repeat condition errors.
Some distinctions were also noted between the different age groups in the EADT. Older participants' responses were found to generate more positive amplitudes at both the ERN and Pe latencies, for both error conditions. A posterior-anterior shift in aging was also identified in the spatial distribution of the Pe.
In the COT, the most notable difference in time domain data appeared to result from a secondary component of the Pe. This occurred at around 500 ms, causing an increase in the amplitude of responses to condition 2 compared to those of condition 1 in the grand average signals. This gap remained until beyond 600 ms. A Wilcoxon signed-rank test found the amplitude difference, at its widest point (538 ms) to be statistically significant ( p = 6.1 × 10 −4 ). As discussed in section 1, secondary Pe components have previously been identified, and have been linked to conscious, evaluative processes [18,19]. This suggests that condition 2, in which the virtual robot steps off the target, having been aligned above it, elicits stronger responses in the aware aspect of the error response. Table 5. Single-trial classification results of COT data. Overall accuracy calculated as the percentage of trials, of either class, correctly classified. SD refers to standard deviation. The participant for whom the highest overall accuracy was achieved is highlighted in italics. The group p-values was calculated by combining p-values using Fisher's method.

Single-trial classification
Across all participants who were included in the classification stage, we achieved a mean overall accuracy of 65.2% for the EADT data, and 65.6% for the COT data. The associated standard deviations were relatively high (8.2% and 7.6%, respectively) as, although statistically significant classification was not possible for some participants, high classification rates were achieved for others. Indeed, in the best cases, for both tasks, the error conditions were classified against each other with around 80% overall accuracy. Group p-values calculated using Fisher's method showed that, at a population level, statistically significant separation of the error conditions was achieved ( p = 2.7 × 10 −25 for the EADT and p = 1.9 × 10 −11 for the COT). As a proof of concept, these classification accuracies show that it is possible to classify these subtly different error conditions, which could not be differentiated by previously explored metrics such as direction or severity, against each other using singletrial EEG. A Welch's t-test, comparing the results of young adults with those of older adults, returned non-significant results. Though this finding should be taken tentatively, due to the small number of older participants included in the classification phase, it suggests that our chosen classification strategy is robust across different age groups, despite some age-related neurophysiological differences.
In previous literature regarding error decoding, a wide variety of classification accuracies have been reported. When classifying errors against non-errors, some studies have been able to achieve very high single-trial classification rates. For example, SVM-based classification models have been used to achieve average accuracies of 80% [36] or even above 90% [5], deep learning approach achieved average accuracy of 84% [37], and Gaussian models have been reported to achieve a high of around 90% [38].
Classification of different error conditions against each other can be considered more challenging than error versus non-error classification as the EEG signals in response to errors are expected to be more similar to each other than to the signals of non-errors. Nonetheless, some errors have been classified against each other on a single trial basis with a high level of success. In a virtual robot reaching task, performed by two participants, Iturrate et al reported correct classification of left versus right sided errors with an impressive 90% accuracy [6]. Furthermore, in the same study, they were able to distinguish small versus larger errors with around 75% accuracy. Spüler and Niethammer reported an overall accuracy of 75.5% for the classification of execution errors against outcome errors (i.e. errors committed by a machine versus errors committed by a human) during a computer game task [9]. However, they did not find significant differences between movement errors occurring at different angles, highlighting the potential difficulty of differentiating errors based on subtle differences.
One of the challenges in error decoding is that data sets for error trials may be small, as errors often occur more rarely than correct actions, both in realworld scenarios [5] and experimental paradigms [39]. Small sample sizes are known to be challenging in classification problems [40,41]. This is exacerbated when attempting error versus error classification, as the error trials are divided into still smaller groups. Indeed, for both tasks of the present study, we were able to achieve higher classification accuracy for the class with more training samples, on average.
Given the challenges of comparing such similar error conditions as the ones in this study, we believe that the results are encouraging. Separation of the error conditions was above chance level for most participants across both tasks. While mean overall classification rates did not reach the accuracy of the most successful studies discussed above, this study has shown that it is indeed feasible to classify ErrPs of different error conditions against each other based on differences in cognitive process, or in the context of differing expected actions. The fact that overall accuracy of around 80% was achieved for some participants is particularly encouraging. In future, it may be interesting to investigate the use of other classification techniques such as those discussed above, especially if larger training sets are available, with the aim of increasing classification accuracy further.

Implications for BCI
Error detection is becoming an increasingly useful aspect of BCI [2]. It has proven to be utilisable in increasing the accuracy of existing BCI control techniques, such as motor imagery [42] and P300 [43], by performing immediate error correction [44]. Furthermore, error detection has been successfully integrated into various BCI systems as feedback for reinforcement learning (RL) strategies, allowing the systems to gradually improve over time [3][4][5]45]. As discussed in section 1, this creates the possibility of BCI becoming more autonomous [3,4]. RL-based systems such as these can work effectively as long as the classification accuracy exceeds chance level [2,3].
It has been shown, in previous literature, that different errors can elicit different ErrP waveforms [46,47]. Recently, a few studies have begun to classify different errors using single-trial EEG, based on aspects such as the direction of the error [6], the severity of the error [6], or whether the error was committed by the human themselves or by a machine [9].
In the COT, we presented a scenario in which a virtual robot was attempting to navigate towards, and grab, a target object among several non-target objects. This scenario could be used in an error-driven BCI. Each robot action would be followed by single-trial EEG classification, to tell the robot what kind of action the human had observed. If we employed simple error detection, we would be able to tell the robot when it had made an incorrect move. However, with the error categorisation displayed in this study, an extra layer of detail could be switched on for participants with statistically significant separation. In the case of condition 1 errors, we could tell the robot that the target is in the other direction, but is not in the adjacent location. In the case of condition 2 errors, we could tell the robot precisely that the target is in the location it just stepped away from. These principles could be applied to a number of BCI-based navigation or target selection scenarios.
Investigating the EADT allowed us to provide further evidence that errors can be categorised in the absence of previously used metrics, with only subtle difference between error conditions. Statistically significant classification accuracy was achieved for the vast majority of the participants included in the classification phase in our study. Thus, the error categorisation displayed here is accurate enough to be utilised in a BCI, for immediate and specific error correction, or as an integral part of a learning system. This opens up the potential for more detailed information to be garnered about the category of error that has occurred, thus allowing for a BCI with more effective error correction and more efficient error-driven learning.

Conclusion
The error conditions considered in this study were very similar to one another. Nevertheless, due to the different cognitive processes required to recognise the errors in the EADT, and the different contexts in which the errors occurred in the COT, we were able to identify differences between the grand average ErrP waveforms of the different error conditions. In the EADT, the clearest distinction between the error conditions was found in the amplitude of the Pe. The colour conditions generally elicited greater amplitudes than the repeat conditions, leading us to speculate that the increased Pe in these conditions could be due to greater certainty that an error had been committed. In the COT we found distinctions in the ERN, and in a secondary component of the Pe. These distinctions led us to speculate that participants may have had a heightened anticipation of a correct action when the virtual robot was aligned above the target, ready to grab it.
Interestingly, we were able to classify the error conditions of both the EADT and the COT, the latter of which could be directly applied in a BCI, with over 65% mean overall accuracy, and around 80% in the best cases. Classification rates were above chance level (p < 0.05) for most participants, of those included in the classification phase of the study, for both tasks, and group-level analysis showed the single-trial separa-tion of the different error conditions to be highly significant overall (p = 2.7 × 10 −25 for the EADT and p = 1.9 × 10 −11 for the COT). The ability to classify such similar errors using single-trial EEG, as we have shown here, is very promising for the future prospect of making error-driven BCI more efficient through the acquisition of more detailed information.
We believe that the findings of this study uncover new opportunities in brain-machine interaction, pushing towards a more autonomous BCI.