Real-time estimation of perceptual thresholds based on the electroencephalogram using a deep neural network

Background: Perceptual thresholds are measured in scientific and clinical setting to evaluate performance of the nervous system in essential tasks such as vision, hearing, touch, and registration of pain. Current procedures for estimating perceptual thresholds depend on the analysis of pairs of stimuli and participant responses, relying on the commitment and cognitive ability of subjects to respond accurately and consistently to stimulation. Here, we demonstrate that it is possible to measure the threshold for the perception of nociceptive stimuli based on non-invasively recorded brain activity alone using a deep neural network. New method: For each stimulus, a trained deep neural network performed a 2-interval forced choice procedure, in which the network had to choose which of two time intervals in the electroencephalogram represented post-stimulus brain activity. Network responses were used to estimate the perceptual threshold in real-time using a psychophysical method of limits. Comparison with existing methods: Network classification was able to match participants in reporting stimulus perception, resulting in average network-estimated perceptual thresholds that matched perceptual thresholds based on participant reports. Results: The neural network successfully separated trials containing brain responses from trials without and could consistently estimate perceptual thresholds in real-time during a Go-/No-Go procedure and a counting task. Conclusion: Deep neural networks monitoring non-invasively recorded brain activity are now able to accurately predict stimulus perception and estimate the perceptual threshold in real-time without any verbal or motor response from the participant.


Introduction
In 1860, the psychologist Gustav Fechner first recognized that inner consciousness could be measured in terms of behavior, and since then methods have been developed and refined for the systematic exploration of sensory systems (Leek, 2001).The evaluation of perceptual thresholds plays a crucial role in the evaluation of human sensory function in vision, hearing, touch and nociception.Evaluation of the perceptual threshold is also increasingly popular in pain research as stimulus perception can be used to assess altered central and peripheral nociceptive processing.For instance, mechanical and thermal perceptual thresholds are increased in patients with neuropathic pain and signs of central sensitization (Maier et al., 2010), or as a result of peripheral nerve dysfunction in diabetes (Courtin et al., 2020;Krämer et al., 2004).More recently, studies have suggested the use of intra-epidermal electric stimulation to assess nociceptive function by measuring the perceptual threshold to nociceptive specific stimulation (Doll et al., 2015;Hennings et al., 2017;Tanaka et al., 2021;Van den Berg and Buitenweg, 2018;van den Berg and Buitenweg, 2021;Van den Berg et al., 2020, 2022).
The evaluation of perceptual thresholds using psychophysical procedures relies on the commitment and cognitive ability of participants to respond accurately and consistently to stimulation regardless of long and sometimes boring experiments.The complex adaptive psychophysical procedures used to measure an unbiased perceptual threshold often preclude usage in non-standard adult populations or children (Smith et al., 2018).In some cases, estimation of the perceptual threshold based on report is not possible because subjects are unable to reliably communicate stimulus perception.In other cases, the physician cannot rely on reported perception, due to potential simulation or malingering while no organic reason for sensory loss can be identified (Austen and Lynch, 2004;Bruce and Newman, 2010).Even in standard healthy adult participants, perception report (i.e.behavior) is not always equal to perception (i.e.conscious access to sensory information), as participants might fail to report a perceived stimulus (also known as lapsing), or might report stimulus perception while actually no stimulus was perceived (also known as guessing).To potentially enable the accurate and objective assessment of perceptual thresholds in a wide variety of patient groups, we propose a fully automated approach that relies on non-invasive recordings of cortical brain activity rather than reported perception, to estimate the perceptual threshold.
When a (change in) visual stimulus is perceived during a psychophysical procedure, this results in the generation of the famous P300 peak in the event-related potential (Picton, 1992).The P300 is a positive peak in the human event-related potential and is considered a key marker of conscious access to sensory information (Rutiku et al., 2015;Salti et al., 2012).A similar peak with the same functional significance, referred to as the P2, can be observed in experiments using nociceptive stimulation (Mouraux and Iannetti, 2009).Peaks like the P300 and P2 are easily identified in averaged EEG responses, but remain difficult to detect on a single-trial basis.One of the best performing feature-based decoding and classification approaches (winner of the 2015 Kaggle BCI competition) consisting of a combination of xDAWN spatial filtering (Rivet et al., 2009) and Riemannian geometry classification (Barachant et al., 2012) achieves an area-under-the-curve (AUC) of around 0.8 for cross-subject classification of visual evoked potentials.An even higher AUC of 0.9 is achieved for the same dataset of visual evoked potentials using an end-to-end decoding and classification approach using a deep neural network (Lawhern et al., 2018).Such end-to-end decoding approaches using a deep neural network have the drawbacks of a high computational complexity, and a large number of trainable model parameters.Nevertheless, provided that one has sufficient computational resources and access a large amount of reliably labeled training data, a deep neural network could discover informative features and potentially find superior solutions to the classification problem by itself (Gemein et al., 2020;Roy et al., 2019).Although visualization and interpretation of the learned features inside such neural networks remains a challenge and topic of ongoing research, several approaches exist to visualize which parts of the recorded EEG contribute most to the neural network classifications, such as visualization of the network's sensitivity to the occlusion of parts of the input (Zeiler and Fergus, 2014).
In this work, we used a deep neural network to detect stimulus perception based on the EEG, and we used the neural network classification scores to control adaptive stimulation and compute perceptual thresholds in real-time (Fig. 1).We hypothesized that when a stimulus is perceived, task-related neural activity associated with the conscious access to sensory information will be present in the EEG and detectable by a deep neural network.Therefore, a sequence of stimulusclassification pairs could be used to estimate the perceptual threshold.The first challenge in this work, was to use a deep neural network for the detection of brain responses to stimuli close to the perceptual threshold, i.e. stimuli that were difficult to perceive or sometimes not even reported as perceived by the participant.The second challenge was that the direct classification of perception using a deep neural network or other machine learning methods can suffer from an unknown calibration bias, introduced by the cross-subject application of the neural network (i.e. the classification score cannot be calibrated for an unseen subject) and non-stationarity of the EEG.We addressed this potential calibration bias in classification scores by performing a 2-interval forced choice (2IFC) classification task, where we compare the score of each poststimulus interval to the score of each pre-stimulus interval.In the next sections, we will show that we can accurately estimate perceptual thresholds based on these classifications (pre-stimulus or post-stimulus) post-hoc, and we will provide a proof-of-concept that we might use this classification to control an adaptive stimulus sequence and estimate the perceptual threshold in real-time.

Data
The study made use of three different datasets for training and validation of the method.Here we will briefly describe the general outline of each dataset and proceed with providing the implementation details.The experiments presented here were approved by the university ethical review committee (nr.RP 2021-05 and RP 2021-176) and in accordance with the 1964 declaration of Helsinki and it's later Fig. 1.Closed-loop estimation of perceptual thresholds based on neurophysiological activity using the presented methods.Stimuli are applied using an adaptive procedure to approach the perceptual threshold.Brain responses to those stimuli are measured using electroencephalography (EEG) and fed to a deep neural network to perform a psychophysical classification task.Subsequently, the stimulus sequence adapts based on classification of the network.Berg et al. amendments and all participants provided written informed consent prior to participation in these experiments.The three different datasets are described in the text below and summarized in Table 1.

Dataset 1 (DS1)
Initial network training, validation, calibration and performance testing was done using a collection of data obtained from previous studies (Van den Berg et al., 2021a, 2021b;van den Berg and Buitenweg, 2021) and an ongoing study (registered as NL66136.100.18 at toetsingonline.nl) at the Department of Anesthesiology, Intensive Care and Pain Medicine of the St. Antonius Hospital (Nieuwegein, the Netherlands), and at the University of Twente (Enschede, the Netherlands).This dataset includes a total of 38 healthy participants (16 male, age 18-73), which we will refer to as DS1.Participants were electrically stimulated on the back of the right or left hand using an electrode for preferential activation of nociceptive afferents in the skin.Participants had to perform a go-/no-go (GN) task by pressing a button and briefly releasing the button when a stimulus was perceived.Stimulus amplitudes were selected by an adaptive psychophysical method of limits to center amplitudes around the perceptual threshold.Each participant received a total of 450 (on one hand) or 900 (on both hands) stimuli of three types, resulting in 150 or 300 stimulus-response pairs per type: • A single square-wave cathodic pulse with a pulse-width of 0.21 ms.
• A double square-wave cathodic pulse with a pulse-width of 0.21 ms and an inter-pulse interval of 10 ms.• A double square-wave cathodic pulse with a pulse-width of 0.21 ms and an inter-pulse interval of 40 ms.
Note that the evoked potential waveforms in response to these stimulus types are very similar (van den Berg and Buitenweg, 2021;van den Berg et al., 2020).Therefore, we included all stimulus types for training, validation and testing, to improve algorithm performance and generalizability of the results.

Dataset 2 (DS2)
Performance of the calibrated network in post-hoc EEG classification and threshold estimation was evaluated on a second independent dataset comprising 15 healthy participants (8 male, age 19-25), which we will refer to as DS2 (for more information, see (Van den Berg et al., 2021a, 2021b)).Participants were electrically stimulated on the back of the right hand using an electrode for preferential activation of nociceptive afferents in the skin.Participants had to perform a GN task by pressing a button and briefly releasing the button when a stimulus was perceived.Stimulus amplitudes were selected by an adaptive psychophysical method of limits to center amplitudes around the perceptual threshold.Each participant received a total of 130 stimuli of two types, resulting in 65 stimulus-response pairs per type: • A single square-wave cathodic pulse with a pulse-width of 0.21 ms.
• A double square-wave cathodic pulse with a pulse-width of 0.21 ms and an inter-pulse interval of 10 ms.

Dataset 3 (DS3)
Performance of the calibrated network for real-time tracking and estimation of the perceptual threshold was evaluated in a proof-ofconcept experiment on 8 healthy participants (5 male, age 19-24), which we will refer to as DS3.Participants were electrically stimulated on the back of the right hand with double-pulse stimuli (pulse-width: 0.21 ms, inter-pulse interval: 10 ms) using an electrode for preferential activation of nociceptive afferents in the skin.
Part I: The perceptual threshold based on reported stimulus perception was compared to the perceptual threshold estimated by the neural network.Participants had to perform a GN task by pressing a button and briefly releasing the button when a stimulus was perceived.The neural network performed a 2IFC classification task, where it had to classify which of two intervals (pre-and post-stimulus) contained poststimulus brain activity (also see Section 2.2).Stimulus amplitudes were selected by two independent adaptive methods of limits (in randomized order) to center amplitudes around the perceptual threshold, where one used the neural network classification as feedback, and the other used button-release.Each participant received a total of 200 stimuli.
Part II: The perceptual threshold based on reported stimulus perception was compared to the perceptual threshold estimated by the neural network.In contrast with the earlier experiments, participants had to perform a GN task by counting each stimulus when perceived.The neural network performed a 2IFC classification task, where it had to classify which of two intervals (pre-and post-stimulus) contained post-

Table 1
Summary of the datasets used in this study, the psychophysical task performed by the participant and the classification task performed by the neural network in each dataset.Stimulus types are abbreviated as SP (single-pulse, 0.21 ms pulse-width), DP10 (double-pulse, 0.21 ms pulse-width, 10 ms inter-pulse interval) and DP40 (double-pulse, 0.21 ms pulse-width, 40 ms inter-pulse interval).Tasks are abbreviated as GN (go-/no-go) and 2IFC (2-interval forced choice).During the Go/No-go (GN) task the participant receives a sequence of stimuli at a randomized interval, and is instructed to report when a stimulus was detected, e.g. by releasing a response button.In the 2-Interval Forced Choice (2IFC) task the neural network classifies whether interval 1 or interval 2 contains poststimulus brain activity.This classification task is repeated for each stimulus.stimulus brain activity (also see Section 2.2).Stimulus amplitudes were selected by two independent adaptive methods of limits (in randomized order) to center amplitudes around the perceptual threshold, where each used the neural network classification as feedback.Each participant received a total of 200 stimuli.

Estimation of perceptual thresholds
Perceptual thresholds were estimated by fitting a psychophysical model to stimulus-response pairs.Stimulus amplitudes were selected by an adaptive psychophysical method of limits aimed at tracking nonstationary perceptual thresholds.The original stimulation procedure from Doll et al. (Doll et al., 2015) was designed for tracking perceptual thresholds using a GN task (Fig. 2).Here, we adapted the procedure to track and estimate perceptual thresholds for either a GN task, executed by the participant by briefly releasing the response button, or a 2IFC task, executed by the neural network by classifying which interval of the EEG contains post-stimulus brain activity.

Nociceptive stimulation
A custom-made electrode consisting of 5 microneedles (0.5 mm) embedded in a layer of flexible silicone (Steenbergen et al., 2012) was used to stimulate nociceptive afferents in the skin through intra-epidermal electric stimulation (Inui and Kakigi, 2012) using a constant current stimulator (NociTRACK AmbuStim, University of Twente, Enschede, the Netherlands).This type of electric stimulation was shown to preferentially activate nociceptive Aδ-fibers in the skin provided that the current remains below twice the perceptual threshold (Mouraux et al., 2010;Poulsen et al., 2020).Stimuli were applied with a uniformly randomized inter-stimulus interval of 3.5-4.5 s

Stimulus selection
Stimuli were selected using an adaptive randomized procedure developed by (Doll et al., 2015).Such an adaptive procedure is necessary to center stimulus amplitudes around a (time-dependent) perceptual threshold, in order to estimate the psychometric curve and associated perceptual threshold based on stimulus-response or stimulus-classification pairs (Doll et al., 2015;Leek, 2001;Treutwein, 1995).Randomization of the stimulus amplitude and inter-stimulus interval in this procedure is used to prevent an expectation bias.
One stimulus parameter (e.g.stimulus amplitude) is selected for threshold estimation.The value of this parameter is updated based on the response in the classification task as follows (Fig. 3): 1) The parameter value is randomly chosen from a vector V consisting of an uneven number of k values separated by a fixed stepsize s. 2) If the stimulus is classified correctly, all vector values are increased with d correct .If it is classified incorrectly all vector values are decreased with d incorrect .
To let the mean of the applied parameter values converge to the perceptual threshold, it should hold that dincorrect dcorrect +dincorrect = p threshold .For a GN task, the perceptual threshold is defined as the parameter value where 50% of the stimuli is correctly classified, i.e. p threshold = 0.5.For a 2IFC task, the perceptual threshold is defined as the parameter value where 75% of the stimuli is correctly classified, i.e. p threshold = 0.75.For the adaptive stimulation procedure in DS1 and DS2, we used k = 5 values, s = 0.025 mA and d correct = 0.025 mA.For the demonstration experiment (DS3), we explored which combination of the parameters k, s, d correct is optimal for estimating the perceptual threshold in each procedure through simulation.We found that the best estimates of the nociceptive perceptual threshold to double-pulse stimuli could be obtained in both procedures with k = 7 values, s = 0.008 mA and d correct = 0.008 mA, and used these settings for the adaptive stimulation procedure in DS3.

Psychophysical model
The probability of correct classification or detection of a sensory stimulus p(β, x) was modeled using the psychometric function described by Eq. (1).
Where x denotes a vector of n stimulus parameters (e.g.stimulus amplitude, stimulus duration, number of previous stimuli), β denotes the vector of effects sizes of those stimulus parameters and Φ denotes the cumulative normal probability distribution function.Furthermore, λ upper and λ lower denote the upper and lower classification limit, respectively.The upper classification limit is the probability of correct classification when the value of stimulus parameters goes to infinite.The lower classification limit is the probability of correct classification when the values of stimulus parameters go to 0. For post-hoc classification of DS2, with both single-pulse and doublepulse stimuli, we modeled classification/detection probability as a function of the first pulse amplitude (p 1 ), the second pulse amplitude (p 2 ) and the number of received stimuli (t) (Eqs.(2.1) and (2.2)).For real-time classification in DS3, we modeled classification/detection probability as a function of the pulse amplitude (p) and the number of received stimuli (t) (Eqs.(2.3) and (2.4)).Note that λ upper and λ lower were set to 1 and 0 respectively, in the case of a GN procedure, and set to 1 and 0.5 respectively, in the case of a 2IFC procedure.

Threshold estimation
The psychometric function and corresponding perceptual thresholds Fig. 3. Randomized method of limits used to center stimulus amplitudes around the perceptual threshold (Doll et al., 2015).A vector of k stimulus amplitudes separated by a stepsize s is initialized starting from 0, or centered around an initial estimate of the perceptual threshold.The stimulus amplitude is chosen randomly from this vector.The vector is increased with d correct if the stimulus is detected (in GN, based on the participant response) or if the reported interval is correct (in 2IFC, based on neural network classification).The vector is decreased with d incorrect if the stimulus was not detected (in GN, based on the participant response) or if the reported interval is incorrect (in 2IFC, based on neural network classification).
B. van den Berg et al.
were estimated based on the stimulus-classification pairs (SCPs) obtained using the adaptive procedure in Section 2.2.2.Parameters of the psychometric function were estimated by minimization the negative loglikelihood using an implementation of the GlobalSearch algorithm (Ugray et al., 2007) in combination with an interior-point algorithm to find local minima (Coleman and Li, 1996) in Matlab.

Performance of threshold estimation
Using DS2, the performance of post-hoc threshold estimation based on neural network classifications was 1) compared to the performance of threshold estimation using only the maximum value at the Cz channel during each interval for classification, and 2) compared to the performance of threshold estimation using a random score during each interval for classification.Differences between participant thresholds and neural network (or maximum value) estimated thresholds were assessed based on Bland-Altman analysis (Bland and Altman, 1986), using the BlandAltmanPlot function available on the Matlab file exchange.Differences in the limits of agreement between neural network and maximum value estimated thresholds, were tested by estimating the 95% confidence interval for the difference between limits of agreement (as the confidence interval between two sample means (Pfister and Janczyk, 2013)), and assessing whether the confidence interval included zero.The neural network classifications in DS2 were used to explain network classifications, which is further described in Section 2.3.4.
Using DS3, the performance of adaptive stimulation and real-time threshold estimation based on neural network classifications was assessed based on Bland-Altman analysis.In addition, the difference between stimulation amplitudes and estimated perceptual thresholds was assessed as a potential marker for the reliability of perceptual thresholds estimated using this paradigm.

EEG recording
The EEG was recorded using a SAGA amplifier (TMSi, Oldenzaal, the Netherlands) with a sampling rate of 1024 Hz at 32 Ag/AgCl electrodes that were placed on the scalp according to the international 10-20 system.

Classification task
Electroencephalography data were filtered by a causal high-pass filter of 0.1 Hz and a causal low-pass filter of 100 Hz, and downsampled to 512 Hz.Data were classified by a multilayer convolutional neural network performing a 2IFC classification task.The EEG was separated in an interval of pre-stimulus activity, ranging from − 1.5 to − 0.5 s, and an interval of post-stimulus activity, ranging from 0.05 to 1.05 s with respect to stimulus onset.Note that the first 50 ms following stimulation were excluded from the post-stimulus interval to rule out any influence of stimulation artifact.Each interval was centered by subtracting the mean and scaled by the standard deviation.The neural network was trained to label these EEG intervals as either pre-stimulus or post-stimulus activity.In the 2IFC classification task, the neural network computed a score for each interval, representing the probability that the interval represented post-stimulus activity.If the score of the second (post-stimulus) interval was higher than the score of the first (pre-stimulus) interval, the network classification was labeled as correct.Otherwise, the network classification was labeled as incorrect.The randomized method of limits described above was performed based on if the interval classification by the neural network was correct or incorrect.
Note that the same neural network could also be used to classify if a stimulus was perceived or not by determining if the network score of the post-stimulus interval is higher than a cutoff value, which is equal to performing a GN task.However, this would introduce an estimation bias in the computed perceptual threshold, as network scores were calibrated based on the entire test set, but not for each individual new participant.Furthermore, as brain activity is non-stationary the potential calibration bias of neural network scores could vary over time.Therefore, the best reference we could compare the score of the post-stimulus interval with, is the score of the pre-stimulus interval.This comparison between postand pre-stimulus interval scores is what we refer to as the 2IFC classification task.

Deep neural network for EEG classification
An adapted version of a convolutional neural network developed for EEG-based BCIs, EEGnet (Lawhern et al., 2018), was used for EEG classification in Matlab.The epochs in DS1 that were reported as 'detected' by participants were used for training, validation and testing.The detected epochs were split into a training set of 30868 intervals, a validation set of 3198 intervals and a test set of 4580 intervals, where half of the intervals were pre-stimulus and half of the intervals were post-stimulus.In the training and validation set, the intervals where no EEG was measured due to technical issues and the intervals where the maximum value over all channels was larger or smaller than 97.5% of all intervals were excluded.Intervals with technical issues or extreme noise were not excluded from the test set to have a representative distribution of test data, i.e. close to the distribution we would encounter with real-time classification.Recordings from the same subject were assigned to the same set (train, validation or test), to prevent a bias through learning of subject-specific artefacts.The EEG intervals were shuffled within each set, and each interval was centered by subtracting the mean and scaled by the standard deviation.
Network and training parameters were explored using the training and validation set, resulting in the final network architecture is shown in Table 2. Training was done with a total of 9 epochs, a mini-batch size of 128, a gradient threshold of 2, and an initial learning rate of 0.1 with dropped with a factor of 0.2 every 3 epochs.Following training, the test set was used to evaluate classification performance, and to calibrate neural network scores with respect to the true classification probability using binomial regression with a complementary log-log link function.In addition, the accuracy and AUC of the calibrated neural network scores was evaluated using 10-fold cross-validation where in each nonoverlapping fold 80% of DS1 was used for training, 10% for validation and 10% for testing.

Network classification explanations
Explanatory maps of the importance of each channel at several latencies for classification were obtained by computing the occlusion sensitivity (Zeiler and Fergus, 2014).In this technique, parts of the original input are systematically ablated to compute how much the output score changes by missing that part of the input signal.Occlusion sensitivity was computed for each channel in each interval using a mask of 50 samples and a stride of 10 samples.Nearest neighbor interpolation was used to resize the computed occlusion sensitivity to the original input size.Outliers were removed from the set of occlusion sensitivity maps where the standard deviation was larger than 0.05 or lower than 0.001.Overall occlusion sensitivity was determined by computing the median occlusion sensitivity over all intervals at each latency.

Analysis of EEG data
Electroencephalography data were processed using Fieldtrip, a Matlab toolbox for EEG and MEG signal analysis (Oostenveld et al., 2011).Similar to the data used for classification, continuous EEG data were filtered by a causal high-pass filter of 0.1 Hz and a causal low-pass filter of 100 Hz.Artefacts caused by eye movements and blinking were removed using independent component analysis (Delorme et al., 2007).Epochs with excessive EMG activity or other remaining artefacts were removed through visual inspection.The averages shown in the results section either show filtered data at Cz, or data that was filtered, cleaned and re-referenced at Cz-M1M2, which will be indicated explicitly.Contrasts between detected and non-detected and between correct and incorrect intervals, as well as the contrast between post-stimulus and baseline, were evaluated using cluster-based non-parametric statistical testing (Maris and Oostenveld, 2007).

Neural Network Performance
Classification performance was initially evaluated on a single original test set in DS1 (also see Section 2.3.3).The classification performance of the neural network on this test set and calibrated network scores are shown in Fig. 4. Here, the network classified post-stimulus and pre-stimulus EEG with an accuracy of 0.84 and AUC of 0.92.The network performed better classifying pre-stimulus EEG in comparison with classifying post-stimulus EEG.
Classification accuracy and AUC were also evaluated using 10-fold cross-validation, which was used to obtain more accurate estimates of network performance by averaging across multiple folds.Classification accuracy and AUC during 10-fold cross-validation are shown in Table 3.Although some folds show a similar classification performance as on the original test set, the average classification performance is lower with an average accuracy of 0.75 ( ± 0.05 std.) and an average AUC of 0.82 ( ± 0.06 std.).
Calibrated network scores should ideally be equal to the true probability of an interval being post-stimulus brain activity, corresponding to the diagonal line in Fig. 4. Calibrated network scores were almost diagonal, showing that network scores after calibration were a good estimator of the average probability, and therefore suitable for EEG classification in a psychophysical procedure.

Perceptual thresholds
The second dataset (DS2) was used to evaluate network performance in post-hoc estimation of the perceptual threshold.Two participants were excluded from this dataset due to failure to perform the detection task, defined as a detection rate of below 0.2 for one or both stimulus types.Performance of post-hoc threshold estimation based on neural network classifications was compared 1) to the performance of threshold estimation using only the maximum value at the Cz channel during each interval, and 2) to the performance of threshold estimation using a random score during each interval.An estimate of the perceptual threshold was considered invalid if the optimizer failed to converge to a solution, or if the estimated perceptual threshold was below 0 mA or above the maximum limit of stimulus amplitude in the experimental setup of 1.6 mA.As is shown in Table 4, using a random classifier almost always led to an invalid estimate of the perceptual threshold, while using the maximum value at Cz or using neural network scores returned more valid estimates.
The boxplot in Fig. 5 shows that using the maximum value at Cz for interval classification, usually leads to a threshold estimate that is larger than the reference value (based on participant responses).Bland-Altman analysis in Fig. 5 shows that using the maximum value at Cz for interval classification, leads to relatively large limits of agreement (LoA) of − 0.30 and 0.56 for single-pulse stimuli and of − 0.04 and 0.12 for double-pulse stimuli.The estimate of the perceptual threshold for double-pulse stimuli is significantly biased as the 95% confidence interval (CI) of the mean is above 0.
The boxplot in Fig. 6 shows that using the neural network score for interval classification, usually leads to a threshold estimate that is similar to the reference value (based on participant responses), except for one exception marked with a red circle.Bland-Altman analysis in Fig. 6 shows that using the neural network score for interval classification, leads to limits of agreement (LoA) of − 0.08 and 0.05 for singlepulse stimuli and of − 0.03 and 0.05 for double-pulse stimuli, with no significant estimation bias.The upper limits of agreement of single-and double-pulse threshold estimates based on neural network scores, were significantly lower (i.e.closer to zero) than the upper limits of agreement of estimates based on maximum values (p < .05).The lower limit of agreement of single-pulse threshold estimates based on neural network scores, was significantly higher (i.e.closer to zero) than the lower limit of agreement of estimates based on maximum values (p < .05).

Typical Examples
Tracked perceptual thresholds and average filtered EEG activity at Cz (not cleaned for artefacts) for an exceptional case (marked by red circles in Fig. 6) are shown in Fig. 7.Note that this participant showed a very large difference between the threshold computed based on participant responses, and the threshold computed based on brain activity.Averaged EEG activity shows that central evoked responses were also present for stimuli reported as non-detected.As a consequence, the neural network managed to successfully classify post-stimulus brain activity in 123 out of 130 epochs.Note that this resulted in an inaccurate estimate of the perceptual threshold due to the low amount of incorrectly classified stimuli available for threshold estimation.
Tracked perceptual thresholds and average filtered EEG activity at Cz (not cleaned for artefacts) for a typical participant are shown in Fig. 8. Stimulus amplitudes were distributed around the estimated perceptual thresholds and the perceptual thresholds based on neural network classification were close to the perceptual thresholds based on participant responses.Note that the brain activity to non-detected or incorrectly classified stimuli remained close to baseline.

Evoked Potential
The average evoked potential at Cz-M1M2 (cleaned) in response to detected/correct and non-detected/incorrect stimulus-response and stimulus-classification pairs is shown in Fig. 9.There was a significant contrast (p < .05) between detected and non-detected and between correct and incorrect epochs.Note that the average evoked potential in response to correctly classified epochs was lower than the average evoked potential in response to detected epochs, because of the lower limit on correct classification probability of 0.5 in a 2IFC task.As such, the average evoked potential of correctly classified epochs, could also include epochs with little or no brain activity.Also note that the average evoked potential in response to non-detected and incorrectly classified epochs was significant with respect to baseline at some latencies.
The average amplitude of the peak between 380 and 420 ms for several detection probabilities is displayed on the right.In order to compare detection probability during the GN task that was performed by the participant (ranging from 0 to 1) with detection probability during 2IFC classification task that was performed by the neural network, detection probability values during 2IFC were transformed to the same range of 0-1.The average peak amplitude appears to increase proportionally with respect to detection probability for participant as well as neural network responses.

Network classification explanations
Average occlusion sensitivity topographies in Fig. 10 show that the neural network mainly focuses on a single-dipole with a maximum at CP1 and a minimum at P3.The average evoked potential at Cz (not cleaned for artefacts) is compared to the occlusion sensitivity at CP1 and P3 in Fig. 11.The occlusion sensitivity at P3 is minimal and the occlusion sensitivity at CP1 is maximal around the peak at Cz occurring between 380 and 420 ms.

Perceptual Thresholds
As a proof-of-concept, an experiment was performed (DS3) where we used the neural network scores to control adaptive stimulation in realtime and to estimate the perceptual threshold based on the resulting stimulus-classification pairs.In part I, we simultaneously controlled stimulus amplitude based on an adaptive method of limits using participant responses, and based on an adaptive method of limits using neural network classification.In part II, we simultaneously controlled stimulus amplitude based on two independent instances of the adaptive method of limits using neural network classification.
The boxplot in Fig. 12 shows that using the neural network score for interval classification, usually leads to a threshold estimate that is similar to the reference value (based on participant responses) in Part I, where participants were assigned the task of releasing a response button when a stimulus was perceived.In Part II, where participants were assigned the task of counting the number of perceived stimuli, estimates by the two independent instances of the neural network appear less consistent in 2 participants.Table 5 shows a potential reason for inaccuracy in those estimates.The average absolute difference between

Table 4
The number of invalid threshold estimates (out of a total of 13 estimates) for each type of classifier, where the optimizer failed to converge to a solution or the estimated threshold was below 0 or above 1.6 mA.stimulus amplitudes and the estimated perceptual threshold (|Δ|) tends to be larger when the estimates in Fig. 12 are less consistent (marked in orange).Note that a large (|Δ|) indicates that the stimuli applied to the participant were far from the perceptual threshold, which could make it more difficult for optimization algorithms to find parameters of the psychometric function and the corresponding perceptual threshold.
Bland-Altman analysis in Fig. 12 shows that using the neural network score for interval classification, leads to limits of agreement (LoA) of − 0.09 and 0.06 with respect to estimates based on participant responses during a GN button-release task.Using two instances of the neural network for interval classification during a GN counting task, leads to limits of agreement of − 0.09 and 0.10 between the estimates of both instances.

Typical examples
An exceptional case where estimation of the perceptual threshold was inaccurate in Part II is shown in Fig. 13.The threshold estimates in Part I appear to be consistent and stable.A |Δ| of 0.04 and of 0.06 for the neural network threshold and the participant threshold respectively (Table 5), indicates that the stimuli applied were close to the estimated perceptual threshold.Evoked responses at Cz (not cleaned for artefacts) that were incorrectly classified remain close to baseline, while a clear response is visible in the average of correctly classified stimuli.This confirms that the neural network performed well in the classification of evoked responses.Nevertheless, the estimate of threshold 2 in Part II is clearly inaccurate which is signaled by a large |Δ| of 0.28 (Table 5).
Tracked perceptual thresholds and average filtered EEG activity at Cz (not cleaned for artefacts) for a typical participant in Parts I and II are shown in Fig. 14.Stimulus amplitudes were distributed around the estimated perceptual thresholds.The perceptual thresholds based on neural network classification were close to the perceptual thresholds based on participant responses in Part I.Both independent estimates of the perceptual threshold based on neural network responses in Part II are consistent.Note that the average evoked response to incorrectly classified stimuli remained close to baseline, while an evoked response is visible in the average of correctly classified stimuli.

Discussion
In this work, we proposed a fully automated approach to estimate perceptual thresholds based on cortical brain activity rather than based on reported perception.To accurately estimate the perceptual threshold based on cortical activity, two challenges needed to be addressed.First, we needed the ability to accurately classify brain activity evoked by stimuli close to the perceptual threshold.Second, we needed to make sure that automated estimation based on neural network classification  ) and based on interval classification using the neural network score (NN).Thresholds in the same participant are connected by a line.In each case, equal threshold estimates would indicate a good performance of automated threshold estimation.One exceptional case, marked by red circles, is discussed in Section 3.2.2.Right: Bland-Altman analysis of the difference between thresholds estimated based on maximum values and reference thresholds based on participant responses.Differences at more than 1.5 times the interquartile range above the upper quartile or below the lower quartile were excluded from Bland-Altman analysis as outliers.There was no significant estimation bias and relatively narrow limits of agreement.results in an unbiased estimate of the perceptual threshold.
We hypothesized that when a stimulus is perceived, task-related neural activity can be detected by a deep neural network, and used for estimation of the perceptual threshold.In accordance with this hypothesis, neural network classification of post-stimulus EEG performed well with an accuracy of 0.84 and AUC of 0.92 on the test set, and an accuracy 0.75 and AUC of 0.82 during 10-fold cross-validation.In this study, we had the luxury of a large training set obtained through previous studies of nociceptive processing, which likely contributed to the success of deep learning.The deep neural network was most sensitive to occlusion of channel latencies around the observed P2 evoked response at Cz, associated with conscious access to sensory information.Topographies of the occlusion sensitivity show focus on a single dipolar source around these latencies.Importantly, these observations confirm that the neural network focusses on brain activity, rather than on artefacts such as eye-blinks or muscular activity.The difference between the observed dipolar topographies of occlusion sensitivity and the diffuse central topography associated with vertex potentials is not surprising: The occlusion sensitivity topography potentially represents the influence of any type of recorded neural activity on network scores, phaselocked and non-phaselocked, timelocked and non-timelocked, at these latencies whereas the topography of the grand average vertex potential only represents timelocked and phaselocked neural activity (which also has to overlap between subjects).However, the exact reasons for the dipolar topography of occlusion sensitivity around 300 and 400 ms remain unknown, although the spatial distribution could suggest involvement of the primary somatosensory cortex.
The challenge of obtaining unbiased estimates of the perceptual threshold was addressed by letting the neural network perform a 2IFC classification task.Although using this task helps us to deal with a potentially biased classification score due to cross-subject application of the neural network and the non-stationarity of EEG signals (also see Section 2.3.2), a few other potential reasons for error remain.For instance, presence of a readiness potential in pre-stimulus activity could have facilitated interval classification.Nevertheless, the experimental paradigm was designed to reduce expectation bias when participants perform a psychophysical task by randomizing stimulus amplitude and the inter-stimulus interval (Doll et al., 2015), and this most likely  reduced associated readiness potentials.In addition, these readiness potentials were not time-locked with the stimulus due to a 1 s randomized inter-stimulus interval, further reducing the potential influence of readiness potentials.A remaining influence of readiness potentials could have negatively impacted the performance of threshold estimation by biasing estimated thresholds towards lower values.However, no evidence of such a negative bias was found in this study.Note that presence of an electrical stimulation artefact could be another reason for estimation bias, and the period between 0 and 50 ms was excluded from the post-stimulus interval to prevent a potential estimation bias.
We used the network classifications for estimation of the perceptual threshold by fitting a psychometric function to stimulus-classification pairs.If the neural network successfully classified post-stimulus brain activity by recognizing task-related brain activity, one would expect the estimated perceptual threshold to be equal to the perceptual threshold based on participant report.On the other hand, if the neural network classified post-stimulus brain activity by recognizing stimulus-evoked sensory-discriminative brain activity, one would expect the estimated perceptual threshold to be lower than the perceptual threshold based on participant report, based on the notion that sensory information is available for decision making before it is consciously perceived (Dehaene and Changeux, 2011).
During post-hoc estimation, the neural network estimate was  Thresholds with an average absolute difference (|Δ|) between stimulus amplitudes and estimated perceptual threshold larger than 0.1 are marked in orange.Right: Bland-Altman analysis of the difference between thresholds estimated based on maximum values and reference thresholds based on participant responses.Differences at more than 1.5 times the interquartile range above the upper quartile or below the lower quartile were excluded from Bland-Altman analysis as outliers.There was a no significant bias of estimated thresholds.remarkably close to the estimate based on participant responses with a difference of less than 0.1 mA in 11 out of 13 participants.Furthermore, Bland-Altman analysis showed that using a neural network for classification resulted in limits of agreement of − 0.08 and 0.05 for single-pulse stimuli and of − 0.03 and 0.05 for double-pulse stimuli, with no significant estimation bias.These limits of agreement are sufficiently small for clinical applications using nociceptive perceptual thresholds, such as observing the loss of intra-epidermal nerve fibers following capsaicin application (Doll et al., 2016), or the assessment of impaired nociceptive processing following sleep deprivation (van den Berg et al., 2022).In contrast, using the method of simply classifying the post-stimulus interval based on maximum values during both intervals, led to larger upper limits of agreement of 0.56 for single-pulse and 0.12 for double-pulse stimuli and a larger lower limit of agreement of − 0.30 for single-pulse stimuli.Threshold estimates based on maximum values for double-pulse stimuli were significantly biased upwards.This shows that it is necessary to use neural networks or other similarly sophisticated classification methods within the protocol presented here, in order to obtain clinically relevant estimates of the perceptual threshold.
In a first proof-of-concept experiment on 8 healthy participants, we evaluated the performance of the neural network for real-time tracking and estimation of the perceptual threshold.When participants were assigned the GN task of releasing a response button when a stimulus was perceived (DS3, Part I), 8 out of 8 neural network estimates were close to the estimate based on participant responses, with a difference of less than 0.1 mA.When participants were assigned the GN task of counting the number of perceived stimuli (DS3, Part II), 7 out of 8 neural network estimates were close to the independent estimate of a second instance of the neural network, with a difference of less than 0.1 mA.Furthermore, Bland-Altman analysis showed that using a neural network for classification resulted in limits of agreement of − 0.09 and 0.06 during the GN task based on button-release and of − 0.09 and 0.10 during the GN task based on counting, with no significant estimation bias.Nevertheless, the average absolute difference between stimulus amplitudes and the estimated perceptual threshold (|Δ|), suggests that threshold estimation performed worse during the counting task in two participants.A potential reason for this decrease in performance is the difference in task, which could mean that the brain activity used for classification during the experiment was different from the activity on which the neural network was trained.As such, the performance during a GN task based on counting might still be improved by using transfer learning to train the neural network on brain activity during a counting task.
In addition, analysis of individual participants in DS2 and DS3 identifies two other potential reasons for error.First, in post-hoc estimation of the perceptual threshold, the estimate will be inaccurate if a clear P2 is present despite the participant reporting the stimulus as not perceived.As this dissociation between brain activity and reported perception resulted in a very high number of correctly classified epochs in one of the participants, optimization algorithms were unable to accurately estimate the psychometric function and associated perceptual threshold for this participant.Second, inaccurate estimates during real-time estimation of the perceptual threshold occur due to inability of the optimization algorithm to estimate parameters of the psychometric function based on the stimulus-classification pairs.However, these problems with threshold estimation are signaled adequately by a large difference between the applied stimulus amplitudes and the estimated threshold, which could be used as a quality control measure.

Conclusion
Deep learning enables accurate classification of EEG recordings in real-time, and thereby allows for non-invasive automated estimation of perceptual thresholds based on brain activity.While current BCI literature mainly focuses on the detection of visual evoked potentials, we found that neural networks can also be used as a reliable classifier of brain activity in response to nociceptive stimuli.In this work, we showed that deep neural network classification of the electroencephalogram leads to accurate estimates of the perceptual threshold post-hoc, and we provided a first proof-of-concept that we might use deep neural network classification to control an adaptive stimulus sequence and estimate the perceptual threshold in real-time.Further studies should assess if the real-time method could accurately estimate the perceptual threshold in a larger group of participants outside the laboratory.Automated perceptual threshold estimation based on the electroencephalogram using deep neural network classification enables development of technology for accurate and objective assessment of perceptual thresholds in a wide variety of patient groups in which obtaining reliable perceptual reports can be difficult due to cognitive impairment, communication problems or potential simulation or malingering.

Fig. 2 .
Fig. 2. Psychophysical tasks executed by the participant or the neural network.During the Go/No-go (GN) task the participant receives a sequence of stimuli at a randomized interval, and is instructed to report when a stimulus was detected, e.g. by releasing a response button.In the 2-Interval Forced Choice (2IFC) task the neural network classifies whether interval 1 or interval 2 contains poststimulus brain activity.This classification task is repeated for each stimulus.

B
.van den Berg et al.

B
.van den Berg et al.

Fig. 5 .
Fig. 5. Left: Perceptual thresholds based on participant responses (Resp.)and based on interval classification using the maximum value at Cz (Max.).Thresholds in the same participant are connected by a line.In each case, equal threshold estimates would indicate a good performance of automated threshold estimation.Right: Bland-Altman analysis of the difference between thresholds estimated based on maximum values and reference thresholds based on participant responses.Differences at more than 1.5 times the interquartile range above the upper quartile or below the lower quartile were excluded from Bland-Altman analysis as outliers.There was a significant bias in the thresholds estimated for double-pulse stimuli.

Fig. 6 .
Fig. 6.Left: Perceptual thresholds based on participant responses (Resp.)and based on interval classification using the neural network score (NN).Thresholds in the same participant are connected by a line.In each case, equal threshold estimates would indicate a good performance of automated threshold estimation.One exceptional case, marked by red circles, is discussed in Section 3.2.2.Right: Bland-Altman analysis of the difference between thresholds estimated based on maximum values and reference thresholds based on participant responses.Differences at more than 1.5 times the interquartile range above the upper quartile or below the lower quartile were excluded from Bland-Altman analysis as outliers.There was no significant estimation bias and relatively narrow limits of agreement.

B
.van den Berg et al.

Fig. 7 .
Fig. 7. Exceptional case.Detected (top) or correctly classified stimuli (bottom) are indicated by filled circles.Non-detected (top) or incorrectly classified stimuli (bottom) are indicated by open circles.The participant showed a very large difference between the threshold computed based on participant responses (top), and the threshold computed based on brain activity (bottom).Average evoked response at Cz (not cleaned for artefacts) shows a clear peak even for the stimuli reported as non-detected.The neural network did manage to correctly classify the interval of post-stimulus brain activity in all except 7 stimuli (bottom), leading to a very low estimate of the perceptual threshold which was likely inaccurate due to the low amount of incorrect classifications available for threshold estimation.

Fig. 8 .
Fig. 8.Typical case.Detected stimuli are indicated by filled markers.The stimulus amplitudes were distributed around the estimated perceptual thresholds.There was a clear average evoked response at Cz (not cleaned for artefacts) to detected or correctly classified stimuli, while the evoked response to non-detected or incorrectly stimuli remained close to baseline.The perceptual threshold estimated by the neural network was close to the perceptual threshold estimated using participant responses.

Fig. 9 .
Fig. 9.Average evoked potential at Cz-M1M2 (cleaned) for detected and non-detected (participant, go-/no-go), and for correct and incorrect (neural network, 2interval forced choice) classified single-pulse (SP) and double-pulse (DP) stimuli.The was a significant contrast between detected and non-detected and between correct and incorrect.Average amplitude of the peak between 380 and 420 ms appears to increase proportional to detection probability.

Fig. 10 .
Fig. 10.Average occlusion sensitivity topographies for each 100 ms post-stimulus.The neural network appears to focus on a single dipole pair with a maximum at CP1 (magenta) and a minimum at P3 (cyan).(For interpretation of the references to colour in this figure, the reader is referred to the web version of this article.)

Fig. 11 .
Fig. 11.Comparison between the average evoked potential waveform observed at Cz (not cleaned for artefacts, top) and the occlusion sensitivity at CP1 and P3 (bottom).Classification intervals are marked by a grey patch.Maxima and minima of the occlusion sensitivity coincide with the peak in evoked potential between 380 and 420 ms.

Fig. 12 .
Fig. 12. Left: Perceptual thresholds based on participant responses (Resp.)and based on interval classification using the maximum value at Cz (Max.).Thresholds in the same participant are connected by a line.In each case, equal threshold estimates would indicate a good performance of automated threshold estimation.

Fig. 13 .
Fig. 13.Exceptional case.Detected stimuli are indicated by filled markers.Estimated perceptual thresholds in part I are consistent and stable (|Δ| = 0.04 and |Δ| = 0.06).There was a clear average evoked response at Cz (not cleaned for artefacts) to correctly classified stimuli, while the evoked response to incorrectly classified stimuli remained close to baseline.However, the estimate of threshold 2 in part II appears to be inaccurate which is signaled by a large |Δ| of 0.28.

Fig. 14 .
Fig. 14.Typical case.Detected stimuli are indicated by filled markers.The stimulus amplitudes were distributed around the estimated perceptual thresholds.There was a clear average evoked response at Cz (not cleaned for artefacts) to correctly classified stimuli, while the evoked response to incorrectly classified stimuli remained close to baseline.The perceptual threshold estimated by the neural network was close to the perceptual threshold estimated using participant responses in part I, and both independent estimates of the perceptual threshold based on neural network classification in part II are consistent.

Table 5
Estimated perceptual thresholds (T) and slopes (S) based on participant (P) and neural network (NN) responses.The average absolute difference between stimulus amplitudes and the estimated perceptual threshold (|Δ|) is a potential marker for the quality of the estimated threshold.Participants where the |Δ| is larger than 0.1 are marked in orange.