On the (un)reliability of common behavioral and electrophysiological measures from the stop signal task: Measures of inhibition lack stability over time

Response inhibition, the intentional stopping of planned or initiated actions, is often considered a key facet of control, impulsivity, and self-regulation. The stop signal task is argued to be the purest inhibition task we have, and it is thus central to much work investigating the role of inhibition in areas like development and psychopathology. Most of this work quantifies stopping behavior by calculating the stop signal reaction time as a measure of individual stopping latency. Individual difference studies aiming to investigate why and how stopping latencies differ between people often do this under the assumption that the stop signal reaction time indexes a stable, dispositional trait. However, empirical support for this assumption is lacking, as common measures of inhibition and control tend to show low test-retest reliability and thus appear unstable over time. The reasons for this could be methodological, where low stability is driven by measurement noise, or substantive, where low stability is driven by a larger influence of state-like and situational factors. To investigate this, we characterized the split-half and test-retest reliability of a range of common behavioral and electrophysiological measures derived from the stop signal task. Across three independent studies, different measurement modalities, and a systematic review of the literature, we found a pattern of low temporal stability for inhibition measures and higher stability for measures of manifest behavior and non-inhibitory processing. This pattern could not be explained by measurement noise and low internal consistency. Consequently, response inhibition appears to have mostly state-like and situational determinants, and there is little support for the validity of conceptualizing common inhibition measures as reflecting stable traits.


Introduction
Broadly, inhibition is conceptualized as the ability to stop or suppress unwanted thoughts, memories and behaviors, and it is generally considered a central facet of higher-order constructs such as cognitive control, self-regulation and/or impulsivity (e.g., Bari & Robbins, 2013;Hofmann, Schmeichel, & Baddeley, 2012;Lenartowicz, Kalar, Congdon, & Poldrack, 2010;Logan, Schachar, & Tannock, 1997).Using experimental approaches and laboratory measurements, we mostly investigate a narrower aspect of inhibitory control termed response inhibition or action stopping, which allows for clear operationalization at the behavioral level.While several tasks have been designed for this purpose, the stop signal task (SST; Lappin & Eriksen, 1966;Logan & Cowan, 1984) is often considered the purest laboratory measure of inhibition (e.g., Aron & Poldrack, 2005;Crosbie, P erusse, Barr, & Schachar, 2008).In the SST, participants are asked to respond quickly and accurately to a primary go signal, but to try to cancel their response if the go signal is followed by a stop signal after a variable delay (stop signal delay, SSD), thus capturing the stopping of initiated and/or prepotent actions.
The SST presumably owes much of its popularity to the formalization of an independent horse race model (Logan & Cowan, 1984;Logan, Cowan, & Davis, 1984), a mathematical model developed to describe SST performance.In this model, inhibitory performance on stop trials is determined by a race between two competing processes: a go-triggered process aimed towards producing movement races against a stoptriggered process aimed towards stopping movement.The winner of the race dictates whether inhibition is successful or not.Central to the model's popularity is that it allows for calculating the presumed time an individual would need to cancel their initiated action, essentially providing a reaction time of stopping in the absence of a direct measure of this hidden process.While there are several approaches to calculating the stop signal reaction time (SSRT), most rely on a simple difference measure: by summarizing an individual's distribution of go reaction times (goRTs) and SSDs into point estimates (such as a mean or a given percentile), one can calculate the SSRT by subtracting the SSD from the goRT.The resulting point estimate for each individual presumably reflects the temporal distance between stop signal presentation and time taken to stop an initiated response.
Research using the SSRT as a measure of inhibition has shown that people are slower to inhibit responses during childhood and old age (Madsen et al., 2020;Singh et al., 2022;Williams, Ponesse, Schachar, Logan, & Tannock, 1999), as well as across a range of psychological and neurological disorders such as attention-deficit hyperactivity disorder, obsessive compulsive disorder, schizophrenia, Parkinson's disease, and substance abuse disorders (Jahanshahi, Obeso, Baunez, Alegre, & Krack, 2015;Lipszyc & Schachar, 2010;Smith, Mattick, Jamadar, & Iredale, 2014).Consequently, inhibition is proposed to be an important facet of normal development and aging, as well as psychopathology.Coupled with the theoretical role of impulsivity and self-regulation in shaping a range of life outcomes (e.g., Enkavi et al., 2019), it is no wonder that significant efforts are put towards understanding why and how SSRT estimates differ between individuals.
While the latency of action stopping has traditionally been limited to model estimates of covert behavior, recent work has also suggested a direct and observable measure of stopping speed (Atsma, Maij, Gu, Medendorp, & Corneil, 2018;Raud & Huster, 2017).Occasionally, small bursts of electromyographic (EMG) activity can be recorded over the responding muscles during successful inhibition, presumably representing partial c o r t e x 1 7 5 ( 2 0 2 4 ) 8 1 e1 0 5 responses (prEMG) that decline before the movement ends in a button press.Thus, the latency of the attenuation of muscular activity provides an alternative estimate of stopping behavior.The prEMG peak latency correlates with the SSRT but is argued to be at least partially distinct from it, and it is compatible with several possible neural mechanisms (Raud, Thunberg, & Huster, 2022).
A common assumption when interpreting SST outcomes is that the SSRT first and foremost reflects an inhibition process and a stable, dispositional inhibitory capability -something specific to the individual, persisting over time and measurement situations e as opposed to other supporting processes or contextual factors.However, as a descriptive framework, the independent horse race model stays silent about which psychological and neural processes determine the SSRT, and it does not specify what sort of factors influence and shape these processes.This descriptive approach has undoubtedly benefited the field, providing us with a model that can be used to investigate inhibition across sensory and response modalities, situations, conditions, development and lifespan, health and disorders, and across species.However, its flexibility might also lead us astray, as common assumptions tend to go unchecked despite lacking a strong theoretical foundation.For instance, while successfully inhibiting a response has repeatedly been shown to be influenced by situational factors like motivation and reward (Boehler, Schevernels, Hopf, Stoppel, & Krebs, 2014;Greenhouse & Wessel, 2013;Leotti & Wager, 2010;Schevernels et al., 2015), a general tendency seems to be to ascribe interindividual variation in the SSRT to variation in a latent trait-like process and not considering whether state-driven fluctuations could account for the data.This is especially apparent in work using the SSRT as an endophenotype, or to guide the search for biomarkers of psychopathology (e.g., Arnatkeviciute et al., 2023;Aron & Poldrack, 2005;Congdon, Lesch, & Canli, 2008;Fineberg et al., 2010;Fortgang, Hultman, Erp, & Cannon, 2016;Graczyk, Sahakian, Robbins, & Ersche, 2023;Rommelse et al., 2009).Consequently, if our measures of inhibition do not show trait-like stability, it would have large implications for individual difference approaches to inhibitory control and severely limit the kinds of questions we could hope to answer.
Trait-like stability of the SSRT could be supported by empirical findings of score stability over time, for instance by showing high test-retest reliability.In other fields investigating trait-like aspects of human behavior and cognition, such as personality or intelligence, measures tend to stabilize in adulthood and show high stability over years (Bleidorn et al., 2022;Calamia, Markon, & Tranel, 2013;Rantanen, Mets€ apelto, Feldt, Pulkkinen, & Kokko, 2007;Roberts & DelVecchio, 2000;Schneider, Niklas, & Schmiedeler, 2014).Although the psychometric properties of task-derived psychological measures have received far less attention than their self-report counterparts (Parsons, Kruijt, & Fox, 2019;Zuo, Xu, & Milham, 2019), the cognitive control field has worried about an apparent lack of test-retest reliability for key task variables for decades (Miyake et al., 2000;Rabbitt, 1997).This concern has been revived by recent studies showing that task measures of control and impulsivity do not show the stability they are commonly assumed to have (Enkavi et al., 2019;Hedge, Powell, & Sumner, 2018;Schuch, Philipp, Maulitz, & Koch, 2022).For the stop signal task specifically, the SSRT has repeatedly been shown to have low test-retest reliability and few stable factors influencing its variation (Faßbender, Meyh€ ofer, & Ettinger, 2023;Hedge et al., 2018;Rodas & Greene, 2020;W€ ostmann et al., 2013).
There are several proposed causes for why measures from cognitive control tasks such as the SST appear to show little stability over time.These can largely be divided into methodological and substantive explanations, i.e., concerning how the research is done on the one hand, and the underlying phenomena we are studying on the other.On the methodological side, one potential explanation stems from a longstanding concern about the reliability of difference measures (e.g., Cronbach & Furby, 1970;Lord, 1956).Difference measures are prominent in cognitive control tasks since control is often operationalized as a contrast between two conditions.As difference scores inherit the error and uncertainty from its constituent variables (and the error increases further if the correlation between the constituent variables increases; Gulliksen, 1950), some worry that they are inherently unreliable and thus lack utility for studying individual differences.Another possible explanation comes from what has been termed the reliability paradox, which describes a situation where a range of robust cognitive tasks fail to reliably differentiate between individuals (Hedge et al., 2018).This phenomenon is argued to stem primarily from task design e the tasks that succeed in showing robust group-or treatmentlevel effects could achieve this power by reducing the variance between people.Accordingly, individual scores reflect a larger proportion of noise relative to meaningful individual differences.Alternatively, it has been argued that low reliability is not necessarily caused by issues with the tasks themselves, but with failing to account for measurement error and the uncertainty in person-level estimates (Chen et al., 2021;Haines et al., 2020;Rouder & Haaf, 2019).In cognitive tasks, person-level estimates tend to reflect a summary of performance across trials, such as a mean reaction time in some condition.Accordingly, misrepresenting the underlying data generating process and/or varying the number of trials influences the precision of the person-level estimate, which could again impact the reliability estimate.
These proposed methodological explanations differ in their focus, but all fundamentally converge on measurement noise as the key factor driving low temporal stability.If this is the case, the inability to differentiate between people is not only something happening over time, but it should also be present within a single measurement session.Previous reports have suggested that the internal consistency of the SSRT is generally higher than its test-retest reliability, indicating that the low test-retest reliability might not be methodological in its origin.However, some have voiced concerns over whether high internal consistency is merely a by-product of methodology (Hedge et al., 2018).In the SST, it is recommended to use an adaptive staircase approach to adjust the delay between go and stop signals to a duration where the participant succeeds in stopping about half the time (Verbruggen et al., 2019).To do this, the delay is lengthened if participants stop and shortened if they respond.This kind of adaptation gives trials that are close in time similar delays.As task-based internal consistency is often estimated by splitting the data into odd and even (i.e., alternating) trials to calculate the split-half reliability, task design and reliability approach could interact by producing very similar SSD distributions for each split, at least if go and stop trials are split separately.The inflated similarity between splits could cause higher SSD reliability.When calculating the SSRT, higher SSD reliability could contribute to produce an artificially high estimate of internal consistency for the SSRT as well.
If inhibition measures are unstable over time, for instance by fluctuating over days or weeks, it would have large consequences for work hoping to understand why these measures differ between individuals.Since the possible causes are many, there is a need for increased understanding of the temporal stability of these measures.For instance, previous work assessing the test-retest reliability of stop signal task measures has largely focused on measures of task behavior such as stopping latencies, reaction times and error rates (Bartholow et al., 2018;Bender, Filmer, Garner, Naughtin, & Dux, 2016;Choudhury et al., 2019;Czapla et al., 2016;Hedge et al., 2018;Kr€ aplin, Scherbaum, Bu ¨hringer, & Goschke, 2016;Kuntsi, Stevenson, Oosterlaan, & Sonuga-Barke, 2001;Palmer, Langbehn, Tabrizi, & Papoutsi, 2018;Robertson et al., 2015;Rodas & Greene, 2020;Soreni, Crosbie, Ickowicz, & Schachar, 2009;Weafer, Baggott, & de Wit, 2013;W€ ostmann et al., 2013).Consequently, we know little about the reliability of SST-derived variables other than the purely behavioral ones.Whereas some have begun to characterize this for measures derived from functional magnetic resonance imaging (Korucuoglu et al., 2021) and stop-related measures of motor system state (Chowdhury, Livesey, & Harris, 2020), no one appears to have investigated the temporal stability of SSTderived event-related potentials, which have been central to electrophysiological investigations of processes contributing to inhibitory control for decades.Additionally, we are not familiar with any studies characterizing the temporal stability of other proposed inhibition measures, such as the presence of stop-related beta bursts over the frontocentral cortex presumably indexing the initiation of inhibition, or the prEMG peak latency which provides an alternative measure of stopping latency.Not being difference measures, they might have an advantage over the SSRT in terms of measurement stability.
Furthermore, since measurement noise might be central to low temporal stability, joint investigations of split-half and test-retest reliability could delineate whether imprecise measures within sessions drive instability over longer time periods.Given the concern that some approaches to internal consistency can cause a confound with common SST design choices, more robust split-half characterization would allow us to see whether the low reliability is something happening over time, or if the measures also lack internal consistency, which would severely limit their use.
Lastly, previous work aiming to summarize current knowledge about the test-retest reliability of SST measures has shown that, while overall low, these reliability estimates show large variation across studies (Enkavi et al., 2019).However, this finding relied on collapsing across different variables.As different task measures presumably rely on at least partially separable processes and are often thought of as measuring different constructs, we currently know little about whether this variability is driven by differences between variables, differences between studies, or both.
In sum, psychometric investigations have found that common measures of inhibition have low test-retest reliability, which poses a threat to studies using these measures as indices of stable, dispositional traits.However, low test-retest reliability can be driven by both methodological and substantive factors, and we currently know little about how these different aspects influence reliability measures in the stop signal task.Additionally, reliability estimates from the literature show large heterogeneity, but it is uncertain whether this reflects differences between variables, study contexts, or sample characteristics.With the aim of answering some of these questions, we took an exploratory approach and estimated the internal consistency and test-retest reliability of a range of behavioral, electroencephalographic and electromyographic variables derived from the stop signal task in three independent samples completing two separate measurement sessions.This allowed us to examine potential test-retest and split-half reliability patterns both within and/or across measurement modalities, as well as to see whether the patterns were consistent across studies.We also aimed to summarize current knowledge of stop signal task reliability through a systematic literature review on both the test-retest and split-half reliability of common stop signal measures, focusing our meta-analyses on potential differences between variables, as well as factors moderating reliability heterogeneity for specific variables.

Methods
We report how we determined our sample size, all data exclusions, all inclusion/exclusion criteria, whether inclusion/ exclusion criteria were established prior to data analysis, all manipulations, and all measures in the study.No part of the study procedures or analysis plans was preregistered prior to the research being conducted.Raw and summary-level data, digital study materials, as well as processing and analysis code is available on the Open Science Framework: https://osf.io/dcb6g/ (Thunberg, Wiker, Bundt, & Huster, 2023).

Participants
Participants were recruited through poster advertisements at the Department of Psychology at the University of Oslo, as well as advertisements on social media.The participants had normal or corrected-to-normal vision and no history of psychological or neurological disorders.All analyses rely on data collected in earlier studies, and some of it is a re-analysis of data included in previously published reports (Study 1: Thunberg, Messel, Raud, & Huster, 2020, Study 2: Raud et al., 2022), thus all sample sizes were fixed before this study was conceptualized.The studies were carried out in accordance with the Declaration of Helsinki and approved by the institutional review board of the Department of Psychology, University of Oslo.All participants consented in writing prior to participating, and they all received monetary compensation.Initially, a total of 28 (study 1), 26 (study 2) and 44 (study 3) participants were recruited for each study.However, we excluded participants that were missing data from the second session, or if their stop signal task performance from either session indicated that they were not following task instructions.Specifically, they were excluded if they failed to respond on more than 10 % of go trials, if their stop accuracy was lower than 25 % or higher than 75 %, or if their unsuccessful stop reaction times were slower than their go reaction times (Verbruggen et al., 2019).After applying these criteria, we were left with 76 participants for further analysis (N study1 ¼ 24, N study2 ¼ 20, N study3 ¼ 32).
We also chose to exclude participants for subsets of the analyses if they had insufficient data for a given modality.Specifically, if a participant had poor EEG (more than 25 % discarded data) or EMG (less than 75 % detectable muscle bursts in go trials) data quality, they were excluded from all EEG and/or EMG analyses, respectively.Additionally, participants with missing data from EEG electrodes of interest (Cz for ERPs, FCz for beta burst detection) were excluded from those analyses specifically.Together, this resulted in excluding 2 (beta, FCz missing) and 1 (EMG data quality) participants from those analyses.
All exclusion criteria were determined prior to data analysis.

Design and procedures
Participants in all three studies completed several sessions consisting of a stop signal task with concurrent EEG-and EMGrecordings.In study 1, participants completed 3 sessions spaced 1 week apart, where each session consisted of three task runs.Transcranial direct current stimulation was applied during the second task run.Here, we've included the first task run from the first and second session.In study 2, participants completed 2 sessions spaced 1 week apart, where each session consisted of two task runs.Here, we've included the first task run from both sessions.In study 3, they completed 2 sessions spaced 2 weeks apart, each session consisting of one task run.
All three studies used a basic visual SST presented using E-Prime 2.0 (study 1 and 2; Psychology Software Tools, Pittsburgh, PA) or Psychtoolbox version 3.0.16(study 3; Brainard, 1997;Kleiner et al., 2007;Pelli, 1997).In study 1 and 2, participants performed the stop signal task and a two-choice reaction time task in alternating blocks.Study 3 only included SST blocks, but additionally included other cognitive tasks.For this analysis, we have focused on the SST exclusively.

Stop signal task
In the stop signal task, the primary task was to respond to the appearance of a right-or left-pointing arrow by thumb abduction with the right or left thumb, respectively.All responses were collected using specialized response devices (study 1 and 2: The BlackBox Toolkit Ltd., UK; study 3: Cedrus RB-740, Cedrus corporation, San Pedro, CA, USA).In a subset of trials, the initial go arrow was followed by a second arrow indicating the need to cancel the initiated response.The included task runs all had a total of 450 trials, with 108 (study 1 and study 2) and 110 (study 3) stop signals, giving stop signal probabilities of .24for all studies.In all studies, each trial started with a centrally presented colored arrow (duration ¼ 100 ms) indicating the need for a response (see Fig. 1).In stop trials, a second arrow (duration ¼ 100 ms) followed the first after a variable delay.Stimuli were interleaved between continuous presentation of a fixation cross.Arrows could be either orange, green or blue (study 1 and 2), or orange or blue (study 3), and the go-stop color assignment was counterbalanced across participants.The delay between go and stop stimuli (stop signal delay; SSD) started at 250 ms and was subsequently adjusted based on performance within a range of 100e800 ms (study 1 and 2) or 100e600 ms (study 3).Specifically, the SSD increased or decreased by 50 ms following successful and unsuccessful stop trials, respectively, with adaptations combined across left-and right-hand response trials.Only responses made within 1000 ms after go stimulus onset were included for analysis and any response made after this was treated as an omission.This response interval was fixed for all studies and was followed by a continued fixation period of 500e1000 ms (study 1 and 2) or 700e1200 ms (study 3) until the start of the next trial.
The first block always started with 10 go trials, but otherwise trial order was randomized within each block.The studies had 3 (study 1 and 2) and 5 (study 3) blocks, separated by short pauses.Participants received automated performance feedback after every 75th trial, instructing them to be faster if their average goRT was above 600 ms, and to be more accurate if their average stop accuracy was below .40.Prior to the experiment, participants completed a short training session with automated performance feedback after every trial.In addition, they were told that they should aim to be both fast and accurate, that they should not wait for a stop signal, and that they would not be able to stop for all stop signals.

Data acquisition
For all three studies, we recorded EEG-and EMG-activity with a BrainAmp system (Brain Products GmbH, Germany) using online low-pass filters at 250 Hz (EEG) and 1000 Hz (EMG) and a 5000 Hz sampling rate.EMG activity was sampled with a .5 mV resolution in all three studies, whereas EEG recordings had a resolution of .5 mV (Study 1 and 2) or .1 mV (Study 3).For all studies, EEG electrode impedances were kept equal to or below 5 kU.EEG was recorded from 9 (study 1), 19 (study 2) and 29 (study 3) Ag/AgeCl electrodes, placed according to the 10e20 system.In all studies, FCz and AFz served as online reference and ground electrode, respectively (in study 1, these 2 electrodes were switched by mistake for 2 participants, who were excluded from all analyses on FCz).Additional electrodes were placed on the tip of the nose (study 1 and 2) as well as on each earlobe (all studies) for offline re-referencing.EMG activity was recorded with a bipolar Ag/AgeCl electrode montage parallel to the belly of the abductor pollicis brevis on each hand, with ground electrodes placed on each forearm (study 1 and 2) or over the ulnar styloid (study 3).

Behavior
Individual summaries of task performance were calculated using custom scripts written in MATLAB R2021b (The c o r t e x 1 7 5 ( 2 0 2 4 ) 8 1 e1 0 5 MathWorks, Inc., Massachusetts, USA) and visualized using raincloud plots (Allen, Poggiali, Whitaker, Marshall, & Kievit, 2019).Specifically, we extracted go accuracies, go errors, go omissions, go reaction times, stop accuracies, unsuccessful stop reaction times, and stop signal delays at the trial-level, and calculated stop signal reaction times at the task-level.SSRTs were calculated using the integration method, with error RTs included in the goRT distributions and go omissions replaced by the maximum RT (Verbruggen et al., 2019).For reaction time analyses, we identified trials with improbable (<100 ms) or outlier (±3.5 SD) reaction times and excluded them from further analyses.

EEG
We did all processing in MATLAB (v.R2021b) using functions from the EEGLAB toolbox (v.2021.1;Delorme & Makeig, 2004), the ERPLAB plugin (Lopez-Calderon & Luck, 2014), along with custom scripts.Time courses were visualized using shad-edErrorBar (Campbell, 2021).First, EEG data was re-referenced to the average of earlobe electrodes followed by low-pass filtering (40 Hz), downsampling to 500 Hz, and high-pass (.1 Hz) filtering.All filtering was done using a Hammingwindowed sinc finite impulse response filter.After this, we did an independent component analysis using the infomax algorithm and components reflecting ocular artefacts were manually identified and removed (mean number of ICs excluded ¼ 1.3158, range: 1 -4).We also identified noisy data segments using a sliding window procedure (500 ms window, 50 % overlap) and removed segments with a peak-to-peak voltage exceeding 175 mV (mean percent rejected ¼ .32%, range: 0e6.3 %).
For ERP estimates, data from go, unsuccessful and successful stop trials were segmented relative to go-or stopsignal onset (À200 to 800 ms) and baseline corrected by subtracting the mean of the pre-stimulus baseline.We extracted amplitude and latency estimates for the N2 and P3 ERPs from electrode Cz separately for all trial types, participants, and sessions, both using participant-averaged and trial-level waveforms.Specifically, we extracted mean peak amplitudes (10 ms interval, peak-centered), mean amplitudes (over a 100 ms window), peak latency, as well as onset latency (here defined as the 50 % fractional peak latency, i.e., the last timepoint preceding the peak reaching 50 % of the peak amplitude).For peak estimates, we searched for local minima/ maxima between 100 e 300 ms and 250 -500 ms for the N2 and P3, respectively.For mean amplitude estimates, we used narrower regions and focused on 150e250 ms for the N2, and 300e400 ms for the P3.Estimation windows were kept constant across all trial types, participants, and sessions.For the trial-level ERP estimates, we used the mean estimate for each trial type as a summary of central tendency along with the SD as a summary of variability.
For beta burst estimates, we focused on detecting the presence of stop-related beta bursts in electrode FCz in a manner similar to earlier reports (e.g., Sherman et al., 2016;Shin, Law, Tsutsui, Moore, & Jones, 2017;Wessel, 2020).To do this, we segmented data from all trial types relative to gosignal onset (À500 to 1500 ms) and for each frequency between 10 and 35 Hz, we convolved the electrode-level data with a complex 7-cycle Morlet wavelet.Bursts were identified as local peaks in these trial-level time-frequency representations.For the analysis, we only included bursts that peaked in the 15e30 Hz range, and with a peak power at least 6 times the median power for that frequency in that electrode across all trials (regardless of trial type).The burst threshold was decided prior to analysis.Stop-related beta burst rate was defined as the number of trials with at least one burst in the time interval between the trial-specific SSD and that participant's SSRT relative to all successful stop trials.

EMG
All EMG preprocessing was done in MATLAB (v.R2021b) using EEGLAB (v2021.1)and custom scripts (see Raud et al. (2022) for more details on processing choices).Time courses were visualized using shadedErrorBar (Campbell, 2021).First, we bandpass filtered the EMG signal between 20 and 250 Hz using a second order Butterworth filter, followed by lowering the sample rate to 500 Hz.We segmented data for all trial types relative to go signal onset (À200 to 1600 ms) and rejected trials with an average baseline (À200 to 0 ms) activity above 100 mV from further analyses (across participants and sessions, this led to the rejection of 58 trials, roughly .0008% of all trials).  1 7 5 ( 2 0 2 4 ) 8 1 e1 0 5 The remaining epochs were baseline corrected by subtracting the mean of the pre-stimulus baseline.Each timepoint was transformed to the root mean square of its nearest neighbors ( ± 5 datapoints) using a moving average procedure.After this, we normalized the data by dividing the full epoch time course with the mean activity during baseline.For each participant and hand, we concatenated the epochs and z-scored the activity block-wise.
We extracted estimates of peak latency and burst rate for all trial types.Muscle bursts were detected at the trial-level and defined as post-baseline scores where z > 1.2, and burst rate was defined as the number of trials with a detected burst relative to all trials of that trial type.We extracted peak latencies using both trial-level and participant-averaged waveforms.First, we excluded outlier trials, defined as trials with an onset latency later than 1.5 times the interquartile range (all trial types) and peak latencies later than 1.5 times the interquartile range (successful stop trials only).On average, this led to excluding 1.9 % of the trials.At the trial-level, the peak latency was defined as the time point with the highest amplitude in each trial.For successful stop trials, we subtracted the trial-specific SSD from the estimates to express them relative to stop signal onset.For these trial-level estimates, we calculated the mean of the resulting distributions as a summary measure of central tendency, along with the SD as a summary measure of variability.
For the estimates using participant-averaged waveforms, we averaged shorter segments (À200 to 800 ms) around go onset (go and unsuccessful stop trials) and stop onset (successful stop trials).Peak latency was defined as the time point with the highest amplitude for each trial type.For go and unsuccessful stop trials, we used the entire post-baseline window.For successful stop trials, we only looked for peaks in the remainder of the response window when accounting for individual mean SSD.

Test-retest reliability
To estimate test-retest reliability, we calculated Pearson's r between all variables from the first and second session.All correlation coefficients were calculated for each study separately.For average correlation coefficients, we Fisher Ztransformed the coefficients before calculating the mean, then back-transformed the estimates to r for reporting.For a subset of variables, we also calculated an additional testretest estimate accounting for potential attenuation due to measurement noise by correcting for split-half reliability (Spearman, 1904).We calculated these deattenuated correlation coefficients using the formula: where r xy is the test-retest reliability, and r xx and r yy is the split-half reliability for the first and second session, respectively.

Split-half reliability
We estimated split-half reliability for key SST variables separately for each study and testing session.More specifically, we did this for go reaction times (both from valid go trials and the nth trial used for SSRT calculation), SSDs, and SSRTs, trial-level ERP parameters (their central tendency, not their variability), and trial-level EMG parameters (again their central tendency, not their variability).We used a permutation-based split-half approach implemented in custom-written MATLAB scripts (v.2021b), where withinsession trial-level data for each participant was split into two random halves (keeping trial type counts constant) to calculate a summary measure for each half.These point estimates were then correlated across participants withinsession and corrected using the Spearman-Brown prophecy formula (Brown, 1910;Spearman, 1910).We ran 10 000 permutations, giving us a distribution of reliability scores.We used the mean of this as our split-half reliability estimate, with confidence intervals given by the 2.5th and 97.5th percentiles.

Interpreting reliability coefficients
There are many proposed guidelines for interpreting the size of reliability coefficients (e.g., the CNTRICS Executive Committee, 2008;Cicchetti, 1994;Clayson & Miller, 2017;Fleiss, 1986;Koo & Li, 2016;Nunnally, 1978;Shrout & Fleiss, 1979).While the differences between guidelines can be substantial, they tend to agree that interpretation should be relative to the intended use.This means that "good" for one kind of use might be "minimally acceptable" if the aim is different.We have evaluated the size of reliability coefficients with this in mind, meaning that lower reliability estimates might be more acceptable for basic research than for applied and clinical research (e.g., Nunnally, 1978;Portney & Watkins, 2000).For measures of internal consistency, values above .7 meet a level commonly considered as an acceptable starting point for exploratory research (e.g., Nunnally, 1978), whereas values above .8have been considered more appropriate in established research contexts (e.g., Clayson & Miller, 2017).We thus consider split-half coefficients larger than .8 to be "high" or "good" for basic research purposes, but meeting this minimal criterion does not mean "high" or "good" in all contexts.For applied and clinical research, values around .9 might be more appropriate, and even higher values desirable if individual decision making is the aim.For test-retest reliability, commonly investigated traits such as personality and intelligence tend to show high stability over time, especially from young adulthood onwards.For instance, Big Five traits have been found to have a test-retest reliability of around .75 for people in their mid-twenties and older (Bleidorn et al., 2022).Test-retest reliabilities of the same size or higher have also been found for most subtests of the Wechslers Adult Intelligence Scale (Calamia et al., 2013).Consequently, test-retest reliabilities around this size would be consistent with what is found for measures commonly agreed to capture stable dispositions.

Review and meta-analysis
We searched PubMed and Google scholar for articles and preprints posted between 01.01.1975 e 04.05.2021 using the following search strings: PUBMED: ((response-inhibition OR stop-signal OR countermanding) AND (reliability OR retest OR internal consistency OR split-half)), Google Scholar: ("response-inhibition" OR "stop-signal" OR countermanding) AND (reliability OR retest OR "internal-consistency" OR "split-half").For the search in Google scholar, we extracted the first 500 search hits using the Publish or perish software (Harzing, 2007).After combining with the PubMed hits and deduplicating any common findings, we had 636 records for screening.In total (and across all SST variables), we ended up extracting 67 estimates of test-retest reliability from 18 studies, and 82 estimates of internal consistency from 40 studies.For more information about our inclusion criteria, screening, moderators, and variable coding, please see the supplementary material.

Meta-analysis
We did all analyses using R (v 4.3.1)via RStudio (2023.06.0 þ 421).To check whether the test-retest reliability of the SSRT changed as a function of time, we focused on studies using Pearson's r to estimate test-retest reliability in nonclinical samples (N ¼ 14 observations from 9 studies).We then fit a three-level meta-regression model using the metaforpackage (Viechtbauer, 2023), with interval duration (number of days between test and retest session) as a fixed effect and reliability estimates nested within studies as random effects.
We also checked whether any change over time would be similar for go reaction times (N ¼ 6 observations from 5 studies), again fitting a three-level meta-regression model with interval duration as a fixed effect and reliability estimates nested within studies as random effects.
To see if we could find other potential predictors of SSRT reliability, we did an exploratory search for relevant moderators using estimates of internal consistency from the literature, this time focusing on studies using split-half correlations (N ¼ 29 observations from 16 studies).We first fit a meta-analytic model with reliability estimates nested under studies as random effects to assess whether there was significant heterogeneity across studies.Next, we took a machine learning approach to build a meta-analysis model by utilizing an adapted random forest algorithm available through the MetaForest package (van Lissa, 2020).During data extraction, we identified the following potential moderators for which we could find sufficient data in the literature: clinical status of sample, sample age, the number of SST trials, the stop signal probability, the SSD method, the modality of the stop signal, and the difficulty of the primary go task.We first reduced this potential moderator list by 100fold replicated feature selection, keeping only those moderators that showed a positive variable importance in more than 50% of the replications.This left us with three candidate moderators: age group of participants, stop signal probability, and SSD method.Next, we used group-wise (with group ¼ study) leave-one-out cross-validation to tune the model hyperparameters via the caret package (Kuhn et al., 2023).To do this, we grew 5000 random trees with uniform, fixed and random-effect weights, 2e3 potential moderators per split, and allowing 2e6 cases per node.The model with the lowest root mean squared error used fixed weights, 3 variables per split, and at least 2 observations per node, and had positive estimates of predictive performance (R 2 oob ¼ 10 %, R 2 cv ¼ 32 %).

Results
We analyzed behavioral, EMG and EEG data from the stop signal task in three different studies, each having two different sessions, with sessions separated by an interval of 1 (Study 1 and 2) or 2 (Study 3) week(s).Across all samples and sessions, we found that task performance was similar to previous studies including young healthy participants, with an overall go reaction time of 529 ms, an unsuccessful stop reaction time (USRT) of 470 ms, and an SSRT of 199 ms.Task performance in both sessions was largely similar between sessions (Fig. 2a) as well as between studies (Supplementary Figure 1 and 2).For summaries of task performance, see Table 1.
As an alternative behavioral measure, we analyzed EMG activity recorded over the responding muscle (Fig. 2d).Across all samples and sessions, we detected EMG bursts in 99 % of go trials, 99 % of unsuccessful stop trials and 31 % of successful stop trials.The peak latency of those bursts for go and US trials were 520 ms and 460 ms (estimated using the average trial-level method), respectively, thus mirroring the go and US reaction times.PrEMG in successful stop trials peaked at 163 ms, markedly earlier than the SSRT, which is consistent with previous reports (e.g., Atsma et al., 2018;Jana, Hannah, Muralidharan, & Aron, 2020;Raud et al., 2022;Raud & Huster, 2017;Thunberg et al., 2020).
We also analyzed EEG activity in both the time and timefrequency domain, focusing on the amplitudes and latencies of the N2/P3 complex of the event-related potential in all trial types as well as the beta burst rate in successful stop trials.As can be seen in Fig. 2c, ERP waveforms were largely similar in the test and retest session (see Supplementary table 2 for descriptive summaries of ERP parameters).On average, 12 % of successful stop trials contained at least one beta burst in the time between the trial-specific stop signal onset and an individual's SSRT (see Fig. 2b for exemplar trials), which is similar to the frequency reported in earlier studies, both in humans (Jana et al., 2020;Wessel, 2020) as well as monkeys (Errington, Woodman, & Schall, 2020).

3.1.
Are SST variables stable over time?
3.1.1.Variables reflecting manifest behavior show high testretest reliability, but variables presumably reflecting a latent inhibitory process do not Overall, we found that behavioral variables reflecting temporal aspects of manifest behavior (e.g., reaction times, stop signal delays) showed high test-retest reliability (all M r > .7),suggesting that these measures are quite stable across time (see Fig. 3a for an overview of behavioral reliability coefficients).Note that reliability was higher for the central tendency of these measures than for their variability, meaning that intraindividual variation was less stable in these samples.The SSRT, presumably reflecting the speed of a latent inhibitory process, showed   To further investigate whether the low test-retest reliability for the SSRT appeared primarily driven by its calculation as a difference measure or if it could suggest a more general instability over time for inhibitory control measures, we also estimated the reliability for EMG-derived variables.The prEMG peak latency has been suggested as a measure of individual stopping latency (Atsma et al., 2018;Raud & Huster, 2017), thus providing an alternative behavioral measure of stopping speed that does not come with the potential issues often accompanying difference scores.In the EMG data, we focused on estimates of peak latency in different trial types, their intraindividual variability, as well as the potential impact of methodological choices.
We again found patterns of high and low test-retest reliability within the different EMG measures (Fig. 3b), suggesting that the method itself can provide stable measures, but that not all measures necessarily reflect aspects stable over time.The EMG peak latencies in unsuccessful stop (US) trials and go trials showed overall acceptable test-retest reliabilities (again with intraindividual variability measures less stable), thus mirroring our reaction time findings.While the rate of prEMG bursts appeared to be somewhat stable over time as well (M r ¼ .69), the peak latency of prEMG in SS trials showed considerably lower stability.This was most prominent for estimates from participant-level waveforms (M r ¼ .23),but the testretest reliability for single-trial estimates were also low (M r ¼ .58).However, the trial-level estimates showed higher test-retest reliability in study 2 and 3 compared to study 1.This could suggest a difference between estimation methods, with some measures of prEMG peak latency being stable over time and others not.
The low test-retest reliability of the participant-level prEMG latency could stem from occasional high-amplitude events.These could have a large impact on the average waveform, and a disproportionate effect on the peak latency.When using single-trial estimates, all individual peaks would be given equal weighting.However, this would not explain why both reliability measures are low in study 1.To probe this further, we started with a closer visual inspection of scatter plots (Supplementary figure 4) of all prEMG test-retest relationships.This suggested that some participants had an unusually large difference between test and retest, potentially skewing our correlation estimates.Inspection of the electrophysiological data suggested that those peak latency values represented genuine estimates, but to check their influence on our results we re-did the analyses focusing on the potential impact of outliers on the trial-level, as well as at the participant-level.For the trial-level re-analysis, we reestimated all peak latencies by using the median of the single-trial distributions instead of the mean, as this would be more robust to potential outlier values remaining after our trial-level rejection.This revealed the same pattern as the estimates using the average values (Supplementary table 3), suggesting that more stringent trial-level cleaning would not increase reliability.To check the impact of potential outlier participants, we re-ran the prEMG peak latency analyses removing one participant from study 1 and one participant from study 2. This increased some reliability coefficients, decreased others, and left some practically unchanged (Supplementary table 4), thus suggesting that no overall patterns could be explained by the presence or absence of these participants either.Together, this suggests that while the prEMG might show good stability in some contexts, it could be drastically lower in other situations.
In sum, while several EMG variables showed good testretest reliability, prEMG variables (i.e., estimated from successful stop (SS) trials) were on average less reliable, again indicating that measures of individual stopping latency do not show stability over time.

Non-inhibitory EEG variables show high test-retest reliability, but inhibition-specific variables do not
Across studies, trial types, parameters and estimation methods, ERP measures reflecting the amplitudes (mean and peak) and latencies (onset and peak) of the N2 and P3 potentials tended to show high stability over time (see Fig. 4 for an overview of all test-retest reliability coefficients for ERP variables and Supplementary figures 5 and 6 for scatter plots), with several showing high enough test-retest reliability to have promise for individual difference work, especially when estimated from single-trial waveforms.Variables derived from successful stop trials also exhibited good reliability, with for instance P3 onset latencies showing an average test-retest reliability of .79 when estimated at the trial-level.This shows that measures collected during successful stopping can show high reliability in general.However, while potentially stable, neither of these ERP variables are commonly assumed to reflect an inhibitory process directly, but rather contribute to the cascade of cognitive processing that combines to produce behavioral stopping.
Despite the overall high reliability for ERP variables, some measures stood out as having slightly lower overall reliability than others.For example, measures of N2 latency were relatively less stable than N2 amplitude or P3 latency measures (Fig. 4a and b).This was the case both when they were estimated from participant-level waveforms (N2 latency M r ¼ .61across measures, trial types and studies, P3 latency M r ¼ .65 across measures, trial types and studies) as well as when derived from single-trial waveforms (N2 latency M r ¼ .73across measures, trial types and studies, P3 latency M r ¼ .78across measures, trial types and studies).In contrast to EMG peak latencies, which seemed to show selectively reduced reliability in successful stop trials, the relatively lower reliability for N2 latencies appeared present across trial types.
To investigate whether an EEG-derived measure more directly tied to inhibitory control would also show high testretest reliability, we focused on burst-like activity in the beta band, which has recently been proposed as a marker for the initiation of inhibition.More specifically, we looked at the rate of successful stop trials which showed burst-like activity in the beta band in the time between the presentation of the stop signal and an individual's SSRT and estimated the test-retest reliability of these rates.We found that beta burst rate showed practically no stability over time (Fig. 4c), with testretest coefficients of .03(Study 1), .14(Study 2) and .32(Study 3).
In sum, while some of the EEG-derived variables showed high test-retest reliability (with trial-level ERP measures in particular showing promise for individual difference work), beta burst rates, presumably reflecting the latent inhibitory process, were unstable and did not appear to reflect a stable trait.

No large between-study differences in test-retest reliability
Reliability scores can differ depending on the sample they are measured in, both due to random sampling variation and due to systematic between-study differences.While the analyzed studies were largely similar in design, they did consist of independent samples and had several smaller design differences.For instance, study 3 differed slightly from the others both in task design (such as a different SSD range) and measurement choices (such as placement of the EMG ground electrode).We had no a priori reasons to expect that these specific factors would influence the stability of SST measures, neither based on theory nor previous empirical work.However, to investigate if there were any larger between-study differences in test-retest reliability, we looked at the studyspecific average test-retest reliability for each measurement modality.
Overall, we found that the studies had similar reliability estimates.For behavioral variables, the studies had average test-retest reliabilities of .55 (study 1), .53(study 2) and .55(study 3).For EMG variables, the studies had average testretest reliabilities of .58(study 1), .48(study 2) and .58(study 3).The somewhat lower reliability for study 2 seemed to stem from an overall lower reliability on several variables, but the within-modality reliability pattern did not appear different from that of the other studies.For EEG variables, the studies had average test-retest reliabilities of .70 (study 1), .78(study 2) and .77(study 3).The slightly lower reliability of study 1 variables seemed to stem from an overall lower reliability on several participant-level ERP measures.These did not seem to show a different within-modality pattern, however.
In sum, while there were some smaller modality-specific differences in test-retest reliability between the different studies, none of the studies appeared to show a different reliability pattern than the others, nor overall lower reliability across modalities.

3.2.
Are SST variables consistent within a testing session?
While some variables derived from a stop signal task clearly can show high stability and provide the ability to differentiate between individuals over time, the variables presumably reflecting the speed or implementation of an inhibitory process (i.e., the SSRT, the prEMG peak latency and the stoprelated beta burst rate) all showed low reliability, both when compared to other stop signal task measures and when compared to the test-retest reliabilities commonly found for trait-like measures.This inhibition-specific instability over time was found across measurement modalities and opens for the possibility that these measures do not capture a trait-like aspect of behavior.However, it is possible that the low testretest reliability for inhibition-related measures does not necessarily reflect a meaningful lack of stability, but rather that measurement noise causes these measures to lack internal consistency in general.To investigate this, we analyzed the internal consistency of key behavioral, EMG and EEG variables using split-half correlations.

Regardless of measurement modality, SST variables have high to excellent split-half reliability
The variables reflecting task performance all showed high split-half reliability (see Fig. 5 and Supplementary table 5 for an overview of split-half coefficients and their confidence intervals).Reaction times and SSDs showed excellent internal consistency (all r > .99),and the SSRT was highly consistent as well (M r ¼ .86).
We also found high-to-excellent split-half reliability for the EMG peak latency measures (Fig. 5 and Supplementary table 6).The measures from go and US trials both showed excellent reliability (r > .95), and the prEMG peak latencies showed high split-half reliability as well (M r ¼ .84).
The N2 and P3 amplitude and latency also showed high-toexcellent split-half reliability, with more than half the ERP variables having an r > .95 and nearly all showing an r > .90 (Fig. 5 and Supplementary table 7).There was a tendency for N2 latency measures to have lower split-half reliability than the other ERP measures, but on average, they showed high internal consistency as well (N2 onset latency M r ¼ .87,N2 peak latency M r ¼ .90).

Reduced internal consistency does not account for SSRT test-retest reliability
Despite all variables showing good internal consistency, we did spot a pattern of overlap between the variables showing low test-retest reliability and those that showed slightly lower split-half reliability.
Therefore, we re-calculated the test-retest coefficients for the SSRT and the prEMG peak latency to correct for possible attenuation.The deattenuated test-retest reliability for the prEMG peak latency now reached an overall acceptable level (M r ¼ .73,r study1 ¼ .34,r study2 ¼ .85,r study3 ¼ .83),but the same was not the case for the SSRT, which remained unreliable over time (M r ¼ .40,r study1 ¼ .23,r study2 ¼ .31,r study3 ¼ .62).

Does the reliability pattern generalize across reports from the literature?
While previous research has pointed to low test-retest reliability for inhibition and control measures, there also seems to be considerable heterogeneity in previously reported reliability estimates from the stop signal task (Enkavi et al., 2019).To see whether we could replicate the main pattern of between-variable differences discovered in our own data and identify any additional factors moderating the size of the reliability coefficients, we performed a systematic literature review.
In total, we identified 67 estimates of test-retest reliability and 82 estimates of internal consistency across a range of task performance variables, reliability approaches, and sample characteristics.We again found substantial heterogeneity in the reported coefficients.As test-retest reliability estimates for the SSRT and go reaction times using Pearson's r were reported most often, we focused our between-variable analyses on those estimates to see if we could replicate the pattern of low stability for inhibition measures but higher stability for measures of manifest behavior.To see if we could discover potential moderators of reliability, we focused on studies reporting split-half reliability for the SSRT.

The test-retest reliability of SSRTs decreases with time
As is evident from Fig. 6a, there was considerable variation between the estimates of SSRT test-retest reliability reported in the literature.Given that our own results indicated that the SSRT might be better conceptualized as reflecting statedependent processing, we decided to see whether any of this heterogeneity could be explained by time, i.e., the duration of the interval between test and retest-session.To do this, we performed a three-level meta-analysis with time as a potential moderator, thus quantifying the reliability-time relationship across 14 different experiments/conditions/groups from 9 different studies (Fig. 7a).The expected test-retest coefficient after a one-week test interval was low (r ¼ .55),and there was a significant negative effect of time (b ¼ -.011, z ¼ À2.788, p ¼ .005),showing that as the time between test and retest session increases, the temporal stability of the SSRT decreases.
Tentatively, a low SSRT test-retest reliability that decreases over time would seem to support the interpretation of our own results, meaning that the SSRT might reflect a variable with more state-like than trait-like contributions.However, it could also reflect a general tendency for the reliability of SST variables to decrease over time.To check whether this was the case, we performed an additional meta-regression to see if we could predict the reliability of go reaction times using time as a moderator, this time quantifying the reliability-time relationship across 6 different experiments/conditions/ groups from 5 different studies (Fig. 7b).Here, however, the estimated coefficient after a one-week interval was higher (r ¼ .74)and we found no significant effect of time (b ¼ -.003, z ¼ À.800, p ¼ .423).
In sum, we again found a pattern of lower test-retest reliability for inhibition variables and higher reliability for manifest behavior.We also found a moderating effect of time for the SSRT only, opening for the possibility that the decrease might be specific to inhibition variables.

SSRT reliability appears lower in adulthood than in other age groups
Our initial meta-regression showed that SSRT reliability could be moderated by time.However, we also found significant residual heterogeneity (p ¼ .029),suggesting that there is variation in reliability estimates that cannot be explained by time alone.Estimates of SSRT split-half reliability in the literature were overall high (r ¼ .87),suggesting that the variation is unlikely to be caused solely by measurement noise.However, we also found significant heterogeneity in the estimates of internal consistency from the literature (p < .0001),further supporting that there could be other factors affecting reliability.
To explore this variation further, we did a data-driven search for meaningful moderators using a random forest modeling approach, this time focusing on the split-half estimates we extracted from the literature.The variable showing the highest importance was participant age (Fig. 7c), with partial dependence estimates suggesting that the SSRT might show lower reliability in adult samples (defined as samples with a mean age above 30 but lower than 65) compared to younger or older age groups (Fig. 7d).We also found a potentially small effect of the approach used to determine SSD, where the conventional staircase approach using a 50 ms step was associated with higher reliabilities than studies using a cumulative tracking approach or fixed SSD intervals chosen relative to the reaction time (Fig. 7e).Due to the relatively low number of studies using other approaches, however, this is hard to investigate with much certainty.The experimentwide stop signal probability did not appear to influence SSRT split-half estimates at all (Fig. 7f), but note that there was a restricted range of stop signal probabilities used in the first place.

Discussion
We aimed to characterize the test-retest reliability and internal consistency of a range of common behavioral and electrophysiological measures from a stop signal task.All variables showed high split-half reliability, suggesting that they are consistent within a testing session.However, across studies and measurement modalities, we only found good test-retest reliability for variables reflecting manifest behavior and non-inhibitory processing.Variables presumably reflecting the latency and implementation of inhibition (the SSRT, the prEMG peak latency, and the beta burst rate in successful stop trials) showed low temporal stability.We interpret this as inhibition measures reflecting situational factors more than stable trait-like influences.In support of this interpretation, a meta-analysis revealed that SSRT test-retest reliability estimates are moderated by the duration of the retest interval, with longer intervals associated with lower reliability.This relationship was not found for go reaction times.We conclude that there is little support for the common conceptualization of inhibition as a stable and dispositional ability.

Measures of inhibition are unstable over time
Common inhibition measures, while consistent within a session, lack stability over time.This suggests that these measures are primarily influenced by factors that are state-like (such as arousal or mood) or that the underlying ability changes rapidly over time (for instance due to task experience and applied strategies).The low test-retest reliability of the inhibition measures poses a serious challenge to findings relying on them to test and develop individual difference theories.This is especially problematic for the SSRT.Given its status as the gold-standard measure of inhibition (Aron & Poldrack, 2005;Crosbie et al., 2008), the SSRT is often used under the assumption that inhibition is a trait-like ability.
This assumption has given credence to the notion that inhibition measures can be useful clinical measures e reflecting impulsive symptoms or a more general vulnerability for psychopathology.Accordingly, much of our knowledge about inhibitory control differences depends upon the SSRTs validity as a measure of a stable ability.
That inhibition measures mainly reflect state-dependent and situational behavior is further supported by our metaanalysis showing that the SSRT has low test-retest reliability after a one-week interval, and that this further decreases over time.Initially, a decrease in SSRT stability might be expected even if it is trait-like, given that a dispositional ability presumably would be amenable to development and change.After all, decreases in test-retest reliability suggesting slow changes over time are found in other areas studying trait stability (Calamia et al., 2013;Roberts & DelVecchio, 2000).However, such studies tend to show that stability decreases slightly over years.For instance, Roberts and DelVecchio (2000) predicted a .03drop in the test-retest reliability of personality measures if the test-retest interval increased from 1 to 5 years.In comparison, our model predicts that for the SSRT, this drop happens with an interval increase of 3 days.Combined with the SSRTs lack of utility for predicting a range of real-life outcomes (Von Gunten, Bartholow, & Martins, 2020), there seems to be little support for its use as a stable trait measure.This lack of stability over time should be especially concerning for work aiming to utilize the SSRT as an endophenotype, or as a criterion for validating potential biomarkers of psychopathology.Over decades, the SSRT has been suggested as an endophenotype for attention-deficit hyperactive disorder in particular (e.g., Aron & Poldrack, 2005;Crosbie et al., 2008;Rommelse et al., 2009), but also for autism spectrum disorder (Schmitt et al., 2019), obsessive-compulsive disorder (Chamberlain, Blackwell, Fineberg, Robbins, & Sahakian, 2005, 2007), and for transdiagnostic symptoms like impulsivity (Congdon et al., 2008;Fineberg et al., 2010;Fortgang et al., 2016).For instance, as impaired inhibition is considered central to a range of disorders, Arnatkeviciute et al. ( 2023) recently used the SSRT as an endophenotype to investigate genetic factors influencing inhibitory control.Despite showing significant heritability, no significant genome wide associations could be found, suggesting a highly complex and polygenic architecture.However, since variations in SSRTs appear to be primarily state-driven and thus do not show the temporal stability and state-independence required for an endophenotype (Gottesman & Gould, 2003;Hasler, Drevets, Gould, Gottesman, & Manji, 2006), it is hard to know what this heritability would reflect and how it would help us understand the genetic influences shaping vulnerability to psychopathology.
A lack of temporal stability is often discussed primarily in terms of its implications for clinical and individual difference work, as opposed to studies on group differences and basic cognitive neuroscience (Blair, Mathur, Haines, & Bajaj, 2022;Enkavi & Poldrack, 2021;Hedge, Bompas, & Sumner, 2020;Karvelis, Paulus, & Diaconescu, 2023;Rodebaugh et al., 2016).In the latter category, the SSRT is central to much work aiming to find neural correlates of inhibition using the stop signal task, for instance by using its latency as an upper temporal limit for when inhibition-related processing must occur (e.g., Errington et al., 2020;Swann et al., 2011;Wagner, Wessel, Ghahremani, & Aron, 2018) or by using correlations between the SSRT and other measures as evidence for indexing the same latent process (e.g., Wessel & Aron, 2015).While finding neural correlates of inhibition comes with unique challenges (Huster et al., 2020;Schall & Godlove, 2012), it does not necessarily require temporal stability.However, one should be mindful that interpreting such markers as indicators of stable dispositions is likely to be flawed and thus the utility of such markers for individual decision-making or clinical diagnostic use low.Consequently, low temporal stability means that basic neuroscience work also requires caution in the study of inhibition, even if mainly in its interpretation rather than its execution.

Temporal instability does not reflect measurement noise
Low test-retest reliability for inhibition measures could reflect unsystematic error and measurement noise, rather than meaningful state-driven intraindividual fluctuations.This could happen because of a reliance on difference measures, or if there is too little interindividual differences in stop signal task performance to consistently differentiate between individuals (Hedge et al., 2018) and/or if trial-level variability causes too much uncertainty in the person-level estimates reliability is calculated from (Chen et al., 2021;Rouder & Haaf, 2019).However, whereas difference scores are often assumed to be inherently unreliable, they can show high reliability as well (Clayson, Baldwin, & Larson, 2021;Trafimow, 2015;Zimmerman, 2009).In our case, we found low temporal stability for inhibition measures regardless of whether they were calculated as difference scores or not.Furthermore, our estimates of split-half reliability showed that all measures, also those presumably measuring inhibitory control, consistently rank participants within a single session.This was in line with the high split-half reliability of the SSRT reported in the literature as well, and these previous reports did not appear to be moderated by the number of trials they were based on.Lastly, accounting for the internal consistency of the SSRT did not raise its temporal stability to an acceptable level.Hierarchical modelling approaches that aim to account for trial-level uncertainty and deattenuated correlation coefficients share the same underlying assumptions about measurement noise (and the two approaches have been found to give similar estimates (Rouder & Haaf, 2019)).Consequently, as deattenuating the SSRT test-retest coefficients did not have a notable impact in our own work, we do not have strong reasons for believing that hierarchical model estimates would show higher stability.In sum, measurement noise does not seem to be the primary factor causing inhibition measures to show instability over time.
In earlier work, high internal consistency of the SSRT has been attributed to an artefact of methodology (Hedge et al., 2018), stemming from the combination of an odd-even split of the data with dynamic SSD adaptations.In general, both odd-even and first-last splits of the data can cause systematic confounds due to task design and time-on-task effects.In the stop signal task, however, dynamic design confounds might actually reduce split-half reliability, rather than increase it (Pronk, Molenaar, Wiers, & Murre, 2022).Permutation-based approaches that sample random trials provide a more robust estimate of internal consistency (Parsons et al., 2019;Pronk et al., 2022).In line with others using robust approaches (Pronk et al., 2022;Wang et al., 2021), we found high estimates of internal consistency for the SSRT, further supporting that it is not a methodological artifact of how the data was split into halves.However, dynamic SSD adaptations could impact the reliability measures regardless.Our search for potential moderators of internal consistency suggested that staircase tracking might be associated with higher split-half reliability.However, as the other studies were few and still relied on some sort of personalized adjustments, these results should be interpreted with caution.Future studies that use fixed SSDs (as opposed to tracked), compare different approaches to SSD adjustment and/or that contrast different ways of summarizing the SSD distribution when calculating the SSRT could contribute to our understanding of this.For now, we conclude that the internal consistency of the SSRT does not appear to be merely an artefact of the reliability calculation, nor of the task design.

4.3.
Manifest behavior and non-inhibitory processing show higher temporal stability Measures of manifest behavior and non-inhibitory processing were overall more stable across time, which could suggest that situational factors do not influence all aspects of stop signal task performance equally, but action stopping specifically.
The higher stability of go reaction times was supported by our meta-analysis, which indicated generally acceptable oneweek test-retest reliability and little decrease over time.Reaction time stability was further supported by the high testretest reliability of processes presumably contributing to reaction time latency, here indexed by the N2 and P3 potentials during go trials.For some of the manifest behavior measures, test-retest reliability approached the level seen for traits like personality and intelligence.However, both test-retest coefficients estimated on our own data as well as those coefficients identified in the literature were based on relatively short test-retest intervals.Consequently, we do not know whether these results suggest that reaction times measures are dependable in the short-term, or whether they also show long-term temporal stability.This could pose a problem for individual difference studies, since reaction times, parameters derived from them, and other non-inhibitory aspects of stop signal task performance are increasingly used to investigate psychopathology, development, as well as brain structure and function (e.g., Cañigueral et al., 2023;Mirajkar & Waring, 2023;Wiker, Norbom, et al., 2023;Wiker, Pedersen, et al., 2023).Future studies should aim to assess the longterm test-retest reliability of reaction time measures and parameters derived from them, which would be needed to conclude that they reflect stable dispositions, as well as factors that might influence their stability and change.
ERP-derived measures of non-inhibitory processing also showed high overall stability in stop trials, both when stopping was successful and when it was not.At first, it might seem counterintuitive that processes presumably contributing to stopping are stable, whereas the stopping speed itself is not.However, the N2 and P3 amplitude showed the highest test-retest reliability, and these amplitude measures are not consistently associated with the SSRT in the first place (Huster et al., 2020).Of the latency measures, the P3 latency was overall more stable than the preceding N2 latency.However, the P3 occurs after stopping has happened and presumably indexes a mix of processing related to performance monitoring, goal maintenance and evaluation (e.g., Huster et al., 2020;Sajad et al., 2022).As action stopping might be highly automatic and thus too fast for conscious access and evaluation (Raud et al., 2020;Verbruggen, Best, Bowditch, Stevens, & McLaren, 2014), the test-retest reliability could then reflect interindividual differences in when stimulus and performance information becomes available for use in such higherorder operations.The N2 occurs closer to the time of stopping and thus its onset might be more influenced by stopping speed.Interestingly, the N2 onset latency showed the lowest stability of the non-inhibitory processing measures, and it also appeared to be less consistent within a session, especially in successful stop trials.It is thus possible that situational factors that influence stopping speed have an indirect effect on the N2 latency as well.

4.4.
Processing choices influence the stability of electrophysiological measures We found that both EMG and ERP measures derived from the stop signal task in general showed good test-retest reliability.However, we did find a general difference in test-retest reliability for both ERP and EMG measures depending on whether participant-level measures were calculated based on parameters extracted from participant-or trial-level time courses, with trial-level estimates being more reliable regardless of measurement modality.Generally, this could mean that triallevel estimates are better starting points for studies hoping to investigate individual differences using electrophysiological measures.
For both EMG and ERP analyses, we used the same processing pipeline for both participant-and trial-level estimates, thus the only difference between them was the level that the parameters were extracted from.However, preprocessing choices have been found to influence reliability estimates both in work using electroencephalography (Suarez-Revelo et al., 2016, 2016, 2016) and magnetic resonance imaging (Zuo et al., 2013).This is not always the case though (Chowdhury et al., 2023;Klawohn, Meyer, Weinberg, & Hajcak, 2020), suggesting that some measures could be more sensitive to initial processing than others.Additionally, while there are general recommendations for which participants and/or trials one might want to exclude when analyzing stop signal task performance (Bissett, Jones, Poldrack, & Logan, 2021;Verbruggen et al., 2019), there are few generally accepted criteria for electrophysiological measures.As reliability is normally calculated on the group-level, which participants to include or exclude can impact both split-half and test-retest reliability coefficients.
Together, this suggests that reliability estimates could differ in contexts utilizing different processing pipelines, and/ or using different criteria for excluding trials and participants.While we hope our results can provide some recommendations for individual difference work, more extensive future investigations would contribute to our understanding of factors influencing the reliability of electrophysiological measures in the stop signal task.This could for instance be by more stringently assessing the impact of outlier criteria at different levels (Congdon et al., 2012), by assessing reliability at the level of individuals (Clayson, Brush, & Hajcak, 2021) and/or by assessing whether the reliability of different parameters is robust across a range of reasonable processing choices (Clayson & Miller, 2017;Klawohn, Meyer, Weinberg, & Hajcak, 2020).Additionally, as our findings were limited to investigating the stability over the course of weeks, assessing the long-term test-retest reliability of these measures would be necessary to properly understand their stability.

Generalizability of findings
We found the same pattern of unstable inhibition measures and stable non-inhibition measures across three independent studies, and it was further supported by our meta-analyses on test-retest estimates in the literature.Based on this, we expect the main patterns of our results to generalize to future work in similar contexts.An important avenue for future research could be to identify whether there are potential populations where this pattern fails.
For instance, we know less about the psychometric properties of measures derived from the stop signal task in specific patient populations.Our meta-analysis on the internal consistency of the SSRT indicated that whether the coefficients came from a clinical sample or not did not have any systematic impact on split-half reliability, but we did not find enough data to look at specific patient populations.For test-retest reliability, we could only identify four SSRT test-retest estimates (from three studies) calculated in clinical samples.They ranged from r ¼ .12(Czapla et al., 2016) to r ¼ .76(Kr€ aplin et al., 2016) with considerable heterogeneity in between.Consequently, we currently have no reason to believe that the SSRT has different psychometric properties in clinical populations in general, but we believe this could be more thoroughly characterized in future work.
In addition, our data-driven meta-analysis showed participants' age to be a potential moderator of the internal consistency of the SSRT, with lower split-half reliability estimates in adult samples compared to both younger and older age groups.This is in contrast to traits like personality, where internal consistency does not appear to show age-related modulations (Lee & Hotopf, 2005).It is possible that by the time we reach adulthood, we are more resistant to the statelike and situational factors that contribute to stopping variation in other age groups.Intuitively, though, this would result in more consistent individual differences, not less.However, action stopping unfolds rapidly and might be highly reliant on automatic processing (Jana et al., 2020;Raud et al., 2020;Verbruggen, Best, et al., 2014).Given that rapid stopping provides a limited time window for interindividual differences to play out, it is possible that stopping at some point becomes so skilled and automatic that its latency approaches ceiling levels, thus limiting genuine interindividual variation and appearing more as a fixed effect in the population.However, these findings could also reflect factors unspecific to action stopping itself (such as sampling differences in different age groups), indicating that action stopping measures might not be invariant to the population they are measured in.To begin answering this, an important first question for future work is whether these exploratory findings would reproduce in a more confirmatory setting.If they do, a potential next step would be to characterize measurement properties in different age groups and/or the individual processes that contribute to action stopping and their variability (or lack thereof) through the lifespan.

Open questions
We have focused on the reliability of different proposed inhibition measures and the validity of interpreting them as trait-like, but we have not aimed to answer how these measures relate to each other or to inhibition as a construct.
Neither the SSRT, the prEMG peak latency, nor the beta burst rate are derived from theories that specify inhibition as a stable and dispositional ability, even though such a property might be implicitly assumed when they are used.Thus, while the lack of stability undermines their validity as trait measures, it does not necessarily invalidate them as measures of inhibition.Importantly, however, our findings cannot validate them as such either.Even though high split-half reliability suggests that these measures provide consistent scores within a session, this does not tell us whether they consistently measure inhibition.It also cannot tell us if the different measures relate to each other, or if they are influenced by the same state-like factors.While our findings are silent on these questions, we do think they still need answers.For instance, some have questioned whether beta band activity measures could be considered valid indices of an inhibitory process (Errington et al., 2020;Mosher, Mamelak, Malekmohammadi, Pouratian, & Rutishauser, 2021;Schall & Godlove, 2012).There is also uncertainty about the degree to which the SSRT and prEMG peak latency reflect the same influences (Raud et al., 2022).And while some relationships between SST variables seem to be present across repeated testing (Thunberg et al., 2020) and other measures show associated changes over time (Chowdhury et al., 2020), we know relatively little about whether this reflects shared reliance on some kind of state-like influence.Consequently, we think there are many pressing questions about the validity of different inhibition measures, how we expect them to behave, and to which degree they reflect shared influences.Furthermore, differences in inhibition measures have repeatedly been linked to other factors, such as aging and psychopathology.Many of these studies have relied on differences between groups, not between individuals.Our results do not invalidate group-level findings or suggest that such findings are not robust.However, they do raise the question of how we should interpret group-level differences.For instance, it is common to think of inhibition scores as something that varies along a single dimension, suggesting that people could be divided into good or bad inhibitors.However, since individual inhibition scores seem to largely rely on unspecified and transient factors, an open question is how to best characterize the inhibition score space.Could inhibition scores be interpreted not only based on their distance to each other, but also based on their relative positions in a higher-dimensional space?Could we identify which dimensions such a space would consist of?If so, would different populations tend to cluster in different regions within that space, even though individual positions within a potential cluster might fluctuate?What would it take to move between regions, rather than just within a region?Currently, we can only speculate on what might be interesting questions and we do not aim to answer them here.Given the potential importance of inhibition for a range of life outcomes, however, we hope questions like this will be explored further in future work.

4.7.
Future directions: is there hope of stability beyond the SSRT?
Many methodological causes have been suggested to explain low reliability of task-based control measures, such as task design leading to little interindividual variation, a failure to consider the uncertainty of person-level estimates, and a reliance on difference measures.While they all come with unique challenges for the field to meet, they also ultimately suggest that methodological refinement might provide solutions for low test-retest reliability.Either we can develop new or modify old tasks (Zorowitz & Niv, 2023), we can move to generative modeling and hierarchical approaches (Chen et al., 2021;Haines et al., 2020;Rouder & Haaf, 2019), or we can find alternatives to the difference measures we often rely on (Hedge et al., 2018;Raud et al., 2022).
For stop signal task measures, two of these possibilities seem fruitful for exploration in the short-term.First, while the test-retest reliability for the prEMG peak latency on average was low, two of the data sets we analyzed showed better testretest reliability when peak latency was estimated at the triallevel.We found no apparent reason for low test-retest reliability in the remaining study, however, suggesting that stability could be influenced by situation and context.Future studies aiming to delineate such influences could help determine if the prEMG peak latency tends to show stability over time, or if such a stability requires specific testing contexts.Second, modelling approaches have been found to increase estimates of test-retest reliability for a range of cognitive task measures (Chen et al., 2021;Haines et al., 2020;Rouder & Haaf, 2019).While our deattenuated correlation coefficients suggest that there is little to gain aiming for hierarchical modelling of the traditional SSRT estimates, the deattenuated prEMG testretest reliability suggested that there could be a benefit in aiming to increase its precision, for instance by modelling or by increasing the number of trials with prEMG.Additionally, we think there is promise in exploring behavioral stop signal task models with a wider set of parameters that might ultimately describe the data better.Specifically, traditional SSRT estimates can only provide single point estimates of stopping latencies, thus limiting information about intraindividual variability.Furthermore, they do not account for failures to trigger stop-related processing in the first place (Matzke, Love, & Heathcote, 2017).It has been argued that trigger failures are a more likely cause for the inhibition impairments seen in attention-deficit hyperactivity disorder (as opposed to slower inhibitory control itself) (Weigard, Heathcote, Matzke, & Huang-Pollock, 2019), and likely implicated in inhibitory performance for people with schizophrenia (Matzke, Hughes, Badcock, Michie, & Heathcote, 2017).Trigger failures also seem to be more strongly related to measures of impulsivity than the SSRT (Skippen et al., 2019).Future studies should aim to investigate whether trigger failure parameters and/or distributional parameters have psychometric properties more in line with what is expected of a stable ability.
A potentially more long-term avenue of exploration could be investigating task-alternatives to the SST.For instance, the anticipated response inhibition task (ARI; Slater-Hammel, 1960) is an alternative to the more widely used stop signal task.Early investigations revealed that response time precision in the ARI appears to have good test-retest reliability (Slater-Hammel, 1960).However, while ARI-derived SSRTs appear to be stable at the mean level both within and across sessions (Hall, Jenkinson, & MacDonald, 2022;Leunissen, Zandbelt, Potocanac, Swinnen, & Coxon, 2017), we currently do not know whether these alternative SSRT estimates allow for consistently rank-ordering individuals.One limiting factor for their use as interindividual difference measures could be that ARI-derived SSRTs show low levels of interindividual variation compared to SST-derived estimates (Leunissen et al., 2017), potentially limiting the ability to find consistent differences in the first place.Thus, while other task approaches to estimating stopping latencies could provide reliable measures, whether they actually will is still an open question.
But even if methodological refinement reveals stable measures associated with inhibition and action stopping, this does not support the continued use of the more traditional SSRT estimate as a stable trait index as is done today.And if no such stable inhibition measures can be found, the more substantive implication of low test-retest reliability -that action stopping is best conceptualized as an act with many situational determinants rather than a stable dispositional ability -is harder to come around.We might have no other choice than to revise our assumptions and theories accordingly.

Conclusion
Inhibition ability is often conceptualized as a trait-like dispositional ability, but psychometric investigations of cognitive tasks have shown that common inhibition variables tend to be unstable over time.By synthesizing the pattern of reliability estimates from the stop signal task across several measurement modalities, three independent studies, and several meta-analyses, we show that there is little support for c o r t e x 1 7 5 ( 2 0 2 4 ) 8 1 e1 0 5 the validity of conceptualizing common inhibition measures as reflecting stable, dispositional traits.Measures of manifest behavior and non-inhibitory processing might show higher trait-like influences, but the longevity of such stability is still unknown.This has major implications for studies using inhibition measures to test and develop individual difference theories, as well as for the hope of using these measures to guide the search for biomarkers and endophenotypes of psychopathology.As inhibition measures appear to be determined mainly by state-like and situational factors, we discourage their continued use as indices of an underlying trait-like ability.

Open practices section
The study in this article has earned Open Data and Open Material Badges for transparent practices.The data and material used in this study are available at: https://osf.io/dcb6g CRediT authorship contribution statement

Fig. 1 e
Fig. 1 e Overview of stop signal task.

Fig. 2 e
Fig. 2 e Common behavioral and electrophysiological measures in the stop signal task.a. Task performance across all studies for each session separately.b.Trial-level data from electrode FCz for one exemplar participant.The plots show timefrequency decompositions of successful stop trials time-locked to go signal onset.In these trials, one or more beta bursts (marked with arrow) were detected in the time between stop signal onset and SSRT.The data is represented in magnitudes of median, meaning that values below and above 6 would be below and above our burst detection threshold, respectively.The unshaded region shows the frequency region used for analysis.c.Event-related potentials from electrode Cz.The time courses show mean (SEM) across all participants, estimated for each session separately.Segments are time-locked to go signal onset (GO ERP) and stop signal onset (US ERP, SS ERP).d.Electromyographic activity over the abductor pollicis brevis.The time courses show mean (SEM) across all participants, estimated for each session separately.Segments are time-locked to go signal onset (Go EMG, US EMG) and stop signal onset (SS EMG).

Fig. 3 e
Fig. 3 e Test-retest reliability for task performance and electromyographic variables.Figures show the test-retest estimates for each study, as well as the mean estimate across studies.a. Test-retest reliability for task performance variables.b.Testretest reliability for SST-derived EMG variables.Parameters are estimated from both participant-level and trial-level waveforms, denoted by wave and trial, respectively.Abbreviations: GoRT e go reaction times, USRT e unsuccessful stop reaction times, SSD e stop signal delay, SSRT e stop signal reaction time, avg e average, SD e standard deviation, peaklat e peak latency, GO e go trials, US e unsuccessful stop trials, SS -successful stop trials.

Fig. 4 e
Fig. 4 e Test-retest reliability for electroencephalographic variables.Figures show the test-retest estimates for each study, as well as the mean estimate across studies.a. Heatmap showing test-retest reliability for parameters from event-related potentials estimated at the participant-level.b.Heatmap showing test-retest reliability for parameters from event-related potentials estimated at the trial-level.c.Scatter plot showing the percentage of stop-related beta bursts in the first and second measurement session.Lines indicate least square fits.Abbreviations: GO e go trials, US e unsuccessful stop trials, SS -successful stop trials, peakamp e peak amplitude, meanamp e mean amplitude, peaklat e peak latency, onsetlat e onset latency.

Fig. 5 e
Fig. 5 e Split-half reliability estimates for behavioral and electrophysiological variables.Each plot shows the mean reliability estimate along with the 2.5th and 97.5th percentiles for each study and session separately.The vertical dashed line represents a reliability value of .7 for reference.Abbreviations: GoRT e go reaction time, SSD e stop signal delay, SSRT e stop signal reaction time, GO e go trials, US e unsuccessful stop trials, SS e successful stop trials, lat e latency, amp e amplitude.

Fig. 6 e
Fig. 6 e Estimates of test-retest reliability for the stop signal reaction time.Note that some studies contributed several estimates, either from different experiments, different conditions or different groups.a. Overview of all identified SSRT testretest reliability coefficients.b.Overview of the subset of reliability coefficients that met our inclusion criteria for the metaanalysis.

Fig. 7 e
Fig. 7 e Meta-analysis of test-retest and split-half reliability coefficients.a.-b.Test-retest coefficients for (a) stop signal reaction times and (b) go reaction times plotted against the duration of the retest interval.Each circle represents a reported reliability estimate, with the circle size indicating the meta-analytic weighting.The lines illustrate the estimated relationship between reliability and interval duration, whereas its shaded region shows the slope uncertainty.c.Variable importance estimates for potential moderators of SSRT split-half reliability.d.-f.Partial dependence estimates for age, stop signal delay method and the experiment-wide probability of a stop signal as potential factors moderating SSRT split-half reliability.

Table 1 e
Summary of task performance for all studies and all sessions.