Authors’ reply to the commentary on “Establishing norms for error-related brain activity during the arrow Flanker task among young adults”

In their commentary on our article, "Establishing norms for error-related brain activity during the arrow Flanker task among young adults" (Imburgio et al., 2020), Clayson and colleagues (this issue) voiced their concerns about our development of norms for an event-related potential measure of error monitoring, the error-related negativity (ERN). The central flaw in their commentary is the idea that because we don't know all the factors that can affect the ERN, it should not be normed. We respond to this idea, while also reiterating points made in our original manuscript: a) at present, the reported norms are not intended to be used for individual clinical assessment and b) our norms should be considered specific to the procedures (i.e., recording and processing parameters) and task used (i.e., arrow Flanker). Contrary to Clayson and colleagues' claims, we believe that information about the distribution of the ERN (i.e., our norms) in a large sample representative of those used in much of the ERN literature (i.e., unselected young adults) will be useful to the field and that this information stands to increase, not decrease, understanding of the ERN.


a b s t r a c t
In their commentary on our article, "Establishing norms for error-related brain activity during the arrow Flanker task among young adults " (Imburgio et al., 2020), Clayson and colleagues (2021) voiced their concerns about our development of norms for an event-related potential measure of error monitoring, the error-related negativity (ERN). The central flaw in their commentary is the idea that because we don't know all the factors that can affect the ERN, it should not be normed. We respond to this idea, while also reiterating points made in our original manuscript: a) at present, the reported norms are not intended to be used for individual clinical assessment and b) our norms should be considered specific to the procedures (i.e., recording and processing parameters) and task used (i.e., arrow Flanker). Contrary to Clayson and colleagues' claims, we believe that information about the distribution of the ERN (i.e., our norms) in a large sample representative of those used in much of the ERN literature (i.e., unselected young adults) will be useful to the field and that this information stands to increase, not decrease, understanding of the ERN.
We have norms for many different measures, without fully understanding their variability. But Clayson and colleagues ( Clayson, et al., 2021 ) argue that because we do not know all the factors that can affect the error-related negativity (ERN), it should not be normed. One point they make is that we should not norm the ERN because the reasons for individual differences in the ERN are sometimes unclear. It is certainly true that the ERN reflects contributions from multiple, overlapping phenotypes ( Hajcak et al., 2019 ) -meaning that bivariate correlations (rather than, for example, regression-based models) are likely to fail at painting a cohesive picture of how the ERN relates to psychopathology. Therefore, pointing to conflicting or null zero-order correlations is not particularly useful at this point in time. Along these lines, the National Institute of Mental Health's (NIMH) Research Domain Criteria (RDoC) initiative (and contemporary thinking on psychopathology more broadly) suggest that psychopathology should be conceptualized along multiple dimensions, rather than categorically. As such, the utility of between-group norms (e.g., norms for clinically anxious patients and healthy controls) is at present unclear. Our sample was not selected to be psychiatrically healthy and we would expect that it provides normative variation in phenotypes related to the ERN.
Despite overlapping associations between the ERN and various dimensional phenotypes, the ERN has, nonetheless, been robustly associated with clinical anxiety (e.g., Cavanagh et al., 2017 ;Moser et al., 2013 ;Riesel et al., 2011 ;Saunders and Inzlicht, 2020 ;Stern et al., 2010 ;Weinberg et al., 2012 ). Indeed, this is reflected in our norms. For example, among datasets that used the same task and data collection/processing procedures as in our norms, participants with clinical anxiety (i.e., generalized anxiety disorder) were found to have ERN difference scores that fell in the 75th-90th percentile range according to our norms ( − 9.24 μV in Weinberg et al., 2010, − 8.69 μV in Weinberg et al., 2012, whereas psychiatrically healthy individuals from the same datasets fell around the 50th percentile ( − 5.20 μV and − 5.38 μV, respectively). This demonstrates the utility of our norms for anchoring and contextualizing group results, which was not possible before. Nonetheless, categorization at the individual level (e.g., for psychiatric decision-making) involves other considerations (e.g., sensitivity and specificity) and is not advisable at the current time.
Instead, the norms will be useful in different ways. As exemplified above, researchers with ERN data (who used the same, widely used task and processing stream) will now have an idea of how their data aligns with that of a large sample representative of those frequently used in ERP research (i.e., unselected young adults). This will be useful not just for clinical samples, but also to ascertain where ERN data from any new sample falls, relative to the population. By analogy, when assessing associations between depression symptoms and reaction time, it is useful to know whether the observed depression scores represent only a narrow segment or the full array of scores that are possible in the broader population. In addition to benchmarking study results, another way in which our norms may be useful is to facilitate the recruitment or stratification of research samples according to the size of the ERN. As described in our original manuscript, and as advocated for by the NIMH (e.g., Insel et al., 2010 ), research participants can now be recruited according to their scores on neurophysiological variables that correspond to variation along dimensions of relevance to psychopathology. By flipping the traditional research design on its head (i.e., dependent variables become independent variables), it may be possible to ascertain greater understanding of both normal and abnormal functioning. Before the publication of our norms, such an approach was not feasible for the ERN.
Clayson and colleagues also raise the issue of nuisance factors. Undoubtedly, multiple factors can affect the magnitude of an ERP component ( Frodl et al., 2001 ;McCarthy and Wood, 1985 ;Urbach and Kutas, 2002 ), though skull thickness appears to have a different effect than described by Clayson and colleagues. In contrast to Clayson and colleagues, MRI-EEG work indicates that skull thickness has an additive (not multiplicative) effect on ERPs ( Frodl et al., 2001 ), which would preserve the validity of difference scores, consistent with our recommendation (in the original manuscript) that difference scores may be more appropriate for normative comparisons. Generally speaking, however, any measure in the fields of psychology and neuroscience will be subject to some contamination by nuisance variables. For instance, performance on a cognitive test may be lower when an individual is sleepdeprived -yet despite this, norms for those tests are still thought to be informative. Indeed, sample sizes for norms are large for exactly this reason -so that the contribution of individuals who are not representative of the population is attenuated.
Clayson and colleagues also note that there are differences in the way researchers measure the ERN. The pooled data presented in our norms was derived using only the arrow version of the Flanker task with stimuli, timing and instructions consistent across datasets. Here, we reiterate again that our norms should only be assumed to hold for this (widely used) arrow version of the Flanker task. We also note that the ERN from this task has been shown to have high internal consistency ( Clayson, 2020 ;Foti et al., 2013 ;Olvet and Hajcak, 2009 ;Riesel et al., 2013 ), rendering Clayson and colleague's point about varied internal consistency in the Clayson meta-analysis ( Clayson, 2020 ) irrelevant. We did find that ERN values varied somewhat with the length of the task, though it was not clear that this mattered much overall ( Imburgio et al., 2020 ).
To emphasize how different ways of quantifying the ERN can affect resulting values, Clayson and colleagues point to the difference in amplitude between mean raw ERN (i.e., not the difference score) in 326 males in our norms (3.18μV) and in 429 males in Fischer and colleagues ( − 5.37μV;2016), who used a different means of scoring the ERN. Using the error minus correct difference score (as recommended in our manuscript; Imburgio et al., 2020 ), our mean amplitude of − 6.75 μV is actually quite similar to Fischer et al. (2016) , − 6.80 μV. Nonetheless, we agree that measurement decisions, such as what task to use, how to clean the data and what time windows to use for scoring the ERN can affect amplitudes ( Sandre et al., 2020 ) and as such, our norms cannot be assumed to hold for data measured with a different processing pipeline. That is, we established a set of norms to be used for interpreting the ERN in a young adult sample (similar to the samples often used in ERN research), in the context of a particular task and set of processing parameters . Our norms should be used only for data collected within these same parameters, just as norms from one cognitive test are not used to benchmark the results of a different cognitive test.
In closing, we believe that work towards a more standardized ERP processing pipeline can be pursued simultaneously with the goal of understanding how the ERN is distributed in the population. Science is by definition a process of revision, and we are open to the idea that different norms may be published in the future (e.g., using a standardized processing pipeline shared by most ERP researchers). For now, however, when abiding by the limitations reiterated above (i.e., don't use the norms for individual clinical assessment at this point in time; apply the norms to studies using the same task and processing parameters as well as a similar sample), we believe that our norms stand to increase, not decrease understanding of this widely studied ERP component.

Data and code availability statement
There are no associated data or code.