A commentary on establishing norms for error-related brain activity during the arrow flanker task among young adults

We suggest that a large data set for the error-related negativity (ERN) and error positivity (Pe) components of the scalp-recorded event-related brain potential (ERP) recently published as normative is not ready for such use in research and, especially, clinical application. Such efforts are challenged by an incomplete understanding of the functional significance of between-person differences in amplitudes and of nuisance factors that contribute to amplitude differences, a lack of standardization of methods, and the use of a convenience sample for the potentially normative database. To move ERPs toward standardization and useful norms, we encourage more research on the meaning of differences in ERN scores, including factors that influence between- and within-person variation, and the dissemination of protocols for data collection and processing.


a b s t r a c t
We suggest that a large data set for the error-related negativity (ERN) and error positivity (Pe) components of the scalp-recorded event-related brain potential (ERP) recently published as normative is not ready for such use in research and, especially, clinical application. Such efforts are challenged by an incomplete understanding of the functional significance of between-person differences in amplitudes and of nuisance factors that contribute to amplitude differences, a lack of standardization of methods, and the use of a convenience sample for the potentially normative database. To move ERPs toward standardization and useful norms, we encourage more research on the meaning of differences in ERN scores, including factors that influence between-and withinperson variation, and the dissemination of protocols for data collection and processing.
We appreciate the efforts of Imburgio et al. (2020) to establish normative data for the error-related negativity (ERN) and error positivity (Pe) components of the scalp-recorded event-related brain potential (ERP). The paper will be valuable for a number of reasons, including the encouragement of standardization of procedures and publication of additional norms. However, critical issues that it did not address raise important questions regarding the establishment and use of normative ERP data. We outline these issues and associated concerns below. Although for brevity we focus here on ERN, each point applies to Pe as well.
Research indicates that ERN involves multiple neural generators and neurotransmitters and is influenced by a combination of cognitive, affective, motivational, and motor processes ( Gehring et al., 2012 ). As a result, variation in "true " ERN signal can be due to a range of factors. The causes of individual differences in ERN scores are often unclear, and such differences have little predictive utility in isolation. For example, both larger and smaller ERNs have been observed in the context of depression, and differences in either direction have been interpreted as clinically meaningful Moran et al., 2017 ). Higher cardiorespiratory fitness also appears to be related to both larger ( Themanson et al., 2008 ) and smaller ERNs ( Pontifex et al., 2011 ), yet each study interpreted these opposing ERN findings as indicating that better fitness related to improved performance monitoring. This interpretive inconsistency about the functional significance of ERN amplitudes (e.g., larger ERNs viewed as better due to "stronger " responses, and smaller ERNs viewed as "more efficient ") is common across studies and is a barrier to establishing general norms, especially when there is also inconsistency in methods across studies. In other words, without knowing the functional significance of ERN amplitude in a specific context (population, task, etc.), identifying a given individual's ERN as larger or smaller than a comparison group provides little information about brain function.
Between-person differences in ERN amplitude can also occur due to factors other than "true " ERN signal. Specifically, the amplitude and morphology of an ERP component can vary across individuals due to nuisance variables that have nothing to do with cognitive processing, 1 including skull thickness, orientation of neural generators due to cortical folding, non-neural bioelectric signals, and changes in unmeasured participant state variables, such as attention and fatigue ( Luck et al., 2011 ). Although the Imburgio et al. article attempts to address these factors with the use of error-minus-correct difference waves, these betweencondition difference waves do not fully mitigate this problem. For example, a difference in skull thickness that causes the ERN to be twice 2 as large in one subject as some norm would likely also cause the correcttrial ERN (CRN) to be twice as large in that individual, and this increased amplitude would therefore still be present in an error-minus-correct difference wave. To eliminate the influence of such factors with difference waves, one needs to compare the same component in two experimental conditions (e.g., the ERN from compatible versus incompatible flanker trials), but this approach was not explored. Indeed, the influence of such factors was likely underestimated in the data set by their elimination of "outlier " participants from the creation of their norms -an approach that is not standard in ERP research and seems questionable when the goal is to create a normative database representative of standard ERPs from an unselected sample.
Another nuisance factor that results in problematic variance in ERN scores is measurement error, which is reflected in the widely variable estimates of internal consistency observed in a meta-analysis of 4499 participants from 68 samples nested within 43 studies ( Clayson, 2020 ). Estimated coefficient alphas for eight ERN trials ranged from 0.02 to 0.94, with estimates partially moderated by type of paradigm, clinical status of the sample, approach for correcting ocular artifact, measurement sensors, and approach to calculating coefficient alpha. These data demonstrate the need for standardization and for consideration of contextual factors and nuisance variables that influence ERN scores.
Flanker tasks are among the most widely used for eliciting ERN, but the numerous variants of the task and numerous approaches to data processing limit its generalizability. Tasks vary widely on a number of potentially important characteristics, including number of trials, type of stimuli, stimulus luminance, length of inter-trial intervals, use of feedback, and task instruction. The data processing pipelines and quality assurance procedures used across labs are similarly variable. Imburgio et al. acknowledged the potential for many such factors to impact ERN scores, and they themselves used different lengths of the flanker task and different recording procedures across recruitment sites in the data they pooled. However, we see this lack of standardization as fatal to a potential normative database. As acknowledged by Imburgio et al., the published normative dataset represents just one instantiation of ERN processing. This necessarily limits its generalizability. Unknown is how applicable these norms are to other labs with different variants of the flanker task, data collection systems, data quality, or analysis pipelines. Indeed, even in the case of the Imburgio study, which kept many of these factors consistent, statistically significant results in ERN difference waves were observed across sites. Taken together, consequences for other researchers, peer reviewers, or clinicians who may rely on prematurely established norms could be substantial.
The lack of standardization of methods represents a significant barrier to individual-differences research. For example, the Research Domain Criteria (RDoC) initiative emphasizes examining the feasibility of neurophysiological measures of dimensional constructs with an eye toward clinical prediction ( Kozak and Cuthbert, 2016 ). The ERN was ini- Nunez and Srinivasan (2006) , Buzsaki, Anastassiou, and Koch (2012) , and Zahn, Carpenter, and McGlashan (1981) .
2 Skull thickness has a multiplicative rather than additive impact on voltages measured at the scalp -illustrated by Ohm's law ( voltage = current x resistance ). Variance in skull thickness alters resistance (impedance), which will have a multiplicative impact on voltage measured at the scalp. This is especially relevant for difference scores in light of variability in skull thickness (and resistance) across people and across the lifespan (e.g., Frodl et al., 2001 ;Lillie, Urban, Lynch, Weaver, & Stitzel, 2016 ). Multiplicative differences in ERPs can also lead to mistaken statistical inferences in the analysis of interaction effects ( McCarthy & Wood, 1985 ). tially investigated in healthy participants and was later used to studygroup differences in clinical populations ( Gehring et al., 2018 ). However, neurophysiological measures of group/condition differences do not easily translate to individual-differences research ( Hajcak et al., 2017 ;Infantolino et al., 2018 ), and ERN research still has such obstacles to overcome.
As an example of a challenge in establishing norms, the mean ± standard deviation for ERN scores from 326 males in Imburgio et al. (Table  7) was + 3.18 ± 6.50 V, and the mean ERN score from 429 males (ERP Analysis section, Fig. S3) in Fischer et al. (2016) was − 5.37 V. These two studies had large samples with different demographic characteristics, used different variations of the flanker task, and varied in recording and data-reduction parameters. Each study employed high-quality methods and made reasonable decisions with regard to each characteristic. If the Imburgio et al. database were used to characterize the "average " male participant from the Fischer et al. sample, an ERN score of − 5.37 would correspond to a z score of − 1.32 (percentile rank = 9.34% or 90.66%). This could be interpreted as indicating that the average male in the Fischer et al. sample is abnormal, which is rather unlikely.
Numerous other issues arise when selecting a normative database, such as how representative the database is of the population(s) of interest ( Mitrushina et al., 2005 ). To this end, sampling procedures for normative databases often stratify on age, sex, race/ethnicity, education level, and socioeconomic status. Imburgio et al. did not report using a standardized sampling procedure 3 and excluded participants with ERP scores greater than three standard deviations away from the mean, which truncates the distribution, leading to overestimation of deficits. Excluding outliers mischaracterizes the population and compromises the normative data. Unsystematic sampling procedures can yield unrepresentative cell sizes for each demographic characteristic, limiting generalizability.
More "ERPology " ( Luck, 2014 ) is required to understand the functional significance of differences in ERN scores, including the diverse factors that influence between-and within-person variation. The Imburgio et al. data set is a valuable basis for that. The publication of protocols for ERN data processing is a necessary first step. Missing information about data processing appears to be a significant problem for ERP research broadly ( Clayson et al., 2019 ;Keil et al., 2014 ), not just ERN research. Some labs have moved toward publishing supporting documentation that outlines all data recording and processing procedures (e.g., see Farrens et al., 2019 ). This practice serves to improve the replicability of processing pipelines, and such communication is crucial for standardization.
Opening up our lab notebooks by depositing ERN paradigms, scripts, etc. that are routinely used in-house via repositories will help to disseminate paradigms for optimization and standardization. The development of the ERP CORE (Compendium of Open Resources and Experiments) represents such an effort (https://erpinfo.org/erp-core; Kappenman et al., 2020 ). ERP CORE is a resource of open EEG paradigms, data, and processing scripts aimed at optimization and standardization of task and analysis procedures. After sufficient optimization and standardization, stratified samples can then be collected to build normative databases. In short, we appreciate the work of Imburgio et al. but believe that the characterization of values obtained for ERN and Pe in a single paradigm and analysis pipeline from a convenience sample 3 Standardization samples comprise data that adhere to rigorous standards, including a standard procedure for recruiting participants. The recruited sample of participants should be appropriately stratified to reflect important demographic characteristics of the population of interest (see Mitrushina et al., 2005 ;Strauss, Sherman, & Spreen, 2006 ). In addition, tests should be administered and scored in a systematic and standardized fashion. Without proper standardized procedures, scores that are deviant from the normative sample could be due to any number of factors in the administration or scoring of the measures, and spurious interpretations can be made ( Bigler & Dodrill, 1997 ).
as norms is premature for use in research and, especially, clinical application.