The data-processing multiverse of event-related potentials (ERPs): A roadmap for the optimization and standardization of ERP processing and reduction pipelines

In studies of event-related brain potentials (ERPs), numerous decisions about data processing are required to extract ERP scores from continuous data. Unfortunately, the systematic impact of these choices on the data quality and psychometric reliability of ERP scores or even ERP scores themselves is virtually unknown, which is a barrier to the standardization of ERPs. The aim of the present study was to optimize processing pipelines for the error-related negativity (ERN) and error positivity (Pe) by considering a multiverse of data processing choices. A multiverse analysis of a data processing pipeline examines the impact of a large set of different reasonable choices to determine the robustness of effects, such as the impact of different decisions on between-trial standard deviations (i.e., data quality) and between-condition differences (i.e., experimental effects). ERN and Pe data from 298 healthy young adults were used to determine the impact of different methodological choices on data quality and experimental effects (correct vs. error trials) at several key stages: highpass filtering, lowpass filtering, ocular artifact correction, reference, baseline adjustment, scoring sensors, and measurement procedure. This multiverse analysis yielded 3,456 ERN scores and 576 Pe scores per person. An optimized pipeline for ERN included a 15 Hz lowpass filter, ICA-based ocular artifact correction, and a region of interest (ROI) approach to scoring. For Pe, the optimized pipeline included a .10 Hz highpass filter, 30 Hz lowpass filter, regression-based ocular artifact correction, a -200 to 0 ms baseline adjustment window, and an ROI approach to scoring. The multiverse approach can be used to optimize pipelines for eventual standardization, which would support efforts toward establishing normative ERP databases. The proposed process of analyzing the data-processing multiverse of ERP scores paves the way for better refinement, identification, and selection of data processing parameters, ultimately improving the precision and utility of ERPs.


Introduction
Studies in social, cognitive, and affective neuroscience often examine relationships between measures of neural activity, such as eventrelated brain potentials (ERPs) or functional magnetic resonance imaging (fMRI), and external correlates, such as behavior. For example, ERP studies of the error-related negativity (ERN) component have tested its potential use as a biomarker for anxiety or depression Hajcak et al., 2019 ) and its sensitivity to harmful social contexts ( de Bruijn et al., 2020 ). However, ERP studies often fail to consider nuisance variables that might obscure between-person relationships, rendering observed relationships tentative and potentially irreplicable amining relationships between ERPs and external correlates. In a typical ERP study, 1 all single-trial waveforms from a participant are averaged together to obtain the single-subject ERP, and the amplitude or latency of an ERP component is quantified. The present manuscript focuses on ERP component amplitude as the ERP "score " of interest, but the approach for optimizing data processing is also applicable to latency measurements. Within-person variance comprises differences in singletrial amplitude scores for a person. Within-person variance is equivalent to between-trial variance for a given person, and greater withinperson/between-trial variability in amplitude scores reflects lower data quality Kolossa and Kopp, 2018 ;Luck et al., 2021 ). Between-person variance comprises differences in scores between people. Within-and between-person variability impact estimates of psychometric internal consistency, which is a ratio of between-person variance to the sum of between-person and within-person variances ( Hedge et al., 2017 ). The typical focus of psychometric work is on variables that impact between-person ERP score variance (e.g., which populations show psychometrically reliable ERP scores?), but this focus has come at the potential cost of ignoring important factors that influence withinperson variance, such as the wide variability in data processing choices ( Clayson, 2020. To achieve a high estimate of internal consistency, between-person variance must be large compared to within-person variance. Internal consistency estimates can be improved by simply reducing withinperson variance, assuming that between-person variance is held constant . Demonstrating adequate internal consistency is a requisite initial step to making inferences about betweenperson relationships (i.e., individual differences), and if internal consistency is poor (or not reported), the validity of potential inferences is unclear or, at worst, unknowable ( Clayson and Miller, 2017b ). Therefore, understanding the factors that reduce within-person variance is key for optimizing internal consistency for studies of between-person differences to support statistical inferences.
Considerations of internal consistency are paramount in studies of performance monitoring that examine ERPs based on participant accuracy, because whether a participant makes an error is not entirely under a researcher's control. A straightforward way to improve internal consistency is simply to collect more ERP trials, because the impact of within-person variance on overall internal consistency estimates can be minimized by retaining more trials for averaging ( Baldwin et al., 2015 ;Carbine et al., 2021 ;Clayson et al., 2021 b;Clayson and Miller, 2017a ). Unfortunately, collecting more error trials is not always feasible in studies of performance monitoring when accuracy rates are high without unduly burdening participants. Hence, researcher efforts to minimize within-person variances for error trial scores should decrease the number of trials needed to obtain adequate internal consistency .
Two scalp-recorded ERP components associated with performance monitoring are ERN and the error positivity (Pe). ERN is a negativegoing peak that occurs within a 100 ms of a participant committing an error and is larger (i.e., more negative) for erroneous responses than for correct responses ( Gehring et al., 1993( Gehring et al., , 2012Larson et al., 2014 ). Prevailing explanations of ERN are that it possibly represents conflict between the mental representation of a correct response and an observed error response ( Larson et al., 2014 ;Yeung et al., 2004 ;Yeung and Nieuwenhuis, 2009 ) or an emotional response to errors that are seen as 1 The typical ERP study averages trials together based on the assumption that each trial represents a "true " signal and random background noise. Therefore, averaging across many trials should reduce the contribution of the background noise to the true signal of interest when that true signal is constant across the task. However, changes in ERP scores across the task could also be of interest (e.g., Berry et al., 2019 ;Brush et al., 2018 ;Volpert-Esmond et al., 2018, and such studies are particularly interested in the factors that impact within-person differences.
aversive (for reviews, see Weinberg et al., 2015 ;Weinberg et al., 2016 ). Following ERN, the Pe component is a slow positive-going wave between 200 and 400 ms that is larger for erroneous responses than for correct ones ( Nieuwenhuis et al., 2001 ;Overbeek et al., 2005 ). Prominent explanations of Pe suggest that it indexes late error awareness and reflects a combination of autonomic, cognitive, and sensory inputs ( Steinhauser and Yeung, 2012 ;Ullsperger et al., 2010 ;Wessel, 2012Wessel, , 2012. The primary focus of the present study is on methodological choices that impact data quality -between-trial variability within persons -of ERN and Pe scores. In the current literature, internal consistency estimates are often used as a proxy for data quality instead of directly examining metrics like between-trial standard deviations. Internal consistency is impacted by both within-person and between-person variance, and the former is relevant to data quality Luck et al., 2021 ). Although internal consistency is an indirect measure of data quality, estimates of internal consistency shed some light on the important methodological choices that might directly impact data quality. Studies have identified several potential processing stages that might impact data quality (see, Clayson, 2020 ;Klawohn et al., 2020 ;Sandre et al., 2020 ), although none have rigorously examined those stages from a purely data quality perspective. The data processing stages that were identified as key from these studies included condition (e.g., correct vs. error trials), filter cutoffs, reference schemes, eye-movement correction procedures (EMCPs), baseline adjustment windows, electrode sites used for measurement, and amplitude measurement approaches ( Clayson, 2020 ;Klawohn et al., 2020 ;Sandre et al., 2020 ). These studies guided the selection of data processing stages to explore in the current study. By understanding the unique impact of these processing stages on data quality, the present findings will help identify an optimized data processing pipeline within this particular set of processing and analysis choices. To this end, we statistically evaluated a data processing multiverse of the above-mentioned processing stage points.
A multiverse analysis is based on the premise that any final dataset used in analysis represents a multitude of decisions on the part of a researcher ( Harder, 2020 ;Steegen et al., 2016 ). Consequently, raw data do not give rise to a single dataset, but rather a "multiverse " of datasets depending on any given data processing pipeline. For example, the analysis of any ERP dataset can be biased by a number of reasonable decisions that a researcher has to make to obtain ERP scores from continuously recorded EEG . Hence, to circumvent such bias, a multiverse analysis examines the impact of a large set of different reasonable decisions to determine the robustness of any one conclusion in light of the multiverse ( Harder, 2020 ;Steegen et al., 2016 ). Multilevel models are well suited for multiverse analyses because they allow us to simultaneously model the impact of each methodological decision.
Taken together, the broad aim of the present study was to determine an ERP processing pipeline for ERN and Pe that is optimized for data quality and experimental effects by examining a multiverse of methodological choices. Data quality was quantified as the standard deviation of single-trial amplitude scores Kolossa and Kopp, 2018 ;Luck et al., 2021 ), which characterizes the spread of singletrial scores for a participant. Data processing choices that maximize the experimental effects (between-condition differences in correct and error trials) for average ERP amplitude scores were considered optimal. Multilevel modeling was used to simultaneously evaluate the impact of the data processing multiverse on the standard deviation of singletrial scores and average amplitude measurements. The consideration of the multiverse resulted in 3456 ERN scores and 576 Pe scores per person. An ideal data processing pipeline would yield high estimates of data quality (i.e., small within-person/between-trial standard deviations) and robust condition effects by maximizing correct and error trial differences.

Participants
To examine the impact of data processing pipelines on ERN and Pe amplitudes, we included 334 healthy participants from published ( Baldwin et al., 2015 ;Clayson et al., 2011 ;Clayson andLarson, 2011a , 2011 b;Larson and Clayson, 2011 ;Larson et al., 2012Larson et al., , 2013 and unpublished studies from our lab. The present analyses focused on the impact of data processing choices on ERN and Pe amplitudes, which was not reported in any of the mentioned studies. Data from these studies were used because each study employed an identical flanker paradigm. Lastly, we a priori decided to exclude participants with fewer than 10 correct or 10 error trials 2 in an effort to ensure the stability of standard deviation estimates. Hence, 298 participants (159 women, 139 men) were included in analyses for this study, and the average participant age was 21.1 ( SD = 2.7, range: 18 to 31 years).
All participants provided written informed consent as approved by the Institutional Review Board. Participants were recruited from undergraduate psychology courses. Exclusion criteria for the original studies were assessed via participant self-report and included current or previous diagnosis of a psychiatric disorder, psychoactive medication use, substance use or dependence, neurological disorders, head injury with loss of consciousness, left-handedness, or uncorrected visual impairment.

Experimental task
Participants completed a modified version of the Eriksen flanker task ( Eriksen and Eriksen, 1974 ). Each trial consisted of either congruent ( <<<<< , >>>>> ) or incongruent ( <<><< , >><>> ) arrow stimuli presented in white on a black background of a 17 in. computer monitor approximately 20 in. from the participant's head. Participants were instructed to respond as quickly and accurately as possible with a righthand key press. An index-finger button press was used if the middle arrow (i.e., the target stimulus) pointed to the left and a middle-finger button press was used if the middle arrow pointed to the right. Flanker stimuli were presented for 100 ms prior to the onset of the target stimulus, which remained on the screen for 600 ms. The response window was 1600 ms. If the participant responded after 1600 ms, the trial was counted as an error of omission. To decrease expectancy effects, the inter-trial interval (ITI) varied pseudo-randomly between 800 ms, 1000 ms, and 1200 ms. Three blocks of 300 trials (900 total trials) were presented; the distribution of congruent and incongruent trials was equal within each block (150 total trials each per block). Participants completed 24 practice trials prior to beginning the experimental task to ensure adequate understanding of instructions.

Electrophysiological data recording and reduction
Electroencephalogram (EEG) was recorded from 128 equidistant passive Ag/AgCl scalp electrodes using a 129-channel hydrocel geodesic sensor net and Electrical Geodesics, Inc. (EGI; Eugene, OR) amplifier 2 The number of retained trials and the interaction between number of trials and event (correct, error) were included as covariates. Each main effect and interaction were significant ( p s < .001). The standard deviation of ERN and Pe was larger for higher numbers of error trials and smaller for higher numbers of correct trials. ERN and Pe were smaller during correct and error trials for higher trial numbers, and the relationship was stronger for correct trials than for error trials. It is difficult to use these findings for optimizing task lengths due to possible changes in ERN and Pe over the course of a task (e.g., Volpert-Esmond et al., 2018 ), but future research could explore a multiverse of data processing using single-trial data. Inclusion of the number of retained trials as a covariate had little impact on the overall pattern of effects. system (20 K nominal gain, bandpass = 0.10-100 Hz). The sensor layout of the hydrocel net is shown in Clayson et al. (2011) . EEG was online referenced to the vertex electrode, and a ground sensor was placed at location PCz. EEG was digitized continuously at 250 Hz with a 24-bit analog-to-digital converter. Impedances were maintained below 50 k Ω.
Each EEG recording went through all branches of the data processing pipeline described below. Key factors that varied across pipelines are italicized (see Fig. 1 ). Continuous EEG data were first highpass filtered using a sixth order (36 dB/oct) infinite impulse response (IIR) Butterworth filter. The three highpass half-amplitude cutoffs used were 0.01 Hz, 0.05 Hz, and 0.10 Hz. Next, the data were lowpass filtered using the same sixth-order IIR Butterworth filter, and the three lowpass halfamplitude cutoffs used were 15 Hz, 20 Hz, and 30 Hz. These highpass filter choices were based on studies examining the impact of filter settings on ERPs related to language and cognition (e.g., Duncan-Johnson and Donchin, 1979 ;Tanner et al., 2015 ) as well as on ERN/Pe work that has used these settings (e.g., Banica et al., 2019 ;Härpfer et al., 2020 ;Muir et al., 2020 ). Continuous EEG data were then epoched from − 500 ms prior to the participant's response to 800 ms following the participant's response.
Ocular artifact correction was performed on response-locked epochs, and two different eye movement correction procedures (EMCPs) were used. These choices were based on a recent meta-analysis of ERN score internal consistency that observed independent components analysis (ICA) and regression-based approaches were the most commonly used EM-CPs ( Clayson, 2020 ). The first EMCP was an ICA procedure: epoched EEG data from all channels were processed through a binary version of EEGLab's runica function called binica ( Delorme and Makeig, 2004 ). Any ICA components that correlated at 0.9 or above with the scalp topography of a blink template and at 0.8 or above with the scalp topography of vertical and horizontal saccade templates were removed from the data. The templates that were used for artifact correction include those that were automatically generated by the ERP PCA Toolkit and those that were created by the present authors ( Dien, 2010 ). The second EMCP was a regression-based approach for ocular artifact correction. The regression-approach used the Miller et al. (1988) implementation, which removes recorded activity associated with vertical saccades, horizontal saccades, and blinks.
Following artifact correction, for all data pipelines channels with more than a 100 V step within 100 ms intervals, a voltage difference of 300 V through the duration of the epoch, or an absolute correlation with the nearest six neighboring channels that fell below 0.4 were marked as bad for the epoch. Channels that were marked as bad for more than 20% of epochs were considered globally bad. Bad channels were interpolated using spherical splines ( Perrin et al., 1989 ), but if more than 10% of channels were marked bad for an epoch, the entire epoch was rejected.
The following data processing stages and choices were based on studies that attempted to optimize the processing pipelines for psychometric internal consistency (see Clayson, 2020 ;Klawohn et al., 2020 ;Sandre et al., 2020 ). Data were referenced using either an average of all channels (i.e., average reference) or an algebraic link of two channels: TP9 and TP10 (i.e., linked mastoids). Four different baseline adjustment windows were then used and included − 500 to − 300 ms, − 400 to − 200 ms, − 300 to − 100 ms, and − 200 to 0 ms with 0 ms referencing the time of button press. Response locked ERPs were then analyzed at different sensors or sensor clusters (i.e., region of interests [ROIs]). For ERN, the three different possibilities included 6 (FCz), 129 (Cz), and the average activity across four fronto-central sites (6 [FCz], 129 [Cz], 7, 106; for electrode configuration, see . Then, ERN was scored using four different measures : time-window mean amplitude (average activity from 0 to 100 ms), adaptive time-window mean amplitude (average activity from 16 ms [4 samples] before the most negative peak to 16 ms after the most negative peak between 0 and 150 ms; see , absolute negative peak amplitude (amplitude of most negative peak between 0 and 150 ms), and a peak-to-peak amplitude (amplitude of absolute negative peak between 0 and 150 ms minus amplitude of absolute positive peak between − 100 and 50 ms).
For ERN and Pe, the outcomes of interest were the standard deviation of single-trial scores (i.e., within-person/between-trial variability) and the arithmetic mean of single-trial scores, and these outcomes were calculated separately for correct and error trials. For ERN there were a total of 3456 measurements per person for each outcome, and for Pe there were a total of 576 measurements. Fewer Pe measurements were obtained because the sensors facet contained only two levels (compared to three for ERN) and the measures facet was not considered. A timewindow mean amplitude procedure is the only reasonable measurement approach for Pe due to its slow, tonic nature, and therefore Pe analyses did not include an analysis of a measures factor.

Data analysis
We used multilevel models to estimate the effect of each data processing choice on participant-level standard deviation of scores and average amplitude of scores. Models were fit in the R package lme4 ( Bates et al., 2015 ). For the present data, measurements are nested within participants. Each model included a global intercept, fixed effects estimate for each data processing choice, and random intercepts for participants. The ERN model included the following predictors: event (correct, error; model reference level: correct), highpass filter (0.01 Hz, 0.05 Hz, 0.10 Hz; reference level: 0.01 Hz), lowpass filter (15 Hz, 20 Hz, 30 Hz; reference level: 15 Hz), EMCP (ICA, regression; refence level: ICA), reference (average, linked mastoids; reference level: average), baseline window ( − 500 to − 300 ms, − 400 to − 200 ms, − 300 to − 100 ms, − 200 to − 100 ms; reference level: − 500 to − 300 ms), sensor (Cz, FCz, ROI; reference level: Cz), and measure (mean amplitude, adaptive mean, peak amplitude, and peak-to-peak amplitude; reference level: mean amplitude). The Pe model included the following predictors: event, highpass filter, lowpass filter, EMCP, reference, baseline window, and sensor (Pz, ROI; reference level: Pz). Prior to fitting the standard deviation scores, they were first log-transformed to address skewness of the data. All multilevel models used maximum likelihood estimation and Sattherwaite degrees of freedom for testing significant effects. Degrees of freedom for testing effects were obtained using the R package lmerTest ( Kuznetsova et al., 2017 ). Tests of estimated marginal means were used to follow up significant tests using the R package emmeans ( Lenth, 2020 ) and Tukey's multiple comparison tests ( Tukey, 1977 ).

Results
The ERP data for conducting the multiverse analyses and the R scripts for doing so are posted on Open Science Framework at (OSF; https://osf.io/fa24t/ ). The multiverse considered between-trial standard deviations of amplitude scores and person average amplitude scores. An ideal processing pipeline was expected to maximize data quality (i.e., smallest between-trial standard deviations) and to maximize experimental effects (i.e., largest correct vs. error trial differences). If a choice regarding a data processing stage was consequential for one metric (e.g., standard deviation scores) but not the other (e.g., average scores), then the data processing stage that was consequential for only one metric was recommended for an optimized pipeline. We first present information for ERN standard deviation scores followed by ERN average magnitude scores, then do the same for Pe.

ERN standard deviation scores
Grand average waveforms for CRN, ERN, and ΔERN are shown in Fig. 2 . An online plot of the figure is available at https://pclayson.shinyapps.io/ern_gavs , where each waveform for each particular data processing stream can be viewed. Summary statistics for the model predicting ERN standard deviation scores are shown in Table 1 , and they are not repeated below for text brevity. The predicted estimated marginal means in standard deviation units for each effect are displayed in Fig. 3 . The primary effects of interest below are those that minimize the standard deviation of between-trial ERN amplitude scores. Hence, significant interactions were followed up with pairwise contrasts across levels of an effect within correct trials and then within error trials.

Event
The omnibus test of event was significant, F (1, 1,025,403) = 31,344, p < .001. Event was involved in a number of significant interactions, which are interpreted below.

Highpass
The highpass main effect was significant, F (2, 594) = 40.45, p < .001, and the difference between each filter setting was significant (| t s| > 3.0, p s < 0.01). The smallest standard deviations were observed for the 0.10 Hz highpass filer, and the size of the standard deviation increased in the order of 0.05 Hz and then 0.01 Hz.
There was also a significant ordinal Event x Highpass interaction, F (2, 1,025,403) = 156.30, p < .001. For correct and error trials, the pattern of differences between highpass filter settings was largely consistent with the overall main effect ( t s > 4.14, p s < 0.001), but significant Grand average waveforms of all error-related negativity data processing pipelines showing error-trial activity (top), correct-trial activity (middle), and difference between error-trial and correct-trial activity (bottom). An interactive version of this figure is available at https://pclayson.shinyapps.io/ern_gavs/ . Fig. 3. The estimated marginal means for the error-related negativity (ERN) model predicting the standard deviation of scores separately for correct and error trials. Estimated marginal means were back transformed from the log scale to the original response scale. The predictors include highpass filter (0.01 Hz, 0.05 Hz, 0.10 Hz), lowpass filter (15 Hz, 20 Hz, 30 Hz), eye-movement correction procedure (EMCP; independent components analysis [ICA], regression), reference (average, linked mastoids), baseline window ( − 500 to − 300 ms, − 400 to − 200 ms, − 300 to − 100 ms, − 200 to 0 ms), sensor (Cz, FCz, region of interest [ROI]), and measure (mean amplitude, adaptive mean, peak amplitude, and peak-to-peak amplitude). Note :% Δ refers the percent change in the standard deviation of error-related negativity scores. For comparison levels, please see Data Analysis section. EMCP = eye-movement correction procedure; ROI = region of interest. * p < .05.
differences were not observed between the 0.01 and 0.05 Hz filters for error trials, t (609.9) = 1.96, p = .12.

Lowpass
The main effect of lowpass was significant, F (2, 594) = 34.36, p < .001, and standard deviations from smallest to largest went in the order of 15 Hz, 20 Hz, and then 30 Hz (| t s| > 3.7, p s < 0.001). The Event x Lowpass interaction was also significant, F (2, 1,025,403) = 24.57, p < .001, and the pattern of effect within correct and within error trials matched the pattern for the overall main effect of lowpass filter setting (| t s| > 2.4, p s < 0.04).

EMCP
Standard deviations of overall ERN scores were lower for ICA-based EMCP than for regression-based EMCP, F (1, 297) = 133.09, p < .001. The Event x EMCP interaction was significant, F (1, 1,025,403) = 1742.24, p < .001, and the pattern of effects within correct trials and within error trials was consistent with the overall main effect (| t s| > 9.0, p s < 0.001).

Reference
Standard deviations of overall ERN scores were lower for an average reference than for a mastoid reference, F (1, 297) = 2149.8, p < .001. The Event x Reference interaction was also significant, F (1, 1,025,403) = 3357.0, p < .001. An average reference corresponded to lower standard deviations than a mastoid reference for both correct trials and error trials (| t s| > 45.0, p s < 0.001).

Baseline
The baseline main effect was significant, F (3, 891) = 1719.39, p < .001, and each follow-up pairwise contrast was significant (| t s| > 21.8, p s < 0.001). The baseline window from − 200 to 0 ms showed the lowest standard deviations and followed in increasing order by − 300 to − 100 ms, − 400 to − 200 ms, and then to − 500 to − 300 ms. The Event x Baseline interaction was significant, F (3, 1,025,403) = 34.81, p < .001. All pairwise contrasts within correct trials and within error trials were significant and followed the same pattern as the overall main effect (| t s| > 19.7, p s < 0.001).

Sensor
The main effect of sensor was significant, F (2, 594) = 120.99, p < .001. The standard deviations were lowest for the ROI and then were higher for Cz and then for FCz (| t s| > 7.1, p s < 0.001). The Event x Sensor interaction was significant, F (2, 1,025,403) = 923.94, p < .001. Each pairwise contrast was significant (| t s| > 6.1, p s < 0.001), and the pattern within correct trials and within error trials was consistent with the overall main effect. Note : For comparison levels, please see Data Analysis section. EMCP = eyemovement correction procedure; ROI = region of interest. * p < .05.

Measure
A main effect of measure was observed, F (3, 891) = 220.57, p < .001, and each pairwise contrast was significant (| t s| > 4.1, p s < 0.001). Standard deviation scores were smallest for peak-to-peak scoring, and increased in the order of mean amplitude, adaptive mean, then peak amplitude. The Event x Measure interaction was significant, F (3, 1,025,403) = 6120.50, p < .001. The pattern of effects within correct trials and within error trials was nearly identical to the pattern for the overall main effect. Within correct trials, the difference between mean amplitude and adaptive mean amplitude approaches was not significant, t (895) = -2.48, p = .06. The remainined differences between measurement approaches within correct trials and within error trials were significant (| t s| > 3.5, p s < 0.01).

Interim summary
The primary question for the model predicting standard deviation scores is which procedure yielded the smallest estimates. Correct trials corresponded to lower standard deviations than error trials. Furthermore, the pattern of effects within correct trials was largely consistent with the pattern of effects within error trials, which indicates that data processing choices yielded consistent impacts within both trial types. These findings suggest that an ideal processing pipeline for ERN would consist of a 0.10 Hz highpass filter, a 15 Hz lowpass filter, an ICA-based EMCP, an average reference scheme, a − 200 to 0 ms baseline window, an ROI measurement approach, and a peak-to-peak scoring procedure.

ERN average scores
Summary statistics for the model predicting average ERN scores are shown in Table 2 . The predicted estimated marginal means for each effect are displayed in Fig. 4 . For the ERN average-score analyses, the primary effects of interest are those that yield the largest error trials vs. correct trials difference score ( ΔERN).

Event
The event predictor was significant, F (1, 1,025,404) = 636,226, p < .001, and larger ERN (i.e., more negative) scores were observed for error trials than for correct trials. Event type was also involved in a number of interactions, which are interpreted below.
The Event x Highpass interaction was significant, F (2, 1,025,404) = 3.37, p = .04. Error-trial scores were larger than correct-trial scores for each highpass filter ( t s > 458.6, p s < 0.001). The ΔERN was more negative for the 0.01 Hz highpass filters than for the 0.10 Hz highpass filter, t (1,025,404) = 2.58, p = .03. The remaining differences in ΔERN amplitude were not significant ( t s < 1.4, p s > 0.38).

Reference
The mastoid reference yielded more negative overall ERN scores than the average reference, F (1, 297) = 787.65, p < .001, and the Event x Reference interaction was also significant, F (1, 1,025,404) = 2622.21, p < .001. ERN scores were more negative for error trials than for correct trials for both reference schemes ( t s > 527.8, p s < 0.001), and ΔERN was larger for the mastoid reference than for the average reference, t (1,025,404) = − 51.21, p < .001.

Baseline
There were four baseline adjustment windows considered, and they included − 500 to − 300 ms, − 400 to − 200 ms, − 300 to − 100 ms, and − 200 to 0 ms. The main effect of baseline was significant, F (3, 890) = 324.4, p < .001, and each pairwise contrast of estimated marginal means was significant (| t s| > 6.8, p s < 0.001). Overall ERN scores were largest (i.e., most negative) for the − 200 to 0 ms baseline and decreased in the order of − 300 to − 100 ms, − 400 to − 200 ms, and then − 500 to − 300 ms.
The Event x Baseline interaction was significant, F (3, 1,025,404) = 179.1, p < .001. Error-trial scores were more negative than correct-trial scores for all baseline time windows ( t s > 388.4, p s < 0.001). ΔERN was largest for the − 500 to − 300 ms window and decreased in magnitude in the order of − 400 to − 200 ms, − 300 to − 100 ms, and then − 200 to 0 ms. All pairwise contrasts were significant (| t s| > 5.7, p s < 0.001), except for the difference in ΔERN between − 300 to − 100 ms and − 200 to 0 ms, t (1,025,404) = − 1.41, p = .49.

Sensor
The overall main effect of sensor (Cz, FCz, ROI) was significant, F (2, 595) = 124.15, p < .001. Overall ERN was largest (i.e., most negative) when measured at FCz and decreased in the order of the ROI then Cz (| t s| > 6.0, p s < 0.001). This effect was qualified by a significant Event x Sensor interaction, F (2, 1,025,404) = 166.34, p < .001. ERN was larger for error trials than for correct trials when measured at each sensor ( t s > 445.7, p s < 0.001). ΔERN was largest using an ROI or FCz than when using Cz (| t s| > 14.7, p s < 0.001), but the difference between ROI and FCz was not significant, t (1,025,404) = − 2.00, p = .11.

Fig. 5.
Grand average waveforms of all error positivity (Pe) data processing pipelines showing error-trial activity (top), correct-trial activity (middle), and difference between error-trial and correct-trial activity (bottom). An interactive version of this figure is available at https://pclayson.shinyapps.io/pe_gavs/ .

Measure
The main effect of measure (mean, adaptive mean, peak, peak-topeak) was significant, F (3, 888) = 3622.7, p < .001. Pairwise contrasts indicated that the difference between each measurement approach was significant (| t s| > 9.1, p s < 0.001). Overall ERN amplitude was largest (i.e., most negative) using the peak-to-peak amplitude procedure and decreased in the order of peak amplitude, adaptive mean, and then mean amplitude.
The Event x Measure interaction was also significant, F (3, 1,025,404) = 2876.0, p < .001. Each measurement approach yielded larger error-trial than correct-trial ERN amplitudes ( t s > 340.1, p s < 0.001). All pairwise comparisons of ΔERN were significant (| t s| > 20.8, p s < 0.001). ΔERN was largest using the peak amplitude procedure and then decreased in the order of adaptive mean, peak-to-peak, and then mean amplitude.

Interim summary
The primary comparisons of interest for the ERN average score analyses are those that maximize ΔERN. Therefore, an ideal processing pipeline for ERN average scores would consist of 0.01 Hz highpass filter, a mastoid reference, a baseline from − 500 to − 300 ms, an ROI or FCz measurement approach, and a peak amplitude scoring procedure.

Pe standard deviation scores
The ERP waveforms showing Pe activity are displayed in Fig. 5 , and an online plot of the figure is available at https://pclayson.shinyapps.io/pe_gavs . Summary statistics for the model predicting Pe standard deviation scores are shown in Table 3 . The predicted estimated marginal means in standard deviation units for each effect are displayed in Fig. 6 . The primary effects of interest below are those that minimize the standard deviation of ERN scores. Hence, significant interactions were followed up with pairwise contrasts across levels of an effect within correct trials and then within error trials.

Event
The omnibus test of event was significant, F (1, 168,359) = 525.60, p < .001. Generally speaking, correct-trial standard deviations were smaller than error-trial standard deviations, but event will be interpreted in more detail below with the other relevant predictors.

Highpass
The highpass main effect was significant, F (2, 594) = 48.32, p < .001, and the difference between each filter setting was significant (| t s| > 3.7, p s < 0.001). The smallest standard deviations were observed for the 0.10 Hz highpass filer, and the size of the standard deviation increased in the order of 0.05 Hz and then 0.01 Hz. There was also a significant ordinal Event x Highpass interaction, F (2, 168,359) = 17.74, p < .001. For correct and error trials, the pattern of difference between highpass filter settings was consistent with the overall main effect ( t s > 2.68, p s < 0.03).

Lowpass
The main effect of lowpass was significant, F (2, 594) = 20.66, p < .001, and standard deviations from smallest to largest went in the order of 30 Hz, 20 Hz, and then 15 Hz (| t s| > 2.5, p s < 0.04). The Event x Lowpass interaction was also significant, F (2, 168,359) = 12.66, p < .001. The pattern of effect within correct trials matched the pattern for the overall main effect of lowpass filter setting (| t s| > 3.1, p s < 0.01). For error trials, a 30 Hz lowpass filter was associated with smaller standard deviations than either 15 Hz or 20 Hz lowpass filters (| t s > 3.9, p s < 0.001), but there was no significant difference between 15 Hz and 20 Hz, t (835.4) = 0.42, p = .91.

Reference
Standard deviations of overall Pe scores were lower for an average reference than for a mastoid reference, F (1, 297) = 704.8, p < .001. The Event x Reference interaction was also significant, F (1, 168,359) = 193.01, p < .001. An average reference corresponded to lower standard deviations than a mastoid reference for both correct trials and error trials (| t s| > 25.8, p s < 0.001).

Baseline
The baseline main effect was significant, F (3, 891) = 636.24, p < .001, and all follow-up pairwise contrasts were significant (| t s| > 12.4, p s < 0.001). The baseline window from − 200 to 0 ms showed the smallest standard deviations followed increasing order by − 300 to − 100 ms, − 400 to − 200 ms, and then to − 500 to − 300 ms. The Event x Baseline interaction was also significant, F (3, 168,359) = 21.95, p < .001. All pairwise contrasts within correct trials and within error trials were significant and followed the same pattern as the overall main effect (| t s| > 10.0, p s < 0.001).

Sensor
The main effect of sensor was significant, F (1, 297) = 618.97, p < .001, and the standard deviations were smaller when measuring Pe using an ROI than when using Pz alone. The Event x Sensor interaction was not significant, F (1, 168,359) = 0.05, p = .83.

Interim summary
The focus for the model predicting standard deviation scores is which procedure yielded the smallest estimates. Correct trials corresponded to lower standard deviations than error trials, and the pattern of effects within each trial type (correct vs. error) was nearly identical. These findings suggest that an ideal processing pipeline would include a 0.10 Hz highpass filter, a 30 Hz lowpass filter, an average reference scheme, a − 200 to 0 ms baseline window, and an ROI measurement approach.

Pe average scores
Summary statistics for the model predicting Pe subject average scores are shown in Table 4 . The predicted estimated marginal means for each effect are displayed in Fig. 7 . For the Pe average-score analyses, the primary effects of interest are those that yield the largest error trials vs. correct trials difference score ( ΔPe).

Event
The event predictor was significant, F (1, 168,953) = 313,435, p < .001, and larger Pe (i.e., more positive) scores were observed for error trials than for correct trials. Event type was also involved in a number of interactions, and those interactions will be interpreted below with the other relevant predictors.

Highpass
The omnibus test of the three different highpass filters was significant, F (2, 594) = 11.75, p < .001. Overall Pe amplitudes were largest using a 0.01 Hz or 0.05 Hz highpass filter than when using a 0.10 Hz filter ( t s > 2.7, p s < 0.02), but the difference between 0.01 Hz and 0.05 Hz was not significant, t (594) = 2.12, p = .09. The Event x Highpass interaction was not significant, F (2, 168,953) = 1.28, p = .28.

Reference
The mastoid reference yielded larger Pe than the average reference, F (1, 297) = 37.74, p < .001, and the Event x Reference interaction was also significant, F (1, 168,953) = 7180.15, p < .001. Pe was larger for error trials than for correct trials for both reference schemes (| t s| > 335.9, p s < 0.001), and ΔPe was larger for the mastoid reference than for the average reference, t (168,953) = 84.74, p < .001.
The Event x Baseline interaction was also significant, F (3, 168,953) = 37.24, p < .001. Error-trial Pe was larger than correct-trial Pe for all baseline time windows (| t s| > 273.1, p s < 0.001). All ΔPe pairwise contrasts between baseline adjustment windows were signifi-cant (| t s| > 2.7, p s < 0.03). ΔPe was largest for the − 200 to 0 ms window and decreased in magnitude in the order of − 300 to − 100 ms, − 400 to − 200 ms, and then − 500 to − 300 ms.

Sensor
The overall main effect of sensor was significant, F (1, 297) = 168.48, p < .001, and the overall Pe was larger when measured at Pz than when using an ROI.
The Event x Sensor interaction was also significant, F (1, 168,953) = 2628.68, p < .001. Pe was larger for error trials than for correct trials using Pz and the ROI ( t s > 359.6, p s < 0.001). ΔPe was larger when measured at Pz than when using an ROI, t (168,953) = − 51.27, p < .001.

Interim summary
The primary comparisons of interest for the Pe average score analyses are those that maximize ΔPe. Therefore, the ideal processing pipeline comprises a regression-based EMCP, a mastoid reference scheme, and a − 200 to 0 ms baseline window.

Discussion
The aim of our analysis was to identify optimized data processing pipelines for flanker-task ERN and Pe measures that lead to the highest data quality (i.e., low within-subject, between-trial standard deviations) and to robust experimental effects (i.e., maximize differences between correct and error trials). The multiverse analyses suggest that certain data processing decisions contribute to substantial differences in optimizing ERP estimates (for summary, see Table 5 ). Furthermore, different data processing parameters perform best depending on whether the objective of an analysis is to minimize within-person variation or maximize average-score differences between conditions. Future ERN/Pe studies using the flanker task may use the present findings to select data processing parameters a priori using this data-driven approach. The proposed process of examining the data processing multiverse provides a roadmap for optimizing and eventually standardizing ERP scores, and the tools for performing multiverse analyses on other datasets are posted in the online supplementary material on OSF ( https://osf.io/fa24t/ ). An ideal processing pipeline for ERN and Pe would minimize between-trial standard deviations while also maximizing betweencondition differences (correct vs. error). For the analyses of betweentrial standard deviations, the primary focus is on minimizing standard deviations for correct and error trials, because lower standard deviations suggest higher data quality Luck et al., 2021 ). For the average score analyses, when different choices during a data processing stage increase or decrease correct-and error-trial scores similarly, then those different choices are considered inconsequential for betweencondition differences (correct-trial scores conceptually serve as a baseline or reference point). Based on this guidance, there are some clear recommendations for ERN and Pe data processing during the flanker task (see Table 5 ). An optimized processing pipeline for ERN would include a 15 Hz lowpass filter, ICA-based ocular artifact correction, and a region of interest (ROI) approach to scoring ERN amplitude. For Pe, the optimized pipeline would include a 0.10 Hz highpass filter, 30 Hz lowpass filter, regression-based ocular artifact correction, a − 200 to 0 ms baseline adjustment window, and a region of interest approach to scoring amplitude.
When the multiverse analyses for standard deviations and for average scores provide contradictory recommendations, other principles of ERP data analysis can be used to guide decision making. For example, when deciding whether to use a mastoid or an average reference, an average reference is generally preferred to avoid exaggerating vertically oriented ERPs, such as ERN and Pe ( Dien, 2017 ). The quality of the average reference is also strongly related to the number of channels and spatial sampling of those channels (e.g., Dien, 1998 ;Junghöfer et al., 1999 ), and thus another reference scheme could be more appropriate based on the EEG equipment used. Similarly, although peak or peak-to-peak amplitude approaches might be recommended by a multiverse analysis, these ERP quantification approaches are statistically biased and misrepresent the "true " underlying ERP peak due to distortion from high-frequency background EEG noise Luck, 2014 ). However, when there is no clear time window for extracting ERP amplitudes, a peak amplitude approach might be the only reasonable method ( Luck and Gaspelin, 2017 ). In light of ERN showing a clear, temporally stable deflection, a time-window mean amplitude is likely preferred over peak-amplitude approaches for extracting ERN. Although the examination of ERN standard deviation scores indicated that a − 200 to 0 ms baseline might be optimal, this time window is likely inappropriate for ERN, because the ERN deflection begins before 0 ms. Therefore, a baseline window of − 200 to 0 ms likely includes signal of interest (i.e., ERN activity) and consequently impacts ERN scores. Taken together, although a multiverse analysis provides a rigorous bottom-up approach for optimizing data processing pipelines, there might be ad- Note : For comparison levels, please see Data Analysis section. EMCP = eyemovement correction procedure; ROI = region of interest. * p < .05.
ditional considerations that lead to the selection of choices that were observed as suboptimal.
The multiverse analysis also sheds light on some data processing choices that generally lead to poor data quality of ERN and Pe measurements. For example, the use of an algebraically linked mastoids reference led to much larger between-trial standard deviations in ERN and Pe scores than an average reference. This finding is consistent with a previous comparison of the two reference schemes during an novelty oddball task ( Dien, 2017 ). Those findings showed that EEG noise levels were higher for the linked mastoids reference than for the average reference -the average reference scheme cancels out some random noise by averaging activity across all electrodes. The multiverse analysis also showed that the use of a single sensor to score ERN and Pe led to larger between-trial standard deviations than ROI approaches. This finding is consistent with other studies that showed lower noise when using ROI approaches than when measuring ERPs from a single electrode ( Baldwin et al., 2015 ;Dien, 2017 ;Huffmeijer et al., 2010 ). However, an ROI approach is only likely to outperform a single-sensor approach when the ERP signal of interest has a broad topography and is captured by the ROI ( Clayson, 2020 ;Clayson and Miller, 2017b ), which might require high-density EEG arrays for some ERPs (cf. Sandre et al., 2020 ).
Ideally, an EEG processing pipeline that prioritizes low withinperson variability would not dramatically reduce between-condition experimental effects. The data from the multiverse analyses can be used to determine necessary sample sizes to detect psychometrically reliable and robust ΔERN and ΔPe effects when using parameters intended to reduce within-person variability. Determining sample size a priori is a strategy for increasing reproducibility and replicability in psychophysiological research. However, variance components needed to conduct sample size calculations are rarely reported in the human electrophysiology literature, making a priori calculations difficult ( Guo et al., 2014 ;Larson and Carbine, 2017 ). To that end, the data used in the multiverse analysis are posted on OSF for future power analyses of any of the proposed processing pipelines.
The present multiverse analyses have implications for software that automates approaches for processing ERP data, and such software packages are becoming more common (e.g., Bigdely-Shamlo et al., 2015 ;Cowley et al., 2017 ;Debnath et al., 2020 ). The development of these software packages furthers the goal of replicability of ERP research by developing pipelines that do not require user input. However, the automatization of data processing is concerning if the effects of different data processing steps are unknown. In the present analyses, different pipelines for the data collected during the same flanker task would be recommended depending on whether the goal is to analyze ERN or Pe. These findings point to the need to fine tune automated processing pipelines for recording particular ERPs for a particular purpose and within a particular context. For now, it is unclear whether unique data processing pipelines are needed for different ERN/Pe paradigms or participant populations (e.g., healthy vs. clinical). At the very least, data processing pipelines are being automated and implicitly recommended before knowing the consequences of key data processing parameters that can impact ERP data quality and within-person differences.
In light of the potential clinical utility of ERN as a biomarker for symptoms of and risk for anxiety ( Hajcak et al., 2019 ), at least one largescale effort to establish normative ERN and Pe data has been undertaken ( Imburgio et al., 2020 ). Such efforts are challenged by a lack of standardization of methods for recording ERN/Pe and a lack of an understanding of the methodological choices that impact ERN/Pe scores ( Clayson et al., 2021 c). The present examination of the multiverse of data processing identifies an optimized pipeline for robust and clean ERN and Pe, and this is a step toward standardization. The proposed process of analyzing the multiverse of data processing choices also provides a roadmap for determining the impact of other methodological choices and stages on Note : Data processing decisions that did not significantly impact ERP estimates are indicated by a dash (-). EMCP = eye-movement correction procedure; ICA = independent components analysis; ROI = region of interest.
ERN and Pe. Nonetheless, rigorous standardization of paradigms, administration, and data analysis is critical to achieving dependability, which is a fundamental concept in psychological testing. Dependability implies that information gleaned from psychological measures will be invariant across time and situations ( Russell et al., 2005 ). In neuropsychological assessment, for example, client performance is compared to normative samples on standardized measure, and such comparisons are used for diagnosis and treatment planning. A keystone of neuropsychological assessment is the standardization of test administration, scoring, and interpretation. Without proper standardization, scores that are deviant from the normative sample could be due to any number of factors in the administration or scoring of the measures, and spurious interpretations can be drawn ( Bigler and Dodrill, 1997 ). Hence, the widely held view is that rigorous standardization in recruitment, assessment, and scoring needs to be followed when creating normative databases. An analysis of the data processing multiverse helps researchers optimize ERP data processing and analysis for eventual standardization, and these efforts support the establishment of normative ERP databases for clinical and cognitive neuroscience. Failing to standardize ERP assessment before providing normative databases has numerous deleterious consequences for researchers and clinicians, who may rely on them ( Clayson et al., 2021 c).
The present multiverse analysis attempted to optimize ERN and Pe data processing pipelines with an eye toward standardization. Additionally, studies could routinely investigate the impact of multiple data processing and analysis decisions on ERP effects of interest to understand the factors that maximize data quality. Ideally, effects of interest should be reasonably robust to variations in data processing and analysis, and this robustness is particularly important in the search of tractable biomarkers for use in real-world trials (for a similar discussion, see Clayson et al., in press). Therefore, routine examination of the data processing and analysis multiverse would optimize choices within a particular dataset for data quality, circumventing the need for a consensus on the optimal approach for measuring a particular ERP component.
There are some limitations to note. First, the present analyses examined a limited number of data processing stages that were based on previous efforts to optimize ERN/Pe processing pipelines (see Clayson, 2020 ;Klawohn et al., 2020 ;Sandre et al., 2020 ), and data were limited to one EEG system and one version of the flanker paradigm. Similarly, analyses focused on a narrow sliver of possible choices within each data processing stage, such as type of filter (i.e., IIR Butterworth), regression-based EMCP, and ICA-based EMCP. For example, it is possible that other types of ICA approaches to correcting ocular artifact or that even other sequences of data processing would perform differently than the one used in the current analyses. Regardless, future research could conduct similar multiverse analyses to examine the impact of different processing stages (e.g., artifact rejection or sampling rates) and different choices within each stage from those included in the present study. Future research could also examine whether optimized data processing pipelines are similar across different recording systems. Second, only healthy right-handed undergraduates were included in the current analyses. Different optimized pipelines might be observed in community or clinical populations, and future validation studies across populations, EEG systems, and components are necessary. Third, average scores and between-trial standard deviations were analyzed separately, which prevented examining the simultaneous impact of data processing choices on average scores and standard deviations. Recent research has started to implement multilevel location-scale models in psychology (e.g., Williams et al., 2019Williams et al., , 2021 and in ERN research ( Clayson et al., 2021 d), and these models could be used to examine the simultaneous impact of each data processing choice on average scores and standard deviations. Although highly computationally demanding (difference between ∼2 million rows of subject-average data and ∼2 billion rows of single-trial data for the present multiverse), future research might explore the use of such models for data processing optimization. Results from multilevel location-scale models could also be used to identify separate optimized data processing pipelines for each participant, as has been done for estimates of ERP psychometric internal consistency , and these models could include sensor as another factor of interest to identify which sensors for which individuals show robust between-condition differences and high data quality.
The present study focused on identifying data processing pipelines for ERN and Pe optimized for data quality and experimental effects, and we propose that the data processing multiverse be considered to optimize and eventually standardize ERPs for use in research and clinical contexts. To move ERN and Pe toward standardization, a clearer understanding of within-person variation (e.g., data processing pipelines) and between-person variation (e.g., demographic variables or individualdifference measures) in scores is needed. The present study used a subset of all possible data processing stages and decisions within those stages, and the proposed process provides a roadmap for considering additional stages and decisions. Although the present data-driven approach helps to identify key parameters for data processing and analysis, it can be necessary to select a less "optimal " parameter for theoretical reasons. Future research could evaluate the paradigmatic factors (e.g., type of stimuli, numbers of trials, inter-trial intervals) that lead to robust scores and high internal consistency to support establishing normative databases. Furthermore, the consideration of the data processing multiverse can also be used to optimize processing pipelines for ERN studies in psychopathology, such as identifying which data processing pipeline yields the strongest relationship between ERN and anxiety symptoms. The proposed process of analyzing the data processing multiverse of ERP scores paves the way for better selection of data processing pipelines, ultimately improving the utility of ERPs in clinical and cognitive neuroscience.