Exploration of the inﬂuence of the quantiﬁcation method and reference scheme on feedback-related negativity and standardized measurement error of feedback-related negativity amplitudes in a trust game

Various approaches have been taken over the years to quantify event-related potential (ERP) responses and these approaches may vary in their utility connecting empirical research and scientiﬁc claims. In this work we compared different quantiﬁcation methods as well as the inﬂuence of three reference methods (linked mastoids, average reference, and current source density) on the resulting ERP amplitude. We use the experimental effects and effect sizes (Cohen ' s d) to evaluate the different methodological variants and we calculate intraclass correlation coefﬁcients (ICC). In addition, the bootstrapped standard error of the means (SME, Luck et al., 2021), which was recently suggested as a quality criterion for ERP research, is used for this purpose. Our example for an ERP is the feedback-related negativity (FRN) to feedback about trustee behavior in a trust game with participants in the trustor position. We found that the quantiﬁcation methods concerning the FRN inﬂuenced the absolute value of condition effects in the experimental paradigm. Yet, the patterns of effects were detected by all chosen methods, except for the ‘individual difference wave’-based peak window approach. In addition, our ﬁndings stress the importance of checking the reference electrodes concerning effects of the experimental conditions. Furthermore, interactions of topographical distribution


Introduction
The electroencephalogram (EEG) and its application in humans have been used over 90 years (Berger, 1929).It is measured with electrodes on the surface of the scalp and is the summed electrical activity of the neurons and their excitatory postsynaptic potentials and dipoles (see Cohen, 2014;Luck, 2005).The electrical signals of the brain need to be amplified and separated from other electrical measurements and background noise like muscle activity, eyemovements and induced currents due to head movements of the EEG montage.The EEG signal can be analyzed in different manners, for instance looking at the means of raw signals of repeated events, the so-called event related potentials (ERPs) or using frequency analysis, targeting specific frequency bands (see Cohen, 2014;Luck, 2005).If looking at ERPs, the mean of several trials of an event creates a specific EEG-wave, with positive and negative peaks that can be numbered in order of occurrence after the respective event.
Various ways of quantification of ERPs have been realized in many different tasks that are often specific to evoke a distinct ERP.Accordingly, these different ways of quantification and analysis of EEG may constitute a weak link in the derivation chain from empirical observations to inferred scientific claims in cognitive neuroscience (Meehl, 1990).In this work we aim to check methodological choices in the derivation chain, and we thus compare different methodological decisions in the analysis of ERPs.In particular, we examine the choice of EEG reference and the kind of quantification of potentials in the ERP in the time range of the N2 component.In order to compare and evaluate these different methodological choices, we use the effects and effect patterns of the ERPs in our experimental paradigm, we calculate standard error of the means (SME, Luck, Stewart, Simmons, & Rhemtulla, 2021), and we report standard measures of reliability, namely intraclass correlation coefficient (ICC).
The trust game (Berg, Dickhaut, & McCabe, 1995) is the experimental paradigm and background for our comparison and we present the paradigm and relevant findings in the next section.Subsequently, we report the different methodological choices for the analysis of EEG data that we aim to compare.We start with the feedback-related negativity (FRN; e.g., Holroyd & Coles, 2002;Miltner, Braun, & Coles, 1997), which is a negative deflection in the ERP in response to negative as compared to positive feedback, and we present the various options of its quantification in one section.We continue in another subsequent section introducing three kinds of EEG references.Finally, we report on standardized measurement error (SME) as an evaluation tool for our comparison.

Trust game
The Trust Game is a social dilemma designed by Berg et al. (1995) to study trust decision-making and reciprocity in a social situation of conflict.In the Trust Game, an agent (the trustor) plays with another real or fictive person (the trustee).
At the beginning of the game, both parties are handed out a certain amount (money or points).The trustor is now faced with the decision (1) to transfer the amount to the trustee or (2) to keep it (Berg et al., 1995).This decision is understood as reflection of the trustor's readiness to make himself vulnerable towards the trustee's decision, which is an operational definition of social trust.In case the trustor refuses to transfer the initial amount to the trustee, both players come off with their initial sum (generally 10 points/dollar/euro).In case the trustor decides to transfer his initial amount to the trustee, this amount is tripled, and the trustee is entrusted the unilateral control over resources (usually now 40 points/dollar/ euro).In the final step of the game, the trustee decides to either (1) share the money between both players (cooperation) or (2) to keep the whole amount to himself (non-cooperation).
One round is completed with the feedback of the trustee's decision.Fig. 1 illustrates the decision logic and possible outcomes for the trustor.The guarantee of full anonymity between the two agents combined with the logic of a nonrecurring transaction ensures the elimination of trustindependent mechanisms maintaining investments like titfor-tat-strategies (Berg et al., 1995;Yamagishi, 2011).
The decision-making process in the Trust Game is influenced by the social expectations of the involved parties c o r t e x 1 7 5 ( 2 0 2 4 ) 1 0 6 e1 2 3 (Dunning, Anderson, Schl€ osser, Ehlebracht, & Fetchenhauer, 2014).These expectations have been shown to manifest in the FRN response in the trust game, if the decision to trust was not rewarded (e.g., Wang, Jing, Zhang, Lin, & Valadez, 2017), making the trust game a valuable example of a paradigm evoking an FRN response.The expectations of the trustors could be influenced by different aspects.One aspect is the anticipation of rational strategies of the trustee, leading to no trust (cf. von Neumann & Morgenstern, 1944).This expectation has been known to be modified by borderline symptomology (e.g., Franzen et al., 2011), so we included this variable as a control variable in our paradigm.However, the excessive tendency of trustors to trust and trustees to cooperate (e.g., Dunning et al., 2014;Wang, Zhang, Jing, Valadez, & Simons, 2016) suggests that the trustor follows the norm of reciprocity (e.g., Kelley & Thibaut, 1978;West, Griffin, & Gardner, 2007;West & Gardner, 2010) and anticipates the trustee to cooperate, leading to an FRN if this expectation is violated (cf.Wang et al., 2016).Accordingly, Wang et al. have repeatedly reported more negative amplitudes for non-cooperation, i.e., the unrewarded trust of the participant, and more positive amplitudes for cooperation, i.e., the rewarded trust of the participant, with the topographical focus of the effect being at fronto-central electrode positions (cf.Wang et al., 2016Wang et al., , 2017)).Therefore, the trust game with its' stepwise process of deciding to trust the trustee and then getting feedback whether the trust is rewarded is well suited for measuring an FRN response, that is driven by reward expectation.Additionally, as the participants are not forced to trust, they may still decide not to trust the other.Accordingly, they may receive feedback about the consequences of not trusting, confronting them with their decision and their deviance from the norm of reciprocity (e.g., Kelley & Thibaut, 1978;West et al., 2007;West & Gardner, 2010).Taken together, there are three different outcomes in the game as the main experimental conditions: the participant did not trust (1), the participant trusted but the trustee was non-cooperative (2), and the participant trusted, and the trustee cooperated (3).

The FRN as example for different quantification methods in ERP research
One important ERP of interest that has frequently been investigated in the context of economic decision-making paradigms is the feedback related negativity (FRN; e.g., Holroyd & Coles, 2002;Miltner et al., 1997).This component is sometimes also called N2 component, the second negative peak (see Baker & Holroyd, 2011) or medial frontal negativity (MFN; Boksem & De Cremer, 2010).It is linked to the evaluative aspect of an event, leading to a more negative electro-cortical midfrontal signal if an event or outcome is more negative than expected (Holroyd & Coles, 2002).This negative peak is around 200 mse400 ms after the respective event (Yeung & Sanfey, 2004), has a fronto-central midline topography, and originates in the anterior cingulate cortex (e.g., Debener et al., 2005;Gehring & Willoughby, 2002;Hewig et al., 2007;Miltner et al., 1997).More recent research, however, reinterpreted the feedback components being a complex of reward dependent reactions (Hajcak, Moser, Holroyd, & Simons, 2006;Hajcak Proudfit, 2015), dampening the normally occurring N2 response and introduced the term reward positivity (Rew-P, Baker & Holroyd, 2011) instead of FRN.Hence reactivity to reward is driving differences in the component complex formally named FRN (cf. Baker & Holroyd, 2011;Hewig et al., 2007;Holroyd, Pakzad-Vaezi, & Krigolson, 2008).However, negative feedback remains to be important for the FRN component (e.g., Hajcak et al., 2006;Hewig et al., 2007), although the proposed mechanism leading to this negative deflection was changed from the negative violation of expectancy (Holroyd & Coles, 2002) to a global outcome evaluation process (Kujawa, Smith, Luhmann, & Hajcak, 2013), leading to a higher FRN or more negative N2 response if reward and therefore the Rew-P is absent.This negative feedback parameter was used to assess offers, (monetary) outcomes or bargaining opportunities in economic games in addition to the behavioral choices and feedback in general.Concerning the psychometric properties, the homogeneity of the FRN component is reasonably large (>.7), if enough trials (~20 per condition) are given (Marco-Pallares, Cucurell, Mu ¨nte, Strien, & Rodriguez-Fornells, 2011).However, there have been different attempts to quantify this ERP over the years in different contexts (e.g., Sambrook & Goslin, 2015).
Three different types have mostly been used: The mean amplitude, where the mean activation over a defined time window is calculated (e.g., Luck, 2005), the baseline to peak amplitude, where a peak value is compared to the mean baseline period (e.g., Luck, 2005;Segalowitz et al., 2010) and the peak-to-peak amplitude, where a preceding component of opposite polarity is used to quantify the component (e.g., Segalowitz et al., 2010).Examples of these quantification methods for the FRN component are given in Table 1 below.Only quantification approaches that were able to be identified clearly by reading the methods sections in publications were included in the present manuscript.
Every quantification approach has specific biases.The time window approach is affected by two important sources of variance within and between subjects: different latencies in peak onsets and different shapes of peak amplitudes due to their maximum amplitude, which in combination impact the shape of the grand average ERP (across subjects, see Luck, 2005).These differences may be driven by person or condition specific variations, as well as by within-person variations.Hence, a completely a-priori based definition of a large time window, e.g., based on the literature, may suffer from the problem of being defined too large and therefore already containing parts of the subsequent or even the prior component of the signal, for example the P2 or P3 component in a N2 response.To correct for this problem and to account for differences in designs and samples, the selection of time windows based on the grand average ERP of experimental conditions for all participants or on individual level were used (cf.Hauser et al., 2014;Marco-Pallares et al., 2011;Rodrigues, Liesner, Reutter, Mussel, & Hewig, 2020;Weismu ¨ller & Bellebaum, 2016).
Another solution was to measure the ERPs from peak to peak, using the P2 as first peak and the N2 as the second peak in case of the FRN (e.g., Hajcak et al., 2006;Holroyd et al., 2003;Osinsky et al., 2012;Zottoli & Grose-Fifer, 2012).This peak-topeak approach has been used on individual level mostly, but in some cases, it is not entirely clear what was done (see Table 1).However, when considering whether the method has been used on the mean signal, the means of the conditions or the difference waves based on the individual participant or the entire sample, seven quantification methods or quantification approaches were identified: Time window, grand mean peak window, grand mean condition peak window, grand mean difference wave peak window, individual mean peak to peak, individual mean condition peak to peak and individual mean difference wave peak window.Given these different quantification methods, the question arises, whether all these different quantifications come to similar results concerning the statistical analysis of the FRN effects in conditions.

Reference schemes
In EEG research, there have been different attempts to rereference the measured signal after recording to ease interpretation of the data.The recording system produces a relative signal often using the double difference amplification: Each electrode signal is compared to the reference electrode signal and the reference signal is compared to the ground electrode signal to separate the brain activity from other electrical measurements and background noise like muscle activity and eye-movements (Luck, 2005).Many approaches were taken over the years to interpret EEG data, with specific biases concerning their interpretation of the data (e.g., Allen, Urry, Hitt, & Coan, 2004;Hagemann, 2004;Jungh€ ofer, Elbert, Tucker, & Braun, 1999;Kayser, 2009;Reznik & Allen, 2018;Rodrigues, Allen, Mu ¨ller, & Hewig, 2021).The most used references are average reference, linked mastoids, common vertex (Cz) and current source density (CSD).It has been found that average reference and linked mastoids are substantially correlated, while the common vertex reference leads to deviant results (e.g., Hagemann, 2004).In several works concerning frequency band analysis, the CSD reference was recommended, as it creates a local reference-free signal and a topographical sharpening of the activation patterns, that relate to personality traits and psychopathology more reliably (e.g., Allen & Reznik, 2015;Hagemann, 2004;Reznik & Allen, 2018;Rodrigues, Allen, et al., 2021;Stewart, Coan, Towers, & Allen, 2014).In the present work, three different reference schemes are included and compared: The linked mastoid reference (LM), the average reference and the CSD transformation.Linked mastoid reference is created by taking the mastoids (A1þA2, often substituted by TP 9 þ TP 10) as reference electrodes.A specific confounding bias in EEG signals when using the LM is the consequence of the activation that is present at the mastoid electrodes themselves (e.g., brain activity in vicinity of the mastoids), as the entire signal is compared to this activation.The average reference is taking the whole head activation as a reference.This leads to the polar average effect as a potentially problematic bias, if the electrodes are unevenly spread or an inadequate density of electrodes is used (Jungh€ ofer et al., 1999).The potentially problematic bias of the CSD reference is the sharpening of the topographical activation (e.g., Cohen, 2014;Kayser, 2009).Additionally, the unit of CSD is different than LM or the other references mentioned above, as the unit is not mV but the electrical activation per area.In particular, the unit of measurement will be mV/cm 2 or mV/m 2 , depending on the transformation algorithm.Hence, a direct comparison of the CSD with other references in one model of analysis is not advisable or possible.This has to be kept in mind when comparing these reference schemes as intended in the present manuscript.Importantly, CSD and average reference may be used to check whether there is a systematic confounding activity at the mastoid electrodes which would invalidate the data referenced to the mastoids.Concerning the differences of the references schemes, we expect CSD reference to have enhanced topographical sharpness compared to linked mastoids and average reference.

Quality criteria for ERPs: the standardized measurement error
In order to compare the quantification methods and the reference schemes criteria for an evaluation are needed.Recently, there has been the attempt to identify a metric of data quality for ERPs that could be used to quantify the precision of the results.Precision in this context is defined as the degree to which repeated measurements under unchanged conditions yield similar results.c o r t e x 1 7 5 ( 2 0 2 4 ) 1 0 6 e1 2 3 Standardized measurement error (SME) as such a criterion.This special case of standard error of measurement accounts for the overall data quality, as well as the data quality on participant level.Thus, it provides not only an overall criterion for a quantification of precision for different methods but helps identifying low data quality participants.However, it is only applicable for averaged ERPs and not applicable to single trial data ERP analysis in general, as the procedure needs an averaging process that is repeatedly bootstrapped from the available trials (Luck et al., 2021).To be precise, the SME is computed by bootstrapping 10000 times the FRN out of the FRN trials that are given (with replacement).This means, one draws 10000 times the amount of FRN trials out of the FRN trials randomly (with replacement), constructs the bootstrapped FRN by averaging these drawn trials and then computes the SD of these 10000 bootstrapped FRNs.This SD is in fact the (estimated) SD of the mean of the FRN trials, so it is the (estimated) standard error of the sample mean.This criterion is used to compare the different quantification methods.However, one must be aware that the SME is to be interpreted as a relative measure in comparable units of measurement (e.g., mV vs mV/m 2 ).In particular, preprocessing decisions concerning the reference may influence this unit of measurement independent of whether there is actually a change in precision, especially if they change the unit of measurement, as for example the current source density (CSD), i.e., Laplacian transformation (Cohen, 2014;Kayser, 2009;Kayser & Tenke, 2006).As Luck et al. ( 2021) also mention, one should not focus solely on the SME, but also use other criteria that support validity.Hence, we also include additional quality control criterions as the ICC (2,k) (Koo & Li, 2016) and effect size (Cohen's d).

Present study
To sum up our present attempt: We try to explore the influence of the reference scheme as well as the quantification method of the FRN on the FRN results as well as on the quality of the data.We evaluate the different methodological choices using condition effects and effect sizes (Cohen's d) and we use the SME to estimate precision.As a an additional control quality criterion, we use the well-established test-retest reliability using the ICC(2,k) (two-way random effects, absolute agreement, multiple measurements, see Koo & Li, 2016) to compare the quantification as well as the reference choices (cf.Rodrigues, Mu ¨ller, Mu ¨hlberger, & Hewig, 2018).
A first central question is whether the different quantification methods do lead to similar or dissimilar effects in the FRN responses in the trust game and whether some methodological choices should be favored as compared to others.
Secondly, the same question applies to the different reference schemes.To evaluate this the condition effects, effect patterns, effects sizes (Cohen's d), and the topographies of the three main conditions will be compared.Furthermore, we use SME and reliability measures to evaluate the precision and reliability of the different quantification methods and EEG reference schemes.
Moreover, we thirdly aimed to explore the consistency of the different evaluation criteria.Accordingly, we explore the relation of SME, ICC, FRN amplitude, and effect size of the conditions, to determine whether these evaluation criteria are consistent to each other.In particular, we examine whether SME is a useful criterion in the context of choosing different quantification methods as well as different reference schemes.This is especially important as SME has been suggested and used as a selection tool for processing and preprocessing choices, despite the advice of not solely relying on it (Luck et al., 2021;Zhang & Luck, 2023).

2.
Material and methods

Ethical statement
The study was carried out in accordance with the recommendations of "Ethical guidelines, The Association of German Professional Psychologists" ("Berufsethische Richtlinien, Berufsverband Deutscher Psychologinnen und Psychologen") with written informed consent from all subjects.All subjects gave written informed consent in accordance with the Declaration of Helsinki before they participated in the experiment.The protocol was approved by the local ethics committee.

Participants
We report how we determined our sample size, all data exclusions, all inclusion/exclusion criteria, whether inclusion/ exclusion criteria were established prior to data analysis, all manipulations, and all measures in the study: Our sample was not based on sample size calculation and a priory power analysis but was sampled ad hoc in three different phases.In total, one hundred and three right-handed, healthy female subjects participated in the study (mean age ¼ 23.4,range ¼ 18e35).Due to poor data quality, three participants had to be excluded from the sample, leading to 100 participants.Individual borderline symptomatology as measured by the short version of the "Borderline-Symptom-List", the "BSL-23" (Wolf et al., 2009), and was considered as a control variable in the analyses and sampling was originally made in three different groups according to high, medium, and low scoring on this criterion (since BSL is not the main focus of the present study, see supplemental materials for analyses including this measure).The demographic data and BSL score of one person could not be recovered and the person was removed from analysis (see Table S1 in supplemental materials), leading to a final sample of 99 participants.The subjects received 20V or course credits for their participation.All participants had corrected-to-normal vision.

Procedure and paradigm
Berg et al.'s (1995) Trust Game in a modified version of Wang et al. (2016) was used as behavioral task.As cover story, participants were informed that they played with a different adult trustee randomly selected in each round and that the choices of the trustees are randomly selected from a pre-study (Wang et al., 2016).The responses of the trustees followed the same pre-programmed procedure across all participants, with random reciprocity-decisions across all trials.Originally, the experiment had three phases: First, subjects played 110 rounds of the trust game in the trustor's perspective (10 practice trials, c o r t e x 1 7 5 ( 2 0 2 4 ) 1 0 6 e1 2 3 block 1: 50 rounds, block 2: 50 rounds).The reinforcement rate for the trustor (rates of cooperation and receiving 20 points after an investment-decision) accounted 50%.In the second part of the experiment a role change followed, in which the subjects played 110 rounds in the role of the trustee with a (trustor-sided) trust rate of 70% (10 practice trials, block 1: 50 rounds, block 2: 50 rounds).In the third and final part of the experiment, subjects switched back to the role of the trustor and played 100 rounds of the trust game (Block 1: 50 rounds, Block 2: 50 rounds).For the present manuscript, the two phases of the trust game as trustor were used and collapsed.Fig. 2 illustrates the sequence of a single trial in the role of the trustor in the Trust Game.Participants saw a decision tree for 1500 ms showing the trustor's options and their possible outcomes.Then, a variable fixation cross was presented for 800 mse1000 ms followed by a decision-screen for 2000 ms.
Participants had to indicate whether they wanted to keep the 10 points (cued by '10', indication of 'no trust') or whether they wanted to invest the 30 points to the trustee (cued by '30', indication of 'trust').To indicate their decision, participants respectively had to press the left or right mouse button, the position of decision options was counterbalanced between participants.In case of a lacking response, a warning message appeared on the screen pointing out that the participant responded too slowly.Then a black screen was presented for a variable 800 mse1200 ms, followed by the presentation of the investment-feedback for 1200 ms.After a trust decision, the possible feedbacks were 'loss' (cued by '0:40') or 'gain' (cued by '20:20 0 ).After a no-trust decision, neutral feedback (cued by '10:10 0 ) was presented.The final screen displayed the current total score for 2000 ms.The FRN was analyzed as neural correlate of the feedback-presentation.
An additional electrode to register eye movements and blinks was put below the left eye.Electrode impedances were kept below 5 kOhm for the EEG.Data were recorded with a sampling rate of 500 Hz and a bandpass filter from form . c o r t e x 1 7 5 ( 2 0 2 4 ) 1 0 6 e1 2 3 (Rodrigues, Weib, et al., 2021).The electrode positions F3, F4, FC2, FC1, Fz and FCz were used to quantify FRN responses (cf.Sambrook & Goslin, 2014).Cz was not used as an electrode of interest due to different latency responses concerning feedback potentials (Luck & Hillyard, 1994).The time window approach was from 200 mse400 ms (Yeung & Sanfey, 2004), and for the peak window approach, the negative peak was searched for in this time window and a 40 ms peak (þ20 ms/-20 ms) was used to extract the FRN response (e.g., Rodrigues et al., 2020Rodrigues et al., , 2022)).For peak-to-peak analysis, the time window was taken from 150 ms to 300 ms for the P2 peak selection (e.g., Luck & Hillyard, 1994) and 200 mse400 ms for the negative N2 peak selection (e.g., Yeung & Sanfey, 2004).The SME of each FRN quantification was computed with 10000 iterations of bootstrapping (Luck et al., 2021).If a mean was needed to determine the peak activation, an unweighted mean was computed, (partly deviating from the EPOS pipeline, but more common in the field cf.Rodrigues, Weib, et al., 2021).

Statistics
For the FRN, a single trial analysis was computed with the random cluster participant with the random slope trust experience (trust not rewarded/not trusted/trust rewarded) and the average mean correlated random slopes for quantification (time window/grand mean peak window/grand mean condition peak window/grand mean difference wave peak window/individual mean peak to peak/individual mean condition peak to peak/individual mean difference wave peak window) and electrode (FCz/Fz/FC1/F3/FC2/F4).The fixed effects were the factors quantification (time window/ grand mean peak window/grand mean condition peak window/grand mean difference wave peak window/individual mean peak to peak/individual mean condition peak to peak/ individual mean difference wave peak window), trust experience (trust not rewarded/not trusted/trust rewarded) and electrode (FCz/Fz/FC1/F3/FC2/F4), with the fixed continuous predictors trials standardized in participants (and mean centered borderline symptom level for supplementary analyses).The first level was used as the reference level for each factor.
For SME (all not calculated as single trials, Luck et al., 2021), similar models were analyzed, but lacking the trials as a predictor (as the SME cannot be computed for a single trial but only for several trials, see Luck et al., 2021) and adding the participant centered FRN responses (to account for possible systematic influences of the amplitudes and condition effects on the SME) and the centered amount of the datapoints given by the quantification as additional continuous predictors.
Each model was computed once for CSD reference, average reference and for linked mastoid reference.The best model for each reference and variate was chosen from the zero model, the simple additive model, and the complex interaction model using the corrected Akaike Information Criterion (AICc) and the probability of information loss (Burnham & Anderson, 2002).All p-values were Bonferroni adjusted for each single fixed effect term (including the intercept) in the respective target model.The models were computed using R (R Core Team, 2020) and GLMMTMB package (Brooks et al., 2017), visualization was done using Ggplot2 (Wickham, 2016).For the FRN response, Cohen's d was calculated for the Bonferroni adjusted significant terms, using the effect estimates in nominator and the square root of the entire random effect variance in denominator (cf.Brysbaert & Stevens, 2018;Westfall, Kenny, & Judd, 2014).For ICC calculations, the sim-plyAgree package was used (Caldwell, 2022).

Analysis of FRN: comparison of quantification methods using three references
The model testing revealed that the full interaction model (not including the borderline trait, see supplement) was the model to be preferred for all references (see Table S2 in supplemental materials).In Table 2, the random effects used in the respective target models are revealed.The numerical differences are partly explained by the differences in measurement units (i.e., larger values for CSD reference).
In Fig. 3, "robust" effects (after Bonferroni correction) of each reference are displayed.For a complete picture of exploratorily significant findings, see Table S3 in supplemental analysis.
There are significant differences in FRN amplitude between the quantification methods (see effects including quantification in Fig. 3).The main effect for the quantification led to larger, more negative FRN values for the peak window approaches (i.e., 'grand mean condition peak window', 'grand mean peak window'), as well as for the 'individual mean condition peak to peak' measurements, while the difference wave peak window quantifications (i.e., 'individual mean difference wave peak window', 'grand mean difference wave peak window') led to a more reduced FRN responses (except for 'CSD reference' in case of 'individual mean difference wave') compared to the 'time window approach' being the baseline category for the comparisons (see Fig. 3, Figs. 4 and 5).Hence, concerning the effect sizes, the peak window approaches as well as the 'individual mean condition peak to peak' quantification led to larger effects.Interestingly, the main effect for the electrodes revealed that the essential position of the FRN response in the chosen paradigm seems to be rather frontal with more negative FRN responses for Fz, F3 and F4 compared to FCz.For the CSD reference, the main effects of quantification and trust experience interacted here, revealing two different patterns of FRN reaction, especially for the quantification approaches 'grand mean condition peak window' and 'grand mean peak window' (more positive FRN than time window) and the 'grand mean difference wave peak window' (leading to more negative FRN than time window).
For the not trusted experience, a midfrontal activation with the negative peak on FCz, Fz, F3 and F4 was found (see Figs. 4  and 3) while for the trust being rewarded or not, only the more frontal electrodes Fz, F3 and F4 were essential.This topographical pattern is present to a lesser extend also in average reference (see Fig. 3 Fig.4).
For the trust behavior evaluation, not having trusted led to a main effect with more negative FRN values in this condition compared to having the trust not rewarded (see Fig. 3, Figs. 4  and 5).Getting rewarded for trust decision only led to a main effect of the FRN with more positive values for the average reference and the linked mastoid reference, while for CSD reference only an exploratory effect was found (see Fig. 3, Figs. 4  and 5).In general, the interaction of the quantification with the trust experience led to less pronounced reward positivity for rewarded trust if the 'individual mean difference wave peak window' quantification was used, which can be explained by a topographical shift of the peak if a difference was taken.
When comparing the Cohen's d, the same pattern as already present in the ERPs was evident.The average reference and the linked mastoid reference reveal larger differences between the rewarded and unrewarded trust experience (see Fig. 3, Figs. 4 and 5).Concerning the quantification, the CSD seems to lead to higher effects concerning the absolute values of the FRN signal (see Fig. 3).The effect of electrode location was dampened in the linked mastoid reference, as to be expected.CSD reveals a narrowly focused difference between trust rewarded and not rewarded in a short time frame.This is revealed by electrode position interactions for CSD in Fig. 3 and depicted in Fig. 4. In contrast, the difference between the rewarded and unrewarded conditions is spatially and temporally smeared likely due to differences between these conditions at the reference-electrode positions for both linked mastoids and average reference (see Figs. 4 and 5 and next section).The entire model parameters of all reference schemes can be seen in supplemental materials (Table S3).

3.2.
Do CSD reference and average reference indicate that there is systematic activation i.e., bias at the mastoid electrodes?
The present section follows up on the previous results section in order to examine whether the findings for CSD and average c o r t e x 1 7 5 ( 2 0 2 4 ) 1 0 6 e1 2 3 reference suggest that there is systematic confounding activity at the mastoid electrodes which would invalidate the data referenced to the mastoid electrodes.
The model testing of the mastoid electrodes revealed that the full interaction model was the model to be preferred for all references (see Table S5 in supplemental materials).
The exploration of the activity of the linked mastoids electrode signal revealed a main effect for the 'grand mean difference wave peak window' method leading to a more negative signal in general (see Fig. 6 and Table 3).
Also, a main effect was found for the trust experience, with no trust leading to a more positive signal in general.This effect was moderated by the quantification method, being larger for the 'grand mean peak window' approach and in CSD also for 'grand mean condition peak' approach, while it was dampened for the 'grand mean difference wave peak window' with this reference.
For trust being rewarded, only average reference could find a negative effect for the linked mastoid signal, yet this was dampened for 'individual mean difference wave peak window' quantification.
These findings reveal possible biases for the linked mastoid reference, that are modified by choosing different quantification methods and that may even vary over different conditions as well as trials in the paradigm.This bias (see Fig. 6 and Table 3) that may be introduced by the reference electrodes of the linked mastoid reference may be one reason for a seemingly enhanced sensitivity of condition differences in the analyses using a linked mastoid reference compared to the CSD reference above (see previous section).The average reference seems to reveal similar signal problems of the linked mastoid electrodes as the CSD.

Analysis of SME
AICc and p-information loss identified the additive model as the best-fitting model (see Table S4 in supplemental materials).The analysis of the SME revealed that the biggest influence came from the datapoints of the quantification method for all reference schemes, with lower SME for more datapoints (see Fig. 7 and Table 4).This might be trivial as less variance might occur when taking the mean over longer data segments.This leads to less variance over the drawn trials in the bootstrapping.Also, the electrode position had a strong influence on the SME, with every electrode but Fz leading to a lower SME (except for the Fz in CSD and average reference) than FCz.This reveals a maximum of activation of the FRN at midline positions for every reference, as there seems to be the most variance given (see Table 4).
Concerning the quantification methods, after the consideration of the datapoints used, no quantification method led to lower or higher SME for any reference scheme (see Table 4); except 'individual mean peak to peak', which has significantly lower SME for all three references and thus would be the method of choice according to SME.The mean FRN amplitude yet played a role, with more negative amplitudes leading to a higher SME, meaning that higher amplitudes seem to cause more variance showing up in SME (see Table 4).Concerning the references, it could be seen that the 'average reference' had the lowest SME, followed by the 'linked mastoids' and the 'CSD'.However, since the CSD being a different measurement unit, it is questionable whether the numerically higher value can be interpreted.

Exploration of test-retest reliability: ICC
The exploration of the ICC led to higher retest reliability for the linked mastoids reference, followed by the average reference and lastly the CSD reference.This could be explained by the topographical sharpening of the CSD reference leading to more variance in the target area and hence to less intercorrelation.Also, the quadratic unit of the CSD leads to an amplification of the differences of the trials, leading to lower homogeneity between the trials.For the linked mastoids reference, the higher homogeneity could be due to the systematic bias provided by the linked mastoids signal in this case, explored above in the preceding section.In other words, systematic variance that produces a difference at the mastoid reference electrodes leads to greater similarity between trials, which increases reliability.However, this may reduce validity because larger parts of the systematic variance are due to activity at the reference electrode rather than at the target electrode.
Concerning the quantification methods of the FRN, the grand mean condition peak window approach as well as the individual mean condition peak to peak approach seem to be favorably on descriptive level and partly also statistical level (see Table 5).The least favorable quantification approaches concerning the homogeneity were the 'grand mean peak window' as well as the 'individual mean peak to peak' quantification.This leads to the impression that the influence of the conditions played an important role in the homogeneity of the trials.Yet, the difference in homogeneity may also simply come from the difference in datapoints.Hence, it may be important to consider the differences in homogeneity as well as the differences in datapoints when it comes to quantification choices.Also, besides having the most datapoints and therefore possibly also the least variance, the 'time window approach' did not outperform other quantification approaches.This difference to the SME results further stresses the importance of using different quality criteria and not solely rely on SME (cf.page 14 Luck et al., 2021).Especially the selection of processing choices may be flawed if the biases explained above are not considered (e.g., Zhang & Luck, 2023).

Discussion
In this work, we compared different quantification methods of the FRN for three different reference schemes.For this comparison we used several quality criteria.In addition to SME (Luck et al., 2021), we calculated ICC, and we used the effects, effect sizes, and effect patterns of the amplitude of the FRN in a trust game paradigm.
Concerning the quantification method of the FRN there were major differences in the absolute values.However, only few interactions with the trust experience conditions were present.In general, the peak window approaches on the signal led to more pronounced negative values for the FRN, which is a simple consequence of selecting the negative peak within the "big"/general time window and then using a smaller time window around the peak.Interestingly, the difference wave peak window quantification did not lead to a more negative value than the general time window approach.This may be caused by the differences of the conditions shifting their negative peak away in time from the signal negative peak concerning latency.Furthermore, interindividual and intraindividual differences in latency leading to a wide distribution of latencies and therefore not a well-defined window that includes the maximum of negative FRN amplitudes, may be disadvantageous compared to the time window approach (cf.Sambrook & Goslin, 2015).The general effect pattern (not trusted most negative FRN, trust not rewarded "middle" FRN, trust rewarded most positive FRN or reward positivity Baker & Holroyd, 2011) was similar for all quantifications (see Table 5).However, according to Cohen's d effect size was dampened for the 'individual mean difference wave peak window' approaches.This questions the validity of this approach, as the difference wave seems to measure a concept or signal quality that is slightly different from all other quantifications as seen above.Given a rather large time window around the difference peak, this may not have that much of an impact (cf.Cohen et al., 2007;Marco-Pallares et al., 2011), but with a narrow time window, the effect pattern is different than any other quantification that was used.However, it is important to note that we did not find any other consistent major difference between the quantification approaches in the present trust game paradigm concerning the general effect pattern (Table 5).Accordingly, despite small differences in effect size, all quantifications seem to be sensitive to the underlying effect pattern.These results seem highly comforting, as the influence of the different habits of quantification may not change the results too much.Hence, we may recommend all quantification methods mentioned above based on the present data.Finally, SME indicated lowest values for the 'individual mean peak to peak method' suggesting a particularly high precision of this quantification method in the present dataset.
Our second research question was a comparison of the reference schemes, and the results lead to the conclusion that the FRN responses were detected similarly for all three references (see Table 5).Yet, the linked mastoids and the average reference seemed to be more sensitive to the effects of the trust experience conditions as indicated by larger effect sizes (see trust experience effects in, Fig. 3).However, on closer explorative inspection, the linked mastoid activation might be confounded in the present paradigm by activity at the reference electrode (see Fig. 6) and therefore may also show artificially high intraclass correlation (see Table 5).Hence if one chooses to pick a specific site such as linked mastoids or common vertex reference (Cz) the measurement may be distorted by activity at the chosen reference site.Accordingly, we advise to check the condition specific activity of the respective reference electrode (e.g., mastoids, vertex) by using CSD or average reference.This does not mean that the CSD or average reference have a completely "unbiased" view on the data.In fact, the bias of CSD considering topographical sharpening is substantial (e.g., Allen & Reznik, 2015;Cohen, 2014;Hagemann, 2004;Kayser, 2009;Reznik & Allen, 2018;Rodrigues, Allen, et al., 2021;Stewart et al., 2014) as well as the polar average effect for the average reference when uneven distributed sensors are used (Jungh€ ofer et al., 1999).Interestingly, in the present data, the bias of the linked mastoid reference seemed to work in favor of detecting an effect concerning the conditions (see Fig. 3 Fig.4), or vice versa, the CSD reference may have lost sensitivity concerning the trust experience conditions as well as homogeneity, due to overly splitting and sharpening the topography of the FRN response (see Fig. 3 and Table 5).However, the greater sensitivity for linked mastoid reference may be due to activity at temporal sites rather that FRN activity at frontal sites.The topographies in Figs. 3 and 7, and the significant interaction effects with electrode position in Fig. 3 indicate a greater sensitivity to topographical patterns for the CSD reference.The CSD also showed the highest effect sizes concerning the differences between the quantification approaches (see quantification effects in Fig. 3).Both arguments are supporting CSD as the method of choice.Yet, CSD measures had lower reliability (ICC).
Taken together, the biases introduced by reference schemes may not be in favor or against an effect detection per se, yet we would argue that they have to be checked and controlled.This is particularly important because, it is possible that the activation at the reference electrode sites may sometimes hinder effect detection if the effects have the same polarity at the reference as at the target electrode.Furthermore, the quantification may interact with the chosen reference scheme and the resulting topographical pattern.Hence, it is vital to check, control, and report these activation patterns of a chosen reference electrode in EEG research, to ensure the validity of the reported ERP concerning the specificity of topographical activation.
Concerning the recently suggested quality criterion SME (Luck et al., 2021), we found no differences between quantification methods.However, the SME was confounded with the amount of datapoints in the quantification method of the ERP (see Table 4 and Fig. 7).Hence, the systematic dependence of  SME on the datapoints limits the variance of SME being dependent on data quality.This can be seen as one example of changes that favor reduced SME but may decrease validity (cf.Luck et al., 2021).In their work, Luck and colleagues mention the examples of excessive filtering or "flatlining" the EEG, yet also the amount of datapoints of the quantification can be seen as such an example.Comparing SME with Cohen's d, it is evident that different reference schemes would be preferred based on these two parameters (e.g., CSD having higher SME but the highest Cohen's d).In addition, if the datapoints would not be not taken into account, the SME would prefer the 'time window approach', despite not leading to preferable Cohen's d (See Fig. 3 and supplemental table S7).Therefore, SME should be used with caution when using a transformation or processing choice that affects the data concerning the amount of datapoints.As we were interested in the difference in quantification methods, the SME might not be suitable to be used in such cases as the only tool.The dependency on the amount of datapoints in the respective analysis window and the associated bias of less variance in SME if more datapoints are present favor the "largest" (concerning datapoints) quantification methods available (see Table 4 and Fig. 7).Previous research concerning the bootstrapping method did reveal the importance of the window length being neither too long nor too short (e.g., Berkowitz & Kilian, 2000;Bu ¨hlmann, 2002;Hall, Horowitz, & Jing, 1995).In our case, the smaller sampling window may be representing the optimal sampling window concerning bootstrapping of time series.If this time window is exceeded, a possible underestimation of the variance may be present in bootstrapping approaches (e.g., Politis, 2003).c o r t e x 1 7 5 ( 2 0 2 4 ) 1 0 6 e1 2 3 Following classical test-theory, one could also assume the ERP consisting of a random signal (error) confound with the real signal leading to less error variance in longer data segments until a specific threshold.This effect would be desirable and may be present in some datasets (e.g., P3b, Verleger, 2020).However, we also found a more positive FRN amplitude was leading to a lower SME and thus a more negative FRN amplitude (and therefore more intense N2 response) showed a higher SME (Table 4).This leads to two assumptions or issues: First, that there is not only random variance, but (non-target) systematic variance that may overshadow short-termed ERP responses in long data segments.For example, the uniformity and much greater size or amplitude of other frontal components like the P300 amplitude (P3a, Verleger, 2020) or the P200 amplitude may confound the SME.Especially long data windows will lead to the confounding inclusion of these other ERPs into the signal of interest.This other systematic variance "twists" the SME to target this "alien" homogeneity instead of the desired homogeneity of the target effects, in our example the FRN.Summing up this first problem of the SME, it may be confounded by variance of other components and therefore other "true signals" that may overshadow the component of interest if the quantification window is too large.The second, and partly related issue is the problem of regarding an ERP as an unchanging undynamic mathematical true signal: ERPs are not invariant over time and are subject to learning and adaptation processes, that may produce variance.The presence of an effect therefore leads to a higher SME (cf.electrode FCz having the highest SME), due to accompanying effect variance (e.g., adaptation processes) leading to more variance than no effect being present at all.This variance may be given for example by the participants adapting and learning over time in their FRN responses (cf.Cohen et al., 2007).This second issue is similar to the first issue, but partly even worse, as higher effect sizes on individual level tend also to carry more accompanying variance (e.g., regression to center).Summing up the second issue of the SME, it may be confounded by the variance accompanying (large) effect sizes.
Hence, this quality criterion may be criticized because a higher effect in this case may lead to a lower quality criterion.
To exaggerate and make it ad absurdum, maybe the "best" SME would be present, if there was no effect at all and possibly a very long data segment with a very high sample rate, as this would lead to the least variance (see Luck et al., 2021 for the extreme example with bridging all electrodes together).Or to put it even worse as this criterion has been suggested to choose processing and preprocessing decisions (Luck et al., 2021;Zhang & Luck, 2023), they might favor pipelines that do not detect effects best, but worst.This exaggeration may just show the problematic of the SME as it was suggested and is not meant to totally "dump" or undermine the general idea of precision of measurement.However, the SME criterion must be interpreted with caution as already argued by Luck et al., in 2021 and the amount of datapoints should be considered as control variable.Especially if complex decisions like EEG processing choices are to be evaluated, other quality criteria should be used in addition to SME, for example effect size measures.
FRN analyses (e.g., 25 Hz low-pass filter Mussel, Hewig, & Weiß, 2018;30 Hz low-pass filter: Osinsky et al., 2012;20 Hz low-pass filter: Weiß, Gutzeit, Rodrigues, Mussel, & Hewig, 2019).In addition, the FRN responses were not compared to similar psychophysiological responses (e.g., midfrontal theta activation, cf.Rodrigues et al., 2020Rodrigues et al., , 2022)).The EEG recording was only done using 32 electrodes instead of high-density montages of 64, 128 or even 256 electrodes.This limits topographical interpretations, separation of signal and noise using ICA as well as it might interact with the results for the effect sizes as well as SME.Yet, it has been shown that at least signal processing and signals are similar in interpretation in 32 electrode compositions (e.g., Kayser & Tenke, 2006).
Concerning the sample that was explored here, there could be a much more detailed theoretical introduction and analysis concerning for example the relevant borderline personality trait.However, due to the methodological focus of the manuscript, the limited space, time and resources, we decided not to add further factors in our data exploration as well as keeping the theoretical explanation of our sample to a minimum and simply used it for illustrative purposes concerning methodological questions.

Conclusion
In this work we could show that the quantification methods concerning the FRN influenced the absolute value of effects and effect sizes, but the patterns itself were detected by all methods used.Additionally, we found that the linked mastoid reference was biased by effects of the experimental conditions that were present at the mastoid electrodes.Yet, the linked mastoid reference and the average reference were more sensitive to differences between (trust experience) conditions than CSD, possibly due to the overly sharpening of FRN activation by CSD.Hence, it could be useful to present the effect patterns on the reference electrodes using references that have them still present as active electrodes, e.g., average reference or CSD reference to get more detailed insights into the data at a reference electrode like the mastoids.Also, it is important to examine the effect pattern not only at one electrode, but also concerning topographical distribution, which CSD and average reference were most capable of.Finally, we were able to show that the SME is highly dependent on the amount of datapoints that are given in the quantification period of the quantification method of the respective EEG feature (i.e., FRN in our case), as well as the amplitude of the component itself.Therefore, one should consider controlling for these important aspects of the data if considering using SME to compare different processing choices and it is advisable to include other criteria in addition to SME concerning complex decisions like EEG processing choices.

Fig. 1 e
Fig. 1 e Decision logic of the trust game.
Fig. 2 e Trial Sequence in the role of the trustor in the Trust Game.

Fig. 3 e
Fig. 3 e Cohen's d for the significant single trial FRN analyses for all references.

Fig. 4 e
Fig. 4 e ERP response with CSD reference, linked mastoids and average reference to the three different experimental conditions (trusted: not rewarded, not trusted, trust rewarded).Note that the black lines indicate the time window approach while the grey lines indicate the mean based peak windows.The topographical pictures represent the grand mean over all three conditions in the respective time window of the specific sample.The shaded error-bars represent the between SEM.

Fig. 5 e
Fig. 5 e FRN response in trust game dependent on trust experience and reference schemes.Error-bars represent between SEM.

Fig. 6 e
Fig. 6 e Linked mastoid signal measured with CSD reference and average reference in the trust game.Shaded-error-bars represent between SEM.

Fig. 7 e
Fig. 7 e Relation of SME with datapoints used in the quantification methods for all three used references.Colors indicate participants.

Table 1 e
Examples of quantification approaches used in FRN research.

Table 2 e
Random effect variance for the target model for CSD reference, linked mastoid reference and average reference.

Table 3 e
Significant effects of the linked mastoid activation analysis using CSD reference and average reference using Bonferroni-correction.

Table 4 e
SME of FRN for CSD reference, linked mastoid reference and average reference.