Don ’ t SNARC me now! Intraindividual variability of cognitive phenomena – Insights from the Ironman paradigm

Two implicit generalizations are often made from group-level studies in cognitive experimental psychology and their common statistical analysis in the general linear model: (1) Group-level phenomena are assumed to be present in every participant with variations between participants being often treated as random error in data analyses; (2) phenomena are assumed to be stable over time. In this preregistered study, we investigated the validity of these generalizations in the commonly used parity judgment task. In the proposed Ironman paradigm, the intraindividual presence and stability of three popular numerical cognition effects were tested in 10 participants on 30 days: the SNARC (Spatial-Numerical Association of Response Codes, i.e., faster left (cid:0) /right-sided responses to small/large magnitude numbers, respectively; Dehaene, Bossini, & Giraux, 1993), MARC (Linguistic Markedness of Response Codes; i.e., faster left (cid:0)


Introduction
Generalization is the core of inductive reasoning in science.By collecting several observations, supported by a toolkit of statistical inference, we generalize our findings beyond the observed data.Some of the generalizations we make are explicit, while some remain implicit and unspoken.In experimental psychology, the research process quite often goes from collecting data from a sample of individuals and searching for group-level effects to generalizing that the observed effects are present in the entire population by means of frequentist (or Bayesian) statistics.An extensive body of literature describes controversies and limitations of generalizability (Simons, Shoda, & Lindsay, 2017;Yarkoni, 2020).Extent and validity of sample-to-population generalizability is far from being resolved and remains beyond the scope of this paper.
Apart from this most obvious generalization, at least two more generalizations may be implicitly made by researchers at the same time.The first of these is the generalization that the group-level effect is reflected in each participant, which may be the case but is not necessarily warranted (see discussions on ecological fallacy in Fisher, Medaglia, & Jeronimus, 2018, and the group-to-person generalizability problem described by McManus, Young, & Sweetman, 2023).Until recently, this was even an explicit assumption in standard statistical methods, such as the general linear model.The general linear model assumes random variables that characterize the responses given by participants in the same experimental condition, which are independent and identically distributed.That is, these variables have the same mean and standard deviation, while any between-participant variability is treated as random error.The issue of between-participant variability is gaining more and more attention recently as new methods are being described to quantify individual prevalence of group-level psychological phenomena (e.g., Cipora et al., 2019;Haaf & Rouder, 2019;Rouder & Haaf, 2018;Rouder & Haaf, 2019;Zayas, Sridharan, Lee, & Shoda, 2019).Such studies usually demonstrate that not every participant, and in some cases even less than half of the participants, show a reliable and replicable group-level effect at an individual level.Even though the methods for quantifying individual prevalence are relatively new, the problem has a much longer history.Phenomena initially described as general principles of cognition, such as the advantage of global over local processing (Navon, 1977), turned out to reflect one of several possible cognitive styles (Happé & Frith, 2006;Poirel, Pineau, Jobard, & Mellet, 2008).In their dominance account, Rouder and Haaf (2018) differentiate between dominant phenomena like the congruency effect in the Stroop task (i.e., when naming ink colors of color words, responses are slower and more likely erroneous if the ink color and the word mismatch than if they match; Stroop, 1935), which are present in all people, and indominant phenomena like the advantage of global processing (Happé & Frith, 2006;Poirel et al., 2008), which are not present or even reversed in some individuals.Crucially, traditional significance testing at the group level cannot differentiate between dominant and indominant phenomena.For this, the presence of the phenomenon under scrutiny must be checked for each participant separately.A solution for estimating the proportion of the sample that reveals an observed grouplevel phenomenon at the individual level are bootstrap confidence intervals, which are built on the outcome measure's individual variability for each participant (Cipora, van Dijck, et al., 2019).Investigations of several cognitive phenomena have successfully employed bootstrap confidence intervals and shown them to be a suitable approach to answer the dominance question (e.g., Guida, Mosinski, Cipora, Mathy, & Noël, 2020;Hohol et al., 2020;Van Dijck, Abrahamse, & Fias, 2020;Van Dijck, Fias, & Cipora, 2022).In contrast to hierarchical modelling, which can estimate overall variability coming from separate sources (i.e., group level, participant level, and session level), bootstrap confidence intervals are a straightforward way to look at one level (i.e., participant level or session level) and disentangle the level's units (i.e., identify which participants or sessions show the cognitive phenomenon of interest reliably).
The second implicit generalization, which is our focus, is assuming that the effects observed at the participant level are stable over time and reflect their typical patterns of behavior.While interindividual consistency of group-level effects has become an issue recently, intraindividual consistency over time is largely overlooked in cognitive psychology.
While in diagnostics or personality psychology the test-retest reliability of measured variables is a major topic, in cognitive psychology we usually measure only once and assume that basic cognitive effects are intraindividually stable.This generalization of a current effect to a permanent intraindividually stable effect seems to be mostly unchallenged in cognitive psychology, which might be due to the historical context of the development of scientific psychology.

Cognitive effects in correlational and experimental psychology
Historically, psychology has been divided between two approaches: correlational and experimental, as outlined by Cronbach (1957Cronbach ( , 1975)).Differential psychology focuses on explaining sources of variability between individuals, mostly taking the correlational approach.The goal of the measures employed is to capture maximum variability between participants, as observing variance is a necessary but not sufficient condition for observing covariance and in consequence correlation between measures.Experimental psychology, following the tradition of its founder, Wilhelm Wundt, is interested in general principles of human functioning.In contrast to differential psychology, experimental psychology uses measures and tasks that have been optimized to produce stable and replicable group-level effects.According to Wundt's reasoning, between-subjects variability reflects noise, which should be minimized (Jensen, 2006).However, designing measures and tasks with this goal and mindset makes the measures and tasks unsuitable for correlational studies (Hedge, Powell, & Sumner, 2018).
Measures used in contemporary cognitive psychology stem from experimental tradition.However, researchers use individual scores obtained by participants in a cognitive task as variables to be correlated with other variables or even as predictors for behavioral phenomena.Such correlations are quite often relatively low.In a seminal paper, Hedge et al. (2018) address this issue, referring to the reliability paradox.As explained above, cognitive tasks aim at producing stable group-level effects (in line with the experimental tradition) and are suited to minimize between-subjects variability.Thus, the largest part of the remaining between-subjects variability being observed when looking into individual scores can be assumed not to come from true differences between subjects regarding the measured construct but instead only from the measurement error.This may be the case, for instance, when the used task is too easy or when the sample is homogeneous regarding the measured effects (Paap & Sawi, 2016).Therefore, tasks which produce stable group-level effects often lack reliability as defined in differential psychology and psychometrics.
At the same time, there are numerous cognitive phenomena, which despite being robust at the group level, produce systematic betweensubjects variability.One of them is the SNARC effect (Spatial-Numerical Association of Response Codes; Dehaene et al., 1993;Wood, Willmes, Nuerk, & Fischer, 2008, for a meta-analysis; Cipora, Soltanlou, Reips, & Nuerk, 2019, for an online replication; Toomarian & Hubbard, 2018, for a review), which describes a left/right side advantage in responding to small/large magnitude numbers, respectively.It is the hallmark effect in the field of numerical cognition and typically measured in a parity judgment task with a left and right response key, where the numerical magnitude affects the response pattern even though it remains taskirrelevant.The effect is indexed with slopes of regression, in which dRTs (mean RTs right side minus mean RTs left side) are regressed on the numerical magnitude for each participant separately.More negative values correspond to a more pronounced SNARC effect, that is, to a stronger association of small/large magnitude numbers with the left/ right, respectively.Slopes are subsequently tested against zero with a one-sample t-test to check for the presence of the effect at the group level (repeated-measures regression approach by Lorch & Myers, 1990; applied to the SNARC effect by Fias, Brysbaert, Geypens, & D'Ydewalle, 1996).Importantly, with this method, the SNARC effect is not tested at the participant level.Despite robustness at the group level, the SNARC effect varies between participants: Even descriptively, only between 70 and 80% of participants reveal negative slopes pointing towards a SNARC effect (Cipora, van Dijck, et al., 2019;Wood et al., 2008).When testing whether a reliable SNARC effect is present at the participantlevel (i.e., unlikely to only reflect random error), the proportion of participants revealing it is even smaller and ranges from 35 to 45%.Relatively few individuals (5% to 10%) reveal the reversed effect (i.e., association of small/large magnitude numbers with the right/left side, respectively), and about half of the participants do not reveal the effect in any direction (Cipora, van Dijck, et al., 2019).The authors developed two bootstrapping approaches to determine confidence intervals for each participant to check whether their observed score is likely to occur by chance, which we make use of for the current study.Since the construction of these confidence intervals was non-parametric and based on each participant's individual behavioral pattern, it was possible to identify participants revealing a reliable regular or reversed effect.Specifically, the approach allows observed effects that might be a pure byproduct of random variability to be differentiated from (most probably) true underlying effects that would be unlikely to be produced only by fluctuations in reaction times.Some cognitive effects, such as the SNARC effect, reveal variability between individuals despite their robustness at the group level and would therefore be considered as indominant phenomena according to Rouder and Haaf (2018).However, presence of variability does not automatically imply reliability of the effect, because variability could be either produced by stable interindividual differences (with high reliability as measured by test-retest correlations) or by high intraindividual variability of the effect (with low reliability as measured by test-retest correlations).
Using a design with enough trials increases the reliability of a task and of the estimated effects by decreasing the measurement error (Baker et al., 2021;Schuch, Philipp, Maulitz, & Koch, 2022;Zorowitz & Niv, 2023).In the case of the SNARC effect, enough trials (Cipora & Wood, 2017, for guidelines) ensure internal consistency as measured with the split-half reliability method (see Supplementary Materials in Cipora, van Dijck, et al., 2019).This seems to make the design promising for use in correlational studies (e.g., Cipora et al., 2016;Cipora & Nuerk, 2013).However, apart from robustness at the group level, sufficient betweensubjects variability, and internal consistency, we should also be able to assume that the measure is stable over time, as it is only suitable to be correlated with other measures if it reflects a robust underlying mental process or structure.In terms of personality psychology, the question is whether a cognitive effect reflects a trait (i.e., long-term characteristic) or a state (i.e., transient, yet unstable cognitive process) of the participant.
From a theoretical point of view, the SNARC effect is meant to reflect the mind's tendency to map numerical magnitudes onto space.Many researchers hypothesize a cognitive construct in long-term memory called the mental number line, which might be helpful to process abstract concepts by using cognitive structures responsible for processing concrete objects (Dehaene, 2005).According to this hypothesis, the SNARC effect can be treated as a trait and correlated with other measures (similar claims of trait-like nature are also made about other cognitive phenomena, which researchers want to consider in correlational studies).For instance, a stronger mental number line (i.e., mapping numerical magnitude to space with greater ease and efficacy) might be associated with better mathematical performance.This translation from cognitive theory to individual differences is not very straightforward given that a correlation between the SNARC effect and mathematical performance is not consistently observed across studies (see Cipora, He, & Nuerk, 2020, for a review).A simple explanation for this is that the SNARC effect is not intraindividually stable over time.
Assuming intraindividual instability is not far-fetched given that the SNARC effect is susceptible to situated influences (Cipora, Patro, & Nuerk, 2018, for a review and taxonomy), such as variations of the experimental situation or task instructions.For instance, the SNARC effect's strength or even direction can be manipulated by previous tasks, such as reading texts in languages with left/right reading direction (Shaki & Fischer, 2008) or including small/large numbers near the left/ right ends of the lines (Fischer, Mills, & Shaki, 2010).Moreover, representational manipulations can affect the SNARC effect, for instance by memorizing ascending or descending number sequences before parity judgment that need to be recalled later (Lindemann, Abolafia, Pratt, & Bekkering, 2008).Different response setups can also influence the strength or direction of the SNARC effect, for example, crossed hands in bimanual setups (Dehaene et al., 1993, but see Wood, Nuerk, & Willmes, 2006a), or verbal response setups with "left" and "right" labels instead of spatial response setups with left and right keys (Gevers et al., 2010).It is no contradiction to assume that the SNARC effect has both trait-like and state-like elements, because most personality models nowadays assume the expression of personality constructs (here: the mental number line) to be a function of both personality and situational context (e.g., Endler & Kocovski, 2001;Wilson, Thompson, & Vazire, 2017).The question which we aim to answer with the current study is more fundamental: If the situational context is not systematically manipulated and participants do the same SNARC experiment repeatedly, will their SNARC effect be stable?The intraindividual stability of the SNARC effect over time has not been established; it has never even been systematically investigated.This is a prerequisite for substantial aspects of our theory building about spatial-numerical associations and for correlational research conducted with the SNARC effect.
To sum up, the SNARC effect is a commonly investigated effect in numerical cognition that reveals large interindividual differences despite its high robustness at the group level.Moreover, it is highly susceptible to situated influences and its intraindividual stability over time remains unclear.Taken together, the SNARC effect seemed to be the perfect candidate for the present investigation of intraindividual variability.

Temporal stability of cognitive effects
Temporal stability can be quantified in terms of test-retest reliability and in terms of absolute stability of scores.Test-retest reliability measures the extent to which the order of participants' scores in a cognitive task remains consistent across measurement occasions (Zorowitz & Niv, 2023).More precisely, test-retest reliability is based on the correlation of scores obtained by the same participants in two different occasions and therefore provides information about the stability of individual scores in relation to the sample mean.However, it does not consider the stability of scores in absolute terms, that is their stability in units being used to measure it (e.g., in milliseconds).Test-retest reliability is perfect if all participants' values change by the same linear transformation (e.g., if all participants' mean RTs move by the same difference between the test and the retest), as the correlation coefficient is invariant to linear transformations.However, even if a similar group-level effect can be observed in the same sample on different occasions, it can come from non-correlated observations that largely differ in absolute terms.Importantly, the test-retest reliability is surprisingly low for some cognitive tasks and often not reported (Zorowitz & Niv, 2023).However, Parsons, Kruijt, and Fox (2019) argue that determining and reporting the reliability in cognitive psychology should become a standard practice, because it can avoid many years of research using measures unsuited to individual differences research.Moreover, Schuch et al. (2022) investigated four common cognitive-control measures and found that these were stable and reliable when measured at the group level but do not show high split-half and test-retest reliability when assessing intraindividual differences as done in correlational approaches.In contrast to test-retest reliability, absolute stability considers how similar individuals' scores obtained from two different occasions are (e.g., how much mean reaction time [RT] varies between different experimental sessions).While test-retest reliability implies that inconsistencies are entirely due to the measurement error, it is important to consider absolute stability to account for variability caused by the participants.As outlined above, bootstrapping can provide valuable insights into absolute stability.Establishing both test-retest reliability and even more absolute stability is crucial to ensure that generalizations from single sessions to individuals' typical behavior are justified, but studies investigating this are surprisingly rare.
Test-retest reliability and absolute stability should therefore not be confused with one another.A cognitive effect or measure can be highly reliable while not being stable over time.For instance, if RTs of all participants increased by 3000 milliseconds between two time points, the correlation between them would be 1 and the effect or measure would be perfectly reliable despite low stability of the scores.At the same time, high absolute stability does not imply test-retest reliability.If all participants had very similar RTs and there was very small random variation between time points, the absolute stability would be very high, but the reliability would be very low (Miller & Ulrich, 2013).To conclude, test-retest reliability and absolute stability, while intuitively similar, provide complimentary but distinct types of information.
On top of test-retest reliability and absolute stability, measurement precision can also not be concluded from robust and replicable grouplevel effects and is independent from test-retest reliability and absolute stability.Luck (2019) defines precision as the extent to which one gets the same values when measuring the same thing at different occasions (or when using different subsets of trials from the same task, like when calculating split-half reliability).In speeded RT tasks, this would be the trial-to-trial variability in RT for a given condition within one participant.The lower this variability is, the more precise the measurement is.Measurement precision is determined within a session and does therefore not say anything about absolute stability and test-retest reliability between sessions.
One approach that takes both between-subjects and within-subjects variations into account are intraclass correlations (ICCs; Hedge et al., 2018;Shrout & Fleiss, 1979).Several forms of ICCs exist, but all of them are calculated as the ratio of the variance of interest over the total variance (i.e., the sum of between-subjects variance, between-sessions variance, and error variance).The different forms of ICCs can be distinguished depending on whether they are based on consistency, such that they consider the relative order of participants' scores across time or across raters, or based on absolute agreement, such that they consider the absolute stability of scores across time or across raters (Koo & Li, 2016).In other words, depending on the researchers' interest, ICCs can be calculated to put either the within-subjects variability of cognitive effects (e.g., their absolute stability between sessions) or their betweensubjects variability (i.e., their robustness at the group-level) in proportion to the overall variance.However, ICCs are not routinely reported, and due to the different variants of this measure, they may be confusing to some readers.
Cognitive psychology does not usually investigate the intraindividual stability of effects over time, but only looks at average effects at the group-level, which are more robust with increasing sample sizes and more accurately captured with high measurement precision.Smith and Little (2018) argue that apart from addressing the replication crisis in psychology by increasing sample sizes, repeatedly measuring psychological phenomena in single individuals can also provide more reliable insights; citing Skinner: "It is more useful to study one animal for 1000 hours than to study 1000 animals for one hour" (Skinner, 1966, p. 21).Similarly, it is possible that situational factors affect the observed effects.Participants take part in studies at different times of the day, at different levels of tiredness, alertness, and motivation.While these fluctuations can be expected to cancel each other out at the group level, this is not the case at the participant level.Situational factors remain largely overlooked, even though they have been demonstrated to affect scores of cognitive tasks.For instance, noise during testing has been shown to affect measures of cognitive control (Hommel, Fischer, Colzato, van den Wildenberg, & Cellini, 2012).
There has been a lack of studies investigating the stability of cognitive phenomena over time, and only few studies have investigated testretest reliability of the most established cognitive phenomena.Importantly, test-retest reliability has been shown to be low or moderate for the Simon, Stroop, Flanker, and other cognitive effects reflected by RT difference scores in the Simon, Stroop, Flanker, stop-signal, go/no-go, Posner cueing, Navon, anti-saccade, attentional network, and colorshape switching tasks; yet, test-retest reliability was high for RTs in these tasks (Cevada, Conde, Marques, & Deslandes, 2019;Eide, Kemp, Silberstein, Nathan, & Stough, 2002;Hedge et al., 2018;Paap et al., 2014;Paap & Sawi, 2016;Strauss, Allen, Jorgensen, & Cramer, 2005).While Paap et al. (2014) found a systematic decrease in the flanker effect over 100 sessions, the stability of other cognitive phenomena, to the best of our knowledge, has not been investigated over the course of many sessions and therefore especially their intraindividual stability remains unclear.Similarly, in the case of the SNARC effect, so far very few studies have addressed reliability, stability, and precision.The testretest reliability of the SNARC effect in the parity judgment task has been found to be small to moderate (r = 0.36, p = .01 in Cipora & Göbel, 2013; r = 0.41, p = .014in Georges, Hoffmann, & Schiltz, 2013; and r = 0.274, p = .116,or, after exclusion of three participants because of high Cook's distance, r = 0.372, p = .04,in Viarouge, Hubbard, & McCandliss, 2014).Note that effects with larger standardized effect sizes or with larger absolute sizes show higher reliability (Schuch et al., 2022).Further, Hedge et al. (2018) reported the between-subjects variance in the SNARC effect as proportion of overall variance to be rather low in a magnitude judgment task (ICC of 0.22).Thus, the intraindividual variability of the SNARC effect was relatively high compared to its interindividual variability.However, the 95% confidence interval (CI) for this ICC ranged from 0.00 to 0.49.Importantly, it remains unclear whether the relatively high intraindividual variability in the above-mentioned studies arose from systematic or unsystematic changes in the SNARC effect between sessions or if it was due to measurement imprecision.Crucially, to find out whether cognitive effects such as the SNARC effect are suitable for correlational (and not only experimental) psychological research, their within-subjects variability should be low and measurement precision should be high (Luck, 2019).Correlations between estimates of cognitive effects and other variables of interest as well as assessments of individual differences are limited by the test-retest reliability of the measures (Lord & Novick, 2008), and therefore only generalizable if both the cognitive effects and other variables are stable over time and if measurement errors are small (Schuch et al., 2022).
The mentioned limitations in reliability, absolute stability, and measurement precision in SNARC experiments have an important theoretical impact.In fact, we consider ease and high automaticity of mapping numbers to space to be an important mediator or metaphor for mathematical development (Cipora, Patro, & Nuerk, 2015), being at the core of much research about spatial-numerical associations (and the mental number line).If there is no such stable ability, but rather a transient, highly fluent state of association of space and numbers, which bears not intraindividual stability, such conceptual implications would be void.Such considerations apply to other cognitive phenomena as well.

Objectives of the present study
This preregistered study aims at illustrating that the absolute stability of cognitive phenomena cannot be taken for granted.Specifically, we explored the intraindividual stability of the SNARC effect and other numerical cognition effects by checking whether observations in single experimental sessions reflect typical and stable behavior of the participants.For this purpose, we asked participants to perform the same parity judgment task on 30 out of 35 consecutive days.In the following, we will call the extensive study design where each participant completes the exact same experiment multiple times regularly within a given period the Ironman paradigm (in reference to a triathlon competition, as participation in the numerous sessions of our study likewise requires high perseverance).Furthermore, we investigated whether the effects vary systematically over time at the participant level or are associated with the participant's current state.In this study, we considered sleep duration, tiredness, consumption of stimulants, and time of the day as the most obvious situational factors.
As outlined above, the SNARC effect is highly replicable at the group level, even if reliably present in only less than half of the participants of previous studies (Cipora, van Dijck, et al., 2019).The SNARC effect is therefore an indominant phenomenon (as classified by Rouder & Haaf, 2018).Investigating its intraindividual stability was the focus of our study.Additionally, the SNARC effect is highly susceptible to situated influences (Cipora et al., 2018).The current study allows us to investigate its stability in individuals who reveal the typical effect, no effect, or the untypical reversed effect, and it potentially also permits us to investigate the role of situational context.Collecting data across 30 sessions per participant allowed us to clarify whether the SNARC effect indeed exists in every participant but was previously not reliably detectable in most samples due to lacking power, or whether it is genuinely present in less than half of the participants.Furthermore, although the SNARC effect was of our main interest, the parity judgment task allowed us to investigate other important numerical cognition effects as well.Namely, we also explored the reliable presence and intraindividual stability of the MARC effect (Linguistic Markedness of Response Codes; Nuerk et al., 2004; i.e., faster left− /right-sided responses to odd/even numbers, respectively) and of the Odd Effect (Hines, 1990;i.e., overall faster reactions to even than to odd numbers, irrespective of the response side).Finally, besides the mentioned numerical-cognition effects, we also investigated the intraindividual stability of RTs themselves.This measure should reveal less unsystematic variance within participants and instead more systematic changes over time as compared to difference scores (Cevada et al., 2019;Eide et al., 2002;Hedge et al., 2018;Paap et al., 2014;Paap & Sawi, 2016;Strauss et al., 2005), with participants becoming faster from session to session.

Participants
As preregistered, the sample size chosen for the study was small due to the high time requirement of the experiment.In total, 10 participants completed the task by conducting 30 sessions (one per day) of a bimanual parity judgment task within a 40-day period each (the preregistered period was 35 days, but three participants needed some additional days).Three authors of this study also participated in the experiment themselves.Additional participants were recruited through word of mouth at the University of Tübingen, Loughborough University, and Thomas More University of Applied Sciences, as well as in the personal environments of the authors.Students at the University of Tübingen were granted ten course credits.The study was conducted in accordance with the Declaration of Helsinki (World Medical Association, 2013).All subjects gave written informed consent.All participants completed the experiment in its entirety.
All participants were female, between 19 and 56 years old (M = 27.30years, SD = 10.99 years), and left-to-right readers.Nine participants were right-handed and one was left-handed.Seven participants reported to be native German speakers (one of which bilingual with Spanish being the second native language), two were native Dutch speakers, and one was a native Italian speaker.

Materials
This study implemented a computerized speeded bimanual parity judgment task (task is available at https://osf.io/p6u5h).Participants were to decide the parity of a number presented on their computer screen by pressing a left key ("D") or a right key ("K") on their computer keyboard with their respective index finger.Stimuli (Arial font, 72 pt) were presented centrally in black (RGB: 0, 0, 0) against a gray background (RGB: 210,210,210).Each trial started with a 300 ms eye fixation cross (+) followed by a number (1, 2, 3, 4, 6, 7, 8 or 9) presented for a maximum of 2500 ms or until response.An inter-stimulus interval of 500 ms with a blank gray screen followed.
Each daily experimental session consisted of two parity judgment blocks with reversed response-to-key assignment.The block order was counterbalanced between days to minimize sequence effects.In each block, each number was presented 30 times leading to a daily total of 480 trials (8 numbers * 30 repetitions * 2 blocks).The presentation order was pseudo-randomized so that no number could appear more than twice in a row.Each block was preceded by a brief practice session (two trials per number), during which accuracy feedback was provided for incorrect responses by a red "X" (RGB: 255, 0, 0; Arial font, 72 pt., presented for 800 ms).The task was implemented in OpenSesame (Mathôt, Schreij, & Theeuwes, 2012), version 3.3.10.

Procedure
Participants were asked to install the OpenSesame software on their computers.They received additional instructions on coding for participant and session numbers, the order of experiment versions to be used, and how to save and upload the data to a shared cloud storage.They were instructed to complete the task at any time during the day.
In each session, participants first completed a short questionnaire.They provided their participant code and current session number.Subsequently, they reported their sleep duration in the previous night (open input approximated to 0.5 h), their current level of tiredness (11-point Likert scale ranging from 0 [not tired at all] to 10 [almost falling asleep]), consumption of stimulants (coffee, cola, black tea) within one hour prior to the experiment (binary response format), as well as the current time of the day (three options: morning [waking up until 12 AM], afternoon [12 AM to 6 PM] or evening [6 PM until going to sleep]).
After completion of the questionnaire, instructions for the first practice session were provided, followed by the practice session.
Instructions for the first experimental block followed.Participants then completed the first experimental block.After the possibility to take a short break, the second block with reversed response-to-key assignment followed.

Data preparation
First, practice trials were discarded.The data was reviewed for any incomplete sessions (i.e., <480 completed trials) which were subsequently discarded from analysis.Furthermore, trials with incorrect responses were excluded from the data.Then, as preregistered, the algorithm proposed by Cipora, Soltanlou, et al. (2019) was applied for further data trimming.Trials with RTs under 200 ms were excluded as anticipations.Then, sequential trimming was used per participant and session.Trials with RTs above and below 3 SDs from the respective session's mean RT were excluded sequentially until means and SDs no longer changed.Finally, sessions with <66% remaining valid trials as well as sessions with empty number × response side cells configuration were fully excluded from analysis.
An additional inspection of the data showed that two participants (Participants 1 and 6) had not completed the experiment versions in the correct order (i.e., alternating execution of the two experiment versions to counterbalance block orders).Participant 1 used the same experiment version (right hand for odd numbers and left hand for even numbers in Block 1, right hand for even numbers and left hand for odd numbers in Block 2) for Sessions 5 through 25; Participant 6 used the same experiment version as above for Sessions 13 through 15).However, we included these two participants' data because this study focuses on intraindividual analyses and including their data did therefore not influence outcomes for other participants.Nevertheless, we explored whether deviating from the order of experiment versions specified in the instructions could have influenced the results.

SNARC and MARC slopes
To analyze the SNARC and MARC effects per participant and session, unstandardized and standardized SNARC and MARC slopes were calculated as proposed by Cipora, Soltanlou, et al. (2019).Since unstandardized slopes are more sensitive to variations in RT characteristics, fluctuations were expected to be more pronounced for unstandardized slopes than for standardized slopes.The standardized SNARC and MARC slopes should therefore better reflect the stable characteristics of an individual's spatial mapping of numbers (Lyons, Nuerk, & Ansari, 2015).
For calculating SNARC and MARC slopes, the individual regression method (Fias et al., 1996) was used as implemented in Cipora, Soltanlou, et al. (2019).dRTs were calculated for each participant, each session, and each number.These were regressed on number magnitude and contrast-coded parity (− 0.5 and 0.5 for odd and even numbers, respectively).Both unstandardized and standardized regression slopes were calculated, with the latter being Fisher-z transformed to approximate normal distribution.Importantly, if the numbers 1, 2, 3, 4, 6, 7, 8, are used, parity and magnitude predictors are orthogonal and presence of one predictor does not affect neither the unstandardized nor the standardized slope of the other.Both SNARC and MARC effects are reflected by negative slopes.Negative SNARC slopes reflect an increase of right-over left-hand advantage in milliseconds per increase in number magnitude of one unit, while negative MARC slopes reflect an increase of right-over left-hand advantage in milliseconds when the number parity status is even instead of odd.For each participant, the proportions of sessions revealing negative SNARC and MARC slopes were calculated as a first indication of the intraindividual stability of the effects.Furthermore, "grand" unstandardized and standardized SNARC and MARC slopes were calculated for each participant by aggregating all data across all sessions.

Intraindividual prevalence and stability of SNARC and MARC effects
Besides the calculation of the observed SNARC and MARC slopes, we were also interested in the reliability of the observed effects and the corresponding individual proportions of sessions revealing reliable SNARC and MARC effects.To test the reliabilities of all session and grand slopes, two bootstrapping methods were used to calculate individual 80%, 90%, 95%, and 99% CIs as proposed by Cipora, van Dijck, et al. (2019).Bootstrap CIs are calculated in a non-parametric approach, so that no distribution assumptions need to be made (Bland & Altman, 2015).Their major advantage is that their width is individually calculated for each participant and session so that measurement precision (as defined by Luck, 2019) can be estimated without being influenced by other participants or sessions.Hence, bootstrap CIs are applicable no matter whether the sample is homogeneous or heterogeneous regarding the cognitive phenomenon under scrutiny.Note that this is not the case for correlation-based reliability approaches (e.g., test-retest reliability, split-half reliability, internal consistency), which largely depend on the sample's variance.The two approaches described in the following are therefore well suited for estimating the prevalence of reliable SNARC and MARC effects at the participant level or even at the session level within each participant.
In the H1 bootstrapping approach, 30 trials were randomly sampled with replacement from all valid trials within each participant × session × number × response side configuration when calculating CIs for session slopes and 900 trials (i.e., 30 repetitions across 30 sessions) within each participant × number × response side configuration when calculating CIs for grand slopes.Based on these sets of selected trials, unstandardized and standardized SNARC and MARC slopes were calculated.This procedure was repeated 5000 times.Subsequently, the session and grand CIs were built based on the ranges containing the mid 80%, 90%, 95%, and 99% of the bootstrapped slopes.Since CIs calculated based on the H1 bootstrapping approach are used to test the extent of variation of an effect depending on which trials are considered for averaging, any observed (reversed) SNARC or MARC effect can be considered reliable when the corresponding CI does not include 0. Furthermore, the visualization of the observed slopes along with the H1 bootstrapped CIs allows evaluation of the similarities of SNARC and MARC effect estimates across sessions for each participant.We also looked at the extent to which the session CIs overlapped with the corresponding grand CIs.
In the H0 bootstrapping approach, two samples of 30 trials were randomly sampled with replacement from valid trials within each participant × session × number configuration for session CIs and L. Roth et al. trials within each participant × number configuration for grand CIs.The two samples were then treated as "left side" and "right side" responses respectively to calculate unstandardized SNARC and MARC slopes (as discussed by Cipora, van Dijck, et al., 2019, the H0 bootstrapping approach is not suitable for standardized slopes). 2 Again, this procedure was repeated 5000 times.The ranges containing the mid 80%, 90%, 95%, and 99% of the bootstrapped slopes were subsequently used to build the corresponding CIs of slopes likely to be observed if there was no association between magnitude / parity and response side.Any observed slopes outside these intervals can therefore be considered as reliable (reversed) SNARC or MARC effects.
In the analysis, we focus on 90% CIs, as in Cipora, van Dijck, et al. (2019).Results for other confidence levels are presented in Appendix A.

Time-series analysis of SNARC and MARC effects
To test for autocorrelations within the observed SNARC and MARC slopes, time-series analyses were conducted for each participant.Therefore, correlations between time-series with different time lags of one, two, three, four, and five days were calculated.A lag-1 correlation represents the correlation between a slope on day X and its equivalent on day X + 1, and correspondingly, lag-2, lag-3, lag-4, and lag-5 correlations represent the correlations between a slope on day X and its equivalents on day X + 2, X + 3, X + 4, and X + 5, respectively.It was of interest whether there would be a monotone decrease in the autocorrelations with an increase in time lags (lag-1 larger than lag-2 larger than lag-3, etc.).For each participant, the permutation-based variant of the C statistic (Tryon, 1982) was used to test deviations from randomness.Furthermore, each of the five autocorrelations per participant was tested against a distribution of correlations obtained by randomly reshuffling the observations (without replacement) and repeating this procedure 5000 times.Any correlation outside the mid 95% of this distribution would be considered a significant autocorrelation.

Influences of predictor variables on SNARC and MARC effects
To obtain some insights on possible predictors of intraindividual SNARC and MARC effect fluctuations, we tested whether the magnitude of SNARC and MARC slopes in a single session were influenced by the participants' sleep duration during the previous night, their current level of tiredness, consumption of stimulants within one hour prior to the experiment or time of the day.Therefore, repeated-measures correlations of sleep duration, tiredness, and consumption of stimulants with the SNARC and MARC slopes were calculated as proposed by Bakdash and Marusich (2017) using their R package rmcorr.Repeatedmeasures correlation models are linear mixed models (LMMs) with a fixed linear predictor, which account for the within-subjects design with random participant intercepts.To test whether time of the day influences the SNARC and MARC effects, LMMs with time of the day as fixed categorical predictor and with random participant intercepts were computed by using the R package lmerTest (Kuznetsova et al., 2017).Repeated-measures correlations could not be used here because we did not expect time of the day to have an ordinal or even linear effect on SNARC or MARC slopes.
Additional repeated-measures correlations and LMMs were calculated to validate the predictor variables.As a plausibility check, the correlation between the arousal measures sleep duration and tiredness was calculated, which we expected to be negative.Moreover, we investigated whether sleep duration, tiredness, consumption of stimulants, and time of the day influenced the mean RTs and the standard deviations (SDs) of RTs revealed by the participants on the

Table 1
Unstandardized (upper part) and standardized (lower part) SNARC effect.Within each part, one row corresponds to one participant.The left part shows descriptive results: proportions of sessions with negative and positive slopes; mean slope (averaged slopes from each session), standard deviation, and grand slope (calculated by aggregating all trials across all sessions).The two following columns correspond to the CI bounds for grand slopes resulting from both bootstrapping approaches.The following part summarizes the proportions of sessions with reliable negative (regular), no reliable, and reliable positive (reversed) effects according to both bootstrapping approaches. 2 This is a deviation from the preregistration as while these data have been analyzed, Cipora et al. discovered a flaw in the reasoning for the H0 bootstrapping in case of standardized slopes.Therefore, in this work we also do not report this problematic analysis.corresponding days.A negative correlation between mean RTs and sleep duration, and between mean RTs and consumption of stimulants, as well as a positive correlation between mean RTs and tiredness would corroborate our measures.Additional analyses considering the Odd Effect, RT stability, session sampling and block order along with their justifications are reported in Appendices C to F. Finally, apart from investigating the systematicity of fluctuations in the SNARC and MARC effects via autocorrelations and Tryon's C statistic (Tryon, 1982), we also ran repeated-measures correlations between the SNARC and MARC slopes and the session number to investigate whether the two effects increase of decrease over time.

Overview
After reviewing the data sets of all 300 sessions, data from one incomplete session (i.e., fewer than 480 trials) was excluded.The overall proportion of trials excluded due to incorrect responses was 7.0%.<0.001% of the trials were excluded because of anticipation.After applying the sequential trimming method as preregistered, another 3.3% of the trials were excluded.Two sessions were excluded due to excessive error rates (i.e., <66% valid trials).The assessments of a lack of valid RTs for any number × response side configuration did not lead to further session exclusions.Excluding the three sessions, 89.7% of all trials were retained for analysis.

Intraindividual prevalence and stability of SNARC effects
At an aggregated level (grand SNARC slopes), eight out of 10 participants descriptively revealed unstandardized and standardized SNARC effects (Participants 1, 2, 3, 5, 6, 8, 9, and 10).Among all participants, the magnitudes of the grand slopes range from − 4.83 to 0.73 for unstandardized SNARC effects and from − 1.43 to 0.14 for standardized SNARC effects.When looking at the individual means of session slopes instead, the results remain similar, with the exception that one additional participant revealed a negative mean unstandardized SNARC slope (Participant 4).Without taking CIs into account, the intraindividual proportions of sessions showing SNARC effects range from 33.33% to 93.33%.Table 1 gives an overview of the results of the different analyses.

H1 bootstrapping approach
A more detailed look at the 90% CIs indicates that five out of participants (Participants 1, 2, 3, 5, and 9) showed reliable unstandardized and standardized SNARC effects at the aggregated level (grand slopes).Participant 7 revealed reliable reversed effects for both measures.The intraindividual proportions of sessions revealing reliable effects ranged from 3.33% to 63.33% for both unstandardized and standardized SNARC effects.Again, the number of participants showing reliable SNARC effects in more than half of the sessions was low (one participant for unstandardized and standardized SNARC effects).Figs. 1 and 2 visualize the individual observed distributions of grand and session slopes with the corresponding 90% H1 CIs for unstandardized and standardized SNARC effects, respectively.Note that the H1 CIs for standardized effects are not symmetrical around the observed effect estimates, in contrast to those for unstandardized effects.H1 CIs for negative estimates include less values that  are smaller rather than larger compared to the estimate.Similarly, H1 CIs for positive estimates include less values that are larger rather than smaller compared to the estimate.The closer the estimates are to zero, the more symmetrical their H1 CI is.The reason for H1 CIs for standardized effects not being symmetrical is the restricted range of possible values.Namely, the minimum is − 1 and the maximum is +1 (even though it is later Fisher-z transformed).Hence, the more extreme the estimate is, the less likely it is to obtain an even more extreme one by bootstrapping.
A closer look at Figs. 1 and 2 indicates that there is considerable intraindividual variation in the effect estimates across the different sessions and that the CIs of the different sessions do not overlap.Illustratively, the intraindividual proportions of session CIs that overlap with the grand CI were calculated.These proportions range from 76.67% to 96.67% for unstandardized SNARC effects and from 46.67% to 100% for standardized SNARC effects, as shown in Table 2.These intraindividual proportions are even lower for the overlap between session CIs and the grand slope estimate instead of the grand CI.
Looking at the H1 bootstrapping CIs for the grand slopes, one can notice that in case of unstandardized slopes, they are much narrower than those for single sessions (Fig. 1 for SNARC and Fig. 4 for MARC).This is because they consider many more observations, which leads to a more precise grand slope estimate compared to each session estimate.However, in case of standardized slopes, some CIs for the grand mean are as large as those for single sessions (Fig. 2 for SNARC and Fig. 5 for MARC).This is because the width of H1 bootstrap CIs for standardized slopes largely depends on the fit of the linear regression model rather than on the total number of trials.The model fit can vary more substantially depending on which points are being selected during bootstrapping.
For unstandardized slopes, H1 bootstrap CIs for the grand slope are roughly located at the mean of all session slopes (Fig. 1 for SNARC and Fig. 4 for MARC).Hence, most of them include zero or are relatively close to zero.However, in case of standardized slopes, some H1 bootstrap CIs for the grand mean are shifted towards extreme values (Fig. 2 for SNARC and Fig. 5 for MARC) and not necessarily located at the mean of all session slopes.We can observe such a pattern when the fit of the linear regression model to the grand dRTs is very high (see our plot with grand dRTs per participant per number in the data analysis folder on https://osf.io/p6u5h).For instance, the unstandardized grand SNARC slope is only slightly negative for Participant 9, but as the grand dRTs are well aligned to it, the CI for the standardized grand SNARC slope for Participant 9 is more negative compared to the CI for the standardized session SNARC slopes.

Table 3
Unstandardized (upper part) and standardized (lower part) MARC effect.Within each part, one row corresponds to one participant.The left part shows descriptive results: proportions of sessions with negative and positive slopes; mean slope (averaged slopes from each session), standard deviation, and grand slope (calculated by aggregating all trials across all sessions).The two following columns correspond to the CI bounds for grand slopes resulting from both bootstrapping approaches.The following part summarizes the proportions of sessions with reliable negative (regular), no reliable, and reliable positive (reversed) effects according to both bootstrapping approaches.

H0 bootstrapping approach
At a 90% confidence level, four out of 10 participants (Participants 1, 3, 5, and 9) showed reliable unstandardized SNARC effects at the aggregated level (grand slopes), while Participant 7 showed a reliable reversed unstandardized grand SNARC effect.The intraindividual proportions of sessions revealing reliable effects ranged from 3.33% to 60.00%.Only Participant 1 revealed a reliable unstandardized SNARC effect in more than half of the sessions.As expected, the prevalence of reliable SNARC effects is therefore much lower when taking 90% CIs into account than the intraindividual proportions of descriptively negative slopes reported above.The intraindividual proportions of sessions revealing reliable reversed effects ranged from 0.00% to 10.00%.The individual distributions of observed grand and session unstandardized SNARC slopes including 90% H0 CIs are visualized in Fig. 3.

Intraindividual prevalence and stability of MARC effects
At an aggregated level (grand MARC slopes), seven out of 10 participants descriptively revealed negative MARC effect slopes.Among all 10 participants, the magnitudes of the grand slopes range from − 38.74 to 2.55 for unstandardized MARC effects and from − 1.70 to 0.23 for standardized MARC effects.The numbers of participants revealing effects remain the same when looking at the individual means of session slopes instead.Without taking CIs into account, the intraindividual proportions of sessions showing MARC effects range from 40.00% to 90.00% (see Table 3).

H1 bootstrapping approach
At the 90% confidence level, six participants (Participants 1, 3, 4, 6, 7, and 10) showed reliable unstandardized and standardized grand MARC effects, while no participant showed reliable reversed unstandardized and standardized grand MARC effects.For both unstandardized and standardized MARC effects, the intraindividual proportions of sessions revealing reliable effects ranged from 17.86% to 80.00% and two out of 10 participants (Participants 3 and 7) showed reliable MARC effects in more than half of the sessions.Figs. 4 and 5 visualize the individual observed distributions of grand and session slopes with the corresponding 90% H1 CIs for unstandardized and standardized MARC effects, respectively.
Again, the intraindividual proportions of session CIs that overlap with the grand CI were calculated to illustrate intraindividual variation in the effect estimates.These proportions range from 43.33% to 80.00% for unstandardized MARC effects and from 30.00% to 89.29% for standardized MARC effects (Table 2).

H0 bootstrapping approach
When looking at the 90% confidence level, six out of 10 participants (Participants 1, 3, 4, 6, 7, and 10) showed reliable unstandardized MARC effects at the aggregated level (grand slopes), while no participant showed reliable reversed unstandardized and standardized grand MARC effects.The intraindividual proportions of sessions revealing reliable effects ranged from 16.67% to 80.00%.Two participants (Participants 3 and 7) revealed a reliable unstandardized MARC effect in more than half of the sessions.The intraindividual proportions of sessions revealing reliable reversed effects ranged from 6.67% to 30.00%.The individual distributions of observed grand and session unstandardized MARC slopes including 90% H0 CIs are visualized in Fig. 6.

Time-series analysis of SNARC and MARC effects
The autocorrelations between the different observed SNARC and MARC slopes calculated for time series with time lags of one, two, three, four, and five days as well as the corresponding C statistic values are summarized in Table 4. Descriptively, none of the participants revealed a monotone decrease in the autocorrelations with an increase in time lags for any of the observed SNARC and MARC slopes.However, when looking at Tryon's C statistic, only Participant 1 revealed significant positive autocorrelations for unstandardized and standardized SNARC and MARC slopes, and only Participant 10 revealed significant negative autocorrelations for unstandardized and standardized MARC slopes.
A further investigation of each of the five time-lag autocorrelations by means of H0 bootstrapped 95% CIs reveals no significant autocorrelations for any of the participants between observed unstandardized SNARC slopes.For standardized SNARC slopes, only Participant 9 showed a significant negative lag-5 autocorrelation.Only Participant 10 revealed significant negative lag-1 autocorrelations for unstandardized and standardized MARC slopes.Additionally, Participant 2 showed a significant positive lag-2 autocorrelation for unstandardized MARC slopes and Participant 1 a significant positive lag-3 autocorrelation for standardized MARC slopes.All autocorrelations and bootstrapped CIs are visualized in Fig. B1 (unstandardized SNARC slopes), B2 (standardized SNARC slopes), B3 (unstandardized MARC slopes), and B4 (standardized MARC slopes) in Appendix B.

Influences of predictor variables on SNARC and MARC effects
Repeated-measures correlations were calculated to test whether fluctuations in SNARC and MARC slopes were predicted by the participants' sleep duration in the previous night, their current level of tiredness, and consumption of stimulants within one hour prior to the experiment.None of the calculated correlations was significant (see Table 5).Moreover, we computed LMMs with a categorical predictor to check whether time of the day influenced the SNARC and MARC effects, but these results were also non-significant (see Table 6).Note that only random participant intercepts were included within these repeatedmeasures correlations and LMMs for sleep, tiredness, stimulants, and daytime.LMMs for sleep, tiredness, and daytime that additionally include random participant slopes can be found in Appendix G. Due to the lack of sessions with prior consumption of stimulants (no session for Participants 5 and 9, only one session for Participant 7), no LMMs with random participant slopes were calculated for stimulants.
As preregistered, we conducted an additional analysis to crossvalidate the measures.First, we calculated the repeated-measures correlation between sleep duration and current level of tiredness.This analysis revealed a modest correlation significantly different from zero ).As further validation measures, we calculated repeated-measures correlations and LMMs to investigate the relationship between all predictor variables (sleep duration, tiredness, consumption of stimulants, time of the day) and the mean RTs, as well as SDs of RTs that the participants revealed on the corresponding days.Most results remained non-significant (see Table 5 and Table 6).Only time of the day was a significant predictor of mean RTs, such that responses were approximately 13 ms faster when participants carried out the experiment in the morning than when they participated in the afternoon or evening (see Table 6).

Non-preregistered analyses
In addition to the preregistered analyses, we conducted nonpreregistered analyses that can be found in the Appendix.Namely, we ran H0 bootstrapping for the intraindividual prevalence of the Odd Effect to be able to compare it with the individual prevalence of the SNARC and MARC effects (Appendix C); ran H1 bootstrapping for mean RTs, repeated-measures correlations of mean RTs with covariates, and time-series analyses for mean RTs (Appendix D); sampled group-level ttests for the SNARC, MARC, and Odd effects (Appendix E); and used LMMs to test block order influences on the SNARC and MARC effects (Appendix F).
Further, we calculated repeated-measures correlations of the SNARC and MARC slopes, mean RTs, and SDs of RTs with the session number.Results indicated a significant decrease of the SNARC effect over time (unstandardized SNARC: r = 0.319, standardized SNARC: r = 0.264) but no significant change of the MARC effect over time was found (see Table 5).Session number correlated negatively with mean RTs, indicating responses getting faster over time, and with SDs of RTs, reflecting decreasing variability over time (see Table 5).

Discussion
In experimental psychology, researchers often draw conclusions from the robustness of effects at the group level to their robustness at the participant level (which is termed group-to-person generalizability problem by McManus et al., 2023); and they also assume that participant-level effects are stable over time and reflect the participant's typical behavior and thus a trait rather than a state characteristic.The goal of the current study was to provide a framework for evaluating these two types of generalizations.Combining the Ironman paradigm with bootstrapping techniques, we demonstrated that intraindividual stability of cognitive phenomena cannot be taken for granted.
As an exemplary phenomenon to illustrate this point, we chose the SNARC effect, which is one of the most thoroughly investigated phenomena in numerical cognition and highly replicable at the group level.It is typically investigated in a parity judgment task where other phenomena like the MARC effect and the Odd Effect can be observed as well.Hence, our aim was to analyze the intraindividual stability of the SNARC and other effects when looking at data from 10 participants with 30 sessions each within a 40-day period in the Ironman paradigm.Besides the calculation of the regular unstandardized and standardized SNARC and MARC slopes, as well as Odd Effect estimates at the session level and at the aggregated participant level, two bootstrapping approaches were used to obtain CIs for the grand slope to determine the reliability of the effects at the participant level.Moreover, we investigated the intraindividual stability of RTs themselves in a similar way.The data collected in the Ironman paradigm provides an interesting new perspective on the mentioned numerical-cognition phenomena.
The SNARC effect's robustness at the group level has been shown in various experimental tasks and for various populations (see Fischer & Shaki, 2014, for a review;Wood et al., 2008, for a meta-analysis).Interestingly, researchers have investigated its high interindividual variability found across studies (e.g., Wood, Nuerk, & Willmes, 2006b) but have not yet looked at its intraindividual stability over time.Instead, only the SNARC effect's test-retest reliability in terms of betweensessions correlations has been examined and found to be surprisingly low (Cipora & Göbel, 2013;Georges et al., 2013;Viarouge et al., 2014).
Further, Hedge et al. (2018) have found that the SNARC effect's intraindividual variability is relatively high compared to its interindividual variability.Importantly, it remained unclear whether the low test-retest reliability and the relatively high intraindividual variability of the SNARC effect are due to systematic or unsystematic changes (i.e., lack of absolute stability) in the SNARC effect over time or possibly even due to measurement imprecision.This suggests that the SNARC effect is an indominant phenomenon (Rouder & Haaf, 2018), which is further supported by its susceptibility to situated influences (Cipora et al., 2018).In the current study, at an aggregated level over sessions, eight participants descriptively showed a grand SNARC effect, and seven participants descriptively showed a grand MARC effect.Even though these numbers must be interpreted with caution due to the very small sample size, it is in line with the proportions found in larger SNARC effect studies consistently reporting negative SNARC slopes for approximately 70% to 80% of participants (for an overview, see Cipora, van Dijck, et al., 2019).Interestingly, grand slopes were very small, Table 4 Autocorrelations in unstandardized and standardized SNARC and MARC effects with time lags of up to five days (lag-1, lag-2, lag-3, lag-4, lag-5) and Tryon's C statistic.Expectedly, when taking CIs from the H0 and H1 bootstrapping approach into account, fewer participants show reliable effects, namely between four (H0 bootstrapping) and five (H1 bootstrapping) for the SNARC and six (H0 and H1 bootstrapping) for the MARC effect.A reversed SNARC effect was revealed by one and a reversed MARC effect by none of the participants, while the rest of the sample did not reveal effects in any direction.Participant 7, who revealed a reversed SNARC effect at the aggregated level, did not remarkably differ in any demographics from other participants: She was the third-youngest participant, female, right-handed, and bilingual (German and Spanish), but not proficient in any language with a right-to-left reading and writing direction.Again, the observed proportions for the SNARC measures are in line with the results by Cipora, van Dijck, et al. (2019) indicating a reliable SNARC effect in about 35% to 45% of participants.
While at an aggregated level a reliable grand SNARC effect was shown by four to five participants, a reliable grand MARC effect by six participants, and a reliable grand Odd Effect by five out of 10 participants, the picture changes when looking at the intraindividual proportions of sessions with reliable effects.Only one participant showed reliable unstandardized and standardized SNARC effects in more than half of the sessions in both bootstrapping approaches.This means that even in most of the four to five participants who revealed reliable grand SNARC effects at the aggregated level, the effect could not be found on most days, indicating a very low intraindividual absolute stability of the SNARC effect.Crucially, we know that this is not an artifact originating solely from using a chronometric task, but instead a property specific to the SNARC effect.Hohol et al. (2020) have shown that, within the same datasets, different proportions of participants reveal the SNARC effect, the Numerical Distance effect, and the Numerical Size effect.Interestingly, the SNARC effect seems to be considered as a trait characteristic of individuals in the literature, but our results suggest that it is instead or concurrently a state characteristic.One might wonder how the only participant showing the SNARC effect in more than half of the sessions differs from others, who exhibited the SNARC effect only in few sessions or not at all.To start with, she ran the same experiment version (i.e., the same block order) for Sessions 5 through 25; however, we could not find any block-order influences on the SNARC effect (Appendix F).Moreover, this participant was the oldest, and the SNARC effect has been shown to increase with age (Ninaus et al., 2017).Also, she was the only participant who was entirely naïve to cognitive psychology experiments.However, to the best of our knowledge, there is no evidence for the SNARC effect to be affected by familiarity with cognitive psychology research, because it is an automatic effect and not deliberately produced by participants.Importantly, the participant had the second-largest mean RTs (cf.Fig. D1, Appendix D), and the SNARC effect is usually more pronounced in slower responses (e.g., Didino, Breil, & Knops, 2019).For the MARC and Odd effects, the number of participants showing reliable effects in most of the sessions was slightly higher.Two participants showed reliable unstandardized and standardized MARC effects and two participants showed a reliable Odd Effect in the majority of the sessions in the bootstrapping approaches.However, again, these numbers are considerably lower than expected, considering that six participants showed reliable grand unstandardized and standardized grand MARC effects and that five participants showed a reliable grand Odd Effect at the aggregated level.
The intraindividual instability of the effects over time is even more dramatic when taking into account that a considerable proportion of some participants' session CIs do not even overlap with their grand CI.Interestingly, the time-series analysis with autocorrelations and Tryon's C statistics suggest that the fluctuations in the SNARC and MARC effects between consecutive sessions are not systematic in most participants.Nevertheless, the SNARC effect but not the MARC effect decreased over sessions on average.Moreover, we did not find any relationship between four aspects of the participant's current state (sleep duration, tiredness, consumption of stimulants, and time of the day) and our measures of SNARC or MARC effects.Crucially, despite the strong variability and high instability over time in the SNARC and MARC effects and although the sample was so small, the group-level t-tests in our non-preregistered bootstrapping analysis (see Appendix E) with one randomly selected observed slope from each of our 10 participants revealed group-level SNARC and MARC effects in a quarter of all cases.Applying the same bootstrapping procedure for the Odd Effect shows that it would have been found in half of all cases.These findings emphasize that effects found at the group level in experimental cognitive psychology by testing random samples of participants only once do not necessarily reflect typical human behavior.One may ask why is this the case.At least three scenarios can potentially underlie the observed variability in the SNARC effect.First, the variability could be a purely statistical artifact, which is not related to underlying theoretical models.More precisely, Cipora, van Dijck, et al. (2019) have shown that the group-level effect can be driven by a minority revealing the effect in the expected direction (in the case of the SNARC effect, by around 40% of the individuals).At the same time, very few individuals reveal a reliable reverse effect.However, the fact that the SNARC effect gets induced from left to right rather than from right to left (even if still only in a minority of the sample) makes L. Roth et al. this purely statistical explanation problematic.Second, the SNARC effect might be present in everyone, but not precisely captured by the parity judgment task, so that both interindividual and intraindividual differences are just a by-product of the task.However, the explanation that the parity judgment task is imprecise or unsuitable for measuring the SNARC effect seems implausible, because its reliability is reasonable, especially in our Ironman paradigm with 900 data points per experimental cell for each participant.Moreover, this explanation seems unlikely in the light of results from Hohol et al. (2020) showing that variation in the SNARC effect is larger than in the Numerical Distance Effect and smaller than in the Numerical Size Effect (even though both other effects have no spatial component when it comes to lateralized responses).Their results show that cognitive effects differ in terms of their individual prevalence, even when calculated from the same data, the same participants, and the same task.
Third, there might be truly underlying trait differences in spatialnumerical associations.Some humans might associate numbers with space in a stable way, while others do not, which is in our view the most likely underlying scenario.This interindividual variability can occur on two levels: Individuals can differ in the spatial mental representations of numbers (consider the extreme differences between number-form synesthetes for an illustration) and in their intraindividual variability (i.e., susceptibility to situational context).This would mean that each individual has a personal underlying SNARC distribution with different means and standard deviations reflecting the likelihood of different effect sizes and directions.Therefore, some individuals show a regular and typically sized SNARC effect around − 7 ms in most of the sessions (e.g., SD = 1 ms).In contrast, other individuals' SNARC effect might depend to a much larger degree on the situational context (e.g., on tasks preceding the parity judgment, on exact instructions, on aspects of the current state), causing stronger variations between sessions (e.g., mean slope of − 7 ms, SD = 5 ms) and therefore sometimes inconsistent correlations.
Several situational influences might influence the SNARC effect at the same time and even interact with each other.This would be in line with recent findings by Judd, Aristodemou, and Kievit (2023), who ruled out monocausal explanations for the effect sizes in eleven cognitive-control tasks when investigating variability in children tested within 25 sessions.They claim that variables such as attention, fatigue, and dopamine levels considered individually can at most only provide incomplete explanations for the high variability of cognitive effects or performances.Judd et al.'s results (Judd et al., 2023) showing that children with strong variability in some tasks only showed weak variability in other tasks speak in favor of this third potentially underlying scenario.A single dimension for variability across individuals and across tasks is therefore inadequate.What is more, a recent study by Belli and Fischer (2023) showing that breathing can shift spatial attention demonstrates that variability is present on a very short time scale.Specifically, processing of stimuli presented on the left/right was facilitated during exhalation/ inhalation, respectively.
In contrast to the SNARC and MARC effects, which are based on differences in RTs between response sides, our additional nonpreregistered analyses showed that RTs vary in a more systematic way (see Appendix D).We found positive autocorrelations and a trend over time towards faster responses in most participants.This is consistent with previous findings from other cognitive tasks (e.g., Cevada et al., 2019;Eide et al., 2002;Paap & Sawi, 2016;Strauss et al., 2005), where high test-retest reliabilities were found for RTs, although only low or moderate test-retest reliabilities were found for the actual effects of interest, such as the Simon, flanker, or Stroop effects.Along with responses getting on average faster over time, their variance also decreased on average over time (as reflected by negative correlations of the session number with mean RTs and SDs of RTs), reflecting that absolute stability (as defined in the Introduction) is low.Moreover, we replicated the tendency of smaller SNARC effects in faster responses (e.g., Gevers,

Table 6
The influence of time of the day on unstandardized and standardized SNARC and MARC effects, on mean RTs, and on the SDs of RTs was investigated by comparing two LMMs in a Likelihood-Ratio test (LRT): LMM1 including only random intercepts per participant vs. LMM2 including both random intercepts per participant as well as a fixed term for time of the day, which was categorically coded with morning (M), afternoon (A), and evening (E).We report both the Akaike information criterion (AIC) and the Bayesian Information Criterion (BIC) for each model as well as the χ 2 statistic (two degrees of freedom) and the p-value resulting from the LRT.Moreover, our table displays estimates for the model of choice, namely variances of participant intercepts and of residuals as random effects, and slope estimates (with standard errors in brackets) for all three times of the day (LMM1) or separately in case of a significant LRT (LMM2) as fixed effects (dummy-coded with M as reference).Fias, 2006).Also, in our attempt to replicate earlier observations regarding the influences of block order on the MARC effect in parity judgment (see Appendix F), we unexpectedly discovered the exact opposite pattern.While we found no influence of block order on the SNARC effect, interestingly, the MARC effect was stronger when the first block was MARC-compatible and the second block MARC-incompatible in the present study, while a previous analysis of datasets including over 1000 participants revealed a stronger effect in the reversed block order (Cipora et al., 2020).An explanation for the result of the previous study could be that participants who start with the MARC-incompatible block must both familiarize themselves with the task and overcome their natural response tendency in the first block; in the second block, they are already familiar with the task and do not have to counteract their spatial association of parity.Therefore, these participants have two reasons to be relatively slow in their first block and two reasons to be relatively fast in their second block, resulting in a rather strong MARC effect.Participants who start with the MARC-compatible block, however, must only familiarize themselves with the task but do not have to overcome their natural response tendency in the first block; in the second block, they must only counteract their spatial association of parity as they are already familiar with the task, which leads to a rather weak MARC effect.Importantly, while in that study block order differed between participants, it differed within participants in the present study, where we counterbalanced it between sessions.Our participants did not need to familiarize themselves with the task in each session's first block anew, such that the only difference between blocks was their MARC-compatibility.Hence, we can see that the influence of block order on the MARC effect seems to be different if familiarity with the task is ruled out.Taken together, it seems that the 10 participants in the current study revealed response patterns that are rather typical in behavioral studies.By applying the Ironman paradigm with 30 sessions per participant, we collected enough data to grasp these basic response patterns.Hence, although future studies need to confirm this with larger samples, the high variability in SNARC and MARC effects that we have observed might not be due to our small specific sample; it might rather be a characteristic of the SNARC effect, MARC effect, and other cognitive phenomena.

Implications for different traditions in psychology
Taken together, these results may be a first indication that the SNARC and MARC effects are both characterized by the seemingly paradoxical attributes of high robustness in the population and concurrent low intraindividual robustness.Implications that we discuss for these effects can be applied to other cognitive phenomena, if these turn out to reveal high variability as well.
Importantly, the finding that the SNARC effect does not exist in every individual and is not stable within individuals does not seem to be grounded in measurement imprecision: The SNARC effect's reliability in the parity judgment task is typically within a reasonable range and not correlated with the proportion of participants showing a reliable SNARC effect, which was shown in an extensive reanalysis of 18 datasets with over 1000 participants by Cipora, Soltanlou, et al. (2019), Cipora, van Dijck, et al., 2019.Moreover, the confidence intervals for null effects or reliable reversed SNARC effects are typically not wider than for reliable regular SNARC effects, thus pointing at similar measurement precision for reversed, null, and regular effects (see Cipora, Soltanlou, et al., 2019;Cipora, van Dijck, et al., 2019).What is more, we made this observation despite collecting up to 900 data points per experimental cell (30 repetitions per session) for each participant (i.e., a total of 14,400 trials per participant).If this can be confirmed in a larger sample, the original idea of the effects representing some sort of stable mapping of numbers on space (Cipora et al., 2018) needs to be refined.Importantly, this shows that the SNARC and MARC effects should not or at least not purely be considered as trait characteristics at least when measured with the parity judgment task.Considering the SNARC effect as a state instead of a trait can account for the rather mixed pattern of correlations between the SNARC effect and other variables such as mathematics skills, with some studies showing such relationships and others failing to observe them (for an overview, see Table 1 in Cipora, van Dijck, et al., 2020).This might also explain mixed results when it comes to correlations of other cognitive phenomena with other variables (e.g., Hedge et al., 2018).
Taken together, results from correlational studies with these effects can only be generalized when the correlations are replicated between and within samples, and/or situated influences driving the correlations are fully understood.It remains an open question whether other cognitive phenomena are stable across time and whether they are dominant or indominant (Rouder & Haaf, 2018), and we believe our study opens a discussion on this issue.

Limitations and future research directions
For the interpretation of the study results, several limitations need to be considered.First and most obviously, our sample was relatively small, with the tested individuals not being necessarily representative of the broad population.Unfortunately, the extensive study design with a high time requirement, paired with a commitment to regularly run the experiment, made recruiting and incentivizing participants difficult.At the same time, we believe that the benefits of this approach outweigh the difficulties given the novelty of the insights brought by this method.We wish to encourage researchers to apply the Ironman paradigm, because it is in our view the only way to investigate the intraindividual stability of cognitive phenomena.The time requirements and experimental schedule might at first be perceived as a highly inflexible commitment, but in fact this is not the case.Potential participants can be motivated by explaining that the experimental task can be run wherever they have their computer or laptop with them and at whatever time of the day is convenient for them.On top of that, being tired or unmotivated in at least some of the sessions is not perceived as a problem, but instead even leads to the variance in the participant's current state that is desirable from the perspective of the experimenter.This would also reflect the variability between participants current state in typical experimental setups with one session per participant, where some participants are concentrated and motivated, while others are tired, bored, or hungry (e.g., a hangover during participation in a cognitive experimental task due to alcohol consumption on the previous day can affect results, see Murgia et al., 2020).
Second, it must be noted that not all participants were naïve to the research questions of the study; three of the authors of this study also acted as participants.These participants' behavior may be biased, because they knew the research questions and might consciously or unconsciously have had desired or undesired outcomes in their minds while participating.However, we assume that number processing in cognitive speeded RT tasks such as the parity judgment task is highly automatized such that it can hardly be deliberately influenced by expectations or motivation.
Third, participants might have applied different strategies in different sessions, which we have not asked for.The use of different strategies might, however, be related to the size of the effects.Moreover, other covariates than the four that we have tested in the current study might play a role.To explore whether different strategies or covariates explain some of the effects' variability, future studies using the Ironman paradigm could assess task-solving strategies (e.g., whether the participants did "odd" vs. "even" or rather "yes" vs. "no" judgments, focusing only on even; whether they applied some visualization strategies; etc.) or further covariates (e.g., body-related covariates like appetite, contextrelated covariates like dealing with numbers shortly before the experiment, etc.).
Fourth, it needs to be considered that the mapping of numbers onto space might be a trait rather than a state, but that the SNARC effect measured in the parity judgment task might not be able to capture this mapping in a stable way.Therefore, we recommend similar studies be undertaken using other tasks in which the SNARC effect can be measured, such as a magnitude classification task.
Lastly, we did not ask about or determine participants' chronotype, which could influence the presence of numerical cognition effects.For instance, a meta-analys by Preckel, Lipnevich, Schneider, and Roberts (2011) found chronotype, in terms of different circadian rhythms and fluctuations in physiological and psychological functions (morningness vs. eveningness), to affect cognitive abilities.Hence, there might be an interaction between chronotype and the covariate of daytime.
Future research should be dedicated to finding reasons for the low intraindividual stability of the SNARC effect in the parity judgment task despite its robustness and high replicability at the group level.The methodology of the present study using the Ironman paradigm combined with bootstrapping approaches should be applied to other cognitive phenomena to answer the following questions: Should other cognitive phenomena be considered as dominant (present in all individuals) or indominant (present at the group level, but absent or even reversed in some individuals), as defined by Rouder and Haaf (2018)?Are other cognitive phenomena intraindividually stable over time and should they be considered as state or trait characteristics of the participants?Which situational contexts are most impactful for intraindividual variability of cognitive phenomena and does this differ between cognitive tasks?Instead of only considering the overall strength of phenomena, as typically done in cognitive psychology, should we also consider their intraindividual variability as an important index, as done in some diagnostic approaches (e.g., attention diagnostics in ADHD)and if so, when?Are the overall strength of cognitive phenomena (i.e., the mean effect size) correlated with their intraindividual variability?

Conclusion
The current study shows that cognitive effects that are robust at the group level might not be present in every individual and are not necessarily stable over time within an individual.This interindividual and intraindividual variability in phenomena like the SNARC effect, the MARC effect, or the Odd Effect should not be considered as error variance or noise.It might not reflect measurement problems either but instead that the underlying phenomena do, in fact, differ between individuals and are not stable over time.Importantly, before using cognitive phenomena as correlate with other variables, their intraindividual stability and situated influences on them must be established.The Ironman paradigm combined with bootstrapping approaches is an innovative and effective study design permitting these relevant and unique insights.Instead of considering cognitive phenomena solely as a trait characteristic of individuals, researchers should think of it in terms of a state characteristic as well.

C.1. Intraindividual prevalence of the Odd Effect
After seeing the results, we were interested in whether we would find similar intraindividual patterns when analyzing the data regarding different measures or looking into other phenomena one can observe in the parity judgment task.Therefore, we investigated the Odd Effect (i.e., faster responses to even than to odd numbers) obtained by calculating the differences between RTs for odd and for even numbers at the session and the aggregated levels (grand effects) per participant (positive estimates indicating the Odd Effect).
At an aggregated level (grand effects), all ten participants descriptively revealed an Odd Effect (i.e., faster responses to even than to odd numbers).The magnitudes of the grand empirical differences ranged from 0.34 to 18.52 ms.The results were similar for the individual means of session estimates.Without taking CIs into account, the intraindividual proportions of sessions with an Odd Effect ranged from 46.67% to 100.00% (see Table C).

C.2. H0 bootstrapping approach
Subsequently, we conducted H0 bootstrapping analysis to be able to test whether the Odd Effect was more robust than the SNARC and MARC effects.At the 90% confidence level, five out of 10 participants (Participants 1, 3, 5, 7, and 8) showed a reliable Odd Effect at the aggregated level (grand empirical difference), while no participant showed a reliable reversed Odd Effect.The intraindividual proportions of sessions revealing reliable Odd Effects ranged from 10.00% to 86.67%.Two participants showed reliable Odd Effects in more than half of the sessions (Participants 5 and 7).The intraindividual proportions of sessions revealing a reliable reverse Odd Effect ranged from 0.00% to 13.79%.The individual distributions of observed grand empirical and session effects including 90% CIs are visualized in Fig. C.

Table C
Summary of the Odd Effect results.One row corresponds to one participant.The left part shows descriptive properties of the empirical difference in ms (RTs for odd numbers minus RTs for even numbers): proportions of sessions with positive and negative empirical differences (with positive/negative empirical differences reflecting a regular/reversed Odd Effect, respectively); mean empirical difference (averaged from each session), standard deviation, and grand Odd Effect (calculated by aggregating all trials across all sessions).The two following columns correspond to the 90% CI bounds for grand Odd Effects resulting from the H0 bootstrapping approach.The following part summarizes the proportions of sessions with reliable positive, no reliable, and reliable negative Odd Effects.Note.Colors represent the experiment version (red -Version 1: starting with right hand -odd numbers, left hand -even numbers (Block 1), and then reversed; blue -Version 2: reversed order).Note that here only the mean difference between conditions is calculated, so there is no equivalent of standardized slopes in this analysis.*Participant 9 was left-handed.

Fig. 1 .
Fig. 1.Distributions of unstandardized observed grand and session SNARC slopes (dots) and 90% H1 CIs per participant.Note.Colors represent the experiment version (red -Version 1: right hand for odd numbers and left hand for even numbers in Block 1, right hand for even numbers and left hand for odd numbers in Block 2; blue -Version 2: reversed order).*Participant 9 was left-handed.

Fig. 5 .
Fig. 5. Distributions of standardized observed grand and session MARC slopes (dots) and 90% H1 CIs per participant.Note.Colors represent the experiment version (red -Version 1: starting with right hand -odd numbers, left hand -even numbers (Block 1), and then reversed; blue -Version 2: reversed order).*Participant 9 was left-handed.

Fig. 6 .
Fig. 6.Distributions of observed grand and session slopes (dots) and 90% H0 CIs per participant for unstandardized MARC effects.Note.Colors represent the experiment version (red -Version 1: starting with right hand -odd numbers, left hand -even numbers (Block 1), and then reversed; blue -Version 2: reversed order).*Participant 9 was left-handed.

Fig. C .
Fig. C. Distributions of observed grand and session Odd Effects (dots) and 90% H0 CIs per participant.Note.Colors represent the experiment version (red -Version 1: starting with right hand -odd numbers, left hand -even numbers (Block 1), and then reversed; blue -Version 2: reversed order).Note that here only the mean difference between conditions is calculated, so there is no equivalent of standardized slopes in this analysis.*Participant 9 was left-handed.

Table 2
Proportions of overlap between the individual session CIs and the corresponding grand CIs obtained from H1 bootstrapping for SNARC and MARC slopes.
a Participant 9 was left-handed.L.Roth et al.

Table 5
Repeated-measures correlations between sleep duration, current level of tiredness, consumption of stimulants, and SNARC and MARC slopes, as well as RT characteristics.