Experiment-based calibration in psychology: Optimal design considerations

Psychological theories are often formulated at the level of latent, not directly observable, variables. Empirical measurement of latent variables ought to be valid. Classical psychometric validity indices can be difficult to apply in experimental contexts. A complementary validity index, termed retrodictive validity, is the correlation of theory-derived predicted scores with actually measured scores, in specifically designed calibration experiments. In the current note, I analyse how calibration experiments can be designed to maximise the information garnered and specifically, how to minimise the sample variance of retrodictive validity estimators. First, I harness asymptotic limits to analytically derive different distribution features that impact on estimator variance. Then, I numerically simulate various distributions with combinations of feature values. This allows deriving recommendations for the distribution of predicted values, and for resource investment, in calibration experiments. Finally, I highlight cases in which a misspecified theory is particularly problematic.


Introduction
Many areas of psychology deal with latent variables that are not directly observable, such as subjective value, confidence, or arousal.Empirical research mandates that such variables are assessed via a suitable measurement procedure.This raises a question of measurement validity: does the measurement actually relate to the latent variable of interest, and if yes, then how close is that relation?A validity argument is usually based on several sources (AERA, Joint Committee on the Standards for Educational, Psychological Testing of the American Educational Research Association, & the American Psychological Association and the National Council on Measurement in Education, 2014), among them quantitative indices such as convergent and discriminant validity, or reliability.While some of these are commonly used, others are less widespread in practice.For example, discriminant validity, while theoretically indispensable for interpretation of convergent validity (McDonald, 2013), is more infrequently assessed (Zumbo & Chan, 2014).This might reflect practical and interpretational challenges in the application of psychometric indices (Bach, Rigdon & Sarstedt, 2023).For a practical example, validity assessment requires demonstrating convergent and divergent validity, which necessitates finding at least two measurement methods for the same, and for a different, latent variable, and this can be challenging.For an interpretational example, when convergent validity is low, then it cannot adjucate between a ''more'', and a ''less'' valid method (Bach, Rigdon et al., 2023).Furthermore, psychometrics is historically grounded in * Correspondence to: Hertz Chair for Artificial Intelligence and Neuroscience, University of Bonn, Germany.
the study of interindividual difference, and all psychometric indices require incidental between-person variability to be meaningful (Brandmaier et al., 2018).In contrast, much of experimental psychology is concerned with general or average treatment effects, where the goal of experimental design is often to minimise variability of the treatment effect between persons (Hedge, Powell, & Sumner, 2018).In such cases, psychometric indices can be misleading.
Experiment-based calibration has recently been proposed as a complementary strategy to compare measurement methods, and does not suffer from the aforementioned shortcomings.Inspired by contemporary approaches to the evaluation of physical measurement (Phillips, Estler, Doiron, Eberhardt, & Levenson, 2001), the strategy is to generate well-defined values of a latent variable in a standardised experiment, and evaluate methods by how well they can reproduce these values (Bach, Melinscak, Fleming, & Voelkle, 2020).This is formalised in the concept of retrodictive validity: the correlation of intended scores of the latent variable with the actually measured scores (see Fig. 1) (Bach & Melinscak, 2020).Conceptually, this can be seen as a type of criterion validity, with the external validity criterion replaced by experiment-defined standard scores (Bach et al., 2020).
While previous work has formally laid out the concept of retrodictive validity, and detailed the statistical assumptions under which it is informative on measurement accuracy (Bach et al., 2020), it has not addressed how the concept can be put into practice -or in other words, https://doi.org/10.1016/j.jmp.2023.102818Received 12 May 2023; Received in revised form 28 September 2023; Accepted 25 October 2023 Fig. 1.Schematic of the calibration approach.In the analytical derivations, no assumptions are made for the distributions of  and  (black arrows).In the simulations, aberration  is decomposed into a non-linearity  and random aberration, and measurement error is assumed to be unbiased i.i.d. for simplicity (grey arrows).how a calibration experiment should be designed.There are at least two perspectives on calibration design.The first is domain-specific: to define a suitable experimental manipulation that can be used to impact on a particular latent variable.Recent work has suggested an expert consensus procedure to this end, and has provided an exemplary experiment to calibrate the measurement of associative learning (Bach et al., 2023).The second perspective is statistical: how can numerical and distributional design features be used to improve calibration?This is the topic of the current work.
To clearly define the objective, i.e., what does it mean to ''improve calibration'', I draw on statistical estimation theory.Development of experimental measurement methods is often incremental, and previous research in the field of psychophysiology has suggested that the achievable improvements might be relatively small (Bach & Melinscak, 2020).In this situation, the challenge is to distinguish between two relatively similar population values for retrodictive validity from sample estimators.Consequently, retrodictive validity estimates themselves should be accurate.Thus, I ask how design features can be used to maximise accuracy of retrodictive validity estimators.Retrodictive validity is expressed as bivariate correlation.Sample estimators for correlation coefficients are approximately unbiased for large samples, such that the objective of maximising accuracy reduces to minimising estimator variance.Estimator variance, in this case, depends not only on sample size but crucially on features of the parent distributions which will often be non-normal.Hence, the approach taken here is to analytically identify relevant distribution features using asymptotic limits, and then to numerically simulate distributions with combinations of these features in finite samples.Note that all simulation results depict the quantity V (   ) which is (approximately) independent of sample size .The variance of the retrodictive validity estimator in a finite sample of particular size can be derived from this quantity by dividing through the actual sample size used.

Overview
To recapitulate the calibration approach, an experiment is conducted in which the independent variable is chosen to generate predicted standard scores  of the latent variable (see Table 1).However, these predicted values are not accurately achieved: the truly achieved scores  deviate by experimental aberration .Measurement operations are then performed to quantify  , but the measured scores  deviate by some error .Because psychological variables have no scale, we can disregard linear mappings between ,  , and  without loss of generality.To illustrate this with an example, consider the calibration of lightness measurement (i.e., the perception of an object's luminance).Experimenters can vary (physical) luminance and colour of an object in order to manipulate (perceived) lightness.Then the standard scores  are defined by a vector of the physical quantities (luminance and colour) together with a formal theory of lightness (e.g. the CIECAM02 model) that takes these quantities as input.A simple measurement operation might consist in asking people to indicate lightness on a visual analogue scale.
Experimental aberration  has two sources.The first is that the theory that links experimental manipulation with latent variable scores (the CIECAM02 lightness model in our example) might be inaccurate; in other words,  is mis-specified.This is modelled here with a systematic non-linearity  ().The second reason is random variation due to imprecision   (due to physical and psychological factors), which is modelled by a (scalable) probability distribution that may depend on  (corresponding to non-additive noise).Measurement error  also has two sources: systematic mis-specification in the measurement system, and random variation.
Variance of normal distributions used in various simulations

Standard values
Standard values  are the experimenters' prediction for the effect of their experimental manipulation on the latent variable.Since experimenters have full control over the experimental manipulation, the distribution of  is a key consideration for designing a calibration experiment.In some cases where the theory is well-developed (e.g.CIECAM02 model for lightness perception),  might be chosen from a large range, allowing many different distributions including approximately normal distributions.In many other cases, the theory that links independent variable to latent attribute values might not be numerically specified and only allow predictions for a discrete number of values of the independent variable.For example, attributes like ''fear memory'' or ''confidence'' might only allow predicting two standard values, i.e. ''low'' and ''high'' (Bach et al., 2023).In the numerical simulations, we consider discrete and continuous distributions, including a standard normal distribution, and a truncated normal distribution, where the latter is more feasible for experimental implementation.

Experimental aberration
The distribution of the experimental aberration  is not under direct experimental control.First, if the systematic non-linearity  () were precisely known then one would be able to include it in the model and thus adjust the predicted standard values .Secondly, precisely revealing the distribution of random aberration would already require a perfect measurement system.However, it is still useful to consider the impact of these two terms.Regarding the non-linearity, theoretical or experimental considerations might suggest a particular class of nonlinearity  ().For example, the Helmholtz-Kohlrausch effect, which is not included in the CIECAM02 model, would suggest that for certain colours, the model over-or underestimates the true perceived lightness, even if this predicted mis-specification is not known numerically.Regarding random aberration, it may be possible to unspecifically reduce the scale of the aberration with experimental means.For example, conducting a psychophysics study in a dedicated lab instead of as an online experiment is likely to result in smaller aberration.This is likely to include an impact on the shape of aberration distribution, which is unknown.Here, we model this situation with an unspecific impact on the scale of random aberration.This is why we decompose random aberration   =    0 into shape  0 , and scale

Measurement error
Measurement error is not under direct experimental control in a calibration experiment.Crucially, data transformations after the experiment may influence measurement error and thus are not known a priori.Hence, a calibration experiment should be designed in a way that is useful under various levels of measurement error.In numerical simulations, we assume throughout that aberration and measurement error are not only uncorrelated but also stochastically independent, and that measurement error follows a normal distribution.We explore several levels of error variance in order to account for measurement instruments of different precision.To facilitate the results and because it is difficult to foresee a priori, we do not consider the effect of systematic measurement error.

Scaling
Because psychological variables have no natural scale or anchor point, we can assume, without loss of generality:

Numerical simulations
In our simulations, we evaluate the following distributions.These examples are based on the analytical results and represent distributions with extreme values for some relevant features.For simplicity, the simulations consider calibration experiments in which the standard scores are independently randomised for each instantiation of the experiment, rather than being balanced.

Systematic non-linearity in the experimental aberration
For the simulated non-linearities  () (see Table 1), the parameter  is numerically determined to ensure that the linear component of the non-linearity is fixed, i.e.  (,  ()) = 1, which simplifies model scaling.For each of the two types of non-linear mappings, we consider a version with smaller and with larger deviation from linearity, i.e. low and high values for V ( ()).
1.No non-linearity:  () = .2. Sigmoid non-linearity:  () =  1∕3 , and  () =  1∕5 .This means that the predicted standard scores are closer to their mean than the true scores; in other words: the standard scores overestimate the true scores when the standard scores are close to the mean, and they underestimate the true scores when the standard scores are extreme.Such non-linearity might be suspected when the latent variable is supposed to be bounded and saturates, while the experimental independent variable is unbounded or has very wide bounds.This might be the case in many areas of perception.3. Inverse sigmoid non-linearity:  () =  3 , and  () =  5 .
Such non-linearity might be suspected when there is evidence that small changes of the independent variable can have strong effects at extreme values.A classical example for inverse sigmoid non-linearity is subjective probability weighting in value-based decision-making: if a latent variable is manipulated by economic gambles with two fixed potential outcomes and different probabilities of winning, then the predicted and true values of the latent variable are likely to relate by inverse sigmoid mapping.(In this example, of course, this known inverse sigmoid mapping could be taken into account when using prospect theory to predict the standard values.)

Random aberration due to imprecision: shape
The parameter  in some of these simulations was numerically determined to ensure that V (  0 ) = 1.For all simulations,  ( ,  0 ) = 0.These distributions were selected as they provide extreme values for some features highlighted in the analytical derivation, and they are illustrated in Fig. 2. Some tentative intuition for where these features might occur in psychology is given after each distribution; these explanations do not refer to the specific example distributions but rather to the feature values being modelled.

Continuous uniform distribution (platykurtic example): 𝜔
) .This might for example occur due to physical or conceptual limitations in the resolution of the independent variable, i.e. it is specified only within a range.2. Standard normal distribution (mesokurtic example):  0 ∼  (0, 1).This is a standard assumption in many types of research.
. This could happen due to floor and ceiling effects if there are personindependent boundaries on achievable true scores.7. (Conditional) gamma distribution with positive skew for smaller : . This might occur due to a combination of a categorical/qualitative and a quantitative mechanism in generating the true score, e.g. a percept is generated as either positive or negative in relation to a reference value, and then imbued with a specific value on the half-axes, which will have more variability into the extreme direction than towards the reference.

Random aberration due to imprecision: scale
Simulated values for   were drawn from a wide range, including typical values encountered in experimental settings.These can be derived from reported retrodictive validity values in calibration experiments.Systematic research using retrodictive validity as a metric has mainly been conducted in the field of human fear conditioning.Here, retrodictive validity has typically been expressed as (Cohen's)  for a within-subjects difference between two experimental treatments.Published values range between  = 0.4 for some skin conductance analysis methods, and  = 1.2 for fear-potentiated startle eye-blink (e.g.Bach & Melinscak, 2020).This corresponds to retrodictive validity estimates between (Pearson's)  = .20 and  = .51.From Bach et al. (2020) (SI Eq. ( 33)), we have and so (2) Thus, the two variances combined will take values between 24 for  = .20,and 2.8 for  = .51.Since these reported estimates derived from experiments with only two possible values of , we can assume  () = .If aberration and error variance were equal, this would correspond to   = 3.46 and   = 1.18, respectively.Assuming that retrodictive validity might be lower or higher in other areas of psychology,   was varied over the interval [0.1, 4].

Measurement error
Because measurement error is not usually known a priori, and can depend on the other model quantities in complicated ways, all simulations use a placeholder error distribution that represents i.i.d.Gaussian noise with three levels:  ∼  ( 0,  2 ) with  ∈ {0.1, 1, 4}.

Simulation settings
All simulations used Matlab's default Mersenne twister to generate random numbers.Distributions were simulated by drawing 10 7 samples.Asymptotic results used all of these samples to approximate the expectations in Eq. ( 3), from which estimator variance was analytically calculated.For finite-sample results, I randomly selected (with replacement) from these simulated distributions 10 6 samples of size  = 100 (based on the sample size suggested in Bach et al. (2023)) and computed the resulting variance of the sample retrodictive validity estimator.All simulation results depict the quantity V (   ) , from which estimator variance is computed by division through actual sample size.For small samples, sample size can have an impact on V (   ) .Hence, I compared the results to simulations with sample sizes of  = 50 and  = 1000.On average, the absolute differences in V (   ) between the differently sized samples was about 1%, and in more than 90% of the simulated scenarios, the absolute difference was below 5%.

Analytical results
The asymptotic limit of the sample estimator variance can be decomposed into the following terms (see Appendix for derivation): where ℎ  collapses all termes involving the (a priori unknown and unknowable) measurement error.In the following I discuss the five terms not involving measurement error, keeping in mind the goal of minimising V (   ) .Fig. 2 illustrates the different quantities.I note (see Appendix) that although the sum of these terms is non-negative, some of them can be negative (and thus serve to reduce estimator variance).Weighted kurtosis of the standard value distribution.
because E () = 0 and V () = 1.Both factors in this term are nonnegative on   ∈ [0, 1].Kurtosis of the standard value distribution is under experimental control, and reducing it will always reduce  1 .Fig. 2A illustrates different distributions and their kurtosis.The second factor has minima at   = 0 and   = 1, and a maximum at For fixed V (), this means that reducing V () from a given value can increase or reduce this term.The maximum value of  1 on the interval   ∈ [0, 1] is 1 27  ().While kurtosis is theoretically unbounded, high-kurtosis distributions of  appear unnecessary in practice, and for a plausible normal distribution of , the maximum is  1 ≈ 0.11.Thus, overall this term makes a relatively modest contribution to V (   ) .
Weighted kurtosis of the aberration distribution.
using E () = 0 and thus,  () = E (  4 ) ∕V 2 (), and Eq. ( 1).Both factors in this term are non-negative on   ∈ [0, 1].Reducing aberration kurtosis would thus always reduce  2 , although in practice this will usually not be under experimental control (Fig. 2BC).For fixed V (), the second factor has a maximum at V () = 2V () + 2. In words, if the aberration is relatively large compared to the measurement error, then decreasing the aberration increases  2 .If V () is below this bound, then decreasing the aberration decreases  2 .This is the case for example if V () = V ().Because V () also impacts on the other terms, its net effect can only be understood in the numerical simulations.
Weighted covariance of squared intended values and squared aberration.
with Eq. ( 1) and  =  0 √ V ().The first factor is non-negative.If  and  are stochastically independent, or if there are just two equiprobable values of , then the first factor equals 1.In other cases, the first factor is large when  takes larger absolute values for more extreme values of  (Fig. 2DE).For the second factor, if V () > 1∕2, then reducing , where V  () depends on V () with V  () < 0.3.The impact of changing V () is analysed in detail in the numerical simulations below.

Weighted systematic aberration.
with Eq. ( 1) and  =  0 √ V ().The first factor is zero if there are two or three equidistant values of , or if there is no non-linearity, i.e.,  () = .Otherwise, the first factor is positive when  takes more extreme values for more extreme values of  (e.g.inverse sigmoid nonlinearity  , Fig. 2F), and negative in the opposite case (e.g.sigmoid non-linearity  , Fig. 2G).The second factor is non-negative.On an interval [ 0, V  () ] , decreasing V () decreases this factor.The minimum value of V  () is achieved when V () = 0, which yields V  () = 5.It is plausible to assume that V () will rarely be larger than this value.The impact of changing V () then only depends on the first factor and is analysed in detail in the numerical simulations below.
Weighted skewness of the aberration.
with Eq. ( 1) and  =  0 √ V ().The first factor is positive if  is positively skewed for large(r) values of  and negatively skewed for small(er) values of  (see Fig. 2HI).For two values of , this would mean that the distribution of  has long tails into both directions, but fewer values in the middle.For the opposite situation, the first factor will be negative.If  has no skew for any value of , then this term will be zero.The second factor is negative.On an interval [ 0, V  () ] , decreasing V () increases this factor.The minimum value of V  () is achieved when V () = 0 which yields V 0 () = 5.It is plausible to assume that V () will rarely be larger than this value.The impact of changing V () then only depends on the first factor and is analysed in detail in the numerical simulations below.

Numerical results
Because several of the analytically identified terms depend on aberration in different ways, I numerically simulated various scenarios to quantify their impact on estimator variance.Fig. 3 shows simulated asymptotic results, where rows refer to different distributions of random aberration  0 , columns to different systematic non-linearities and levels of measurement error V (), and the -axis to different levels of random aberration   .Using the same layout, Fig. 4 shows simulated results from finite samples, and Fig. 5 shows the direct comparison.
We first consider estimator variance in the absence of systematic non-linearity (left three columns in Figs. 3 and 4).When  0 is independent of the standard values  (top three rows), different distributions of  yield relatively similar estimator variance.This confirms the analytically derived result that although the kurtosis of  does affect  1 , its numerical impact is relatively small.Furthermore, reducing the scale of aberration can reduce estimator variance considerably, but only for low to medium levels of measurement error, i.e. when the error variance V () is on a lower or same order of magnitude as the variance of the aberration.In high-noise scenarios, when measurement error variance is on a larger order, reducing aberration only has a small effect.
Next, we consider conditional distributions of  0 that depend on .Here, different choices of  yield more dissimilar estimator variance.In most situations, binary  yields the smallest, and normally distributed  the largest estimator variance, with a small impact of truncating .In most circumstances, reducing the scale of   reduces estimator variance.However, in the specific case that V (  0 ) inversely scales with Fig. 5. Direct comparison of asymptotic limits (Fig. 3) and finite-sample results (Fig. 4).Faint colours denote the asymptotic limits.
the absolute values of  (fourth row), these results are reversed: binary  yields the largest estimator variance, and reducing the scale of   can increase estimator variance, in particular if measurement error is large.
For small and medium levels of measurement noise, this effect is mainly due to term  2 , as anticipated in the analytical results.However,  2 has hardly an impact at high measurement noise.Here, the increase in estimator variance is due to an interplay of several factors, including the error variance-dependent term ℎ.At high levels of measurement noise, this term generally increases with decreasing   .Note that in the absence of prior knowledge on the distribution of measurement error, I assumed i.i.d.normally distributed error for the simulations.Insofar as some of these results depend on ℎ, they may not generalise.
In the presence of sigmoid non-linearity in the aberration, the results look very similar (middle three columns), and there is almost no impact of the variance of the systematic aberration.However, inverse sigmoid aberration changes this picture (right three columns) in several important ways.First, asymptotic estimator variance becomes very large for large systematic non-linearity when  is a normal distribution and the scale of   is small.This is due to term  2 .The implementation of inverse sigmoid aberration used here is a cubic function, and the kurtosis of the cube of a standard normally distributed variable corresponds to the value of its 12th moment which is 11!! = 2079.This excessive kurtosis is reduced by adding random noise to ; hence the phenomenon is particularly pronounced for small scale of   and .However, kurtosis is not as excessive for any truncated  or in finite samples (see Fig. 4), and this phenomenon therefore may be irrelevant for practical purposes.The following three differences apply both for asymptotic and finite-sample cases.Secondly, estimator variance is generally much larger for (truncated) normal distribution of  than for the discrete and continuous uniform distributions, due to an increase of the term  3 .Third, the scale of the non-linearity has a larger impact on estimator variance here than for sigmoid non-linearity, again due to an impact on term  3 .Fourth, there are more scenarios in which decreasing the scale of   increases estimator variance.

Discussion
In this note, I investigated quantities that determine the sample variance of retrodictive validity estimators.Of particular interest are design features that an experimenter can control; and it is useful to consider how different settings of these features fare under various combinations of non-controllable features.
The first feature under experimental control is the sample size, which has a well-known linear relation with estimator variance.Increasing sample size, can, in any scenario, reduce estimator variance.All other results concern the quantity V (   ) , where  is the sample size.In other words, they describe how estimator variance is affected when sample size is constant.The second feature under experimental control is the distribution of , which the experimenter can freely choose.In the absence of systematic non-linearity in the aberration, and when imprecision aberration is i.i.d., the choice of  has no large impact on estimator variance.However, in many other situations, estimator variance increases from binary to discrete to continuous uniform to normally distributed .The third feature of interest is the scale of the aberration distribution   , which is at least under indirect experimental control.Reducing imprecision aberration reduces estimator variance in many but not all scenarios, and it can sometimes even increase estimator variance.Because it is generally unknown which scenario occurs in a given calibration experiment, the effect of reducing imprecision aberration is difficult to predict.
From these results, several recommendations for the design of calibration experiments can be derived.First, it may be preferable to use a relatively small number of discrete experimental treatments, as is usually the case anyway in many fields of experimental psychology.Binary treatments fare particularly well, but in advanced experimental fields, a goal may be to calibrate a measurement across a range of values and to include a possible impact of systematic aberration, which would motivate the use of several discrete values.In case a continuous distribution is desired, then normal distributions, however desirable as they are for simplified statistical computations, should be avoided in favour of a continuous uniform distribution.A second recommendation relates to aberration and the use of resources.Reducing aberration is obviously desirable in its own right, but it may be costly if it involves specific experimental efforts such as shielded testing rooms, prolonged resting periods, monitoring participants over extended periods of time, etc.The present results suggest that reducing imprecision aberration is useful in most scenarios but not when measurement noise is relatively high, and not under some particular aberration distributions.Since these distributions are not a priori known to the calibration planner, if there are limited resources available they may be better invested into increasing sample size, which uniformly and linearly decreases estimator variance.
Other distributional features had a pronounced impact on estimator variance but many of these will be hard to control or even to know, for example the shape of the aberration distribution.However, as one of these features, systematic non-linearity deserves particular attention as it relates to shortcomings in the substantive theory used to generate the predicted standard values .While not under experimental control in a particular calibration experiment, substantive theory can be improved over time, and its results can be incorporated into the analysis of a calibration experiment even after the data are acquired.A particularly problematic case, inverse sigmoid aberration, is exemplified by subjective probability weighting in value-based decision making.If a latent variable is manipulated by changing outcome probabilities in an economic gamble with two fixed potential outcomes, then the latent variable is likely to relate to the probabilities in an inversesigmoid manner.If, however, predicted values were derived with a linear probability weighting, then these would be misspecified with an inverse-sigmoid aberration.This could then be improved by using cumulative prospect theory to derive more accurate predictions of the true scores.Thus, reducing systematic aberration merits investment and yields long-term returns, as the results can be used retrospectively and prospectively in many calibration experiments.This requires improving substantive theory and thus resonates with recent theoretical proposals that measurement should principally be based on substantive theory (Borgstede & Eggert, 2023).
These recommendations are based on the objective of minimising estimator variance.I note that there might be other objectives when designing a calibration experiment.For example, experimental aberration constitutes an important source of measurement uncertainty (BIPM, IEC, IFCC, ILAC, IUPAC, IUPAP, ISO, OIML, 2012;Rigdon & Sarstedt, 2022;Rigdon, Sarstedt, & Becker, 2020).Aberration cannot be reduced by better measurement instruments.In the terminology of this article, experimental aberration places an upper limit on achievable values of retrodictive validity.This might be undesirable in itself, and as such, reducing the scale of measurement aberration might be an independent goal of calibration design, beyond reducing estimator variance.
All results are necessarily limited to the choice of scenarios analysed.To motivate these scenarios, I analytically derived distribution features that have a large impact on estimator variance, and then simulated distributions with extreme values of these features.Doing so may have missed out on some other distributions that are possibly more relevant in particular fields of experimental psychology.All simulations are openly available and can easily be expanded to encompass other situations.Furthermore, the present analysis is focused on independently randomised (rather than balanced) experimental treatments.This is a plausible experimental approach for between-subjects studies, but the case of balanced experimental treatments is particularly important for within-subjects designs.Future work will explore this experimental design in a purely numerical approach.
As a general limitation of the calibration approach, and similar to the concept of construct validity (Cronbach & Meehl, 1955), it does not allow to empirically separate trueness (bias) and precision of the measurement (Bach et al., 2020).Trueness of the measurement refers to systematic deviation from the true scores.This can occur for example due to unknown non-linearities in the mapping from true scores to measured scores (see for illustration e.g., Dunn & Anderson, 2022), or because one is measuring not the latent variable of interest but a different, but somewhat correlated variable.Precision on the other hand, relates to variability in the measurement under constant true scores.Although these can be distinguished in theory -just as one can distinguish systematic aberration (bias) from imprecision aberrationretrodictive validity collapses both into one accuracy metric.In order to distinguish the terms, there are two approaches (Bach et al., 2020).The weaker one is to analyse pairs of values of , in which case, the influence of trueness will likely be reduced compared to a larger set of values of .The second is to additionally evaluate reliability (Brandmaier et al., 2018), from which precision can be computed under justifiable assumptions (McDonald, 2013).Note, however, that the design of parallel tests in the context of experimental measurement can be non-trivial, as discussed previously (see supplemental information in Bach et al. (2020)).Future work will address this experimental limitation and hopefully contribute to the derivation of uncertainty metrics in experimental measurement (Rigdon & Sarstedt, 2022;Rigdon et al., 2020).
To summarise, estimator variance in calibration experiments can be reduced by using uniformly distributed, preferably discrete, experimental treatments; by reducing inverse sigmoid experimental aberration; and in many but not all circumstances by reducing imprecision aberration.If resources are limited, then the most predictable route to reducing estimator variance, under all experimental circumstances, remains increasing sample size.We note that if  = 0 then ℎ  () = 0.It follows that the sum of the other terms is non-negative.

Fig. 2 .
Fig. 2. Illustration of calibration-specific terms that impact on variance of retrodictive validity estimators.A: Kurtosis of the different standard value distributions used for numeric simulations.BC: Two aberration distributions with different kurtosis.B: Continuous uniform distribution (platykurtic example).C: Student's t-distribution with  = 5 (leptokurtic example).DE: Two random aberration distributions with different covariance of squared standard values and aberration.D: (Conditional) normal distribution with larger variance for smaller absolute values of .E: (Conditional) normal distribution with smaller variance for larger absolute values of .FG: Sigmoid and inverse sigmoid non-linearity.HI: Two random aberration distribution with different skewness of the aberration.H: Gamma distribution with positive skew for larger .I: Gamma distribution with negative skew for larger .

Fig. 3 .
Fig.3.Simulated asymptotic results for 63 scenarios with different standard value distributions (colours), different random aberration distributions (rows), different scale of random aberration (-axis), different systematic non-linearity (columns) and different levels of measurement error (columns).All panels show the product of asymptotic estimator variance with sample size.Note that under inverse sigmoid aberration, two columns are displayed with a different limit on the -axis.

Fig. 4 .
Fig. 4. Simulated finite sample results for 63 scenarios with different standard value distributions (colours), different random aberration distributions (rows), different scale of random aberration (-axis), different systematic non-linearity (columns) and different levels of measurement error (columns).All panels show the product of asymptotic estimator variance with sample size.This quantity was averaged across 10 5 samples of size  = 100.

Table 1
Notation used in the methods and results.