The ideal outcome of a behavioral experiment in many fields of experimental psychology is the predicted effect in the dependent measure of interest—usually either response times (RTs) or proportions of correct responses (PCs)—and no effect between conditions in the respective other measure (or at least an effect in the same direction). This outcome is ideal for two reasons: (a) it is easy to interpret, and (b) with just one outcome measure on which to test a hypothesis, there is no need to correct for multiple testing.

Unfortunately, it is sometimes not predictable whether participants will focus more on doing the task right or on doing it fast (i.e., which point on the speed–accuracy trade-off [SAT] continuum they will choose; for reviews, see, e.g., Heitz, 2014; Luce, 1986; Pachella, 1974; Sanders, 1998). Indeed, SATs can vary unpredictably within and across participants (e.g., Dutilh et al., 2012; Gueugneau, Pozzo, Darlot, & Papaxanthis, 2017; Liesefeld, Fu, & Zimmer, 2015), participants can adapt their SATs trial-wise and at will (e.g., Paoletti, Weaver, Braun, & van Zoest, 2015; Reuss, Kiesel, & Kunde, 2015; Voss, Rothermund, & Voss, 2004; Wickelgren, 1977), and they might even change their SATs systematically between conditions. Sometimes, of course, shifts in SATs are the phenomenon of interest (e.g., as an account of post-error slowing; Laming, 1968; see also Botvinick, Braver, Barch, Carter, & Cohen, 2001; Thura, Guberman, & Cisek, 2017), but usually SAT shifts hinder the goals of a study, probably confounding the investigated effect.

Even when the dependent measure of interest is set a priori for theoretical reasons, researchers still routinely check whether the respective other measure points in the same direction (e.g., shorter RTs and more correct responses/higher PC). This is because RTs and PCs pointing in opposite directions (i.e., shorter RTs and fewer correct responses/lower PC) would indicate that the observed effects may (in part) be due to SATs instead of to “real” effects. Furthermore, the (sometimes subjective) decision to interpret the one or the other measure would yield conflicting conclusions regarding the direction of the effect. Therefore, ignoring either would obviously be wrong.

These intricate situations might be resolved by using a combination of RTs and PCs, such that (a) possible SAT contributions are cancelled (or at least dramatically attenuated), while (b) ”real” effects remain in the data. There are suggestions of such combined measures in the literature, and they have been employed when the data show signs of SATs (e.g., Kunde, Pfister, & Janczyk, 2012; Kristjánsson, 2016). The goal of this article is to test whether these available measures conform to the two, just mentioned, criteria, and to arrive at well-founded recommendations as to whether or not to use them. For this purpose, we simulate pure SATs and “real” effects with the diffusion model (Ratcliff, 1978) and determine the degree to which each measure attenuates SATs and maintains “real” effects. Additionally, we will formally introduce and test a new alternative that was conceived by Liesefeld et al. (2015).

Measures to combine speed (RT) and accuracy (PC)

In the following section we describe four measures in the literature that combine RT and accuracy into a single performance measure. To make perfectly clear how the different measures are calculated, an example data set and the respective calculations are given in Table 1.

Table 1 Example calculations for the inverse efficiency score (IES; Eq. 1), rate-correct score (RCS; Eq. 2), linear integrated speed–accuracy score (LISAS; Eq. 3), and balanced integration score (BIS; Eq. 4)

Inverse efficiency score (IES)

The most often suggested combined measure is the inverse efficiency score (IES; Townsend & Ashby, 1983), which is typically defined as mean correct RTs divided by PCs (Akhtar & Enns, 1989; Bruyer & Brysbaert, 2011):

$$ {IES}_{i,j}=\frac{\overline{RT_{i,j}}}{PC_{i,j}} $$
(1)

where \( \overline{RT_{i,j}} \) is participant i’s mean RT on correct-response trials in condition j and PCi,j is participant i’s proportion of correct responses in condition j. Although most (if not all) studies using or evaluating IES have included only correct RTs, all RTs (including those from error trials) seem to be taken into account according to the original proposal (Townsend & Ashby, 1983, p. 204). We will, however, evaluate the version without incorrect RTs, because this is the one typically reported in empirical studies. This measure can be interpreted as “the average energy consumed by the system over trials” (Townsend & Ashby, 1983, p. 204).

Rate-correct score (RCS)

An alternative suggestion is the rate-correct score (RCS; Woltz & Was, 2006):

$$ {RCS}_{i,j}=\frac{NC_{i,j}}{\sum_{k=1}^{n_{i,j}}{RT}_{i,j,k}} $$
(2)

where NCi, j is participant i’s number of correct responses in condition j and the denominator reflects the total time participant i spent on trials in condition j (in other words, the sum of RTs across all ni,j trials of participant i in condition j). RCS “can be interpreted directly as number of correct responses per unit time” (Woltz & Was, 2006, p. 673). As is evident from a comparison of Eqs. 1 and 2 and as detailed in Appendix A, RCS is similar or even identical to the inverse of IES (see also Vandierendonck, 2018), and therefore no strong differences between the two are to be expected. Both measures also bear some similarity to reward rate (e.g., Balci et al., 2011; Gold & Shadlen, 2002), which instead of mere RTs takes into account all of the time between two responses (i.e., also the time between a response and the next trial).

Linear integrated speed–accuracy score (LISAS)

Vandierendonck (2017) suggested the linear integrated speed–accuracy score (LISAS), which is defined as

$$ {LISAS}_{i,j}=\overline{RT_{i,j}}+\frac{S_{RT_{i,j}}}{S_{PE_{i,j}}}\bullet {PE}_{i,j} $$
(3)

where \( \overline{RT_{i,j}} \) is participant i’s mean RT on correct-response trials in condition j and PEi,j is participant i’s proportion of errors (1 − PC) in condition j. Note that both the mean RT and SRT include only correct trials and that \( {S}_{RT_{i,j}} \) and \( {S}_{PE_{i,j}} \) are the across-trial sample standard deviations of participant i in condition j (i.e., with n in the denominator, and not the unbiased population estimate with n – 1 in the denominator).Footnote 1

Whereas IES and RCS were constructed to have a straightforward interpretation (viz. the energy consumed on average and the number of correct responses per second of activity), the goal of LISAS is to obtain a linear integration.

Balanced integration score (BIS)

Liesefeld et al. (2015) devised a measure that focuses on giving equal weights to RTs and PCs, which we term balanced integration score (BIS).Footnote 2 It is calculated by first standardizing RTs and PCs to bring them to the same scale and then subtracting one standardized score from the other:Footnote 3

$$ {BIS}_{i,j}={z}_{P{C}_{i,j}}-{z}_{\overline{R{T}_{i,j}}} $$
(4)

with \( {z}_{x_{i,j}}=\frac{x_{i,j}-\overline{x}}{S_x} \).

Thus, BIS is the difference in standardized mean correct RTs and PCs. The mean and sample standard deviation must be calculated over all cells that contribute relevant variance. If values were, for example, standardized per condition, all conditions would have the same mean value of 0, thus eliminating any effects. Thus, \( \overline{RT} \), PC, SRT, and SPC are typically calculated across all observed mean RTs and all PCs from the analyzed experiment (including all subjects and all conditions).

Assessing the combined measures

Quality criteria for combined measures

As we stated above, we believe that a combined measure should ideally (a) cancel out SATs and (b) maintain “real” effects. Other criteria, however, have been put forward and are discussed in the following section.

Bruyer and Brysbaert (2011) examined how well IES worked in several empirical data sets in comparison to RTs or PCs alone. Their criterion was whether IES “clarifies matters” (p. 9). In particular, they checked (a) whether effects in RTs or PCs were preserved in IES, (b) whether new effects emerged in IES, and (c) whether IES would yield a more orderly data pattern than its constituents. After comparing the result patterns in RTs, PCs, and IES in data from several studies, they plaintively concluded: “It looks pretty much like every type of change is possible with the introduction of IES” (p. 9). This indicates that employing convenience samples of empirical data sets is not well suited to examining the suitability of a given combined measure. A promising and flexible alternative is creating artificially generated data, in which the relative contributions of SATs and “real” effects can be known.

Vandierendonck (2017) created such artificial data in order to examine properties of IES, RCS, and LISAS. He simulated pure effects on RTs or PEs and effects in both variables in the same or opposing directions and found that all three combined measures performed quite well in recovering these effects, with RCS and LISAS working better than IES.Footnote 4 Although the effects in opposing directions can be conceived as SATs,Footnote 5 his evaluation was focused on whether effects are maintained or amplified in a given combination of speed and accuracy. In contrast to this standpoint, we believe that the most important property of a combined measure should be its insensitivity to SATs, yielding ideally no or considerably reduced effects when the effects on RTs and PEs point in opposing directions. Only then would using the combined measure avoid interpreting spurious effects. Indeed, when patterns of RTs and PEs were opposing, the combined measures examined by Vandierendonck somewhat reduced the effects, but these SAT effects were still alarmingly large (see his Table 11 and Fig. 3d).

Interestingly, whereas Bruyer and Brysbaert (2011) advised against the use of IES, Vandierendonck’s (2017) conclusion was more favorable for this particular measure (see also Vandierendonck, 2018). Both Vandierendonck (2017) and Bruyer and Brysbaert concluded that one should always examine RTs and PCs/PEs separately in addition to any combined measure. However, if such an examination is always indeed needed, it appears questionable to us whether anything is gained by examining the combined measure at all, or whether the additional analysis simply enlarges the Results section (and the alpha error).

Simulating SAT and “real” effects with a diffusion model

As was exemplified above, attempts to assess the behavior of the combined measures with empirical data have the disadvantage that the degree of SATs and the impacts of other variables on RTs and PCs (such as “real” effects) are unknown. To gain pure SATs and pure effects, we based our assessment of combined speed–accuracy measures on data artificially generated by the well-established diffusion model (Ratcliff, 1978; Ratcliff, Smith, Brown, & McKoon, 2016; Ulrich, Schröter, Leuthold, & Birngruber, 2015; Vandekerckhove & Tuerlinckx, 2007; Voss & Voss, 2007; Wagenmakers, van der Maas, & Grasman, 2007). In particular, we modeled SATs and “real” effects on RTs and PCs by variation of threshold separation and drift rate, respectively (see below for details on the model and the parameters). Furthermore, as we will demonstrate below (see the Balanced Integration of Measures section), the relative weighting of RTs and PCs depends on the accuracy level, and the different combined measures’ effectiveness in canceling SATs and maintaining “real” effects varies differentially across accuracy ranges. Therefore, we densely sampled accuracies ranging from pure guessing (50%) to virtually perfect performance (100%) instead of picking only a few points from this spectrum.

Diffusion models (Ratcliff, 1978) are a class of random-walk models that have been successfully applied to modeling decision behavior and predicting RT distributions and PCs in a variety of paradigms and fields of research (for recent reviews, see Forstmann, Ratcliff, & Wagenmakers, 2016; Ratcliff et al., 2016; Voss, Voss, & Lerche, 2015; Wagenmakers, 2009). The basic idea of these models (see Fig. 1 for an illustration) is that a diffusion process starts at a specified point and noisily accumulates evidence with a certain drift rate v (reflecting the strength of evidence) until one of two thresholds is exceeded. In one interpretation of the diffusion model, the upper threshold a is associated with a correct response and the lower threshold 0 represents an erroneous response. Although the exact starting point can theoretically vary between 0 and a, most often it is set to a/2, thus without any bias toward one or the other threshold. Once the accumulated evidence exceeds one of the thresholds, the correct or wrong decision is made (and a response is given, in many experimental settings). Although the drift rate v drives the diffusion process in the direction of a, noise can cause the diffusion process to reach 0, and thus an error results. Typically, the time from accumulation onset until a threshold is reached is considered the decision time, and additional processes such as perception, non-decision-related cognitive processes, and motor execution are captured by a nondecisional constant t0. For a more complete description of parameters, see, for example, Voss et al. (2015) or Ratcliff et al. (2016).

Fig. 1
figure 1

Illustration of a simple diffusion model. In this example, the upper threshold a is associated with correct responses and the lower threshold, at 0, is associated with erroneous responses. Without any bias, evidence accumulation starts at a/2. On each time step, a fixed drift rate and random noise are added to the evidence. The green line represents a trial with a (relatively) high drift rate (hitting the threshold a rather early), whereas the blue line represents a trial with a (relatively) small drift rate (hitting the threshold a rather late). Due to random noise, it is also possible that the accumulated evidence will hit the lower threshold at 0, thus representing an erroneous response.

Of particular importance for the present purposes, larger values of a will lead to longer RTs (since it takes longer to reach a threshold) but at the same time increase the likelihood of correct responses. In other words, varying a across simulated conditions can be used to induce a “pure” shift along the SAT continuum without any confound with “real” effects that would reflect between-condition differences in task difficulty (and that can be simulated by varying v or t0).

In our simulations, a Wiener diffusion process was used (e.g., Ratcliff, 1978; Ulrich et al., 2015) in which noise is modeled as Brownian motion to which a (linear) drift function is added. We fixed the starting point at a/2. For the present purposes, we varied threshold separation a in order to induce an SAT, with a ∈ {5, 10, 15, …, 290, 295, 300}. We repeated the simulation with six different drift rates, v ∈ {0.20, 0.22, …, 0.30}, as an operationalization of “real” effects.Footnote 6 For each simulated cell, an individual value for the drift rate was sampled from a normal distribution with mean v and a standard deviation of 0.01, to induce error variance. Further error variance was induced by sampling the nondecision time t0 for each simulated cell from a normal distribution with mean 300 and a standard deviation of 20. Within each of these 60 (threshold separation a) × 6 (drift rate v) = 360 combinations, 100 trials were simulated as a Wiener diffusion process (i.e., Brownian motion with a positive drift):

$$ X(t)=B(t)\sigma + vt, $$

where B(t) represents the Brownian motion at time t. In our simulation, we used σ = 4. The simulated data were organized in such a way that the data could be conceived as 100 independent between-subjects experiments with n = 20 participants in each cell and random variation in each participant’s drift rate and nondecision time. For each simulated participant and each of the combinations of a and v, those variables were computed that were necessary in order to calculate IES, RCS, LISAS, and BIS in a subsequent step.Footnote 7 The full data set can be retrieved from https://osf.io/pyshv/

In a nutshell, we orthogonally manipulated threshold separation and drift rate and added some error variance. The variation in threshold separation emulates a variation in SATs, and the variation in drift rate emulates a variation in “real” effects. Given the goal to examine how well the various combined measures cancel SATs, the main focus in the following analyses was on the threshold separation. The drift rate was varied orthogonally for two purposes: (a) to check whether the threshold-dependent behavior of all measures generalizes across various drift rates and (b) to subsequently test how well the combined measures maintain “real” effects. Error variance was added so as to increase the comparability to real data and to avoid that even negligible effects become significant.

Effects on speed and accuracy

Figure 2 visualizes RTs and PCs of the simulated data as a function of threshold separation a and drift rate v. As desired, RTs were in a range typically observed in cognitive experiments, and importantly, responses speeded up with increasing v and slowed down with increasing a (Fig. 2, left panel). At the same time, responses were more accurate with higher values for drift rate v and with larger threshold separations a (Fig. 2, right panel). Also, PCs ranged from close to pure guessing (50%) to virtually perfect performance (100%). Thus, the data cover the whole spectrum of typically observed effects, and our variation of threshold separation a implements an SAT: the higher the threshold separation, the slower but more accurate the responses. The data set is complemented by a “real” effect as induced by the drift-rate manipulation. Hence, this simulated data set is well-suited for examining the properties of the combined measures.

Fig. 2
figure 2

RTs and PCs as a function of threshold separation a and drift rate v. For a given drift rate, both RTs and PCs increase with increasing threshold separation. This is the pattern expected in the case of a speed–accuracy trade-off.

Balanced integration of measures

In the absence of any good reason to amplify the influence on the combined measure of either RTs or PCs, it appears reasonable to give equal weight to both constituents. An operational definition of such balanced integration is that the combined measure shares as much variance with RTs as with PCs. This is achieved in BIS by design (see Appendix B). To assess the relative contributions of RTs and PCs to the other combined measures, we calculated for each combined measure M (M ∈ {IES, RCS, LISAS}) the index IM, as

$$ {I}_M=\frac{r_{RT,M}^2}{r_{PC,M}^2} $$
(5)

where \( {r}_{RT,M}^2 \) is the squared Pearson correlation of M with RTs and \( {r}_{PC,M}^2 \) is the squared Pearson correlation of M with PCs. If RTs and PCs contribute to the same degree to the measure M, the index should take a value of IM ≈ 1. In contrast, IM < 1 means a dominance of PC, and IM > 1 means a dominance of RT, with the extreme cases of IM = 0 (exclusively influenced by PC) and IM = ∞ (exclusively influenced by RT). For each measure, IM was calculated for each combination of drift rate v and threshold separation a. In particular, correlations were calculated separately for each of the 100 experiments, Fisher z-transformed, averaged across experiments, transformed back, and entered into Eq. 5.

As is evident in Fig. 3, IES, RCS, and LISAS exhibit a pattern considerably deviating from a balanced integration of RTs and PCs: For RCS and IES, the balance changes depending on a: With smaller threshold separations, the influence of RTs is negligible, whereas at larger separations RTs, take over so as to affect the measures predominantly. Furthermore, the influence of RTs increases with increasing drift rates v for medium to large threshold separations a. For LISAS, the pattern is different: RTs dominate at low levels of a, and the respective influences become more balanced at higher levels (yet never reaching the point of balanced integration at ILISAS = 1).

Fig. 3
figure 3

Relative contributions of RTs and PCs to three of the combined measures, as a function of threshold separation a and drift rate v. The point of balanced integration is indicated by the dotted line at IM = 1. Values below 1 indicate a predominance of accuracy, and values above 1 indicate a predominance of RTs. BIS is not plotted, because it integrates in a balanced manner by design, so that for all cells IBIS = 1 (see Appendix B). Note that the y-axis scaling for LISAS differs from the other two measures because LISAS overemphasizes RTs to a much larger degree, and the overemphasis on PCs for low thresholds in IES and RCS would be disguised with an adapted scaling.

The behavior of the three combined measures might be problematic: Not only do these measures not weight RTs and PC in a balanced manner, but the relative influences strongly depend on the accuracy level. Thus, even if it is possible to find a weighting factor so that RTs and PCs are balanced in one experimental condition, this weighting will yield unbalanced integration in other cells of the design. Thus, none of the three measures integrates RTs and PCs in a balanced way. BIS differs in this regard, and in fact IBIS = 1 across all levels of threshold separation a and drift rate v (see Appendix B for a formal proof).

However, whether a balanced and constant (across SATs) weighting is desirable is an open question. One advantage of the unbalanced and accuracy-dependent integration in IES and RCS could be that RTs contain much more relevant information when accuracies are close to ceiling, and accuracies contain more relevant information when RTs are very fast. Thus, in the following sections, we directly test more unambiguously desirable characteristics of any combined measure: Do they cancel SATs while maintaining “real” effects?

Testing the efficiency of the combined measures to compensate for SATs

In the worst case, an experimental manipulation would only induce a pure SAT. Analyzing either RTs or PCs would then yield spurious effects and wrong conclusions; analyzing both would yield contradictory results. In Fig. 2, for example, although threshold separation does not influence task difficulty, this parameter exerts strong effects on RTs and PCs. The combined measures should—ideally—compensate for these SATs and provide the same values irrespective of the threshold separation. Figure 4 visualizes the combined measures as a function of threshold separation a and drift rate v. From simple visual inspection, it becomes clear that none of the four measures fulfills this criterion perfectly, although the effect of threshold separation (i.e., of SATs) is clearly smaller for BIS than for all competing measures.Footnote 8

Fig. 4
figure 4

IES, RCS, LISAS, and BIS as a function of threshold separation a (SATs) and drift rate v (“real” effects). All of the measures retain “real” effects (i.e., the differences between lines should be large) at reasonable levels of a (i.e., when there is sufficient time to accumulate evidence, so that PCs clearly deviate from guessing performance). However, effectiveness in cancelling SATs (the lines ideally should be flat) strongly differs between the various combined measures.

Due to the different y-axis scalings, the sizes of the SAT effect (and of the drift-rate effect) are not directly comparable between the four panels of Fig. 4. Additionally, the figures do not contain information on the statistical error variance, and one can therefore not gauge which differences are statistically significant. To address this question directly, we ran inferential statistics on different subsets of the data to report effect sizes and how often an SAT effect was statistically significant for a given measure. We will start with t tests and continue with analyses of variance (ANOVAs).

Comparisons of two conditions using t tests

We first concentrate on situations with two conditions in which the dependent measures are assessed by t tests. Therefore, we ran two-sample t tests comparing all combinations of threshold separation a for each of the 100 experiments while keeping drift rate v constant, thus looking at pure SATs. The results are illustrated in the form of heat maps, in which each point represents one comparison of two levels of a (as designated on the x- and y-axes). In Fig. 5, color codes the resulting values for effect size d (Cohen, 1988), and in Fig. 6, it codes the proportions of significant t tests. In both figures, the upper/left half of each heat map represents the data with drift rate v = 0.2, and the lower/right half the data with drift rate v = 0.3.

Fig. 5
figure 5

Effect sizes d for pairwise comparisons of threshold separations a, reflecting SATs, at two different values of drift rate v (v = 0.2 and v = 0.3, above and below the diagonal, respectively). Each point in a panel denotes a comparison between two combinations of threshold separation (e.g., a = 250 vs. a = 200). The black diagonals indicate the absence of comparisons between a cell and itself (e.g., a = 250 vs. a = 250). Note that all measures except BIS are heavily influenced by the simulated SATs, with RTs alone and LISAS performing worst, followed by PCs. IES and RCS show lower (but still substantive, up to d = 2) effects of SATs. For most measures, the effect depends approximately linearly on the difference in threshold separation. IES and RCS diverge from this orderly pattern due to their highly nonlinear dependence on threshold separation, displayed in Fig. 4. The effects on BIS do not exceed d = 0.12.

Fig. 6
figure 6

Proportions of significant pairwise comparisons (two-sample t tests) of threshold separations a, reflecting SATs, at two different values of drift rate v (v = 0.2 and v = 0.3, above and below the diagonal, respectively). Each point in a panel denotes a comparison between two combinations of threshold separation (e.g., a = 250 vs. a = 200). The black diagonals indicate the absence of comparisons between a cell and itself (e.g., a = 250 vs. a = 250).

As we intended through the simulation of SATs, for RTs and PCs, the d values clearly become larger, the larger the difference between the two levels of threshold separation a (with generally larger effects on RTs). The results obtained for LISAS are very similar to those for RTs, with only slightly smaller values for d. This is not surprising, given that LISAS mainly represents RTs (see Fig. 3). The d values for IES, RCS, and BIS, in contrast, are much smaller. However, IES and RCS both also exhibit an increase in d values when small to medium values for threshold separation a are compared with large values (the orange regions far off the diagonal). That these regions are not centered at the edges (as for RTs, PC, and LISAS) results from the nonmonotonic behavior of both measures in the smaller range of threshold separation a (in particular, values with 0 < a ≤ 50 do not follow the general trend of rising [IES] or falling [RCS] with increasing levels of a; see Fig. 4).

Additional, subtle patterns become more clearly visible when we consider the proportions of significant t tests, in Fig. 6 (recall that we simulated 100 experiments for each cell). First, for RTs, PCs, and LISAS, nearly all t tests are significant, with exceptions only close to the diagonal—that is, for small differences in SAT. Another effect of the above-mentioned nonmonotonic behavior of IES and RCS (Fig. 4) becomes apparent in Fig. 6, for comparisons of the smallest threshold separations to small-to-medium threshold separations: Deviating from the high incidence of significant tests in other regions, only about 20%–30% of the t tests are significant, yielding the yellowish stripes in the lower left corners of the graphs. This is again due to the nonmonotonic behavior of IES and RCS as a function of threshold separation, as described above and visible in Fig. 4. For BIS, large areas with no or very few significant t tests are apparent. Surprisingly, some tests are significant around the diagonal—that is, for small differences in SAT. Additionally, more tests are significant for the higher drift rate of v = 0.3, in particular when combined with high threshold separations. However, the proportion of significant tests is lower than for any competing measure for virtually all comparisons.

Comparisons of three conditions using ANOVAs

The examined measures are by no means restricted to comparisons of two conditions (in contrast to other alternatives, such as the binning score; Draheim et al., 2016; Hughes et al., 2014). To get an impression of their behavior in more complex situations, we also considered the case of three conditions, as would be assessed with ANOVAs. Unfortunately, this more complex design does not allow an exhaustive examination of all possible comparisons, as was possible with pairwise comparisons. Instead, we drew three random levels (without replacement) 100 times for each experiment and drift rate v. Then we calculated a one-way between-subjects ANOVA on each data set. Analogous to the analyses with t tests above, Fig. 7 visualizes the mean effect sizes η2 and the mean proportions of significant ANOVAs. As we expected, the ANOVA almost always revealed high effect sizes and significant effects for RTs and PCs. However, this was also true for IES, RCS, and LISAS. For BIS, the effect sizes and proportions significant were considerably smaller. Surprisingly and in contrast to the other measures, the effect of SATs on BIS depends on the size of the “real” effect, with an increase from around 50% to 60% significant with increasing drift rate v.

Fig. 7
figure 7

Mean effect sizes η2 and the proportions of significant one-way analyses of variance (between subjects) with three (randomly drawn) threshold separations for each measure

Testing the efficiency of the combined measures to maintain “real” effects

One trivial explanation for the efficiency of BIS in canceling SAT effects would be that it cancels any effect. If this were the case, BIS would be useless as a combined measure. Although Fig. 4 already shows that this is unlikely, we also analyzed “real” effects, as simulated by variations in drift rate, to address this concern more formally (Figs. 8 and 9 for combinations of drift rates as assessed with t tests; Fig. 10 for averages across all combinations of three drift rates, as assessed with ANOVAs; all calculations were performed for four exemplary threshold separations with a ∈ {5, 100, 200, 300}). As can be seen, all combined measures nicely maintain “real” effects, and sometimes even enhance them. For very small threshold separations—where performance is close to chance (pure guessing)—variations in drift rate have no effect on any of the examined measures (see also Fig. 4). This confirms the usual recommendation not to consider data (in particular RTs) when performance is close to chance.

Fig. 8
figure 8

Effect sizes d for pairwise comparisons of drift rates v, reflecting “real” effects, at four different values of threshold separation a (panel a: a = 5 and a = 100; panel b: a = 200 and a = 300, above and below the diagonal, respectively). Each point (square) in a panel denotes a comparison between two combinations of drift rate (e.g., v = 0.20 vs. v = 0.22). The black diagonals indicate the absence of comparisons between a cell and itself (e.g., v = 0.20 vs. v = 0.20). Note that all combined measures reveal “real” effects as well as or better than any of the constituents (RTs or PCs) in most comparisons, and that BIS reveals these effects virtually as well as any other competitor in most comparisons, and only slightly worse than the best competitor in some comparisons (for a = 300, lower right areas in panel b, in particular).

Fig. 9
figure 9

Proportions of significant pairwise comparisons of drift rates v (two-sample t tests), reflecting “real” effects, at four different values of threshold separation a (panel a: a = 5 and a = 100; panel b: a = 200 and a = 300, above and below the diagonal, respectively). Each point (square) in a panel denotes a comparison between two combinations of drift rate (e.g., v = 0.20 vs. v = 0.22). The black diagonals indicate the absence of comparisons between a cell and itself (e.g., v = 0.20 vs. v = 0.20).

Fig. 10
figure 10

Mean effect sizes η2 and the proportions of significant one-way ANOVAs averaged across all possible combinations (without replacement) of three drift rates (between subjects) for each measure.

Discussion

In the present study, we examined the usefulness of several approaches for combining response times (RTs; speed) and proportions correct (PC; accuracy) to control for speed–accuracy trade-offs. Instead of using empirical data in which the levels of SAT are unknown, we simulated pure SATs and “real” effects without any confound, by varying threshold separation and drift rate in the diffusion model (Ratcliff, 1978). Arguably, a useful combined measure should cancel any effects of differential SATs, so that it is insensitive to the specific trade-off along the SAT continuum, without markedly attenuating “real” effects. Whereas all combined measures fulfilled the latter criterion, BIS came far closer to the first criterion than the other alternatives, and it might be worth a recommendation whenever a combination of RTs and PCs is desired.

Recommendations on the use of combined measures

Advantages of combining RTs and PCs

There are several potential reasons why a researcher may want to combine RTs and PCs. First, when using BIS, SATs are canceled to a large degree, thus considerably decreasing the likelihood of interpreting spurious effects that are mainly driven by SATs.

Second, combining RTs and PCs can yield a gain in statistical power in two (not necessarily mutually exclusive) situations: (a) If some participants focus more on speed and others focus more on accuracy, the effects of experimental manipulations will be distributed across RTs and PCs, and a combination of the two can potentially reconstitute the full effect (see also Hughes et al., 2014, p. 705). (b) For situations in which there is no clear theoretical reason to focus on either RTs or PCs, testing both would yield an inflation of alpha error and, thus, require an adaptation of the alpha level (i.e., with the typical level of α = .05, only tests with p < .025 could safely be considered significant). Deciding a priori to analyze BIS instead would allow for maintaining the original alpha level.

When to combine RTs and PCs

Before using any combination of RTs and PCs (or any other measure), the researcher must, of course, critically ask whether this combination makes theoretical sense in the given situation. Only when the cognitive process of interest affects both RTs and PCs (see Vandierendonck, 2017, p. 654), and if a trade-off is possible—that is, when the process is more error-prone when it is speeded (like a decision)—RTs and PCs can reasonably be interpreted as the result of a common underlying process. In situations in which RTs and PCs are mainly influenced by different cognitive mechanisms, combined measures should be avoided. As an example, in the change-detection task (a typical task to assess visual working memory), participants see two subsequent arrays of objects and have to decide whether these are identical or whether one object has changed in between. In this task, the researcher might be interested in the capacity of working memory, which mainly affects PCs (when capacity is exceeded; Alvarez & Cavanagh, 2004; Luck & Vogel, 1997, 2013), or in the efficiency of the comparison between working memory entries and a test display, which mainly affects RTs (Gilchrist & Cowan, 2014; Hyun, Woodman, Vogel, Hollingworth, & Luck, 2009; Liesefeld, Liesefeld, Müller, & Rangelov, 2017). Combining RTs and PCs would confound capacity limitations and comparison efficiency, and therefore would complicate rather than clarify the interpretation of potential effects.

Risk of p hacking

It might be tempting to check one or several combined measures whenever RTs and PCs yield a nonsignificant, but “trending,” result in the same direction. Interpreting any resulting effect as confirmatory evidence would, of course, be misleading, due to the associated inflation in alpha error. It is, however, perfectly fine to decide in advance that BIS will be analyzed when the theory makes no clear predictions as to whether an effect influences RTs or PCs, or when the effect is expected to be distributed across both measures (e.g., due to inter- and intra-individual variation in SATs).

Sample dependence of BIS

The standardization across subjects and conditions involved in the calculation of BIS implies a major deviation from all previous combined measures: There is no BIS for a single cell; instead, individual values reflect whether performance was below (BIS < 0) or above (BIS > 0) the average performance (across all subjects and cells) for the respective subject in the respective cell of the design. In other words, a particular value of BIS is not only influenced by the performance of the respective subject in the respective cell, but also by the performance of the subject in other cells and the performance of the other subjects in the sample. In fact, the way the standardization is performed is the main difference to LISAS, which standardizes per subject/condition and thus provides sample-independent performance values (Vandierendonck, 2017, 2018).

The reader might wonder whether this sample dependence of BIS is problematic for its use. The answer to this question depends on the goal of calculating the measure. If the goal is to determine the absolute performance (i.e., without comparison to a specific group) of a particular subject in a particular task (this might sometimes apply during job recruiting or grading of academic achievements), BIS is not suited. It is, however, well suited for determining relative performance—that is, whether one (group of) subject(s) is better than another (group of) subject(s), or one condition is more difficult than another condition. This is exactly the type of question typically asked in (experimental) psychological research, for which the sample dependence of BIS is, consequently, unproblematic. On the contrary, concerning this type of question, BIS is often easier to interpret than the constituent measures (RTs and PCs), because it directly expresses whether a (group of) subject(s) performs above or below average (BIS > 0 or BIS < 0, respectively).

To approach this question from another vantage point, consider that statistical tests are insensitive to linear transformations such as the standardizations involved in BIS. In particular, the difference between two conditions in mean RTs will result, by definition, in the exact same t value as the difference of any linear transformation, if the transformation is applied uniformly to all RTs (such as a standardization with the same mean and the same standard deviation used in the calculation of BIS).Footnote 9 This feature of linear transformations also means that it is not the standardization, but the additive component (the subtraction), that does the job of controlling for SATs (as we demonstrated above).

Standardizing across different subsamples

In the present examination, RTs and PCs were standardized across all conditions and all subjects for a given test. This is, in a way, the most conservative approach, because all variance is kept (see the previous section). There might, however, be situations in which it is reasonable to remove some of the variance. When the research focus is on a Group × Treatment interaction, for example, it might be a good idea to remove the main effect of group (and the related error variance) by standardizing per individual (e.g., Bush, Hess, & Wolford, 1993; Faust, Balota, Spieler, & Ferraro, 1999) or per group.

In general, when calculating BIS, it is important to carefully ponder what shall be compared and therefore should be minimally included in the standardization. If, for example, standardization was performed separately per condition, any differences between conditions would be removed by design, and it would be impossible to detect any effects. If there is no particular reason to exclude a particular contributor of variance (as in the Group × Treatment example above), we recommend including the mean RT and PC for all subjects, groups, and conditions of the experiment, because this maximizes the data basis for calculating means and standard deviations.

Comparison to model fitting

In the present study, the combined measures were used to extract “real” effects that were simulated via manipulations of the drift-rate parameter of the diffusion model. In a way, the aim was to “recover” effects on drift rate and to ignore variations in another parameter (threshold separation). Obviously, the best way to recover any parameter of the diffusion model would be to fit the (simulated) data to the diffusion model itself. Indeed, a diffusion-model analysis of speed–accuracy data has several advantages in many situations (Forstmann et al., 2016; Ratcliff et al., 2016; Voss et al., 2015; Wagenmakers, 2009). Many, but not all, researchers would argue that major strengths of the diffusion model are that it is based on several well-validated theoretical assumptions and that its parameters are psychologically interpretable. These strengths, however, also restrict its use to specific situations, namely those in which a decision process is at the heart of the observed behavior.

BIS, in contrast, was developed on the basis of purely statistical considerations (the same is likely true for LISAS). It yields a balanced integration of RTs and PCs in any task, independent of what the task measures and which cognitive processes it involves. It would, of course, be advantageous to show for each specific task that BIS cancels SATs while maintaining “real” effects. Still, BIS was not specifically developed for decision processes and still performs quite well with data generated by a decision-process model; this gives us some confidence that BIS would perform equally well on data generated by other models/processes.

To elaborate a bit on how BIS might complement the modeling approach: Most popular models (such as the diffusion model) focus on the decision process, whereas many phenomena of interest to cognitive psychologists are captured in the residual “nondecision time” (e.g., Schmitz & Voss, 2012). Arguably, SATs can also occur in nondecision components of a task (Rinkenauer, Osman, Ulrich, Müller-Gethmann, & Mattes, 2004). For example, in a mental rotation task, Liesefeld et al. (2015) found that participants differed from each other in the time they took for performing the rotation, whereby taking less time meant that the resulting rotated representation of the original stimulus was less accurate and therefore more errors were committed; thus, it was not the decision component of the task, but the rotation component preceding the decision that was influenced by SATs.

Second, it is an empirical fact that researchers do use combined measures (e.g., Collignon et al., 2008; Gabay, Nestor, Dundas, & Behrmann, 2014; Kristjánsson, 2016; Kunde et al., 2012; Mevorach, Humphreys, & Shalev, 2006; Petrini, McAleer, & Pollick, 2010; Röder, Kusmierek, Spence, & Schicke, 2007; Spence, Kingstone, Shore, & Gazzaniga, 2001a; Spence, Shore, Gazzaniga, Soto-Faraco, & Kingstone, 2001b), and that reviewers do request the use of these measures (according to our own experiences and informal reports from colleagues), especially if RTs and PCs show opposite patterns of effects (which would indicate condition-contingent SATs). One reason for using simple combinations of RTs and PCs instead of decision models might be that authors desire a combined measure that does not rely on a specific psychological theory. This is understandable if the research focus does not lie on decision making (which is the case for most of the studies cited above) or if behavioral data are secondary to the research question (as in many neuroimaging studies employing SAT measures; e.g., Kiss, Driver, & Eimer, 2009; Küper, Gajewski, Frieg, & Falkenstein, 2017; Reeder, Hanke, & Pollmann, 2017). Thus, from a practical standpoint, there is quite some demand for combined speed–accuracy measures.

Finally, BIS is easy to calculate, and therefore is potentially accessible to a wider range of researchers. Although tutorials and easy-to-use implementations and accessible tutorials for the diffusion model and other evidence accumulation models are available (e.g., Donkin, Brown, & Heathcote, 2011; Voss et al., 2015; Wagenmakers et al., 2007; Wagenmakers, van der Maas, Dolan, & Grasman, 2008), their correct application and interpretation still require considerable theoretical background. Easy-to-use code for calculating BIS in Matlab, R, and Excel can be retrieved from https://github.com/Liesefeld/BIS.

EZ-diffusion model

Another powerful, yet easy to calculate, tool for combining speed and accuracy data is the EZ-diffusion model (Wagenmakers et al., 2008; Wagenmakers et al., 2007). Based on the diffusion model (with a few simplifying assumptions), this model provides simple equations for calculating drift rate v, threshold separation a, and nondecision time t0 on the basis of mean RTs, PCs, and the variance in RTs. In a way, the drift rate of EZ diffusion corresponds to the combined measures examined here (it would reflect our “real” effects while canceling out SATs). In addition, it provides estimates of threshold separation (and thus of the degree of SAT) and nondecision time. This model, of course, cannot be reasonably compared to the other combined measures with the present set of simulated data, because EZ diffusion is based on the same model that was used to generate the data. In fact, our simulations were designed so that they would meet all (or virtually all; see below) assumptions underlying EZ diffusion (which is not guaranteed with real data; see also the section below on Desirable Extensions of the Present Simulations).

To illustrate the use of the EZ-diffusion model and to also validate our simulations, we extracted drift rate, threshold separation, and non-decision time from our data set using the EZ-diffusion model. The results are visualized in Fig. 11, and two interesting pieces of information are revealed: First, the results validate our simulation by showing that the parameters are recovered well in large parts. In particular, the drift-rate parameter seems mostly independent of the simulated threshold separation, but it is influenced by our “real” effects (as induced via variations in drift rate). Second, the results point to a limitation of the EZ-diffusion model whenever its assumptions are violated. The particular violation here (and elsewhere—e.g., Ratcliff, 2008) is that a trial is aborted after a while (a response deadline; here, around 3,500 ms), which implies that sometimes the decision process cannot finish. The larger the threshold separation is, the more often this happens, thus leading to distorted RT patterns in these cases (a few very long RTs are missing in the data; see note 7). In Fig. 11, this becomes most obvious in the overestimations of nondecision time with high threshold separations (right panel).

Fig. 11
figure 11

Threshold separation a, drift rate v, and nondecision time t0 as extracted from the simulated data using the EZ-diffusion model (Wagenmakers et al., 2008; Wagenmakers et al., 2007), as a function of threshold separation a and drift rate v as implemented in the simulation generating the data set. Note that the mean nondecision time was set to 300 ms in our simulations.

Additionally, it remains to be investigated whether parameter extraction using the EZ-diffusion model has advantages over the other measures in canceling SATs when data were not generated with the diffusion model but with, for example, the leaky competing accumulator model (Usher & McClelland, 2001), the linear ballistic model (Brown & Heathcote, 2008), or the fast-guess model (Ollman, 1966; see Van Ravenzwaaij & Oberauer, 2009, for related comparisons).

Outlook: Open questions and future directions

Confounds of variation in threshold separation and drift rate

This article has treated only two idealized situations: pure SATs without any “real” effects (a threshold separation variation) and pure “real” effects without any SAT (a drift-rate variation). In these situations, there is no correlation between SATs and the “real” effects across conditions (because one of the two was always kept constant). All types of combinations of these two situations are, of course, possible and likely do occur in reality.

Unfortunately, a systematic investigation of confounds between threshold separation and drift rates is subject to a combinatory explosion (in the present case of 60 levels of threshold separation and six levels of drift rates, there are 360 possible combinations of these two parameters, yielding 64,620 pairwise comparisons) and is beyond the scope of the present article. That BIS cancels out pure SATs and leaves intact pure “real” effects is already important information, especially in light of our demonstrations that other measures dramatically fail already in the simplest situation of pure SATs. Nevertheless, preliminary explorations of this combinatory space are in line with the general pattern reported here: BIS cancels or strongly reduces effects of SATs, while typically maintaining “real” effects (see Appendix C for more details).

Desirable extensions of the present simulations

Although the diffusion model is arguably one of the best validated and most established models to reflect core cognitive processes employed in a wide range of experimental tasks (Forstmann et al., 2016; Ratcliff et al., 2016; Voss et al., 2015; Wagenmakers, 2009), it is likely that data simulated by this model differ from real data in various respects. Furthermore, although variations in threshold separation are the standard approach to induce SATs, there is some debate as to whether SATs are (typically) reflected in (pure) variations of threshold separation (Lerche & Voss, in press; Rae, Heathcote, Donkin, Averell, & Brown, 2014; Rinkenauer et al., 2004; Starns, Ratcliff, & McKoon, 2012; Voss et al., 2004), casting additional doubt on the validity of the present simulations. For these reasons, future studies should strive to confirm the results reported here with alternative operationalizations of SATs—namely, by using other parameter combinations (including nondecision time t0 and drift rate v; see, e.g., Rae et al., 2014; Rinkenauer et al., 2004) and/or other models (e.g., the leaky competing accumulator model, Usher & McClelland, 2001; the linear ballistic model, Brown & Heathcote, 2008; or the fast-guess model, Ollman, 1966), and, with some qualifications (see the Quality Criteria for Combined Measures section above), real data (see Bruyer & Brysbaert, 2011; Vandierendonck, 2018).

Other statistical tests

We have rather exhaustively tested comparisons of two independent samples, have only parenthetically tested designs with more levels of a factor (three), and have excluded any multifactorial and within-subjects designs. Although there is no reason to believe that other combined measures would gain the lead in these situations, and although the results look very similar in some preliminary explorations with such designs, these assumptions should be tested carefully and systematically in future work. Similarly, it appears likely that BIS improves results in correlative approaches (see, e.g., Draheim et al., 2016; Hughes et al., 2014; Van Ravenzwaaij & Oberauer, 2009), but this topic, too, must await future validation.

Unequal weighting of RTs and PCs

BIS was designed to integrate RTs and PCs in a balanced manner. There is no guarantee, however, that balanced weighting is ideal (this would be a rather surprising coincidence, in fact). Thus, future research should strive to determine which weighting of RTs and PCs is ideal in a given situation. For the meantime, an equal integration of the two constituents seems the most reasonable choice to us. If it turns out that an unequal weighting is preferable, the equal (and constant across accuracy levels) weighting is still a convenient feature of BIS, because it allows easily adapting the relative weights of RTs and PCs. This can be achieved by simply adding a weighting parameter w to Eq. 4 (with 0 < w < 1), as in, for example:

$$ {BIS}_{i,j}=w\bullet {z_{PC}}_{i,j}-\left(1-w\right)\bullet {z}_{\overline{R{T}_{i,j}}} $$

A major difficulty with such an endeavor would be to find criteria according to which one should determine w. Obviously, to try different values for w until a desired outcome (a statistically significant effect) is obtained would inflate the alpha error and must be avoided.

Transforming the constituents

Close inspection of Figs. 5, 6, and 7 indicates that for BIS, SATs influence ANOVAs much more than they influence t tests (although this influence is still considerably less than for all of the alternatives). One remedy would be to avoid using ANOVAs for testing critical hypotheses and instead to focus on t tests or contrasts (which usually reflect the hypothesis of interest much better, anyway). Another alternative might be to transform RTs and PCs before entering them into Eq. 4. In particular, BIS integrates RTs and PC linearly, ignoring that RTs and PCs are typically not linearly related. Closer approximation to a linear relationship between RT and PC can be achieved by first transforming both measures. It turns out that the following transformations provide reasonable approximations to linearityFootnote 10 (but see, e.g., Lo & Andrews, 2015, for potential pitfalls of such transformations):

\( {\overline{RT_{i,j}}}^{\prime }=\ln \left(\overline{RT_{i,j}}\right) \), and \( {PC}_{i,j}^{\prime }=\ln \left(\frac{1}{1-{PC}_{i,j}}\right) \)

Combining multiple measures

BIS is in no way restricted to combining only RTs and PCs. To give an example, complex-span tasks are measures of working memory capacity that correlate highly with general intelligence. In these tasks, participants have to remember a sequence of memoranda (e.g., words) and after each memorandum a short processing task has to be solved (e.g., verification of an algebraic equation). After several memorandum–processing pairs, participants have to recall all memoranda. Usually, analyses of this type of task focus on recall performance, but it turned out that accuracy on the processing part correlates with intelligence, too (Unsworth, Redick, Heitz, Broadway, & Engle, 2009)—potentially because people do trade off memorizing and processing. BIS could be used to combine performance on both aspects of the task in order to gain a more comprehensive measure of complex-span performance.

Furthermore, BIS can combine an arbitrary number of performance measures, by simply standardizing all constituents and adding measures for which high values reflect good performance (such as PC) and subtracting measures for which high values reflect bad performance (such as RTs). To stick with the example of complex-span tasks, in addition to recall accuracy and processing accuracy, processing time could be included as a third performance measure (see Unsworth et al., 2009).

Conclusion

We have formally introduced and validated a new approach to control for speed–accuracy trade-offs, the balanced integration score (BIS), and compared it to alternative measures. This measure effectively controls for speed–accuracy trade-offs while retaining true effects. Furthermore, it is highly flexible and easy to calculate. Matlab and R code as well as an Excel sheet for calculating this measure can be retrieved from https://github.com/Liesefeld/BIS

Author note

H.R.L.’s work is supported by LMU Munich’s Institutional Strategy LMUexcellent within the framework of the German Excellence Initiative, and by the Graduate School of Systemic Neurosciences, Munich Center for Neurosciences–Brain & Mind. M.J.’s work is supported by the Institutional Strategy of the University of Tübingen (ZUK 63, German Research Council). We thank Scott Brown, Marc Brysbaert, Anna Liesefeld, Rolf Ulrich, and André Vandierendonck for helpful and constructive comments and suggestions on a previous version of the manuscript.