A guide to the measurement and interpretation of fMRI test-retest reliability

The test-retest reliability of functional neuroimaging data has recently been a topic of much discussion. Despite early conflicting reports, converging reports now suggest that test-retest reliability is poor for standard univariate measures-namely, voxel- and region-level task-based activation and edge-level functional connectivity. To better understand the implications of these recent studies requires understanding the nuances of test-retest reliability as commonly measured by the intraclass correlation coefficient (ICC). Here we provide a guide to the measurement and interpretation of test-retest reliability in functional neuroimaging and review major findings in the literature. We highlight the importance of making choices that improve reliability so long as they do not diminish validity, pointing to the potential of multivariate approaches that improve both. Finally, we discuss the implications of recent reports of low test-retest reliability in the context of ongoing work in the field.


Introduction
The ability to attain similar results given repeated measurements -or test-retest reliability -is a desirable quality of a measure. Converging evidence has demonstrated that univariate brain measures derived from functional magnetic resonance imaging (fMRI) show poor test-retest reliability, whether these measures reflect voxel-or region-level task-based activation or edge-level functional connectivity [1 ,2 ]. Furthermore, as will be discussed, a number of studies have highlighted factors that can improve reliability. However, the implications of this body of work are nuanced and often misunderstood. It is crucial to understand how we measure and interpret test-retest reliability because this informs the interpretation of existing research and the choices we should make moving forward. Drawing from lessons learned in our recent review and meta-analysis of test-retest reliability of functional connectivity [1 ] alongside other important work, we will provide a guide to measurement of testretest reliability in fMRI, highlight major findings, and provide context needed to facilitate their interpretation.

A primer on measuring fMRI test-retest reliability with the intraclass correlation coefficient
Test-retest reliability is a Measurement Theory concept that quantifies the stability of a measure under repeated measurements [3]. This is important since it not only informs how precisely we can characterize an object, but also how precisely we are able to measure associations with other variables of interest. As will be discussed at the end of this section, reliability is considered to be complementary to validity, broadly defined as the 'relevance of a measuring instrument for a particular purpose' [3].
While there are many measures of test-retest reliability [4], it is most commonly measured in fMRI using the Intraclass Correlation Coefficient (ICC). A theoretical discussion of the properties of ICC and statistical inference can be found in [5,6]; here we will highlight only a few key features. In human-oriented research, the ICC is typically defined as the proportion of total measured variance (e.g. variability between people, sessions, etc.) that can be attributed to variability between people: Note that between-subject variance is in the numerator and within-subject variance is included in the total variance in the denominator; thus, ICC increases as subjects become more distinct from each other and/or as within-subject measurements become more similar. The ICC is commonly denoted ICC(n,k), where n represents the model and k represents the type for averaging, described as follows. 1 Three ICC models (n = 1, 2, or 3) are commonly used: ICC(1,k) is said to reflect absolute agreement and does not explicitly include facets (i.e. sources of error; [8]) in the model; ICC(2,k) also reflects absolute agreement and includes a random facet (i.e. levels reflect a random sample of possible levels); and ICC(3,k) (a.k.a., Cronbach's alpha) reflects consistency and includes a fixed facet (i.e. levels reflect all possible levels of interest). The reliability of single measurements is estimated with k = 1, and reliability of average measures may be estimated by choosing k > 1. The ICC is similar to the Pearson's correlation between repeated measurements, except that the Pearson's correlation is invariant to linear transformations between measurements (i.e. reflects fit to y = mx + b); in contrast, consistency ICCs are only invariant to translation (i.e. reflects fit to y = x + b) and agreement ICCs are affected by both translation and scaling (i.e. reflects fit to y = x) [6,7]. This follows the assumption that repeated measurements be of the same class, and thus reflect the same population variance (and, for 'agreement' ICC, mean). Incidentally, ICC can also be seen as being sensitive to inter-subject discriminability since it increases with more distance between-subjects and less distance within-subjects. A common historical rule of thumb categorizes ICC as poor <0.4, fair 0.4-0.59, good 0.6-0.74, excellent !0.75 [9].
Despite its straightforward appearance, the ICC can be estimated a number of ways. The following is a selection of choices one can make in estimating the ICC: How many facets (sources of error) to include, if any (for >2 facets, see Generalizability Theory [8]); Whether to model facets as random (ICC(2,k)) or fixed (ICC(3,k)); Whether to estimate ICC for average measurements (a 'Decision Study' permits exploring combinations of facets [8]); Which variance estimation procedure to use (note that the standard ANOVA has several limitations, including negative estimates commonly set to 0; for alternatives, see Refs. [8,10]; How precisely to model the error structure (for structural equation modeling procedures, see Refs. [11,12]); Whether to include covariates of no interest [13]; and more. Since it can be difficult to navigate the many choices, ICC(2,1) may be an ideal starting point; most univariate fMRI ICC studies include repeated measurements over time, which can introduce systematic error across subjects, and most study findings will be relevant to single rather than average measures. The choice of random or fixed facets can be tricky, so it is useful to consider whether the investigator wishes to generalize beyond the measured levels of the facet in the given study (e.g. whether a multisite study should inform other studies using other sites or whether they only care about informing future experiments performed at that same sites). If one is unsure about desired generalization, it may help to start by considering random facets because this is more conservative (i.e. over-estimates facet contribution) and in practice yields similar results as fixed facets for fMRI [10]. Special consideration is due for timerelated facets (e.g. session). In a standard non-nested model, the time-related variance component reflects changes over time shared across individuals, which are likely to be minimal at non-development timescales [14]. However, this does not reflect individual-specific changes over time; that variability is typically included in the residual variance component, which is typically very large [15].
Reliability is typically reported as the average of univariate ICC coefficients across the brain. For task-based activation, ICC is often calculated for percent signal change at each voxel. For functional connectivity, it is often calculated for the correlation coefficient at each edge. However, it can be desirable to estimate ICC for any other number of measures: node strength, beta coefficients, test statistics, regional homogeneity, and so on. A couple considerations are worth bearing in mind: first, that the measure be available at the level of the individual subjects, and second, that the measure be relevant to the question of interest -for example, if one is interested in reliability of edge strength, it may not make sense to calculate reliability of node strength. A few multivariate ICCs may also be used, including 'multivariate generalizability' [8] and the I2C2, which pools variance components across the image [16]. We have not observed an ICC based on the multivariate distance, but related measures of discriminability include 'fingerprinting' [17] and the a metric called 'discriminability' [18 ].
The interpretation of the ICC should be considered alongside, and secondary to, validity. It is often stated that reliability provides an upper bound for validity, which refers to the fact that the correlation between observed variables cannot exceed their within-variable correlations [19,20]. Specifically, the observed correlation rðX; Y Þ (between observed variables X and Y) can be described as depending on the true correlation rðX; Y Þ (between the true value of the variables X T and Y T ) and the within-measure correlations (rðX; X 0 Þ and rðY ; Y 0 Þ, that is, reliability coefficients): Thus improving reliability in isolation can increase the observed correlation, thus attenuating the sample size required to detect an effect [21,22] (although correcting for this seeming attenuation is not recommended in practice for many reasons [23]). There are a few ways to think of validity in this context. Defining validity as rðX; Y Þ and treating Y as ground truth is analogous to the popularly used criterion validity. One can also think of the true correlation in the absence of test-retest error rðX T ; Y T Þ as reflecting 'true' criterion validity. It has also been argued that this is 'validation' and that 'validity' should be reserved for ontological claim that an existing attribute causally affects a measure [24]. For the latter two interpretations, validity can indeed be present despite low reliability: think of a noisy thermometer that gives very different results at each measurement for a person but the correct result averaged across 100 measurements. And clearly high reliability does not imply validity: think of a thermometer that always registers zero. However, both cases are associated with low utility (and low observed criterion validity) for a single test. In summary, both reliability and validity are essential for a high quality measure, although the relationship between the two can vary depending on the definitions used.

Factors influencing test-retest reliability of fMRI
Using data amassed across the past decade, recent metaanalyses have underscored the poor test-retest reliability of univariate fMRI-that is, at the voxel and region level for task-based activation [2 ] and at the edge level for resting-state functional connectivity [1 ]. For context, structural MRI measures exhibit relatively high reliability compared with functional MRI, reflecting the expected immutability of brain structure and the expected higher amount of state and/or noise fluctuations in rest [2 ,25]. However, the literature points towards a number of factors that can increase reliability. In the following, we highlight a selection of references from the recent literature and most recent available literature reviews of task-based activation [26] and functional connectivity [1 ]. We limit our discussion to these commonly used fMRI units of analysis that have recently drawn attention in the literature but that other fMRI measures are actively investigated (e.g. Refs. [27,28]).
Both task-based activation and functional connectivity reliability have been found to increase with the following factors: shorter test-retest intervals [26,29,30], task type [26,31] (task > rest for functional connectivity; basic > complex tasks for activation), locations with larger and significant effects (although typically a small association [26,29,32,33], locations in cortex rather than subcortex [2 , 15,34], and non-clinical populations [35,36]; meanwhile, minimal effects on functional connectivity reliability were observed with the use of multiple harmonized sites and scanners [37,38] (see Ref. [39] for more about increasing reliability with harmonization; see Ref. [40] for details on how to harmonize scanners) although site effects become more prominent with multivariate measures [37,38].
While univariate reliability is low in fMRI, multivariate reliability is substantially greater. Near perfect discriminability has been observed on the basis of the multivariate pattern of connectivity by comparing correlations between connectomes ('fingerprinting') [17] and distances between connectomes ('discriminability') [18 ], even with poor univariate ICC in the same data [15,41]. The I2C2, which pools univariate estimates of variance across the image, also shows substantially higher reliability of connectivity than ICC [16]. In this vein, second-order network measures may show greater reliability than first-order measures, with greater dissociation after global signal regression [42]. Task-based activations have also been shown to exhibit greater reliability when combined across multiple areas or within a multivariate model [43 ].
Some findings have been reported specifically within either the activation or functional connectivity literatures. Activation studies reveal more task-relevant effects: reliability has been shown to improve for block rather than event-related designs and target-nontarget rather than task-rest contrasts [44]. Note that the latter effects may be small [2 ] -especially if the nontarget condition results in similar activation as the target [45]. In general, reliability is expected to depend on a number of taskspecific factors, including the task itself. Functional connectivity has been shown to improve with more within-subject data [30,46], eyes open, awake, and active recordings [47,48], no artifact correction (a highly variable and complicated result; cf. [1 ,49] and Implications), within-network location [15], averaging over longer rather than shorter intervals within a given dataset [15,31], no task regression for task data [31], full rather than partial correlation-based connectivity with shrinkage [50], and younger rather than older adult populations [51]; meanwhile, minimal effects were observed with slice timing correction [49]. We expect that many of the factors listed here that improve reliability in functional connectivity studies also improve reliability in activation studies. For example, we expect reliability increases with scan duration for not only connectivity but also activation [52], although this was not found in a recent activation metaanalysis [2 ].
Finally, although we have not yet observed investigations of reliability across the lifespan for connectivity or activation, we have observed age-related differences in reliability of connectivity including lower ICCs in infants [53,54] and older adults [51] compared with younger adults as well as differences in spatial distributions of reliability between children and adults [55]. Thus, we hypothesize that reliability will follow an inverted Ushape (i.e. smaller at young age, peaking at young adulthood, and decreasing in older adults) and exhibit changes in spatial distribution with age.

Implications
Evidence amassed across the field points to the low testretest reliability of univariate fMRI data, as well as to a number of factors that can affect its reliability. What does this mean for existing research and how should we move forward? The low reliability of these fundamental levels of analysis in fMRI can clearly impair our ability to detect effects (see A primer on measuring fMRI test-retest reliability). This is particularly problematic for fMRI, where typical effect sizes are likely small to moderate [56,57,58 ]. Thus, it is reasonable to consider ways we can improve reliability (see Factors influencing test-retest reliability of fMRI). At the same time, we urge the reader to temper an unduly pessimistic interpretation of these findings by considering the following.
Much existing work relies on group-level inference, which can be robust in the absence of high individual-level test-retest reliability [59]. While increasing the number of subjects from a population for a test does not change the expected value of the ICC or effect size, it will change the power to detect an effect [22,60] along with the precision of the ICC and effect size estimate [41,61] (see Ref. [58 ] for an illustration of low precision leading to inflation in fMRI). Thus, a study with a large sample size can have substantial power and precision to detect effects even if observed effects are small due to low test-retest reliability. An illustrative example is the robust group-level activation in bilateral amygdala for both emotion tasks in Fig. 4 of [2 ] despite low reliability, likely due to a combination of relatively large group size and magnitude of the underlying effect. Still, one must bear in mind the limited generalizability of group effects to the individual [59] and the possible attenuation of observed effects due to low reliability (although even small true effects can be meaningful [62]).
In addition, the low reliability frequently reported pertains to univariate measures, yet multivariate analyses commonly used in modern neuroimaging are substantially more reliable (see Factors influencing test-retest reliability of fMRI), reflecting greater stability in the pattern of the image and/or pooling elements. As such, it may be helpful to think of reliability of univariate fMRI measures as establishing the lower bound on reliability of fMRI [43 ]. In addition, recent work has underscored the poor power of mass univariate analyses for activation [63] and connectivity [57,58 ], stemming from small univariate effect sizes and multiple testing correction requirements. Multivariate approaches tend to show larger effect sizes [58 ] and are therefore recommended for better reliability, power, and precision.
Finally, it is important to recognize that reliability is not the same as validity. Choices that improve reliability may not improve validity -and can even do the opposite, depending on the definition of validity employed. Thus, it is worth proceeding with caution when making choices just to improve reliability. For example, recall that reliability generally increases with longer scans and decreases with artifact removal. Acquiring longer scans may be a reasonable decision that better captures an individual as they move through different states [15]. However, deciding to retain artifacts in an effort to increase reliability can be more problematic; the increased reliability could be due to unwanted but reliable artifacts like motion [64,65], and retaining artifacts is associated with poorer outcome measures [49,66]. It can be a simple choice to remove a nuisance variable that is known to block measurement of desired associations, but it can be tricky to understand and properly remove fMRI confounds when their removal decreases desired associations [67]; either way, the decision to retain artifacts should not be based solely on an expected improvement in reliability. A related problem is that low reliability due to high within-subject variability may be meaningful or it could reflect noise. This is often difficult to disentangle in complex living systems, where both sources of variability are expected. In summary, both reliability and validity are desirable, and we should not strive for one without the other.

Conclusion
With increasing interest in reproducible findings there has been increasingly critical interest in the reliability of fMRI, typically measured by the ICC. Recent surveys of the functional connectivity and task-based activation literatures point towards the low reliability of univariate measures, as well as a number of factors that influence reliability. However, study design and analysis decisions should not be made based on reliability alone; instead, one should seek to also understand the impact of any decision of validity. Notably, there is growing evidence that multivariate approaches improve both reliability and validity, and thus offer a promising avenue for future research. A better understanding of fMRI reliability and validity alongside adoption of best practices in the field [68] will enable us to be better positioned to achieve both.

Conflict of interest statement
None declared.