Statistical considerations for repeatability and reproducibility of quantitative imaging biomarkers

Quantitative imaging biomarkers (QIBs) are increasingly used in clinical studies. Because many QIBs are derived through multiple steps in image data acquisition and data analysis, QIB measurements can produce large variabilities, posing a significant challenge in translating QIBs into clinical trials, and ultimately, clinical practice. Both repeatability and reproducibility constitute the reliability of a QIB measurement. In this article, we review the statistical aspects of repeatability and reproducibility of QIB measurements by introducing methods and metrics for assessments of QIB repeatability and reproducibility and illustrating the impact of QIB measurement error on sample size and statistical power calculations, as well as predictive performance with a QIB as a predictive biomarker.


INTRODUCTION
Medical imaging modalities such as CT, MRI, and positron emission tomography (PET) are routinely used in clinical practice for disease screening, diagnosis, staging, therapeutic monitoring, evaluation of residual disease, and assessment of disease recurrence. Traditionally, image contrast-based qualitative interpretations of medical images are the most commonly employed radiology practice. With advances in imaging technologies in recent years, imaging metrics that can quantify tissue biological and physiological properties, in addition to those that quantify tissue morphology such as disease size, are increasingly used in research and early phase clinical trials to characterize disease and response to treatment. A recent review 1 by a group of principal investigators from the Quantitative Imaging Network (National Cancer Institute, National Institutes of Health) has called for wider incorporation of quantitative imaging methods into clinical trials, and eventually, clinical practice for evaluation of cancer therapy response. In the emerging era of precision medicine, quantitative imaging biomarkers (QIBs) can be integrated with quantitative biomarkers from genomics, transcriptomics, proteomics, and metabolomics to facilitate patient stratification for individualized treatment strategy and improve treatment outcome. 2 A QIB is defined as "an objective characteristic derived from an in vivo image measured on a ratio or interval scale as an indicator of normal biological processes, pathogenic processes or a response to a therapeutic intervention. " 3 QIBs can be generally classified into five different types: structural, morphological, textural, functional, and physical property QIBs. 4 Kessler et al 3 have introduced terminologies related to QIBs for scientific studies. Study designs and statistical methods used for assessing QIB technical performances have been extensively reviewed. [4][5][6][7][8] Because many QIBs are derived through multiple steps in image data acquisition and data analysis that often involve different manufacturer scanner platforms and different computer algorithms and software tools, QIB measurements can produce large variabilities, which pose a significant challenge in translating QIBs into clinical trials, and ultimately, clinical practice. In order for a QIB and its changes to be interpretable in clinical settings across institutions and clinics for disease characterization and therapy response assessment, it is highly important to evaluate the repeatability and reproducibility of the QIB.
Both repeatability and reproducibility constitute the reliability of a QIB measurement. Repeatability refers to the precision of a QIB measured under identical conditions (e.g. using the same measurement procedure, same measurement system, same image analysis algorithm, and same location over a short period of time; also known as repeatability condition), 3,4 which is mainly a measure of the within-subject variability and the variability caused by the same imaging device over time. On the other hand, reproducibility refers to the precision of a QIB measured under different experimental conditions 3,4 (also known as the reproducibility condition), which is mainly a measure of the variability associated with different measurement systems, imaging methods, study sites, and population. In recent years, many studies have been conducted to investigate the reliability of different QIBs. For example, Yokoo et al 9 studied the precision of hepatic proton-density fat-fraction measurements by using MRI; Lodge 10 examined the repeatability of standardized uptake value (SUV) in oncologic 11 19 Winfield et al., 20 Weller et al., 21 Lecler et al, 22 and Lu et al 23 estimated the repeatability and reproducibility of the apparent diffusion coefficient (ADC) derived from diffusionweighted MRI; Hagiwara et al 24 studied the repeatability and reproducibility of quantitative relaxometry with a multidynamic multiecho MRI sequence using a phantom and normal healthy human subjects; Jafari-Khouzani et al 25 appraised the repeatability of brain tumor perfusion measurement using dynamic susceptibility contrast MRI; Han et al 26  In this article, we review the statistical aspects on repeatability and reproducibility of QIB. We introduce methods and metrics for assessment of QIB repeatability and reproducibility, and illustrate the impact of QIB measurement error on sample size and statistical power calculations, as well as performance as a predictive biomarker.

MEASUREMENT ERROR MODEL
Precision of a QIB is defined as the closeness of agreement between repeated measurements of the QIB, 3 and repeatability and reproducibility comprise different sources of variability that may impact the precision of a given QIB. Measurement error is defined as the difference between a measured quantity and its true value. 30 Any sources of variability can cause measurement errors in QIB measurements. Though it is critical to identify and obtain valid inference on the impact of every component of variation (see Section 3), we start by introducing a general model on measurement error. Table 1 lists commonly used symbols in this article.
In a measurement error model, instead of the true QIB value, we can only observe QIB values with errors that are random across different QIB measurements. If the errors are constant for all measurements, then the error is called bias. 3 Since both repeatability and reproducibility mainly concern random errors, we assume that there is no bias and that the random errors are independent and identically distributed with the mean equal to zero and variance of σ 2 ϵ . Thus, the only unknown parameter σ 2 ϵ measures the level of variability, with larger values indicating larger variability or worse precision. Let Y itl be the measured QIB value from the lth measurements of a repeated QIB measurement made at time t for subject i with X it being the corresponding true value, the measurement error model can be expressed as where β 0t represents the bias of the QIB measurement and β 1 represents the proportional bias of the QIB measurement. The random error ϵ itl is assumed to be normally distributed (other commonly used distribution assumptions are discussed below). Because both bias and proportional bias cannot be identified solely through the QIB measurements, it is usually assumed that these values are constant and known in advance from  ground-truth studies such as a phantom study. 31 Because we can always standardize the QIB measurement through to remove the effect of bias and proportional bias, without the loss of generalizability, we can reasonably assume the setting of no bias and no proportional bias (β 0t = 0 and β 1 = 1), and the model (1) becomes Following the measurement error model (2), we consider a simplified setting where all measurements for subject i are made at a relatively short time interval so that the true value X i remains unchanged. That is, let Y ijk be the kth repeated QIB measurement made on subject i under experimental condition j (different experimental conditions may include different measurement systems, imaging methods, and countries/regions), in a similar fashion to model (1) we employ a linear relationship between Y ijk and X i , 4,5,32 but further breaking down the measurement error ϵ ijk into different components of repeatability-and reproducibilityrelated errors. Similar to model (2), we assume there are no bias or proportional bias, and the model that accounts for both repeatability-and reproducibility-related errors can be written as The terms δ ik , γ j , and (γδ) ij represent different components of measurement error caused by within-subject variability (under repeatability condition), between-condition variability (under reproducibility condition), and the interaction between subject and condition, respectively, and we assume they follow normal distributions. In general, the random error variances σ 2 δ , σ 2 γ , and σ 2 γδ are the key performance characteristics used in repeatability and reproducibility studies (see details below).

Repeatability
Many studies have been conducted to investigate the repeatability of QIBs for different imaging modalities, including but not limited to CT, 17,18 MRI, 9,12,14,15,[20][21][22][23][24][25] and PET. 10,11,18,29 Test-retest studies are usually performed to evaluate the repeatability of QIB measurements. These studies usually require each subject to be scanned repeatedly over a short period of time with the assumption that X i does not change. If the repeatability study condition is held, i.e. all repeated scans are performed at the same location, with the same measurement procedure, and using the same measurement system and image analysis algorithm, an estimate of QIB repeatability can be calculated.
In practice, test-retest studies can be performed fairly easily using a phantom, but could be difficult with human subjects due to expenses and logistic problems. Therefore, repeatability studies using human subjects are often limited to a small number of replicates (usually two) on each subject. Furthermore, for imaging studies with contrast administration, because there usually exists a required contrast washout period between two consecutive scans (e.g. consecutive dynamic contrast-enhanced (DCE) MRI scans are usually required to be performed at least 24 h apart), 4,31 "coffee-break" experiments where there is only a short break between repeated scans are not always possible. Thus, for repeatability studies with long interval between scans, possible changes in the true values should also be considered in the model.
Specifically, for a test-retest study with n subjects and m replicates for each subject, since all experiments are conducted under the same experimental condition, without the loss of generalizability, we set j = 1 and model (3) becomes for i = 1, · · · , n and k = 1, · · · , m . The random effect variance σ 2 δ is the key performance characteristic used in a repeatability study, with smaller value corresponding to better repeatability. Because we only consider QIB measurements in a single-site and single measurement system study for a test-retest repeatability study, the random error γ j in (3), which measures the reproducibility-related variability such as between-site variability, becomes a condition-specific systematic error γ 1 in (4), representing the condition-specific bias. As discussed in Section 2, the constant bias γ 1 cannot be identified through the testretest study and should be estimated through a phantom study with ground-truth values. The term ( γδ ) ij in model (3) vanishes because there is only one study condition and the interaction effect cannot be observed.
Many metrics related to the estimate of within-subject variance σ 2 δ have been proposed to quantify the magnitude of repeatability. 4 Table 2 shows the metrics considered in this article for repeatability measurement. The within-subject standard deviation (wSD) is the most commonly used metrics for assessing repeatability. It is the standard deviation ( σ δ ) of repeat measurements for a single subject. If we assume all subjects have the same σ δ and the ground true values X i are independent and normally distributed, wSD can be obtained by fitting a linear mixed-effects model with subject-specified random intercepts using maximum likelihood. 33 Although other estimation procedures such as method of moment estimators can also provide consistent estimation of wSD, the maximum likelihood method is usually preferred for small sample size when model (3) is correctly constructed. Alternatively, wSD can be calculated by averaging the within-subject sample variances 32 . This estimator (equation (5)) is equivalent to the estimator obtained from the one-way analysis of variance (ANOVA) model. 34 Another closely related metric for repeatability measurement is the intraclass correlation coefficient (ICC), 35,36 which is defined as the proportion of total variation that is associated with the variation of true value. That is, if we assume Comparing the variance of QIB measurement Y i1k (denominator of equation (6)) with the variance of the true value X i (numerator of equation (6)), we can observe that the extra variation equals to the within-subject variation σ 2 δ . If σ 2 δ is much smaller than σ 2 X , then Var and ICC is close to 1, which indicates that the measurement error contributes little to variation of the QIB measurement. In other words, larger ICC implies better repeatability and smaller ICC implies worse repeatability. ICC can be estimated with known estimates of σ δ and σ X (equation (6)), both of which can be obtained by either fitting a linear mixed-effects model with subject-specified random intercepts or one-way ANOVA model. When using ICC as the measure of repeatability, it is crucial to ensure that the subjects participating in the study are representative of the study population so that the estimated QIB variation can well reflect the variation of the study population (variance of X i ).
The within-subject coefficient of variation (wCV) ( Table 2) is an alternative metric of wSD. The wCV is defined as the ratio of the within-subject standard deviation to its mean, which is commonly used for test-retest studies of repeatability when wSD is not constant among studied subjects and model (4) becomes inadequate. A useful alternative model for model (4) is to assume that the wSD increased proportionally with the true value X i , i.e., In model (7), with the extra constraint of δ ik > 0 , it is more adequate to assume δ ik follows log-normal or Weibull distributions 37 with the mean of δ ik equal to one. Raunig et al 4 suggest using log-normal distribution so that the log-transformed QIB measurement log ( Y i1k ) is normally distributed (after adjusting for the site-specific bias γ 1 ) Under model (8), wCV only depends on the log-transformed within-subject variance σ 2 Therefore, we can apply any of the estimators of wSD on the log-transformed QIB measurements to obtain valid estimates of σ 2 δ ′ , and wCV can be estimated by plugging the σ 2 δ ′ into model (9). Without the log-normal distribution assumption, by mimicking estimator (5), when m = 2, we can still estimate wCV by pooling and averaging the within-subject sample coefficient variation 32 Both ICC and wCV have the benefit of being dimensionless, which make them useful for comparing quantities measured on different scales.
Instead of a single point estimates of the repeatability metric (either wSD, ICC, or wCV), it is desirable to make inference on these values through either constructing confidence intervals (CIs) or performing hypothesis testing. Both confidence interval and hypothesis testing involve estimating the distribution of the estimator and thus depend on the choice of the estimation method. For wSD (denoted as σ and the corresponding estimator denoted as σ ), if the ANOVA-type estimators such as equations (5) and (10) are used, n(m − 1)σ 2 /σ 2 follows χ 2 distribution with a degree of freedom (df) of n(m − 1) . Thus, the (1 − α) * 100% CI of σ is where χ 2 df,α/2 and χ 2 df,1−α/2 are the α/2 and 1 − α/2 quantiles, respectively, of the χ 2 distribution with degree of freedom df. To test whether the level of wSD is greater than a threshold value c, we conduct hypothesis test with null hypothesis H 0 : σ 2 ≤ c 2 vs alternative hypothesis H 1 : σ 2 > c 2 . The corresponding test statistic is T = n(m−1)σ 2 c 2 and we reject the null hypothesis if On the other hand, if the maximum likelihood estimators are used, making inference based on the asymptotic distribution of σ is usually problematic for small sample sizes and numerical methods such as bootstrap CI 38 or profile likelihood CI 39 should be considered. For wCV, CI of σ 2 δ ′ on the logtransformed data is first determined using formula (11). Then the CI of wCV can be obtained based on model (9). Because the estimator of ICC is a nonlinear function of σ δ and σ X , its exact sampling distribution is not available. In this case, bootstrap confidence intervals have been extensively used in literature. 40,41 We can also construct the confidence interval of ICC by approximating its sampling distribution using either F (also known as Satterthwaite approximation) or β distribution. 42 Based on Monte Carlo simulation of the sampling distribution of the generalized pivotal quantity of ICC, Ionan et al 43 suggests the generalized confidence interval proposed by Weerahandi. 44 Sample size calculation can be conducted using the method provided by A'Hern. 36 Reproducibility Reproducibility concerns the consistency or precision of the QIB measurement made on the same subject with the same experimental design but under different experimental conditions, such as different measurement device. Reproducibility of QIBs for different imaging modalities, such as CT, 17 MRI, 9,12,16,19,24 and PET, 13,29 have also been extensively studied. Although many experimental factors can be included for the reproducibility study condition, it is practically impossible to consider all conditions for a single reproducibility study. Raunig et al 4 provided a list of conditions that can be tested in reproducibility studies. Depending on which condition is being tested, reproducibility studies can be classified into two categories: (1) repeated measurement design and (2) cohort measurement design. For example, the former can be used to study the variability caused by different scanners, while the latter can be used to study the variability caused by different study sites. Because the withinsubject variability is generally embedded in the variability under different experimental conditions, a reproducibility study can generate repeatability results for each experimental condition. Specifically, for a repeated measurement design, subject i is repeatedly measured m times for each of the J experimental conditions; for a cohort measurement design, each subject will be repeatedly measured m times under one of the experimental conditions. That is, a repeated measurement design requires each subject being measured m × J times, while a cohort measurement design requires each subject only being measured m times. Model (3) is valid for both experimental designs, and the key performance characteristics is the sum of the random effect variances ( σ 2 ϵ = σ 2 δ + σ 2 γ + σ 2 γδ ), which represents the total variation under the reproducibility study condition. However, subjects in the cohort measurement design are only measured under a single experimental condition and thus, the subject-condition interaction effect ( γδ ) ij does not exist. In this case, we can assume σ 2 γδ = 0 , and the total variation for the cohort measurement design becomes σ 2 δ + σ 2 γ .
Similar to repeatability studies, either linear mixed effects models or two-way ANOVA can be used to fit the data and obtain valid estimates of σ 2 δ , σ 2 γ , and σ 2 γδ . The square root of the total variance σ 2 ϵ , denoted as total SD (tSD) ( Table 2) or reproducibility SD, can be used as a metric to quantify the magnitude of reproducibility. If estimators based on linear mixed effects models are considered, the sampling distributions for these estimators are unknown and numerical methods such as bootstrap or permutation should be used to make valid inferences on these parameters. On the other hand, if ANOVA-type estimators are considered, all the three terms dfσ 2 δ /σ 2 δ , dfσ 2 γ /σ 2 γ , and dfσ 2 γδ /σ 2 γδ follow χ 2 distributions and the corresponding CIs and test statistics can be easily obtained. However, because the sampling distribution of σ 2 ϵ , which equals to σ 2 δ +σ 2 γ +σ 2 γδ , is unknown, numerical methods are recommended.
In a scenario where only two experimental conditions are being compared, e.g. comparing two scanner platforms, the variance of γ j ( σ 2 γ ) is no longer estimable (only γ 1 and γ 2 exist). We then use the agreement between these two experimental conditions as the measure of reproducibility. For such a situation where m = 1 for a repeated measurement design, Lawrence and Lin 45 proposed the concordance correlation coefficient (CCC) ( Table 2) as an agreement measure of reproducibility, 46

defined as
to evaluate the QIB agreement between the two experimental conditions, where σ 1 and σ 2 are the standard deviations of the measured QIB under experiment condition 1 and 2, respectively, and ρ 12 is the Pearson correlation between the measured QIB values Y i1 and Y i2 . Similar to Pearson correlation, CCC is also ranged between −1 and 1, with values close to 1 (or −1) representing good concordance (or good discordance) and 0 representing no correlation.
The method introduced by Lawrence and Lin 45 is commonly used to estimate CCC and its CI. 47 Figure 1 demonstrates the Bland-Altman plots based on 100 simulated data with X generated uniformly at random from interval 0 and 5. When gold-standard or reference values are available (see the setting described in Obuchowski et al 32 ), the difference between the QIB measurements and the corresponding reference values are plotted against the reference values in Bland-Altman plots. Figure 1a illustrates the case of additive error (model (4) with σ = 0.8), while Figure 1b shows the case of multiplicative error (model (7) with σ = 0.3). When reference values are not available, e.g. test-retest studies, standard deviations (or differences for the case of m = 2) of repeated measurements are plotted against the averages of repeated measurements in Bland-Altman plots. Figure 1c and d show the cases of additive and multiplicative errors based on two repeated measurements and same values of σ as in Figure 1a and Figure 1b, respectively. When multiplicative errors are observed (e.g. in Figure 1(b) , and (d)), log-transformed QIB measurements should be considered as suggested in Section 3.1.
When only two experimental conditions are being compared, Bland-Altman plots can also be used in reproducibility studies, especially when gold-standard or reference values are not available. 4 In addition, scatter plot of experimental Condition 1 vs Condition 2 with a fitted regression line can provide useful visualization of the agreement between these two methods ( Figure 2). When there exists more than two experimental conditions, we can follow the same procedure as above for each pair of the experimental conditions.

EXAMPLES OF THE IMPACT OF QIB MEASUREMENT ERRORS ON CLINICAL STUDIES QIB as trial end point
QIBs can serve as a clinical trial's endpoint to assess treatment efficacy, where subjects enrolled in the study are scanned before and after treatment, and the difference of the mean QIB measurements over the treatment course is used to determine the efficacy  of the treatment. Since it is often nearly impossible to perform repeated measurements at a single time point in a longitudinal study, it is difficult to assess repeatability and/or reproducibilityrelated QIB measurement errors. From model (2), for a setting without repeated measurements, let Y i1 and Y i2 be the QIB measurements (or log-transformed QIB measurements) before and after intervention for subject , respectively, and we further assume that the corresponding true value X it follows normal distribution with mean µ t and variance σ 2 X for t = 1, 2 and i = 1, · · · , n . The common approach to assess treatment efficacy is to test if the mean difference is greater than a threshold value c so that the difference is practically meaningful (i.e. null hypothesis H 0 : µ 2 − µ 1 ≤ c vs alternative hypothesis H 1 : µ 2 − µ 1 > c ). Under model (2), the corresponding test statistic Z is where ρ is the Pearson correlation (ranged between 0 and 1) between Y i1 and Y i2 . Thus, for a statistically significant level of α, the formula of minimum sample size required to achieve β power is where z α and z β are the α and β quantiles of standard normal distribution. For example, under the setting of c = 0 and µ 2 − µ 1 = 0.5 , which represents 50% change against no change in hypothesis testing if log-transformed QIB is considered, Figure 3 illustrates the required sample sizes to achieve 80% power with a significance level of 5% for different values of σ ϵ , σ X , and ρ . From both Figure 3 and equation (12), we can notice that the required sample size is an increasing function of σ 2 X and σ 2 ϵ , and is a decreasing function of ρ .
For a longitudinal study, the correlation parameter ρ measures the level of dependence between the QIB measurements before and after treatment, with longer time interval between the measures generally resulting in smaller ρ . Obuchowski et al 6 showed that the range of ρ , which depends on the time interval of the two measurements, is between 0 and Using equation (12), the range of sample size is from . Although the required sample size n is a decreasing function of ρ and ρ is a decreasing function of the time interval between the measurements, we cannot jump into the conclusion that studies with smaller time interval require smaller sample sizes. This is because a smaller time interval usually also results in a smaller difference between µ 1 and µ 2 .

QIB as predictive biomarker
In addition to serving as clinical trial end points, QIBs can also be used as predictive biomarkers for early prediction of treatment effect, or as intermediate endpoints in multiarm, multistage trials. 50 Under this scenario, it is usually assumed that the true value of QIB is associated with the primary trial end point but we only measure QIB with error (see model (1)). Many methods have been proposed to adjust the measurement error on covariates in regression models. [51][52][53] However, using such adjustment requires knowledge of the distribution of measurement errors that can only be obtained from additional studies such as repeatability or reproducibility studies. This requirement, on the one hand, emphasizes the importance of repeatability or reproducibility studies, but on the other hand, may not always be applicable and standard approach that ignores the measurement error is conducted. 54 In this study, we use simulation to illustrate the impact of measurement error when standard approach is used. Figure 3. Sample size required to achieve 80% power with a significance level of 5% against measurement error standard deviation σ ϵ . The true difference µ 2 − µ 1 is 0.5, commonly represents 50% change for log-transformed QIB; ρ = 0.4, 0.6, or 0.8; σ X = 1 or 1.5; and c = 0 . QIB, quantitative imaging biomarker.
Because the sample sizes in published studies where QIBs were used as predictive biomarkers are usually small, it is difficult to analytically evaluate the impact of measurement error on QIBs as predictive biomarkers. Based on the study design and assumptions, Monte Carlo simulations can be used to numerically approximate the impact of measurement error. For illustration purposes, we designed our simulations based on the study by Tudorica et al, 54 where DCE-MRI QIBs were used for early prediction of breast cancer response [pathologic complete response (pCR) vs non-pCR] to neoadjuvant chemotherapy (NACT). We denoted Z i as the indicator of pCR and X i as the true value of a DEC-MRI QIB for subject i. The true DCE-MRI QIB X i was generated from normal distribution. Tudorica et al 54 provided a list of DEC-MRI QIBs with their means and SDs for pCR and non-pCR patients. Here, we considered the percent change in QIB K trans (transfer rate constant) after the first cycle of NACT relative to baseline (pCR: mean = −64%, SD = 9%; non-pCR: mean = −14%, SD = 41%), which showed the best predictive performance for pCR vs non-pCR in that study. 53 Consistent with the sample size of the study, 53 we included a total of 28 subjects in this simulation study. Here, without loss of generality, we assumed that the first five subjects are pCR patients and the other subjects are non-pCR patients. As noted above, we can only observe the DEC-MRI QIB with error (model (2)). The best approach was to fit a univariate logistic regression model using the observed DEC-MRI QIB Y i as covariate, i.e.
Area under the receiver operating characteristic curve (AUC) was used to evaluate the QIB predictive performance. Sample size calculation of AUC can be conducted using the formula provided by Obuchowski et al. 55 Because the effect of measurement error on AUC is still not clear, we perform a simulation study evaluate this effect. The true K trans percent change values were repeatedly generated 1000 times, and the average AUC across these 1000 simulated data sets and the corresponding 95% CIs were calculated. Figure 4 illustrates the average AUCs against different values of σ ϵ ( σ ϵ = 5%, 10%, · · · , 30% ), the measurement error standard deviation in K trans percent change.
Our simulation results show that the predictive performance as measured by the average AUC decreases, and the length of 95% CIs increases with increased measurement error ( σ ϵ ).

Discussion
In this review article, we provided a general introduction on the study design, statistical model, and statistical metrics that can be used to assess repeatability and reproducibility of QIB measurements. We also illustrated the impact of repeatabilityand reproducibility-related QIB measurement errors on QIB applications, e.g. on sample size calculation when a QIB is used as a clinical trial end point.
The statistical models presented here assume that the measurement errors are normally distributed and independent from the true QIB values. If the measurement errors increase in proportion to the true QIB values, i.e. multiplicative errors (see model (7)), log-transformed QIB values can be used as the relationship between error and true value becomes additive after the transformation. In practice, when QIB measurements have values equal or close to zero, we can add a small amount on all QIB values before taking the log-transformation. For more complex error structures such as non-Gaussian or heterogeneous measurement errors, the statistical methods introduced in this article can provide a reasonable approximation of the repeatability and reproducibility metrics of interest, e.g. wSD and ICC, but statistical inferences on these estimates can be biased and may lead to false conclusions.
Test-retest studies are commonly used to study repeatability and reproducibility of a QIB, where each object, e.g. a phantom or a

BJR|Open
Review article: Repeatability and reproducibility of QIB human subject, is being repeatedly measured. This approach may sometimes be impractical for human subject studies for reasons such as costs, time, and the invasiveness of the imaging scan. As an alternative strategy, Obuschowski et al 32 proposed a method to estimate the measurement error under repeatability condition when a testretest study is not feasible. The method requires a reference (golden standard) value is available for each subject. By assuming the reference value to be the true QIB value ( X i ), it can serve as the second measured value for wSD or wCV estimation: Repeatability and reproducibility can be part of the same study. It is possible to study repeatability in a restricted subset of a reproducibility study to ensure repeatability is acceptable, e.g. in an initial subset of subjects going into the study to ensure it is worth pursuing.
There is increasing need to accelerate clinical translation of QIBs. However, significant challenges remain. Using solid tumor therapy response as an example, the 1D imaging tumor size measurement based on the RECIST (Response Evaluation Criteria In Solid Tumors) 1.1 guidelines 56 is the only widely used QIB in today's standard of care and clinical trials. Many QIBs that interrogate tumor biology and physiology and thus are well suited for evaluation of response to increasingly used and effective molecular targeted therapies find it difficult to be translated into clinical trials and practice. This is mainly due to the variabilities in quantifying these QIB parameter values caused by differences in vendor imaging platforms, imaging data acquisition methods, and imaging data analysis algorithms and software tools. Because of the lack of sufficient repeatability and reproducibility studies to understand the variabilities of these functional QIBs, unlike RECIST tumor size measurement, there is currently no consensus on the magnitudes of changes in these QIBs for defining clinical response endpoints such as complete response, stable disease, etc. In order to establish a path to clinical translation for functional QIBs, there is a clear need for not only standardization of data acquisition and analysis to minimize variability, 1 but also more efforts in assessment of QIB repeatability and reproducibility. 1 It is our hope that the statistical tools presented in this article may contribute to this endeavor.