Test-retest reliability and sample size estimates after MRI scanner relocation

OBJECTIVE
Many factors can contribute to the reliability and robustness of MRI-derived metrics. In this study, we assessed the reliability and reproducibility of three MRI modalities after an MRI scanner was relocated to a new hospital facility.


METHODS
Twenty healthy volunteers (12 females, mean age (standard deviation) = 41 (11) years, age range [25-66]) completed three MRI sessions. The first session (S1) was one week prior to the 3T GE HDxt scanner relocation. The second (S2) occurred nine weeks after S1 and at the new location; a third session (S3) was acquired 4 weeks after S2. At each session, we acquired structural T1-weighted, pseudo-continuous arterial spin labelled, and diffusion tensor imaging sequences. We used longitudinal processing streams to create 12 summary MRI metrics, including total gray matter (GM), cortical GM, subcortical GM, white matter (WM), and lateral ventricle volume; mean cortical thickness; total surface area; average gray matter perfusion, and average diffusion tensor metrics along principal white matter pathways. We compared mean MRI values and variance at the old scanner location to multiple sessions at the new location using Bayesian multi-level regression models. K-fold cross validation allowed identification of important predictors. Whole-brain analyses were used to investigate any regional differences. Furthermore, we calculated within-subject coefficient of variation (wsCV), intraclass correlation coefficient (ICC), and dice similarity index (SI) of cortical segmentations across scanner relocation and within-site. Additionally, we estimated sample sizes required to robustly detect a 4% difference between two groups across MRI metrics.


RESULTS
All global MRI metrics exhibited little mean difference and small variability (bar cortical gray matter perfusion) both across scanner relocation and within-site repeat. T1-and DTI-derived tissue metrics showed < |0.3|% mean difference and <1.2% variance across scanner location and <|0.4|% mean difference and <0.8% variance within the new location, with between-site intraclass correlation coefficient (ICC) > 0.80 and within-subject coefficient of variation (wsCV) < 1.4%. Mean cortical gray matter perfusion had the highest between-session variability (6.7% [0.3, 16.7], estimate [95% uncertainty interval]), and hence the smallest ICC (0.71 [0.44,0.92]) and largest wsCV (13.4% [5.4, 18.1]). No global metric exhibited evidence of a meaningful mean difference between scanner locations. However, surface area showed evidence of a mean difference within-site repeat (between S2 and S3). Whole-brain analyses revealed no significant areas of difference between scanner relocation or within-site. For all metrics, we found no support for a systematic difference in variance across relocation sites compared to within-site test-retest reliability. Necessary sample sizes to detect a 4% difference between two independent groups varied from a maximum of n = 362 per group (cortical gray matter perfusion), to total gray matter volume (n = 114), average fractional anisotropy (n = 23), total gray matter volume normalized by intracranial volume (n = 19), and axial diffusivity (n = 3 per group).


CONCLUSION
Cortical gray matter perfusion was the most variable metric investigated (necessitating large sample sizes to identify group differences), with other metrics showing substantially less variability. Scanner relocation appeared to have a negligible effect on variability of the global MRI metrics tested. This manuscript reports within-site test-retest variability to act as a tool for calculating sample size in future investigations. Our results suggest that when all other parameters are held constant (e.g., sequence parameters and MRI processing), the effect of scanner relocation is indistinguishable from within-site variability, but may need to be considered depending on the question being investigated.

Objective: Many factors can contribute to the reliability and robustness of MRI-derived metrics. In this study, we assessed the reliability and reproducibility of three MRI modalities after an MRI scanner was relocated to a new hospital facility. Methods: Twenty healthy volunteers (12 females, mean age (standard deviation) ¼ 41 (11) years, age range ) completed three MRI sessions. The first session (S1) was one week prior to the 3T GE HDxt scanner relocation. The second (S2) occurred nine weeks after S1 and at the new location; a third session (S3) was acquired 4 weeks after S2. At each session, we acquired structural T1-weighted, pseudo-continuous arterial spin labelled, and diffusion tensor imaging sequences. We used longitudinal processing streams to create 12 summary MRI metrics, including total gray matter (GM), cortical GM, subcortical GM, white matter (WM), and lateral ventricle volume; mean cortical thickness; total surface area; average gray matter perfusion, and average diffusion tensor metrics along principal white matter pathways. We compared mean MRI values and variance at the old scanner location to multiple sessions at the new location using Bayesian multi-level regression models. K-fold cross validation allowed identification of important predictors. Whole-brain analyses were used to investigate any regional differences. Furthermore, we calculated within-subject coefficient of variation (wsCV), intraclass correlation coefficient (ICC), and dice similarity index (SI) of cortical segmentations across scanner relocation and within-site. Additionally, we estimated sample sizes required to robustly detect a 4% difference between two groups across MRI metrics. Results: All global MRI metrics exhibited little mean difference and small variability (bar cortical gray matter perfusion) both across scanner relocation and within-site repeat. T1-and DTI-derived tissue metrics showed < | 0.3|% mean difference and <1.2% variance across scanner location and <|0.4|% mean difference and <0.8% variance within the new location, with between-site intraclass correlation coefficient (ICC) > 0.80 and withinsubject coefficient of variation (wsCV) < 1.4%. Mean cortical gray matter perfusion had the highest betweensession variability (6. 7% [0.3, 16.7], estimate [95% uncertainty interval]), and hence the smallest ICC (0.71 [0.44,0.92]) and largest wsCV (13. 4% [5.4, 18.1]). No global metric exhibited evidence of a meaningful mean difference between scanner locations. However, surface area showed evidence of a mean difference within-site repeat (between S2 and S3). Whole-brain analyses revealed no significant areas of difference between scanner relocation or within-site. For all metrics, we found no support for a systematic difference in variance across relocation sites compared to within-site test-retest reliability. Necessary sample sizes to detect a 4% difference between two independent groups varied from a maximum of n ¼ 362 per group (cortical gray matter perfusion),

Introduction
The importance and interest in magnetic resonance imaging (MRI) techniques to glean information about the state of the brain in health and disease cannot be understated. However, in order to provide useful information for researchers and clinicians, it is imperative that MRIderived metrics are accurate and reproducible. Encouragingly, measures of cortical thickness and gray matter volume derived from T1weighted MRI, resting arterial spin labeling (ASL) perfusion imaging, and diffusion tensor imaging (DTI) metrics generally report good reproducibility and reliability (Boekel et al., 2017;Dickerson et al., 2008;Hodkinson et al., 2013;Iscan et al., 2015;Madan and Kensinger, 2017;Madhyastha et al., 2014;Shahim et al., 2017;Ssali et al., 2016;Vollmar et al., 2010).
While previous literature suggests that structural MRI, PCASL, and DTI metrics are reliable and reproducible across varied circumstances, to the best of our knowledge, there is no published information about the effect of scanner relocation on these sequences. Recently, the MRI scanner we utilize in a number of ongoing research studies was relocated from a private radiology clinic to a new hospital. This involved completely ramping down the scanner magnet, removal from one shielded MRI suite, transport across the city to the new facility, installation in the new MRI environment, re-establishing the superconducting magnet, and a physical re-shim by the vendor. While no hardware or software was changed during the move, we were unsure of the influence the move and new scanning environment would have on an established neuroimaging protocol. Thus, in this study we assessed the reliability and reproducibility of three MRI modalities (T1-weighted, PCASL, and DTI) across scanner relocation. Specifically, we compared both global and regional MRI values and variance at the old scanner location to multiple sessions at the new location. Furthermore, using the variance observed in these global MRI metrics, we estimated sample sizes needed to robustly detect a 4% difference between two groups across the various MRI metrics.

Material and methods
Twenty healthy volunteers (12 females, mean age (standard deviation) ¼ 41 (11) years, age range [25-66]) provided written informed consent. Participants completed three MRI sessions. The first session (S1) occurred one week prior to scanner relocation; the second (S2) occurred nine weeks after S1 and at the new location, while the third session (S3) occurred 4 weeks after S2. This design allowed us to investigate variability associated with moving the scanner (S1-S2), as well as standard between-session test-retest reliability (S2-S3).

Pseudo-continuous arterial spin labelling
A stack of spiral, fast spin echo acquired images were prepared with pseudo-continuous arterial spin labelling and background suppression to measure whole brain perfusion quantitatively (Dai et al., 2008): TR ¼ 6 s, echo spacing ¼ 9.2 ms, post-labelling delay ¼ 1.525 s, labelling duration ¼ 1.5 s, eight interleaved spiral arms with 512 samples at 62.5 kHz bandwidth and 30 phase encoded 5 mm thick slices, NEX ¼ 5. Participants were asked to close their eyes during the PCASL acquisition.

Structural MRI processing
To extract estimates of cortical thickness, surface area, and volume, T1-weighted images were processed using the longitudinal stream in Freesurfer (v6.0.0; http://surfer.nmr.mgh.harvard.edu/) (Reuter et al., 2012). An unbiased within-subject template space and image (Reuter and Fischl, 2011) were created using robust, inverse consistent registration (Reuter et al., 2010). Further processing included skull stripping, Talairach transforms, atlas registration as well as spherical surface maps and parcellations, which were initialized with common information from the within-subject template. We extracted global metrics (mean cortical thickness, total surface area, total gray matter (GM) volume, cortical and subcortical GM volume, total white matter (WM) volume, and lateral ventricle volume) in each participant at each timepoint; regional thickness and surface area were extracted from the Desikan-Killiany Freesurfer parcellations at each timepoint (Desikan et al., 2006). To facilitate whole-brain investigation of regional test-retest reliability, we created difference maps normalized to S2 ((S2-S1)/S2-between site difference; and (S2-S3)/S2-within-site difference) which were warped to fsaverage space and smoothed with a circularly symmetric Gaussian kernel across the surface with a full width at half maximum of 10 mm. Hippocampal volume was calculated using the longitudinal hippocampal subfield segmentation stream (Iglesias et al., 2016).

PCASL processing
At each timepoint, quantified cerebral blood perfusion images were co-registered to the structural image using Freesurfer's bbregister (with default values). These registration parameters were combined with the structural longitudinal warping parameters to register the perfusion image to the within-subject template. To account for any remaining misalignment, perfusion images were smoothed (using FSL, with sigma ¼ 2). Global, regional cortical (from Desikan-Killiany parcellation after projection onto the cortical surface), and subcortical perfusion values were extracted for each subject at each timepoint. We also created normalized difference maps ((S2-S1)/S2 and (S2-S3)/S2) that were warped to fsaverage space.

DTI processing
Preprocessing was performed similarly to previous longitudinal DTI studies using FSL (v5.0.9) (Engvig et al., 2012;Madhyastha et al., 2014;Melzer et al., 2015). At each timepoint, this included motion-and eddy current distortion-correction; rotation of the b matrix accordingly; motion quantification via root mean square deviation between each pair of realigned diffusion images and averaging over all pairs to create a single, 'relative' motion metric; smoothing (using fslmaths with a 1 voxel box kernel and -fmedian flag), which has been shown to increase reliability (Madhyastha et al., 2014); brain extraction; and fitting a diffusion tensor to produce fractional anisotropy (FA), mean diffusivity (MD), axial diffusivity (AD, the principal diffusion eigenvalue), and radial diffusivity (RD, the mean of the second and third eigenvalues) images. For each participant, we used mri_robust_template (Reuter et al., 2012) (part of Freesurfer) to create an unbiased, within-subject FA median template as well as robustly registering FA/MD/AD/RD images from S1, S2, and S3 to this within-subject template. The within-subject FA template was then entered into a tract-based spatial statistics (TBSS; Smith et al., 2006) analysis to create the group-wise FA skeleton (thinned at FA>0.25), representing the centers of all tracts common to all participants. Midpoint-space registered FA images (S1, S2, and S3) were projected onto the skeleton to create FA skeletons containing S1, S2, and S3 values. Using the transforms derived from the FA procedure, we created separate MD, AD, and RD skeletons at each timepoint. Average DTI metrics (FA/MD/AD/RD) along the skeleton, and regionally within 17 white matter tracts (defined by the JHU ICBM-DTI-81 White-Matter Labels atlas (Mori et al., 2008)), were extracted for each participant at each timepoint. Lastly, we created normalized difference maps along the skeleton ((S2-S1)/S2 and (S2-S3)/S2) for each DTI metric.

Global MRI metric analysis
We assessed each of 12 global metrics independently for reliability: total GM, cortical GM, subcortical GM, total WM, and lateral ventricle volume; average cortical thickness and total surface area; average cortical perfusion; and average FA, MD, AD, and RD along the skeleton. Bayesian models were used for analysis and were fit using the 'brms' (v2.9.0) package (Buerkner, 2017;Carpenter et al., 2017) in R (v3.6.2). In each model, four chains with 10000 iterations each were used to generate posterior samples. In the baseline model an intercept, varying intercept by subject, session (referenced to S2, the first scan at the new location), and between-session variance by subject were included. To determine if allowing between-session variance to differ between the scanner relocation and within-site repeat was useful at explaining variance in the data, we performed model comparison using k-fold cross validation to estimate the expected log pointwise predictive density (ELPD) (Vehtari et al., 2017). The initial model fit (which included a constant variance across all sessions) was compared to a second model in which variance was allowed to vary between the scanner locations. A higher ELPD indicated the model provided a better fit to the data. The standard error in the difference of the ELPD between models gave a measure of the uncertainty. When the difference in ELPD was increased by at least twice the standard error of the estimated difference, we took this as reasonable evidence that the better-performing model should be preferred. ELPD values are reported relative to the initial model (where variance across scanner location was held constant). In order to account for the multiple models and comparisons used to investigate the 12 global metrics, we included a student t prior to shrink estimates in all models, thereby reducing the potential multiple comparisons issue (degrees of freedom ¼ 3, mean ¼ 0, standard deviation ¼ 0.5% of the average of the specific MRI metric at S2).

Additional test-retest metrics
To further investigate reliability, we also calculated within-subject coefficient of variation (wsCV) (Bland and Altman, 1996), the intraclass correlation coefficient (ICC) (McGraw and Wong, 1996), and the dice similarity index (SI) (Dice, 1945). SI was used to quantify the similarity between cortical segmentations across imaging sessions. For ICC, we used a two-way mixed effects model with absolute agreement (participant was modelled as a random effect and scanner as a fixed effect), using the R packages brms and sjstats (v0.16.0, https://doi.org/10. 5281/zenodo.1284472).

Sample size estimates
Based on the observed within-site variance, we performed frequentist sample size calculations with 80% power at alpha ¼ 0.05 to detect a 4% difference in all global MRI metrics; sample sizes were also calculated for global volumes normalized by ICV. Sample size was calculated for a difference between two independent groups, assuming equal group sizes, allowing 5% type I error, 2-tailed significant test p < 0.05 (Mak et al., 2017).

Data availability
Analysis code and data are available at https://github.com/nzbri/s canner-test-retest.

Results
One of 20 participants (12 females, mean age (standard deviation) ¼ 41 (11) years, age range [25-66]) did not complete S3, leaving a total of 59 imaging sessions. No structural or PCASL images were excluded, however three DTI datasets (from two subjects) were excluded due to excessive motion (defined as mean relative motion greater than three times the standard deviation of the group).
We observed no evidence of a difference in mean MRI metrics between scanner locations, however there was evidence that SA varied within-site (from S2 to S3). Fig. 1 and Table 1 present global MRI metrics across sessions. Across all MRI metrics, there was no indication that allowing between-session variance to be different between scanner locations improved the model fit (Table 2). That is, we found no evidence of a systematic difference in variance pre-or post-relocation. Furthermore, ICC was high and within-subject coefficient of variation was low across all metrics (Table 1).

Global results
All T1-derived metrics demonstrated excellent reproducibility and reliability across scanner relocation and within-site repeat ( , with no evidence of a difference in SI between cortical segments at S1 or S3 relative to S2. While we estimated a 0.0% change [-0.3, 0.3] in SA between scanner relocation, we did observe a 0.3% [0.1, 0.6] increase in SA across the within-site repeat (S3>S2).

Regional results
Regional CTh and SA from Desikan-Killiany cortical parcellations across sessions are presented in Supplementary Figs. 1 and 2, respectively; subcortical volumes are presented in Supplementary Fig. 3. Vertex-wise, we identified no significant differences in CTh or SA between or within-site repeat. MRI metrics for all participants are displayed relative to session 2 (S2, the first session at the new location). All individual trajectories (n ¼ 20) are displayed in gray, while the mean at each session is overlaid in color. Each panel displays one of the 12 MRI metrics investigated. CBF demonstrates the largest variability, while the other 11 metrics demonstrate robust consistency across relocation and within-site test-retest reliability. S1 ¼ session 1 (scanner at old location), S2 ¼ session 2 (first scan at scanner's new location), S3 ¼ session 3 (second scan at scanner's new location), Cortical Vol ¼ total cortical gray matter volume, SubCort Vol ¼ total subcortical gray matter volume, GM Vol ¼ total gray matter volume, WM Vol ¼ total white matter volume, Vent Vol ¼ total lateral ventricular volume, CTh ¼ average cortical thickness, SA ¼ total surface area, CBF ¼ average cortical gray matter cerebral blood flow (perfusion), FA ¼ average fractional anisotropy along principal white matter tracts, MD ¼ average mean diffusivity along principal white matter tracts, AD ¼ average axial diffusivity along principal white matter tracts, RD ¼ average radial diffusivity along principal white matter tracts.  Fig. 1). Consequently, there was no evidence of a significant systematic difference in mean signal nor any difference in variance between sites or for the repeat.

Regional results
Regional cortical and subcortical mean perfusion values across sessions are presented in Supplementary Figs. 4 and 5, respectively. We identified no areas of significantly different perfusion across the cortex between scanner relocation or within-site repeat.  Reliability and reproducibility across global MRI metrics. Session S2, the first session at the new scanner location, was the reference. Thus, the population estimate is the estimate for each MRI metric at the new location. The 'PreMove' population estimate quantifies the mean difference at the old scanner location (S1), as a percentage of the reference population estimate (the new scanner location, S2); this quantifies the mean difference due to scanner relocation. The 'Repeat' population estimate quantifies the mean difference between a repeat scan at the new location (S3), relative to the first session at the new location (S2). PreMove between-session variability quantifies the standard error between S1 and S2, as a percentage of the population estimate; repeat variability quantifies S2 and S3. 95% uncertainty intervals are reported for population estimate, between-session variability, and ICC; 95% confidence intervals are reported for wsCV. Note, Total GM, cortical GM, subcortical GM, WM, ventricular volume, and SA were normalized to intracranial volume for calculation of ICC. AD ¼ axial diffusivity, FA ¼ fractional anisotropy, GM ¼ gray matter, ICC ¼ Intraclass correlation coefficient, MD ¼ mean diffusivity, RD ¼ radial diffusivity, WM ¼ white matter, wsCV ¼ within-subject coefficient of variation.
T.R. Melzer et al. NeuroImage 211 (2020) 116608 3.3. DTI results 3.3.1. Global results DTI metrics were highly reproducible and reliable. Fig. 1 also shows relative skeletal FA, MD, AD, and RD across the three scanning sessions. We observed estimated mean difference between sites and variability between scanner relocation and within-site repeat to be <0.3% for all metrics (Table 1).

Regional results
Regional DTI metrics from white matter tracts are presented in Supplementary Figs. 6-9. Corrected voxel-wise analyses along the white matter skeleton showed no differences in DTI metrics between relocation or within-site repeat.

Sample size calculations
Estimated sample sizes required to detect a 4% difference between two independent groups are summarized in Fig. 2.
Necessary sample size to detect a 4% difference between two independent groups given the observed variance in different global MRI metrics, at 80% power, and a 2-sided 0.05 level of significance. AD ¼ average skeletal axial diffusion, RD ¼ average skeletal radial diffusivity, MD ¼ average skeletal mean diffusivity, CTh ¼ average cortical thickness, SA ¼ total surface area, /ICV ¼ indicates that the metric has been divided by intracranial volume, GM Vol ¼ total gray matter volume, FA ¼ average fractional anisotropy, Cortical Vol ¼ total cortical gray matter volume, WM Vol ¼ total white matter volume, Subcort Vol ¼ total subcortical volume, CBF ¼ average cortical gray matter perfusion. Note: Lateral ventricular volume has been excluded from this plot as the necessary sample size (2686) was too large to appropriately display on the same axes.

Volume-based processing results
Test-retest reliability of the four global metrics processed using volumetric, CAT12-derived processing are presented in Supplementary  Fig. 10 and Supplementary Table 1. As with the surface-based processing, we found no evidence of a systematic difference in variance pre-or postrelocation (Supplementary Table 2). Of the four metrics derived using CAT12 processing, only WM volume exhibited evidence for a small difference in mean volume (À0.5% [À0.9, À0.04]) between scanner locations (see Supplementary Results).

Discussion
An MRI scanner we use for ongoing longitudinal research was relocated from a private radiology clinic to a new hospital. In this study, we showed that structural, perfusion, and diffusion MRI metrics showed good to excellent reproducibility and reliability both globally and regionally. We observed no evidence for a systematic difference in mean signal (bar a small change within-site for total SA) nor any difference in variance across scanner relocation or within-site repeat. Both betweensite and within-site test-retest metrics for volume, thickness, and surface area-including mean difference, ICC, wsCV, and SI-were equivalent or higher to those reported in the within-site literature (Eggert et al., 2012;Iscan et al., 2015;Jovicich et al., 2013;Madan and Kensinger, 2017), which suggests that these structural MRI metrics are not necessarily compromised by relocation of the MRI scanner.
While we were not necessarily powered to investigate regional variability, we did so in a conservative manner. Neither vertex-based nor voxel-based analysis revealed any region-specific variation between-site or within-site. Our global results suggest there was no global bias as a consequence of the scanner relocation; these regional results, supported by Supplementary Figs. 1-9, suggest that there were also no additional region-specific differences in test-retest reliability.
Mean perfusion from the PCASL acquisition had the highest betweensession variability, and hence the smallest ICC and largest within-subject coefficient of variation. This variability may be associated with the nature of the PCASL acquisition; the measurement is a result of a difference image, which is susceptible to motion and low signal to noise ratio (Alsop et al., 2015;Dai et al., 2008). Additionally, this high variability likely reflects the more functional nature of perfusion imaging. As opposed to structural metrics, such as T1 or diffusion MRI, perfusion imaging is capable of capturing differences associated with short term brain state. This is both an advantage and a challenge. For example, scanning with eyes open versus eyes shut produces large perfusion differences in the visual cortex. Here, we asked participants to close their eyes; however, we did not monitor compliance. Circadian rhythm has also been shown to associate with cerebral perfusion (Hodkinson et al., 2014). We therefore attempted to scan participants at the same time of day over the three sessions, but there were situations where this was not practical. Both chronic and acute caffeine intake can change perfusion values by up to ELPD AE standard error. A positive difference indicates a better fit than the baseline model and a negative difference, a worse fit (generally indicating overfitting). When this difference is greater than twice the standard error, there is reasonable evidence the better-performing model should be preferred. In all cases there was no strong evidence that a model that allowed a predictor for each session or the between-session variance to be different improved the model fit, and in many cases there was evidence that the simpler model was preferred. 'Common' modelled the variance between scanner relocation and repeat with a single parameter; 'Separate' allowed the variance between scanner relocation and repeat to vary. GM ¼ Gray matter, WM ¼ White matter, Ventricle ¼ lateral ventricular volume, CTh ¼ Cortical thickness, SA ¼ Surface area, CBF ¼ cerebral blood flow, FA ¼ fractional anisotropy, MD ¼ mean diffusivity, AD ¼ axial diffusivity, RD ¼ radial diffusivity, ELPD ¼ Estimated log pointwise predictive density. T.R. Melzer et al. NeuroImage 211 (2020) 116608 24% (Clement et al., 2018). Acquisition at different times of day, with varying levels of caffeine intake, in addition to any of the other 58 documented perfusion modifiers-like levels of drowsiness (Poudel et al., 2012), changes in mood, alcohol use, and temporal proximity to physical exercise-may help explain the surprisingly high PCASL variability (Clement et al., 2018). This greater variability suggests that larger effect sizes or larger sample sizes are required in order to identify effects of interest, and that subtle effects may be extremely difficult to identify robustly with PCASL. Despite the higher variability of PCASL relative to T1-and DTIderived metrics, the technique still exhibits much better reliability than both task-fMRI and fMRI-based functional connectivity measures. A recent study that performed a meta-analysis and an empirical test-retest reliability analysis of 11 commonly used task-fMRI experiments in the Human Connectome Project and the Dunedin Longitudinal Study reported poor reliability, with ICCs ¼ 0.067-0.485 (Elliott et al., 2019). Similarly, a recent systematic review and meta-analysis of functional connectivity reported that on average, individual connections exhibited a "poor" ICC of 0.29 (Noble et al., 2019). In the current study, within-site cortical GM perfusion ICC was 0.81 [0.63, 0.95], which still falls within the "excellent" range (>0.75). Similar to previous work directly comparing PCASL to task-fMRI and resting state fMRI (Holiga et al., 2018), our results suggest that PCASL should not be abandoned and provides much higher test-retest reliability than either task-fMRI or resting state functional connectivity.
DTI with TBSS processing also exhibited excellent reliability in terms of both bias and variance across relocation and test-retest within the new location. Substantial variability in DTI metrics has been reported across different scanners, acquisition protocols, and processing methodologies (Magnotta et al., 2012;Palacios et al., 2017;Shahim et al., 2017;Vollmar et al., 2010;Wang et al., 2012). However, metrics appear highly reproducible when test-retest reliability is examined on the same scanner, with the same acquisition, and same processing protocol, with wsCV generally below 7% (Shahim et al., 2017;Vollmar et al., 2010;Zhou et al., 2018), and down to 0.73% (FA) and 0.16% (MD) (Palacios et al., 2017), and ICCs >0.9 (Boekel et al., 2017) when global values from TBSS on a single scanner were used (Jovicich et al., 2013). With our consistent acquisition and processing, all investigated DTI metrics showed equivalent reliability both across and within relocation sites, suggesting that the scanner relocation had a negligible effect on average DTI values both regionally and globally along the white matter skeleton.
Calculated sample sizes required to robustly identify a 4% difference between two groups varied widely, depending on MRI metric (Fig. 2). For example, in order to show a 4% difference in cortical gray matter perfusion, groups of 362 participants per arm would be needed; for the same mean difference, only 23 participants per group would be needed for average FA and only 9 per group for cortical thickness. These dramatically different sample size requirements further highlight the real-world consequences of metrics exhibiting higher variance. Regional brain volumes are frequently normalized by some measure of intracranial volume to reduce variability (Hansen et al., 2015;Whitwell et al., 2001). Our sample size calculations demonstrate and reinforce this important methodological procedure; while 114 participants per group would be needed if total gray matter volume were used, normalizing by ICV provides much more power, requiring only 19 per group.
A systematic difference between relocation means of 0.1% in total GM volume (from the current study) will most likely not change interpretations of group-level cross-sectional studies where large effect sizes are expected (e.g. in advanced Parkinson's disease (Melzer et al., 2012)). Even in situations where effects of interest are relatively small--for example, tracking aging in a group of 60-year-olds, with an expected total GM volume loss in healthy aging of~0.45% per year (Schippling et al., 2017)-this systematic bias may not significantly influence interpretations. Our results suggest that systematic bias as a consequence of scanner relocation appears negligible. However, this type of scanner change could (and should) be included in statistical modelling when possible (Lee et al., 2019). We did not investigate the reliability of more advanced diffusion acquisitions and models, and are therefore unable to comment on the interesting question of reliability of high angular resolution diffusion imaging (HARDI) sequences (Zhou et al., 2018).
The surface-based (Freesurfer) and volume-based (CAT12, Supplementary Material) processing streams showed very similar results. For example, the population estimate, mean difference, and between-session variability were virtually identical for both estimates of total GM volume and average GM perfusion. However, we observed a small effect of scanner relocation for CAT12-processed WM volume (À0.5% [À0.9, À0.04]), but the difference for Freesurfer-derived WM volume between sites (0.2% [À0.9, 0.4]) did not meet our statistical threshold. Freesurferderived metrics required smaller sample sizes per group than the equivalent CAT12-derived metrics (Supplementary Results). Freesurferderived total GM volume divided by ICV required n ¼ 19 per group to robustly identify a 4% difference between groups, while the CAT12derived equivalent required n ¼ 45 per group, a 2.4-fold increase. This suggests that the longitudinal Freesurfer pipeline allows for smaller sample sizes (Supplementary Discussion).
There may be instances in the future when a scanner needs to be relocated. Here we provide an indication of what scanning changes can be expected. With appropriate vendor support and re-commissioning of the scanning environment (re-shim), we have shown that MRI change between relocation appears equivalent to within-site test-retest reliability. Beyond relocation, we provide estimates of test-retest reliability (both mean and variance), which allows others to estimate variance of specific metrics, facilitating power analyses. This manuscript also reinforces the robustness of T1 and DTI metrics, but highlights the high variability of cortical perfusion. This is important to consider when comparing the relative importance of structural versus PCASL-based investigations and the influence these have on interpretation of results. Despite this higher variability for the PCASL acquisition, it still compares favorably to task-fMRI and functional connectivity in terms of reliability.
In this work, we investigated the effect of MRI scanner relocation on the reliability of global and regional MRI metrics derived from structural, perfusion, and diffusion acquisitions. Cortical gray matter perfusion was the most variable metric investigated, with all other metrics showing substantially less variability. We observed no evidence for a systematic difference in mean signal or variance between-site or within-site repeat (bar a small SA mean difference within-site). These results suggest that when all other parameters are held constant (e.g., sequence parameters and MRI processing), the effect of scanner relocation is small, but may need to be considered depending on the question at hand.

Data/code availability statement
Analysis code and data are available at https://github.com/nzbri/s canner-test-retest.

Ethics
Twenty healthy volunteers provided written informed consent.

Funding
Research reported in this publication was supported by Pacific Radiology Christchurch. This work was performed while TRM was supported by a Sir Charles Hercus Early Career Development Fellowship from the Health Research Council of New Zealand (17/039).

Role of funder
The funders had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.