Multi-centre, multi-vendor reproducibility of 7T QSM and R2* in the human brain: Results from the UK7T study

Introduction We present the reliability of ultra-high field T2* MRI at 7T, as part of the UK7T Network's “Travelling Heads” study. T2*-weighted MRI images can be processed to produce quantitative susceptibility maps (QSM) and R2* maps. These reflect iron and myelin concentrations, which are altered in many pathophysiological processes. The relaxation parameters of human brain tissue are such that R2* mapping and QSM show particularly strong gains in contrast-to-noise ratio at ultra-high field (7T) vs clinical field strengths (1.5–3T). We aimed to determine the inter-subject and inter-site reproducibility of QSM and R2* mapping at 7T, in readiness for future multi-site clinical studies. Methods Ten healthy volunteers were scanned with harmonised single- and multi-echo T2*-weighted gradient echo pulse sequences. Participants were scanned five times at each “home” site and once at each of four other sites. The five sites had 1× Philips, 2× Siemens Magnetom, and 2× Siemens Terra scanners. QSM and R2* maps were computed with the Multi-Scale Dipole Inversion (MSDI) algorithm (https://github.com/fil-physics/Publication-Code). Results were assessed in relevant subcortical and cortical regions of interest (ROIs) defined manually or by the MNI152 standard space. Results and Discussion Mean susceptibility (χ) and R2* values agreed broadly with literature values in all ROIs. The inter-site within-subject standard deviation was 0.001–0.005 ppm (χ) and 0.0005–0.001 ms−1 (R2*). For χ this is 2.1–4.8 fold better than 3T reports, and 1.1–3.4 fold better for R2*. The median ICC from within- and cross-site R2* data was 0.98 and 0.91, respectively. Multi-echo QSM had greater variability vs single-echo QSM especially in areas with large B0 inhomogeneity such as the inferior frontal cortex. Across sites, R2* values were more consistent than QSM in subcortical structures due to differences in B0-shimming. On a between-subject level, our measured χ and R2* cross-site variance is comparable to within-site variance in the literature, suggesting that it is reasonable to pool data across sites using our harmonised protocol. Conclusion The harmonized UK7T protocol and pipeline delivers on average a 3-fold improvement in the coefficient of reproducibility for QSM and R2* at 7T compared to previous reports of multi-site reproducibility at 3T. These protocols are ready for use in multi-site clinical studies at 7T.


Introduction
Neurodegenerative diseases are a significant global health burden. In many instances, neurodegeneration is associated with the deposi-multi-site study with a dementia cohort to assess feasibility in patient groups. Imaging as part of the C-MORE study (Capturing the MultiORgan Effects of  is also including harmonized multi-centre sequences which might provide insights into the long-term impact in survivors of . Yet, in order to perform such multi-centre studies, it is necessary to first guarantee the consistency and reproducibility of imaging markers. A popular approach to estimating iron concentration in the human brain uses gradient-echo (GE) magnetic resonance imaging (MRI). In grey matter, iron is mainly found in the protein ferritin which, due to its antiferromagnetic core and the presence of uncompensated spins at the surface or in the core, exhibits a superparamagnetic behaviour ( Makhlof et al., 1997 ;Langkammer et al., 2012 ). This paramagnetic iron interacts with the MRI scanner's static magnetic field (B 0 ) causing local dipolar field perturbations. These accentuate the rate of transverse signal decay causing T 2 * relaxation in surrounding tissue, which is visible as decreasing signal amplitude with increasing echo time in a series of GE images. This effect causes an increase in the rate of transverse relaxation, R 2 * , which correlates well with non-heme iron concentrations in grey matter ( Gelman et al., 1999 ;Langkammer et al., 2010 ), and has been used to investigate the distribution of iron in the healthy brain and in disease ( Haacke et al., 2005 ;Yao et al., 2009 ;Li et al., 2019 ).
The local presence of iron (and to a lesser extent myelin and calcium) also affects the signal phase of GE images because of the effect of the field perturbation on the local Larmor frequency ( House et al., 2007 ;He et al., 2009 ;Lee et al., 2012 ). Quantitative Susceptibility Mapping (QSM) methods attempt to deconvolve these dipole phase patterns to identify the sources of the magnetic field inhomogeneity. In other words, QSM estimates quantitative maps of tissue magnetic susceptibility from GE phase data ( Li and Leigh, 2004 ;Wang and Liu, 2015 ). This approach has shown sensitivity to several neurological conditions ( Lotfipour et al., 2012 ;Acosta-Cabronero et al., 2013 ;Blazejewska et al., 2015 ;Acosta-Cabronero et al., 2016 ) and offers advantages over magnitude R 2 * such as having reduced blooming artifacts or being able to distinguish between paramagnetic and diamagnetic substances ( Eskreis-Winkler et al., 2017 ). R 2 * imaging and QSM have been shown to provide reproducible results in single-site and cross-site studies at 1.5T and 3T ( Hinoda et al., 2015 ;Cobzas et al., 2015 ;Deh et al., 2015 ;Lin et al., 2015 ;Santin et al., 2017 ;Feng et al., 2018 ;Spincemaille et al., 2019 ).
The dipole-inversion problem at the heart of QSM methods benefits from the increased sensitivity to magnetic susceptibility variation and spatial resolution at ultra-high fields ( B 0 ≥ 7 T) ( Yacoub et al., 2001 ;Reichenbach et al., 2001 ;Tie-Qiang et al., 2006 ;Duyn et al., 2007 ;Wharton and Bowtell, 2010 ). At 7T, close attention must be paid to B 0 shimming and gradient linearity to achieve accurate QSM and R 2 * mapping ( Yang et al., 2010 ). Head position is also an important factor that affects the susceptibility anisotropy ( Lancione et al., 2017 ;Li et al., 2017 ).
In this study, we introduce single-echo and multi-echo GE imaging protocols for QSM and R 2 * mapping at 7T which were standardised on three different 7T MRI scanner platforms, from two different vendors. We applied this standardised protocol in the UK7T Network's "Travelling Heads " study on 10 subjects scanned at 5 sites. We report reproducibility for derived R 2 * and QSM maps and make recommendations for the design of future multi-centre studies.

Measurement setup
Ten healthy volunteers (3 female, 7 male; age 32.0 ± 5.9 years) were recruited: comprising two subjects from each of the five 7T imaging sites in the UK7T Network (described in Table 1 ). Each subject was scanned five times at their "home " site, and once at the other sites, under local ethics approval for multi-site studies obtained at Site-4 (HBREC.2017.08). Scans for each subject were completed within a period of between 83 and 258 days. The five home-site scans were performed across different sessions: the median time to acquire all five scans was 59 days (range: 3-71 days).

QSM and R 2
* data processing QSM maps were generated from both the single-echo and multi-echo T 2 * -weighted datasets using the Multi-Scale Dipole Inversion (MSDI) algorithm, as implemented in QSMbox v2.0 ( Acosta-Cabronero et al., 2018 ). Briefly: first the local field was estimated by phase unwrapping ( Abdul-Rahman et al., 2005 ) and magnitude-weighted least squares phase echo fitting was performed on the multi-echo data. Then, independently for both single-echo and multi-echo data, background field was removed using the Laplacian Boundary Value (LBV) method followed by the variable Spherical Mean Value (vSMV) algorithm with an initial kernel radius of 40mm ( Zhou et al., 2014 ;Acosta-Cabronero et al., 2018 ). MSDI inversion was estimated with two scales: the self-optimised lambda method was used on the first scale with filtering performed using a kernel with 1mm radius, and on the second scale the regularization term was set to = 10 2.7 (the optimal value for in-vivo 7T datasets found in ( Acosta-Cabronero et al., 2018 )) and filtering was done with a kernel radius set to 5mm. Brain masks used in the analysis were obtained with FSL's Brain Extraction Tool (BET) with fractional intensity threshold = 0.2 for single-echo data ( Smith, 2002 ). These were then mapped to multi-echo data space.

Data registration
The neck was cropped from the magnitude data with FSL's "robustfov " command ( https://fsl.fmrib.ox.ac.uk/fsl/ ), applied to the single-echo data and the 4th echo of the multi-echo data. Highresolution single-echo and multi-echo templates were made from this cropped data for each subject with antsMultivariateTemplate-Construction2.sh from the Advanced Normalization Tools (ANTs, http://stnava.github.io/ANTs/ ). Two approaches were compared: transformations using rigid registration with mutual information similarity metric (denoted as "Rigid " below) or using symmetric diffeomorphic image registration with cross-correlation similarity metric (denoted "SyN " below). Other settings were kept the same for both approaches: 4 steps with 0.1 gradient step size, maximum iterations per step 1000, 500, 250 and 100, smoothing factors per step of 4, 3, 2, and 1 voxels, and shrink factors per step of 12 ×, 8 ×, 4 ×, and 2 ×. The resulting registrations were then applied to the QSM and R 2 * maps which were averaged to create single-echo and multi-echo QSM and R 2 * templates for each subject.

Selection of Regions of Interest (ROIs)
Five regions of interest (Substantia Nigra, Red Nucleus, Caudate Nucleus, Putamen and Globus Pallidus) were manually segmented based on the subject-specific QSM templates of the single-echo data registered with the "SyN " approach. In order to minimize the amount of segmentation variability, these ROIs were then mapped to the single-echo "Rigid ", and multi-echo "SyN " and multi-echo "Rigid " spaces with nearest neighbour interpolation and via non-linear registrations obtained with the default settings in the antsRegistrationSyN.sh command in ANTs.
Magnitude data were first registered to the T 1 -weighted MP2RAGE scans (Rigid transformations; MI similarity metric) and later to the standard T 1 "MNI152 brain " (Montreal Neurological Institute 152) (using settings in antsRegistrationSyN.sh) applied to the single-echo data and to the 1st echo of the multi-echo data. These registrations were then used to map the 48 probabilistic cortical ROIs, "cortical ROIs ", from the Harvard-Oxford Cortical Atlas and the 21 probabilistic subcortical ROIs, "subcortical ROIs ", from the Harvard Oxford Subcortical Atlas to the QSM and R 2 * template spaces. The T 1 -weighted MP2RAGE data was bias-field corrected, brain extracted, and segmented into five tissues using SPM ( https://www.fil.ion.ucl.ac.uk/spm/ ): the grey matter (GM), white matter (WM) and cerebral-spinal fluid (CSF) volumes were mapped into each subject-specific QSM template space. Then, using "fslmaths " from FSL ( https://fsl.fmrib.ox.ac.uk/fsl/ ), the mapped cortical ROIs were thresholded at 10% of the "robust range " of non-zero voxels and multiplied by the GM tissue map in order to obtain GM-specific cortical ROIs. The mapped subcortical ROIs were thresholded at 50% of the "robust range " of non-zero voxels. From these, any CSF voxels were excluded from the left and right Caudate Nucleus, Putamen and Globus Pallidus, and the voxel sets from the left and right counterparts were merged together.
From the single-echo and multi-echo data, average and R 2 * values were extracted from the manual and atlas-based ROIs for all volunteers and sessions in template space (values given in Supplementary material 1).
In order to estimate where the magnetic field is spatially more variable, field-maps were first estimated from the multi-echo datasets. ΔB 0 was calculated from the background field removal step of the QSM pipeline, and was defined, per-voxel, as the average difference between the field in a voxel and its immediate nearest neighbours. The average ΔB 0 was extracted for each of the cortical ROIs and averaged across all subjects and sessions. Then the cortical ROIs were divided into two groups based on the ΔB 0 values: wherever | ΔB 0 | > 0.005 Hz the ROI was grouped into "high ΔB 0 " regions, otherwise it was grouped into "low ΔB 0 " regions. ΔB 0 was calculated from the multi-echo pipeline only, as differences to values calculated using single-echo data were minimal ( Fig. 1, Supplementary material 2).
We explored three possible susceptibility reference regions for QSM processing. The average QSM signal was extracted from: 1 A whole brain mask, "wb "; 2 A whole-brain CSF mask eroded in two steps, "csf "; 3 A manually placed cylindrical ROI in the right ventricle, "cyl " (across all subjects the ROI volume was 104 ± 11 mm 3 ).

Statistical analysis
Statistical analysis was performed with R 3.5.3 ( R Core Team, 2013 ). Cross-site analysis used only the 1st scan at the "home " site along with the scans at the other four sites. To obtain the within subject average, AV w , the and R 2 * values were averaged within the same site and across the sites and then averaged across subjects: where n is the number of sessions ( = 5 for within-site and cross-site) and m the number of subjects. Relative reliability was measured using the intra-class correlation coefficient (ICC) from within and cross-site data independently for each ROI ( Weir, 2005 ): where MS b and MS w are the between-subjects and within-subjects mean square from a random-effects, one-way analysis of variance (ANOVA) model. Intra-subject absolute variability is assessed by measuring the within-subject standard-deviation (SD w ) calculated as ( Santin et al., 2017 ): ∕ is the replicate average for each subject. SD w was computed using within-site data and cross-site data independently. Simi-larly, cross-subject variability was calculated by measuring the betweensubject standard-deviation (SD b ): is the measurement average across subjects and sessions. Note that SD b is computed using data from all sites. Statistical testing on AV w , SD w and ICC values extracted from manual and template-based ROIs was done by first fitting the data with normal, log-normal, gamma and logistic distributions. The goodness-of-fit statistics for the parametric distributions were calculated and the distribution which showed the lowest Akaikes Information Criterion (AIC) was then used on a general linear model fitting. All models included as fixed main effects ROI number and data type (within-and cross-site). When evaluating the data registration type, the model also included registration type ( "Rigid " and "SyN ") as a fixed main effect. When testing for QSM reference, the model also included reference region ( "wb ", "csf ", and "cyl ") as a fixed main effect. On multi-echo QSM data, a model was fitted which also included the number of echoes processed as a fixed main effect. When comparing the manual and subcortical ROIs, the ROI type (manual vs. atlas-based) was also included as a fixed main effect. Finally, on the data from the cortical ROIs, ROI number was replaced with "high ΔB 0 " and "low ΔB 0 " ROI type as covariate. A p-value less than 0.05 was considered significant.

Head orientation
We investigated the effect of head orientation on QSM variability. Since all our data was acquired with transverse slice orientation, the slice normal vector in the acquired images was parallel to B 0 . We used the per-subject rotation matrices of the affine transforms from this acquired transverse data to MNI space to estimate the z -axis rotation with respect to the B 0 vector (0,0,1) ( Fig. 7 (A)): where M 33 is the 3rd row, 3rd column of the affine transform matrix.
Two linear mixed effects models, 'mod1' and 'mod2', were fitted on the within-site and cross-site data separately: both models included site, ROI, and session number as fixed effects, and subject number as a random effect, while 'mod2' also included as a fixed effect. For each model, the R 2 was evaluated and both models were compared with a chi-squared test.
Finally, from 'mod2' the fit coefficients were used to estimate corrected -values based on a chosen standard for all of the measurements ( = 0 . 52 radians). Then, new within-site and cross-site SD w of the corrected were calculated based on the same approach as in Section 2.5 .

Fig. 1 shows QSM and R 2
* maps for one example subject. Basal ganglia structures, including Caudate Nucleus, Putamen and Globus Pallidus are clearly visible consistent with previous findings ( Langkammer et al., 2010 ;Wang et al., 2015 ;Betts et al., 2016 ;Acosta-Cabronero et al., 2016 ). Fig. 2 , Supplementary material 2 highlights the difference in QSM data quality when using our chosen Roemer coil combination method vs using sum-of-squares coil combination.  Fig. 3 shows boxplots over ROIs of the within-and cross-site AV w (A), SD w (B) and ICC (C) values for the manual ROIs on the and R 2 * maps. The AV w from R 2 * maps measured on the same site is systematically higher compared to the AV w measured across sites ( p < 0.0001; e.g., on the Putamen ROI, AV w_within-site = 0.0493 ms − 1 vs AV w_cross-site = 0.0489 ms − 1 ). On this comparison, QSM data did not show significant differences between within-site and cross-site groups for either single-echo data ( p = 0.053) or multi-echo data ( p = 0.65).

QSM and R 2
From all the data in the manual ROIs, the median SD w of single-echo -values was approximately 29% lower than for multi-echo -values ( p = 0.0010). There was a significantly larger SD w cross-site compared to within-site on single-echo data ( p < 0.0001; e.g., on the PN ROI, SD w_within-site = 0.00088 ppm vs SD w_cross-site = 0.0014 ppm), multi-echo ( p = 0.033) and on R 2 * data ( p < 0.0001). The ICC values for within-and cross-site R 2 * data (median ICC was 0.98 and 0.91, respectively) were found to be significantly higher than values for single-echo (median ICC was 0.89 and 0.64, respectively) or for multi-echo (median was ICC 0.76 and 0.38, respectively) ( p = 0.00011). For all measurements, the ICC for cross-site data was significantly lower than for within-site data (single-echo QSM: p < 0.0001; multi-echo QSM: p = 0.017; R 2 * : p < 0.0001). Similar statistics were obtained for AV w , SD w and ICC measurements in the Altas-based cortical ROIs (Table 2 Supplementary material 2).

Registration
The within-and cross-site standard deviations for one axial slice from one example subject using "Rigid " and "SyN " registration approaches are shown in Fig. 4 . Generally, with both registration methods, withinsite and cross-site SD w increases in veins, in the orbitofrontal regions and at the cortical surface (white and green arrows, Figure 4 ). These are areas associated with large B 0 inhomogeneities and gradient non-linearity. However, there is a decrease in the cross-site standard deviation in the orbitofrontal region and close to the edges of the cortex when using the "SyN " compared to the "Rigid " method (green arrows, Fig. 4 ).
On the manual ROIs increased variability was observed for R 2 * on "Rigid " registered data compared to "SyN " (SD w : p < 0.0001; ICC: p = 0.013) but not for single-echo or multi-echo : for example, the median cross-site R 2 * SD w from all ROIs was 0.00066 ms − 1 using "SyN " method and 0.00086 ms − 1 using the "Rigid " registration method. On the atlasbased cortical ROIs, the same significant trend was observed for R 2 * and single-echo data (Table 2 Supplementary material 2).

QSM referencing
To assess the optimal QSM susceptibility referencing, Fig. 5 shows boxplots of the SD w for single-echo and multi-echo using different referencing methods on the manual ROIs. On single-echo data, compared to "wb " correction (chosen correction for this study), the "csf " reference did not increase significantly the SD w ( p = 0.93) but with "cyl " the median SD w increased by approximately 14% ( p < 0.0001).
Multi-echo data showed an increase in the median SD w of, respectively, 11% ( p = 0.00096) and 8% ( p = 0.00064) when using "csf " and "cyl " methods for correction. The effect of varying the referencing of QSM data was similar in within-site and cross-site data, for all methods tested.

Multi-echo QSM and R2 *
On average across all the manual ROIs and compared to single echo data, multi-echo data (using two or more echoes) showed a significant 14% increase of the SD w ( Fig. 6 ) and 3% of the ICC ( Table 1 , Supplementary material 2). This supports the single-echo and multi-echo comparison in Section 3.2 . Similar behaviour was observed on the atlas-based   cortical ROIs (Table 2 Supplementary material 2). On the manual ROIs, there is no significant difference in AV w ( p = 0.79) or in SD w ( p = 0.11) from computed from multiple echoes (i.e. 2 or more echoes in the QSM analysis). Yet, in the atlas-based cortical ROIs, long echo times (i.e. using 6 or more echoes) showed an average increase of 15.7% in SD w ( p < 0.0001) compared to using 2 to 5 echoes and a decrease of 1.75% in ICC ( p < 0.0001) ( Table 2 Supplementary material 2).
In the manual ROIs, R 2 * showed no significant change in variability across all ROIs when different number of echoes were used in the fitting (SD w : p = 0.11; ICC: p = 0.95) ( Fig. 6 (B)) or on AV w ( p = 0.97). In the atlas-based cortical ROIs, the number of echoes used influenced the average R 2 * value (AV w : p < 0.0001), weakly ICC ( p = 0.021), but not SD w ( p = 0.61). Tables 1 and 2 Supplementary material 2 display individual statistics.

ROI selection
There is a small but significant higher average from manually drawn ROIs compared to the atlas-based subcortical ROIs in single-echo QSM data ( p < 0.0001; e.g. 0.042 ± 0.009 ppm vs 0.033 ± 0.010 ppm in the caudate nucleus) and in multi-echo QSM data ( p < 0.0001; e.g. 0.048 ± 0.010 ppm vs 0.038 ± 0.011 ppm in the caudate nucleus) ( Fig. 2 ). Similarly, for R 2 * (e.g. 0.041 ± 0.004 ms − 1 vs 0.039 ± 0.006 ms − 1 in the caudate nucleus) this difference was significant ( p < 0.0001). In addi- Fig. 4. Voxel-wise within-and cross-site standard deviation of an example subject from single-echo and multi-echo QSM and R 2 * data with data registered with "Rigid " (A) and "SyN " (B) transformations. Arrows point to regions where the SD w decreased with the "SyN " transformations (green) are compared to "Rigid " (white). The right Caudate Nucleus (a), Putamen (b) and Globus Pallidus (c) are outlined in white. Multi-echo -maps were calculated with data from all eight echoes.

Fig. 5.
Boxplots from data obtained on the manual ROIs of within-and cross-site SD w (red and green, respectively) of single-echo QSM (A) and multi-echo QSM (B) with a whole-brain reference (wb), with a csf reference (csf), and with a cylinder reference (cyl). Data from each ROI is shown with a different marker for each boxplot. Legend: SN = Substantia Nigra; RN: Red Nucleus; CN: Caudate Nucleus; Pu: Putamen; GP: Globus Pallidus. Multi-echo -maps were calculated with data from all eight echoes. tion, the SD w was, on average, approximately two times higher and the ICC lower in the atlas-based subcortical ROIs compared to the manual ROIs in all datasets (SD w : single-echo QSM p < 0.0001, multi-echo QSM p < 0.0001, R 2 * p < 0.0001; ICC: single-echo QSM p = 0.00021, multiecho QSM p = 0.0023, R 2 * p = 0.012). So, ROI selection should be done consistently in a study.

Spatial distribution of the magnetic field
On the atlas-based cortical ROIs the SD w increased by approximately 28% and 88% on "high ΔB 0 " regions compared to "low ΔB 0 " regions on multi-echo and R 2 * data, respectively ( p = 0.0011 and p < 0.0001) ( Table 2 Supplementary material 2). Similarly, ICC values decreased significantly for single-echo and multi-echo and R 2 * values.

Table 2
Within-site to Cross-site ratio of the median SD w obtained from all five manually-defined ROIs on single-echo and multi-echo without and with -correction.

QSM variability with head orientation
When analysing in the manually-defined ROIs with respect to , a consistent negative trend was observed for all subjects. Fig. 7 (C) and (D) shows an example for the analysis in the Globus Pallidus ROI. Fitting a linear model on , with and ROI as fixed variables, showed a significant negative correlation with single-echo ( p < 0.0001) and multi-echo ( p = 0.015).
In addition, for , the within-site SD w was nearly half of the crosssite SD w (0.011 and 0.028 radians, respectively), indicating that there was larger variability in head orientation across sites (subject-wise variability of , i of equation [3], is plotted in Fig. 7 (B)).
Separately for within-site and cross-site data, we assessed the goodness-of-fit of a model containing as an explanatory variable. On single-echo within-site data, the marginal R 2 increased from 0.71 with 'mod1' to 0.76 with 'mod2' (which includes ) (Chi-squared test, p = 0.041). The corresponding cross-site R 2 s were: 0.77 and 0.80 (Chisquared test, p = 0.057). On multi-echo data, the marginal R 2 increased from 0.75 with 'mod1' to 0.79 with 'mod2' on within-site data (Chisquared test, p = 0.041) and maintained at 0.79 on both models for cross-site data (Chi-squared test, p = 0.14).
From the corrected -values at norm , results show a slight decrease in the ratio of within-site to cross-site SD w ( Table 2 ), but variability of obtained from cross-site data was still higher than from within-site ( with -correction, p = 0.01; uncorrected , p < 0.0001 ( Section 3.2 )). For multi-echo data, the SD w obtained from the corrected -values were similar on within-site compared to cross-site ( with -correction, p = 0.11; uncorrected , p = 0.033 ( Section 3.2 )).

Discussion
In this paper, the reproducibility of QSM and R 2 * measurements in cortical and subcortical regions of the brain was assessed for the first time in a multi-site study at 7T for two different protocols (a single-echo 0.7mm isotropic T 2 * -weighted scan and a 1.5mm isotropic multi-echo T 2 * -weighted scan), using three different scanner platforms provided by two different vendors.
Previous studies at 1.5T and 3T have shown good reproducibility for and R 2 * data acquired on the same scanner or across sites (1.5T and 3T) ( Hinoda et al., 2015 ;Cobzas et al., 2015 ;Deh et al., 2015 ;Lin et al., 2015 ;Santin et al., 2017 ;Feng et al., 2018 ;Spincemaille et al., 2019 ). In terms of QSM and depending on the subcortical region, intra-scanner 3T repeatability studies report an SD w of 0.002-0.005 ppm ( Feng et al., 2018 ) and 0.004-0.006 ppm ( Santin et al., 2017 ), and the cross-site 3T study by Lin et al. (2015) reported an average SD W of 0.006-0.010 ppm. We observed a within-site SD w range of 0.0009-0.004 ppm and crosssite SD w range of 0.001-0.005 ppm at 7T. Compared to 3T studies, this is a 2.0-5.3 fold decrease in the within-site SD w , and a 2.1-4.8 decrease in the cross-site SD w .
The range of within-site SD w values for R 2 * was averaged 0.0003-0.001 ms − 1 in our study and the cross-site SD w range was 0.0005-0.001 ms − 1 . The cross-site values are comparable to the same site reported Fig. 7. In QSM, it is assumed that the macroscopic susceptibility in an imaging voxel is isotropic. However, it has been shown that this assumption is too simplistic for single head orientation QSM methods, complicating the interpretation of the QSM results ( Li et al., 2017 ). We investigated the effect of head orientation on QSM estimation in our data: (A) Considering that data was all acquired in the transverse plane with B 0 perpendicular to the imaging slice, subjects had a variable head rotation with respect to B 0 . To estimate , we used MNI space as a common head orientation (z MNI ) across all scans. From the affine registration matrix M converting acquired data into MNI space, the angle of rotation from the rotated z-axis, z 2 , will be given by = cos −1 ( 33 ) where M 33 is the 3rd row, 3rd column of the affine transform matrix. (B) Subject-wise within-site and cross-site i measurements on . (C) Single-echo and (D) multi-echo scatter plots of measurements according to on the Globus Pallidus manual ROI. For each subject a linear trend is also plotted and the fit coefficients are given in the plot legend. Data from each site is displayed with a different symbol. Multi-echo -maps were calculated with data from all eight echoes. at 3T: 0.0005-0.0009 ms − 1 ( Feng et al., 2018 ), 0.0006-0.002 ms − 1 ( Santin et al., 2017 ). Compared to the latter, our cross-site results show a 1.1-3.4 improvement over the same brain regions in R2 * variability.
The study from Hinoda et al. (2015) measured QSM reproducibility at 1.5T and 3T by scanning subjects twice on each of the scanners. They showed there is a 1.1-2.1 fold decrease in the upper and lower limits in Bland-Altman plots at 3 T compared to 1.5 T, which is in line with the expected signal-to-noise ratio (SNR) increase between these two field strengths ( Edelstein et al., 1986 ;Wardlaw et al., 2012 ). Compared to 3T reports, there is, on average, an improvement of approximately 3fold in our QSM and R 2 * 7T measurements of reproducibility. This is in line with the expected SNR increase in brain imaging from 3T to 7T ( Pohmann et al., 2016 ).
The higher values of cross-site SD w compared to the within-site values in our study may be attributed to the different gradient systems and automatic distortion corrections used in the different scanner platforms and to the different approaches to shimming, which lead to different geometrical distortions and dropout regions ( Figs. 3 and 4 Supplementary material 2) ( Yang et al., 2010 ). In our study we verified that not only regions in the cortex close to air-tissue interfaces show differences in B 0 across scanners, but also large subcortical regions such as the CN, the Pu and the GP ROIs.
We also showed that the use of a non-linear registration method (here, "SyN " in ANTs) significantly reduced the inter-scanner variability of cortical QSM compared to rigid-body registration, indicating that differences in geometric distortion across scanners were present. The R 2 * results for both cortical and subcortical structures also show significantly lower inter-scanner variability when a non-linear registration was used. For QSM, higher cross-site variability may also be attributed to the head orientation with respect to B 0 ( Lancione et al., 2017 ;Li et al., 2017 ). Our results indicate head orientation varied somewhat between scans and there was greater variation between sites than intra-site; we also observed a consistent negative correlation between and head orientation ( ). Using a linear model to attempt to regress-out the effects of head rotation improved the reproducibility of both within-site and cross-site data. It also reduced the penalty for multi-site scanning vs single-site scanning, but not completely.
In this study, the reproducibility of QSM using single-echo, highresolution (0.7 mm isotropic resolution; TE = 20ms) and multi-echo standard-resolution (1.4 mm isotropic resolution; TE = 4,9,14,19,24,29,34 and 39 ms) protocols were compared, and the results show that the multi-echo QSM data has a significantly higher variability than single-echo QSM. Although multi-echo phase data has been combined with a magnitude-weighted least squares regression of phase to echo time, it may carry inconsistent phase accumulation across echoes that were inconsistently unwrapped. This is also particularly relevant for regions of large field inhomogeneities, where phase accumulation could exceed ± between neighbouring voxels, resulting in multiple phase wraps, which the unwrapping algorithm maybe unable to correct ( Cronin et al., 2017 ). This has also been verified on the analysis of QSM data from the cortical ROIs reconstructed with different numbers of echoes: long echo times increase significantly the test-retest variability. Alternatively, phase unwrapping can be completely omitted by calculating the phase change over all echo-times using a complex fitting routine ( Liu et al., 2012 ;Liu et al., 2013 ) (fit_ppm_complex.m of MEDI toolbox) and calculating the Laplacian directly from the resulting, still wrapped, temporal phase change data .
It has been shown that resolution influences QSM estimation. Haacke et al. (2015) showed on phantom data that by decreasing slice thickness from 3 mm to 0.5 mm partial volume effects are reduced, absolute susceptibility values decrease, and accuracy improves up to 25%. Similar findings on in vivo brain data are reported in Sun et al. (2017) (single-echo data) and Karsa et al. (2019) (multi-echo data). Our results support the suggestion that a reduction of partial volume effects at higher-resolution might play a role in decreasing both test-retest and cross-site variability on the single-echo high-resolution data compared to the multi-echo low-resolution data. R 2 * values show significantly lower variability, reflected in the higher ICC within and across-sites compared to corresponding values for in subcortical areas. This may be because the estimation is globally more sensitive to background field inhomogeneity compared to magnitude data. However, in orbitofrontal and lower temporal regions large through-plane field variations from tissue-air interfaces dominate the field changes and produce dropouts in the signal magnitude and increase the background phase, affecting both QSM and R 2 * maps by increasing variability and decreasing ICC across sites. In addition, because of large field variations, the estimated cortical R 2 * increases significantly when late echo times are used for the fitting, but this effect is not seen in subcortical areas.
QSM can only determine relative susceptibility differences ( Cheng et al., 2009 ) and most approaches to calculation of susceptibility from measured phase yield maps in which the average value of susceptibility is zero over the masked imaging volume. Issues related to referencing of QSM data have been investigated ( Feng et al., 2018 ;Straub et al., 2017 ), with aim of finding a reference region or tissue to which all susceptibility values are referred that produces well-defined and reproducible values of susceptibility. Here we investigated how the choice of reference affects the within-site and cross-site variability of measured susceptibility at ultra-high-field. We tested three accepted reference regions: total whole brain signal, "wb ", whole brain CSF eroded in order to exclude any pial or skull surfaces, "csf ", and a manually selected cylindrical ROI in the right ventricle, "cyl ". We found that the "cyl " referencing generally increased the variability of the cross-site and within-site susceptibility measurements in cortical and subcortical ROIs compared to "wb " referencing. In the case of the multi-echo acquisition the "csf " referencing also increased the variability relative to "wb " data. This may be because of imprecision in systematically obtaining average QSM signal from CSF regions. Referencing using a small ROI in the ventricles might be prone to subjectivity given the natural variation in ventricle size in healthy subjects and in disease. Furthermore, the ventricles do not contain pure CSF: they are traversed by blood vessels with a different ( Sullivan et al., 2002 ). This makes whole-brain referencing attractive in many situations. Yet, in patient cohorts where there is substantial iron load in subcortical structures ( Snyder and Connor, 2009 ), whole brain referencing might not be an appropriate approach. In this case, the more appropriate approach will be to choose a small reference region which shows no changes in the particular disease to be "zero " susceptibility at a cost of a slight increase in SD.
To eliminate operator-dependent bias in segmentation when determining brain structures, we have analysed data using both manual and atlas-based segmentation. From our results, manual ROIs showed significantly lower variability compared to atlas-based methods. This happens because of imprecision in registration between MNI and subject space as well as the empirical thresholding that was chosen to obtain the subcortical ROIs. This resulted in larger ROIs being derived from the atlas-based method compared to the manual method (Wilcoxon test, CN: p = 0.014; Pu: p = 0.00018; GP: p = 0.0010). Overestimation of the region ( Fig. 5 Supplementary material 2) meant including boundary voxels that, generally, have lower susceptibility (white-matter, for example), lowering the average and R 2 * . However, traditional manual drawing of ROIs for cohort studies is difficult, time consuming and potentially unsuitable as it biases results towards particular cohorts ( Collins et al., 2003 ) so it may not always be the most appropriate approach.
In this study, harmonized protocols were produced for all five scanners without any significant sequence alterations, as a product 3D gradient echo (GE) sequence was readily available on all systems (the product 'gre' sequence from Siemens and the product 'ffe' from Philips). The protocols and an example dataset are provided in ( Clarke, 2018 ). Generally, we also relied on the vendors' reconstruction. However, at the end of the reconstruction pipeline of the Siemens systems we adopted a different coil combination approach based on Roemer et al. (1990) and Walsh et al. (2000) , to match the SENSE approach implemented on Philips scanners ( Pruessmann et al., 1999 ;Robinson et al., 2017 ). This was required due to artifacts appearing on phase images in Siemens data reconstructed with the vendor's pipeline, such as open-ended fringe lines or singularities ( Chavez et al., 2002 ) ( Fig. 2 , Supplementary material 2). These reduce the consistency of the QSM results ( Santin et al., 2017 ). However, other coil combination methods such as a selective channel combination approach ( Vegh et al., 2016 ) or the COMPOSER (COMbining Phase data using a Short Echo-time Reference scan) method ( Bollmann et al., 2018 ) have also been shown to reduce open-ended fringe lines and noise in the signal phase. For future investigations, the raw k-space data collected from all sites in this study has been stored and is available from the authors upon request.
On the QSM reconstruction, an imperfect background field filtering can influence the reproducibility of QSM data. For this reason, we performed background removal in two steps as implemented in QSMbox v2.0 and as described in ( Acosta-Cabronero et al., 2018 ): first with the LBV approach and then followed by the vSMV method. Regularized field-to-susceptibility inversion strategies have been proposed to overcome the ill-posed problem in QSM with data acquired at a single head orientation ( de Rochefort et al., 2010 ). We opted to use the MSDI implementation in QSMbox v2.0 ( Acosta-Cabronero et al., 2018 ), as it ranked top-10 in all metrics of the 2016 QSM Reconstruction Challenge ( Langkammer et al., 2018 ), and also now includes a new self-optimized local scale, which results in a better preservation of phase noise texture and low susceptibility contrast features. On the second step, the regularization factor, , used for this study was set to 10 2.7 , as recommended by Acosta-Cabronero et al. (2018) based on an L-curve analysis ( Hansen et al., 1993 ) with high-resolution 7T data.
The standard multi-echo GE protocol in this study was produced as a harmonised sequence that could be performed at all sites, with a relatively short acquisition time (approximately 5 min), which is accept- Fig. 8. Illustration of the feasibility of a 7T QSM clinical study. (A) and R 2 * (B) for four ROIs (Substantia Nigra, SN; Caudate Nucleus, CN; Putamen, Pu; Globus Pallidus, GP) from healthy volunteer (HV) and synthetic "patient " (PT) data for which AV lit and SD lit were obtained from Langkammer et al. (2016) and SD b were calculated from data of the current study. AV lit values for R 2 * were linearly scaled to 7T according to Yao et al. (2007) . Blue bars show the AV lit ± SD lit and green bars the AV lit ± SD b . Statistical differences between HV and PT obtained from Langkammer et al. (2016) are also shown. For each ROI, the sample size that would have been needed to give a significant effect was calculated from the group means, AV lit , and the SD b per ROI and is shown in circles. Multi-echo -maps were calculated with data from all eight echoes. able for patient studies. Mid-brain structures such as the basal ganglia are identifiable, yet small subcortical structures will suffer from partialvolume effects, which could be a limitation of this harmonized protocol for future ultra-high field multi-site studies.
At ultra-high field there can be variations in SNR in magnitude data caused by the variable B 1 + across the brain ( Abduljalil et al., 2003 ). As R 2 * is estimated voxel-wise, and as there is always a reasonable SNR in the magnitude data, the coefficient in the exponential fit that estimates R 2 * will not be strongly affected by variations in B 1 + . QSM maps are estimated from filtered phase data which is not strongly affected by transmit B 1 + variations. On our data, no correlations were found between QSM or R 2 * maps and B 1 + flip-angle maps collected in the same session ( Fig. 6 Supplementary material 2).
To minimise confounding effects of age or pathology, we assessed test-retest reliability and cross-site variability with ten healthy young subjects. The cross-site, between-subject standard-deviation, SD b , measured in this study was evaluated together with healthy and Parkinson's disease data from ( Langkammer et al., 2016 ). A power analysis revealed a sample size that would have been required for a multi-site clinical study in each ROI as shown in Fig. 8 . For all the significant ROIs the number of subjects that would have been required per group was less or equal to 44. Since this is lower than the sample size we have used in this study (90 healthy volunteer scans) and the numbers in the Langkammer study (66 patients and 58 control subjects), it gives strong confidence of feasibility for future 7T QSM clinical studies.

Conclusion
We investigated test-retest reliability and reproducibility of T 2 *weighted imaging protocols at ultra-high field MRI. Considering the increase in susceptibility effects at 7T, we found that variability of measurements of QSM and R 2 * in the basal ganglia are reduced compared to reports from lower field strengths, 1.5T and 3T. Scanner hardware differences give more modest improvements for cortical measurements of QSM and R 2 * . Multi-echo protocols do not benefit from long echo times as these increase the imprecision in the estimation of QSM. We suggest that 7T MRI is suitable for multicentre quantitative analyses of brain iron, in health and disease.

Centre funding
The Wellcome Centre for Integrative Neuroimaging is supported by core funding from the Wellcome Trust (203139/Z/16/Z).
Cardiff University Brain Research Imaging Centre is supported by the UK Medical Research Council (MR/M008932/1) and the Wellcome Trust (WT104943).
This research was co-funded by the NIHR Cambridge Biomedical Research Centre. The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR or the Department of Health and Social Care. The Cambridge 7T MRI facility is co-funded by the University of Cambridge and the Medical Research Council (MR/M008983/1).

Individual funding
CTR is funded by a Sir Henry Dale Fellowship from the Wellcome Trust and the Royal Society [098436/Z/12/B]. JBR is supported by the Wellcome Trust (WT103838).