Cross-scanner and cross-protocol multi-shell diffusion MRI data harmonization: Algorithms and results

Cross-scanner and cross-protocol variability of diffusion magnetic resonance imaging (dMRI) data are known to be major obstacles in multi-site clinical studies since they limit the ability to aggregate dMRI data and derived measures. Computational algorithms that harmonize the data and minimize such variability are critical to reliably combine datasets acquired from different scanners and/or protocols, thus improving the statistical power and sensitivity of multi-site studies. Different computational approaches have been proposed to harmonize diffusion MRI data or remove scanner-specific differences. To date, these methods have mostly been developed for or evaluated on single b-value diffusion MRI data. In this work, we present the evaluation results of 19 algorithms that are developed to harmonize the cross-scanner and cross-protocol variability of multi-shell diffusion MRI using a benchmark database. The proposed algorithms rely on various signal representation approaches and computational tools, such as rotational invariant spherical harmonics, deep neural networks and hybrid biophysical and statistical approaches. The benchmark database consists of data acquired from the same subjects on two scanners with different maximum gradient strength (80 and 300 mT/m) and with two protocols. We evaluated the performance of these algorithms for mapping multi-shell diffusion MRI data across scanners and across protocols using several state-of-the-art imaging measures. The results show that data harmonization algorithms can reduce the cross-scanner and cross-protocol variabilities to a similar level as scan-rescan variability using the same scanner and protocol. In particular, the LinearRISH algorithm based on adaptive linear mapping of rotational invariant spherical harmonics features yields the lowest variability for our data in predicting the fractional anisotropy (FA), mean diffusivity (MD), mean kurtosis (MK) and the rotationally invariant spherical harmonic (RISH) features. But other algorithms, such as DIAMOND, SHResNet, DIQT, CMResNet show further improvement in harmonizing the return-to-origin probability (RTOP). The performance of different approaches provides useful guidelines on data harmonization in future multi-site studies.


Introduction
Diffusion magnetic resonance imaging provides information to characterize tissue microstructure by probing the diffusive displacements of water molecules. It is increasingly used to investigate the structural connections in human brains and tissue abnormalities related to disorders (Kubicki et al., 2007;Pasternak et al., 2018;Mueller et al., 2005). Recent advances in acquisition protocols with multiple b-values provide imaging measures that are even more sensitive and/or specific than standard approaches based on single b-value data (Jensen et al., 2005;Özarslanet al., 2013;Ning et al., 2015). On the other hand, dMRI scans have intrinsic variability caused by various factors including but not limited to scanner field and gradient strength and acquisition protocols (Vollmar et al., 2010;Grech-Sollars et al., 2015;Veenith et al., 2013;Landman et al., 2011). In particular, the variability of imaging measures between datasets acquired from different scanners with different protocols could be much more significant than the intra-scanner variability between two data repetitions acquired from the same subject using the same protocol at a different time (Karayumak et al., 2019). Therefore, it is imperative to reduce the cross-scanner and cross-protocol variability in order to reliably aggregate multi-site databases for increasing statistical power and sensitivity of studies.
In view of the significance of cross-scanner and cross-protocol variability, several types of methods have been recently developed to reliably combine dMRI datasets in multi-site studies (Karayumak et al., 2019;Moyer et al., 2020;Blumberg et al., 2018Blumberg et al., , 2019Mirzaalian et al., 2018;Fortin et al., 2017;Mirzaalian et al., 2016;Pohl et al., 2016). Some approaches are able to enhance the spatial or angular resolution or other imaging features of dMRI data with low-b values to match the data quality of the state-of-the-art scanners and protocols (Grech-Sollars et al., 2015;Zhu et al., 2011;Vollmar et al., 2010). However, these approaches were designed or implemented to retrospectively harmonize the statistical differences in imaging measures between groups of subjects from different sites with matched age, gender, handedness and other socio-economic factors. A more ideal approach to evaluate cross-scanner and cross-protocol variability is to use datasets from the different scanners of the same subjects. This enables accurate measurement of the intra-subject reproducibility and cross-scanner variability that is not provided by retrospective analysis using group data. In recent years, the intra-scanner reproducibility has been investigated using test and retest scans from the same subjects using the same scanner (Shahim et al., 2017;Duan et al., 2015;Boekel et al., 2017) or using scanners with different field strengths (1.5T vs 3T and 3T vs 7T). We here specifically seek to understand scanner and/or protocol related variability in current multi-site studies based on 3T scanners.
To facilitate research on cross-scanner and cross-protocol variability, a benchmark database was provided in  with data acquired from 3 different scanners using 5 different acquisition protocols. An open competition was organized during the Computational Diffusion MRI (CDMRI) workshop at 20th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2017) which invited participants to develop algorithms to solve a dMRI harmonization task which is to find a mapping between datasets acquired using different scanners and/or protocols that were acquired with the same b-value. The result of the competition was reported in . Most algorithms, e.g. , used spherical harmonics to represent single-shell dMRI signals and applied convolutional neural networks (CNNs) to learn mapping between datasets.
While the majority of previous works focused on the harmonization of single-shell diffusion MRI, this work presents the evaluations of harmonization algorithms for multi-shell diffusion data. In particular, this paper is a continuation and extension of the results in , and reports on the "MUlti-SHell Harmonization and enhancement Challenge" (MUSHAC) , part of CDMRI at MICCAI 2018. We report the summary and results of the second open competition on harmonizing multi-shell dMRI datasets. The database used in this competition consists of four acquisitions of the same 15 healthy participants scanned on two MRI systems with different maximum gradient strength (80 and 300 mT/m) and different protocols. Different from the earlier work that focused on singleshell data, this paper reports the results of multi-shell comparison from a significantly wider range of algorithms (19 algorithms from 9 research groups). Such systematic evaluations (Klein et al., 2009) have the potential to provide important guidelines for further multi-site studies on diffusion MRI.
Results demonstrate that state-of-the-art computational tools, e.g. CNN and bio-statistical approaches, enable accurate harmonization of multi-shell diffusion MRI across scanners and protocols, reducing between-scanner/between-protocol variability to levels comparable to scan-rescan variability. These results pave the way towards reliable combination of multicenter data in large clinical study, which is key in the area of big data.

Methods
The acquisition parameters and preprocessing approaches are described in Section 2.1. Section 2.2 introduces the tasks and rules of the MUSHAC challenge on multi-shell dMRI harmonization. Section 2.3 provides the evaluation strategies and metrics. Section 2.4 summarizes the algorithms participating in the open competition.

Data
The database was acquired from 15 healthy volunteers, which includes all the subjects reported in Table 1 of . The data acquisition was approved by Cardiff University School of Psychology ethics committee. All subjects were scanned using three different scanners with different maximum gradient strength. This work focuses on the multi-shell database acquired from the two scanners with relative higher maximum gradient strength: a) 3T Siemens Prisma (80 mT/m), and b) 3T Siemens Connectom (300 mT/m). The average time between acquisitions on scanners a) and b) was 1 month. The scanners had no software upgrades during the course of the study.
Two types of protocols were acquired from each scanner: 1) a more 'standard' (ST) protocol with acquisition parameters matched to a typical clinical protocol; and 2) a more advanced or 'state-of-the-art' (SA) protocol where the superior hardware and software specifications were utilized to increase the number of acquisitions and spatial resolution per unit time. The ST data from both scanners have an isotropic resolution of 2.4 mm, TE = 89 ms and TR = 7.2 s. Both ST datasets have 30 directions at b = 1200, 3000 s/mm 2 . On the other hand, the Prisma-SA data has a higher isotropic resolution of 1.5 mm, TE = 80 ms, TR = 7.1s and 60 directions at the same b-values while the Connectom-SA data has the highest resolution of 1.2 mm with TE = 68 ms, TR = 5.4 s and 60 directions. Reversed phase encoding, i.e. AP and PA, b0 images were acquired in both protocols. Additional b = 0 s/mm 2 images were also acquired with matched TE and TR across scanners. Moreover, structural T1-weighted (T1w) images were acquired on both scanners using an MPRAGE sequence with 1 mm 3 isotropic voxel. More detailed information about the acquisition parameters can be found in Table 2 of .
The data preprocessing approach includes the following step: The b0 volumes were corrected for EPI distortions by applying FSL TOPUP on reversed phase-encoding pairs (Andersson et al., 2003). The data was corrected for eddy current induced distortions and gradient-nonlinearity distortions  with FSL TOPUP/eddy and in-house software kindly provided by Massachusetts General Hospital (MGH) (Andersson and Sotiropoulos, 2016). Spatiotemporally varying b-vectors and b-values due to gradient nonlinearities of the Connectom scanner were made available (Bammer et al., 2003;Sotiropoulos et al., 2013;Rudrapatna et al., 2018). The dMRI of Prisma-SA and Connectom-SA were affinely registered to Prisma-ST using the corresponding fractional anisotropy (FA) maps with FSL FLIRT (Jenkinson and Smith, 2001) and using sinc interpolation and appropriate b-vector rotation. Prisma-ST was used as the reference space since additional geometric distortions can still be present in Connectom data because of gradient nonlinearities (Setsompop et al., 2013;Jones et al., 2018;Tax et al., 2019). The T1w images were registered to the FA of Primsa-ST to avoid extra interpolation of the dMRI data.

Competition tasks
The processed data from 10 randomly selected subjects were used as training datasets which were released to the participants of the MUSHAC. The Prisma-ST scans and the corresponding T1w images of the remaining 5 subjects were used as a testing set. The tasks were to predict the other three datasets using the provided Prisma-ST scans. In particular, the following two tasks were evaluated:

1.
Harmonization: the prediction of dMRI signals of the Connectom-ST dMRI scans using the provided Prisma-ST data;

2.
Enhancement: the prediction of dMRI signals of the state-of-the-art dMRI scans, including Prisma-SA (Task 2a) and Connectom-SA (Task 2b) scans, using the provided Prisma-ST data.
All participants of the dMRI harmonization competition were required to at least complete Task 1, but Task 2 was optional. The preliminary results of the 12 algorithms were announced during the CDMRI workshop at MICCAI 2018 , as this provided useful feedback regarding the algorithm performance and capabilities to the participants. Then, the participants were allowed to re-submit their updated results or new results after the workshop. New participants were also allowed to enter the competition without providing preliminary evaluations of their algorithms. By the final deadline of the competition, 19 algorithms implemented by 9 different research groups were submitted for evaluation on the test datasets.

Evaluation
Harmonization algorithms had to predict the dMRI signals of three missing datasets by using the provided b-values, b-vectors and the Prisma-ST scans. From the predicted and the underlying ground-truth datasets, diffusion tensor models were estimated using the dMRI signal at b = 0, 1200 s/mm 2 using a weighted least-squares estimate provided by the dtifit command in FSL (Jenkinson et al., 2012). The fractional anisotropy (FA) and mean diffusivity (MD) measures were subsequently extracted for each voxel. Moreover, two multishell dMRI measures, including the mean kurtosis (MK) provided by the diffusion kurtosis imaging (DKI) model (Jensen et al., 2005) and the return-to-origin probability (RTOP) measure from MAP-MRI (https://paperpile.com/c/XYw0oa/FzQNk Özarslan et al., 2013), were computed using multi-shell diffusion data. The FA, MD, MK and RTOP measures were all computed using the Dipy toolbox (Garyfallidis et al., 2014). In addition, the zeroth and second order rotationally invariant spherical harmonic (RISH) features, R0 and R2 (Mirzaalian et al., 2016), were computed for the normalized dMRI signals at b = 1200, 3000 s/mm2 to measure the angular frequency of dMRI signals at each b-value shell. The RISH features were computed using an in-house MATLAB (Mathworks, Natick, MA) toolbox.
The estimated imaging measures of the predicted datasets were compared against the corresponding features derived from the true acquired data. The performances of algorithms were evaluated in brain regions specified by brain masks excluding the cerebellum, which was not always included within the foot-head field-of-view coverage. Such an evaluation mask was obtained with the Geodesic Information Flow (GIF) algorithm (Cardoso et al., 2015), which was used to segment different brain areas in the anatomical T1w images. The evaluation masks were provided to the participants. Then, the absolute-value of the percentage error (APE) between the predicted and ground-truth measures, i.e. APE = |(predicted -ground-truth)/ground-truth|, were computed on a voxel level. Subsequently, the mean and standard deviation of APE were computed globally in a brain mask excluding cerebellum, and regionally in brain regions provided by FreeSurfer (Fischl, 2012).

Algorithms
Participants of the open competition developed 19 algorithms to solve the harmonization challenge (Task 1: the prediction of the Conectom-ST data using the Prisma-ST data). A subset of 19 algorithms were also implemented to solve the enhancement challenge (Task 2: the prediction of Prisma-SA data (Task: 2a) and Connectom-SA data (Task: 2b) using the Prisma-ST data). For Task 1, the Prima-ST and the Connectom-ST scans have exactly the same acquisition parameters, including the spatial resolution and the gradient directions. Therefore, the voxel-wise differences in imaging features between the two datasets can be directly computed without using any interpolation algorithms. The corresponding APE value reflects the true cross-scanner differences which can be considered as the reference when evaluating the performance of harmonization algorithms since no harmonization algorithms were applied. Since the voxel size of Prisma-ST scans are different from those of Prisma-SA and Connectom-SA scans, no natural reference is available for Task 2a and Task 2b. But the performance of interpolation-based algorithms can be considered as the baseline to evaluate other more complex algorithms. A detailed description of the participating algorithms is provided below and a brief summary is presented in Table 1. (Avants et al., 2011) were applied to align the Prisma-ST datasets to the other three scans with B-spline interpolation methods. Then the MPRAGE images were applied to obtain brain segmentations using the FSL FAST (Jenkinson et al., 2012) and to obtain brain masking using FSL BET command (Jenkinson et al., 2005). The normalized diffusion signals in white matter (WM) and gray matter (GM) voxels within the brain mask were transformed to SH coefficients with order four and Laplace-Beltrami regularization of ë = 0.006.

Spherical-harmonic based interpolation (SHInterp)-Additional nonlinear registrations based on ANTs
The SHInterp approach simply transformed the SH coefficients back to diffusion signals along the set of gradient directions of the to-be-predicted scans. Thus, due to SH fitting, it is analogous to a spatial smoothing or regularization in the angular domain of each shell. Since this approach didn't remove any variability in the data, its performance is expected to reflect the true cross-scanner and/or cross-protocol variability of the datasets. For Task 2, standard cubic interpolation was utilized to increase the spatial resolution.

DIAMOND-based approach (DIAMOND-a -b, -c and -d)-
This approach applied two additional preprocessing steps for noise filtering and data up-sampling. In particular, noise filtering was done by using the low-rank matrix approximation approach in (Veraart et al., 2016) followed by correcting for the Gibbs ringing approach (Kellner et al., 2016). The up-sampling approach was applied to interpolate the data with isotropic voxels of size 1 mm using the ITK with "sinc" interpolation (Meijering et al., 1999).
The signal prediction approach was developed based on the hybrid biophysical and statistical DIAMOND method (Scherrer et al., 2016(Scherrer et al., , 2017, which fits a probabilistic model that characterizes diffusion signals with a multi-compartment model. DIAMOND models each compartment in each voxel with a continuous statistical distribution of diffusion tensors (Scherrer et al., 2016), the expectation of which characterizing the average 3-D diffusivity rates of the compartment and the concentration of which capturing the intra-compartment microstructural heterogeneity. The number of fascicle compartments at each voxel was automatically determined using AICU model order selection method (maximum: 3 fascicle compartments). All voxels also included an isotropic diffusion compartment to characterize free-water diffusion.
Two algorithms, denoted by DIAMOND-a and DIAMOND-b, based on the central DIAMOND (Scherrer et al., 2016(Scherrer et al., , 2017 and the non-central DIAMOND models (Scherrer et al., 2016(Scherrer et al., , 2017 were used to characterize diffusion signals, the latter allowing for the separate modeling of axial and radial heterogeneity. In both algorithms, the sum of model coefficients was constrained to be 1. The participant team also submitted two other algorithms based on central DIAMOND and non-central DIAMOND without the constraint on the sum of coefficients. But their experimental results are very similar to the results of DIAMOND-a and DIAMOND-b. Thus, the results without the constraints are not reported in this work.
Based on the estimated model coefficients using the Prisma-ST dataset, the same signal expressions were applied to predict diffusion signals using the b-vectors and b-values of the other three scans. No additional steps were applied to remove potential variability among the datasets from different scanners and protocols.

Linear regression based on RISH features (LinearRISH)-This approach,
denoted by LinearRISH, first preprocessed the raw diffusion MRI following the HCP pipeline steps using an in-house software. Later, the Gibbs unringing approach (Kellner et al., 2016) was applied to reduce the noise in the Prisma-ST data. Then, the steps described in (Karayumak et al., 2019) were followed: (i) RISH features up to 6th order were estimated separately for diffusion signals at two b-shells with b = 1200, 3000 s/mm 2 , respectively, using a Tikhonov regularization based algorithm for estimating SH coefficients with the regularization parameter being 0.001; (ii) The mean templates were created for each RISH feature at b = 3000 s/mm 2 using the antsMulti-VariateTemplateConstruction2.sh command provided by ANTs (Avants et al., 2011); (iii) The scale maps that characterize the corresponding cross-scanner and/or cross-protocol variability were calculated between the mean RISH features from the Prisma-ST datasets and the other three scans. Consequently, the estimated maps were applied to scale the SH coefficients in the subject space. The same steps were also applied to scale the SH coefficients corresponding to b = 1200 s/mm 2 ; (iv) The scaled SH coefficients were transformed back to diffusion signals along the provided set of b-vectors for the two b-shells, respectively and the harmonized signal at each shell was estimated in this way. Finally, to preserve the overall integrity of the harmonized signal, the mean harmonized diffusion signal is scaled with the average signal from the other three protocols.
Considering potential inter-subject variability and the limited number of training subjects, the adaptive LinearRISH approach used different subsets of the training dataset to estimate mappings for each test subject. Briefly, from the training dataset, heuristically, the three most similar subjects to each test subject were selected. The similarity criterion was based on the mean squared error of the RISH features of the Prisma-ST dataset and test subjects in the whole-brain as well as region-wise level. The LinearRISH steps described above were followed using subsets of the training dataset to harmonize each test subject.

Tissue-wise regression forest (TWRF-a and TWRF-b)-The MRPAGE
images were applied to obtain label maps for WM, GM and CSF regions using FSL FAST (Zhang et al., 2001). For each tissue region, a random forest regression (Breiman, 2001) was trained to predict diffusion signals using different training features. Two sets of training features were extracted for each voxel. The first set included the normalized diffusion signal, the eigenvalues of diffusion tensor, the trace, linearity, planarity and sphericity, which were selected to make the solution comparable to (Alexander et al., 2017), and mean signals from the 3 × 3 × 3 neighbor voxels, b-values, the b-vectors of both input and the to-be-predicted scans. The second set of features only included the normalized diffusion signals and the mean of the neighborhood. The following parameters were used in the training: the max depth of the decision trees = 10, the number of decision trees (estimators) = 10, criterion was mean squared error, allowed samples per leaf nodes = 1. The scikit-learn toolbox (Pedregosa et al., 2011) was used for the training. The prediction results corresponding to the two methods were denoted by TWRF-a (full feature set) and TWRF-b (reduced feature set), respectively. As for training, 6 subjects were used to train the algorithm and the rest subjects were used for validation, implementing the algorithms to the testing datasets.

Single-layer SH network (SH-1Layer)-This approach first interpolated
Prisma-ST to match the resolution of the two SA scans. Then, the 4th order SH coefficients of the normalized diffusion signals, i.e. s(b, u)/s(0) with s(b, u) being the diffusion signal with b-value = b along the gradient direction u and s(0) being the baseline, were computed separately at the two non-zero b-shells. The harmonization algorithm was based on a network with a single convolutional layer which took the SH volume (both shells concatenated) as input and returns a spherical harmonic volume (again, both shells concatenated) as output. It was trained without validation because the relative lower risk of overfitting for this simple network. The performance of SHNet can be considered as a "baseline" to which other more complex network-based methods should be compared. The network was trained using the Adam optimizer and the mean-squared error (MSE) as the loss function with the learning rate being 0.0001 and the number of epochs being 30.

SHResNet-a:
The method from the first team applied the same preprocessing steps as SHInterp. This algorithm, denoted by SHResNet-a, used the SH coefficient from a 3 × 3 × 3 voxel neighborhood (input size: 3×3×3×15) to predict the 15-dimensional SH coefficients in the center voxel using 3D-convolutional networks. The predicted coefficients were converted to diffusion signal using the inverse SH transform. A more detailed information on the structure of this network can be found in Fig. 3 (Tax et al., 2019).

SHResNet-b:
The methods from the second team includes five variants, i.e. SHResNet-b0 to -b4, that all applied the same preprocessing approaches as SH-1Layer. The first approach, denoted by SHResNet-b0, used the same network structure and training method as SHResNet-a. The difference between SHResNet-a and SHResNet-b0 is related to the differences in preprocessing approaches, i.e. nonlinear registration v.s. interpolation. Methods SHResNet-b1, SHResNet-b2, and SHResNet-b3 included a one-hot orthogonal vector of SLANT labels (Huo et al., 2018) as input to reduce anatomically dependent variability across scanners. In particular, SHResNet-b1 used 133 labels to denote different anatomical regions. SHResNet-b2 simplified the hierarchy of the labels into 23 unique regions, (i.e. combine cortical lobes, combine cerebral WM, cerebellum, etc, using the hierarchical algorith in (Asman and Landman, 2014)). SHResNet-b3 further smoothed the labels using a Gaussian filter. Instead of using anatomical labels, Method SHResNet-b4 included a one-hot orthogonal vector of voxel location in x, y, and z coordinates (normalized from −1 to 1) as input for reducing FOV/location dependent artifact. All the models, SHResNet-b0 to SHResNet-b4, were trained using the same parameters and cost functions as SH-1Layer in Section 2.4.6. First, it separates the two input-shells into ten linear combinations, based on the neighboring q-space signals, while the last layer combines the output of the ResNet to predict the two shells of the three to-be-predicted scans. Further, each residual block was extended to be applicable to multi-shell data.

Spherical network (SphericalNet)-This
2.4.9. Tiny-patch network (Tinypatch)-In pre-processing, this method, named Tinypatch, applied FSL FAST to obtain label maps for white-matter, gray-matter, and CSF regions using the MPRAGE images. The diffusion data were projected to an 8th order Spherical Harmonics (SH) basis, using an L2-minimal projection. These projections were concatenated voxelwise with the tissue labels and then fed into a patch-wise feed-forward fully connected neural network, which output estimates of a corresponding patch from the to-be-predicted site.
Patches were constructed using a center voxel and the 6 immediately adjacent voxels. The input was then vectorized, and then fed into the fully connected network. The neural network consisted of 2 encoder layers, one center layer, and 2 decoder layers, with (128,64), (32), and (64,128) hidden units respectively, using tanh non-linear activations at each hidden unit. The encoder and center layers were pre-trained using an auto-encoder task, attempting to reconstruct the base site using a temporary set of decoder layers. The full network was then trained using corresponding patches between scans; for the higher resolution scans (Task 2a and Task 2b) the spatial offsets were concatenated to the activations at the center layer. Loss was computed by projecting SH estimates back to the subject specific b-vectors. This loss was then propagated back to the rest of the network.
A separate network was trained for each task site (Task 1, Task 2a, and Task 2b). Each network was trained to convergence on 9 of the training subjects, leaving one subject for validation. The best performing weight set on the validation subject was then run on the test dataset, and the outputs submitted for evaluation.
2.4.10. Deeper image quality transfer (DIQT)-Raw data were pre-processed to extract signal features by spherical harmonics (SH) deconvolution (Tournier et al., 2004). The pre-processing was performed using an home-made python script, also including some functions from DIPY (Garyfallidis et al., 2014). For each of the 2 diffusion gradient shells, the 15 coefficients of the 4th order SH deconvolution were estimated from the normalized raw signal at different diffusion gradients directions, using real and symmetric SH basis. These 30 SH coefficients (15 coefficients of the 4th order SH deconvolution per each of the 2 shells) computed from the available subjects in the training set, were then used as input to a CNN, to predict the corresponding 30 SH coefficients for the unseen subjects in the test set. From the predicted 30 SH coefficients, the normalized raw signal for the subjects in the test set was computed by evaluating the SH in the provided set of directions.
The DIQT approach used supervised learning on 3D input-target patches that were extracted from the Prisma-ST scans. The center voxel of each input patch was contained within the brain mask, and each patch was of physical size 11×2.4 3 mm 3 . Before extracting the patches, the mean and standard deviation were computed for each input channel, i.e. SH coefficient across training subjects. Then the training (and validation) data was normalized to have zero mean and the standard deviation one. The set of input patches were then used to predict image patches in the target space where the size of the target patch was chosen so that, when considered in the same physical space, the target patches were contained within the input patches.
The performance validation of this algorithm was initially explored on two fixed subjects. Then cross-validation was utilized, where the hold-outs were different to the data exploration subjects, to estimate the generalized error-rate. 10% of the training data was allocated as a pre-validation set, for early stopping of model training. The final validation/submission was performed by multiple reconstruction of a subject's brain. For a reconstruction, a subject's input brain was divided into non-overlapping input patches, where the patch predictions were combined to create a whole brain prediction. This procedure was performed 64 times and the reconstructions averaged. Variants of the DIQT Network (Blumberg et al., 2018;Blumberg et al., 2019) were used for the three tasks, where the shuffle in (Blumberg et al., 2018) was removed in Task1. In Tasks 2 and 3, the shuffle in was replaced with a transposed convolutional layer. The model hyperparameters of these networks, including the number of reversible network layers per block (a proxy for the network's depth), size of input/target patch, learning rate and decay, sampling scheme for training data, were explored in . The code for the extension of this challenge and  is available at https://github.com/sbb-gh/. Next, this method used the Spherical Harmonics and a Radial Decomposition (SHARD) proposed in (Christiaens et al., 2018) as rotation-invariant representation of multi-shell dMRI data. SHARD learns the optimal low-rank representation of a given dataset (single subject or group), using a singular value decomposition across shells and spherical harmonic bands. In this work, a groupwise basis was learned for each protocol (Prisma-ST and Connectom-ST) from the 6 training subjects (subjects A-F). The data of all (training, validation, and test) subjects is then projected onto the basis, thus representing each scan by a 4-D image consisting of 57 vol of SHARD coefficients in the basis of its protocol. The SHARD images were estimated in the original space of Connectom scans then warped to the space of Prisma data.

Spherical harmonics and a Radial Decomposition (SHARD) based ResNet (CMResNet-a and CMResNet-b)-As
Two ResNet networks, denoted by CMResNet-a and CMResNet-b, were used to learn mappings between the SHARD images, where each included an encoder and two decoders. The encoder transforms the source SHARD image to a common embedding. The two decoder networks D S and D T transform from embedding to source and target domain, respectively. The networks consist of multiple depth-wise separable convolution layers, residual blocks, and a Squeeze-and-Excitation connection (Hu et al., 2017) that uses channel-wise average pooling inside the brain mask as features for adaptive channel scaling.
The structure of the two networks, CMResNet-a and CMResNet-b are shown in Fig. 1 to Fig. 3. The networks use a multiply connection between input and encoder output and expand the N = 57 SHARD volume channels to C = 72 channels. The networks were trained using a mean squared loss on the SHARD coefficients with either source or target image as input. The transformed image from the single pass through encoder and decoder was evaluated against the same subject scanned with the other protocol (see Fig. 1a) and b)). Additionally, using two passes through the network, the input image was compared to the prediction given itself as input as shown in Fig. 1c) and d). In the first two training methods the loss was evaluated between domains, in the latter, it was evaluated using a single image from either domain. Chaining E, D S , E and D T allows training the network using a cycle-consistency loss (Zhou et al., 2016;Liu et al., 2017) of source images without the need for well aligned matching target data. To avoid learning an identity mapping in this setting, either D T or D S are held constant during cycle-consistency training steps and the network is trained simultaneously with paired and unpaired loss. Fig. 4(a) and (b) illustrate the average APE of the FA and MD measures over all voxels within the evaluation masks for 19 algorithms that completed Task 1. These DTI features were computed using only diffusion signals at the b-shell with b = 1200 s/mm 2 , since DTI only captures the second order cumulant, and higher-order cumulants become more prominent at higher b-values. Similarly, Fig. 5(a) to 5(d) show the APE of the zeroth and second order RISH features, R0 and R2, for diffusion signals at the two b-shells, respectively. Fig. 6(a) and (b) illustrate the APE of MK and RTOP computed using multi-shell diffusion signals. Algorithms of the same type, i.e. interpolation-, regression-or CNN-based, are coded by the same color. A summary of the mean APE values of the 19 algorithms is provided in Table 2. It can be seen that SHInterp has similar or slightly better performance than the reference. The two DIAMOND model-based algorithms have very similar performances, indicating the different choices of DIAMOND models have only little effect on the results. SHOREInterp has similar performances as SHInterp and Reference. TWRF-b is slightly better than TWRFa, but both actually increased variability compared to the reference. SHResNet-a is the best among the six SHResNet-based algorithms. DIQT is in general the best CNN based approach, but slightly worse than SH-1Layer or CMResNet-b in some measures. Overall, LinearRISH has the best performance in most evaluation metrics but the RTOP measure, in which LinearRISH is the 2nd highest and is worse than the reference and SHInterp. Fig. 7 illustrates the spatial distribution of the mean APE over the ROIs from the Kesikan-Killiany atlas from FreeSurfer (Fischl, 2012) in sagittal, axial and coronal views of a brain for 8 representative algorithms. The value in each ROI is the average APE over all measures and all the underlying voxels, with the size of ROIs not considered in the analysis. In particular, Fig. 7(a) to 7(h) show the APE corresponding to Reference, DIAMOND-a, LinearRISH, SH-1Layer, ShpericalNet, SHResNet-b0, DIQT and CMResNet-b. Fig. 7(a) shows that the frontal and temporal lobes have the highest errors, which may be related to residual, uncorrected distortions and motion-induced effects in these regions. The errors are significantly reduced by several methods in Fig. 7(b) to 7(f), though the frontal and temporal lobes still have relatively higher errors than other regions. It can be seen from Fig. 7(c) that LinearRISH has relatively lower errors than other methods in almost all regions. Table 3 summarizes the mean APE in the gray matter and white matter regions of all 19 algorithms. It also provides the two ROIs with the highest or lowest APE for each method. For all methods, the APE values in gray matter are all higher than the result of white matter. Moreover, several ROIs, such as the entorhinal cortex, the superiorparietal cortex, and the rostralmiddlefrontal cortex, consistently have the highest errors for most algorithms. On the other hand, white matter regions in the paracentral, the banks of the superior temporal sulcus (bankssts) and the insula have the lowest errors.

Task 2: spatial-and angular-resolution enhancement
3.2.1. Global evaluation-The barplots in Figs. 8 to 10 illustrate the APE of 8 measures for 19 algorithms evaluated for Task 2a. Overall, the average APE is higher than the corresponding results in Task 1 shown in Figs. 4 to 6. The reference value of APE is not available in this task since the Prisma-ST and Prisma-SA scans have different spatial resolutions. But the SHInterp approach can be considered as the "baseline" since it provides harmonization errors as assessed in Task 1 that are similar to the intrinsic between-scanner differences. A summary of the mean APE values for 8 evaluation measures is shown in Table 4.
As in Task 1, the DIAMOND and SHORE-based approaches have better performances than the SHInterp method, indicating that DIAMOND and SHORE are more agnostic to scanner or protocol dependent noise/errors in the measurements. The LinearRISH approach still has the best performances in predicting the FA, MD, RISH and MD measures, though it has higher error in RTOP than SHInterp.
For CNN-based approaches, the simple SH-1Layer approach has better performances than SHResNetb1 to SHResNet-b4 and DIQT, implying that these complex networks need to be further trained to improve the perforamnces. Nonetheless, other deep-networks based methods, such as SHResNet-b0 and CMResNet-b, have better performances than SH-1Layer.
The global evaluation results for Task 2b are provided in Figs. S1-S3 and Table S1 in the Supplementary Materials. Overall, similar results as in Task 2a were found for this task. The LinearRISH approach still has the best performances in most metrics but the RTOP measure. The DIAMOND-based methods and the SH-1Layer approach have similar performances. SHResNet-b0 and DIQT are two of the best CNN-based algorithms.
3.2.2. Regional evaluation-The spatial distribution of APE values has similar patterns as in Task 1. In particular, the frontal and the temporal lobes have relatively higher errors than other regions which may be caused by image distortions or head motions. A detailed illustration of the spatial patterns for the algorithms evaluated for Task 2a and Task 2b is provided in Figs. S4 and S5 in the Supplementary Materials. Table 5 shows the mean APE values in GM and WM and the two ROIs with the highest or lowest APE values for Task 2a. The corresponding results for Task 2b are shown in Table S2 in the Supplementary Materials.
The estimation errors have similar spatial patterns as in Task 1. In particular, the errors in gray matter are significantly higher than the errors in white matter. The cortical regions in the superiorparietal, the entorhinal, the temporalpole and the frontalpole have the highest APE values. On the other hand, white matter in paracentral areas consistently exhibits the lowest APE for most algorithms.

Discussion
In this work, we have evaluated the performance of 19 algorithms that have participated in an open competition on harmonization of multi-shell dMRI scans from different scanners and protocols. Next, we discuss several factors that may limit the results of this study and possible solutions to further improve the performance in future work.

Sample size of training data
The performance of harmonization algorithms may depend on the number of subjects used in training (Karayumak et al., 2019). Ideally, the number of subjects should be large enough so that the training dataset contains sufficient information to learn the variability across datasets acquired using different scanners and/nor protocols. If the harmonization algorithm is based on deep networks, the number of training examples should be large enough to ensure the convergence of the training algorithms and to avoid overfitting. But, in practice, travelling subjects are expensive and logistically difficult, making them infeasible for multi-center studies. Thus, many studies only use one subject, or a phantom, to test the stability of imaging measures (Teipel et al., 2011). The benchmark database evaluated in this study is one of the largest open datasets with scans from travelling subjects using different scanners and protocols. However, the analysis results show that the performance of several algorithms with complex network structure is worse than a simple one-layer network, i.e. Sh-1Layer, indicating the training data is still insufficient for complex algorithms. Thus, more effective network structures and training algorithms need to be explored in future multi-site harmonization studies.

Inter-subject variability and adaptive prediction
In the evaluation results, the LinearRISH method has demonstrated better performance than other algorithms in most of the evaluation metrics, followed by the DIAMOND and SHORE-based approaches. Among other reasons, three possible factors that contribute to its superior performances are as follows: First, the simple linear-regression approach is able to avoid overfitting and provide robust data predictions. The voxel-wise linear models used in LinearRISH are able to reliably reduce space dependent variability in the datasets. Second, this approach adaptively selected different training subjects to predict each testing dataset based on the similarities in the corresponding Prisma-ST scans. Though it is expectedthat there exist subject-independent factors that are related to the cross-scanner and cross-protocol variability in the dataset, there may also exist other factors, such as head sizes and shapes, that lead to subject-specific variability across scans. The adaptive prediction approach used by LinearRISH was shown to be an effective solution to reduce subjectdependent variability in learning mappings between dMRI data from different scanners and protocols. However, the potential of adaptive prediction in clinical applications with pathological data need to be explored in future work. Third, LinearRISH is the only method that uses 6th order SH representations, while other methods use 4th order SH representation.
The higher order SH representations may provide more detailed information on the data that is helpful to improve the performance of harmonization algorithms.

On evaluation metrics
This study used several metrics, including DTI, DKI, MAP-MRI and RISH features, to evaluate the performance of harmonization algorithms on a global and regional level. Note that the FA and MD measures were computed only using dMRI signals at the lower b-shell at b = 1200 s/mm 2 , since at higher b-values higher-order cumulants become more prominent. Moreover, the RISH features were computed separately for the two b-shells at b = 1200, 3000 s/mm 2 , respectively. For most algorithms, the prediction error of RISH features is higher for higher b-values. Moreover, the MK measure from DKI and the RTOP measure from MAP-MRI were computed by integrating dMRI signals at both b-shells. The two measures reflect different aspects of the signals. In particular, the MK measure was computed based on the cumulant expansion, which do not provide an accurate prediction of diffusion signals to high b-values. Moreover, the maximum b-value at b = 3000 s/mm 2 might be beyond the radius of convergence for DKI, jeopardizing accuracy but potentially improving precision (Chuhutin et al., 2017). On the other hand, the RTOP measure was computed based on an analytical expression of the ensemble average propagator estimated using diffusion signals. Though other approaches were proposed to compute the RTOP measure, these methods all rely on continuous extrapolation of diffusion signals. Thus, the RTOP measure is sensitive to the decay of diffusion signals and the estimation methods. Although at least two b-shells are needed to estimate the MK and RTOP for characterizing non-exponential signal decays, more b-values are expected to further improve the consistency of estimation results. In the evaluation results, the LinearRISH approach achieved the best performance in almost all metrics except RTOP which may be caused by errors in relative signal decays along the radial direction of b-values of the predicted signals. Caution should be placed in the results from the RISH (R0 and R2) metrics since these are based on the same underlying model as LinearRISH. However, the evaluation with different metrics should reduce potential bias in metrics whose underlying model is similar to the proposed algorithms. These diversified metrics could provide a guideline on the selection of harmonization algorithms corresponding to different metrics.
The ROI based analysis, e.g. Fig. 7, shows that the GM has much higher APE values than WM. A possible reason is that GM has more significant heterogeneity due to different layers and cortical folds, which make it more difficult to harmonize the dMRI data. The findings in the regional analysis suggest that caution should be taken when interpreting results on these high-APE regions. Future work is needed to develop more reliable algorithms to harmonize these ROIs. Finally, we mention that a limitation of the analysis is that only the APE is used to compare different dMRI measures. The APE metric may not reflect the impact of the differences in dMRI data on statistical analysis in clinical studies. Further evaluation metrics will be needed in future studies to highlight the differences between dMRI data for clinical studies from different angles.

On signal representations and models
The harmonization algorithms used four types of approaches to represent or model diffusion signals including spherical harmonics (SH), Spherical Harmonics and a Radial Decomposition (SHARD), the DIAMOND model and the SHORE model. In particular, SH was the most commonly used approach that provides a complete basis for the representation of single-shell diffusion signals. Typically, fourth-or sixth-order SH representations were used to remove high-frequency noise. Moreover, the SH coefficients were usually separately estimated for different b-shells without modeling the radial signal decay. On the other hand, the novel SHARD representation and the representations based on the DIAMOND and SHORE models were computed using multi-shell diffusion signals. These approaches use either parsimonious/band-limited representations or biophysical models to remove noise in the signals. In the interpolation-based approaches, SH, DIAMOND and SHORE were all used to predict dMRI signals without removing cross-scanner and cross-protocol variability. It was interesting that all these approaches, especially the DIAMOND based methods, provided lower prediction error than the reference, indicating that these biophysical and statistical hybrid models accurately characterize the diffusion properties of the underlying tissues and reduce the variability across scanners, particularly variability caused by measurement noise.

Conclusion
In this study, we have presented a comprehensive evaluation of the performance of multi-shell harmonization algorithms for reducing cross-scanner and/or cross-protocol variability in dMRI data using a benchmark database. The results showed that some algorithms were able to significantly reduce the variability to a similar level as the scan-rescan variability between different scans of the same subjects using the same scanner and protocol. Moreover, the results from different types of algorithms indicate that inter-subject variability and measurement noise are important factors causing variability among dMRI scans. The performance of a large set of algorithms evaluated in this study could provide useful information to guide the selection of harmonization algorithms in future multi-site studies with travelling subjects. The benchmark database used in this study can be obtained on the following webpage: https://www.cardiff.ac.uk/cardiff-university-brain-research-imaging-centre/ research/projects/cross-scanner-and-cross-protocol-diffusion-MRI-data-harmonisation. Evaluation pipelines and the results are available at: https://projects.iq.harvard.edu/files/ cdmri2018/files/mushac_evaluation.zip. The organizing team of MUSHAC can be contacted by email at cdmri18@cs.ucl.ac.uk.

Supplementary Material
Refer to Web version on PubMed Central for supplementary material. The network uses the encoder E to transform the source and/or target image to a common embedding that the decoders D T and D S transform to the target and source domain, respectively. The weights are shared across all networks with identical names. The rows show the methods for training the encoder and decoder networks of the networks to predict the target SHARD image given the source SHARD image a), the inverse mapping b), and the full cycle from source to predicted source c) and from target to predicted target d). The encoder and decoder networks architecture of CMResNet-a and their building blocks. SConv3D denotes a depth-wise separable convolution consisting of a convolution (with kernel size 3 × 3 × 3, unless otherwise stated, stride = 1, zero padded), grouped by input channels, followed by a 1 × 1 × 1 convolution (stride = 1, no padding) mapping to the number of output channels. The number of channels is omitted if it is unchanged. The Squeeze-and-Excitation reduction factor (r) is 4. Encoder and decoder network architecture of the CMResNet-a. See Fig. 2 for the layer legend. An illustration of the APE of FA and MD measures computed using diffusion signals at the b-shell with b = 1200 s/mm 2 for algorithms evaluated for Task 1. The reference (dashed lines) illustrates the difference between Prisma-ST and Connectom-ST scans with no harmonization. The y-axis does not include the origin in order to emphasize the differences between methods. An illustration of the APE of the zeroth and second order RISH features, R0 and R2, of diffusion signals at two b-shells for the algorithms evaluated for Task 1. The reference (dashed lines) illustrates the difference between Prisma-ST and Connectom-ST scans with no harmonization. An illustration of the APE of MK and RTOP computed using multi-shell diffusion signals for algorithms evaluated for Task 1. The reference (dashed lines) illustrates the difference between Prisma-ST and Connectom-ST scans with no harmonization. Illustration of the average regional APE values of all dMRI measures in different brain regions defined by the Desikan-Killiany atlas from FreeSurfer for several representative algorithms evaluated for Task 1 and the reference. Global APE values for the FA and MD measures computed using diffusion signals at the b-shell with b = 1200 s/mm 2 for algorithms evaluated for Task 2a. Global APE values for the zeroth and second order RISH features, R0 and R2, of diffusion signals at two b-shells for the algorithms evaluated for Task 2a. Global APE values for the MK and RTOP measures computed using multi-shell diffusion signals for algorithms evaluated for Task 2a.
Ning et al.