Outliers in diffusion-weighted MRI: Exploring detection models and mitigation strategies

Diffusion-weighted MRI (dMRI) is a medical imaging method that can be used to investigate the brain microstructure and structural connections between different brain regions. The method, however, requires relatively complex data processing frameworks and analysis pipelines. Many of these approaches are vulnerable to signal dropout artefacts that can originate from subjects moving their head during the scan. To combat these artefacts and eliminate such outliers, researchers have proposed two approaches: to replace outliers or to downweight outliers during modelling and analysis. With the rising interest in dMRI for clinical research, these types of corrections are increasingly important. Therefore, we set out to investigate the differences between outlier replacement and weighting approaches to help the dMRI community to select the best tool for their data processing pipelines. We evaluated dMRI motion correction registration and single tensor model fit pipelines using Gaussian Process and Spherical Harmonic based replacement approaches and outlier downweighting using highly realistic whole-brain simulations. As a proof of concept, we applied these approaches to dMRI infant data sets that contained varying numbers of dropout artefacts. Based on our results, we concluded that the Gaussian Process based outlier replacement provided similar tensor fit results to Gaussian Process based outlier detection and downweighting. Therefore, if only the least-squares estimate of the single tensor model is of interest, our recommendation is to use outlier replacement. However, outlier downweighting can potentially provide a more accurate estimate of the model precision which could be relevant for applications such as probabilistic tractoraphy.


Introduction
Subject motion during a diffusion-weighted MRI (dMRI) acquisition (Johansen-Berg and Behrens, 2014) can cause signal dropout artefacts which is effectively a problem of randomly missing data (Andersson et al., 2016;Sairanen et al., 2018b).When such data is used in modelling of the brain microstructure (Alexander et al., 2019;Novikov et al., 2019), brain connectivity (Zhang et al., 2021), or the diffusion-weighted signal representations (Basser et al., 1994;Alexander et al., 2002), these artefacts must be accounted for to avoid biased results (Andersson et al., 2016;Sairanen et al., 2018bSairanen et al., , 2021Sairanen et al., , 2022)).Due to the typically used echo-planar imaging (EPI) technique, these artefacts mostly affect a whole slice, or a group of slices in the case of multi-band acquisition (Andersson et al., 2016).Therefore, it is sensible to classify these outliers in a slicewise or groupwise manner as well.
Tools to handle missing data problems are generally referred as robust methods (Little, 2002), and they can be further divided into two become possible to correct for intra-volume effects (Andersson et al., 2017;Christiaens et al., 2021).These image registration steps mean in practice that the outlier slice which was classified in the original noncorrected DWI space are not aligned with the motion corrected DWI and cannot be directly used to downweight outlier information.This is one reason why replacement approaches are more straightforward to use in dMRI.However, there is nothing to prevent from applying the motion correction transforms to the outlier classifications and using it in the weighted modelling approach (Sairanen et al., 2018b).This would provide statistical flexibility to investigate the precision of dMRI metrics and avoid biases arising from replacement approaches (Little, 2002).
The replacement approach has been widely popular in dMRI community, due to it being implemented in the FSL EDDY tool that is used in many preprocessing pipelines to correct for subject motion and geometrical image distortions (Andersson and Sotiropoulos, 2016;Andersson et al., 2016Andersson et al., , 2017)).Recently, spherical harmonic (SH) based replacement methods have been proposed (Özarslan et al., 2013;Fick et al., 2016;Christiaens et al., 2019).It is not known which replacement approach, GP or SH, is better.Likely this will depend heavily on the implementation of such methods and can vary between different acquisition methods due to varying degrees of freedom.Regardless of the approach, using replacement is better than doing nothing to the outliers.
If the model errors or confidence intervals (Efron, 1982;Little, 2002) are of interest, the counterpart of replacement -weighted modelling could provide useful information.In this case, the outliers can be traced throughout the motion correction steps to obtain voxelwise uncertainty scores (e.g.z-scores) (Sairanen et al., 2018b).These scores can subsequently be used in model fitting as well as model uncertainty evaluation e.g. using robust bootstrap algorithms (Sairanen and Tax, 2021).
All robust approaches have in common that they must identify which measurements are outliers which means that some sort of data classification is necessary.For replacement, this is a binary selection with data either kept intact or replaced (Andersson et al., 2016).The same binary approach can be used in a weighted approach in which case outlier measurements would be completely excluded from the model fit.However, weighted approach gives additional freedom to downweight measurement based on the degree of outlier.This could be beneficial in some, likely rare, cases where model estimation problem is becoming mathematically ill-conditioned due to large number of outliers (Sairanen et al., 2017).The degree of an outlier can be estimated based on how far it is from a model prediction or other measurements (Sairanen et al., 2018b).
In this work, our aim is to clarify what kind of preprocessing of DWIs should be used when outliers are present in data.We implemented an SH based replacement method in FSL EDDY (Andersson and Sotiropoulos, 2016;Andersson et al., 2016) to evaluate which prediction method, GP or SH, provides more accurate outlier classification results and tensor model estimates.We compare both approaches to a slice-wise outlier classification tool (SOLID) (Sairanen et al., 2018b).Furthermore, we study the differences in the robust tensor model estimates calculated using replacement and weighted approaches.Lastly, we investigate how model parameter precision relates to the number of outliers in data and demonstrate a relevant difference between the two robust approaches.Our results could be helpful for dMRI researchers in selecting which preprocessing methods to use when investigating subjects that move a lot during the scan (e.g.children) or in situations where estimating the first level uncertainty accurately would be beneficial.

Methods
In this paper we aimed to 1 implement spherical harmonic estimator to replace outliers (Section 2.1), 2 compare outlier classification for Gaussian Process and Spherical Harmonic forward models (Section 2.2), and 3 compare the robust approaches (replacement and weighted modelling) for the diffusion tensor model (Section 2.5).We used highly realistic whole-brain simulations to obtain ground-truth test data, described in Section 2.3.To provide a proof of concept we tested all methods with dMRI data from seven infant subjects with various degrees of motion artefact, described in Section 2.4.The outline of this study is summarised in Fig. 1.Additionally, we investigated how these robust processing pipelines would affect the estimated uncertainty of the model parameters using bootstrap analysis in Section 2.7.

Spherical harmonic estimator
To compare spherical harmonic (SH) based replacement and the previous Gaussian Process (GP) based replacement in the EDDY framework, we first needed to implement an SH estimation algorithm within EDDY.This implementation is incremental to the current EDDY implementation as it replaces only the Gaussian Process estimator, with all other code identical.
As a basis functions for this implementation, we chose to follow the sign conventions used in MRTrix 3.0 software (Tournier et al., 2019) for the spherical harmonics der et al., 2002) where  and  define the degree and order of the basis,     is a corresponding Legendre polynomial, and (, ) are spherical coordinates: To perform weighted estimation of the SH coefficients , we propose solving the following minimisation problem where  is an  × 1 vector of the dMRI signals ( being the number of measurements),  is an  ×  design matrix of SH basis functions ( is determined by the selected highest degree  max of basis functions),  is  ×  weight matrix whose diagonal contains possibly different weights for all measurements,  is an arbitrary tuning factor, and  is an  ×  regularisation matrix whose diagonal contains factors similar to Laplace-Beltrami factors in Descoteaux et al. (2007).By setting all weights to one ( = ) this problem reduces back to a normal Ridge regression.The regularisation term ensures that higher order coefficients remain reasonably small and therefore limits the noise propagation into the model.An estimate of the SH coefficients  is given by To estimate the regularisation tuning factor , we chose to use Leave-One-Out Cross-Validation (LOOCV) (Efron, 1982).In our implementation, a subset of all brain voxels (1000 by default) is randomly selected and then Brent's method (Press et al., 2007) is used to find the optimal  value.

Outlier classification
We evaluated four classification methods: 1 EDDY GP, 2 EDDY SH, 3 SOLID Mean, and 4 SOLID Variance with the simulated data.The EDDY methods used different signal prediction approaches (Gaussian Process or Spherical Harmonics) which affects both the image registration and the outlier replacements.Otherwise, they followed the exact same algorithm.The working principles were the same for both SOLID methods Subject motion during a dMRI acquisition can cause slice-wise dropout artefacts.These artefacts can be remedied by using outlier replacement with different predictions e.g. from Gaussian Processes (GP) or from Spherical Harmonics (SH).Alternatively, an outlier position can be traced throughout the motion correction registrations and used to downweight such unreliable information during model estimation.We set out to investigate these different approaches using whole-brain simulations with the aim to find out which approach would the most beneficial to use in practice.
with the only difference being the outlier classification based on signal means or variances.Unlike the EDDY methods, that search for outliers iteratively throughout the registration framework, the SOLID methods classify outliers before the motion correction.
We used the Area-Under-Curve (AUC) values of the Receiver-Operating-Characteristic (ROC) curves (Hanley and McNeil, 1982) and Precision-Recall curves (Davis and Goadrich, 2006).To do this, we calculated false and true positive and negative findings by thresholding classification scores between 0 and 10 with 0.1 step size.The classification score for EDDY methods were the number of standard deviations that the measurement was off from the prediction (Andersson et al., 2016), whereas for SOLID method it was the modified Z-score based on signal averages or variances (Sairanen et al., 2018b).
An ROC curve details how a classifier trades specificity (true negative rate): true negatives true negatives + false positives for sensitivity (true positive rate): true positives true positives + false negatives , conversely a PR curve details how a classifier trades sensitivity for precision (positive predictive value): true positives true positives + false positives .For both curves, an optimal classifier would produce AUC value of 1.0 whereas completely random classifier would produce AUC value of 0.5.

Simulation data
We used the same highly realistic whole-brain simulation that were used in the original EDDY outlier paper (Andersson et al., 2016).These simulations were generated with the POSSUM MRI-simulator (Drobnjak et al., 2006(Drobnjak et al., , 2010;;Graham et al., 2016) which generates dMRI signals in k-space by solving Bloch's and Maxwell's equations with known head motions, eddy currents, and ground-truth signal dropouts.These simulations were used to assess the effect of the frequency of outliers in the data between 3% and 18% with 3% increments as well as the amount of head motion with small or large motion parameters.The latter simulations were also used to evaluate the effect of single-band (SB) and multi-band (MB) imaging.The small and large head motion cases refer to the ''good'' and ''bad'' cases in Andersson et al. (2016).All simulations were repeated with signal-to-noise ratios of 20 and 40.For more details, please see the previous publication (Andersson et al., 2016) and POSSUM papers (Drobnjak et al., 2006(Drobnjak et al., , 2010;;Graham et al., 2016).These simulations gave us ground truth images to evaluate our methods with as well as a possibility to accurately compare our results with the previous publication (Andersson et al., 2016).We used SH degree ( max ) eight to represent these simulations in the augmented EDDY-framework.Simulations were used to control for the degree of motion and the outlier frequency as independent factors whereas in empirical data outliers are typically caused by motion and studying them separately would be difficult.
Outlier simulations were realised by selecting 30% of DWI volumes randomly.In these volumes, 10% of slices were randomly marked as outliers.This amount of outliers was used for the SB and MB motion simulations whereas for the outlier frequency simulations both these probabilities were multiplied by √  (where n was 1, 2, . . ., 6) to generate data sets with 3, 6, 9, 12, 15, and 18% of outliers.The degree of outlier (amount of signal drop) was randomly selected in range from 0.1 to 0.9 with zero being full signal dropout and 1 being signal with no dropout.The level of movement in these outlier frequency simulations was based on the ''good'' (small head motion) subject simulation.All simulations were based on an acquisition scheme with 12 interspersed b=0s/mm 2 volumes, 32 b=700s/mm 2 volumes, and 64 b=2000s/mm 2 volumes.The gradient directions on the non-zero shells were optimised using the electrostatic repulsion approach on a half-sphere (Jones et al., 1999).

Infant data
We obtained data from an on-going infant study to evaluate our methods with in vivo measurements.We selected seven subjects that all had different degrees of motion.A 3T MRI Siemens Skyra system (Erlangen, Germany) with a 32 channel head coil was used for dMRI acquisition.The scans consisted of 13 non-diffusion weighted images that were interspersed between 60 diffusion-weighted images with a bvalue of 750s/mm 2 and 74 diffusion-weighted images with a b-value of 1800s/mm 2 , each with a unique gradient direction.A bipolar gradient scheme was used to minimise geometrical distortions due to eddy currents (Reese et al., 2003).The image resolution was isotropic 2 mm 3 with a 80 × 80 × 44 imaging matrix.The in-plane acceleration factor (SENSE) was 2 and the multi-band acceleration factor was 2. Only anterior-posterior phase encoded images were acquired as the reverse phase encoding required manual adjustment during the scan which was deemed infeasible in the corresponding clinical scan environment.The use of infant data in this work was approved by the relevant Ethics Committee of the Helsinki University Hospital.To represent these measurements in EDDY-framework with SH, we used SH degree ( max ) eight.

Diffusion tensor estimates
We used ordinary linear least squares (OLLS) or weighted linear least squares (WLLS) algorithms to obtain fractional anisotropy (FA) and mean diffusivity (MD) from the diffusion tensor model (Basser et al., 1994).
OLLS was used to obtain a baseline in cases where the outlier replacement in EDDY was used only for the registration purposes i.e. outliers were left in the data during model fit.These cases mimic a naïve situation that happens if outliers are not addressed in modelling while still ensuring that they will not affect the registration steps.These baselines we refer to as GP naive or SH naive depending on the EDDY method.OLLS was also applied in cases where either EDDY method was used to replace the outliers which is one way to obtain robust tensor estimates.These we refer to as GP repol or SH repol depending on the replacement method, with repol referring to replaced outliers.
WLLS was used to obtain robust tensor estimates after either EDDY method was used for the registrations but outliers were left in place.We obtained the voxel-wise weights as follows: First a slice-wise difference between prediction and measurement was calculated (the predicted image is subtracted from the observed, and the difference image is averaged across all intracerebral voxels).Second, these differences were used to calculate a slice-wise z-score which was saved to a 2D sliceby-volume matrix.Third, z-scores were used to fill a 4D image with the same dimensions as the corresponding DWI data i.e. for a given slice and volume in the 4D data all voxels will be given the same zscore as that in the corresponding slice and volume location in the 2D matrix.Fourth, the 4D z-score image was transformed using the same transformations that were used to correct DWI for subject motion and geometrical distortions to provide voxel-wise z-scores.Fifth, the voxelwise z-scores were scaled so that unaffected data (scores between −3 and 0) were given a weight of 1, outliers (scores below −4) were given a weight of 0 and scores in between were linearly interpolated.These we refer to as GP weighted or SH weighted depending on the EDDY method.It should be noted that because the slicewise difference is based on all intracerebral voxels it is very sensitive, and a z-score of −4, which means it is given a weight of zero, might not even be identifiable as an outlier by eye.
The simulation data was used to evaluate how close to the groundtruth FA and MD values our methods could get.This evaluation was done by calculating the Pearson's correlation coefficient across all voxels in the brain between the test simulation and the ground-truth value.These tests would show how these different processing methods would perform when the number of outlier increases and for different amounts of head motion.
Infant data with unknown ground-truth FA and MD values were used to assess differences between these methods.The Pearson's correlation coefficient were calculated as cross-comparison of all approaches using every brain voxel.These tests provide correlation plots that visualise how similar the results are between different approaches.If replacement (OLLS fit with outliers replaced) and weighted modelling (WLLS fit with outliers left in place) perform very similarly, it would give credence to both robust approaches.

Registration errors
The success of outlier classification can affect the iterative EDDY process.If classification fails, or model prediction is not accurate, the image registration algorithm used in EDDY might not be able to accurately register the artefactual volume.We used the simulations to compare the registration errors between GP repol and SH repol cases.We also compared the differences in eddy current corrections using the infant data.

Robust wild residual bootstrap
One approach to evaluate the confidence of diffusion tensor model fit is to use wild residual bootstrapping (Whitcher et al., 2005;Liu, 1988).However, if the original data sample contains outliers, the residual needs to be adjusted to accommodate that.We implemented a similar approach used in the preliminary work by Sairanen et al. (2018a) in which residuals corresponding to the outlier measurements are downweighted by their weight W. Therefore the bootstrap samples  * were generated as follows: 1.The tensor model was fitted to the original data , resulting in parameter estimates ̂, as ̂ = ( T ) −1  T , where  is the design matrix and  is a diagonal matrix of measurementwise weights.In addition, a vector of residuals  is calculated by  =  −  ̂. 2. Bootstrap samples were generated from the original parameter estimates as  * =  ̂+, where  is a diagonal matrix where each element is randomly assigned the value 1 or −1. 3.For each sample  * another set of parameter estimates ̂ * are calculated.From each of these sets of parameters (tensor elements) we calculated FA and MD.This yields a set (we performed 2000 bootstraps) of FA and MD estimates for each voxel.By calculating the standard deviation across those 2000 values we obtained an estimate of the uncertainty of these two scalar parameters for each voxel.
If outliers are replaced (i.e.repol is used), all measurements have the same weight of one and the problem returns to a normal wild residual bootstrap.To demonstrate the differences between different robust modelling methods, we calculated bootstrap samples of a subsample from the outlier frequency simulations.The subsample consisted of voxels that were in the highest 5-percentile of ground-truth FA values.We fitted the single tensor model and calculated the voxel-wise standard deviations of the ensuing FA and MD maps across bootstrap samples.We averaged these standard deviations across all the voxels under consideration to explore if there were a difference in estimated uncertainty between outlier replacement and downweighting.This was repeated for all ten simulation cases for different outlier frequencies.Additionally, we did the same analysis for a brain simulation that had no outliers (i.e.0% frequency).

Outlier classification
We evaluated four different methods to classify axial DWI slices into normal and outlier data: the first was FSL's EDDY with a Gaussian Process predictor (Andersson et al., 2016), the second was FSL's EDDY with the Spherical Harmonic predictor, the third and fourth were SOLID (Sairanen et al., 2018b) with signal intensity average and variance used as the test score, correspondingly.All of these approaches provide similar output that basically tells how far off from unaffected data any given DWI slice is.In the case of EDDY, difference is measured as standard deviations from the model prediction.In the case of SOLID, difference is measured as median absolute deviations from median of the measured slicewise metric with the metric being mean or variance.
We chose to use a score of 10 as the arbitrary upper limit as it would be extremely unlikely that any non-outlier slice could have that high deviation score i.e. the number of false positive findings could not increase after that.While outliers will not follow normal distribution and could have higher scores, increasing the threshold would affect Based on these results, EDDY GP had better performance in nearly all simulations with both SNRs.The average registration error remained relatively stable throughout the increasing outlier percentage for all methods when the outlier frequency was below 10%.The bottom figure shows average registration errors for the small and large head motion simulations in SNR 20 simulations.The error plot is used to indicate the mean and the standard deviation.In these cases, GP approaches performed better than SH approaches having both smaller average deviation and less variability.
only marginally the left top part of PR curves with no affect at all on ROC curves.Since the area under that part of PR curve would be very small, this decision has very limited effect on our AUC results.
Figs. 2 and 3 summarise our findings as AUC-values of the ROC and PR curves of all the classifier methods.Fig. 2 depicts how classifiers were able to handle situations with different outlier frequencies (i.e.how large percentage of data was affected) when the subject remained still during the scan.Fig. 3, on the other hand, depicts how classifiers were able to handle the situation where subjects had moved during the scan causing small and large motion artefacts.Additionally, small and large motion artefacts were also simulated using multi-band acquisition approach in which case multiple slices were affected by the same artifact at the same time.
In the case of incremental frequency of outliers (Fig. 2), we did not observe any large differences between the EDDY predictor approaches.Both GP and SH produced similar results retaining very good classification results up to 12% of outliers, after which both approaches started to struggle.By contrast, simpler slicewise metric (SOLID mean and variance) approaches remained relatively stable throughout the range of simulated outlier frequencies.Interestingly, with a small number of outliers, especially the variance based detection, SOLID produced relatively large amount false positive findings leading into decreased PR AUC-values.It should be noted that this simulation of incremental outlier percentage might not reflect real-life situation very well.Usually larger numbers of signal dropout artefacts are correlated with notable head motions.In these simulations, head motions were omitted to study the signal dropout in isolation and observe the algorithm behaviour under such hypothetical circumstances.
The cases of small and large head motions with outliers (Fig. 3) demonstrated again that both EDDY prediction approaches produced similar results.Simple slicewise metrices (SOLID mean and variance) produced higher AUC-values of the ROC curves and lower AUC-values of the PR curves in the non-multi-band simulations.In the multiband simulations, EDDY based classification became nearly perfect producing truly optimal ROC AUC-values and nearly optimal PR AUCvalues as well.Since most of the modern acquisition approaches are relying on multi-band, it is relatively safe to say that EDDY based outlier classification in such cases is preferable over the evaluated counterpart.Of course, simple slicewise metric could be adjusted to benefit from the same a priori knowledge of multi-band information as well if seen necessary for some specific application (e.g.data is acquired using some specific gradient scheme that is not compatible with the modelling approaches in EDDY).

Registration errors
We investigated the differences in eddy current correction between the methods by calculating an average registration error from all brain voxels in warp images.These results are shown in Fig. 4. Since the amount of motion was a known parameter in these simulations, these results indicate which method was better and produced smaller registration errors in general.Basically, we calculated the voxel-wise RMSE between the ground-truth warp images used in the simulations and the result warp images from both methods.In almost every case, the original EDDY GP method produced smaller errors.
Similarly, we evaluated the differences in infant image registration between EDDY GP and SH methods by calculating the voxel-wise RMSE.This analysis only tells if there is a difference between the methods and should not be used to indicate which method is better.This is because as with any real-life situation, with the infant data there is no groundtruth available.Average registration differences from the least severe to the most severe cases were 23.8, 24.0, 24.0, 48.6, 243.9, 298.0, 604.3 μm indicating that in the less severe cases, both methods produced similar results and could be viable approaches.

Diffusion tensor estimates
We fitted the single tensor model to all simulation cases of the six methods described in methods: OLLS fitting with outliers retained in the data was used to provide a baseline of what happens if the outliers were left in place (GP naive and SH naive).OLLS was also used with outliers replaced to obtain tensor fits with outlier replacement (GP repol and SH repol), whereas WLLS was used with outliers retained in the data to obtain robust tensor fits without outlier replacement (GP weighted and SH weighted).Subsequently, we calculated FA and MD values from these tensors and calculated the voxel-wise Pearson's correlation coefficients with the known ground-truth values.
Fig. 5 detail how capable the compared six approaches were in producing FA and MD values for incremental outlier percentage in data.With the GP naive and SH naive approaches where the outliers were left in place, the accuracy (seen as the general trend of the graph) of both FA and MD decreased almost linearly and the precision started to decrease (seen as the area around the slope increasing).With the replacement approaches (GP repol and SH repol), the accuracy and the precision remained good and stable up to 12% of outliers in data.After that, both methods started to lose accuracy (declining trend of the graph) as well as to lose precision (widening area).The weighted modelling approaches (GP weighted and SH weighted) behaved nearly identically with the replacement approaches.A difference in SNR caused basically a shift in correlation coefficients with higher SNR producing better correlations in all cases.
Fig. 6 detail how the compared six approaches produced FA and MD values in the case of small and large motion artefacts with singleband and multi-band.The GP naive and SH naive approaches produced again the worst results, having the largest deviation from the ground truth and the highest variations.Interestingly, now we observed a difference between the replacement approaches (GP repol and SH repol): SH based replacement performed slightly worse than GP while producing more consistent results (less variation) than either of the naïve approaches.This was likely a consequence of the difference in the registration accuracy.Similar finding was done with the weighted modelling approaches (GP weighted and SH weighted).Again, the GP based approach performed slightly better.
Based on the evidence that the GP based methods were the better performing methods, we conducted a more detailed comparison between the GP based replacement and weighted modelling (GP repol vs GP weighted).Supplementary Figs. 1 and 2 detail the differences between the approaches for the incremental outlier percentage.Supplementary Figs. 3 and 4 detail the differences between the approaches for small and large head motions with single-band and multi-band.Results were plotted so that each simulation was represented by one dot (average of correlation coefficients) i.e. ten dots per SNR.GP repol results were shown on the -axis and GP weighted on the -axis, therefore if dots were above the diagonal line from 0 to 1, that would indicate the weighted method being better and vice versa.These figures indicate that the differences between these two approaches were minimal, were located almost symmetrically around the diagonal line, and could be seen only in the third decimal of FA or MD correlation coefficient.This means that regardless of the robust approach, the least-squares tensor estimate remained the same which was a very encouraging finding.

Infant data analysis
We applied both EDDY GP and SH pipelines to infant data and fitted single tensor models with the same logic as for the simulations: OLLS with outliers in place was used as a naïve baseline of what happens if the outliers were left in place, OLLS with outliers replaced was used to evaluate the differences in replacement methods, and WLLS with outliers left in place was used to evaluate the performance of weighted tensor model estimates.Additionally, we applied SOLID mean and variance pipelines in which motion corrections were performed using EDDY GP, outliers were left in place, and WLLS with weights derived from SOLID mean or variance metrices were used in the estimation.
We did a cross comparison between these methods by calculating FA and MD values from the whole brain.We used scatter plots and Pearson's correlation coefficients to visualise and assess the similarities of the methods.An example from one of the infants is shown in Fig. 7.We chose to omit SOLID variance from this visualisation as it produced similar results with SOLID mean.
While these results cannot tell which method is absolutely the best as there is no known ground-truth, the results reveal if there are large differences between the methods.With the GP repol and weighted approaches providing the best results in simulation cases, it is not unreasonable to assume that is the case here as well.Moreover, it is reasonable to assume that the naïve OLLS approach cannot produce very accurate results, therefore methods that deviate from OLLS and are more similar to the GP repol and weighted approaches are likely the better performers.
These results provide additional support to the simulation cases.Both, replacement and weighted modelling lead to very similar tensor estimates.For example, the comparison in Fig. 7 between GP repol and GP weighted had Pearson's correlation coefficient of 1.0, indicating very high correlation.Similar results were obtained for SH repol and SH weighted as well.Interestingly, GP and SH approaches did not provide as high correlation.However, in those comparisons, correlations were still good with Pearson's correlation coefficients of approximately 0.992 to 0.995.Combining SOLID with eddy motion parameters produced also high correlation coefficients (0.998 for MD and FA) with the GP repol and weighted methods and slightly lower correlations with the other methods.

Robust wild residual bootstrap
Our aim with this section is to briefly demonstrate differences in estimated uncertainty when outliers are either replaced or downweighted.We generated 2000 bootstrap samples for each analysed voxel, fitted the tensor model, calculated FA and MD, and the corresponding standard deviations.Finally to compare GP repol, GP weighted, SH repol, and SH weighted, we calculated the mean across these voxelwise estimated standard deviations.These are shown in Fig. 8.For both metrics (FA and MD) and both predictive models (GP and SH) the estimated standard deviation is lower for replacement compared to weighting.This is in line with what would be expected and demonstrates that if outliers are replaced, it can cause the precision of the FA and MD estimates to be overestimated.

Discussion
We aimed to investigate different robust modelling approaches (Little, 2002) used in diffusion-weighted MRI that can be used to correct for subject motion related outliers (Andersson et al., 2016;Sairanen et al., 2018b).Gaussian Process (GP) based replacement (Andersson  (Basser et al., 1994) fits.Importantly, the same processing pipeline but without outlier replacement and using weighted modelling (WLLS) produced very similar results with the replacement approach.This confirmed that tensor estimates can be robustly obtained using either of these approaches.
Pragmatically, an important advantage of outlier replacement over downweighting is that many processing methods following later in the dMRI processing pipelines (e.g.model estimation) can be used as they are.This means that it readily slots into existing pipelines.In contrast, weighted modelling approaches require changes in the software used for modelling or other downstream processing.Specifically, the probability of any voxel being an outlier will need to be passed from the software used to detect the outliers to the downstream software.Therefore, switching from replacement to weighted modelling might not be straightforward in all cases.However, our results demonstrate that weighted modelling can produce more accurate estimates for model precision.The significance of that will depend on the application and the ultimate analysis.For example, this would impact probabilistic tractography, but not deterministic.
While the Spherical Harmonic (Alexander et al., 2002;Descoteaux et al., 2007) based replacement method that we implemented produced relatively good outlier classification results, it had higher registration errors, and slightly less accurate tensor model fits.The slightly poorer tensor model fits are likely a direct consequence of the poorer registration.We think that the explanation for poor registration is that any movement induced variance within the column space of the linear spherical harmonic design matrix was explained perfectly by the model, and was thus invisible to the correction method.The expected correlation between the movement parameters and the design matrix is zero, but for any one realisation (data set) there is a risk that correlation is non-zero, and hence that there will be a proportion of the movement that will not be corrected.We attempted to overcome this problem by regularising the SH model fit to the data, where we used an equivalent method to that used to estimate the hyper-parameters of the GP to estimate the regularisation weight.It is conceivable that a higher weight or a different form of the regulariser would improve the SH based registration, but that would be outside the scope of the present paper.
However, the Gaussian Process (GP) approach (Andersson and Sotiropoulos, 2016;Andersson et al., 2016Andersson et al., , 2017) ) is already an effective method.If data is acquired using multi-band or a similar technique (Barth et al., 2016), the GP method can provide perfect outlier classification results as shown in our ROC and precision-recall curve AUC evaluations (Fig. 3).This accuracy and precision in outlier classification is also beneficial for the proposed weighted model estimator approach as the difference between the outlier data and predicted data can sensibly be said to reflect the severity of the signal dropout artefact.Subsequently, this severity i.e. how many standard deviations a measurement is from the prediction can be scaled into weights between 0 (full outlier) and 1 (good data) to be used in modelling similar to Sairanen et al. (2018bSairanen et al. ( , 2022)).With replacement approaches, the estimated precision of the model parameters is inflated which is seen as a downward trend for both GP repol and SH repol.In contrast, with the downweighting approach, the estimated precision decreases as the outlier frequency increases which is seen as an upward trend for both GP weighted and SH weighted.The latter is the expected behaviour, as clearly the precision will go down with the increasing frequency of missing data.Regardless of the approach, when the point of too many outliers is reached (approximately 12% here), problems arising from image registration start to take effect and the linear trends disappear.
In our WLLS tests, we used a piecewise continuous function, akin to a sigmoid, to translate from the EDDY outlier z-scores to model weights.We have not done any detailed work to ensure that this is an optimal function, and this could be the subject of future work.Another weakness besides this need to select such mapping function in weighted modelling is its computational demand.OLLS estimate from all brain voxels can be represented as one large matrix multiplication.For WLLS, this is not possible since the problem design matrix is affected by the weights that can be unique for each voxel.
We also provided a proof of concept that our method is applicable to real-life infant data.These tests agreed well with our simulation results, indicating that GP methods repol and weighted had high correlation in tensor estimates.The same could be said about SH methods as well.However, the GP methods and SH methods produced slightly different results.Seeing that our simulations indicate that the GP method performs better, this could mean that the SH method fails to represent the infant dMRI signal equally well.The infant tests also demonstrated that if there are too many outliers in data, it is unlikely that any algorithm can fix it and a reacquisition is needed, or whole subject should be excluded from further analyses.This was seen in the registration errors diverging (Section 3.2) between the robust pipelines and neither producing visibly reliable results.This finding agrees well with our simulation results and prior studies (Andersson et al., 2016) that more than 10% of outliers can be problematic for data processing pipelines.
We focused our tests on the single tensor model due to it still being the most popular and widely used dMRI model in clinical research.In the future, these kinds of evaluations could be performed with more advanced microstructural models such as NODDI (Zhang et al., 2012) and effects in tractography might be of interest as well (Sairanen et al., 2022).However, it seems unlikely that our findings would be contradicted by such studies as these higher order models tend to require more data than single tensor estimates.Therefore, if a method fails to capture details that are necessary for tensor fit it likely fails to capture details necessary for more complex models as well.
Finally, to the good news: both robust approaches (outlier replacement and outlier downweighting in modelling) produced very similar results indicating that both are suitable for obtaining robust diffusion tensor estimates.However, when considering the precision of these tensor model parameters obtained using robust wild residual bootstrapping (Liu, 1988;Whitcher et al., 2005;Jones, 2008;Sairanen et al., 2018a;Sairanen and Tax, 2021), we demonstrate (Fig. 8) that there is a clear difference between the approaches.This difference could be important in applications that are rely on the precision of the model parameters such as the probabilistic tractography.
With replacements, the confidence on the subsequent model fits becomes inflated, i.e. the estimated variance initially decreases as the proportion of outliers increases as seen in Fig. 8.This is, of course, contrary to the true state of affairs, where the precision should go down as the amount of valid data decreases.To see why the precision instead increases, consider what would happen if all data was replaced by model driven expectations and one subsequently used the same model in the bootstrapping.In that case, one would obtain the exact same model parameters for each bootstrap and the estimated variance would be zero.In practice that would of course not happen as other problems (i.e.poor correction for movement and distortions) appear when the outlier frequency reaches a certain level, leading to decreased precision which is seen in Fig. 8 when the outlier frequency exceeds 12%.
In contrast to the outlier replacement, the precision estimates obtained with weighted modelling exhibits the expected behaviour of increased variance (decreased precision) as the amount of valid data decreases.In our simulations shown in Fig. 8, the precision of the weighted approach decreases approximately linearly until the point where registration problems start appearing at outlier frequency of 12%-15%.After that, there is an inflexion point and the precision start decreasing faster.Theoretically, it is reasonable to expect that this difference between replacing and downweighting outliers is common to all models used in dMRI.However, it remains to be investigated what kind of, if any, practical implications this causes for the various modelling methods used in dMRI.

Conclusion
Diffusion-weighted MRI data is likely affected by motion artefacts and outliers to some degree.Therefore, motion-robust processing methods that account for both image registration needs and outliers in data are necessary to obtain reliable model estimates.Based on our results Gaussian Process (GP) based replacement was the most accurate approach to classify the outliers and therefore produced slightly better performance in the EDDY-framework than the SH approach.More importantly, we showed that both the replacement with regular model estimation algorithm and weighted model estimation algorithm produced very similar results.This means that results obtained by these methods are comparable with each other providing credence to both robust approaches.

Fig. 1 .
Fig.1.The outline of the experiments used in this study.Subject motion during a dMRI acquisition can cause slice-wise dropout artefacts.These artefacts can be remedied by using outlier replacement with different predictions e.g. from Gaussian Processes (GP) or from Spherical Harmonics (SH).Alternatively, an outlier position can be traced throughout the motion correction registrations and used to downweight such unreliable information during model estimation.We set out to investigate these different approaches using whole-brain simulations with the aim to find out which approach would the most beneficial to use in practice.

Fig. 2 .
Fig. 2. Outlier classification in the case of incremental percentage (Frequency) of outliers in the data (x-axis).Results are summarised with Area Under Curve (AUC) values (y-axis) for the compared four different methods shown in corresponding columns from left to right: EDDY with Gaussian Process (EDDY GP) predictor, EDDY with Spherical Harmonic (EDDY SH) predictor, and simple slicewise metrices Mean and Variance.The first row depicts the Receiver-Operating Characteristic (ROC) AUC-values and the second row depicts the Precision-Recall AUC-values.Classification was performed for both Signal-to-Noise (SNR) separately.Line plots show the average AUC-values and similarly coloured areas show the corresponding standard deviations.

Fig. 3 .
Fig. 3. Outlier classification in the case of Small and Large motion artefacts with single band (SB) and with multi-band (MB).The compared methods are shown on the -axis from left to right: EDDY with Gaussian Process (EDDY GP) predictor, EDDY with Spherical Harmonic (EDDY SH) predictor, and simple slicewise metrices Mean and Variance.The error plots show the mean and the standard deviation for the Area Under Curve (AUC) values.The first row depicts the Receiver-Operating Characteristic (ROC) AUC-values and the second row depicts the Precision-Recall AUC-values.Classification was performed for both Signal-to-Noise (SNR) separately.

Fig. 4 .
Fig. 4. The top figure shows the average registration error (mm) in the single-band (SB) simulations with no motion for the increasing outlier frequency.Solid and dashed lines represent the SNR 20 and 40 simulations, respectively.The lines indicate the average registration error and the area shows the corresponding standard deviation.Based on these results, EDDY GP had better performance in nearly all simulations with both SNRs.The average registration error remained relatively stable throughout the increasing outlier percentage for all methods when the outlier frequency was below 10%.The bottom figure shows average registration errors for the small and large head motion simulations in SNR 20 simulations.The error plot is used to indicate the mean and the standard deviation.In these cases, GP approaches performed better than SH approaches having both smaller average deviation and less variability.

Fig. 5 .
Fig. 5. Tensor metric results of the outlier frequency simulations.The two first rows show FA results.The last two rows show MD results.The first column is the baseline (i.e. the GP naive and SH naive approaches), the second column is with outlier replacement (GP repol and SH repol), and the third column is with outlier downweighting (GP weighted or SH weighted).Each subplot consists of increasing outlier frequency on -axis and the Pearson's correlation coefficients between the ground-truth and estimated values on -axis.Average results are shown using lines and corresponding standard deviations are visualised using similarly coloured region around the line.

Fig. 6 .
Fig. 6.Tensor metric results of the subject motion simulations.The first row show FA results.The second row show MD results.Columns from left to right indicate the level of motion artefact (small or large) and simulation type (single-band SB or multi-band MB).Pearson's correlation coefficients between the ground-truth and the estimated value on the -axis are shown for each tested method detailed on the -axis.The error plot depicts the mean and standard deviation of these results.

Fig. 7 .
Fig. 7. FA and MD results from one infant subject with moderate motion artefacts.The diagonal contains a sagittal FA slice and the title of the used method: GP naive, GP repol, SH repol, GP weighted, SH weighted, or SOLID mean.The subplots are made by calculating Pearson's correlation coefficient between two compared methods.For example, on the top row the first correlation plot shows comparison between GP naive and GP repol.FA (red) results are shown in the upper triangle of subplots and MD (blue) results are shown in the lower triangle of subplots.This plot does not tell which method was the best, it is simply a comparison between methods.

Fig. 8 .
Fig.8.Difference in the model variance between replacing and downweighting outliers.Lines indicate the average and shaded areas indicate the standard deviation calculated over the ten simulated brains.Due to only one 0% frequency simulation, there is only one data point and no variance to show.With replacement approaches, the estimated precision of the model parameters is inflated which is seen as a downward trend for both GP repol and SH repol.In contrast, with the downweighting approach, the estimated precision decreases as the outlier frequency increases which is seen as an upward trend for both GP weighted and SH weighted.The latter is the expected behaviour, as clearly the precision will go down with the increasing frequency of missing data.Regardless of the approach, when the point of too many outliers is reached (approximately 12% here), problems arising from image registration start to take effect and the linear trends disappear.