Reliability on multiband diffusion NODDI models: A test retest study on children and adults

Neurite Orientation Dispersion and Density Imaging (NODDI) and Bingham-NODDI diffusion MRI models are nowadays very well-known models in the field of diffusion MRI as they represent powerful tools for the estimation of brain microstructure. In order to efficiently translate NODDI imaging findings into the diagnostic clinical practice, a test-retest approach would be useful to assess reproducibility and reliability of NODDI biomarkers, thus providing validation on precision of different fitting toolboxes. In this context, we conducted a test-retest study with the aim to assess the effects of different factors (i.e. fitting algorithms, multiband acceleration, shell configuration, age of subject and hemispheric side) on diffusion models reliability, assessed in terms of Intra-class Correlation Coefficient (ICC) and Variation Factor (VF). To this purpose, data from pediatric and adult subjects were acquired with Simultaneous-MultiSlice (SMS) imaging method with two different acceleration factor (AF) and four b-values, subsequently combined in seven shell configurations. Data were then fitted with two different GPU-based algorithms to speed up the analysis. Results show that each factor investigated had a significant effect on reliability of several diffusion parameters. Particularly, both datasets reveal very good ICC values for higher AF, suggesting that faster acquisitions do not jeopardize the reliability and are useful to decrease motion artifacts. Although very small reliability differences appear when comparing shell configurations, more extensive diffusion parameters variability results when considering shell configuration with lower b-values, especially for simple model like NODDI. Also fitting tools have a significant effect on reliability, but their difference occurs in both datasets and AF, so it appears to be independent from either misalignment and motion artifacts, or noise and SNR. The main achievement of the present study is to show how 10 minutes multi-shell diffusion MRI acquisition for NODDI acquisition can have reliable results in WM. More complex models do not appear to be more prone to less data acquisition as well as noisier data thus stressing the idea of Bingham-NODDI having greater sensitivity to true subject variability.


Introduction
Neurite Orientation Dispersion and Density Imaging (NODDI) is a diffusion MRI (dMRI) technique that produces direct estimation of brain microstructure by modeling the tissue as a multi-compartmental system , accounting for free, hindered and restricted water diffusion. Expressing the dMRI signal as a sum of contributions from each of the three compartments modeled, NODDI provides specific metrics (i.e. the Orientation Dispersion Index ODI, the volume fraction of intra-cellular v intra and of cerebro-spinal fluid v iso compartments) which are able to quantify the effects of density and orientation distribution of dendrites and axons. Based on NODDI, Bingham-NODDI model was then introduced ( Tariq et al., 2016 ), in order to characterize anisotropic orientation dispersion of complex configurations like fanning and bending axons, overcoming the commonly used Diffusion Tensor Imaging (DTI).
Since DTI metrics (e.g. the Fractional Anisotropy (FA) and Mean Diffusivity (MD)) were estimated by approximating a single fiber at each voxel and assuming mono-Gaussian model for water diffusion, they lack specificity in the White Matter (WM) tracts ( Kodiweera et al., 2016 ;Wheeler-Kingshott and Cercignani, 2009 ).
Similarly to DTI, NODDI and Bingham-NODDI methods allow one fiber tract in each voxel and they represent powerful instruments, able to provide highly specific and biologically meaningful measures of tissue microstructure ( Kunz et al., 2014 ). For this reason, these techniques recently gained increasing interest for applications in the clinical field, as they allowed to non-invasively investigate cognitive processes  and in-vivo cellular architecture of the developing human brain and to follow brain maturation ( Lynch et al., 2020 ;Mah et al., 2017 ). Particularly, the ability to detect possible alterations in WM fiber organization ( Dean et al., 2017 ) made the NODDI approach suitable to probe brain injury ( Caverzasi et al., 2016 ;Palacios et al., 2020 ) and to investigate specific neurological disorders, such as Alzheimer's disease ( Fu et al., 2020 ), Multiple Sclerosis ( Mustafi et al., 2019 ;Schneider et al., 2017 ) and epilepsy disorders ( Sone et al., 2018 ).
Despite their success in quantifying complex brain morphology, both NODDI and Bingham-NODDI present some clinical limitations, mainly related to the long acquisition and processing times. Multi-shell acquisition protocol with a large number of uniformly distributed diffusion weighted gradient directions for each b-value leads to long scan time, which limits its application in the clinical practice especially in the pediatric population. In this context, Multiband or Simultaneous Multi-Slice (SMS) imaging could represent a valid option for scan time reduction as it is able to simultaneously excite multiple slices ( Setsompop et al., 2012 ), thus providing an acceleration in data acquisition which is proportional to the number of excited slices (i.e. acceleration factor AF). Moreover, fitting the data for diffusion parameter maps estimation is computationally expensive particularly when increasing dataset size and for highly complex biophysical models. The use of computational power offered by recent parallel computing architectures like Graphics Processing Units (GPUs) could represent a helpful solution to reduce processing time ( Cheng et al., 2013 ). Among other algorithms for fast model fitting optimization like the Accelerated Microstructure Imaging via Convex Optimization (AMICO) ( ( Daducci et al., 2015 )), two GPU-based toolboxes have been recently presented to fit both NODDI and Bingham-NODDI models. The first tool is Microstructure Diffusion Toolbox (MDT), a python-based toolbox freely available under an open source l -GPL license that is primarily meant for microstructure modeling and analysis of dMRI ( Harms et al., 2017 ), while the second tool is CUDA Diffusion Modeling Toolbox (cuDIMOT), a GPU-based library for accelerated nonlinear optimization ( Hernandez-Fernandez et al., 2019 ). Compared to the widely used AMICO toolbox, MDT and cuDIMOT allow to fit NODDI and Bingham-NODDI models much faster, but it is not yet clear whether the significant fitting time reduction could sacrifice reliability of the metrics estimated. In order to efficiently translate NODDI imaging findings into the diagnostic clinical practice ( Lerma-Usabiaga et al., 2019 ), a test-retest approach would be useful to assess reproducibility and reliability of NODDI biomarkers, thus providing validation on precision of the method, especially when considering GPU-based approach. In this context several recent studies focused on investigating the reliability and reproducibility of metrics from different dMRI techniques, such as DTI ( Boekel et al., 2017 ;Duan et al., 2015 ), NODDI ( Parvathaneni et al., 2018 ;Tariq et al., 2012 ) and 3-tissues Constrained Spherical Deconvolution (CSD) ( Newman et al., 2019 ). Focusing on NODDI, previous studies demonstrated significant dependence of metrics on magnetic field strength ( Chung et al., 2016 ) and reported excellent repeatability, especially for ODI and v intra metrics ( Andica et al., 2020 ;Granberg et al., 2017 ;Huber et al., 2019 ). Moreover, although a reproducibility study was recently performed to assess the effect of b-values and gradient directions on metrics ( Parvathaneni et al., 2018 ), the effect of their inter-action with other factors (e.g. SMS protocol and fitting tools) still need to be evaluated. .
The purpose of this work is therefore to perform a test-retest reliability assessment of NODDI and Bingham-NODDI models fitted with two different GPU-based algorithms (MDT, cuDIMOT) in terms of Intraclass Correlation Coefficient (ICC) and Variation Factor (VF). Focusing only on NODDI metrics, the reliability results from GPU-based algorithms were also compared to those from AMICO toolbox in order to provide quantifiable comparison between toolboxes with different performances. Particularly, the test-retest reliability was evaluated based on children and adult data for different SMS protocols and for seven different shell configurations. Along with it, a test on the brain asymmetry and its effect on reliability was performed.

Subjects
This research was conducted in accordance with the Declaration of Helsinki. Ethical committee approval was obtained through Institutional Review Board (IRB) and written informed consent was obtained from every subject or next of kin for the execution of MRI acquisitions. Between September 2017 and September 2020 MRI data from healthy subjects were collected at the Bambino Gesù Children's Hospital (Rome, Italy). According to the recommendations on sample size for test-retest studies ( Bujang and Baharum, 2017 ), assuming two observation per subject, a minimum sample size of 15 subjects is considered to be sufficient to detect ICC values of 0.6 for fixed values of alpha and power (respectively 0.05 and 80%). Consequently, we identified 31 subjects: 15 children (mean age = 8.9 y, range = 4.5-15.6 y, 8 males and 7 females) and 16 adult volunteers (mean age = 29.2y, range = 22.8-59.2 y, 8 males and 8 females).
Additional five brain volumes with no diffusion weighting ( b = 0 s/mm 2 ) were collected. Only in adult volunteers one further acquisition using a reversed phase encoding direction (posterior to anterior) was also acquired to allow the estimation of susceptibility-induced distortion. All diffusion images were acquired with a spatial resolution of 2.0 × 2.0 × 2.0 mm and acquisition matrix of 128 × 128. Total acquisition time for diffusion imaging was approximately 20 min for AF = 2 and 15 min for AF = 3. Children did not exit the scanner between test and retest but the two sessions were separated by about 20 mins. Conversely, adult volunteers were out of the scanner for about 20 min between sessions.

Data fitting
NODDI and Bingham-NODDI were applied to the pre-processed data from all the shell configurations. Model fitting was performed with two different toolboxes: MDT ( https://github.com/cbclab/MDT , version 0.19.1), a python-based GPU accelerated toolbox freely available under an open source l -GPL licence ( Harms et al., 2017 ) and cuDIMOT ( https://users.fmrib.ox.ac.uk/~moisesf/cudimot/ ), a GPU-base library for accelerated nonlinear optimization ( Hernandez-Fernandez et al., 2019 ). We used Powell's and Levenberg-Marquardt optimization algorithms respectively for MDT and cuDIMOT. Additionally, the NODDI model was also implemented using the python package of the AMICO toolbox ( https://github.com/daducci/AMICO ). The fitting was achieved with an Oracle cloud server setup with two NVIDIA® Tesla® P100, four Intel® Xeon® Gold 5120 CPUs @ 2.20 GHz, 187 GB RAM. Each toolbox produced three maps (ODI, v intra and v iso ) for NODDI model and six maps (ODI along primary and secondary direction ODI p and ODI s , total ODI ODI tot , Dispersion Anisotropy DA, v intra and v iso ) for Bingham-NODDI model. An analysis based on Region of Interest (ROI) was then performed to investigate test-retest reliability. To this purpose, individual FA maps were first computed by fitting Diffusion Tensor Model on a multi-shell configuration and then aligned with Tract-Based Spatial Statistics (TBSS) FSL pipeline ( Smith et al., 2006 ) to construct WM mean skeleton. ROIs were automatically extracted from JHU-ICBM-DTI-81 WM atlas and overlapped ( fslmath function) to the skeleton in order to remove any cerebro-spinal fluid (CSF) and gray matter (GM) voxels and to ensure reproducibility to be evaluated on the same projections. Finally, the ROIs were back-projected from Montreal Neurological Institute (MNI) space into each subject space with inverse TBSS transformations. During the fitting, each tool approximates the noise standard deviation, whose estimate could be influenced by changes in the brain mask and misalignment in the b0 images. For this reason we investigated the effect of noise standard deviation estimates on the testretest variability. To this purpose we fitted both NODDI and Bingham NODDI models on data from all shell configurations with MDT toolbox, comparing reliability results obtained in standard setting with those obtained with specific noise standard deviation estimates, previously computed with specific approaches, i.e. Moments ( https://github.com/ samuelstjean/autodmri ) and MPPCA ( https://github.com/mrtrix3 ) ( St-Jean et al., 2020 ). A MANOVA analysis was performed between data with different noise standard deviation estimates both on ICC and VF (see Tables 20,21 of Supplementary Material).

Test-retest reliability
Diffusion parameter maps were computed for each shell configuration for both NODDI ( Fig. 1 A) and Bingham-NODDI ( Fig. 1 B) models. We evaluated ICC and VF distributions within multi-shell configurations for the two fitting toolboxes. In particular, we compared results from seven different shell combinations (P 21 , P 22 , P 23 , P 31 , P 32 , P 33 and P 4 ) for MDT and cuDIMOT. This approach was used for SMS sequences acquired with different AF both for pediatric subjects (we referred to children acquired with AF = 2/3 as PED_SMS2/PED_SMS3), and adult volunteers (AD_SMS2/AD_SMS3). Since pediatric subjects were acquired with limited acquisition stack to reduce exam time, we focused their reliability assessment within 36 supra-tentorial WM ROIs. Out of 36 ROIs, 32 were investigated for test-retest reliability in order to assess effect of hemispheric side (16 ROIs for each hemispheric side). For adults, ROI-based analysis was performed both within 32 supraand 8 infra-tentorial ROIs. Both ICC and VF were obtained on all diffusion scalar metrics (ODI, v intra , v iso for NODDI and ODI s , ODI p , ODI tot , DA, v intra , v iso for Bingham-NODDI) and for each ROI. The VF gives a measure of within-subject variability ( Albi et al., 2018 ;Jovicich et al., 2014 ;Papinutto et al., 2013 ), where lower scores indicate better reproducibility. It is computed as the ratio between the absolute test-retest error and their mean. The ICC is a widely used statistical approach ( McGraw and Wong, 1996 ;Shrout and Fleiss, 1979 ) which combine both within-and across-subject variability to assess agreement in testretest reliability ( Albi et al., 2018 ;Buchanan et al., 2014 ;Duan et al., 2015 ;Hodkinson et al., 2013 ). Particularly, according to the practical guidelines reported in literature ( Koo and Li, 2016 ), we identified the two-way mixed effects model for single measurement as best ICC measurement for our study ( McGraw and Wong, 1996 ), ranging from 0 (poor reliability) to 1 (excellent reliability) and defined as follows: where MS R is the mean square for rows (i.e. between subject), MS C is the mean square of columns(i.e. within subjects), MS E is the mean square error, k is the number of measurements and n is the number of subjects.

Statistical analysis
MANOVA analysis was performed on ICC and VF values for each NODDI parameter as extracted from the different ROIs. The model included main effects of acceleration factor (SMS), adult or children group (Subj), shell configuration, software choice (Tools) and hemispheric side. For each factor (Tool, SMS, Configuration, Subj and Side), tables of statistical results reported mean ICC (see Table 1 and Supplementary  Material Tables 1,3) and VF (see Table 2 and Supplementary Material  Tables 2,4) values and standard deviation, values of F statistic, significance and effect size ( ) for each diffusion parameter.
In order to assess significant differences between diffusion models (NODDI and NODDI-Bingham), we run a t -test analysis on the ICC and VF of common metrics (i.e. ODI, ODI tot , v intra and v iso ). Moreover, the effect of different noise standard deviation estimates on reliability was also investigated performing a MANOVA analysis (see Tables 20,21 of Supplementary Material). The statistics were carried out in Statistical Package for Social Science (SPSS) software. We also computed mean Table 1 Mean values, standard deviation, F -statistics, p -value and size effect of ICC for NODDI and Bingham-NODDI metrics to assess effect of different factors (tool, SMS, configuration and side) within 32 supra-tentorial ROIs. N.S. means Not Significant.  and standard deviation of VF and ICC values (see Tables 5-4 of Supplementary Material). For every model parameter, toolbox performances and shell configurations of each group were compared. We analyzed results with Bland-Altman plots, where parameter difference between acquisitions was plotted against its mean value, averaged over ROIs for each subject. We exhibited outcomes from the two toolboxes, shell configurations and datasets (see Supplementary Material).
Moreover, statistics on models (NODDI and Bingham-NODDI) revealed no significant differences for ICC values of ODI, v intra and v iso between the 2 NODDI models under investigation, while VF produced NODDI ODI values significantly reduced ( p < 10 − 07 ) with respect to Bingham ODI (ODI tot ).

Dataset results
Due to the lack of infra-tentorial acquisition in children, the comparison between the 2 datasets was performed on the 32 supra-tentorial ROIs. Children exhibited ICC values of roughly 0.9 for both NODDI and Bingham-NODDI models ( Table 1 ) and a VF values below 5.5% ( Table 2 ). Higher VF values were observed only for v iso (18% for NODDI v iso and 15% for Bingham-NODDI v iso ). For adult supra-tentorial areas we found ICC results for ODI and DA with value of around 0.75 whereas v intra and v iso were around 0.8 ( Table 1 ). VF in adult dataset was in the range of 2-7% in all estimated parameters but v iso that had over 31% variation ( Table 2 ). The 2 datasets show significant differences in all metrics both for ICC and VF ( Fig. 2 ).

Shell configurations results
A comparison among the seven different shell combinations (P 21 , P 22 , P 23 , P 31 , P 32 , P 33 and P 4 ) was performedFor supra-tentorial ROIs we observed that shell configurations had a statistical significant effect on VF on the values of v intra and v iso for both NODDI models and DA. As reported in Fig. 3 A, the lowest VF values for v intra were obtained in the configuration P 33 whereas the highest VF values were always obtained in the P 21 configuration ( Table 2 ). We also observed that shell configurations had a statistical significant effect on ICC on the values of v iso for both NODDI models and v intra of NODDI. Particularly, the highest ICC values were obtained in v intra of NODDI in the P 33 configuration whereas the lowest performances were obtained in P 21 ( Table 1 and Fig. 3 B), that seems to be in accordance with VF results. The infra-tentorial results did not show any significant effect of shell configuration on ICC, while slightly significant results were obtained for VF on the values of DA and Bingham v iso , both showing worst VF values for P 22 configuration ( Tables 1 , 2 of Supplementary Material).

Acceleration factor results
We then assessed effects of AF on reliability results for both adult and pediatric datasets ( Fig. 4 ). Focusing on supra-tentorial analysis, we found that SMS3 ICC values were significantly higher compared to those in SMS2 for all parameters ( Table 1 ). Concerning the VF, we observed that SMS2 VF values were significantly lower compared to those in SMS3 for v iso in both models, v intra of Bingham as well as DA ( Table 2 ). Conversely, ODI VF of both models were significantly higher when comparing SMS2 vs SMS3. The infra-tentorial ICC confirmed evidences of supratentorial results, showing that ICC values were significantly higher when comparing SMS3 vs SMS2 for all diffusion parameters (  Fig. 3. Boxplots of ICC (left) and VF (right) values within each shell configuration (P 21 , P 22 , P 23 , P 31 , P 32 , P 33 and P 4 ) for NODDI v intra parameter. Asterisks highlight significant differences between shell configurations.   were significantly different for ODI and v intra of both models, with lower values for SMS3 ( Table 2 of Supplementary Material).

Fitting toolboxes results
Since the fitting was performed with two different GPU-based tools, we compared their reliability results ( Fig. 5 ). Regarding the supratentorial data, a significant VF difference resulted for all the metrics, but not for NODDI ODI. In particular, MDT revealed significantly higher VF values than cuDIMOT toolbox in all metrics, except for v intra of both NODDI models ( Table 2 ). ICC values were significantly higher in v intra and ODI and lower in v iso and DA when comparing MDT to cuDIMOT ( Table 1 ). The infra-tentorial data results showed a strongly significant difference in VF values for v intra and v iso of both models, as well as DA, ODI p and ODI tot , where MDT revealed higher VF values than cuDIMOT ( Table 2 of Supplementary Material). The effect of tool revealed no significant ICC results for all diffusion parameters excluding Bingham v iso , where MDT produced higher ICC than cuDIMOT ( Table 1 of Supplementary Material). Beyond that, an interesting difference between cuDIMOT and MDT was the fitting runtime ( Table 3 ). On Oracle Cloud server setup (two NVIDIA® Tesla® P100, four Intel® Xeon® Gold 5120 CPUs @ 2.20 GHz, 187 GB RAM), MDT performed fitting 400% faster than cuD-IMOT for Bingham-NODDI and up to 600% faster for NODDI. NODDI models was also performed on AMICO toolbox ( Table 4 ). Assessing the effect of tool, we found significantly increased ICC values for AMICO when compared to cuDIMOT in ODI and v intra metrics, while no differences were found in v iso . Compared to MDT, AMICO revealed better ICC values in v iso , while significantly reduced values appeared in v intra ( Fig. 6 ). VF assessment reported significantly higher values in AMICO when compared to both MDT and cuDIMOT in ODI and v intra metrics ( Fig. 6 ).

Hemispheric side results
Based on recent evidences of hemispheric asymmetries in the NODDI signal ( Schmitz et al., 2019 ), we computed diffusion metrics within left and right ROIs, comparing their reliability results ( Fig. 7 ). For supratentorial ROIs, we observed that hemispheric side had a significant effect on ICC on the values of ODI and v iso of both models, with higher ICC values respectively on left and right sides ( Table 1 ). Concerning the VF, significant differences resulted for v intra and v iso of both models and NODDI ODI, always with higher VF values for left ROIs ( Table 2 ). The infra-tentorial data confirmed VF significant results only for ODI, again with higher values for left ROIs ( Table 2 of Supplementary Material), while ICC results showed significant differences for Bingham ODI p and v intra , where ICC was higher respectively within right and left ROIs.

Discussion
The study investigated test-retest reliability of diffusion MRI metrics computed for different acquisition setting, with the aim to assess whether fast acquisition and processing setting could help to translate NODDI within the clinical setting.. To this purpose, we assessed reliability for 7 different shell configurations and 2 AF. We also analyzed the data with 2 different GPU-based tools and compared the results. The study made use of 2 metrics to assess reliability of the measurements, ICC and VF.
Excellent reliability in terms of ICC was observed for all diffusion parameters, especially for intra (see Tables 1 and boxplots in the Sup-Table 4 Mean values, standard deviation, F-statistics and p-value of ICC and VF for NODDI metrics to assess effect of different tools (MDT, cuDIMOT and AMICO) within 32 supra-tentorial ROIs.  plementary materials); the observation is in accordance with previous NODDI reproducibility studies performed on human brains ( Chang et al., 2015 ;Tariq et al., 2013Tariq et al., , 2012, on rat brains ( McCunn et al., 2019 ) and on the spinal cord Grussu et al., 2015 ) by showing that those values are well fitted and robustly estimated in different acquisition schemes. Conversely, our results show slightly lower ICC and very high VF values related to v iso (always above 15%). This is actually confirmed by recent literature findings ( Tariq, 2018 ) suggesting v iso is a poorly reliable parameter, since it accounts for the isotropic volume fraction and has generally very low values in the WM. Concerning Bingham-NODDI metrics, ODI s , ODI tot , ODI p and v intra showed from good to excellent reliability, while DA produced worse results, likely due to the fact that it is less robust to noise ( Tariq, 2018 ) and harder to estimate compared to other diffusion parameters ( Tariq et al., 2016 ). As with NODDI, the very low reliability of v iso was confirmed in Bingham-NODDI, thus validating the evidences found in other Bingham-NODDI reproducibility tests elsewhere ( Tariq, 2018 ). Although NODDI and Bingham-NODDI models have a different number of parameters and are difficult to be compared, both of them produced very similar results in terms of reliability, thus no differences were recorded between ICC values of these models. Conversely, we found significantly lower VF values of NODDI ODI when compared to ODI tot of Bingham model.

Dataset results
Focusing on the reliability of the evaluation within the supratentorial areas, it was possible to compare results from the two different datasets, pediatric and adult. Pediatric dataset always exhibited better reliability results when compared to adult's dataset, showing increased ICC values for ODI (about 21% both for NODDI and Bingham model), v intra (15% for NODDI and 21% for Bingham model), v iso (14% and 19% respectively for NODDI and Bingham model) and DA (13%). This evidence occurred both for NODDI and Bingham-NODDI models ( Fig. 2 ) and could be attributed to acquisition differences between populations. Children did not exit the scanner between test and retest sessions and consequently they had better data alignment when compared to adults, which led to an improved data stability. Moreover, in order to reduce patient discomfort and to collect images as fast as possible, acquisition was limited to the supra-tentorial areas and no additional volumes with reversed phase-encoding direction were acquired. The lack of these acquisitions prevented children to be preprocessed with the standard correction for susceptibility-induced distortion correction. Conversely, adult volunteers received a longer acquisition for a whole-brain coverage further corrected for susceptibility-induced distortion. During ROI-based analysis, supra-tentorial and infra-tentorial areas were considered separately for reliability assessment in adult dataset. Average reliability was usually better in supra-tentorial areas ( Table 1 ) than in infra-tentorial areas ( Table 1 of Supplementary Material). In particular, we recorded global ICC reduction for all the metrics when analysis was performed within the 8 infra-tentorial ROIs. The infra-tentorial data are indeed more affected by noise and more prone to fitting errors compared to the supra-tentorial ROIs.

Shell configurations
Comparing shell configurations, very small reliability differences appeared and the only parameters that were affected by the configuration were v iso of both models and NODDI v intra . Referring to the Tables 1  and Table 2 , v intra is better reproduced between the 2 sessions in the configuration P 33 whereas the performance dropped in the case of P 21 configuration, both in terms of ICC and VF ( Fig. 3 ). This trend suggests more extensive diffusion parameters variability when considering shell configuration with lower b-values (e.g. 300 s/mm 2 ), especially for simple model like NODDI. The ODI from both models were not affected by the configurations and that finding is in line with previous results showing that orientation dispersion index is also well estimated with a single shell whereas it is more sensitive to the number of gradient directions. In our case, fewer shell ODI results were comparable in term of reliability to those with up to 4 shells as it was already confirmed in a study ( Timmers et al., 2016 ) showing that single shell ODI analysis was able to reproduce the group differences from multi-shell analysis. Furthermore, our results seems to be in line with ODI value behavior seen in a recent study Parvathaneni et al., 2018 . The authors showed how in WM the acquisitions with a single shell with b = 1000/2000s/mm 2 outperform in term of root mean square error (RMSE) the sequences acquired with 2 shells. Although we did not perform single shell acquisition, our results revealed a trend of increased ICC for lower number of shells (i.e. two-shell configuration) in ODI (see Table 1 ). Moreover, similar results were obtained for DA, an evidence that might suggest the ability to reliably fit the data even for more complex models such as Bingham model. Conversely, our finding confirmed the sensitivity of v intra to the choice of outer shell b-value Parvathaneni et al., 2018 ), since for equal number of shells, the ICC appeared increased when considering higher b -values.

Acceleration factors
The study observed that SMS3 ICC values were significantly higher compared to those in SMS2 for all diffusion parameters, while there was not a clear winner when looking at VF, that exhibited very high values for v iso both in SMS2 and SMS3 ( Fig. 4 ). This parameter was already shown to be unreliably fitted elsewhere and we might hypothesize how the deviance from Gaussian noise towards a more non-central chi distribution in case of multiband acquisition ( St-Jean et al., 2020 ) could mostly affect the model fitting in case of higher acceleration factors as due to their lower SNR. Better alignment of the data along different directions and different shells, as in the case of shorter acquisition time might also be responsible for more reliable performances. The higher acceleration factor protocol was indeed 5 min shorter than the one with lower acceleration and then the data were more prone to having motion misalignment. Given the dependence on the number of directions for diffusion parameters, the lower SMS2 ICC values are likely due to this motion artifact only partially corrected via post-processing. In this context, head motion estimates corroborates the idea of an overall slightly larger motion artefact in case of lower acceleration factor (see Fig. 11 in Supplementary Material). The effect of the head motion can be disruptive of the reliability and it seems to be the main factor as due to slice acceleration when looking ICC and VF (see Figs. 10-12 of Supplementary Material) VF values for v intra , v iso and DA were higher in SMS3 compared to SMS2, probably due to the higher acquisition noise of faster acquisition. In this context, we hypothesized a possible competitive effect between acquisition noise and motion artifacts, which could be evaluated when looking at the two datasets separately. Pediatric subjects were less prone to motion artifacts since they did not exit the scanner, thus their reliability values mainly account for acquisition noise. As we can see from Supplementary Material (Tables 5 and 7), although ICC values were always higher for SMS3 in both cohorts, the gap between different AF was prominent in adults rather than children, reflecting the beneficial effect of faster acquisition over motion artifacts which indeed was strongly reduced in children. Focusing on diffusion parameter significantly affected by AF, we found that SMS3 produced significantly higher VF values in all parameter for children and in v intra , v iso and DA of adults (Tables 6 and  8 of Supplementary Material), probably related to higher inter-subject variability caused by acquisition noise of faster protocol.

Fitting toolboxes
Bland-Altman plot in Fig. 8 showed two clusters of results in adult datasets for v intra . The data fitted with cuDIMOT had higher values and were more spread in the plot. Most of the spread difference between SMS2 and SMS3 was actually observed in cuDIMOT estimations whereas MDT fitted parameters showed to be more clustered. Interestingly, CuDIMOT's VFs result in being generally lower than the MDT counterpart, whereas mean values for each metric is actually higher (see Bland-Altman plots in Supplementary Material). The latter might cause the lower ICC seen thus leading to a less reliability in terms of how well the method can discriminate between individuals. We do not have a clear idea regarding the reasons why there is this discrepancy in the results for v iso between the 2 toolboxes. Both the toolboxes use iter-ative optimization algorithms for the solution of non-linear least square problems. Even though they do not use the same way to reach the minimum (Powell for MDT and Levenberg-Marquardt for cuDIMOT), their approaches are very similar as they both interpolates between Gaussian-Newton and gradient descendent and this may not be a reasonable cause of this discrepancy. Eventually, we observed that boundary parameters are tighter in MDT than cuDIMOT and this could give rise to more variability and overestimation of metrics for the latter. The use of different optimization methods might likely influence the analysis time, which in MDT resulted in being faster. Furthermore, both toolboxes have a cascade method to fit complex models starting from simpler ones, and the way this is applied is similar in the two toolboxes. Furthermore, this difference occurs in both datasets and acceleration factors, so it appears to be independent from either misalignment and motion artifacts, or noise and SNR. Only MDT has the option to model noise distribution (Gaussian, Offset-Gaussian and Rician) and As far as we know, offset Gaussian provides stable results ( Panagiotaki et al., 2012 ) when compared to other noise distribution and that could explain better reliability in terms of ICC for MDT compared to cuDIMOT for ODI and v intra parameters. Conversely, we did observe that MDT also showed to have a higher VF and lower ICC compared to cuDIMOT for v iso and DA ( Fig. 5 ).
Focusing on v iso , the effect of noise standard deviation estimates on VF, we found that autodMRI noise standard deviation produced better results, almost comparable with those from cuDIMOT. Since the poorer reliability performances seen in MDT for v iso are not well understood, these results could suggest that more precise noise standard deviation estimate could be a possible solution for that. Despite the use of OpenCL in MDT and CUDA in cuDIMOT better performances have been seen in terms of run time for MDT vs cuDIMOT. This is not in line with what is reported in the manuscript of the author of cuDIMOT who concludes that cuDIMOT would potentially achieve better perfomances for Nvidia GPU cards.

Hemispheric side
Recent studies performed NODDI to investigate possible microstructural hemispheric asymmetries across the whole brain of adult ( Schmitz et al., 2019 ), children ( Dimond et al., 2020 ) and infants ( Dean et al., 2017 ). Particularly, they observed both leftward and rightward microstructural asymmetries over 22.78%, 12.78% and 6.11% of the overall brain area respectively for v intra , ODI and v iso ( Schmitz et al., 2019 ). Multivariate analysis on diffusion data from right-handed subjects revealed significant effect of hemispheric side in ODI and v iso reliability ( Fig. 7 ), thus suggesting the role of brain microstructure in reliability estimation.
Having said that, we also want to point out that the present study suffers from a few limitations. First of all, even though the study has an overall larger number of subjects compared to few other studies on the topic ( Andica et al., 2020 ;Tariq, 2018 ), relatively few subjects per acquisition were actually attained thus making the inference less robust and the results less interpretable. In this context, we provided reliability information assessing ICC and VF statistics on a single dataset (see Tables 3 , 4 of Supplementary Material), where pediatric and adult data were merged, but future studies should address it to strengthen the study, recruiting a large number of adult subjects. Moreover, in this study we used the default mode of NODDI fitting, thus neglecting that optimised intrinsic diffusivity in NODDI might be different in children and adults ( Guerrero et al., 2019 ). This aspect need to be further investigated in future studies. Second limitation of this study is the heterogeneity of the acquired data, with differences between children and adult datasets. Since pediatric data were much more difficult to acquire, a few sacrifices have been made in terms of protocol: only supra-tentorial brain part was actually sampled and no additional volumes with reversed phase-encoding direction were available for children, thus preventing them to be pre-processed for susceptibility-induced distortions corrections. Moreover, children did not exit the scan between test and retest acquisitions, while adults were taken out of the scanner for about 20 min.

Conclusion
In summary, we have applied the NODDI and Bingham NODDI models to a cohort of pediatric and adult subjects and showed the effects of different acquisitions and methods on the reliability of NODDI metrics. We observed how different metrics show different patterns when looking at the effects of multiband factors and shell configurations. The main achievement of the present study is to show how 10 min NODDI acquisition with 3 shells can have reliable results in WM. More complex models do not appear to be more prone to less data acquisition as well as noisier data thus stressing the idea of Bingham-NODDI having greater sensitivity to true subject variability. Multiband acquisition did not result in worsening the reliability; conversely, shortening the acquisition might be beneficial for attenuating the motion artifacts. Lastly, we also performed a test to study GPU analysis by comparing two GPUboosted analysis toolboxes mainly showing that faster fittings does not jeopardize the reliability when compared to more stable methods.