Reproducibility of cerebellar involvement as quantified by consensus structural MRI biomarkers in advanced essential tremor

Abstract Essential tremor (ET) is the most prevalent movement disorder with poorly understood etiology. Some neuroimaging studies report cerebellar involvement whereas others do not. This discrepancy may stem from underpowered studies, differences in statistical modeling or variation in magnetic resonance imaging (MRI) acquisition and processing. To resolve this, we investigated the cerebellar structural differences using a local advanced ET dataset augmented by matched controls from PPMI and ADNI. We tested the hypothesis of cerebellar involvement using three neuroimaging biomarkers: VBM, gray/white matter volumetry and lobular volumetry. Furthermore, we assessed the impacts of statistical models and segmentation pipelines on results. Results indicate that the detected cerebellar structural changes vary with methodology. Significant reduction of right cerebellar gray matter and increase of the left cerebellar white matter were the only two biomarkers consistently identified by multiple methods. Results also show substantial volumetric overestimation from SUIT-based segmentation—partially explaining previous literature discrepancies. This study suggests that current estimation of cerebellar involvement in ET may be overemphasized in MRI studies and highlights the importance of methods sensitivity analysis on results interpretation. ET datasets with large sample size and replication studies are required to improve our understanding of regional specificity of cerebellum involvement in ET. Protocol registration The stage 1 protocol for this Registered Report was accepted in principle on 21 March 2022. The protocol, as accepted by the journal, can be found at: 10.6084/m9.figshare.19697776.

Several Magnetic Resonance (MR) imaging studies suggest structural changes in the cerebellum associated with ET [6][7][8][9][10] . Specifically, earlier work based on voxel-based morphometry (VBM) suggests bilateral cerebellar atrophy in ET subjects 11,12 , with a predilection for the vermis 11 . More recently, volumetric studies using highresolution cerebellar atlases 13 report significant decreases of gray matter (GM) in different cerebellar lobules I-IV, V, VI, VII and VIII in addition to the vermis 14 . Contrary to these findings, other work indicates that there is no significant association between ET symptoms and cerebellar degeneration 5,15,16 . Furthermore, a meta-analysis comprising 16 pooled VBM studies fails to find consistent cerebellar abnormalities and gray matter alterations in the ET population 15 .
Another recent VBM meta-analysis 17 suggests that whereas there is some evidence for volumetric changes in ET patients, there is significant heterogeneity in the published literature limiting definitive conclusions on cerebellar tissue loss based on MRI. The small median sample size of studies (n = 19.5 for ET, 20 for Normal Control (NC)) in this meta-analysis implies an estimated median power of 0.44 for one test or less than for example if 10 lobules or region of interest (ROI) were tested and corrected for multiple comparisons. The literature on cerebellar atrophy in ET patients may suffer from the winner's curse effect and low positive predictive value 18 , in addition to the file drawer effect, i.e., the publication bias towards reporting of significant findings 19 . These factors make the meta-analysis results difficult to interpret, and therefore the authors of meta analyses suggest additional replication studies. Meta-analysis caveats and winner's curse effects are not specific to the field of neuroimaging 20,21 and can be especially relevant in low power settings.
Beyond cerebellar involvement, an abnormal cerebello-thalamo-cortical network has been proposed in ET 22,23 . These abnormalities could be a logical functional consequence of cerebellar pathology, or alternatively reflect a wider structural degenerative process beyond the cerebellum. Thus far, the cortical changes in ET and their association with cerebellar degeneration are not well characterized in neuroimaging studies and lacks consensus 11,[24][25][26][27][28][29] . In addition to possible decreases in volume, some studies even suggest an increase in gray matter in the supplementary motor area of ET patients based on a VBM analysis 30 . These inconsistencies motivate further exploration of coincidence of cerebellar and cortical structural changes to improve our understanding of patterns of degeneration in ET.
The current inconsistencies in MR imaging studies that link varying cerebellar changes to ET may be attributed to various sources. It is difficult to collect large scale and well characterized randomized ET and control subjects, and collection and analysis of disparate cohorts with small sample sizes limits valid hypothesis testing and interpretation of findings. Apart from the difficulties of collecting large scale well characterized randomized ET and control subjects, the disagreements between imaging studies may also arise from the complexity and flexibility of the neuroimaging processing pipelines and the statistical models [31][32][33] . We refer to the study of robustness in findings resulting from various pipelines and statistical models as "Methods sensitivity analysis". In ET studies, these pipelines include VBM, ROI volumetry, and cortical thickness estimation which offer quantification of biomarkers at different scales and regional specificities. Typically, the pipeline choice stems from underlying hypotheses about biomarker's spatial specificity and sensitivity (e.g., voxels vs regions) in identifying case-control differences, availability of data, and familiarity with the software toolboxes. Most of the aforementioned studies choose only one among the many available imaging analysis pipelines, such as VBM using SPM, or ROI analysis using Freesurfer. The lack of identical (or similar) pipelines between two studies complicates direct comparison of the results. The next source of variability in the analysis comes from differences in the statistical modeling. The existing literature employs varying approaches towards hypothesis testing (GLM and permutation tests), controlling confounders and covariate selection that can introduce more inconsistencies in the biological findings 7,9,11,12,14 . The situation is further complicated at times by a lack of statistical and neuroimaging reporting standards. In some studies, we were not able to find full details of the statistical analyses. For example, z or t values, effect sizes, and details of multiple comparison corrections were not reported in a consistent fashion 7,9,11,14 . Additionally, studies performing analysis based on presumed disease subtypes that may in fact exist in a continuum could also dilute statistical power and inflate the effect sizes that can be detected 20,21 in smaller cohort studies. All these complexities, compounded possibly by the file drawer effect, make the comparison and interpretation of neuroimaging studies difficult, and hinder the translation of research findings to clinical applications 34,35 .
To address these methodological issues in the current ET imaging literature, we carried out multiple neuroimaging analyses at different phenotypic scales and compared them against the findings from the literature. For these analyses, we used a local sample of ET patients referred to a specialized neurosurgical movement disorders clinic. The patients presented with an advanced stage of ET with disabling upper extremity symptoms. The local sample also comprised a limited number of control subjects however their age and sex were not well-matched with the ET group. We therefore augmented the control sample size by drawing from two publicly available datasets: the Parkinson's Progression Markers Initiative (PPMI) 36 and Alzheimer's Disease Neuroimaging Initiative (ADNI) 37 , allowing us to obtain an sample of control subjects with similar age, sex distributions and scanner type as of the local ET sample. With this augmented sample, we aim to investigate group differences between ET and NC groups using structural imaging biomarkers derived from T1 MRIs. Specifically, we aim to answer the following three questions: 1. Can we detect a consensus in cerebellar involvement as quantified by structural MR imaging biomarkers in an advanced ET sample? 2. What is the impact of methodological variation resulting from the use of different image processing pipelines and statistical models on the above findings? Could these variations explain the literature discrepancies? To answer question 1, we tested the hypothesis that the ET group shows significant cerebellar changes compared to the NC group that are detectable using a consensus of 3 different MRI biomarkers: (1) cerebellar VBM, (2) cerebellar gray and white matter volumetry, and (3) cerebellar lobular volumetry.
We answered the second research question of the impact of pipeline and statistical model selection with a systematic methodological sensitivity analysis that includes: (1) comparisons with alternative segmentation pipelines to estimate cerebellar lobular volumes, (2) parametric versus non-parametric significance tests and alternative confounder control models and intracranial volume choices.
We investigated the third question by comparing the differences in the correlation patterns between cerebellar and cortical structural features of ET and NC groups in a secondary exploratory analysis. The overview of this study is illustrated in support information Fig. S1.

Results
No consistent detection of cerebellar involvement in advanced ET by all 3 MRI biomarkers. The consensus based hypothesis testing results are illustrated in Fig. 1. Despite our large cohort (N = 211, patient group: N = 34, control group: N = 177), we were not able to detect significant voxel-level differences between ET and augmented NC using VBM with a cerebellar mask. The full VBM report can be found in support information (SI) Fig. S3. The cerebellar Gray Matter and White Matter (GM & WM) and cerebellar lobular volumetry hypothesis testing were carried out using general linear model (GLM) with age, sex, the intracranial volume (eTIV, estimated Total Intracranial Volume), cohort and group as covariates, with Bonferroni approach for multiple comparison correction. The volumetric comparisons of ROIs (Region of Interests) include left and right cerebellar GM & WM estimated by Freesurfer, left and right CrusI, CrusII, Dentate nucleus, vermis CrusI, CrusII, and VI for cerebellar lobular volumes estimate by SUIT (no white matter estimation from SUIT). The only significant ROI we obtained was the left cerebellum WM with p = 0.0122, z = 2.5059. The positive z value suggests "hypertrophy" rather than "atrophy" in ET. These results are different from those of the previous MRI studies 11,38 , but may agree with a recent histology study which reports the "focal swellings of Purkinje cell axons" 39 . In summary, we were not able to detect cerebellar involvement associated with ET from the consensus of all the 3 MRI biomarkers.

Findings across different statistical models and cerebellar segmentation pipelines.
We assessed the impacts of alternative (1) cerebellum segmentation pipeline i.e., MAGeT 40 and (2) statistical models on the hypothesis testing results. Note that Freesurfer only segments cerebellar GM & WM, SUIT 13 segments GM including hemispheric lobules, vermis and deep nucleus, and MAGeT segments GM & WM including only hemispheric lobules without vermis or deep nucleus. In order to evaluate the replicability of the previous findings, we have repeated the cerebellar GM & WM volumetry and cerebellar lobular volumetry hypothesis testing based on the Freesurfer, SUIT and MAGeT segmentations using 10 most commonly used statistical models. The results are summarized in Fig. 2 which show four most consistent findings across pipelines and statistical models. See methods section and SI for detailed results ( Fig. S5-7). Since no single ROI showed consistent significant differences across all 3 pipelines between ET and NC, we focus on the consensus findings obtained from at least 2 pipelines.
Significant right cerebellar GM reduction in ET has been detected by both Freesurfer and MAGeT using model 7 and 9 with effect sizes of − 0.8996 and − 0.8780. These models employ permutation tests using cerebellar . The significant increase in the left cerebellar WM was confirmed by all the 10 models based on Freesurfer segmentations, however it was not significant for MAGeT results. We also noticed that all the statistical models based on SUIT segmentations showed increases and some were significant. We note that SUIT based results should be interpreted with caution due to certain issues pertaining to its segmentation quality and consequent high correlations among the lobular volume estimates (more details in methods sensitivity analysis and the quality control sections). On the other hand, MAGeT provided better cerebellar lobular segmentations compared to SUIT (refer to the quality assessment results) , and showed a reduction of right cerebellar GM in ET.

Methods sensitivity analysis. Different statistical models and confounding effect control settings.
To evaluate the effects of different statistical models comprising various confounder control strategies and covariate settings, we tested the hypothesis of cerebellar involvement with the same cerebellar volumetric data (Freesurfer cerebellar GM & WM volumes and SUIT cerebellar lobular volumes) using 10 commonly used models for testing including: general linear model (GLM) based family of tests (models 2-5) with age, sex, eTIV, cohort and group as covariates, and permutations based family of tests (models 6-11) with different confounder control settings (refer to methods section and Table S2 for full details). Model 1 is a direct permutation test with multiple comparison correction as a reference for the other 10 models. The confounder correction settings are different within the testing family: (1) "covariate inclusion" and "variable transformation" for GLM and (2) "residual based methods" and "variable transformation" for permutation tests. All of the hypothesis testing results are summarized in Fig. S5. We detected a significant increase in left cerebellar WM in the ET group with all models (except direct comparison without adjusting for confounding variables) based on Freesurfer segmentations. This result is consistent with some recent histological studies 39,41 . GLM based tests always give larger effect sizes than the permutation tests (e.g., the mean effect size of left cerebellar WM from Freesurfer was 2.6457 for models 2-5 and 0.9260 for models 6-11 as illustrated in Fig. S5) suggesting departure from distribution normality. Permutation tests discovered more significant ROIs than GLM. Whereas GLM was only able to detect the increase of left cerebellar WM, the permutation tests additionally detected the increase of left and right CrusI apart from the reduction of right cerebellar GM (with effect size -0.8996 from model 7 and -0.8780 from model 9). Right cerebellar cortex reduction was only detected when we controlled for eTCV instead of eTIV with permutation tests, which was in accordance with the literature findings. GLM based models 3 and 5 showed larger effect sizes but statistical testes did not Different segmentation pipelines of SUIT and MAGeT. Since we obtained different results from SUIT and MAGeT, we further explored the cerebellar lobular volume differences from these 2 pipelines. The distributions of cerebellar lobular volumes estimated by SUIT and MAGeT are illustrated in Fig. S4 in SI. We observed the  (Table S2). Two horizontal arrows with lighter color represent conflicting results across the 10 different models. Gray short line indicates that no data available for the ROI from a specific pipeline. www.nature.com/scientificreports/ high interlobular correlations with small variances ( 0.8901 ± 0.1211 ) in SUIT results as seen in Fig. 3a, and lower interlobular correlations with larger variances ( 0.4092 ± 0.1523 ) in MAGeT results as seen in Fig. 3b. In Fig. 3c, the cross hemispheric lobular correlations between SUIT and MAGeT were also comparatively low ( 0.3170 ± 0.1036), with left VIIIb showed the largest mean correlation between these 2 pipelines (ρ = 0.4469) and right X showed the smallest correlation (ρ = − 0.0062). We also calculated the correlations of cross hemispheric cerebellar lobular volumes within and across pipelines (SUIT and MAGeT) as summarized in Fig. 3d. SUIT showed extremely high hemispheric cerebellar lobular volume correlations with small standard variances (mean 0.985 ± 0.007), whereas MAGeT gave high correlations with larger variances (0.773 ± 0.083). In summary, SUIT lobular segmentations showed high correlations with less variances. MAGeT segmentations showed comparatively low correlations with larger variances. These results were coupled with the visual inspections from our anatomy experts who suggested that MAGeT results appeared more biologically plausible.
Cerebello-cortical structural covariance patterns vary with pipelines. In an exploratory analysis, we show in Fig. 4 the cerebello-cortical structural covariance between cerebellar GM & WM volumes and the cerebral cortical thickness aggregated using DKT parcellation ( n ROI = 62). Following the literature convention 42 , the cortical thickness was corrected for confounding effects from age, sex and cohort, and we additionally controlled for eTIV and for cerebellar GM & WM volumes using the residual method 43 . Based on the unsatisfactory quality of SUIT segmentations, we limited this analysis to the results from MAGeT (a, c) and FreeSurfer (b, d) pipelines.
Generally, the NC groups showed small and positive structural covariance patterns between cerebellar GM & WM and cortical thickness. The mean cerebellar GM and cortical ROI correlations in NC were 0.0758 for MAGeT and 0.0778 for FreeSurfer; whereas the mean cerebellar WM and cortical ROI correlations were  www.nature.com/scientificreports/ WM. We discuss the implications of these findings for future studies in the next section. All these findings suggest that the methodological sensitivity analysis should be seriously considered in the biological inferences based on complex computational models and pipelines.

Discussion
In summary, we proposed a principled consensus based approach to analyze cerebellar involvement in ET with an augmented cohort with high power, while considering the impacts of the MRI processing pipelines and statistical models. The quality of all the images and the processing results were evaluated by both neuroanatomy and image processing experts. We were not able to detect the cerebellar involvement for advanced ET from the consensus of 3 MRI biomarkers namely VBM, cerebellar GM & WM volumetry and cerebellar lobular volumetry. We further tested the same hypothesis using 10 most commonly used statistical models based on the biomarkers derived from Freesurfer, SUIT and MAGeT. No cerebellar ROI derived from these 3 pipelines showed consistent significant difference. The two regions that showed cross pipeline agreement between FreeSurfer and MAGeT included (1) reduction in right cerebellar GM volume found significant with permutation tests by 2 out of 10 statistical models using cerebellar volume as confounding factor, and (2) increase in left cerebellar WM volume found either significant by all the 10 statistical models based on the Freesurfer results or non-significant but trending in the same direction based on MAGeT results. Based on results from hypothesis testing, we carried out exploratory analysis to investigate covariance patterns between cerebellar GM & WM volumes affected in ET and cortical thickness in cerebrum quantified using DKT parcellation. The results showed ET group had a consistent overall decrease in association between cerebellar GM volume (estimated by Freesurfer and MAGeT) and cortical thickness, although the trends were not consistent for cerebellar WM. This discrepancy may stem from the different definitions of cerebellar WM in Freesurfer and MAGeT atlases. Both Freesurfer and MAGeT segment the trunk-like main cerebellar WM volume reliably, but the MAGeT atlas excludes the smaller branchlike fronds of cerebellar WM underneath the cerebellar cortex. The correlations between left and right cerebellar WM from Freesurfer and MAGeT were 0.79, 0.76; 0.77, and 0.78 as detailed in the Results. Previous studies [44][45][46] have reported alterations in cortical thickness in Parkinson's and ET, and a few fMRI studies 27,30 have linked tremor severity with cerebello-thalamo-cortical pathway. However structural atrophy patterns associated with this pathway and related cerebello-cortical networks remains relatively unexplored. The cerebellar GM decrease is consistent with previous studies 7,9,11,17,38 , which use VBM and cerebellar GM & WM volumetry, including some studies that accounted for different clinical variables. The WM increase is contradictory to previous findings 38 which used Freesurfer 4.0.5 with eTIV (estimated by SPM2) as covariate; however, it is in line with the recent histology studies 39,41 that report cerebellar WM increase due to possible "focal swellings of Purkinje cell axons". For lobular volumetric analyses, both manual and SUIT based segmentation results in the literature report significant atrophy 7,9,14 in different cerebellar lobules. However, we were not able to detect these differences with MAGeT Together with the other sensitivity studies 33,47 , this work highlights the fact that the results derived from complex modeling and image processing pipelines can be sensitive to algorithmic and parametric choices. Our extensive, time-consuming quality control procedure for all the subjects (MNI, PPMI and ADNI) carried out by both anatomical and imaging processing experts sheds some light on the sources of variation in neuroanatomical findings in the ET literature. The detailed quality assessment (QA) results are shared with our OSF preregistration (https:// osf. io/ ucrxf/). The main observations regarding cerebellar segmentations are as follows: (1) Freesurfer is generally reliable for various datasets, however it only estimates the global volumes of cerebellar GM & WM without finer lobular segmentations; (2) The SUIT pipeline with its accompanying cerebellar atlas (default for SPM/FSL/AFNI) is the most commonly used method for cerebellar segmentations and is the only pipeline that segments vermis and dentate nucleus without cerebellar white matter. However, the overall results were found generally poor in our datasets. SUIT overestimated lobular volumes, often segmenting the space between neighboring lobules. The high inter-lobular correlations with low variance are biologically unlikely and need further investigation; (3) MAGeT gives most anatomically reliable results possibly due to its multi-atlas registration approach comprising 5 manually segmented templates and the high resolution of these templates. However, it does not provide segmentations for vermis and dentate nucleus; (4) From the computation cost perspective, Freesurfer is computationally intensive and also gives cerebral parcellations, whereas SUIT is computationally economic but requires manual re-orientation before processing due to its cerebellum extraction step. MAGeT requires extensive computing resources due to the large number of registrations involved. In terms of statistical models, the permutation test is more sensitive to group differences, and we found that controlling for cerebellar volume instead of the total intracranial volume seems more adapted to study of cerebellar subregional differences. In general, results interpretability is dependent on confounding variable choices conjointly with variable transformation techniques like direct proportion adjustment.
There are several limitations in this study: (1) The ET group is still small with only 34 subjects. Increasing the number of NC subjects can improve the power to some extent but reaches plateaus. As shown in the preregistration examples, 325 more NC can only increase the power to 0.97 while we used 177 NC to get 0.9 power Scientific Reports | (2023) 13:581 | https://doi.org/10.1038/s41598-022-25306-y www.nature.com/scientificreports/ in the present study. Since there are no open ET datasets, we were not able to use more advanced matching procedures, like propensity score matching 48 .
(2) In this study, the cohort effect (MNI, PPMI and ADNI) is modeled as a simple linear effect when we pooled NC subjects, but the actual cohort effects could be more complex and require more complex modeling 49 . (3) We only included age, sex, cohort, eTIV/eTCV in our statistical models, without other potentially important clinical variables such as disease duration, since we did not have access to these data at the time of the present analysis. Results could vary if these clinical variables were included. (4) We used the default configurations of these pipelines similar to other investigators. The performance may be improved with better tuning from the pipeline experts 50 .
Overall, this study emphasizes the significance of pipelines and methods sensitivity analysis in biological inferences, reinforcing the importance of preregistration procedures. Methods sensitivity analysis and detailed data & processing quality assessments should be reported in future studies. While ET studies are numerous in the literature, more replication studies and accessible datasets are essential in order to draw robust conclusions regarding the extent of cerebellar involvement in ET based on MRI analysis.

Methods
Data and cohort matching. This study used the 3 T T1 MRI images from 3 datasets which have already been collected: (1) The MNI dataset with 70 subjects including 38 well characterized pre-surgical advanced ET subjects and 32 normal control (NC) subjects; (2) The PPMI dataset is a subset of the PPMI control cohort with 116 NC subjects; (3) The ADNI dataset is a subset of the ADNI control cohort with 312 NC subjects. More details of the datasets and image acquisitions can be found in the support information (SI). Due to the image processing errors or low processing quality, we discarded 4 ET and 3 NC subjects from the MNI dataset, 38 NC subjects from PPMI and 89 NC subjects from ADNI. Based on the number of ET subjects left (34), power = 0.9, the mean literature effect size = 0.61 (more details in pre-registration power analysis) and significance level of 0.05, we calculated the number of subjects needed: 177 for 2-sided tests. We randomly selected these 177 age and sex matched NC subjects from the pooled MNI, PPMI and ADNI2 NC subjects to form the NC group with a L2 based matching algorithm 51 (more details in SI). We have 211 subjects in total (34 ET and 177 NC). The age and sex distribution are illustrated in Fig. S2 and summarized in Table 1 below. Cohort membership will be modeled as a linear random effect in latter analysis.
MRI processing. The original raw (dicom) T1-weighted (T1w) MR images are converted into NIfTI format and further organized according to BIDS standard with HeuDiConv 0.8.0 52 . All the T1 data are preprocessed with the anatomical workflow of fMRIPrep 20.2.0 53,54 . Freesurfer pipeline (http:// surfer. nmr. mgh. harva rd. edu/, version 6.0.1) which is part of fMRIPrep 20.2.0, and estimates the cerebellar GM and WM volumes using with the default "recon -all" processing. We quantified cerebral cortical thickness and cerebellar GM & WM (gray and white matter) volumes using the default "DKT atlas + aseg" labels.
Quality control procedure. The quality control (QC) procedure was carried out for MNI, PPMI and ADNI. The quality of the images and the processed results (normalization and segmentation) were evaluated by two expert neuroanatomists (M.A. and A.F.S.) and an imaging expert (Q.W.) and the results are summarized in Fig. 5. Refer to the full quality assessment report in SI for more details.
Considering the quality assessment (QA) results, MAGeT was able to give more informative and anatomically plausible cerebellar segmentations (See the full QA report in SI.). SUIT segmentations were alarming due to its general tendency for overestimation, the high inner pipeline correlations (Fig. 3a) and comparatively low processing qualities. SUIT also provided estimations of deep cerebellar nucleus volumes, e.g., dentate nucleus, however, T1 MRIs alone did not allow for QCing these anatomical structures. Freesurfer generally provided acceptable quality of cerebellar GM and WM segmentations. However quality 2 classifications of Freesurfer results were due mainly to the overestimation of cerebellar WM. Cerebellar segmentation pipelines. We used SUIT pipeline 13 (version 3.4) to segment the cerebellum into finer lobules. SUIT is the most used pipeline for cerebellar lobular segmentation. It first extracts the cerebellum from the entire brain image, then segments the cerebellar gray and white matter and finally segments the cerebellar gray matter into 34 lobules according to the SUIT atlas.
Different from the SUIT, MAGeT Brain 40 (version 1.0) pipeline employs a multi-atlas procedure to perform volumetric segmentation of brain structures. The multi-atlas approach combined with an intermediate cohortspecific bootstrapping procedure can better capture the neuroanatomical variability offering more accurate segmentations.
A consensus based hypothesis testing of cerebellar involvement of ET. We tested the hypothesized cerebellar structural differences associated with ET compared to the NC group with a consensus approach The full model is detailed with name "model 2" in the method sensitivity analysis in Table S2., and the detailed results are illustrated in Fig. S5 and summarized in Fig. 1.

Methods sensitivity analysis. Statistical models and confounder control settings sensitivity analysis.
In general, we used 2 hypothesis testing approaches (GLM and permutation hypothesis testing) and 2 families of confounding control methods (residual based methods and adjustment based methods 43 ) we denote each model and confounding control method combination as one model, the details of the models can be found in Table S2. and results in Fig. 2. We tested 2 most widely used approaches for controlling the confounding effects of intracranial volumes: (1) Residual based method, confounders (age, sex, estimated intracranial volume (eTIV), and cohort) are included as covariates in a regression model first, for example it can be: where V oi is volume of interest and b 0 is the ROI volume with confounding effects corrected. Usually, the model will be fitted with the NC data first, and b 0 s are calculated for both ET and NC groups with the fitted model 55 . Besides eTIV, the total cerebellar volume (eTCV) can also be used in this model if it is considered as a confounder. This is similar to the control of total hippocampus volume when comparing hippocampus subregions 56 . (2) Adjustment based methods: Using intracranial volume normalized ROIs or log transformed normalized ROIs (direct proportion adjustment and power proportion adjustment [57][58][59] in the GLM and permutation approaches instead of the original volume, for example: V dpa = V oi /eTIV (direct proportion adjustment, DPA); V ppa = Voi/eTIV b1 , log(V ppa ) = b 0 + b 1 * log(eTIV ) (Power proportion adjustment, PPA), V dpa is the proportion adjusted volume and V ppa is the power proportion adjusted volume. When the intracranial volume adjusted variables are used in GLM model, the model becomes V dpa V ppa = b 0 + b 1 * age+ b 2 * sex+ b 3 * cohort+ b 4 * group + ε instead. In fact, we used only DPA with GLM for better interpretability ( V oi ratio). In addition, we have compared the differences of using eTCV and eTIV to adjust for global volume effects in both GLM and permutation tests. We permute for n = 5000 times for all the permutation tests. Details of the models used in the sensitivity analysis are fully described in Table S2.
Cerebellar volumetry and cerebellar segmentation pipeline selection. Cerebellar volumetry can be sensitive to the choice of segmentation pipelines and anatomical atlas. Therefore, we compared the lobular volumetric group differences derived from: (1) SUIT pipeline with SUIT atlas 13 , which is widely used for cerebellar segmentation by the imaging community; (2) MAGeT Brain pipeline with a multi-atlas segmentation method to assess the Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.