Exploring the potential of representation and transfer learning for anatomical neuroimaging: Application to psychiatry

The perspective of personalized medicine for brain disorders requires efficient learning models for anatomical neuroimaging-based prediction of clinical conditions. There is now a consensus on the benefit of deep learning (DL) in addressing many medical imaging tasks, such as image segmentation. However, for single-subject prediction problems, recent studies yielded contradictory results when comparing DL with Standard Machine Learning (SML) on top of classical feature extraction. Most existing comparative studies were limited in predicting phenotypes of little clinical interest, such as sex and age


Introduction
With the ever-growing availability of brain imaging data (e.g., UK-BioBank Bycroft et al., 2018, HCP Van Essen et al., 2013, ABIDE Di Martino et al., 2014, etc.), Machine Learning (ML) and, in particular, Deep Learning (DL) models are starting to emerge for personalized medicine and biomarker discovery in psychiatry and neurology.Psychiatric disorders are complex and highly heterogeneous, gathering clinical, biological, and environmental variabilities (Wolfers et al., 2018), thus making their neurobiological characterization challenging.In this context, ''Standard'' ML (SML) models, including (regularized) linear models and kernel-based methods, have been broadly used in neuroimaging studies, where the number of available samples  is usually small ( < 10 3 ) and the number of imaging features  quite large (typically  > 10 5 ).One main drawback that limited their applicability in many medical imaging applications (LeCun et al., 2015) (and more broadly in biomedicine) is their need for pre-selected features manually or automatically designed (e.g., through feature engineering).As opposed to SML methods, DL, and in particular, ConvNets (CNNs), can automatically learn from raw data a hierarchical representation of features relevant to the task at hand (e.g., classification or regression).They have shown impressive results on supervised and unsupervised learning problems, both on natural and medical images, by learning a high abstraction of the data in a layer-wise manner.Several studies have started to benchmark such models on functional https://doi.org/10.1016/j.neuroimage.2024.120665Received 15 November 2023; Received in revised form 15 May 2024; Accepted 31 May 2024 Fig. 1.New paradigm for discriminating psychiatric disorders at the subject-level.In a pre-training phase, a non-linear DNN   is trained to learn a low-dimensional embedding from a large brain imaging dataset of healthy controls, discovering the general variability associated with non-specific variables such as age and sex.This pre-training can be performed either with (i) self-supervised tasks (e.g., contrastive learning Dufumier et al., 2021a;Chen et al., 2020) (ii) generative modeling (e.g., VAE Kingma and Welling, 2014) or (iii) discriminative tasks (e.g., age prediction Bashyam et al., 2020).In the second step, the model is initialized with pre-trained weights   =  ℎ and fine-tuned to discriminate between patients and controls.Our main hypothesis is that the representation learned during pre-training will allow easier discovery of the specific variability associated with the pathology of interest (e.g., abnormal cortical atrophy in temporal and pre-fontal regions for schizophrenia or ASD).
brain imaging (He et al., 2020a;Mellema et al., 2022) and cortical data (Dahan et al., 2022) for phenotypes prediction, some of them showing improvement of DL over SML (Mellema et al., 2022).However, as noted in several recent studies (Abrol et al., 2021;Schulz et al., 2020;He et al., 2018;Quaak et al., 2021;Vieira et al., 2020), the benefit of using DL on anatomical brain MRI data for single-subject prediction (required for psychiatric disorder diagnosis or prognosis) is unclear, and a careful and extensive comparison with simple regularized linear models and kernel-methods is still missing.For bipolar disorder (Hibar et al., 2018) and schizophrenia (Van Erp et al., 2016), several very large meta-analyses led by the ENIGMA consortium have shown significant variations in cortical regions including prefrontal, anterior temporal, and insula cortices, visible in structural neuroimaging.More fine-grained analysis at the voxel level is required to improve the diagnosis and prognosis accuracy of machine learning models at the subject level (Nunes et al., 2020).For ASD (Van Rooij et al., 2018), smaller subcortical volumes of the pallidum, putamen, amygdala, and nucleus accumbens, as well as increased cortical thickness in the frontal cortex and decreased thickness in the temporal cortex were observed from structural brain imaging.Only a case-control study was performed in this case, and more effort is required at the single-subject level to investigate anatomical brain abnormalities and their link to behavior.
In a recent study (Schulz et al., 2020) based on the UK Biobank dataset (Bycroft et al., 2018) (UKB), Schulz et al. studied whether the two main priors encoded in current CNNs, namely translational invariance (derived from the convolution operation) and compositionality (derived from its hierarchical structure), can be exploited to capture non-linear dependencies in structural/functional Magnetic Resonance Imaging (sMRI/fMRI) data for individual prediction tasks.In particular, they showed that SML and DL models have a similar scaling trend, even in the large-scale regime (  = 8k), on both modalities (sMRI and fMRI) for a variety of tasks (age and sex prediction but also fluid intelligence or household income prediction).However, these results contradict the ones obtained by Peng et al. (2021) on both the Predictive Analytics Competition (Fisch et al., 2021) and UKB, as noted by Abrol et al. (2021).Specifically, Abrol et al. pointed out some technical flaws in the work of Schulz et al. that heavily affected their conclusions.The main shortcomings were the feature selection step performed for SML and DL (with an arbitrary number of reduced dimensions) and using a single central brain slice in their main experiments, limiting DL representation capacity.On the contrary, in their study (Abrol et al., 2021), Abrol et al. performed feature selection only for SML models, and they used a whole-brain approach for DL.They found a significantly better scaling trend for DL on UKB with training samples ranging from   ∼ 2000 to   = 10 4 , and they attributed the performance drop in the work of Schulz et al. to a coding bug.Moreover, they also found a small but significant increase in performance on the Mini-Mental State Examination (MMSE) regression task (  = 428, −0.07 of MAE, Mean Absolute Error, for DL vs. SML), which might be in contradiction with a recent benchmark (Wen et al., 2020) on Alzheimer's detection that found no significant differences between SML and DL.While this score indicates Alzheimer's disease severity, it does not translate into Alzheimer's diagnosis (Dinomais et al., 2016), which may explain the different findings.Finally, they suggested that DL can consistently extract robust brain representations according to different saliency map techniques, showing consistent patterns across runs and saliency methods for age and sex prediction.
Nonetheless, the past literature comparing DL and SML with neuroimaging data still has several limitations that we highlight here.
Limited number of prediction tasks.First, most recent papers (Abrol et al., 2021;Schulz et al., 2020;Peng et al., 2021) have mainly focused their analysis on age and sex prediction in the healthy population.While studying age regression has become an important research field for many research questions (new biomarkers discovery for psychiatric disorders or neurocognitive impairment with brain age gap (Koutsouleris et al., 2014;Cole and Franke, 2017;Cole et al., 2018;Jonsson et al., 2019) or normative modeling (Marquand et al., 2019;Zabihi et al., 2020;Wolfers et al., 2018)), DL evaluation on psychiatric disorder classification is (also) urgently required.The advances made in the ML field are remarkable, and the availability of large-scale neuroimaging data previously inaccessible to research gives a unique opportunity to study these clinical tasks.The question of whether non-linearities can be captured in highly heterogeneous clinical cohorts, including patients with schizophrenia (Wolfers et al., 2018;Koutsouleris et al., 2014) (SCZ), Bipolar Disorder (Wolfers et al., 2018) (BD), and Autism Spectrum Disorders (Zabihi et al., 2020) (ASD) is still debated, and no clear consensus arises (Salvador et al., 2017;Wen et al., 2020;Quaak et al., 2021).This is mainly due to the small sample size of the current datasets (typically  < 10 3 ), which causes ML models to over-fit and bias the neuroimaging community towards over-optimistic results (Varoquaux, 2018;Schnack and Kahn, 2016;Kambeitz et al., 2018;Pulini et al., 2019;Flint et al., 2021).These disorders involve subtle anatomical atrophies/hypertrophies in cortical and subcortical structures, and their identification is still a difficult challenge.
No replication on external multi-site data.Second, both Abrol et al. and Schulz et al. have based their analysis mainly on a unique homogeneous (i.e.single-site and single-scanner model) dataset (UKB) that does not reflect the inevitable heterogeneity in emerging large multi-site and multi-scanner clinical data collections (e.g., ABIDE, ABCD, SCHIZCON-NECT, etc.).As such, a comprehensive complementary benchmark on phenotype prediction with large-scale multi-site datasets is required.As noted in a recent study (Koppe et al., 2021), since DL has an exceptional capacity to learn any function (even random noise Zhang et al., 2021a), it can also learn ''disease-irrelevant site-specific characteristics'', and its generalization capacity on data acquired on never-seen sites must also be reported.
No evaluation on ''raw'' data. Third, previous studies (Abrol et al., 2021) argued that DL models should be evaluated on voxel-level brain imaging data rather than ROI-based or slice-based MRI, as they are originally conceived to extract features to perform complex tasks (Le-Cun et al., 2015) automatically.Previous studies (Peng et al., 2021;Abrol et al., 2021) have concentrated their effort on fully preprocessed voxel-based MRI.However, much less research has been devoted to the pre-processing pipeline and its impact on DL performance.Recent findings on brain age (Cole et al., 2017;Hwang et al., 2021;Wachinger et al., 2021;Peng et al., 2021) suggest that DL models perform similarly between raw images (with only linear registration and eventually non-brain tissue removal) and fully pre-processed ones (with gray matter extraction, non-linear diffeomorphic registration, and several bias correction steps as performed with Voxel-Based Morphometry (VBM)), suggesting that CNNs do not extract extra-information from raw data.This is a significant difference with classical vision tasks (e.g., ImageNet classification) since we know that automatic feature extraction of color, shape, and texture is the cornerstone of today's CNNs performance.As a result, a fundamental question is whether usual non-linear computationally demanding pre-preprocessing steps can remove non-linear discriminative information for brain disorders that could have been leveraged by DL (e.g., cortical folding patterns).This problem has been rarely addressed for mental disorders such as schizophrenia, bipolar disorder, and autism, especially with large multisite studies (e.g., ABIDE, SCHIZCONNECT).Furthermore, several recent works (Cole et al., 2017;Hwang et al., 2021;Wachinger et al., 2021;Wen et al., 2020;Glocker et al., 2019) showed that the prediction capacity of CNNs on images from never-seen sites is worse when using raw data than VBM as pre-processing for both age prediction (Cole et al., 2017;Hwang et al., 2021;Wachinger et al., 2021), Alzheimer's diagnosis (Wen et al., 2020) and sex prediction (Glocker et al., 2019;Wachinger et al., 2021).This suggests that CNNs probably overfit acquisition sites using raw data rather than extracting discriminative information.This point is critical since most large-scale clinical datasets that arise in the neuroimaging field are highly multi-centric (e.g., ABIDE, SCHIZCONNECT, ENIGMA Nunes et al., 2020).
No evaluation of transfer learning strategies.Finally, probably the most important difference between DL and SML models is the ability of the former to learn a generalizable representation from a large dataset that can be transferred to other tasks they were not trained on (i.e., Transfer Learning and Self-Supervised Learning).Initiated by the work of Caruana (1997) on transfer learning and multi-task learning, this idea has been first successfully applied to natural images (Bengio, 2012;Yosinski et al., 2014) (by re-using features first learned on ImageNet (Deng et al., 2009), a large-scale dataset with  > 10 6 images and 1000 categories), and then to medical datasets (Azizi et al., 2021;Zhou et al., 2021) (by pre-training a CNN on unlabeled medical images in a self-supervised manner).While this idea has been discussed in recent works (Abrol et al., 2021;Koppe et al., 2021) (considering the availability of large brain MRI datasets of healthy subjects, e.g., HCP Van Essen et al., 2013 or UKB Bycroft et al., 2018), very few studies (Dufumier et al., 2021a) have evaluated this approach on brain disorder classification tasks, remaining mainly limited to age and sex prediction tasks (Malik and Bzdok, 2022).
To summarize previous studies, there is no consensus on the superiority of deep learning for individual prediction tasks.While Schulz et al. (2020) only provided a partial analysis on age and sex prediction, Abrol et al. (2021) extended their findings on these two tasks, arguing that DL was able to outperform SML.Both works remained mainly limited to the same prediction tasks (age and sex prediction), and they provide empirical evidence from the same benchmarking resource (UKB).In this work, we propose investigating more clinically relevant tasks using a different neuroimaging data set to compare DL learning capacity against SML.We also aim to explore new learning strategies for DL based on Transfer Learning (TL), which has not been investigated in previous studies.

Contributions.
We propose to revisit and extend the analysis initiated in recent works (Schulz et al., 2020;Abrol et al., 2021) to large multisite datasets.We perform extensive experiments to compare DL vs. linear and kernel-SVM (i.e., SML) models on five supervised tasks (age, sex prediction, and brain disorder diagnosis) using one of the largest multi-site clinical datasets to date.We investigate pre-processing of anatomical data (VBM and quasi-raw), data augmentation for DL models, dimensionality reduction for SML models (Gaussian Random Projection, Univariate Feature Selection, Recursive Feature Elimination), and cross-site generalization both in the medium-scale ( ≈ 1k) and large-scale ( ≈ 10k) data regime.Unlike previous literature, we also consider three main transfer learning strategies for mental disorders classification with DL based on self-supervised pre-training, generative modeling, and supervised pre-training (see Fig. 1).Finally, we consider the Deep Ensemble technique to quantify uncertainty in deep models and analyze its impact on prediction.
In summary, in this work, we are interested in digging into key questions for neuroimaging: can current SOTA DL models extract nonlinearities from highly multi-center brain disorder datasets?How do they scale compared to standard machine learning models?Can we transfer a brain representation of the healthy population to better discriminate patients with mental disorders?

Data
All data have been collected through various data-sharing initiatives, consortiums, and platforms that can be consulted in the dedicated papers and webpages accessible through hyperlinks in Table 1.We have reported the most important demographic information in Table 1 for all datasets.Importantly, since we acknowledged that reproducibility is critical for all ML/DL studies, we have also integrated the OpenBHB dataset recently released (Dufumier et al., 2022) that can be found here.The testing splits used for both age and sex prediction are defined using only data from OpenBHB, for reproducibility purpose, as described in Section 2.4.VBM pre-processing is performed with CAT12 (Gaser and Dahnke, 2016) from the SPM toolbox, essentially consists of noise and biasfield correction followed by Gray Matter (GM), White Matter (WM), and Cerebrospinal Fluid (CSF) segmentation.Images are non-linearly aligned to the MNI template with DARTEL (Ashburner, 2007) and modulated using the Jacobian deformation field map.All sMRI scans are re-sampled to have an isotropic 1.5mm 3 spatial resolution with dimension 121 × 145 × 121 using a linear spline interpolation.Going to a higher spatial resolution would have induced a higher computational burden, and considering the difference in scanner parameters in our cohorts (e.g., permanent magnetic field), we decided to fix this resolution for all images.We also normalized all images using the Total Intracranial Volume (TIV) estimated by CAT12 to account for the (irrelevant) differences in head size.We applied a visual quality check for all pre-processed images to remove poorly segmented images or images with obvious MRI artifacts.

Quasi-raw
As opposed to VBM, quasi-raw pre-processing was designed to be minimal.Only essential steps have been kept to map the images from different sites and scanners to the same space with the same resolution, and only necessary image correction steps have been applied.Specifically, each scan is rigidly re-oriented to the MNI space and then re-sampled to a 1.5mm 3 spatial resolution through a linear spline interpolation.The bias field is corrected using the N4ITK algorithm (Tustison et al., 2010) from ANTs (Avants et al., 2009), and the brain is extracted with BET2 (Jenkinson et al., 2005) (the skull and nonbrain tissues are removed).Each image is linearly registered (9 degrees of freedom) to the MNI template with FLIRT from FSL (Jenkinson and Smith, 2001).During the training of DL models, we normalize each quasi-raw image by subtracting its mean and dividing by its standard deviation computed across the voxels in each volume.

Machine learning pipeline for phenotype prediction
First, we want to confirm the results obtained by several studies (Peng et al., 2021;Abrol et al., 2021) on age and sex prediction from anatomical data, as we increase the number of training samples   , for both DL and SML, but with several key differences: (i) we do not apply feature selection (Salvador et al., 2017;Chu et al., 2012-03-01) for both DL and SML (this point is studied in-depth in Section 3.1.4);(ii) we separately predict age and sex to avoid arbitrary age discretization; (iii) we assess the generalization performance on an external test, including never-seen sites, and an internal test set stratified on age, sex, and site (see Section 2.4 hereafter).The use of an external test allows us to give unbiased results since the model cannot make predictions based on confounding variables related to site information (Wachinger et al., 2021).Then, we explore DL performance compared to SML models on three increasingly difficult binary classification tasks for psychiatric diagnosis, including patients with schizophrenia, bipolar disorder, and ASD.Importantly, these three tasks do not have the same difficulty (at least with SML (Salvador et al., 2017;Eslami et al., 2021)), and one might expect improvement with non-linear models on more challenging tasks where SML models under-perform (e.g., in ASD (Hoogman et al., 2022)).We pooled a large number ( = 19) of datasets covering a wide age range (from childhood to elderhood) and balanced between males and females (see Section 2.1).

SML models
We considered two linear models with  2 and  1 +  2 penalization to promote parsimonious and shrunk solutions, along with Radial-Basis Function Kernel SVM (rbf-SVM).These models have been commonly used in the literature (Abrol et al., 2021;Schulz et al., 2020;Salvador B. Dufumier et al. et al., 2017) and consistently resulted in similar performance, even when additional kernel functions were included during cross-validation (e.g., polynomial or sigmoidal).

DL architectures
Due to the lack of standard benchmarks in the neuroimaging field, there is still no consensus about the DL architectures adapted to our downstream tasks.We focused our analysis on SOTA CNN architectures and Transformers as they consistently resulted in top performance on image recognition tasks.Specifically, we chose a classical 3D-AlexNet (Krizhevsky et al., 2012) architecture, as defined by Abrol et al. (2021), consisting of five convolutional layers.This network was called ''DL1'' by Abrol et al. and was used in most of their experiments.To use recent advances in the DL field, we also retained 3D-ResNet18 (He et al., 2016) and 3D-DenseNet121 (Huang et al., 2017), similar to recent works that have used structural neuroimaging data (Dufumier et al., 2021a).The latter network has 121 layers and is the deepest network used in this paper.Finally, we also compared Transformerbased architecture and smaller CNN backbones in Supplementary E but they systematically underperformed compared to the three models selected in this study.All networks have been implemented in Python, and the code is available here.

Cross-validation procedure and training splits
For age regression and sex prediction, we have built a multi-site dataset including both OpenBHB (see Table 1) -a public dataset that can be accessed without further authorizations-along with more restricted datasets: HCP (Van Essen et al., 2013), OASIS 3 (LaMontagne et al., 2019) (only Healthy Controls, HC), ICBM (Mazziotta et al., 2001), BIOBD (Sarrazin et al., 2018) (only HC), SCHIZCONNECT-VIP1 (only HC), and BSNIP (Tamminga et al., 2014) (only HC).Eventually, we gathered  = 11210 scans from 8679 participants and  = 99 sites.We first derived an external test dataset with MPI-Leipzig and NAR (   = 640 from 619 participants distributed across the lifespan from  = 3 sites).Then, from OpenBHB, we derived an age/sex/sitestratified internal test dataset and a stratified validation dataset with respectively    = 662 scans from 480 participants and   = 655 scans from 482 participants.The remaining training set includes   = 9253 scans from 7098 participants.Importantly, each participant appears in only one split so that we avoid any data leakage from the validation/test set.We chose to use validation/test set only from OpenBHB to promote reproducibility in our work.2Finally, we sub-sampled this training set in a stratified manner (on age, sex and site) in order to compute performance at varying training sample size (𝑁 ∈ [100, 500, 1000, 3000, 5000, 9253]) for both age and sex prediction using a Monte-Carlo Cross Validation (CV) procedure, similarly to Abrol et al. (2021), Schulz et al. (2020).We repeated this sub-sampling five times for  ≤ 500 and three times otherwise to keep a reasonable computational budget while still deriving a consistent estimator of classifiers' performance.About schizophrenia, bipolar disorder, and autism detection, we detailed the splits used in Table 2.We used the same splits for all models (SML and DL) and repeated each experiment 30 times, using different random initialization and reporting the average and standard deviation.

DL and SML training
We performed a grid search for SML models to choose the best values of the hyperparameters using the full training set for all tasks.Specifically, for Logistic and Ridge Regression, we tuned the regularization term  within the values [10 −1 , 1, 10, 10 2 , 10 3 ] and for ElasticNet, we also tuned the  1 ratio term within the values [0.1, 0.5, 0.9].As for rbf-SVM, we tuned the gamma parameter within the values [10 −1 , 1, 10, 100] for both classification and regression problems.
We implemented all DNNs with the PyTorch (Paszke et al., 2019) library and SML models with the scikit-learn library (Pedregosa et al., 2011).Similarly to Abrol et al. (2021), we used the Adam (Kingma and Ba, 2015) optimizer to perform Stochastic Gradient Descent (SGD) with a weight decay fixed to 10 −5 .We tuned the learning rate  within the values [10 −3 , 10 −4 , 10 −5 ] for all regression and classification tasks with the maximum number of training samples each time, finding that  = 10 −4 was a good value for all DNNs.We then cross-validated the hyper-parameter  ∈ [0.2, 0.4, 0.8] by decreasing the initial learning rate  every ten epochs for all DNNs and tasks.For computational reasons, we set the batch size  equal to  = 32.We optimized all DNNs for 300 epochs on age and sex prediction and 100 epochs for diagnosis classification.While we did our best to cross-validate critical hyper-parameters for DL models, we could not reasonably test all hyper-parameters with grid-search (e.g., non-linearities, optimizers, etc.).This is a fundamental challenge when working with DL since we optimize highly non-convex functions with many local minima.It motivated the apparition of standard benchmarks in computer vision (such as ImageNet) that allowed easy reproducibility and comparison between SOTA models.Yet, such a benchmark is urgently required for the neuroimaging community, but we did our best to obtain strong baselines for all SML and DL models (in line with recent studies on the same topic Abrol et al., 2021;Schulz et al., 2020;Salvador et al., 2017;Nunes et al., 2020).

VBM vs quasi-raw pre-processing for brain imaging analysis
In this study, we compare two main pre-processings: Voxel-Based Morphometry (VBM) and quasi-raw.VBM data provide volumetric information about gray matter density in each voxel, which are good predictors of phenotype (Abrol et al., 2021;Schulz et al., 2020;Peng et al., 2021;Jonsson et al., 2019).However, original raw MR images may contain more information than VBM, in particular related to cortical folding patterns, which may be predictive of psychiatric disorders (e.g., gyrification index Sasabayashi et al., 2021).This suggests that raw images could bring more discriminative information than VBM images.We aim at elucidating whether DNNs can extract such complementary patterns and consequently achieve better performance.First, we evaluate DL models on VBM data and quasi-raw data on our five benchmarking tasks by following the analysis pipeline described in Section 2.3.
Then, we hypothesize that the domain gap between internal and external tests for age prediction is more pronounced for raw data than for VBM pre-processed data.To check this hypothesis, we plot both raw and VBM pre-processed images (from internal and external test sets) encoded by a DenseNet trained on age prediction with the largest sample size available (  = 9253).We use t-SNE (Van der Maaten and Hinton, 2008) visualization to map the embedded images to 2D representations.
Finally, we make an indirect test to check whether noise induced by the scanner explains the discrepancy in results obtained from DNNs trained on VBM vs quasi-raw data on psychiatric disorders.From a network trained to predict a given psychiatric condition with a given pre-processing (VBM or raw), we train a linear classifier to predict the acquisition site from the network representation.We hypothesize that the acquisition site is much more easily decodable from quasi-raw data than VBM data when embedded by a DNN trained on clinical tasks.

Data augmentation
Considering the small sample size (typically  ≈ 1k) and high input dimensionality of brain images (> 1M voxels) in clinical datasets, data augmentation should provide a simple way to artificially increase the dataset size, limit the over-fit and improve the performance.From the vicinal risk minimization point-of-view, Chapelle et al. (2001) showed that it could be seen as a regularization technique that imposes invariance to given transformations for a prediction task.We evaluate five standard augmentations: affine transformation (with both rotation and translation), flip, random Gaussian noise, cropping, and cutout for all psychiatric disorder classification tasks.For each strategy, we test both strong and light augmentations.As noted by Hernandez-Garcia (Hernández-García et al., 2018), strong augmentations produce more biologically plausible representations compared to light augmentations (because it may generate examples that should be explored by DNNs for good generalization on test images, exploiting domain knowledge).The hyper-parameters cross-validated are indicated in Supplementary Table 8.We first evaluate these augmentations on VBM data.We might also hypothesize that these transformations are more suited to quasi-raw images since they are only linearly registered to the MNI template and they could be noisier than VBM images.To test this hypothesis, we apply a random combination of all previous augmentations (cutout, crop, affine, Gaussian noise, flip) with a probability of 50% for each transformation on quasi-raw images.We report the performances for light and strong augmentations with the same hyper-parameters as in Supplementary Table 8 and we compare them to baseline results without augmentations on VBM and quasi-raw data.We use DenseNet121 backbone in all these experiments.

Data harmonization with ComBat and linear adjusted regression
As reported in several multi-site studies (Wachinger et al., 2021;Glocker et al., 2019), the high heterogeneity between scanners and acquisition protocols leads ML models to under-perform on cross-site images (i.e., coming from other sites than the ones used during training).This also explains why we carefully introduced an external test to evaluate the generalization performance of the models.Here, we leverage two SOTA harmonization methods to remove non-biological variance: ComBat (Johnson et al., 2007;Fortin et al., 2018) and Linear Adjusted Regression.These two methods directly harmonize the data without changing the model (as opposed to recent methods (Dinsdale et al., 2021) that act on DL representations), allowing for a fair comparison between SML and DL methods.Both ComBat and Linear Adjusted Regression need image statistics on all sites to remove site information.However, in our case, only the training and internal test sets contain the same sites, so we only residualized these two sets, leaving the external test set unchanged.
Linear Adjusted Regression is a linear harmonization method that tries to preserve biological variability from the data while removing non-biological effects (such as site effect).The model itself can be expressed as (Wachinger et al., 2021):  (Fortin et al., 2018) adds a multiplicative nonlinear effect   on the residual noise, which brings to a different residualization scheme that also requires the biological variables   : These models generally require to have access to all imaging sites during training.In our experimental design, this was possible only when using the internal test set but not when using the independent external test set.To avoid possible data leakage during residualization, we propose to set   = 1 and   = 0 for all unknown test sites  in both linear adjusted regression and ComBat.This is not ideal, and other DLbased (Dinsdale et al., 2021;Torbati et al., 2021) solutions are starting to emerge in the literature but there is still no consensus, and most of the current studies use ComBat or Linear Adjusted Regression (Radua et al., 2020;Ball et al., 2021).

Dimensionality reduction for SML models
Previous studies (Schulz et al., 2020;Abrol et al., 2021) argued that dimensionality reduction was a necessary step for SML models to limit over-fitting and work properly (especially considering the very highdimensionality of 3D MRI,   > 300K).We carefully reproduce the experimental design from these studies (same feature space dimension  = 784 and reduction methods) on BHB-10K to test this hypothesis.
Specifically, we use three different feature selection methods, Gaussian Random Projection (GRP), Random Feature Elimination (RFE), and Univariate Feature Selection (UFS) following Abrol et al. (2021) and Schulz et al. (2020).As opposed to RFE and UFS, GRP is an unsupervised feature selection method that applies a random matrix to the data and preserves the Euclidean distance between points, up to an error  depending on the number of features selected.We evaluate these strategies on our five prediction tasks across both internal and external test sets.As a direct comparison with the current literature, we also evaluate the reduction methods on UKBioBank (Bycroft et al., 2018) for the age regression task.

DL and SML models interpretation
While DL models are often considered as ''black box'' models, several interpretability methods have been proposed over the years to highlight the image areas that have been important for the model to make its decision (see this recent paper (Zhang et al., 2021b) for a comprehensive survey).Here, we aim to elucidate whether DL (trained from scratch) and linear models make their decisions based on the same brain region patterns, which is a critical question for precision psychiatry.
In this regard, linear models are much simpler to interpret since we have direct access to the weighted maps (or importance maps Ball et al., 2021).In a weighted map, each weight is associated with a unique input feature.Higher absolute weight values indicate stronger importance of the corresponding input features on the final prediction score.In particular, in a clinical context with anatomical images, hypertrophy (resp.atrophy) in regions with high positive (resp.negative) weights translates into a stronger brain signature for a given pathology, i.e., a higher predictive score.
To generalize to the non-linear case, we have chosen a gradientbased method (Simonyan et al., 2014) for DL model interpretability.This sensitivity analysis computes the gradient of predicted output w.r.t. each input voxel (i.e., it quantifies how much output prediction value varies depending on the input voxel value).More sophisticated gradient-based models have been proposed over the years, but they do not necessarily result in more accurate saliency maps (Adebayo et al., 2018).Similarly to Abrol et al. (2021), we compute brain region importance maps using the Automated Anatomical Labeling atlas (Rolls et al., 2020) (AAL) for each model trained with the maximum number of samples on each task.Specifically, a weighted map is computed through sensitivity analysis for each input image, and all absolute values are summed per region.The resulting importance map is normalized so that it sums to one.Finally, all importance maps for each test set (internal and external) are averaged.We compute the correlation matrix between all averaged maps to compare region importance obtained with SML and DL models.

Deep ensemble learning
Deep ensemble for DNNs uncertainty quantification.In a real-world scenario where an AI tool is implemented in a hospital, knowing the uncertainty associated with a prediction is crucial.First, it allows the clinician to trust (or not) the system.Second, an over-confident system could highly bias an expert's opinion over incorrect predictions.Third, knowing when the prediction is likely to be incorrect (e.g., for out-of-domain images) may improve performance since it allows the system to "go beyond binary statements on existence vs. non-existence of an effect; and afford credibility estimates around all model parameters at play, which thus enable single-subject predictions with rigorous uncertainty intervals (Bzdok et al., 2020)".In this regard, Bayesian models (such as MC-Dropout Gal and Ghahramani, 2016) and Deep Ensemble learning (Lakshminarayanan et al., 2017) have been developed for quantifying predictive DNNs uncertainty.A recent benchmark (Gustafsson et al., 2020) has shown the superiority of the latter over the former, and considering its simplicity, we adopted this framework for brain disorder classification.
Previous deep models do not integrate any notion of uncertainty inside their prediction.Once trained, they estimate the predictive distribution (|, D) for any input image , given training set D (where  represents the clinical status).However, modern DNNs tend to be overconfident in their prediction (Guo et al., 2017), highly limiting their reliability and their clinical use.Yarin Gal Gal (2016) introduced the notion of epistemic uncertainty to quantify the uncertainty associated with the model's weights  inside DNNs.Lakshminarayanan et al. (2017) showed that Deep Ensemble provides a simple way to quantify this uncertainty by aggregating several DNNs output (|,  () ) trained with Stochastic Gradient Descent (SGD) from different random initialization.The averaged distribution p(|, D) = 1  ∑  =1 (|,  () ) for  trained DNNs can be seen, from a Bayesian perspective, as a posterior distribution estimation of (|, D) through Monte-Carlo sampling  () ∼ (|D).
Implementation.As shown by Lee et al. (2015), deep ensemble learning with independently trained neural networks on the whole dataset benefits much more than bagging regarding accuracy and calibration.As a result, in our study, we use the standard deep ensemble strategy often used in DL: we train each network with a different random seed each time and perform stochastic gradient descent on the whole training set.Then, for the regression task (resp.classification task), the output values (resp.probabilities computed after softmax) of all networks are averaged.This strategy encourages diversity in learned DL representations without sharing weight between networks.While increasing the number  of independently trained networks can increase this diversity, it is computationally costly.As a trade-off between performance, computational time, and memory, we fixed  = 3 in our experiments.

Pre-training strategies for transfer learning
Deep models have several key advantages over SML besides leveraging raw data.Since DL should be able to learn both low-and high-level imaging features relevant to a given task, it has been hypothesized that at least part of this information could be important for other tasks or domains.Transfer Learning (Caruana, 1997;Raina et al., 2007;Bengio, 2012;Yosinski et al., 2014) was grounded on this idea, and it has achieved good performance using both natural and medical images (Azizi et al., 2021;Mustafa et al., 2021;Yosinski et al., 2014).Closely related to this idea, in a recent study on resting-state fMRI, He et al. (2022) showed how an ML system trained to predict a large bank of phenotypes (e.g., cognition or blood biomarkers) can boost the prediction of correlated, but distinct, set of phenotypes on UKBioBank (Bycroft et al., 2018).As suggested by a recent study (Abrol et al., 2021), predicting phenotype or demographic information in the large-scale data regime may be achieved by a DNN to significantly outperform SML (e.g., for age regression).It suggests that non-linear patterns related to variables non-specific to pathology are discovered from brain imaging.The discovery of these non-specific axes of variance should allow the learning, in a second phase, of the specific variability associated with mental disorders.
We propose to use a new paradigm depicted in Fig. 1 to train a DNN to discriminate mental disorders from controls.In the first pre-training step, we pre-train a DNN on brain MRI of the healthy population (from childhood to elderhood) to learn a representation capturing the biological and environmental variability of the healthy brain.This can be achieved with a large-scale dataset.Then, in the second step, the network is fine-tuned to predict the mental condition from brain MRI.Our main assumption is that the representation learned during pretraining will help to discover the pathological variability related to specific mental conditions.
We explore five pre-training strategies to learn anatomical features from the healthy population before applying transfer learning to clinical datasets: (1) our proposed weakly self-supervised model that integrates participant's age as auxiliary information-namely Age-Aware Contrastive Learning (Dufumier et al., 2021a), (2) self-supervised contrastive learning (SimCLR Chen et al., 2020) (3) SOTA self-supervised model for medical imaging based on context-based restoration (Model Genesis Zhou et al., 2021) (4) Variational AutoEncoder (VAE (Kingma and Welling, 2014)) considered as SOTA generative model (easier to train than GAN (Goodfellow et al., 2014) and integrating an encoder that can be fine-tuned) and ( 5) a discriminative supervised model trained on age prediction.Importantly, age information is only used during pre-training of age-aware CL and supervised models but it is never used during fine-tuning.All these models are pre-trained on OpenBHB (with also HCP, ICBM, and OASIS3 to increase the dataset size and without ABIDE to avoid data leakage on ASD prediction).This dataset is international and highly multi-centric, promoting heterogeneity in the population under study as well as in image quality.To cross-validate the hyper-parameters, we derived the same validation set as we did for age and sex prediction (stratified on age, sex and site).We provide a detailed description of these five strategies hereafter.
2.12.1.Self-supervised learning Age-aware contrastive learning.To learn a brain representation of the healthy population, we have developed a new self-supervised algorithm (Dufumier et al., 2021a), built on the recent development in contrastive learning (Chen et al., 2020;He et al., 2020b).In particular, this algorithm is able (i) to encode invariance to a set of image transformations T and (ii) integrate phenotype information (in our case, participant's chronological age) to enforce images with close phenotype to have close representation in the DL space.The set T is chosen according to the exploratory work (Dufumier et al., 2021a) we performed on psychiatric disorders.In our case, T consists of a random cutout, i.e., a black patch covering 1∕16 of the input image is applied to a random location.Two brain images with small missing parts from the same individual still share most of their anatomical features.Consequently, property (i) enforces the encoder to map these two images to the same point in the representation space.To ensure property (ii) is fulfilled, we used a Radial Basis Function kernel to measure the similarity between two chronological ages.We optimized Age-Aware InfoNCE loss as described in Dufumier et al. (2021a). 2 was cross-validated in {1, 2, 3, 5}.Similarly to our previous work (Dufumier et al., 2021a), we used DenseNet121 as DL encoder, and a 2-layer MLP as non-linear projection head (see ourcode).We set the batch size to  = 64.
After pre-training with Age-Aware InfoNCE loss, we fine-tuned the encoder on each downstream task by cross-validating the learning rate  and scheduler hyper-parameters  in the same way as before with DL models trained from scratch (see Section 2.5).A randomly initialized linear layer is added on top of the pre-trained encoder and trained end-to-end on each downstream task.
Contrastive learning.As a fair comparison with the previous algorithm developed, we have also explored SimCLR (Chen et al., 2020), a SOTA contrastive learning model adapted for brain MRI.Specifically, we used the same transformations T (based on cutout) during pre-training, and we trained it for 100 epochs.Since the pretext task is solved quickly (reaching 99% accuracy in less than 10 epochs), we have fine-tuned the pre-trained model after (i) 10 epochs, (ii) 30 epochs, (iii) 100 epochs, and we have cross-validated the optimal  during fine-tuning and setting the learning rate  = 10 −4 .The best results were obtained using the model pre-trained for 10 epochs, suggesting a rapid over-fit on the training set (even if we reach ≈ 10k samples).
Context-based restoration.Context-based restoration is a distinct category of self-supervised models that emerged recently for medical imaging.It can be seen as a special case of denoising autoencoder (Vincent et al., 2008) for representation learning (like inpainting (Pathak et al., 2016)) where the idea is to retrieve the original image from its artificially degraded version using an encoder-decoder neural network.This method mainly requires defining the degrading module and transforming an input image into a degraded, transformed version.It is worth noting that degraded images need not be realistic but rather hide/transform important semantic information that could be deduced from its surrounding context (by analogy with Natural Language Processing where typical self-supervised task consists in retrieving a missing word in a sentence Devlin et al., 2019).Model Genesis (Zhou et al., 2021) defines such a module and introduces different strategies to learn context, texture, and appearance.The original formulation leverages UNet backbone (with skip connections between the encoder and decoder) to learn 3D image representations from medical images.We take the same original transformations and backbone to pre-train the network on the same brain MRI dataset as the other methods.We train it for 200 epochs using a learning rate 10 −4 and Adam optimizer.

Variational auto-encoder
VAE (Kingma and Welling, 2014) is a generative model that uses an encoder-decoder architecture to (i) reconstruct an input image from its latent representation and (ii) impose a prior distribution in the latent space (generally a Gaussian distribution).Once trained, the VAE can be used either to generate new samples from the known prior distribution or to encode input images through its encoder.One main difficulty encountered during training is to avoid posterior collapse where the posterior latent variable is equal to the prior (thus ignoring the input signal).This is notably due to the non-identifiability issue of the latent variable (Wang et al., 2021), caused partly by the model architecture.We used two methods to avoid such behavior: (1) the encoder-decoder architecture is light including only 5 convolutional blocks in the encoder (and a symmetric decoder with transposed convolutions); (2) a -VAE (Higgins et al., 2016) objective function to restrict the parameters space. is chosen small ( = 10 −5 ), and the pre-trained model is validated using linear probing.Linear probing is a simple tool coming from the representation learning field.Here, it consists of training a linear layer on top of the pre-trained VAE encoder to predict the phenotype (age and sex).We hypothesize that if the biological variables can be successfully predicted from the latent representation, the VAE model has learned transferable anatomical brain patterns.Ridge regression is used for predicting age and logistic regression for predicting sex with a regularization term  ∈ {10 −2 , 10 −1 , 1, 10 1 , 10 2 , 10 3 } cross-validated on the validation set.

Supervised learning
This pre-training strategy is the simplest but also the most widely used in transfer learning (Yosinski et al., 2014): the network is trained to predict a rich signal in a supervised manner on a large-scale database, and we assume that high-level semantic features will be re-used on downstream tasks.In our context, it consists in modeling normal brain aging by training a DNN (DenseNet121 here) to predict the age from our large-scale dataset of HC.It has two crucial advantages over ImageNet pre-training: (i) we do not have a domain gap between natural and medical images, and (ii) we can directly transfer to 3D data using a 3D DNN.Recent studies (Raghu et al., 2019;Azizi et al., 2021) on transfer learning with medical images suggest that domain gap can hurt performances.

Variance analysis
To better explain the performance of TL and Deep Ensemble, we hypothesize that pre-trained models do not escape from the initial basin landscape as randomly initialized models do (Neyshabur et al., 2020), leading to less variance during model optimization.We have tested this hypothesis on SCZ vs. HC and BD vs. HC by training  = 30 independent DNNs on each task using the same training set each time but different initialization (random for baseline and pre-trained for transfer and transfer + deep ensemble).We then compute the variance in performance every 50 epochs across models and report the standard deviation.We did not run this experiment for ASD vs. HC considering the computational cost (ASD is the largest clinical dataset in this study).Standard deviation is estimated using 30 independent measures for all tasks and models, except for transfer+deep ensemble where it is estimated with 10 measures (since we aggregate three DNNs for each measure).).Models cannot use site-specific information for their prediction on this test set, eliminating a strong bias reported in the literature.For age and sex prediction, we performed a 5-fold (resp.3-fold) Monte Carlo Cross-Validation sub-sampling procedure for   ∈ {100, 500} (resp.  ∈ {1000, 3000, 5000, 9253}).As for diagnosis classification tasks, each model is trained 30 times with different random initialization, and average and standard deviations are reported.Mean Absolute Error (MAE) is the reference measure for age prediction while Area Under the Curve (AUC) is the preferred metric for binary classification tasks since it does not depend on a particular threshold (it only measures a classifier discriminative power).Overall, SML models perform equally well with DL models for sex prediction (up to   = 9253), SCZ vs. HC, BD vs. HC, and ASD vs. HC.SML and DL performance keeps improving for age prediction when increasing the number of training subjects   on the external test.On the other hand, performance increases very slowly (it is almost a plateau) on the internal test starting from   ≈ 3 with an important improvement for non-linear DL models over SML.

Comparable performance between DL trained from scratch and linear models on psychiatric disorders prediction
We start by evaluating the performance of DL and SML models on our five prediction tasks across multi-site datasets on VBM data.From Fig. 2, we observe very similar performance on all classification tasks (both sex prediction and diagnosis classification) across all models and even in the very large data regime (  > 9000 for sex prediction).Specifically, all models achieve almost perfect AUC scores (Area Under the Curve) on sex prediction on both test sets (AUC = 98.32% for Logistic Regression and AUC = 98.47% for DenseNet with   = 9253 on the external test set).While DenseNet is almost always the bestperforming network for detecting schizophrenia, bipolar disorder, and autism, it achieves performance on par with Logistic  2 and rbf-SVM, i.e.AUC ≈ 85% on SCZ vs. HC, ≈ 76% for BD vs. HC, and ≈ 65% on ASD vs. HC, on the internal test.DenseNet (like other models) shows poor generalization performance on the external test, losing −10%, −5%, and −1% AUC for SCZ vs. HC, VD vs. HC, and ASD vs. HC, respectively.
As for age regression, we observe that DL outperforms SML only in the large-scale data regime   > 9k on the external test, e.g., MAE = 0.82 between AlexNet and ElasticNet.On the internal test, DL always outperforms SML, in line with (Peng et al., 2021;Abrol et al., 2021).We obtain SOTA performance compared to previous studies (MAE=2.36±0.04 ),3 which validates the architectural design of DL models (see Supplementary E for more experiments with Transformers).This discrepancy between internal and external tests suggests poor generalization performance on cross-site images due to a large over-fitting on the acquisition site (discussed hereafter).
To further validate our results on age prediction, we replicate our SML analysis pipeline on the UKBioBank dataset.In Supplementary A , we show that DL largely outperforms SML on age regression by 0.9 MAE with   = 9253, but it requires more data than   = 100 samples to achieve such results on the external test, as opposed to what was found on the internal test (Peng et al., 2021).It confirms that our SML pipeline is competitive with the current literature on UKBioBank and it extends previous results reported in the literature to cross-site generalization for age regression.

DL models underperform on raw data
We plot the performances of DL models trained and evaluated on quasi-raw data and we compare them to the previous results obtained on VBM data.Fig. 3 shows that DL models under-perform on raw images compared to VBM data for all tasks and testing sets at the current sample size   ≤ 10k.More specifically, we observe a degradation of the ROC-AUC by 1.6% for sex classification, and by 0.25 MAE for age regression with   = 9253 with DenseNet and ResNet respectively, the best-performing models on these two tasks on the external test set.About the classification of psychiatric disorders, this effect is even more pronounced with −14%, −4%, and −3% AUC on average between performance on VBM and raw data for schizophrenia, bipolar disorder, and autism, respectively, on the external test set.The only exception is the age prediction performance on the internal test, but they still poorly generalize to external data compared with models trained on VBM data.It suggests that DL models overfit more on acquisition sites with raw images than VBM.
In the latent space of a DenseNet trained to regress the age, we observe in Supplementary Fig. 8 a clear separation between embedded Results indicate that DL models fail at extracting more discriminative features from raw images than fully pre-processed ones, even in the large-scale data regime.8 in Supplementary).For completeness, these augmentations are also evaluated on quasi-raw data for all clinical tasks (see Table 9 in Supplementary).Overall, data augmentation does not significantly improve performance for all clinical tasks.In the rest of this study, we do not perform any particular augmentations when training deep models.
raw images coming either from the internal vs external test set (especially for middle-aged participants between 20 and 40 years old).This is not the case for VBM images, where inter-and intra-site embedded images overlap correctly in the latent space for a given age range (blue/orange and yellow/cyan).This greater difference (i.e., domain gap) between internal and external test sets for raw encoded images could explain the differences shown in Fig. 3 for age prediction, supporting the site over-fitting hypothesis.
Quantitatively, we show in Supplementary Table 12 that acquisition site can be linearly decoded with 70%, 82%, and 48% balanced accuracy from the embedding of a DenseNet trained on quasi-raw data to classify SCZ, BD, and ASD respectively.It is > 40% more than when the same network is trained on VBM data.It suggests that DNNs fail at compressing disease-related features from raw images and tend to rapidly over-fit on scanner-induced noise.

Data augmentation does not improve performance
To artificially increase the sample size in clinical datasets, we study data augmentation on VBM data.Surprisingly, in Fig. 4, we do not observe significant improvement in performance for DL models.It can even degrade performance using either light or strong augmentations, depending on the task.It suggests that current augmentations are highly class-dependent.
We also report the performance on quasi-raw images in Supplementary Table 9.Again, we observe no improvement with the tested augmentations except for SCZ vs HC on the internal test but it always remains far below the baselines on VBM data.Therefore, in the rest of this study, we only apply weight decay as a regularization technique without data augmentation.

Data harmonization produces mixed results
Accounting for site-related effects on neuroimaging data, we explore the benefit of data harmonization for both SML and DL models.From Table 10 in Supplementary, we observe that data residualization does not bring improvement for DL models while it marginally improves performance for SML with   = 9253 on age regression.It is not reproducible on external tests (in line with the results obtained by Fortin et al. in the original ComBat study (Fortin et al., 2018) on age prediction).However, the difference is more pronounced on psychiatric datasets with a gain of 1 − 3% AUC overall on the three tasks with SML on the internal tests.On external tests, the improvement is mixed especially for BD vs. HC and ASD vs. HC.As for DL models, we observe a significant degradation in performance on both internal and external test sets, indicating that current residualization methods fail at preserving non-linear biological variability extracted by DL models (in line with a recent study on Alzheimer's disease An et al., 2022).We perform additional experiments on DenseNet and ResNet, clearly supporting these conclusions; see Table 10 and 11 in Supplementary.Data harmonization techniques for anatomical MRI have been mainly crafted for SML models, and their adaptation to DL is still in its infancy B. Dufumier et al. Fig. 5. Three dimensionality reduction methods are evaluated on 3D anatomical VBM images from BHB-10K, namely Gaussian Random Projection (GRP), Recursive Feature Elimination (RFE) and Univariate Feature Selection (UFS).We reproduce the same experimental setting as in previous studies (Abrol et al., 2021;Schulz et al., 2020) by setting the number of reduced dimensions to  = 784 (from 300K gray matter voxels).Performance of SML (including rbf-SVM and penalized linear models) on reduced data are reported on age/sex prediction tasks (shown here for   ∈ {100, 500, 1000, 3000, 5000, 9253}) and diagnosis classification tasks (see Fig. 11 in Supplementary).We use the same training/validation/testing splits as previously.The performance in the original space is also reported for comparison purposes.All models are tested on the internal test (shown here) and the external one (see Fig. 12 in Supplementary), with similar conclusions.In all cases, dimensionality reduction provides no improvement for SML and it significantly decreases performance when using UFS or GRP.Fig. 6.Correlation matrix computed between brain region importance maps obtained for each task and model.A strong correlation indicates a good agreement between two models for a given task.Each brain region importance map is obtained through sensitivity analysis (i.e., using a gradient-based method) for both DL and linear models.All models considered have been trained with the maximum number of training samples.Brain regions are defined through the AAL atlas.
(e.g.Bashyam et al., 2022;Dinsdale et al., 2021).Overall, applying data harmonization does not significantly change our main conclusions in the previous Section 3.1.

Dimensionality reduction hurts performance for SML models
In Fig. 5, we plot the performance of SML models on reduced BHB-10K data with  = 784 features and we compare them to our previous baseline results on the original data (≈ 300 dimensions).We also perform the same experiments on our clinical datasets, and we report the results in Fig. 11 in Supplementary.We observe a strong degradation in performance for all models, especially with GRP (drop by 12% AUC for sex prediction, +2.7 MAE for age regression, >10% AUC for all binary classification tasks on clinical datasets with SML models, and the maximum number of training samples).This is somewhat expected since GRP is fully unsupervised, i.e. it does not rely on the target variable to preserve relevant features (and thus can focus on general non-biological variability, e.g., acquisition site).RFE seems to be the best-performing method but it still under-performs compared to the regularized linear models applied directly to the original data (−5% AUC for sex classification and +0.97 MAE for age prediction with   = 9253).This suggests that the non-redundancy and sparsity hypothesis in the final solution has been violated on these tasks (Chu et al., 2012-03-01).Similar results are obtained on the external test set (see Fig. 12 in Supplementary).Finally, we replicate these findings on UKBioBank (Bycroft et al., 2018) in Supplementary G, a benchmarking resource intensively used by the machine learning community in previous studies (Schulz et al., 2020;Abrol et al., 2021;Peng et al., 2021).(Dufumier et al., 2021a).We plot t-SNE representation (top) of latent features encoded from new healthy brain images in the external BSNIP dataset (unseen during training).Below, we report the decoding performance to predict demographic information (age/sex) from the latent features (Pearson's correlation for age and balanced accuracy for sex), using linear probing.While Age-Aware contrastive (Dufumier et al., 2021a) and Age Supervised both use age as a weak signal during pre-training, all other models are unsupervised.All models use DenseNet121 backbone except VAE (using a smaller CNN architecture with 5 layers to avoid posterior collapse) and Model Genesis (UNet backbone as in the original formulation Zhou et al., 2021).
In these experiments, the number of selected components  is a critical hyper-parameter, and it was not discussed in previous studies (Abrol et al., 2021;Schulz et al., 2020).For completeness, we also perform additional experiments with  = 10k (see Fig. 13 in Supplementary), showing that we can reach similar performances than in the original space on age and sex prediction by reducing the input size by 30 (the gray matter mask containing about 300k voxels), with RFE (MAE= 0.03 and AUC= 0.32% with SML models for age and sex prediction).
Overall, these experiments show that dimensionality reduction is not necessary for SML on multi-site neuroimaging data and it can highly decrease performance without careful model selection.Regularized SML models can also learn from very high-dimensional data.

Deep and linear models make their decision based on the same brain regions for psychiatric disorders, aging and sex
Fig. 6 shows two clear patterns, both reproducible across the testing set.First, all DL models generate similar saliency maps to logistic regression with  2 regularization for all tasks (correlation  > 0.70 between the linear model and all DL models for all tasks).This is in line with recent studies (Salvador et al., 2017;Ball et al., 2021) on SML models applied to age prediction, schizophrenia, and bipolar disorder detection.Both linear and non-linear models resulted in similar final weighted maps with various degrees of noise and sparsity.Second, ElasticNet generates extremely sparse maps (which is expected) but with regions overall poorly correlated with other models ( = 0.21,  = 0.22,  = 0.25 and  = 0.24 between ElasticNet and Logistic  2 , DenseNet, ResNet and AlexNet resp.on ASD detection).This is more pronounced as we increase the task difficulty (e.g., age or sex prediction with > 95%AUC v.s.ASD detection with ≈ 60%AUC).Furthermore, for completeness, we also used an occlusion-based method (Zeiler and Fergus, 2014) to compare the saliency maps given by sensitivity analysis and occlusion.Occlusion consists of monitoring the model prediction variation while occluding each brain region independently (defined by the AAL atlas).We reported in Supplementary (Fig. 9) the correlation between the saliency maps obtained from occlusion vs. sensitivity analysis.Overall, we found an excellent agreement between these two methods ( > 0.70 for all models and tasks except AlexNet with sex prediction and DenseNet on bipolar detection).

Exploring pre-training strategies
In Fig. 7, we plot the latent representation of healthy brain images encoded through the pre-trained models described in Section 2.12.These brain images come from the external dataset BSNIP, unseen during training.We also report the decoding performance to predict demographic information from latent features using linear probing.Interestingly, our proposed model Age-Aware contrastive (Dufumier et al., 2021a) is the only one that captures well both age and sex phenotypes, even if it has not been trained with sex information.It also has a better decoding performance for age, even compared to a fully age-supervised model.This can be explained by previous results showing poor generalization performance to new external images with this model, i.e., DenseNet121 (see Fig. 2).It suggests that Age-Aware contrastive model encodes robust features independent from the scanner.VAE also captures demographic information well while not being trained with weak supervision.Nonetheless, it still underperforms compared to our proposed model.
Transfer to clinical datasets.To further compare these strategies, we fine-tune the different models on the three classification tasks and we report the performance in Table 3.We observe that Age-Aware contrastive model gives the best performance by a large margin (+2%, +4%, +8% AUC resp.on SCZ vs. HC, BD vs. HC and ASD vs. HC, sorted by task difficulty) compared to all other pre-training strategies.Interestingly, adding phenotype information (in particular age) during pre-training (either with discriminative or weakly self-supervised models) allows a boost in performance compared to completely unsupervised models (self-supervised and generative).It notably implies that (1) anatomical knowledge related to age can be transferred to discriminate a wide range of psychiatric disorders and (2) decoder-free self-supervised models provide more robust, reproducible features across sites.Interestingly, a discriminative approach with age prediction as pre-training can improve ASD performance.However, it does not replicate on the external test, suggesting an over-fit on the scanner.In the following, we have thus used our Age-Aware contrastive model as pre-training.

Knowing what you don't know helps: quantifying DNNs uncertainty with deep ensemble
In Table 4, we show that quantifying DNNs uncertainty through Deep Ensemble allows (i) to drastically improve DNN calibration (quantifying whether DNN confidence score for a given prediction can be trusted) and (ii) to improve performance for all psychiatric disorder prediction tasks.We report the results with the DenseNet backbone on the external test set and an increasing number of ensemble models  .We observe a significant improvement in calibration for all tasks as we increase the number of ensemble models with −6%, −10%, and −14% ECE, respectively, for SCZ vs. HC, BD vs. HC, and ASD vs. HC.Interestingly, calibration was higher for harder tasks (e.g., ASD) with the baseline model, suggesting that DNNs were largely over-confident even when making many mistakes.Additionally, the improvement in calibration systematically goes with a performance improvement.

Table 3
Fine-tuning results of models pre-trained with the five previous strategies.All models are pre-trained with only healthy brains.We reported average AUC(%) for all models and the standard deviation by repeating each experiment three times.The baseline is reported from DenseNet121 backbone, giving the best results for mental disorder classification and thus providing strong results.

Task
Test set

Pre-training strategies
Weakly self-supervised Self-supervised Generative Discriminative

Table 5
Combining Deep Ensemble learning and Transfer Learning improves DL representation over SML models, especially on complex tasks such as ASD and BD detection.We report the average AUC for all models and the standard deviation by repeating each experiment three times.We use the DenseNet121 backbone for all DL models.The baseline corresponds to a single network trained from scratch on VBM images.We aggregate  = 3 networks trained from different random initializations for Deep Ensemble.For Transfer Learning, we pre-train a single network with Age-Aware contrastive learning (Dufumier et al., 2021a) and fine-tune all weights on each clinical task.For Transfer+Deep Ensemble, we aggregate three networks, all pre-trained with Age-Aware contrastive learning (only once) and fine-tuned on each downstream task.The randomness thus comes from the gradient descent optimization on each downstream task.

Coupling deep ensemble and transfer learning outperforms SML and achieves SOTA results
We present the results on mental disorder classification when we combine the new paradigm presented in Fig. 1 and the previously described Deep Ensemble strategy.We compare them to SML trained on VBM data (results on residualized data are reported in Supplementary Section 3.1.3).
From Table 5, we observe a consistent increase in performance when combining both Deep Ensemble learning and Transfer Learning w.r.t.baseline on the external test (+0.84%,+9.44%, +6.75% AUC resp.on schizophrenia, bipolar disorder, and autism spectrum disorders detection).This improvement is significant between DNNs and the bestperforming SML model for bipolar disorder and ASD detection but not for schizophrenia ( = 0.03,  = 0.02 and  = 0.32 respectively in a two-sample -test on the external test;  = 0.01,  = 0.01, and  = 0.75 on the internal test).
Deep ensemble results support the hypothesis that different random initialization leads to different representations after training.Transfer learning results show that anatomical features learned from the healthy population during brain maturation and aging can be re-used to drastically improve DL generalization on hard clinical tasks (such as bipolar disorder and ASD detection).Nonetheless, DL performance is still on par with SML models on easier tasks (such as schizophrenia), the task difficulty being measured by linear performance.
Variance analysis.From Table 6, we observe that Transfer+Deep Ensemble offers the lowest variance in all cases (while also being the best performing model, see Table 5).Interestingly, transfer learning drastically lowers SD for SCZ vs HC, favoring our hypothesis that solutions are constrained in the same basin landscape, thus confirming previous findings on natural and medical images (Neyshabur et al., 2020).It is mixed for BD vs HC where Deep Ensemble seems a crucial component to achieve low variance of the models.

Discussion
In this study, we have investigated the potential of DL models to extract non-linearities on large-scale and medium-scale multi-site datasets for key problems in neuroimaging, including single-subject psychiatric disorders and age/sex prediction, as compared to standard linear and kernel machine learning methods (SML).
We first confirm recent findings (Schulz et al., 2020) raising doubts on a universal usage of DL models in anatomical neuroimaging.In particular, we found no difference in performance between DL methods trained from scratch and SML for simple and more complex singlesubject neuroimaging classification tasks, including (1) sex prediction, (2) schizophrenia detection, (3) bipolar disorder, and (4) autism spectrum disorders classification.Our results on psychiatric disorders extend the ones found in a recent benchmark on Alzheimer's detection (Wen et al., 2020), showing that DL is on par with simple linear SVM trained on ADNI (Jack et al., 2008) -the largest neuroimaging initiative to date for Alzheimer's disease (in their case, they comprised   = 666 participants with several time points per participant).Nonetheless, we did find that DL outperforms SML on the age regression task, confirming recent studies on this topic (Peng et al., 2021;Abrol et al., 2021).Still, it needs a vast number of samples (  > 9000) to extract a better representation than a simple regularized linear model when images come from sites never seen during training.A question then arises: why does DL outperform SML in computer vision on challenging image classification tasks and not on single-subject neuroimaging tasks?
The first reason explaining this phenomenon is the highly complex pre-processing pipeline engineered for years in neuroimaging, allowing for noise reduction, spatial alignment, and data harmonization.In particular, diffeomorphic spatial registration, as well as brain tissue segmentation and other non-linear image corrections (e.g., bias field correction, intensity rescaling, etc.) have been developed over the last two decades (Ashburner, 2007;Gaser and Dahnke, 2016) for statistical analysis and allow powerful statistical learning with simple linear models.This whole pipeline can be viewed as a complex non-linear function mapping brain raw images to nicely aligned and denoised anatomical images, thus explaining the success of SML (including both linear and kernel methods) in the neuroimaging community.A second obvious reason is clinical data scarcity.Brain imaging produces large, yet limited, input volumes with > 300k dimensions across no more than a few thousand subjects.It is 1000 times less than ImageNet and with potentially less diversity.A third reason could be related to very high inter-individual heterogeneity in the anatomy of various patients labeled with the same diagnosis, e.g., bipolar disorder or autism (Wolfers et al., 2018;Zabihi et al., 2020;Nunes et al., 2020).This last hypothesis is further supported by the current re-conceptualization of major disorders in psychiatry (for instance, through the RDoC initiative).

Are brain images too noisy?
The main hypothesis made by Schulz et al. (2020) regarding the similar scaling trend between DL and linear models is the linearization of decision boundaries when input images are over-whelmed by noise (e.g., MRI artifacts) unrelated to underlying neurobiological changes related to the pathology.It was well illustrated on the MNIST dataset (LeCun et al., 1998) (grayscale images dataset with handwritten digits ranging from 0 to 9) with a simple experiment: authors (Schulz et al., 2020) added Gaussian noise to the images and, the stronger was the noise, the closer the learning curves were between DNNs and linear models.We argue that our experiments on VBM vs. raw images support this hypothesis.We showed how site-related noise was well preserved in the representation space of a DNN trained to predict age/sex or mental condition, especially with raw measurements, even when we know that a more discriminative signal is present.This hypothesis was also supported in the experiments on age prediction in Section 3.1: while the learning curve for SML was significantly worse than DL on the internal test (reaching a plateau early with   = 3), it was not the case on the external test.These findings suggest that current site-related noise inside MRI prevents DNNs from exploiting non-linear signals, thus somehow linearizing its decision boundary for psychiatric conditions classification.
Data augmentation is usually suggested as the simplest way to increase the sample size in small clinical datasets artificially.We showed that standard geometrical transformations degrade DNN performance on VBM and quasi-raw data for all tasks.We hypothesize that standard geometrical transformations (e.g., rotation, translation, flip) are not adapted to our data since all images are non-linearly registered to the same template.As for Gaussian noise, it also appears unnecessary since we already applied a smoothing kernel to regularize our data.Taken together, these results highlight the necessity to design new augmentation schemes specifically crafted for neuroimaging data.In particular, we acknowledge that new methods are emerging for generating meaningful synthetic images through non-linear deep generative models (Chadebec et al., 2022) and we leave this research axis as future works.
Transfer learning from large-scale healthy dataset to medium-scale clinical studies.Crucially, we propose a new transfer learning paradigm for discriminating patients with mental disorders from controls, achieving a new SOTA for ASD classification and bipolar disorder detection.This paradigm is versatile and does not specify a particular pre-training strategy.It mainly relies on the hypothesis that capturing the biological variability in the healthy population related to non-specific variables (e.g., age, sex, etc.) with a large-scale dataset allows easier discovery of specific pathological variability (e.g., subtle cortical atrophy in pre-fontal and temporal lobe for ASD detection) during fine-tuning on small-scale cohorts.Our findings with our proposed Age-Aware contrastive strategy suggests that age-related features are also implicated in BD and ASD diagnosis, supporting previous findings on this topic (Courchesne et al., 2003;Greimel et al., 2013) (e.g.related to brain overgrowth during childhood).In this regard, integrating other phenotypes (e.g., cognition) during pre-training using -Aware contrastive learning opens up a new avenue for transfer learning and representation learning.This would enable us to structure the representation space according to non-imaging variables and possibly learn a richer manifold from large-scale healthy datasets.
Additionally, we also show how uncertainty quantification (''knowing what you don't know'') is crucial for DL model, and it can be solved with Deep Ensemble.Considering their over-confidence in solving complex tasks even with noisy data, modeling and quantifying a predictive uncertainty is essential for computer-aided diagnosis and clinical trial design.
Quantitatively, we found that DL, combined with TL, establishes the new state-of-the-art prediction performance on bipolar disorder detection from brain anatomical imaging (> 78% AUC on both internal and external tests, with 1173 subjects and 471 patients with BD), in light of recent results from the ENIGMA consortium (Nunes et al., 2020) (the largest to date with 3020 subjects and 853 patients with BD).
In their experiments (Nunes et al., 2020), they achieved ≈ 70% AUC (resp.≈75%) on external (resp.internal) test after linear residualization adjusted on age, sex, and site.These findings suggest that (i) discriminative transferable anatomical non-linear patterns can be learned with DL through pre-training from brain imaging of the healthy population; (ii) different DL initialization converge to different solutions after training that, if aggregated together, can outperform SML; (iii) DL models tend to learn simple B. Dufumier et al. features on easy tasks (such as schizophrenia detection), falling into the Simplicity Bias (Shah et al., 2020), which encourages DNNs to find the simplest features to perform the task (and thus hurting generalization power on external test sets).
Interestingly, for schizophrenia, the easiest clinical task among the three tackled in this paper (relative to ML diagnosis accuracy), DL struggles to find better representation than simple regularized linear models, even when performing TL or Deep Ensemble learning.We hypothesized that this might be due to the simplicity bias (Shah et al., 2020) where DL trained with standard training procedures, such as Stochastic Gradient Descent (SGD), tends to rely on the simplest features even if more complex ones could bring more discriminative information.We saw that aggregating different DL representations trained from scratch on SCZ detection leads to marginal improvement (+0.46%AUC on the internal test), as opposed to BD and ASD classification (+3% and +2.92% AUC respectively), suggesting that different DL models extract dissimilar (potentially non-linear) features only on complex tasks.This would also explain the performance drop on the external test for SCZ vs. HC (−9.92% AUC compared to internal test) viewed as an out-of-domain dataset since the simplicity bias leads to poor out-of-domain generalization (Shah et al., 2020).This performance drop was only observed on SCZ vs. HC after performing TL and Deep Ensemble.Simplicity bias is a relatively new concept, and removing this bias in current DL models is still an open challenge.We hypothesize that, by avoiding simplicity in DL, we may also benefit from the powerful representation capacity of DL on simpler clinical tasks such as schizophrenia detection.
We acknowledge that current DL architectures may not be ideal for brain anatomical data.On natural images, DL architectures (in particular CNN) bring a strong inductive bias (e.g.translation invariance, hierarchical representation) that seems very beneficial for challenging computer vision tasks, which could partly explain their success.In particular, on MNIST (LeCun et al., 1998) (a highly popular benchmarking image dataset containing handwritten digits), CNNs can outperform SML (by > +15% accuracy (Schulz et al., 2020)) with as few as   = 100 samples.Another work (Alain and Bengio, 2016) also showed that the representation space of a CNN randomly initialized can be used as such to achieve accurate results on MNIST (> 90% accuracy).More remarkably, CNNs randomly initialized (i.e.not trained) can be used as a ''handcrafted prior'' for image denoising, inpainting, image reconstruction (Ulyanov et al., 2018), and object localization (Cao and Wu, 2022) on ImageNet to achieve SOTA results.
On the other hand, we hypothesize that current inductive bias in CNNs may not be sufficient for brain anatomical data where all images are already aligned and share the same colors and textures (in line with a recent review Eitel et al., 2021).Other recent DL architectures, such as Transformers (Vaswani et al., 2017), integrating attention modules at its core and relaxing the inductive bias constraints present in CNNs, might be another exciting research direction for neuroimaging.While Transformers still require massive amounts of data on natural images (because of their flexibility Lin et al., 2022), first works in neuroimaging are starting to appear (He et al., 2021) and should receive special attention.
Our findings demonstrate that DL and SML tend to rapidly over-fit the acquisition sites, even in the large-scale data regime.On the age regression task, we observe a significant performance drop of all DL and SML models between internal and external tests (average drop of MAE: MAE(DL) = 1.00, MAE(SML) = 0.88 with  = 10k images acquired on 17 sites).A similar drop in classification performances is found with schizophrenia detection (with 1300 samples) AUC(DL) = 7.81%, AUC(SML) = 9.72%.Such a decrease in performance might be mainly attributed to site acquisition settings.Moreover, this suggests a systematic bias with results obtained on test images that stem from sites that have been seen during the training phase.DL models appear to over-fit even more with raw data than VBM on age regression, explaining their higher performance drop between internal and external tests, observed in Fig. 3 and confirmed in Supplementary Fig. 8.This is in line with the inter-scanner reliability test performed by Cole et al. (2017) on DL models.Our results again favor the handcrafted VBM preprocessing for DL since it seems to limit the site bias (at least on age regression).Interestingly, similar results were obtained on Alzheimer's detection (Wen et al., 2020), with poor DL generalization when using raw images coming from never-seen sites.
Overall, this shines a light on a recurrent issue in neuroimaging with multi-site studies related to data harmonization and debiasing in DL.While SOTA data harmonization techniques (Combat Fortin et al., 2018 and Linear Adjusted Regression) have been partially beneficial for SML on clinical applications, it was not the case for DL (see Table 10 in Supplementary).It suggests that current harmonization techniques still fail at preserving non-linear input relationships leveraged by DL to perform the downstream task.Removing site information from DL representation while protecting for variables of interest (e.g., biological such as diagnosis, age, sex, or sensitive attributes in the context of trustworthy AI) is an open challenge both in computer vision (Barbano et al., 2021) and neuroimaging (Dinsdale et al., 2021;Barbano et al., 2023a,b).It is still a relatively new research area with no benchmarking datasets or metrics in neuroimaging.
Often considered as a "black box'', we provide empirical evidence that DL models randomly initialized make their decision based on very similar brain regions compared to linear models.We observed these agreements between DL and linear models on the internal and external test sets.This consistency across DL and linear models is reassuring and suggests the reliability of features extracted by DL models.It should also be noted that different CNNs based their decisions on highly similar importance maps for all evaluated tasks.DL reliability is crucial in the context of precision medicine for psychiatry as a first step towards building models accepted and trusted by clinicians.
Overall, our study confirms that DL utility over SML on challenging clinical applications in psychiatry comes from TL and Deep Ensemble learning.Coupling these two strategies outperforms SML on both BD and ASD and achieves new state-of-the-art BD results.While DL trained from scratch did not dominate simple linear models on psychiatric disorders, we showed that recent advances in contrastive learning (Dufumier et al., 2021a(Dufumier et al., ,b, 2023;;Louiset et al., 2024) applied on a large healthy population ( ≈ 10k) allow DL models to learn re-usable features.Aggregating other modalities (e.g., functional or diffusion MRI, genetics) to perform representation learning remains an exciting challenge that might be solved with contrastive learning.It would improve our understanding of brain disorders and possibly pave the way towards personalized medicine in psychiatry through predictive models of clinical outcome, where only small longitudinal cohorts are, and will be, available in the near future.

Fig. 2 .
Fig. 2. DL vs. SML performance on phenotype prediction and increasingly difficult diagnosis classification tasks on highly multi-site datasets.For SML methods, 2 linear models with  1 (Logistic) or  1 + 2 (ElasticNet) penalization are evaluated, as well as non-linear Radial Basis Function (rbf) SVM.As for DL, vanilla AlexNet (Krizhevsky et al., 2012) (previously introduced by Abrol et al. (2021) with 2.5M parameters and 5 layers) and more advanced ResNet18 (He et al., 2016) (33.2M parameters, 18 layers) and DenseNet121 (Huang et al., 2017) (11.2M parameters, 121 layers taking advantage from skip-connections and feature re-using) are considered.Both DL and SML algorithms are trained on whole-brain 3D anatomical images.All models are evaluated on two different test sets: an internal test stratified on age, sex, and site ( ℎ  = 662,  ℎ  = 655), and diagnosis for clinical datasets (   = 118,    = 116,    = 107,    = 103,    = 184,    = 188); an external test including sites never seen during training ( ℎ  = 640,    = 133,    = 131,    = 207).Models cannot use site-specific information for their prediction on this test set, eliminating a strong bias reported in the literature.For age and sex prediction, we performed a 5-fold (resp.3-fold) Monte Carlo Cross-Validation sub-sampling procedure for   ∈ {100, 500} (resp.  ∈ {1000, 3000, 5000, 9253}).As for diagnosis classification tasks, each model is trained 30 times with different random initialization, and average and standard deviations are reported.Mean Absolute Error (MAE) is the reference measure for age prediction while Area Under the Curve (AUC) is the preferred metric for binary classification tasks since it does not depend on a particular threshold (it only measures a classifier discriminative power).Overall, SML models perform equally well with DL models for sex prediction (up to   = 9253), SCZ vs. HC, BD vs. HC, and ASD vs. HC.SML and DL performance keeps improving for age prediction when increasing the number of training subjects   on the external test.On the other hand, performance increases very slowly (it is almost a plateau) on the internal test starting from   ≈ 3 with an important improvement for non-linear DL models over SML.

Fig. 3 .
Fig. 3. DL performances are evaluated on raw brain images and extensively pre-processed, non-linearly registered, anatomical Gray Matter (GM) brain images (namely VBM).Results indicate that DL models fail at extracting more discriminative features from raw images than fully pre-processed ones, even in the large-scale data regime.

Fig. 4 .
Fig. 4. Five augmentation strategies are evaluated on VBM data with two different sets of hyper-parameters (light and strong, see Table8in Supplementary).For completeness, these augmentations are also evaluated on quasi-raw data for all clinical tasks (see Table9in Supplementary).Overall, data augmentation does not significantly improve performance for all clinical tasks.In the rest of this study, we do not perform any particular augmentations when training deep models.

Fig. 7 .
Fig. 7.We explore several pre-training strategies based on representation learning applied to brain MRI, among which our proposed model Age-Aware contrastive(Dufumier et al., 2021a).We plot t-SNE representation (top) of latent features encoded from new healthy brain images in the external BSNIP dataset (unseen during training).Below, we report the decoding performance to predict demographic information (age/sex) from the latent features (Pearson's correlation for age and balanced accuracy for sex), using linear probing.While Age-Aware contrastive(Dufumier et al., 2021a) and Age Supervised both use age as a weak signal during pre-training, all other models are unsupervised.All models use DenseNet121 backbone except VAE (using a smaller CNN architecture with 5 layers to avoid posterior collapse) and Model Genesis (UNet backbone as in the original formulationZhou et al., 2021).

Table 1
Demographic information about the datasets used throughout this study.We integrated OpenBHB, a large multi-site sMRI dataset freely available here from which we have drawn our training set until   = 5000 and our internal and external testing sets for all our experiments on age and sex prediction.

Table 2
Training/Validation/Test splits used for the three mental illness disorders detection.Out-of-site images always make the external test set, and each participant falls into only one split, avoiding data leakage.The internal testing set is always stratified according to age, sex, site, diagnosis, and training and validation set.All models use the same splits.

Table 4
Deep Ensemble improves calibration and performance for all clinical tasks.Calibration is measured by the Expected Calibration Error (ECE) and performance is measured by ROC-AUC.In this experiment, the Deep Ensemble model takes the average representation (given after the softmax layer) of  models trained with supervision with different random initializations.
Green numbers indicate improvement over DL baselines.* indicates  < 0.05 for a two-sample t-test with the best performing SML model on the current task.

Table 6
Standard Deviation (SD) of AUC performance reported during model optimization, depending on their initialization.TL and TL+Deep Ensemble drastically reduce SD, suggesting that they do not escape much from the initial basin landscape of the loss function.SD is estimated using 30 measures for all pairs (task, model), except Transfer+Deep Ensemble, which is estimated with ten measures (3 models are used for Deep Ensemble).