Cross-cohort generalizability of deep and conventional machine learning for MRI-based diagnosis and prediction of Alzheimer’s disease

This work validates the generalizability of MRI-based classification of Alzheimer’s disease (AD) patients and controls (CN) to an external data set and to the task of prediction of conversion to AD in individuals with mild cognitive impairment (MCI). We used a conventional support vector machine (SVM) and a deep convolutional neural network (CNN) approach based on structural MRI scans that underwent either minimal pre-processing or more extensive pre-processing into modulated gray matter (GM) maps. Classifiers were optimized and evaluated using cross-validation in the Alzheimer’s Disease Neuroimaging Initiative (ADNI; 334 AD, 520 CN). Trained classifiers were subsequently applied to predict conversion to AD in ADNI MCI patients (231 converters, 628 non-converters) and in the independent Health-RI Parelsnoer Neurodegenerative Diseases Biobank data set. From this multi-center study representing a tertiary memory clinic population, we included 199 AD patients, 139 participants with subjective cognitive decline, 48 MCI patients converting to dementia, and 91 MCI patients who did not convert to dementia. AD-CN classification based on modulated GM maps resulted in a similar area-under-the-curve (AUC) for SVM (0.940; 95%CI: 0.924–0.955) and CNN (0.933; 95%CI: 0.918–0.948). Application to conversion prediction in MCI yielded significantly higher performance for SVM (AUC = 0.756; 95%CI: 0.720-0.788) than for CNN (AUC = 0.742; 95%CI: 0.709-0.776) (p<0.01 for McNemar’s test). In external validation, performance was slightly decreased. For AD-CN, it again gave similar AUCs for SVM (0.896; 95%CI: 0.855–0.932) and CNN (0.876; 95%CI: 0.836–0.913). For prediction in MCI, performances decreased for both SVM (AUC = 0.665; 95%CI: 0.576-0.760) and CNN (AUC = 0.702; 95%CI: 0.624-0.786). Both with SVM and CNN, classification based on modulated GM maps significantly outperformed classification based on minimally processed images (p=0.01). Deep and conventional classifiers performed equally well for AD classification and their performance decreased only slightly when applied to the external cohort. We expect that this work on external validation contributes towards translation of machine learning to clinical practice.


Introduction
The diagnostic process of dementia is challenging and takes a substantial period of time after the first clinical symptoms arise: on average 2.8 years in late-onset and 4.4 years in youngonset dementia (Van Vliet et al., 2013).The window of opportunity for advancing the diagnostic process is however much larger than these few years.For Alzheimer's disease (AD), the most common form of dementia, there is increasing evidence that disease processes start 20 years or more ahead of clinical symptoms (Gordon et al., 2018).Advancing the diagnosis is essential to support the development of new disease modifying treatments, since late treatment is expected to be a major factor in the failure of clinical trials (Mehta et al., 2017).In addition, early and accurate diagnosis have great potential to reduce healthcare costs as they give patients access to supportive therapies that help to delay institutionalization (Prince et al., 2011).
Machine learning offers an approach for automatic classification by learning complex and subtle patterns from highdimensional data.In AD research, such algorithms have been frequently developed to perform automatic diagnosis and predict the future clinical status at an individual level based on biomarkers.These algorithms aim to facilitate medical decision support by providing a potentially more objective diagnosis than that obtained by conventional clinical criteria (Klöppel et al., 2012;Rathore et al., 2017).A large body of research has been published on classification of AD and its prodromal stage, mild cognitive impairment (MCI) (Ansart et al., 2021;Wen et al., 2020;Rathore et al., 2017;Arbabshirani et al., 2017;Falahati et al., 2014;Bron et al., 2015).Overall, classification methods show high performance for classification of AD patients and control participants with an area under the receiver-operating characteristic curve (AUC) of 85-98%.Reported performances are somewhat lower for prediction of conversion to AD in patients with MCI (AUC: 62-82%).Structural T1-weighted (T1w) MRI to quantify neuronal loss is the most commonly used biomarker, whereas the support vector machine (SVM) is the most commonly used classifier.Following the trends and successes in medical image analysis and machine learning, neural network classifiers -convolutional neural networks (CNN) in particular -have increasingly been used since few years (Wen et al., 2020;Cui and Liu, 2019;Basaia et al., 2019), but have not been shown to significantly outperform conventional classifiers.Most CNN studies perform no to minimal pre-processing of the structural MRI scans as input for their classifier (Wen et al., 2020;Basaia et al., 2019;Hosseini-Asl et al., 2018;Vieira et al., 2017), while others use more extensive pre-processing strategies proven successful for conventional classifiers, such as gray matter (GM) density maps (Cui and Liu, 2019;Suk et al., 2017).Although CNNs are designed to extract high-level features from raw imaging data, it is imaginable that the learning process for complex tasks is improved by dedicated pre-processing that enhances diseaserelated features, which reduces model complexity and enables a more stable learning process.It is unclear yet whether CNNs would improve AD classification over conventional classifiers and whether they benefit from extensive MRI pre-processing.
Despite high performance of machine learning diagnosis and prediction methods for AD, it is largely unknown how these algorithms would perform in clinical practice.A next step would be to assess the generalizability of classification methods from a specific research population to another study population.There are however only very few studies assessing classification performance on an external data set (Wen et al., 2020;Bouts et al., 2019;Archetti et al., 2019;Hall et al., 2015).Results varied from only a minor reduction in performance for some experiments (Wen et al., 2020;Hall et al., 2015) to a severe drop for others (Bouts et al., 2019;Archetti et al., 2019;Wen et al., 2020).While generalizability seemed related to how well the training data represented the testing data (e.g. an external data set with similar inclusion criteria showed a smaller performance drop than a data set with very different criteria (Wen et al., 2020), a better understanding is crucial before applying such methods in routine clinical practice.Therefore, this work aims to assess the generalizability of MRI-based classification performance to an external data set representing a tertiary memory clinic population for both diagnosis of AD and prediction of AD in individuals with MCI.To evaluate the value of neural networks and to determine their optimal MRI pre-processing approach, we compare a CNN with a conventional SVM classifier using two pre-processing approaches: minimal pre-processing using only rough spatial alignment and more extensive pre-processing into modulated GM maps.First, we optimize the methods using a large research cohort and assess classification performance using crossvalidation.Subsequently, we validate AD prediction performance in MCI patients of the same cohort as well as AD diagnosis and prediction performance in the external data set.

Study population
We used data from two cohorts.The first group of 1715 participants was included from the Alzheimer's Disease Neuroimaging Initiative (ADNI; adni.loni.usc.edu).The ADNI was launched in 2003 as a public-private partnership, led by Principal Investigator Michael W. Weiner, MD.The primary goal of ADNI has been to test whether clinical and neuropsychological assessment, serial magnetic resonance imaging (MRI), positron emission tomography (PET), and other biological markers can be combined to measure the progression of mild cognitive impairment (MCI) and early Alzheimer's disease (AD).For up-to-date information, see www.adni-info.org.We included all participants with a T1w MRI scan at baseline from the ADNI1/GO/2 cohorts: 336 AD patients, 520 control participants (CN), 231 mild cognitive impaired (MCI) patients who converted to AD within 3 years (MCIc) and 628 MCI patients who did not convert The second group of participants was included from the Health-RI Parelsnoer Neurodegenerative Diseases Biobank (PND; www.health-ri.nl/parelsnoer), a collaborative biobanking initiative of the eight university medical centers in the Netherlands (Manniën et al., 2017).The Parelsnoer Neurodegenerative Diseases Biobank focuses on the role of biomarkers on diagnosis and the course of neurodegenerative diseases, in particular of Alzheimer's disease (Aalten et al., 2014).It is a prospective, multi-center cohort study, focusing on tertiary memory clinic patients with cognitive problems including dementia.Patients are enrolled from March 2009 and followed annually for two to five years.In the PND biobank, a total of 1026 participants have been included.Inclusion criteria for the current research were: a high resolution T1w MRI at baseline, clinical consult at baseline, 90 days or less between MRI and clinical consult, and a baseline diagnosis of SCD, MCI, or dementia due to AD.A flow diagram of the inclusion can be found in the supplementary files (Fig. S1).A total of 557 participants met inclusion criteria.One person was excluded because image analysis failed.This led to inclusion of 199 AD patients and 138 participants with SCD.Of the MCI group, we included the 139 participants that had a follow-up period of at least 6 months.Of this group, 48 MCI patients converted towards dementia within the available follow-up time and 91 MCI patients remained stable.Demographics are shown in Table 2.

Image pre-processing
We evaluated two pre-processing approaches based on T1w images: minimal pre-preprocessing and a more extensive pre-processing into modulated GM maps.
To prepare T1w images with minimal pre-processing, scans were non-uniformity corrected using the N4 algorithm (Tustison et al., 2010) and subsequently transformed to MNI-space using registration of brain masks with a similarity transformation.A similarity transformation is a rigid transformation including isotropic scaling.Registrations were performed with Elastix registration software (Klein et al., 2010;Shamonin et al., 2014).To account for variations in signal intensity, images were normalized within the brain mask to have zero mean and unit variance.
To obtain modulated GM maps encoding gray matter density, the Iris pipeline was used (Bron et al., 2014).To compute these maps a group template space was defined using a procedure that avoids bias towards any of the individual T1w images using pairwise registration (Seghers et al., 2004).The pairwise registrations were performed using a similarity, affine, and nonrigid B-spline transformation model consecutively.We selected a subset of images for the definition of the template space.This template set consisted of the images of 50 ADNI participants that were randomly selected preserving the ratio between diagnostic groups (subject list available at https://gitlab.com/radiology/neuro/bron-cross-cohort).The other images of both ADNI and PND data sets were registered to the template space following the same registration procedure.For the current work, some changes to the template space construction procedure as used in Bron et al. (2014) were made: non-uniformity correction was performed, skull-stripping was performed, and the template space corresponded to MNI-space.Using similarity registration based on brain masks, we computed the coordinate transformations of MNI space to each of the template set's images, which were subsequently concatenated with the pairwise transformations before averaging.After template space construction, probabilistic GM maps were obtained with the unified tissue segmentation method of SPM8 (Statistical Parametric Mapping) (Ashburner and Friston, 2005).To obtain the final feature maps, probabilistic GM maps were transformed to the template space and modulated, i.e. multiplied by the Jacobian determinant of the deformation field, to take compression and expansion into account (Ashburner and Friston, 2000).To correct for head size, modulated GM maps were divided by intracranial volume.

Classification approaches
Two machine learning approaches were used for classification: a support vector machine (SVM) and a convolutional neural network (CNN).

Support vector machine (SVM)
An SVM with a linear kernel was used as this approach previously showed good performance using voxel-based features for AD classification.(Klöppel et al., 2008;Cuingnet et al., 2011;Bron et al., 2014Bron et al., , 2015)).The c-parameter was optimized with 5-fold cross-validation on the training set.Input features, i.e. voxel values of the pre-processed images within a brain mask, were normalized to zero mean and unit variance based on the training set.The classifier was implemented using Scikit-Learn.
To gain insight into the classifications, we calculated statistical significance maps (p-maps) that show which features contributed to the SVM decision.These maps were computed using an analytical expression that approximates permutation testing (Gaonkar et al., 2015).Clusters of significant voxels were obtained using a p-value threshold of α ≤ 0.05.P-maps were not corrected for multiple comparisons, as permutation testing has a low false-positive rate (Gaonkar and Davatzikos, 2013).

Convolutional neural network (CNN)
An all convolutional neural network was used (Springenberg et al., 2015), which is a fully convolutional network (FCN) architecture that uses standard convolutional layers with stride two instead of the pooling layers used in most CNNs.This approach was chosen as it has previously shown good classification performance for AD based on structural MRI (Cui and Liu, 2019;Basaia et al., 2019).The used architecture is shown in Fig. 1.Specifically, the network was built of 7 blocks consisting of a 3D convolutional layer (filter size 3; stride 1), followed by dropout, batch normalization (BN), and a rectified linear unit (ReLU) activation function, succeeded by a second 3D convolutional layer (filter size 3; stride 2), dropout, BN, and ReLU activation (Cui and Liu, 2019;Basaia et al., 2019).The number of filters changed over blocks: 16 filters in block 1, 32 in block 2 and 3, 64 in block 4 and 5, 32 in block 6, and 16 in block 7. The final output layer of the network was a softmax For artificially increasing the training data set and for removing the class imbalance, data augmentation was used.The training set was augmented to 1000 samples per class based on the 'mixup' approach (Eaton-Rosen et al., 2018;Zhang et al., 2018).Mixup is a data-agnostic augmentation approach that is not based on spatial transformations, and therefore does not degrade the spatial normalization.Augmented samples were constructed by linearly combining two randomly selected images of the same class: a fraction of 80% of the first image was added to a fraction of 20% of the second image.
The network was compiled with a binary cross-entropy loss function and Adam optimizer (learning rate=0.001,epsilon=1e-8, decay=0.0).To facilitate a stable convergence, learning rate followed a step decay schedule, i.e. after each ten epochs the learning rate was divided by two.The dropout rate was set to 20%.Data was propagated through the network with a batch size of 4. Input images were normalized to zero mean and unit variance based on the augmented training set.A validation set was created by randomly splitting 10% of the training data which was not used for training but only for regularization by early stopping, i.e. training was stopped when the validation AUC had not increased for 20 epochs.The model of the epoch with the highest validation AUC was selected as final model.Implementation was based on Keras and Tensorflow.
To gain insight into the classifications, we made saliency maps that show which parts of the brain contributed the most to the prediction of the CNN, i.e. which voxels lead to increase/decrease of prediction score when changed.Saliency maps were made using guided backpropagation, changing the activation function of the output layers from softmax to linear activations (Springenberg et al., 2015).Maps were averaged over correctly classified AD patients (Rieke et al., 2018).

Analysis and statistics
Classification performance was quantified by the area under the curve (AUC) and accuracy.For AD-CN classification, the data of the ADNI AD and CN groups were randomly split for 20 iterations preserving relative class sizes in each training and testing sample, using 90% for training and 10% for testing.Random splits were the same for both SVM and CNN.In each iteration, classification model parameters were optimized  on the training set as explained above.The models were optimized solely on the training set; the test set was used only for evaluation of the final model.Ninety-five percent confidence intervals (95%CI) for the mean performance measures were constructed using the corrected resampled t-test based on the 20 cross-validation iterations, thereby taking into account that the samples in the cross-validation splits were not statistically independent (Nadeau and Bengio, 2003) Subsequently, we retrained classifiers using all AD and CN participants from the ADNI as training set.These retrained classifiers were used for visualization and their performance was evaluated on three independent test sets: ADNI MCIc-MCInc, PND AD-SCD, and PND MCIc-MCInc.95%CIs were obtained based on 500 bootstrap samples of the test set.Significant differences between classifiers were assessed using the non-parametric McNemar Chi-square test (Dietterich, 1998) (α < 0.013 after Bonferroni correction for 4 comparisons in each test set).
Trained models, lists of included subjects and all code used in preparation of this article are available from https:// gitlab.com/radiology/neuro/bron-cross-cohort4 .
The performance of external validation, i.e. the application of the classifiers in the PND data set, is shown in Fig. 4. For AD-SCD diagnosis, the AUC for SVM was 0.896 (95%CI: 0.855 − 0.932) and that for CNN was 0.876 (95%CI: 0.836 − 0.913).Both AUC and accuracy followed the same patterns as in ADNI: SVM and CNN showed similar performance and modulated GM maps yielded higher classification performance than minimally processed T1w images (McNemar's test; p < 0.01 for SVM, p = 0.01 for CNN).Performances were however slightly lower; PND confidence intervals for AUC (but not for accuracy) overlapped with those of ADNI.
For prediction of MCI conversion in PND, classification performance was also lower than that in ADNI.For the GM modulated maps, the AUC for CNN was 0.702 (95%CI: 0.624 − 0.786) and that for SVM was 0.665 (95%CI: 0.576 − 0.760).Confidence intervals were relatively large and overlapped with those in the ADNI data.No significant differences between classifiers and between pre-processing approaches were seen.
Brains regions that contributed to the classifications are visualized using SVM p-maps in Fig. 5 and using CNN saliency maps in Fig. 6.The SVM p-map for the minimally processed T1w images showed small clusters of significant voxels, mainly located in the medial temporal lobe (hippocampus), around the ventricles and at larger sulci at the outside of the brain.For modulated GM maps, clusters of significant voxels in the p-map were larger and predominantly visible in the hippocampus.In addition, smaller clusters were located in the rest of the temporal lobe and the cerebellum.CNN saliency maps showed a very limited contribution of the temporal lobe.Instead, the saliency map for the T1w images mainly showed contribution of voxels at the edge of the brain, in white matter regions around the ventricles and in the cerebellum.For modulated GM maps, clusters of contributing voxels were located in the subcortical structures, the white matter around the ventricles and the cerebellum.

Discussion
We performed a comparative study focusing on the generalizability of diagnostic and predictive performance of machine learning based on MRI data of the ADNI research cohort, to the PND multi-center data set representing a tertiary memory clinic population.Both cross-validation and external validation results for AD-CN diagnosis showed similar performance using the used deep learning classifier and conventional classifier.Both approaches significantly benefited from the use of modulated GM maps instead of raw T1w images.Application to MCI conversion prediction yielded higher performance for SVM than for CNN in ADNI, but this was not seen in PND.Performances were in line with the state-of-the-art (Rathore et al., 2017;Wen et al., 2020;Ansart et al., 2021).For MCI conversion prediction, Ansart et al. (2021) showed that performance of current methods converges to an AUC of about 75% as the number of subjects increases, which aligns with our results.
While in many medical imaging applications CNNs convincingly outperformed conventional classifiers (Litjens et al., 2017), our results showed similar performance for CNN and SVM, which confirms the findings by Wen et al. (2020).Other CNN designs could possibly improve on this, but we made an effort to follow the state-of-the-art for CNN design.Promising developments to further improve performance could come from changes in network architecture (e.g., successful standard architectures like InceptionNet or ResNet, adversarial training, discriminative auto-encoders) and improvements in data collection and handling (e.g., larger datasets to learn more complex models, or pretraining on other collections of brain imaging data).In addition, data augmentation could play a role in further improvement.While a strength of the mix-up approach is that it is data-agnostic, an augmentation approach using for example prior knowledge may have added value.
This work shows that the need for dedicated pre-processing is lower for CNN than for SVM, but nevertheless has an added value for the performance.While we evaluated only one implementation of the pre-processing procedure (Bron et al., 2014), we expect that alternative implementations (e.g.SPM12, FSL-VBM) could have slightly changed results but would have led to the same conclusions.With sufficiently large datasets the need for dedicated pre-processing including spatial normalization may reduce.
Although SVM and CNN classifiers yielded similar performance, their visualizations showed different brain regions to be involved in the classification.SVM significance maps showed a clear contribution of the hippocampus and medial temporal lobe as previously shown and expected based on prior knowledge (Bron et al., 2017).CNN saliency maps showed involvement of subcortical structures, regions prone to white matter hyperintensities and the cerebellum.For both classifiers, classification based on minimally processed T1w images showed voxels at the edge of the brain to be involved, which is expected as only similarity transformation to template space had been performed.In addition to the brain edges, the CNN classifier, which outperformed the SVM for these minimally processed input images, also highlights regions similar to those shown by the saliency map for the modulated GM images.This may implicate that the CNNs non-linear operations, in contrast to the linear kernel of the SVM, could extract feature maps that partly resemble GM modulated maps.The regions highlighted by the CNN saliency maps could possibly be related to AD using prior knowledge, but we will refrain from over-interpretation here.It is however unexpected that the medial temporal lobe is not covered as previously shown with CNN saliency maps on ADNI data (Dyrba et al., 2020;Rieke et al., 2018).Differences between the SVM and CNN classifiers in involved brain regions could be contributed to both the differences in the classification approaches as well as to the differences in the used visualization techniques.If the first reason dominates, hence if the classifiers actually use different brain regions, combining classifiers into a hybrid approach would be an interesting future direction.However, for full understanding of brain regions involved in CNN-based classification of AD, further research is required.This work is one of the few to address how AD classification performance of MRI-based machine learning generalizes to an independent cohort (Wen et al., 2020;Hall et al., 2015;Bouts et al., 2019;Archetti et al., 2019).On the PND data, the resulting AUC values (0.896 for SVM, 0.876 for CNN) were competitive with values reported for AD-CN in the literature, but still they were 0.04-0.07lower than those in the ADNI cross-validation experiment.The main patterns in the results corresponded between ADNI and PND data, i.e. similar performance for SVM and CNN and added value of dedicated MRI processing.For prediction in MCI, AUC values in the PND data set were 0.04-0.10lower than those in ADNI.Overall, similar to experiments by Wen et al. (2020) and Hall et al. (2015), we observed only a minor performance drop.This largely preserved performance could be related to the similarities between the ADNI and PND studies that include a multicenter set-up, within-study standardization of cognitive protocols, and diagnostic criteria for AD (McKhann et al., 1984(McKhann et al., , 2011) ) and MCI (Petersen, 2004).The performance reduction could be contributed to differences between the studies, such as the MRI protocols (all high resolution T1w, but more homogeneous within ADNI than within PND), country of origin (United States vs. the Netherlands), control population (a combination of cognitively normal and SCD vs. SCD only), MCI population (amnestic MCI only vs. a broad MCI group) and patient inclusion criteria (ADNI used hard cut-offs on cognitive scores and clinical dementia rating whereas PND did not) (Petersen et al., 2010;Aalten et al., 2014).Studies that found much worse generalizability in their experiments described larger differences in inclusion and diagnostic criteria between training data and validation data than we did (Bouts et al., 2019;Wen et al., 2020).
A limitation of this study is that the diagnosis was based on clinical criteria rather than post-mortem histopathological examination.Although diagnosis was typically confirmed by follow-up, it is possible that some of the patients were misdiagnosed.An alternative could be to use amyloid data from PET imaging or cerebrospinal fluid to classify AD pathology instead of relying on the clinical diagnosis (e.g., Son et al. (2020)).In addition, because of the limited availability of diagnostic information at follow-up in the PND data set, its MCI data is relatively small.This is reflected by the large confidence intervals for the performance metrics in the prediction task.To maximize the number of PND MCI participants, we chose to use the last available time point for final diagnosis.As a result the time-toprediction ranged between 1-5 years, whereas for ADNI a fixed time interval of three years is chosen.As time-to-prediction is related to predictive performance (Ansart et al., 2021), a fixed time interval would be preferred for inter-cohort performance comparison.
While the external validation performance was quite high, as expected some performance drop was observed.Therefore, research focusing on approaches to mitigate such performance drops, such as transfer learning, is highly relevant (Wachinger and Reuter, 2016).In addition, whereas this work only exploited structural MRI, other works have shown that performance can be increased with the use of multi-modal inputs, i.e. cognitive test scores, fluid-based biomarker measurements, genetic information and other imaging modalities such as PET, diffusion MRI or perfusion MRI (Bron et al., 2017;Ansart et al., 2021;Venkatraghavan et al., 2019).While multi-modal classification would therefore be a logical and important extension, this may also lead to a decrease of generalizability as chances of differences between studies increase with multiple modalities.
In conclusion, classification performance of ADNI data generalized well to the multi-center PND biobank cohort representing tertiary memory clinic patients, with only a minor drop in performance.Conventional SVM classifiers and deep learning approaches using CNN showed comparable results, and both methods benefited from dedicated MRI processing using GM modulated maps.We hope that external validation results like those presented here will contribute to setting next steps towards the implementation of machine learning in clinical practice for aiding diagnosis and prediction.

Figure 2 :Figure 3 :
Figure 2: Cross-validation performance for classification of the Alzheimer's disease patients (AD) and controls (CN) of the ADNI data set expressed by (a) area under-the-ROC-curve (AUC) and (b) accuracy.Performance is shown for SVM and CNN classifiers using two inputs: minimally processed T1w scans and modulated GM images.Error bars indicate 95%CIs.

Figure 4 :
Figure 4: Classification performance on the PND data set: (a) area under-the-ROC-curve (AUC) and (b) accuracy.Classifiers were trained on ADNI AD-CN and applied to PND AD-SCD (left figures) and PND MCIc-MCInc (right figures).Performance is shown for SVM and CNN classifiers using two inputs: minimally processed T1w scans and modulated GM images.Error bars indicate 95%CIs.P-values for significant differences are shown in (b).

Figure 5 :
Figure 5: Visualization of the SVM classifiers using analytic significance maps (p-maps) based on two inputs: (a) minimally processed T1w images and (b) modulated GM maps.

Table 1 :
Demographics for the ADNI data set.

Table 2 :
Demographics for the PND data set.FU: follow-up time