Introduction

Dementia is a syndrome that deteriorates cognitive functions beyond what might be expected from the usual consequences of biological aging. Currently, more than 55 million people live with dementia worldwide, and this number is envisioned to rise to 78 million in 2030, and 139 million in 2050 [1]. Alzheimer’s disease (AD) is the most common form of dementia and may contribute to 60–70% of cases. It has physical, psychological, social, and economic impacts on people living with dementia, their carers, families, and society. Its early diagnosis allows clinicians to initiate the treatment as early as possible to arrest or slow down the disease progression more effectively [2].

Many machine learning approaches have been proposed for the automatic classification of AD stages, which go from cognitively normal (CN) to the intermediate state of mild cognitive impairment (MCI) to the final AD stage. Some exploit longitudinal studies to estimate the progression from one stage to the other [3,4,5,6], as reviewed in recent surveys [7,8,9]. Other approaches rely on cross-sectional studies, aiming to classify the degree of the disease based on the results of a predetermined visit [10,11,12,13]. Some of them [12,13,14,15,16] focus on traditional machine learning (ML) techniques and rely on hand-designed features extracted from the data, according to domain-specific knowledge from AD research. Other methods [9, 11, 17,18,19,20,21,22,23,24,25,26] exploit the ability of deep learning (DL) architectures to discover the discriminant features in the data automatically. Examples of recent ML- and DL-based methods are given in “Comparison with the State-of-the-Art”.

Biomedical image analysis has become a significant research field for various biomedical applications [27,28,29,30,31,32,33]. In the case of AD, the most frequently adopted imaging data include magnetic resonance images (MRIs) and positron emission tomography images (PETs) [9, 11, 17,18,19, 22, 34, 35]. Besides, other data modalities are commonly taken into account, including omics data (e.g., gene expression (GE) data [12, 36, 37]), sometimes coupled with clinical data [2, 6, 21, 38]. Recently, some research started focusing on the integration of omics data with information from biomedical images [13, 39,40,41,42]. These omics imaging methods, bringing together information from different sources, can reveal hidden genotype-phenotype relationships to understand the onset and progression of many diseases and identify new diagnostic and prognostic biomarkers [43].

In [13], we proposed an ML-based omics imaging approach to AD classification that relies on data acquired by non-invasive techniques. The multi-modal data, derived from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset,Footnote 1 consisted of omics and imaging features extracted by GE values from blood samples and MRIs, respectively. We have shown how a suitable integration of these data modalities, using well-known ML techniques, can often lead to better results than any of them taken separately.

Here, we explored a different dataset, namely the ANMerge datasetFootnote 2 [3], and performed a thorough performance analysis of AD classification results. Specifically, we varied the settings of the evaluation procedure based on cross-validation (CV), testing three different classifiers, four atlases for extracting MRI features, and different criteria for extracting the GE features. Analogous experiments have also been extended to data from ADNI to perform a fair comparison among the two sets of results. Finally, we also considered the adoption of clinical information (demographic features and cognitive tests), exploring the advantages of combining them with imaging and omics data types. As an added value, we make publicly available the software implementing the evaluation procedure for AD classification to simplify the comparison with results from other methods.

The rest of the paper is organized as follows. “Materials and Methods” describes the data adopted in the experiments for classifying AD patients and the procedures used to extract their features. “Experiments” presents the evaluation procedure and discusses the results achieved with the proposed framework. Finally, “Conclusions” concludes our paper and gives some future research directions.

Materials and Methods

The omics imaging data adopted in [13] for AD classification come from the ADNI database. For a detailed description of these data and the extraction of the related features, the reader can refer to [13]. Here, we considered analogous features obtained from the ANMerge dataset [3]. This dataset is an improved and updated version of AddNeuroMed [44]. It provides multi-modal data from more than 1,700 participants of a longitudinal study, including clinical assessments, MRIs, genotyping, transcriptomic profiling, and blood plasma proteomics.

We adopted GE values extracted by blood samples and MRIs, as we aim to use multi-modal information integrating omics and imaging data acquired by non-invasive techniques. In the ANMerge dataset, these data, available for selected subsets of patients, come both from the first visit, so they are already aligned in time. Table 1 summarizes the number of ANMerge patients having MRI, GE, or both types of data for their first visit and their diagnosis (AD, MCI, or CN). It can be observed that the three classes of the MRI+GE subset of ANMerge data selected for the experiments (in the following, simply referred to as the subset of ANMerge data) are very well balanced. This was not the case for the subset of ADNI data having both MRI and GE data used in [13], consisting of 42 AD, 428 MCI, and 250 CN patients.

Table 1 Number of ANMerge patients having MRI, GE, or both types of data, and their diagnosis (AD, MCI, or CN)

Imaging Data: MRI

Rather than using the imaging features made available in ANMerge, to maintain consistency with [13], we extracted imaging features from MRIs using the Clinica open-source framework [9, 45, 46].

The basic step common to all Clinica pipelines for preprocessing and feature extraction involves the conversion of the dataset into the Brain Imaging Data Structure (BIDS) format [47]. Although automatic conversion tools are provided in Clinica for some publicly available datasets (including ADNI), there is no tool for ANMerge. Therefore, we prepared Matlab/Python scripts for centering the MRIs of all ANMerge selected patients and placing them into the BIDS structure.

Then, we adopted the Clinica framework to generate the voxel-based features from MRIs, which have been shown [9] to lead to high-performance results using the Support Vector Machine (SVM) [48] classifier. The Unified Segmentation procedure [49] is first applied, simultaneously performing tissue segmentation, bias correction, and spatial normalization of each input image. Next, a group template is created using the DARTEL algorithm for diffeomorphic image registration [50], using the subjects’ tissue probability maps on the native space obtained from segmentation. The DARTEL to MNI method [50] is then applied, providing the registration of the native space images into the MNI (Montreal Neurological Institute) space, using routines from the Statistical Parametric MappingFootnote 3 (SPM) package. These steps transform all the images into a common space, providing a voxel-wise correspondence across subjects. A set of imaging features is finally extracted based on regional measurements, where the anatomical regions are obtained from an atlas in the MNI space, and the average gray matter density is computed in each region. In the experiments, we considered the regional features derived from four different atlases: AAL2 [51] (121 features), AICHA [52] (385 features), Hammers [53, 54] (69 features), and LPBA40 [55] (57 features).

Omics Data: GE

Normalized gene expression data from blood transcriptomics of ANMerge participants were downloaded via Synapse.Footnote 4 Post-quality-control and batch-corrected expression values, as described in [3], were used for the differential abundance analysis. The lumiHumanIDMapping R package v. 1.10.1 [56] was used to map nuIDs to gene symbols. The dataset contained 5213 nuIDs for 691 under 90 samples. Multiple probes corresponding to the same gene symbol were aggregated by the median value. The Limma R package v. 3.46.0 [57] was used for performing differential expression analysis using linear models and finding significant differentially expressed genes (DEGs) from the three unpaired two-class contrast matrices (AD vs. CN, AD vs. MCI, MCI vs. CN). Several filtering criteria based on log-fold change (LogFC), which indicates the difference between two conditions in terms of gene expression, and the Benjamini–Hochberg adjusted p value (BH-adj.pvalue) were applied to each contrast to consider genes significant and select the omics features: (1) AD vs. CN: BH-adj.pvalue \(\le 0.05\), 308 genes (“AD-CN_pv005”); (2) AD vs. MCI: BH-adj.pvalue \(\le 0.05\), 6 genes (“AD-MCI_pv005”) and BH-adj.pvalue \(\le 0.1\), LogFC \(\ge \mid 0.3\mid\), 59 genes (“AD-MCI_pv01_LFC03”); (3) MCI vs. CN: BH-adj.pvalue \(\le 0.05\), LogFC \(\ge \mid 0.3 \mid\), 472 genes (“MCI-CN_pv005_LFC03”) and BH-adj.pvalue \(\le 0.05\), LogFC \(\ge \mid 0.4 \mid\), 42 genes (“MCI-CN_pv005_LFC04”).

Clinical Data: Clin (Dem+Cog)

Often (e.g., [2, 6, 21, 38]) clinical data, eventually together with other types of data, are adopted as features for AD classification. As we already observed in [13], some of them are considered by medical doctors to diagnose the disease state of each patient, are directly adopted for labeling patients belonging to different classes [58], or are used as criteria for including/excluding patients from AD datasets, as in [3]. Therefore, their use as features for AD classification appears to bias the results strongly (and positively). Here, we investigated their role in classification performance, justifying our doubts.

Among the clinical information included in the ANMerge dataset, we selected five features that are available for most of the patients. These include three demographic features (sex, age, and years of education) and two cognitive test scores: the Clinical Dementia Rating Sum of Boxes (CDR_SOB) and the Mini-Mental State Examination (MMSE). Few missing values have been imputed through kNN separately for each class to introduce the minimum possible alteration of the data. The statistics of the subset of ANMerge data, extracted from the chosen clinical features, are reported in Table 2. In the experiments, we considered the results of using demographic features, cognitive tests, or both clinical data, also coupled with omics and imaging data.

Table 2 Summary statistics describing the subset of ANMerge data

Experiments

Evaluation Procedure

The evaluation procedure adopted in the experiments, illustrated in Fig. 1, is similar to the one adopted in [13].

Fig. 1
figure 1

Scheme of the evaluation procedure

It consists of NumIter iterations of k-fold cross-validation, with stratified partitions of the data into training and test subsets, using different classifiers. At each iteration, data are standardized by z-scoring training folds and using their mean and variance to z-score the test fold accordingly. The performance results are computed as average over the NumIter iterations of a set of well-known metrics: accuracy (Acc), sensitivity (Sens), specificity (Spec), precision (Prec), F-measure (F1), geometric mean (Gm), area under the ROC curve (AUC), Matthews correlation coefficient (MCC), and balanced accuracy (BA). For their description, please, refer to [13]. The reason for using so many metrics is to make the comparison of other methods with our results easier. However, most of our conclusions are based on values achieved for MCC and BA, which provide clear overall results regardless of eventual class unbalancing [59, 60]. The Matlab scripts implementing the evaluation procedure and the computation of the performance metrics are made publicly available through our web pages.

Evaluation on ANMerge Data

The described procedure has been applied to the subset of ANMerge data for each of the three binary problems. We fixed to 5 the number of folds for the CV and to 50 the number of iterations of the CV and varied the classifier (SVM with linear kernel, kNN with k=5, and logistic regression (LR)), the atlas for imaging features (AAL2, AICHA, Hammers, and LPBA40, see “Imaging Data: MRI”), and the filtering criterion for extracting the GE features (AD-CN_pv005 for AD vs. CN, AD-MCI_pv005 and AD-MCI_pv01_LFC03 for AD vs. MCI, MCI-CN_pv005_LFC03 and MCI-CN_pv005_LFC04 for MCI vs. CN, see “Omics Data: GE”).

The best performance results for each of the ANMerge binary problems are reported in Table 3. They are those showing the highest MCC and BA values, achieved using almost always the SVM classifier except for all the cases of MRI data alone and the MRI+GE data for the AD vs. MCI task, where LR leads to better performance. The atlas for MRI data was AAL2 in all cases, except for the MRI+GE data for the AD vs. MCI task, where LPBA40 leads to better performance. The chosen criteria for GE data were the AD-CN_pv005 for the AD vs. CN task, the AD-MCI_pv005 for the AD vs. MCI task, and the MCI-CN_pv005_LFC03 for the MCI vs. CN task.

Based on the described best choices for setting up the evaluation procedure, Table 3 suggests that using omics imaging data (MRI+GE) from the subset of ANMerge data leads to better classification results than imaging or omics data alone for all three binary classification tasks.

Table 3 Best performance results on the subset of ANMerge data

Comparison with Results on ADNI Data

To better compare the results achieved on the subset of ANMerge data with those that can be obtained using the ADNI dataset, we repeated the experiments reported in [13], but fixing the same parameter values as above (5 folds and 50 iterations) and varying the classifier and the atlas as done for the ANMerge dataset. The method for extracting the GE features varied among those described in [13] (SAM-Tstat and SAM-Wilc for both AD vs. CN and AD vs. MCI and Topvar300 for MCI vs. CN). The best performance results are reported in Table 4 for each of the ADNI binary problems. The highest MCC and BA values have been reached using the SVM classifier, except for the GE and MRI+GE cases in the MCI vs. CN task, where LR leads to better results. As in [13], the best choices for the AD vs. CN task are confirmed to be the AICHA atlas and the SAM-Wilc method, and the same can be said for the MCI vs. CN task with the AAL2 atlas and the Topvar300 method. However, for the AD vs. MCI task, the best atlas was LPBA40 (instead of AICHA), and the best GE extraction method was SAM-Tstat (instead of SAM-Wilc).

Table 4 Best performance results on the subset of ADNI data

The overall results obtained in [13] for the ADNI data were confirmed, as omics imaging data allow us to achieve better performance than imaging or omics data alone for all the binary tasks, except the MCI vs. CN. Besides the latter task being much more challenging than the others, worse results on this dataset (but not on the ANMerge dataset) are probably due to the more unbalanced class distribution.

Apart from different choices for the MRI atlas and the GE extraction procedure, performance results on ADNI data in terms of MCC and BA appear to be similar to those on ANMerge data for the AD vs. CN and AD vs. MCI tasks. However, probably due to the strong class unbalancing for these tasks in the subset of ADNI data, the accuracy for the AD minority class (i.e., Sens) is much lower than that of the other two (majority) classes (i.e., Spec). This phenomenon is absent in the results on the well-balanced subset of ANMerge data, where Sens and Spec achieve similar results.

Instead, a significant performance improvement can be observed when using data from ANMerge for the MCI vs. CN task, which was an open issue already highlighted in [13]. This appears to be connected to more significant GE features that succeeded in being extracted compared to those from the ADNI data.

Evaluation Using Also Clinical Data

Best classification results obtained using demographic features (Dem), cognitive tests (Cog), or all Clinical (Dem+Cog) data, also in combination with MRI and GE data, are reported in Table 5. The best performance values have been chosen by fixing the SVM classifier while varying the MRI atlas and the GE criterion.

Table 5 Best performance results on the subset of ANMerge data using also clinical data

From Table 5, we can observe that clinical data alone, mainly just the two cognitive test scores, always lead to better performance than any of the other types of data or their combinations. This is undoubtedly due to their strong link with the medical doctor’s diagnosis and justifies our concern in using these features for a fair evaluation. Indeed, cognitive test scores confirm being strongly informative of patients’ AD status. However, we believe these tests alone cannot solve the problem, as patients get acquainted with them over time, and their scores, in subsequent visits, would be biased by the acquired experience. Therefore, an automatic tool for AD classification based on omics imaging data acquired by non-invasive techniques, such as the one proposed, could help clinicians make their diagnoses.

Comparison with the State-of-the-Art

Being ANMerge relatively recent, we could not find other methods using the dataset to evaluate AD classification in cross-sectional studies. Nonetheless, to provide a rough performance comparison of the results that could be achieved by state-of-the-art approaches, in Table 6, we report the classification performance of several methods on various datasets published in the recent literature. Our best results from Table 3 are also reported to make more immediate comparisons. For each classification problem, methods are grouped as DL-based (top) and ML-based (bottom). For each method, we report an acronym (made by the last name of the first author and the publication year) and reference, the adopted dataset, the number of samples for each class, the type(s) of data modality, and the performance values in terms of the most commonly used metrics (Acc, Sens, AUC, and BA).

Table 6 Performance (%) on various datasets of recent methods for AD classification

DL-based methods considered in Table 6 include [9, 11, 17,18,19,20,21,22,23,24,25,26]. Aderghal et al. [17] integrate the MRI and DTI (Diffusion Tensor Imaging) modalities from ADNI data. Due to the scarcity of DTIs, they adopt cross-modal transfer learning from MRIs to DTIs and combine the classification results of multiple CNNs by a majority vote. In [18], Backstrom et al. propose a 3D CNN for AD vs. CN classification using ADNI MRIs. Multiple MRIs for individual subjects, made at different times, are considered; the results reported in Table 6 are those obtained by the authors using a subject-separated data partitioning strategy. Li et al. [19] propose a classification method based on multiple cluster dense convolutional neural networks (DenseNets) to learn features from MRIs coming from ADNI. Each whole-brain image is first partitioned into different local regions, and a fixed number of 3D patches is extracted from each region. These patches are grouped into different clusters with k-means clustering, and a DenseNet is constructed to learn the patch features for each cluster. The features learned from the discriminating clusters of each region are combined for classification, and the results from different local regions are integrated to enhance the final image classification. Pan et al. [20] propose a DL framework for AD diagnosis based on MRI and PET images from the ADNI dataset. Missing PET images are imputed from their corresponding MRI data by using 3D Cycle-consistent Generative Adversarial Networks (3D-cGAN) to capture their relationship. A deep multi-modal multi-instance neural network is then used for AD classification using subjects with both MRI and PET (either real or synthetic). Senanayake et al. [21] use 3D MR volumes and neuropsychological measure-based (NM) feature vectors from ADNI. To combine these two data sources, having very different dimensions (35 NM features against more than ten million features from 3D MR volumes), they propose a DL-based pipeline that reduces the dimension of the MRI features to a dimension comparable with that of NM, and use the feature vector merging the two sets of features. The accuracy values reported in Table 6 have been extracted by their bar plots. Shi et al. [22] propose a multi-modal algorithm based on a stacked deep polynomial network (MM-SDPN). Two SDPNs are first used to learn high-level features from MRIs and PETs coming from the ADNI dataset, taken separately. These are then fed to another SDPN to fuse multi-modal neuroimaging information to retain the intrinsic properties of both modalities and their correlation. Bae et al. [11] develop a CNN-based algorithm to classify AD patients using coronal slices of T1-weighted MRIs that cover the medial temporal lobe. The performance results reported in Table 6 come from the within-dataset validation performed by the authors on data from the ADNI and the Seoul National University Bundang Hospital (SNUBH) datasets. Islam et al. [24] propose an approach for generating synthetic brain PET images for building a large-scale dataset for training DL models for AD classification. The PET images generator exploits Deep Convolutional Generative Adversarial Networks (DCGANs) [62] learning on the three AD stages. A 2D-CNN model using axial, coronal, and sagittal slices from the generated PET data is finally adopted for classification. Jo et al. [25] propose a 3D CNN-based DL model on PET images for AD classification. Random under-sampling (RUS) is adopted for class balancing. Wen et al. [9] present an open-source framework for AD classification using CNNs and T1-weighted MRIs and use it to compare different CNN architectures. Table 6 reports the best results obtained by training the DL models on a subset of ADNI data and testing them on the remaining ADNI data, as well as on the AIBLFootnote 5 (the Australian Imaging, Biomarkers and Lifestyle Flagship Study of Ageing) and OASISFootnote 6 (Open Access Series of Imaging Studies) datasets. Yu et al. [26] propose the THS-GAN (Tensor-train decomposition, Higher-order pooling, and Semi-supervised learning-based GAN) method for AD classification on ADNI data. The tensor-train decomposition is applied to all layers in the classifier and discriminator, reducing the number of parameters and exploiting, at the same time, the structural information of the brain. The higher-order pooling leverages the second-order statistics of the MRIs, effectively capturing long-range dependencies between slices of different directions. Moreover, the model is designed in a semi-supervised manner to take advantage of both labeled and unlabeled MRIs.

ML-based methods considered in Table 6 include [12,13,14,15,16]. Hett et al. [14] propose a texture-based grading framework based on 3D Gabor filters to better capture structural alterations caused by AD. An adaptive patch-based fusion strategy based on a local confidence criterion is adopted to combine all the grading maps estimated on texture maps. Moreover, contrary to usual grading-based methods using the average grading values over the considered ROI, a classification step is applied based on a nonparametric grading values distribution representation to better discriminate the pathology stages. Zheng et al. [15] describe a network-based approach built on multiple morphological features to enhance the MRI-based accuracy of AD classification. A multifeature-based network (MFN) is constructed for each patient using a sparse linear regression performed on six types of morphological features to promote the structure-based diagnosis. SVM was adopted to examine the diagnostic performance of the MFN by cross-validating the results. The performance values reported in Table 6 were obtained by combining the properties of the MFN with the morphological features. Gupta et al. [16] propose a framework based on SVM and feature selection to discriminate the various stages of ADNI patients using a combination of FDG-PETs, MRIs, CSF protein levels, and APOE genotype data. Stamate et al. [23] evaluate the performance of three state-of-the-art machine learning models (XGBoost, RandomForest, and DL) on AD classification based on plasma metabolites as potential AD biomarkers. Data samples are gathered by the European Medical Information Framework for AD Multimodal Biomarker DiscoveryFootnote 7 (EMIF-AD) [63]. The study demonstrates that XGBoost, whose performance is reported in Table 6, is more effective than RF and DL for this particular dataset and that this accuracy for clinical diagnosis is broadly similar to that achieved by CSF markers of AD pathology [64]. Lee and Lee [12] classify AD vs. CN using blood gene expression data not only from the ADNI dataset but also from two AddNeuroMedFootnote 8 [44] datasets (ANM1 and ANM2). Table 6 reports the best results obtained using suitable feature selection methods and classifiers.

Despite the variability of experimental conditions and, consequently, of the performance results, few observations can be derived from Table 6. (1) Generally, but not always (e.g., see the results of [65] and [15]), DL-based methods lead to better performance than those based on traditional ML; (2) Different datasets can lead to pretty different performances of the same method (e.g., see the results of [9] and [12] varying the dataset); (3) The combination of multiple data sources appears promising for better performance (e.g., [17, 20, 22]); however, it should be taken into account that the extraction of further data modalities (e.g., the CSF used in [16]) could require quite invasive interventions; (4) Our performance results using MRI and GE features appear on average compared to those achieved by state-of-the-art methods for various classification tasks. Being the only ones produced for the ANMerge dataset, they can be used as a baseline for evaluating future AD classification methods on different data. Moreover, the promising integration of different data modalities obtainable by non-invasive techniques could be exploited by future ML- or DL-based methods.

Conclusions

We proposed an extensive evaluation of a machine learning procedure for classifying Alzheimer’s patients using data from the ANMerge dataset. We considered data from different modalities, including imaging, omics, and clinical features, taken alone or combined together.

Overall results suggest that integrating omics and imaging features leads to better performance than any of them taken separately. This result holds whichever is the binary AD classification problem taken into account, also for the most challenging task of distinguishing MCI vs. CN patients, differently from what we previously experimented on analogous data from the ADNI dataset.

Moreover, we showed that clinical features consisting of just two cognitive test scores always lead to better performance than any other types of data or their combinations. Our results show how their adoption as classification features positively biases the results, being involved in the clinician diagnosis process, thus inhibiting a fair evaluation.

We believe that the results of our extensive experiments on the ANMerge dataset can be used as a baseline to compare for the evaluation of future AD classification methods. Indeed, as the dataset (at least in its merged and revised form) is pretty new, no other methods could be found experiencing it. Toward this goal, we believe that the public availability of the software developed for the experiments makes these future comparisons easier.

Future research is directed toward adopting different cohort study data for independent training and testing. This will certainly require a general method for reducing batch effects between different experiments and an ad hoc procedure for classifier hyperparameter optimization on the training set.

Supplementary information

The Matlab scripts implementing the evaluation procedure adopted for AD classification and the computation of the performance metrics are made publicly available through our web pages.