A minimalistic approach to classifying Alzheimer’s disease using simple and extremely small convolutional neural networks

BACKGROUND
There is a broad interest in deploying deep learning-based classification algorithms to identify individuals with Alzheimer's disease (AD) from healthy controls (HC) based on neuroimaging data, such as T1-weighted Magnetic Resonance Imaging (MRI). The goal of the current study is to investigate whether modern, flexible architectures such as EfficientNet provide any performance boost over more standard architectures.


METHODS
MRI data was sourced from the Alzheimer's Disease Neuroimaging Initiative (ADNI) and processed with a minimal preprocessing pipeline. Among the various architectures tested, the minimal 3D convolutional neural network SFCN stood out, composed solely of 3x3x3 convolution, batch normalization, ReLU, and max-pooling. We also examined the influence of scale on performance, testing SFCN versions with trainable parameters ranging from 720 up to 2.9 million.


RESULTS
SFCN achieves a test ROC AUC of 96.0% while EfficientNet got an ROC AUC of 94.9 %. SFCN retained high performance down to 720 trainable parameters, achieving an ROC AUC of 91.4%.


COMPARISON WITH EXISTING METHODS
The SFCN is compared to DenseNet and EfficientNet as well as the results of other publications in the field.


CONCLUSIONS
The results indicate that using the minimal 3D convolutional neural network SFCN with a minimal preprocessing pipeline can achieve competitive performance in AD classification, challenging the necessity of employing more complex architectures with a larger number of parameters. This finding supports the efficiency of simpler deep learning models for neuroimaging-based AD diagnosis, potentially aiding in better understanding and diagnosing Alzheimer's disease.


Introduction
Alzheimer's disease (AD) is a neurodegenerative disease and the most common cause of dementia (Jack et al., 2018).Today 10.7% of the population over the age of 65 has dementia caused by Alzheimer's disease (Association, 2022).The cause of AD is not fully understood, and while there are multiple drugs approved by the FDA, their utility has been limited due to moderate symptom relief and severe side effects (Association, 2022;Athar et al., 2021;Loera-Valencia et al., 2019).There is a pressing need for high-precision diagnostic tools from healthy controls (HC) using T1-weighted magnetic resonance images (MRI) of the brain.AD vs HC classification provides a basis to evaluate model architecture for a classification task based on two cognitively distinct groups (Jack et al., 2008).By investigating the properties of the AD vs HC classification, we hope to gain insight that will enable sMCI vs pMCI classification.
The field of deep learning is evolving quickly.A major part of this research is conducted on 2D image classification tasks on datasets such as ImageNet (Russakovsky et al., 2015).New architectures and training regimes have greatly improved classification accuracy (Tan and Le, 2019;Canziani et al., 2016).However, MRI brain scans are 3D volumes.Often 2D architectures are expanded to 3D data by simply replacing 2D convolutions with their 3D counterparts, assuming that the techniques that improve performance in 2D natural images generalize across dimensions and domains (Uemura et al., 2020;Liang et al., 2018;Ruiz et al., 2020;Chen et al., 2019).Peng et al. (2021) challenged this assumption by proposing the 3D CNN named Simple Fully Convolutional Network (SFCN) for brain-age prediction on T1w MRI.SFCN was specifically designed to be simple and ''shallow'' compared to modern architectures.In this paper, we rigorously test SFCN against the popular architectures DenseNet (Huang et al., 2017), and EfficientNet (Tan and Le, 2019) for AD vs HC classification.It is well established that, in general, the performance of a model is highly dependent on the number of parameters (He et al., 2016;Nakkiran et al., 2021).We investigate this claim explicitly by studying the performance of the SFCN as we shrink the feature width towards one.
Data leakage in machine learning refers to the phenomenon where information from the test data influences the training of the model.This issue can contribute to systematic errors and biases in reported results of AD classification as discussed in Wen et al. (2020).For example, data leakage can occur when a dataset is improperly split such that one participant with multiple assessments is part of both the training and test sets.Another typical example is using the test set for hyper-parameter selection (Kriegeskorte et al., 2009).Oversight in constructing the datasets can lead to overly optimistic classification accuracy (i.e.> 98%) (Wen et al., 2020).It is prudent to build on the AD classification literature to understand better the factors contributing to high performance.Some of the earlier findings may be inflated, in part due to sub-optimal data management.In the present study, we make use of a relatively large sample and rely on 5-fold crossvalidation.We tune hyper-parameters using a validation set, avoiding prematurely exposing the models to the test set.Hyper-parameters for all architectures were found using the same search procedure, and the final test results were only generated once.We believe this minimizes the likelihood of data leakage and, subsequently, inflation of our reported results.
Our primary contribution is demonstrating that the simple and shallow SFCN architecture can achieve competitive performance with more complex architectures like DenseNet and EfficientNet in AD vs HC classification, using minimal preprocessing.Searching for deep learning architectures is computationally demanding and can consume significant research time.We show that our results are competitive with other efforts in the literature, which typically involve much more complex pipelines.Additionally, we demonstrate that good results can be obtained with very small architectures.Smaller models are less hardware-intensive, making research on AD classification, as well as clinical inference, more accessible.The models with trained weights and the dataset splits can be found at: https://github.com/CRAI-OUS/simple_ad

Dataset and preprocessing
We used structural T1-weighted MRIs of the brain from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database (Jack We used a minimal preprocessing pipeline to ensure that as much of the information in the raw data was available to the model, and to minimize the complexity of the total pipeline.Each brain scan was skullstripped with HD-BET (Isensee et al., 2019), a deep learning-driven skullstripping tool.HD-BET was chosen because it is fast (t < 10 s/scan) compared to other alternatives such as Freesurfer (Fischl, 2012).Rapid processing is important in making the models more accessible for clinical use.The scans were resampled to 1 mm isotropic resolution and cropped to a size of 160 × 192 × 160.The top 5 percentile of the intensities were clipped and each scan was normalized to the interval [0, 1].Clipping the top intensities ensures that noisy outliers do not shift the contrast when rescaling intensities to [0, 1].
Only HC and AD patients were considered in this paper, which amounted to 1597 subjects scanned for a total of 5054 MRI sessions.The data were divided into 5 folds such that all sessions of each subject were contained in one fold.Each fold was stratified so that the maleto-female ratio, average age, and diagnosis ratio were matched for each fold.We created 5 data splits where the train and validation set consisted of 4 folds, and the remaining fold was used for testing.10% of the 4 train-validation folds were used for validation, which was also balanced with respect to the training set.For testing and validation, one random session was used for each subject, as opposed to training when all available sessions were used.This was to avoid a few subjects with a lot of sessions influencing the test results too much.We refer to split 0 as the combination of the training, validation, and test set that uses fold 0 as the test set.An overview of the folds and the stratified values can be seen in Table A.1.

Architectures
In this paper, we compare three architectures: DenseNet (Huang et al., 2017) was chosen since it is a popular architecture in medical classification (Liang et al., 2018;Uemura et al., 2020;Ruiz et al., 2020).EfficientNet (Tan and Le, 2019) was chosen since it can be scaled to a small size, which makes it practical for training on 3D T1 scans.While published in 2019, it is still competitive on ImageNet classification with a Top-1 accuracy of 84.3%.
For these two architectures, we used the 3D implementation provided by Monai (Cardoso et al., 2022).We used DenseNet121 and EfficientNet-B0 as these were the smallest versions of the models, and we therefore could train the models using a single GPU per model.
Our focus in this paper is the Simple Fully Convolutional Network (SFCN) (Peng et al., 2021) architecture.It was originally designed for brain-age estimation, for which it has achieved state-of-the-art performance on (Peng et al., 2021;Gong et al., 2021;Leonardsen et al., 2022).It has also been successfully applied to AD classification (Leonardsen et al., 2023;Gupta et al., 2023).As the name suggests, the architecture The spatial dimension of the feature space is shown between each block.The feature width for each feature space can be found in Table 1.
was designed to be simple.SFCN consists of 6 blocks of 3 × 3 × 3 convolution, batch normalization, ReLU activation, and max pooling.At the final layer, global average pooling is performed, followed by a single linear layer.A diagram of the architecture can be seen in Fig. 1.
We investigated how the number of trainable parameters affected the predictive performance.SFCN was down-scaled by keeping the depth constant and dividing the width of each layer by a multiple of 2. For the smallest models, we deviated from this rule to avoid layers of width 1.The models were ranked by their number of parameters from SFCN-0 to SFCN-7.SFCN-7 is the base model, with the original size as defined by Peng et al. (2021).The configuration of the down-scaled SFCN models can be seen in Table 1.

Hyper parameter selection and training the models
The performance of deep learning models is sensitive to the hyperparameters of the models.Different architectures might benefit from different hyper-parameters.For a fair comparison between different architectures, we therefore performed a grid-based hyper-parameter search for all architectures on the training and validation set of split 0, searching for an optimal optimizer, learning rate, and weight decay.In order to fit the training on a single GPU, we fixed the batch size to 4. For each architecture, the hyper-parameters that gave the highest Receiver Operating Characteristic (ROC) Area Under the Curve (AUC) on the validation set were selected.Each model was then trained with optimal hyper-parameters on the other 4 splits and tested on their respective test sets.As optimizers, we tested both AdamW (Loshchilov and Hutter, 2017) and Stochastic Gradient Descent (SGD).For AdamW we used  1 = 0.9,  2 = 0.95.No momentum was used with SGD.The models were trained for 50 epochs with binary cross-entropy loss.The learning rate was scheduled using linear warmup up to epoch 10, followed by cosine decay.No weight decay was used for normalizing layers and bias terms (Brock et al., 2021;He et al., 2022;Jia et al., 2018).Since a model might benefit from early stopping, two sets of the trainable parameters were saved for each training session, one from the last epoch and one for the epoch with the highest validation ROC AUC.We used the checkpoint of the last epoch as our default but investigated the properties of the best ROC AUC checkpoint.The hyper-parameters of the grid search can be found in Table A.2 and the optimal parameters for each model in Table A.3.A summary of the method for training SFCN-Base on AD vs HC classification can be found in Fig. 2.
When training smaller versions of SFCN, we used the best hyperparameters found for SFCN-Base.However, we changed the batch size from 4 to 16, and we used a linear scaling law to adapt the learning rate to the new batch size (Goyal et al., 2017).Since the smaller configuration of SFCN could benefit from less weight decay, a hyper-parameter search for weight decay was performed.The hyper-parameters for the smaller versions of SFCN can be found in Table A.4.

Metrics
We choose to rely on ROC AUC as our metric for hyper-parameter selection as well as model evaluation.ROC AUC is in many ways considered a better performance metric than accuracy and related measures such as sensitivity and specificity (Dinga et al., 2019).This is partly due to ROC AUC being invariant to class imbalance.Furthermore, we consider it a clinical task to decide what rate of false positives and false negatives can be tolerated in clinical settings.We do, however, report the accuracies of our models, to facilitate an intuitive interpretation of model performance and enable comparisons with other methods.

AD vs HC classification
We found that EfficientNet, DenseNet and SFCN-Base performed similar on the test sets, with an average ROC AUC of 94.9%, 94.9%, and 96.0%, respectively.SFCN had a slightly higher ROC AUC than the two other architectures in 4 out of 5 test folds.A comparison of the ROC AUCs and accuracies across the architectures can be seen in Fig. 3. Using the checkpoint with the highest validation ROC AUC yielded a similar result with a test ROC AUC of 95.2% for EfficientNet, 94.7% for DenseNet, and 95.7% for SFCN.
Next, we compare our results to the results from other publications.From the list of AD classification publications with ''no data leakage'' by Wen et al. (2020) we selected the 5 publications with the highest accuracy.In addition, we selected a few newer publications on AD classification.The complete comparison can be found in Table 2. Overall, our simple pipeline performed competitively with state-of-the-art methods.

Qualitative model characteristics
We investigated the correlation of the model predictions.Using the pre-sigmoid output of the models, we calculated the Pearson Correlation Coefficient (PCC) for pairs of the architectures.We found the models to be highly correlated, with PCC for DenseNet and EfficientNet of 0.88, SFCN and EfficientNet of 0.90, and SFCN and DenseNet of 0.91.

Table 2
Comparison of AD vs HC deep-learning classifiers on ADNI data.Methods marked with a '+' indicate ''no data leakage'' as per Wen et al. (2020).Table abbreviations: BA: Balanced Accuracy.Modailities: T1w: T1 weighted MRI, MD: Mean diffusivity MRI, FDG-PET: Fluorodeoxyglucose positron emission.Preprocessing: The main steps of the preprocessing pipelines of the methods.If given in the method, the software used in preprocessing is given in parentheses.R: Linear registration, NR: Nonlinear Registration, N: Normalization or bias field correction , SS: Skullstripping, Seg: Segmentation, LMD: Custom landmark detection, None: No preprocessing was performed.Preprocessing software: SMP: Statistical Parametric Mapping (Ashburner et al., 2012), FS: Freesurfer (Fischl, 2012), DPARSF: Data Processing Assistant for Resting-State fMRI (Yan and Zang, 2010), HD-BET: (Isensee et al., 2019), FSL: MRIB Software Library (Woolrich et al., 2009), ANTs: Advanced Normalizations Tools (Tustison et al., 2021), DARTEL: Diffeomorphic Anatomical Registration Through Exponentiated Lie algebra (Goto et al., 2013).Models Types: 2.5D CNN: CNN, which processes images as slices using 2D convolution and integrates the information across slices.3D CNN: CNN with 3D convolution operating on the full image, 3D ROI CNN: 3D CNN that operates on subregions of the image.VAE: Variational Autoencoder, FEAT MLP: Multilayer perceptron that processes features extracted from the image.To further test if different models picked up different useful features we used Linear Discriminant Analysis (LDA) to construct new classifiers from the output of the three model architectures.We fitted 5 LDAs on the validation sets and did inference on the test data.In Fig. 4 the presigmoid output of the models can be seen plotted against each other together with the LDA classification line.The LDA models that used all three architectures had an average ROC AUC of 96.19%, an increase of only 0.19% compared to SFCN.

Small architecture performance
Next, we investigated how the size of SFCN affects the test accuracy.Using the same hyper-parameters as SFCN-Base we tested 7 smaller architectures.We found that SFCN-3 with 13k parameters achieved a ROC AUC of 94.6%.SFCN-3 has only 0.44% of the parameters of SFCN-Base, but the relative reduction in ROC AUC is only 1.45%.We further observed that the extremely small SFCN-0 width 712 parameters and a feature width of [2,2,2,3,3,2] still have a respectable 91.4% ROC AUC.The performance metrics as a function of model size can be seen in Fig. 5.
We visualized all the feature maps from SFCN-0 of 4 individuals.We selected the two subjects from the test set for which the model gave the highest and lowest AD scores.The feature maps are displayed in Fig. 6.

Discussion
We investigated whether modern architectures, proven effective on ImageNet, increase the AD vs HC classification performance over simpler architectures for structural MRI data.Our results show that the three architectures, SFCN-Base, DenseNet, and EfficientNet, achieved approximately the same ROC AUC, with a slight advantage to SFCN-Base.Considering the extensive efforts in designing these architectures, this finding is unexpected.SFCN represents a simplistic architecture and might be considered outdated.This raises questions about why DenseNet and EfficientNet, which are effective classifiers for 2D tasks, do not exhibit the same efficiency for AD classification using 3D brain MRI.Furthermore, why do our results remain competitive with other studies that employ pretraining or carefully designed pipelines and architectures?
Although we cannot present strong evidence for why SFCN is sufficient for AD classification, we can hypothesize.DenseNet and Efficient-Net are architectures built to perform well on ImageNet.ImageNet has 1000 diverse labels, many of which have very distinct characteristics.In contrast, AD classification only has two classes, with relatively minor visual differences distinguishing the AD group from the HC group, especially compared to the visual diversity in ImageNet.The images are also centered and always facing the same direction, simplifying the classification task.We see that on the AD classification task the model capacity can be very small with SFCN-0 having only 712 parameters and still reaching an ROC AUC of 91.4%.On ImageNet classification, the drop in performance is observed in models with many orders of magnitude more parameters than the models in our experiments.The ImageNet top-1 accuracy of DenseNet drops from 77.85% to 74.8% when the model size changes from 33M to 7M parameters (Huang et al., 2017).A visual inspection of the feature maps of SFCN-0 in Fig. 6 shows that the network extracts regions of Cerebrospinal fluid (CSF) in the first layers while discarding the texture in the rest of the brain.This may be a hint as to why SFCN is able to achieve a high ROC AUC even with a tiny model.Since CSF is darker then brain tissue in T1w MR images, a simple threshold is sufficient for an estimate of brain atrophy.This threshold function can be easily implemented by a linear layer and a ReLU function, similar to the first layer of SFCN.Understanding how SFCN processes these regions further down the network is challenging, due to the shrinking spatial size and convoluted interactions between the features.However, since one can see all the features of the first layer, we believe that the CSF segmentation-like behavior is central to further processing.
We see a potential in utilizing the small version of SFCN, and similar simple models, for new applications.It could be possible to explicitly analyze the features of such a model to understand what it has learned.Feature analysis is usually not feasible when a model has hundreds or thousands of features, but with a model with no more than 10 features in each layer, visualizing and interpreting them is a realistic possibility.
Other applications can be in a clinical setting on a machine without a GPU.In this case, a slight decrease in performance may be acceptable if the model is small enough to enable rapid analysis on a single CPU.

Conclusion
In this paper, we have demonstrated that a simple preprocessing pipeline and the simple architecture SFCN yielded competitive results on AD vs HC classification relative to two other larger and more sophisticated model architectures.We found that the SFCN model architecture could be scaled to a surprisingly small size, with only a small deterioration in performance.This work suggests that a simple CNN with minimal preprocessing could serve as a viable baseline when testing new machine learning pipelines for AD-related classification.

Fig. 1 .
Fig.1.Diagram of the SFCN architecture.The spatial dimension of the feature space is shown between each block.The feature width for each feature space can be found in Table1.

Fig. 2 .
Fig. 2. Summary of the deep learning pipeline for AD vs HC classification using a simple convolutional neural network.

Fig. 3 .
Fig. 3. (a) AD vs HC test accuracy and ROC AUC for EfficientNet, DenseNet, and SFCN-Base for each split.(b) Comparison of average test accuracy and ROC AUC for AD versus HC using EfficientNet, DenseNet, and SFCN-Base.

Fig. 4 .
Fig.4.The pre-sigmoid output of each model architecture plotted up against each other.The value on the axis represent the value of the output before the final sigmoid activation models.Higher values represent that the model given the input has higher confidence in the AD class.Lower values represent higher confidence in the HC class.The scatter plot is created by combining the 5 test sets.The pre-sigmoid outputs are from the 5 models that trained on the train set belonging to each test set.Pearson correlation coefficient (PCC) for each model pair is shown in the right corner.The black lines are the class separation lines from the linear discriminant analysis (LDA) models when fitted on the pre-sigmoid output of the 5 validation sets.

Fig. 5 .
Fig. 5. AD vs HC classification test accuracy and ROC AUC for SFCN of different sizes.The lines mark the average ROC AUC and accuracy and the points mark the accuracy and ROC AUC for each of the split.

Fig. 6 .
Fig. 6.All feature maps of SFCN-0.The two subjects with the highest and lowest AD-score in the test set are displayed.Each 3D feature map is displayed as 3 orthogonal slices along with the index of the Block(B) and Feature (F).Zoom in to see details.

Table 1
Feature width for the configurations of SFCN.Model 7 is the ''Base'' model and is identical to the original SFCN model architecture.