A 3D convolutional neural network to classify subjects as Alzheimer's disease, frontotemporal dementia or healthy controls using brain 18F-FDG PET

With the arrival of disease-modifying drugs, neurodegenerative diseases will require an accurate diagnosis for optimal treatment. Convolutional neural networks are powerful deep learning techniques that can provide great help to physicians in image analysis. The purpose of this study is to introduce and validate a 3D neural network for classification of Alzheimer's disease (AD), frontotemporal dementia (FTD) or cognitively normal (CN) subjects based on brain glucose metabolism. Retrospective [18F]-FDG-PET scans of 199 AD, 192 FTD and 200 CN subjects were collected from our local database, Alzheimer's disease and frontotemporal lobar degeneration neuroimaging initiatives. Training and test sets were created using randomization on a 90%-10% basis, and training of a 3D VGG16-like neural network was performed using data augmentation and cross-validation. Performance was compared to clinical interpretation by three specialists in the independent test set. Regions determining classification were identified in an occlusion experiment and Gradient-weighted Class Activation Mapping. Test set subjects were age- and sex-matched across categories. The model achieved an overall 89.8% accuracy in predicting the class of test scans. Areas under the ROC curves were 93.3% for AD, 95.3% for FTD, and 99.9% for CN. The physicians' consensus showed a 69.5% accuracy, and there was substantial agreement between them (kappa = 0.61, 95% CI: 0.49-0.73). To our knowledge, this is the first study to introduce a deep learning model able to discriminate AD and FTD based on [18F]-FDG PET scans, and to isolate CN subjects with excellent accuracy. These initial results are promising and hint at the potential for generalization to data from other centers.


Introduction
Alzheimer's disease (AD) is a debilitating condition that affects millions of people worldwide.It is the fourth leading cause of disabilityadjusted life-year (DALY) in older adults and its prevalence is set to increase dramatically due to population aging (Nichols et al., 2022).This reiterates the urgent need for effective disease-modifying drugs to treat neurodegenerative disorders.
An accurate diagnosis is paramount to the development of effective treatments because of the variety of pathological mechanisms at play in different neurodegenerative diseases.This is even more challenging in diseases where clinical pictures overlap such as AD and the frontotemporal dementia (FTD) spectrum, given the fact that gold standard diagnosis relies on pathological findings, only possible after the patient's death, and the importance of an early diagnosis to change the course of the disease.Current diagnostic criteria rely on a compatible clinical presentation, cerebrospinal fluid (CSF) biomarkers, magnetic resonance imaging (MRI) and positron emission tomography (PET), giving to brain glucose metabolism a foremost position (Dubois et al., 2014;Gorno-Tempini et al., 2011;Rascovsky et al., 2011).However, in our clinical experience, interpretation of metabolic brain patterns can be challenging even for experts, as illustrated by the moderate inter-observer agreement (Brucher et al., 2015).Using standalone FDG-PET, approximately 20 % of subjects with dementia are incorrectly labeled AD (Bloudek et al., 2011).This is particularly problematic in cases with overlapping metabolic patterns as seen in the frontal variant of AD and behavioral variant FTD, or in logopenic primary progressive aphasia (PPA), related to AD pathology, and other PPA (Minoshima et al., 2021).With the anticipated approval of anti-amyloid therapy, which will certainly require a highly certain diagnosis for AD, there is a need to develop more accurate and objective interpretation tools to diagnose patients with precision.
Artificial intelligence (AI) techniques have been successfully applied in radiology and are increasingly popular in neuroimaging, especially for structural imaging (Hu et al., 2020;Jo et al., 2019;Nemoto et al., 2021).FDG-PET exhibits a fair level of evidence in detecting AD or FTD and is therefore an essential tool for diagnosis, a pivotal milestone in a patient's clinical pathway (Arbizu et al., 2018).Additionally, it has the ability to reclassify misdiagnosed subjects, which makes a strong case for applying AI techniques to metabolic imaging (Jack et al., 2010;Perini et al., 2021).A remarkable example of which is a recent work that classified with high accuracy FDG-PET scans, using a 3D convolutional neural network (CNN), into dementia with Lewy bodies (DLB), Alzheimer's spectrum or cognitively normal (Etminani et al., 2022).Regarding AD vs FTD classification using FDG-PET, supervised machine learning approaches, such as decision trees, support vector machine (SVM), or principal component analysis, have been employed, achieving accuracies ranging from 80 to 95 % (Perovnik et al., 2022;Sadeghi et al., 2008;Xia et al., 2014).Some studies have also incorporated additional information, such as clinical data or structural MRI (Dukart et al., 2013;Garcia-Gutierrez et al., 2022).Crucially, deep learning methods, which have the ability to autonomously learn complex features from data, have never been applied to AD vs FTD classification using brain glucose metabolism.
Therefore, the primary objective of the current study is to serve as a proof-of-concept, showcasing the potential of a 3D CNN approach in aiding the diagnosis of AD, FTD or normal aging through [18F]-FDG-PET.The research premise is that, following training, this model will display strong classification performance on test data acquired in the same conditions as training data, with results comparable to those achieved by specialists relying solely on [18F]-FDG-PET scans for diagnosis.Secondary objectives were to explore the model's underlying mechanisms through testing on a sample of DLB subjects, which is the third most prevalent dementing disorder and also exhibits cortical hypometabolism on FDG-PET (Minoshima et al., 2022).Additionally, visualization of brain regions driving classification and dimensionality reduction of extracted features were conducted.
For each category, the ground truth label was based on a probable final clinical diagnosis as per consensual criteria (Gorno-Tempini et al., 2011;McKhann et al., 2011;Rascovsky et al., 2011).Subjects who showed conversion to another disorder were excluded.The AD category included 100 CE scans randomly selected from the ULille database and 104 scans from ADNI 2 & 3. ULille subjects also had typical CSF biomarkers (total tau > 525 pg/mL, phospho-tau > 73 pg/mL, Aβ1-42 < 615 pg/mL and Aβ1-42/Aβ1-40 < 5.6 %) and 16 subjects later demonstrated a confirmed diagnosis from autopsy or genetics, while ADNI subjects had a high-confidence diagnosis as reported by clinicians.All AD subjects were at the stage of dementia.
The FTD category comprised all 59 scans of the FTLDNI database who had FDG-PET, 4 of which were excluded due to absence of diagnosis, as well as all 137 FTD scans acquired at ULille.Among these 171 exhibited a behavioral presentation, 11 were classified as a semantic variant of PPA and 10 as an agrammatical variant of PPA.Sixteen ULille subjects later on demonstrated a certain diagnosis from autopsy or genetics.All FTD subjects were at the stage of dementia.
The CN category included 164 randomly selected baseline scans from ADNI 2 & 3 and all 36 control scans from FTLDNI (supplementary Fig. 1).

Data acquisition
PET-scans from ULille database were acquired on a hybrid 4-ring Biograph mCT-Flow PET/CT with 20-slice CT and 4 × 4 mm 2 lutetium oxyorthosilicate crystals (Siemens Medical Solutions USA, Inc., Molecular Imaging, Knoxville, TN, USA).Mean tracer dose was 177 MBq (SD = 19 MBq).Thirty minutes post injection, a low-dose CT scan of the brain was acquired for attenuation correction of the PET data, and 10-minute emission images were subsequently acquired.The PET data were reconstructed iteratively using an ordered subset expectation maximization algorithm with 8 iterations and 21 subsets.The reconstruction process included decay, random and scatter corrections and 2-mm full width at half-maximum Gaussian kernel smoothing.For each PET examination, the reconstructed images comprised a series of 109 axial slices with the following parameters: field of view = 408 × 408 × 221.3 mm3, matrix = 400 × 400 × 109, and voxel size = 1.02 × 1.02 × 2.03 mm3.
For ADNI acquisition the protocol can be found at https://adni.loni.usc.edu/wp-content/uploads/2010/05/ADNI2_PET_Tech_Manual_0142011.pdf.FDG tracer dose was 185 MBq (+/-10 %), and between 30 and 60 min after injection 6 dynamic 3D scans of 5-min frames were acquired.A low-dose CT scan was acquired for attenuation correction, or for PET-only scanners an attenuation correction scan was acquired using rod sources.
All FTLDNI scans were acquired at the Mayo Clinic center on a GE Discovery RX PET/CT scanner.Participants were injected with 185 MBq (+/-10 %) of FDG and acquisition started 30 min later, consisting of six 5-minute dynamic frames.A CT-scan, obtained prior to injection of FDG, was used for attenuation correction, and reconstruction used 3D filteredback projection technique (Bejanin et al., 2020).
All images were reviewed by an expert (AR) for visual quality control.

Image preprocessing
Preprocessing was done using MATLAB R2020a (MathWorks, Natick, MA, USA) & the Statistical Parametric Mapping 12 (SPM12; Wellcome Trust Centre for Neuroimaging, London, UK, http://www.fil.ion.ucl.ac.uk/spm).PET scans from ADNI and FTLDNI databases were downloaded in NIFTI format.Each of them consisted of 6 NIFTI files, which were realigned and averaged.PET scans from the ULille database were downloaded in DICOM format and subsequently converted to NIFTI with dcm2niix (Li et al., 2016).We spatially normalized each scan using default parameters of SPM12 normalization function and the International Consortium of Brain Mapping (ICBM) template.Subject-specific gray matter, white matter, CSF, bone, soft tissue and air probability maps were estimated from PET images using default parameters and tissue probability maps of SPM12 segmentation function.Masks were generated for voxels with a probability > 0.7 of being gray matter, white matter or CSF.These individual masks were then merged into a unified mask, which was applied to spatially normalized volumes.As a result, voxels presumed to represent bone, soft tissue, or air were excluded.Feature-wise normalization was done through dividing each voxel by the maximum of the 3D scan it belonged to, so that all voxel values were between [0,1].No extra intensity normalization or smoothing was performed.Every scan was inspected in MRIcron (https://www.nitrc.org/projects/mricron) to ensure correct preprocessing.After visual inspection, 5 ADNI AD scans were excluded because of negative value voxels, which may have conflicted with the use of the ReLU function.Nineteen FTLDNI FTD, 6 ULille FTD, 1 ULille AD and 1 ADNI CN scans (N = 27, 4,6 %) were noticed to have segmentation issues, and were reprocessed changing the default clean up parameter from light cleanup to no cleanup.Preprocessing resulted in 3D volumes of 79×95×79 voxels (2 mm isotropic) along the standard x, y, z axes as used in SPM.Top and bottom slices without brain region information were removed, and final volume dimensions were 79×95×60.

Dataset and data augmentation
The whole dataset contained 591 preprocessed [18F]-FDG PET scans.Training and test sets were randomly built on a 90 % -10 % basis for each of the three AD, FTD and CN groups in keeping with previous related works, resulting in 532 scans in the training set and 59 scans in the test set (Etminani et al., 2022;Nguyen et al., 2023).Dataset splitting was repeated randomly 30 times to improve stability of results.Due to human resource constraints and to uphold methodological validity, a single split was randomly selected for both comparison with specialist-based classification and secondary investigation of the model.Results are presented for this single split unless mentioned otherwise.
To prevent overfitting of the model, we augmented training data using a customized pipeline that generated batches and augmented data in real-time.Flipping along the sagittal plan, and +/-10 • random rotations across the 3 plans were performed in 50 % of cases to limit computational cost.Ten-percent translations across the 3 axes were performed for each scan, +/-8, +/-10 and +/-6 voxels along the x, y and z axis, respectively.

Neural network building
Training was done at the Lille In vivo Imaging and Functional Exploration (LIIFE) research lab at Lille University Hospital, on a computer with Linux Ubuntu 20.04 operating system, 12 CPU Intel® Xeon® W-2133 3.60 GHz for a total of 102GB of memory, and an NVIDIA Quadro RTX 6000 with 24GB of memory.The network was inspired by the VGG16 architecture (Simonyan and Zisserman, 2014), and consisted of 2 blocks of two 3D convolution layers and a max pooling layer, followed by a flattening layer and two dense layers (Fig. 1-B).The loss function used was cross-entropy.The input was preprocessed brain FDG-PET volumes, while the output was a probability value for AD, FTD and CN.To facilitate hyperparameter selection, we used Bayesian optimization for 200 iterations over the training set (supplementary Table 1).Following this, we retained Adagrad optimizer, learning rate = 0.0005, dropout rate = 0.5, batch size = 6, and performed end-to-end training on the augmented training set for 150 epochs using 5-fold cross-validation, early stopping with patience = 20 based on validation loss and saving the best model based on validation accuracy.Finally, the model with the highest validation accuracy within cross-validation was finetuned on the whole non-augmented training set with stochastic gradient descent with 0.0001 learning rate and 0.9 momentum for 50 epochs.This was repeated for each of the 30 random dataset splits.
In a complementary analysis, we repeated the procedure using the same hyperparameters while retaining only AD and FTD scans.This allowed us to assess the model's performance when it was exclusively trained on data of subjects with a neurodegenerative condition.

Specialist-based classification
Each scan of the test set was reviewed by 2 French board-accredited nuclear medicine physicians (HL & FH), and a resident in nuclear medicine (AR) with respectively 12, 9 years and 3 years of experience in the field of nuclear medicine.Non-normalized native brain volumes were visualized using ITK-SNAP (www.itksnap.org),without any clinical information (Yushkevich et al., 2006).Due to physicians' preference, a French rainbow lookup table (LUT) was used for visualization (supplementary Fig. 2).Each scan was classified by specialists into the following categories: AD, FTD, or CN.Scans for which there was disagreement were reviewed to reach consensual agreement.Fleiss' and Cohen's kappa coefficients were calculated to evaluate interobserver agreement and consensus/model agreement.

Model visualization with saliency maps
Building on the assumption that prediction probability for the real class of a scan will substantially decreased when voxels relevant for classification are occluded, we performed an occlusion experiment over the non-augmented training set of the specified random split (Zeiler and Fergus, 2014).A 5-voxel occluding cube was applied on each normalized scan with a stride of 2 for all 3 directions.The variations in prediction probability for the class of each scan are plotted as a function of the position of the occlusion cube, generating 38×46×29 voxel volumes.To allow overlay, these volumes were normalized through dividing by their max, resized to 75×91×56 voxel volumes through cubic spline interpolation, padded to match the 79×95×60 dimensions, and averaged over each category.Thresholding was performed using the minimum voxel outside the brain.Similarly, Gradient-weighted Class Activation Mapping (Grad-CAM) heatmaps were calculated for the last convolution layer, resized to 75×95×60 voxel volumes, and averaged over each class (Selvaraju et al., 2017).Finally, heatmaps are displayed over each category's averaged brain.

Voxel-based hypometabolism evaluation
For further exploration of metabolism between categories and for comparative purposes with the saliency maps, we conducted a voxelwise analysis over the training set of the specified random split to align with the occlusion experiment using Statistical Parametric Mapping 12 (SPM12, Wellcome Trust Centre for Neuroimaging, London, UK, http:// www.fil.ion.ucl.ac.uk/spm) in MATLAB R2020a.Two-sample t-tests were performed for AD vs CN scans, FTD vs CN scans and AD vs FTD scans.The modeling function utilized default parameters, and grand mean scaling was applied.Multiple analysis error was corrected employing peak-level family-wise error (pFWE) < 0.05.

Network extrapolability
With the purpose of testing the network on significantly different subjects, 20 DLB scans were included from the ULille database.Nineteen DLB scans had a probable diagnosis and 1 had a certain diagnosis as per the consensus diagnostic criteria (McKeith et al., 2017).Image preprocessing was done identically to scans from training/test sets.

Dimensionality reduction examination
We aimed to investigate how data separated based on class and acquisition center over the entire dataset.To achieve this, we utilized the Uniform Manifold Approximation and Projection (UMAP) technique to project the final dense layer of our model into a 2D space.This method was chosen for its ability to better preserve the global structure of data and faster processing compared to t-distributed stochastic neighbor embedding (McInnes et al., 2018).

Demographics
Subject demographics are summarized in Table 1.ANOVA tests were computed on age and MMSE, while chi-square tests were performed for sex between categories within each dataset.There was no significant difference for age and sex across categories in the test set (p > .05).Conversely, MMSE scores were significantly different across categories in the test set, and post-hoc testing revealed CN subjects had a significantly higher MMSE than AD or FTD subjects (p = .0007and 0.0001 respectively), but there was no significant difference between AD and FTD subjects (p = .83).

Model training
Training of 200 models using Bayesian optimization was completed in 49 h and 31 min.The results of the 20 best performing models can be found in supplementary Table 1.Cross-validation training of the final model with data augmentation on training set took 1 h and 58 min to complete for one random split.Within this split, model 1 was chosen for finetuning on the whole training set, as it demonstrated the highest validation accuracy of 91.5 % (Fig. 2).

Model evaluation
Model accuracy on the test set of the evaluated split was 89.8 %.Model classification is described in detail in Table 2.The AUC for predictions were 93.3 % for AD, 95.3 % for FTD and 99.9 % for CN (Fig. 3).The average AUC across all data splits were 94.9 %, 97.9 % and 99.3 %, respectively.The model performed best in detecting CN with 100 % sensitivity (all 20 CN cases were detected), 97 % specificity (38 out of non-CN cases were correctly ruled out by the model), 95 % precision (20 out of 21 cases labeled as CN were correctly classified), and F1 score %.Detailed metrics for AD and FTD can be found in Table 3.A typical example of error by the model can be found in Fig. 4.
In the complementary analysis with AD and FTD data only, model accuracy on the same test set was 87.2 %, and AUC was 91.3 %.All FTD patients were labeled as FTD and 75 % of AD patients were classified as AD.Average accuracy across all splits was 88.2 % and the average AUC was 95.7 % (Fig. 5).

Specialist-based classification
Across the test set, there was substantial agreement between physicians, as evidenced by a Fleiss' kappa coefficient of 0.61 (95 % CI: [0.49, 0.73]).Consensual physician labeling (Table 2) had a 69.5 % accuracy (N = 41).The highest specificity was observed for FTD at 92.5 %, but this contrasted with a low sensitivity of 47 %.Comparatively, CN showed the highest sensitivity of 90 % (Table 3).Further information on labeling and detailed metrics per physician can be found in supplementary Tables 2 and 3. Cohen's kappa between the consensus and the model for the whole test set was 0.59 (95 % CI: [0.48, 0.71]), and for incorrectly labeled cases by the consensus or the model was − 0.08 (95 % CI: [− 0.15, − 0.02]).

Saliency maps
The most prominent regions in driving AD classification were found to be the precuneus and posterior cingulate cortex (PCC).The insular cortex and cerebellum were also found to be important in Grad-CAM.The FTD heatmaps showed that anterior regions such as the dorsolateral and mesial frontal regions were driving classification.The FTD Grad-CAM map also highlighted posterior associative regions, but this was less evident in the occlusion heatmap, whereas the occlusion map highlights the cerebellum.The CN occlusion heatmap shows a diffuse pattern while the Grad-CAM map highlights a few regions such as the cerebellum and precentral regions overlying a diffuse pattern (Figs. 6,  7).

Voxel-based hypometabolism evaluation
When performing the AD vs CN t-test, most hypometabolic regions were the precuneus/PCC region, the bilateral posterior temporoparietal cortex and the mesial temporal lobe (all pFWE < 0.001).When performing the FTD vs CN analysis, most significant voxels were found in the anterior insula, the whole prefrontal cortex, the anterior temporal   lobes, the basal ganglia and to a lesser extent the precuneus/PCC and posterior temporoparietal regions (all pFWE < 0.001).When performing the AD vs FTD analysis, a few voxels in the posterior temporoparietal and precuneus/PCC regions were found hypometabolic in AD subjects (pFWE < 0.01), while FTD subjects displayed hypometabolism of the ACC, prefrontal mesial cortex, anterior insula and OFPFC (pFWE < 0.01).Thresholded maps with pFWE < 0.05 are shown in Fig. 8.

Network extrapolability
Demographics of subjects with DLB can be found in supplementary Table 3. Thirteen DLB scans were classified as FTD, 7 as AD and 0 as CN.

Dimensionality reduction examination
As shown in shown in Fig. 9, there appears to be a slight center separation in the CN class, where FTLDNI cases cluster on one side, although still displaying substantial overlap with ADNI cases.On the other hand, there is no evident separation by center for AD and FTD classes.

Discussion
The future of nuclear medicine in neurodegenerative diseases may require multiple scans over time for a single patient.In this study, we introduce a deep learning model able to classify [18F]-FDG-PET scans into AD, FTD or CN subjects with high accuracy, doing better than a consensus reached by physicians specialized in nuclear medicine in all metrics.Such a tool would be of great assistance to physicians to break free from intra-and inter-observer variability in clinical strategies involving repeated PET scans.
We extend the previous work of Etminani et al. by presenting a CNN model highly efficient in differentiating AD and FTD-spectrum diseases (Etminani et al., 2022).They reported in their study 96.4 %, 71.4 %, 96.2 % and 94.7 % AUC for AD, MCI, DLB and CN, respectively.Our model tends to show slightly better performances, but this should be put into perspective with the fact that we did not include MCI subjects, which, understandably, was the most misclassified category in their work.In addition, in the AD vs FTD analysis which better reflects the standard situation met in clinical settings, we found an average accuracy of 88.2 %, and the average AUC was 95.7 %.Regarding specialist-based diagnosis, while the consensus accuracy is similar at 69.5 % in the present study (compared to 57 % in Etminani et al. with an additional category), our Fleiss' kappa of 0.61 (compared to 0.19) indicates stronger agreement.This low variance in physician performance reflects a shared understanding of the test set and suggests that it is unlikely for another physician to outperform the model.On the other hand, the model and the physicians did not commit the same errors as shown by the absence of agreement (Cohen's kappa = -0.08).Furthermore, we also extend classical machine learning approaches that aimed to  A. Rogeau et al.
distinguish AD and FTD using FDG-PET.An early work with spatial decision trees yielded around 90 % accuracy.However, the use of brain regions drawn empirically may have led to overfitting, and a high rating accuracy by experts, also approaching 90 %, indicates that cases might have been more straightforward to classify (Sadeghi et al., 2008).
Another method using a multiple kernel algorithm demonstrated accuracy near 95 %, but once more empiric selection of the most efficient features may have resulted in some overfitting, and sample size was limited (Xia et al., 2014).More recently, an SVM approach yielded an overall accuracy of 78-80 % compared to expert accuracy of 71 % (Perovnik et al., 2022).
The main strength of our model lies in its exceptional accuracy to isolate CN subjects, and this could be envisioned as a first AI clinical application for screening brain scans and present them to physicians as high-probability to be normal/abnormal (Hao et al., 2022;McKinney et al., 2020).Our model accurately classified all CN subjects and erroneously labeled only 1 CE scan as CN.None of CN test subjects were reported to have converted to MCI or dementia since scanning.The clinician consensus had a high sensitivity too at 90 % (vs 100 %), but specificity was moderate at 74 % (vs 97 %).Interestingly, MMSE scores were significantly higher for dementia scans classified CN by physicians compared to true positives (supplementary Table 5).This brings support to the fact that the model may detect metabolic abnormalities still indiscernible to the human eye.
In the occlusion experiment and Grad-CAM analysis, we found salient regions consistent with the traditional hypometabolism pattern described in visual interpretation and illustrated in Fig. 8.The PCC was found to be the most important region by large to drive AD classification, but in clinical practice the emphasis tends to be put on other posterior associative areas such as temporoparietal junctions, and therefore it reiterates the need to scrupulously examine the PCC/precuneus when facing a possible diagnosis of AD.The PCC is an important area for cognition showing decreased [18F]-FDG uptake in early AD (Minoshima et al., 1997) and also, although inconsistently, decreased uptake in FTD (Scheltens et al., 2018).Contrarily to the aforementioned work, in which the PCC was a common driver for classification into AD, MCI or DLB (Etminani et al., 2022), herein the PCC was not found to guide classification towards FTD, and consequently should not be considered as a hallmark of neurodegenerative disease.This may also account for the misclassification of a few AD scans without hypometabolic PCC as FTD (Fig. 4).As expected, frontal regions were highlighted in both FTD occlusion and Grad-CAM maps.The CN heatmaps showed a diffuse pattern.Surprisingly, the cerebellum was a decisive region in several heatmaps driving classification in FTD occlusion and highlighted in both AD/CN Grad-CAM.Although it was not confirmed on the corrected maps of the voxelwise analysis, lower cerebellar metabolism was seen on the uncorrected maps (using p < .001)when comparing FTD to AD or CN.The cerebellum is debated to show altered cerebellar glucose metabolism in C9orf72 mutations (N = 10 in the training set) vs non-C9orf72 FTD (Castelnovo et al., 2019;Diehl-Schmid et al., 2019).However, it is considered intact in most neurodegenerative disorders and recommended as a reference region for intensity normalization by the European Association of Nuclear Medicine (Guedj et al., 2022).This calls for Fig. 5. ROC curve for AD vs. FTD analysis.Red line, selected iteration.Black lines, average of all iterations.ROC, receiver operating characteristic.AUC and confidence intervals are presented for the averaging across the 30 dataset splits.further investigation, and it may be interesting in the future to perform a voxelwise analysis between non-C9orf72 FTD and AD using a cerebellar mask.
In testing extrapolability on DLB, we found that all of them were classified as AD or FTD.This highlights the coherence of our model in successfully ruling out subjects with another neurodegenerative disease as healthy subjects, which would be the most prejudicious situation in a clinical setting.This is also in line with the findings of Etminani et al., in which the dimensionality reduction visualization clearly separated CN and DLB subjects.
Our UMAP dimensionality reduction revealed a slight center separation within the CN class, where FTLDNI cases cluster on one side, perhaps due to all FTLDNI scans being acquired on a single PET camera.However, there was no evident separation by center for AD and FTD.

Strengths
As shown in Fig. 2, we successfully prevented overfitting during the cross-validated training using several regularization methods such as data augmentation, batch normalization, dropout and early stopping.This, we believe, shows that the network could learn robust features during training that would allow its extrapolation to new brain scans with a different acquisition protocol.Similarly, training over a large multicentric dataset, well-balanced across the different categories would also facilitate utilization in different conditions.We also demonstrated in a supplementary analysis that although acquisition parameters varied between centers, there was no significant difference in accuracy between them (supplementary).
Another strength is that we could include subjects who received a certain diagnosis (N = 30).This is of interest since there is a large overlap in radiological presentations of neurodegenerative diseases (Olney et al., 2017), and protein-targeting drugs will require an unequivocal diagnosis.In addition, the rest of our subject had a probable diagnosis.Longitudinal analyses have shown individuals fulfilling FTD probable criteria will continue to do so over time or move to the certain category after postmortem analysis (Devenney et al., 2015).This is, however, to put into perspective with the fact that most often neurodegenerative pathologies overlap, as pure AD is thought to only represent 30 % of cases and in 30 % of cases associates with TDP-43 pathology (Villain and Dubois, 2019).

Limitations
Unfortunately, a same center could not provide subjects to all 3 categories, which would ensure the network did not learn any acquisition specificities.For example, the CN category did not include any ULille subjects.This is because our database, built from clinical data, does not include any healthy control subjects.This could also be viewed as a wider limit to the development of AI in clinical practice since healthy controls are not commonly found among hospital patients.On the same topic, data augmentation can theoretically increase bias between classes (Balestriero et al., 2022).However, it was required to expand data during training, and the absence of underfitting gives confidence in the absence of significant bias (Fig. 2).
It could be argued that subjects having several scans may have led to inflated performance of the model.However, except 1 subject who had 1-month interval scans, there was a minimum of 6 months between scans for other subjects which is enough to see progressive metabolic changes (Forster et al., 2011).Additionally, we repeated the classification analyses using only one scan for each patient and average accuracy was at 84.1 % still substantially higher than physician accuracy (supplementary Fig. 3).
Physicians reviewed studies in their native space, while the model labeled spatially and feature-wise normalized volumes.This decision aligned with physician preferences for examining scans and ensure that the scans they reviewed displayed the highest resolution.Therefore, if this introduced a bias, it is more likely that it contributed to increasing physician performance rather than the opposite.
Probably because of regions showing severe hypometabolism, SPM segmentation basic parameters first considered large cortical areas as not gray matter, and these scans (N = 27) required another segmentation treatment.This is a minor issue in the current study, but it reflects what may happen when developing an AI pipeline.Even if AI is successfully incorporated in imaging departments, this highlights the need for quality checks to ensure coherent results.
Finally, in the age of emerging blood biomarkers and new radiotracers, some may question the clinical relevance of using FDG-PET, which might seem an out-of-fashion approach to neurodegenerative classification.However, we argue that this is not the case.Lumbar puncture remains an invasive procedure, blood biomarkers are promising but have yet to be seen in clinical routine and amyloid PET can be twice as expensive (Contador et al., 2023;Teunissen et al., 2022).Furthermore, a recent consortium has reaffirmed the importance of FDG-PET for dementia diagnosis, even prioritizing it over CSF markers in certain situations (Chetelat et al., 2020).

Conclusion
In this study, we demonstrated the ability of a tailor-made 3D CNN to accurately classify [18F]-FDG PET-scans between AD, FTD or CN subjects.Our results showed that this model outperforms clinical interpretation by experienced physicians and displays an excellent capability in identifying control subjects.These findings add to the growing field of AI in metabolic imaging and suggest our clinical practice may change in the nearby future integrating these tools.To our knowledge, this is the first work using deep learning techniques to classify AD vs FTD subjects based on brain glucose metabolism.

Fig. 1 .
Fig. 1.Overview of methods.(A) Dataset of 199 CE, 192 FTD and 200 CN scans from ULille, ADNI and FTLDNI databases split between training & test sets on a 90 %− 10 % basis.Scans were spatially normalized to ICBM152 template.(B) Network architecture and Bayes search to select hyperparameters.(C) Training on augmented data through 5-fold cross-validation, and best model kept for retraining on whole training set.(D) Clinical interpretation by 3 physicians of the native scans.(E) Performance comparison in the independent test set between model and physicians' consensual agreement.(F) Occlusion experiment and Grad-CAM.(G) Predictions of the model of DLB.(H) UMAP visualization.AD, Alzheimer's disease; ADNI, Alzheimer's disease neuroimaging initiative; CN, cognitively normal; DLB, dementia with Lewy bodies; FTD, frontotemporal dementia; FTLDNI, frontotemporal lobar degeneration neuroimaging initiative; ICBM, International Consortium for Brain Mapping; ULille, University of Lille.

Fig. 2 .
Fig. 2. Model training using 5-fold cross validation.Top graphs show training and validation accuracies.Bottom graphs show train and validation losses.Model 1 had the highest validation accuracy and was selected for further training.

Fig. 4 .
Fig. 4. AD scan labeled FTD.Top: native scan, bottom: preprocessed data.Hypometabolism in parietal associative posterior regions is observed and to a lesser degree in frontal mesial, whereas the posterior cingulate and precuneus appear relatively normal.

Table 1
Data demographics.Data are presented for the same split as used for comparison with specialist-based classification and secondary investigation.* Some subjects had several scans acquired at different timepoints; these are treated as different individuals for calculation of demographical variables.** Two AD subjects and 12 FTD subjects did not have a MMSE score within a year of scanning.ANOVA used for age and MMSE, and chi-square for sex.AD, Alzheimer's disease; ADNI, Alzheimer's disease neuroimaging initiative; CN,