Fine-grained age-matching improves atrophy-based detection of mild cognitive impairment more than amyloid-negative reference subjects

Graphical abstract


oups signific
ntly increased accuracy, more so than regression-based approaches and using amyloid-negative reference groups.

Introduction

Mild cognitive impairment (MCI), characterized by unimpaired activities of daily living in the presence of objectifiable cognitive deficits [Albert et al., 2011], often precedes dementia in the course of Alzheimer's disease (AD).MCI is associated with an increased risk of progression to dementia due to AD [Lombardi et al., 2020;Mitchell and Shiri-Feshki, 2009], but it can also be caused by other neurodegene

Introduction
Mild cognitive impairment (MCI), characterized by unimpaired activities of daily living in the presence of objectifiable cognitive deficits [Albert et al., 2011], often precedes dementia in the course of Alzheimer's disease (AD).MCI is associated with an increased risk of progression to dementia due to AD [Lombardi et al., 2020;Mitchell and Shiri-Feshki, 2009], but it can also be caused by other neurodegenerative diseases, vascular lesions, medication side effects, other medical conditions, e.g., depression [Lopez et al., 2003].
ative diseases, vascular lesions, medication side effects, other medical conditions, e.g., depression [Lopez et al., 2003].

Cerebral imaging plays a central role in the diagnostic work-up of MCI to rule out underlying structural lesions and to detect cerebral atrophy to guide further diagnostic steps.However, since neurodegenerative changes overlap Cerebral imaging plays a central role in the diagnostic work-up of MCI to rule out underlying structural lesions and to detect cerebral atrophy to guide further diagnostic steps.However, since neurodegenerative changes overlap with age-related atrophy [Franke et al., 2010;Raji et al., 2009], it remains challenging to differentiate between the two.
with age-related atrophy [Franke et al., 2010;Raji et al., 2009], it remains challenging to differentiate between the two.

Different approaches have been taken to standardize the quantification of brain atrophy.Semiquantitative rating scales of global and regional cerebral atrophy allow a quick assessment [Koedam et al., 2011;Pasquier et al., 1996;Schel Different approaches have been taken to standardize the quantification of brain atrophy.Semiquantitative rating scales of global and regional cerebral atrophy allow a quick assessment [Koedam et al., 2011;Pasquier et al., 1996;Scheltens et al., 1992] but show inter-rater variability [Pasquier et al., 1996;Scheltens et al., 1992;Scheltens et al., 1995].
ens et al., 1992] but show inter-rater variability [Pasquier et al., 1996;Scheltens et al., 1992;Scheltens et al., 1995].

An alternative is quantifying the deviation of brain and tissue volumes from the mean of a control sample, as is common practice in evaluating cerebral glucose metabolism measured using positron emission tomography (PET) [Minoshima et al., 1995].In the case of MRIbased atrophy measurements, a patient's high-resolution T1-weighted image is automatically segmented into tissue classes and spatially normalized to a reference template.The difference between the patient's gray matter (GM) and the mean GM of a control group is then computed and expressed as multiples of the standard deviation of the control group, i.e., Z-statistics [Matsuda et al., 2012].It is possible to differentiate between cognitively normal (CN) and AD dementia with high accuracy using various indices.However, the classi An alternative is quantifying the deviation of brain and tissue volumes from the mean of a control sample, as is common practice in evaluating cerebral glucose metabolism measured using positron emission tomography (PET) [Minoshima et al., 1995].In the case of MRIbased atrophy measurements, a patient's high-resolution T1-weighted image is automatically segmented into tissue classes and spatially normalized to a reference template.The difference between the patient's gray matter (GM) and the mean GM of a control group is then computed and expressed as multiples of the standard deviation of the control group, i.e., Z-statistics [Matsuda et al., 2012].It is possible to differentiate between cognitively normal (CN) and AD dementia with high accuracy using various indices.However, the classification accuracy is considerably lower when distinguishing between CN and MCI [Li et al., 2019;Waragai et al., 2014].
ication accuracy is considerably lower when distinguishing between CN and MCI [Li et al., 2019;Waragai et al., 2014].

The reference group in this approach is age-matched to the patient sample but typically with an age range of around 30 years [Hirata et al., 2005;Komatsu et al., 2018;Matsuda et al., 2012;Matsunari et al., 2007;Tateno et al., 2015;Waragai et al., 2014], inducing a considerable bias if a patient's age is close to the extremes of the reference sample.Alternatively, a regression-based approach can be used to account for confounds such as age.Here, the confound is regressed upon a reference sample, and the resulting parameter estimates are used to compute an expected GM volume for the individual patient's level of that confound.The expected GM volume is then compared to the actual GM volume of the patient [Alzheimer's Disease Neuroimaging Initiative et al., 2015;Hedderich et al., 2020;Hedderich et al., 2022;Mühlau et al., 2009].However, this approach assumes a linear relationship between the measure of interest and confound The reference group in this approach is age-matched to the patient sample but typically with an age range of around 30 years [Hirata et al., 2005;Komatsu et al., 2018;Matsuda et al., 2012;Matsunari et al., 2007;Tateno et al., 2015;Waragai et al., 2014], inducing a considerable bias if a patient's age is close to the extremes of the reference sample.Alternatively, a regression-based approach can be used to account for confounds such as age.Here, the confound is regressed upon a reference sample, and the resulting parameter estimates are used to compute an expected GM volume for the individual patient's level of that confound.The expected GM volume is then compared to the actual GM volume of the patient [Alzheimer's Disease Neuroimaging Initiative et al., 2015;Hedderich et al., 2020;Hedderich et al., 2022;Mühlau et al., 2009].However, this approach assumes a linear relationship between the measure of interest and confounds, which, in the case of age, is not necessarily valid [Dima et al., 2022;Fjell et al., 2013;Hedman et al., 2012].
gative reference group without amyloid-related atrophy could thus be more sensitive to disease-related atrophy.

We hypothesized that optimizing the reference group could improve MR-based differentiation between CN and MCI patients.Specifically, we hypothesized that an approach with amyloid-negative reference groups closely matched to an individual patient's age would be more sensitive to neurodegenerative changes in GM volume, resulting in greater classification accuracy, even compared to regression-based approaches.Using structural MRI, as well as amyloid-PET data of CN participants and patients with MCI from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database (adni.loni.ucla.edu)and an independent sample acquired in our center, we investigated the effects of different types of reference groups on the utility of Z-statics-based atrophy quantification in differentiating between CN and MCI.Furthermore, we examined the effects of a We hypothesized that optimizing the reference group could improve MR-based differentiation between CN and MCI patients.Specifically, we hypothesized that an approach with amyloid-negative reference groups closely matched to an individual patient's age would be more sensitive to neurodegenerative changes in GM volume, resulting in greater classification accuracy, even compared to regression-based approaches.Using structural MRI, as well as amyloid-PET data of CN participants and patients with MCI from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database (adni.loni.ucla.edu)and an independent sample acquired in our center, we investigated the effects of different types of reference groups on the utility of Z-statics-based atrophy quantification in differentiating between CN and MCI.Furthermore, we examined the effects of atrophy thresholds, the extent of spatial smoothing, and region of interest (ROI) on classification accuracy.

ophy thr
sholds, the extent of spatial smoothing, and region of interest (ROI) on classification accuracy.


Methods

Data used to prepare this study were obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu).The ADNI was launched in 2003 as a public-private partnership led by Principal Investigator Michael W. Weiner, MD.The primary goal of ADNI has been to test whether serial magnetic resonance imaging (MRI), positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessment can be

Methods
Data used to prepare this study were obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu).The ADNI was launched in 2003 as a public-private partnership led by Principal Investigator Michael W. Weiner, MD.The primary goal of ADNI has been to test whether serial magnetic resonance imaging (MRI), positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment (MCI) and early Alzheimer's disease (AD).

mbined to
measure the progression of mild cognitive impairment (MCI) and early Alzheimer's disease (AD).


Subjects

Structural MRI and (AV45) amyloid-PET scans of CN individuals and participants with MCI were retrieved from the ADNI-GO and the ADNI-2 phases.Detailed documentation of the inclusion criteria an

Subjects
Structural MRI and (AV45) amyloid-PET scans of CN individuals and participants with MCI were retrieved from the ADNI-GO and the ADNI-2 phases.Detailed documentation of the inclusion criteria and diagnostic categories can be found on the ADNI website (https://adni.loni.usc.edu/methods/documents/).
diagnostic categories can be found on the ADNI website (https://adni.loni.usc.edu/methods/documents/).

In short, CN was defined by an MMSE greater than 23 points, a Clinical Dementia Rating of 0, and neuropsychological performance within normal ranges.MCI was defined as an MMSE greater than points, a CDR of 0.5, a subjective memory concern reported by the patient, caregiver, or treating clinician, a memory loss objectified using the education-adjusted delayed-recall p In short, CN was defined by an MMSE greater than 23 points, a Clinical Dementia Rating of 0, and neuropsychological performance within normal ranges.MCI was defined as an MMSE greater than points, a CDR of 0.5, a subjective memory concern reported by the patient, caregiver, or treating clinician, a memory loss objectified using the education-adjusted delayed-recall performance on the Wechsler Memory Scale Logical Memory II, and preserved activities of daily living.
rformance on the Wechsler Memory Scale Logical Memory II, and preserved activities of daily living.

For each participant, the T 1 scan performed closest to an amyloid-PET scan was selected and classified as CN or MCI based on the neuropsychological assessment with the shortest delay from the MRI scan.Participants with cerebral infarcts or significant depressive symptoms, indicated by a GDS score greater than 5, were not included in the analysis.Furthermore, seven CN participants and two MCI patients were excluded because of confluent white matter lesions corresponding to Fazekas grade 3 [Fazekas et al., 1993], and four CN and one MCI participant had to be excluded due to image processing failures.The resulting final data set included a total of 141 cognitively normal individuals and 91 For each participant, the T 1 scan performed closest to an amyloid-PET scan was selected and classified as CN or MCI based on the neuropsychological assessment with the shortest delay from the MRI scan.Participants with cerebral infarcts or significant depressive symptoms, indicated by a GDS score greater than 5, were not included in the analysis.Furthermore, seven CN participants and two MCI patients were excluded because of confluent white matter lesions corresponding to Fazekas grade 3 [Fazekas et al., 1993], and four CN and one MCI participant had to be excluded due to image processing failures.The resulting final data set included a total of 141 cognitively normal individuals and 91 participants with MCI.The average time between PET and MRI measurements was 28.31 ± 28.46 days.

articipants with MCI
The average time between PET and MRI measurements was 28.31 ± 28.46 days.


Confirmatory sample

Findings were validated on an independent sample of CN and MCI participants acquired in our center as part of a study approved by the ethics committee of the University of Cologne's medical faculty.All participants gave written informed consent.In this sample, CN was defined by an MMSE greater than 23 points and unimpaired performance on neuropsychological tests of memory, language, and executive function.MCI was defined as an MMSE greater than 23 points, a subjective memory concern reported by the patient, caregiver, or treating clinician, memory loss objectified using the age-adjusted delayed-recall performance on the Wechsler Memory Scale Logical Memory II, and preserved activities of daily living, based on information provided by the spouse or a caregiver.This data set consisted of 19 CN individuals and MCI patients.MCI patients presented with a predominantly amnestic phenotype, and 18/19 of patients had positive CSF biomarkers or Amyloid-PET indicative of Al

Confirmatory sample
Findings were validated on an independent sample of CN and MCI participants acquired in our center as part of a study approved by the ethics committee of the University of Cologne's medical faculty.All participants gave written informed consent.In this sample, CN was defined by an MMSE greater than 23 points and unimpaired performance on neuropsychological tests of memory, language, and executive function.MCI was defined as an MMSE greater than 23 points, a subjective memory concern reported by the patient, caregiver, or treating clinician, memory loss objectified using the age-adjusted delayed-recall performance on the Wechsler Memory Scale Logical Memory II, and preserved activities of daily living, based on information provided by the spouse or a caregiver.This data set consisted of 19 CN individuals and MCI patients.MCI patients presented with a predominantly amnestic phenotype, and 18/19 of patients had positive CSF biomarkers or Amyloid-PET indicative of Alzheimer's pathology as previously described [Conwell et al., 2018;Richter et al., 2020].

eimer's pathology as pre
iously described [Conwell et al., 2018;Richter et al., 2020].


PET and MRI acquisition

ADNI-GO/-2 MRI data were acquired on 3 T MRI scanners by Siemens, Philips, and General Electric Healthcare.The present study used the scanner-specific 3D sagittal T 1 -weighted magnetizationprepared rapid gradient-echo (MPRAGE) sequences.ADNI's original MPRAGE sequences undergo standardized image correction steps during preproc

PET and MRI acquisition
ADNI-GO/-2 MRI data were acquired on 3 T MRI scanners by Siemens, Philips, and General Electric Healthcare.The present study used the scanner-specific 3D sagittal T 1 -weighted magnetizationprepared rapid gradient-echo (MPRAGE) sequences.ADNI's original MPRAGE sequences undergo standardized image correction steps during preprocessing to increase signal uniformity across different scanner N. Richter et al. types and trial centers.
ssing to increase signal uniformity across different scanner N. Richter et al. types and trial centers.

AV45-PET data were also acquired on different scanners at the different trial centers.Therefore, the PET data in the ADNI study also undergo standardized preprocessing steps to increase signal uniformity across centers.The imaging protocols used at the different AV45-PET data were also acquired on different scanners at the different trial centers.Therefore, the PET data in the ADNI study also undergo standardized preprocessing steps to increase signal uniformity across centers.The imaging protocols used at the different trial centers are described in detail on the ADNI website (https://adni.loni.usc.edu/data-samples/).
trial centers are described in detail on the ADNI website (https://adni.loni.usc.edu/data-samples/).

T 1 -weighted MPRAGE images for the confirmatory sample were acquired using a 3 T MAGNETOM Trio (Siemens, Erlangen, Germany) with a custom build BrainPET insert in the bore of the magnet using both a transmit-receive and 8-channel receive coil.The scan parameters were: TR = 2250 ms, TE = 3.03 ms, FA = 9 • , FOV = 256 × 256 mm 2 , matrix = 256 × 256, voxel resolution = 1 mm isotropic, 176 sagittal slices, no gap, interleaved, scan time = 5 min and 14 s.Vacuum cushions were used to reduce head motion.Automated and manual shimming was applie T 1 -weighted MPRAGE images for the confirmatory sample were acquired using a 3 T MAGNETOM Trio (Siemens, Erlangen, Germany) with a custom build BrainPET insert in the bore of the magnet using both a transmit-receive and 8-channel receive coil.The scan parameters were: TR = 2250 ms, TE = 3.03 ms, FA = 9 • , FOV = 256 × 256 mm 2 , matrix = 256 × 256, voxel resolution = 1 mm isotropic, 176 sagittal slices, no gap, interleaved, scan time = 5 min and 14 s.Vacuum cushions were used to reduce head motion.Automated and manual shimming was applied before data acquisition to account for field inhomogeneities resulting from the BrainPET insert.

before data acquisition to account for field inhomo
eneities resulting from the BrainPET insert.


MRI data processing -Voxel-based morphometry (VBM)

MRI data were processed using statistical parametric mapping (SPM12, Wellcome Trust Center for Neuroimaging) with the computational anatomy toolbox (CAT
2, https://www.neuro.uni-jena.de/cat/)implemented in MatLab R2012b (MathWorks, Natick, MA, USA).

Images were bias-corrected and automatically segmented into GM, white matter (WM), and cerebrospinal fluid (CSF).The GM segment underwent visual inspection for misclassification of tissue.GM maps of three CN and one MC Images were bias-corrected and automatically segmented into GM, white matter (WM), and cerebrospinal fluid (CSF).The GM segment underwent visual inspection for misclassification of tissue.GM maps of three CN and one MCI participant had to be excluded from further analyses because of tissue classification errors.
participant had to be excluded from further analyses because of tissue classification errors.

Tissue maps resulting from the segmentation were highdimensionally warped to a study-specific template using diffeomorphic anatomical registration through exponentiated Lie algebra (DAR-TEL) [Ashburner, 2007].The template was generated from the T 1 images of 34 amyloid-positive and 34 amyloid-negative CN patients of the ADNI sample, which were matched according to age and gender.The GM partitions were warped to the template space and modulated for the nonlinear normalization only to preserve tissue concentrations while accounting for differences in TIV.Resulting GM maps were smoothed with Gaussia Tissue maps resulting from the segmentation were highdimensionally warped to a study-specific template using diffeomorphic anatomical registration through exponentiated Lie algebra (DAR-TEL) [Ashburner, 2007].The template was generated from the T 1 images of 34 amyloid-positive and 34 amyloid-negative CN patients of the ADNI sample, which were matched according to age and gender.The GM partitions were warped to the template space and modulated for the nonlinear normalization only to preserve tissue concentrations while accounting for differences in TIV.Resulting GM maps were smoothed with Gaussian kernels of 2 mm, 4 mm, and 8 mm full width at half maximum (FWHM) for subsequent analyses.

kernels of 2 mm, 4 m
, and 8 mm full width at half maximum (FWHM) for subsequent analyses.


Regions of interest

Atrophy was assessed at the level of total GM and cortical GM in the individual cerebral lobes defined using the Montreal Neurological Institute (MNI) atlas [Mazziotta et al., 2001], as well as a medial temporal lobe (MTL) ROI, including the hippocampus, the amygdala, the parahippocampal, an

Regions of interest
Atrophy was assessed at the level of total GM and cortical GM in the individual cerebral lobes defined using the Montreal Neurological Institute (MNI) atlas [Mazziotta et al., 2001], as well as a medial temporal lobe (MTL) ROI, including the hippocampus, the amygdala, the parahippocampal, and the temporal fusiform gyri, defined using the Harvard-Oxford atlas [Desikan et al., 2006].

the temporal fu
iform gyri, defined using the Harvard-Oxford atlas [Desikan et al., 2006].


Amyloid status

Amyloid status in the ADNI sample was defined based on the standard uptake value ratios (SUVR) of florbetapir, as made available from ADNI.Briefly, in the framework of ADNI, preprocessed florbetapir scans (https://adni.loni.usc.edu/methods/pet-analysis-method/pet-analysis/ ) were coregistered to the participants' T 1 scans, which were segmented and parcellated into ROI using the software Freesurfer (https://surfer .-nmr.mgh.harvard.edu/,version 5.3.0).The SUVR for each participant was then obtained by dividing the mean florbetapir uptake in a set of cortical ROI by the florbetapir uptake in the whole cerebellum (white and gray matter) [Landau et al., 2012].Participants with a florbetapir SUVR greater than 1.11 were considered amyloid-positive, and those with a florbetapir SUVR < 1.11 were considered amyloid-negative [Clark, 2011;Joshi et al., 2012].Amyloid status in the confirmatory sam

Amyloid status
Amyloid status in the ADNI sample was defined based on the standard uptake value ratios (SUVR) of florbetapir, as made available from ADNI.Briefly, in the framework of ADNI, preprocessed florbetapir scans (https://adni.loni.usc.edu/methods/pet-analysis-method/pet-analysis/ ) were coregistered to the participants' T 1 scans, which were segmented and parcellated into ROI using the software Freesurfer (https://surfer .-nmr.mgh.harvard.edu/,version 5.3.0).The SUVR for each participant was then obtained by dividing the mean florbetapir uptake in a set of cortical ROI by the florbetapir uptake in the whole cerebellum (white and gray matter) [Landau et al., 2012].Participants with a florbetapir SUVR greater than 1.11 were considered amyloid-positive, and those with a florbetapir SUVR < 1.11 were considered amyloid-negative [Clark, 2011;Joshi et al., 2012].Amyloid status in the confirmatory sample was defined based on cerebro-spinal fluid (CSF) biomarkers or clinical amyloid-PET.

e was defin
d based on cerebro-spinal fluid (CSF) biomarkers or clinical amyloid-PET.


Statistics

Data were tested for normality of distribution using the Shapiro-Wilk-Test.Group comparisons of normally distributed data were performed using independent samples T-tests and the Wilcoxon rank-sum test for non-normally distributed data.Gender distribution was assessed using the Chi-Square-Test.Statistical analyses were performed with the software R (Version 3.6.3,https://cran.r-project.org/).Receiver operating characteristic (ROC) analyses were within R using the package "pROC".Image arithmetics and spatial smoothing were performed with modules of the FSL software

Statistics
Data were tested for normality of distribution using the Shapiro-Wilk-Test.Group comparisons of normally distributed data were performed using independent samples T-tests and the Wilcoxon rank-sum test for non-normally distributed data.Gender distribution was assessed using the Chi-Square-Test.Statistical analyses were performed with the software R (Version 3.6.3,https://cran.r-project.org/).Receiver operating characteristic (ROC) analyses were within R using the package "pROC".Image arithmetics and spatial smoothing were performed with modules of the FSL software package (FMRIB's Software Library, Version 5.0, https://www.fmrib.ox.ac.uk/fsl).

ackage (FMRIB's Softwa
e Library, Version 5.0, https://www.fmrib.ox.ac.uk/fsl).


Definition of atrophy

Atrophy was operationalized for each participant as the deviation of GM volume from the mean of a reference group: GM maps of participants were transformed to voxel-wise Z-statistics by subtracting the mean of a reference group and dividing it by the standard deviation of that reference group.As the central element of this investigation, the effect of using different reference groups was analyzed (see below).The voxel-wise Z-maps were thresholded at different Z-levels (-2.5, − 3.5, and − 4.5), and the number of subthreshold voxels was assessed in each ROI.For each ROI, the number of subthreshold voxels was entered in a ROC analysis to determine the a

Definition of atrophy
Atrophy was operationalized for each participant as the deviation of GM volume from the mean of a reference group: GM maps of participants were transformed to voxel-wise Z-statistics by subtracting the mean of a reference group and dividing it by the standard deviation of that reference group.As the central element of this investigation, the effect of using different reference groups was analyzed (see below).The voxel-wise Z-maps were thresholded at different Z-levels (-2.5, − 3.5, and − 4.5), and the number of subthreshold voxels was assessed in each ROI.For each ROI, the number of subthreshold voxels was entered in a ROC analysis to determine the accuracy with which the participants could be classified as CN or MCI patients.

uracy with which the parti
ipants could be classified as CN or MCI patients.


Types of reference groups

Age effects were investigated by comparing the performance of two approaches using different age-specific reference groups to the 'standard' approach of using one reference group consisting of CN participants in the same age range as the whole patient group [Hirata et al., 2005;Matsuda, 2016] and a regression-based approach explained below.In the first type of age-specific approach, mean and standard deviation GM maps were computed from 20 CN participants whose age deviated less than five years from the respective age.The second type of age-specific approach used reference groups of 20 CN participants whose age deviated<2.5 years from the age investigated (Fig. 1).Zstatistics were computed for each participant, comparing them with the mean of the reference group corresponding to their age.In the 'standard' approach, the mean and stan

Types of reference groups
Age effects were investigated by comparing the performance of two approaches using different age-specific reference groups to the 'standard' approach of using one reference group consisting of CN participants in the same age range as the whole patient group [Hirata et al., 2005;Matsuda, 2016] and a regression-based approach explained below.In the first type of age-specific approach, mean and standard deviation GM maps were computed from 20 CN participants whose age deviated less than five years from the respective age.The second type of age-specific approach used reference groups of 20 CN participants whose age deviated<2.5 years from the age investigated (Fig. 1).Zstatistics were computed for each participant, comparing them with the mean of the reference group corresponding to their age.In the 'standard' approach, the mean and standard deviation of the 141 CN participants from the ADNI sample were used.
ard deviation of the 141 CN participants from the ADNI sample were used.

For the two age-specific approaches, Z-statistics could not be computed for all participants due to the normal age distribution with the consequence of having<20 CN to compute mean and standard deviation images at the lower and upper extremes of the age range.Using all CN participants, reference groups of 20 participants with ages ± 2.5 years of the participant could be generated for ages 66 t For the two age-specific approaches, Z-statistics could not be computed for all participants due to the normal age distribution with the consequence of having<20 CN to compute mean and standard deviation images at the lower and upper extremes of the age range.Using all CN participants, reference groups of 20 participants with ages ± 2.5 years of the participant could be generated for ages 66 to 80 years and ± 5 years of the participant for ages 63 to 82 years.
80 years and ± 5 years of the participant for ages 63 to 82 years.

In order to assess the relevance of amyloid status for classification accuracy, the same analyses were also performed using only the amyloid-negative CN participants to form the reference groups.Using only amyloid-negative CN, reference groups with ages within 2.5 years of the participant were generated for ages 68 t In order to assess the relevance of amyloid status for classification accuracy, the same analyses were also performed using only the amyloid-negative CN participants to form the reference groups.Using only amyloid-negative CN, reference groups with ages within 2.5 years of the participant were generated for ages 68 to 78 and within five years of the participant for ages 63 to 84.
78 and within five years of the participant for ages 63 to 84.

The difference in the number of subthreshold voxels between groups was assessed using Wilcoxon rank-sum tests because of their non-normal distribution.Differences were deemed significant at a The difference in the number of subthreshold voxels between groups was assessed using Wilcoxon rank-sum tests because of their non-normal distribution.Differences were deemed significant at a p < 0.0001, corresponding to a Bonferroni-corrected p = 0.0216.

< 0.0001, cor
esponding to a Bonferroni-corrected p = 0.0216.


ROC analyses

ROC analyses were performed to determine the parameters that N. Richter et al. distinguish CN and MCI with the highest accuracy.Specifically, we examined the effect of smoothing (unsmoothed, Gaussian kernels of 2 mm, 4 mm, and 8 mm FWHM), Z-threshold, amyloid stats, and region of interest.The statistical significance of the difference in AUC was assessed using DeLong's Test as implemented in the "R"-package 'pROC'.The same analyses were applied to an independ

ROC analyses
ROC analyses were performed to determine the parameters that N. Richter et al. distinguish CN and MCI with the highest accuracy.Specifically, we examined the effect of smoothing (unsmoothed, Gaussian kernels of 2 mm, 4 mm, and 8 mm FWHM), Z-threshold, amyloid stats, and region of interest.The statistical significance of the difference in AUC was assessed using DeLong's Test as implemented in the "R"-package 'pROC'.The same analyses were applied to an independent validation sample to validate our observations for the optimal parameters.

t validat
on sample to validate our observations for the optimal parameters.


W-Scores

A number of publications have reported on the use of W-scores to adjust for covariates such as age using linear regression.We have included additional analyses to compare the performance of this approach to the use of

W-Scores
A number of publications have reported on the use of W-scores to adjust for covariates such as age using linear regression.We have included additional analyses to compare the performance of this approach to the use of age-specific references.For details please see the Supplementary Materials.

ge-speci
ic references.For details please see the Supplementary Materials.

Fig. 1.Schematic illustration of the two approaches for the computation of Z-statistics using age-specific GM reference maps in the exemplary case of a 73 year old participant.For each age with a least 20 cognitively normal participants (CN) within a 5-or 10-year bracket, mean and SD GM maps were computed.For each participant, an age-specific Z-map was computed using the two approaches.For comparison, Z-maps were also computed in a standard approach using the mean and SD of a CN group of similar age as the patient group.As indicated in the schematic, age brackets narrow the range of ages that can be addressed, especially when Fig. 1.Schematic illustration of the two approaches for the computation of Z-statistics using age-specific GM reference maps in the exemplary case of a 73 year old participant.For each age with a least 20 cognitively normal participants (CN) within a 5-or 10-year bracket, mean and SD GM maps were computed.For each participant, an age-specific Z-map was computed using the two approaches.For comparison, Z-maps were also computed in a standard approach using the mean and SD of a CN group of similar age as the patient group.As indicated in the schematic, age brackets narrow the range of ages that can be addressed, especially when using 5-year age brackets.SD = standard deviation, GM = gray matter.

ing 5-ye
r age brackets.SD = standard deviation, GM = gray matter.


Results


De
MCI patients performed poorer than CN on the MMSE and the LM-DR in both samples and on the TMT in the validation sample.Interestingly, while CN and MCI of sample 1 did not differ concerning performance on the TMT, amyloid-positive CN performed worse on the TMT-A (CN amyloid-positive = 38.91 s ± 10.81 s, CN amyloid-negative = 31.99s ± 10.72, p < 0.001) and B than amyloid-negative CN (CN amyloid-positive = 91.47s ± 39.47 s, CN amyloid-negative = 77.28s ± 43.13, p = 0.007; Table 1).

n the validat
on sample.Interestingly, while CN and MCI of sample 1 did not differ concerning performance on the TMT, amyloid-positive CN performed worse on the TMT-A (CN amyloid-positive = 38.91 s ± 10.81 s, CN amyloid-negative = 31.99s ± 10.72, p < 0.001) and B than amyloid-negative CN (CN amyloid-positive = 91.47s ± 39.47 s, CN amyloid-negative = 77.28s ± 43.13, p = 0.007; Table 1).

In sample 1 (ADNI), 34 of 141 CN (24%) and 42 of 91 MCI participants (46%) were amyloid-positive, while the majority of MCI participants in the validation sample (18 of 19) had CSF-biomarkers or amyloid-PET indicative of AD pathology.These biomarkers of AD pathology had not been assessed in the CN of the confirmatory sample.

There was no difference between CN and MCI g In sample 1 (ADNI), 34 of 141 CN (24%) and 42 of 91 MCI participants (46%) were amyloid-positive, while the majority of MCI participants in the validation sample (18 of 19) had CSF-biomarkers or amyloid-PET indicative of AD pathology.These biomarkers of AD pathology had not been assessed in the CN of the confirmatory sample.
There was no difference between CN and MCI groups with respect to gender distribution or level of education in either sample.
oups with respect to gender distribution or level of education in either sample.


Group differences in the number of subthreshold voxels

Using the standard reference group, significant differences in the number of subthreshold voxels between CN and MCI were observed in frontal, temporal, parietal, and total GM at Z-threshold

Group differences in the number of subthreshold voxels
Using the standard reference group, significant differences in the number of subthreshold voxels between CN and MCI were observed in frontal, temporal, parietal, and total GM at Z-thresholds of − 3.5 and − 4.5 using no smoothing and in frontal and total GM at 2 mm smoothing.Using age-specific reference groups, significant differences in the number of subthreshold voxels were observed for all ROI, Zthresholds, and degrees of smoothing, except for MTL and lobar GM at a Z-threshold of − 4.5 with 8 mm smoothing.There were no significant differences between the two types of age-specific reference groups concerning the number of subthreshold voxels (Table 3).
of − 3.5 and − 4.5 using no smoothing and in frontal and total GM at 2 mm smoothing.Using age-specific reference groups, sign

icant differences in the number of subthreshold voxels
ere observed for all ROI, Zthresholds, and degrees of smoothing, except for MTL and lobar GM at a Z-threshold of − 4.5 with 8 mm smoothing.There were no significant differences between the two types of age-specific reference groups concerning the number of subthreshold voxels (Table 3).


Classification accuracy based on the number of subthreshold voxels

ROC analyses revealed an optimal separation between CN and MCI using age-specific reference groups.Diagnostic accuracy was substantially higher when using the 10-year bracket approach than the standard reference group, reaching an AUC of 1 for temporal, parietal, and total GM when using 2 mm or no smoothi

Classification accuracy based on the number of subthreshold voxels
ROC analyses revealed an optimal separation between CN and MCI using age-specific reference groups.Diagnostic accuracy was substantially higher when using the 10-year bracket approach than the standard reference group, reaching an AUC of 1 for temporal, parietal, and total GM when using 2 mm or no smoothing kernels.Using 5-year brackets did not increase diagnostic accuracy compared to 10-year brackets.The greatest AUC achieved using the standard approach was 0.731 for total GM without spatial smoothing, which is significantly poorer than the accuracies achieved for both age-specific approaches under those conditions (DeLong's Tests, p < 0.001 compared to the age-specific approaches).Diagnostic accuracy increased with decreasing Z-thresholds, with the greatest accuracy generally seen around a Z-threshold of − 3.5 (Table 4).Lowering the Z-threshold to − 4.5 resulted in a decrease in AUC in some areas.The six ROIs did not differ significantly (all DeLong's Tests, p greater than 0.05) for AUC at optimal conditions (0 or 2 mm smoothing, Z-threshold = -3.5).However, the AUC was greater in all other ROIs than in the MTL (for example compared to total GM at 0 mm and Z-threshold = -3.5,AUC MTL = 0.985, AUC total GM = 1, DeLong's-Test p = 0.0263).

kernels.Using 5-year brackets did not increase diagnostic accuracy
compared to 10-year brackets.The greatest AUC achieved using the standard approach was 0.731 for total GM without spatial smoothing, which is significantly poorer than the accuracies achieved for both age-specific approaches under those conditions (DeLong's Tests, p < 0.001 compared to the age-specific approaches).Diagnostic accuracy increased with decreasing Z-thresholds, with the greatest accuracy generally seen around a Z-threshold of − 3.5 (Table 4).Lowering the Z-threshold to − 4.5 resulted in a decrease in AUC in some areas.The six ROIs did not differ significantly (all DeLong's Tests, p greater than 0.05) for AUC at optimal conditions (0 or 2 mm smoothing, Z-threshold = -3.5).However, the AUC was greater in all other ROIs than in the MTL (for example compared to total GM at 0 mm and Z-threshold = -3.5,AUC MTL = 0.985, AUC total GM = 1, DeLong's-Test p = 0.0263).


Effect of amyloid status

When including only amyloid-negative CN in the reference groups, the greatest AUC for the standard reference group without age brackets was 0.825 for total GM (Table 5).The AUC for the amyloid-negative reference group without age brackets was greater than for the reference group without age brackets, including amyloid-negative and positive CN for 2 mm or

Effect of amyloid status
When including only amyloid-negative CN in the reference groups, the greatest AUC for the standard reference group without age brackets was 0.825 for total GM (Table 5).The AUC for the amyloid-negative reference group without age brackets was greater than for the reference group without age brackets, including amyloid-negative and positive CN for 2 mm or no smoothing kernels.The most remarkable difference was observed for parietal GM at a Z-threshold of − 4.5, but this difference was barely significant (AUC amyloid-negative CN reference group = 0.689, AUC amyloid-negative and -positive reference group = 0.593, DeLong's Test p = 0.039).Using amyloid-negative references and age-specific reference groups did not improve AUC values (Tables 4 and 5).

o smoothing kernels.The m
st remarkable difference was observed for parietal GM at a Z-threshold of − 4.5, but this difference was barely significant (AUC amyloid-negative CN reference group = 0.689, AUC amyloid-negative and -positive reference group = 0.593, DeLong's Test p = 0.039).Using amyloid-negative references and age-specific reference groups did not improve AUC values (Tables 4 and 5).


Optimal cut-off values

The highest accuracy for distinguishing between CN and MCI was achieved using age-specific brackets.From a practical standpoint, the ideal approach uses age-specific 10-year brackets, including CN, irrespective of amyloid status, as this covers the greatest age range.Using this approach, the greatest accuracy was achieved using unsmoothed data and a Z-threshol

Optimal cut-off values
The highest accuracy for distinguishing between CN and MCI was achieved using age-specific brackets.From a practical standpoint, the ideal approach uses age-specific 10-year brackets, including CN, irrespective of amyloid status, as this covers the greatest age range.Using this approach, the greatest accuracy was achieved using unsmoothed data and a Z-threshold of − 3.5, with a cut-off of 98.5 subthresholdvoxels for total GM, 3.5 subthreshold-voxels in medial temporal GM, 29 subthreshold-voxels in frontal GM, 19.5 subthreshold-voxels in temporal GM, 24 subthreshold-voxels in the parietal GM, and subthreshold-voxels in the occipital GM.Similar AUC values were observed when using a 2 mm smoothing kernel and a Z-threshold of − 3.5, with a cut-off of 61.5 subthreshold voxels for total GM, 0.5 voxels in MTL GM, 13.5 subthreshold-voxels in frontal GM, 6 subthresholdvoxels in temporal GM, 14.5 subthreshold-voxels in the parietal GM, and 3.5 subthreshold-voxels in occipital GM.

of − 3.5, with a cut-of
of 98.5 subthresholdvoxels for total GM, 3.5 subthreshold-voxels in medial temporal GM, 29 subthreshold-voxels in frontal GM, 19.5 subthreshold-voxels in temporal GM, 24 subthreshold-voxels in the parietal GM, and subthreshold-voxels in the occipital GM.Similar AUC values were observed when using a 2 mm smoothing kernel and a Z-threshold of − 3.5, with a cut-off of 61.5 subthreshold voxels for total GM, 0.5 voxels in MTL GM, 13.5 subthreshold-voxels in frontal GM, 6 subthresholdvoxels in temporal GM, 14.5 subthreshold-voxels in the parietal GM, and 3.5 subthreshold-voxels in occipital GM.


Group differences in the validation sample

The observations regarding the reference group type (standard vs. age-specific) and the optimal Z-thresholds and smoothing kernels were validated in an independent in-house sample of 19 CN and 19 MCI patients (Table 6).The in-house sample was tested using 10-year brackets to allow the inclusion of the maximum number of participants.The standard

Group differences in the validation sample
The observations regarding the reference group type (standard vs. age-specific) and the optimal Z-thresholds and smoothing kernels were validated in an independent in-house sample of 19 CN and 19 MCI patients (Table 6).The in-house sample was tested using 10-year brackets to allow the inclusion of the maximum number of participants.The standard reference group approach did not detect significant differences in the number of subthreshold voxels (p < 0.0001) between CN and MCI.Using the age-specific reference group approach, significant differences between the diagnostic groups were seen in medial temporal GM and temporal GM without smoothing and 2 mm smoothing at all Z-thresholds and at a Z-threshold of − 4.5 in total GM, while the frontal and parietal GM showed trends toward a group difference.

ference group approach did not detect signi
icant differences in the number of subthreshold voxels (p < 0.0001) between CN and MCI.Using the age-specific reference group approach, significant differences between the diagnostic groups were seen in medial temporal GM and temporal GM without smoothing and 2 mm smoothing at all Z-thresholds and at a Z-threshold of − 4.5 in total GM, while the frontal and parietal GM showed trends toward a group difference.


Classification accuracy in the validation sample

The greatest AUC achieved using the standard reference group in this sample was 0.773 in the MTL without smoothing and in total GM with mm smoothing (Table 7).

The greatest classification accuracies using the age-specific approach  Group differences were computed as Wilcoxon-Tests.Bold font and gray background indicate a significant group di

Classification accuracy in the validation sample
The greatest AUC achieved using the standard reference group in this sample was 0.773 in the MTL without smoothing and in total GM with mm smoothing (Table 7).
The greatest classification accuracies using the age-specific approach  Group differences were computed as Wilcoxon-Tests.Bold font and gray background indicate a significant group difference at p < 0.0001.ROI = region of interest, Z = Z-threshold, MCI = mild cognitive impairment, CN = cognitively normal, SD = standard deviation, GM = gray matter, MTL = medial temporal lobe.The smoothing kernel is reported in mm of full width at half-maximum.

erence at p < 0.0001.ROI = region of interest, Z
Z-threshold, MCI = mild cognitive impairment, CN = cognitively normal, SD = standard deviation, GM = gray matter, MTL = medial temporal lobe.The smoothing kernel is reported in mm of full width at half-maximum.

with 10-year brackets were achieved for medial temporal and temporal GM with no or 2 mm smoothing.The maximum was an AUC of 0.985 at a Z-threshold of − 3.5 without smoothing in the MTL.Moderate AUC values ranging between 0.839 and 0.881 were observed for total GM without and with 2 mm smoothing.AUC values, especially in frontal and parietal, but also total GM were much lower in this sample than in with 10-year brackets were achieved for medial temporal and temporal GM with no or 2 mm smoothing.The maximum was an AUC of 0.985 at a Z-threshold of − 3.5 without smoothing in the MTL.Moderate AUC values ranging between 0.839 and 0.881 were observed for total GM without and with 2 mm smoothing.AUC values, especially in frontal and parietal, but also total GM were much lower in this sample than in the ADNI sample.
the ADNI sample.

Using the age-specific 10-year brackets, the best classification accuracy was achieved using unsmoothed data and a Z-threshold of − 3.5, with an extent cut-off of 1111.5 subthreshold-voxels for total GM, 161 subthreshold-voxels in MTL GM, 280 subthreshold-voxels in frontal GM, 294 subthreshold-voxels in temporal GM, 280 subthreshold-voxels in the parietal GM, and 64 subthreshold-voxels in the occipital GM.Similar AUC values were observed when using a 2 mm smoothing kernel and a Z-threshold of − 3.5 with a cut-off of 1006.5 subthreshold voxels for total GM, 76.5 voxels in the MTL, 239 subthreshold-voxels in frontal GM, 243.5 subthreshold-voxels in temporal GM, 147 subthreshold-voxels in the parietal GM, and 48.5 subthreshold-voxels in occipital GM.


Discussion

We demonstrate that VBM could differentiate between MCI-like atrophy and atrophy in cognitively normal aging with very high accuracy.Age-specific reference groups significantly increased accuracy, more so than regression-based approaches and using amyloid-negative reference groups.Constraining the age range for the reference template to within five years of the patients' ages (i.e., a 10 Using the age-specific 10-year brackets, the best classification accuracy was achieved using unsmoothed data and a Z-threshold of − 3.5, with an extent cut-off of 1111.5 subthreshold-voxels for total GM, 161 subthreshold-voxels in MTL GM, 280 subthreshold-voxels in frontal GM, 294 subthreshold-voxels in temporal GM, 280 subthreshold-voxels in the parietal GM, and 64 subthreshold-voxels in the occipital GM.Similar AUC values were observed when using a 2 mm smoothing kernel and a Z-threshold of − 3.5 with a cut-off of 1006.5 subthreshold voxels for total GM, 76.5 voxels in the MTL, 239 subthreshold-voxels in frontal GM, 243.5 subthreshold-voxels in temporal GM, 147 subthreshold-voxels in the parietal GM, and 48.5 subthreshold-voxels in occipital GM.

Discussion
We demonstrate that VBM could differentiate between MCI-like atrophy and atrophy in cognitively normal aging with very high accuracy.Age-specific reference groups significantly increased accuracy, more so than regression-based approaches and using amyloid-negative reference groups.Constraining the age range for the reference template to within five years of the patients' ages (i.e., a 10-year bracket centered on the patient's age) improved accuracy substantially.A further narrowing of the age range led to marginal accuracy improvement only while reducing the number of patients that could be examined.

ear bracket
centered on the patient's age) improved accuracy substantially.A further narrowing of the age range led to marginal accuracy improvement only while reducing the number of patients that could be examined.


Age-adjustment outweighs amyloid-status

Approaches accounting for age differences within the reference group consistently outperformed those that merely used a reference group in a similar age range as the patients.This finding is likely because the ages of MCI patients examined often range from 60 to 90 and beyond [Hedderich et al., 2022;Hirata et al., 2005;Matsuda et al., 2012;Waragai et al., 2014].

Age-adjustment outweighs amyloid-status
Approaches accounting for age differences within the reference group consistently outperformed those that merely used a reference group in a similar age range as the patients.This finding is likely because the ages of MCI patients examined often range from 60 to 90 and beyond [Hedderich et al., 2022;Hirata et al., 2005;Matsuda et al., 2012;Waragai et al., 2014].A considerable loss of brain volume characterizes this part of the life span [Bethlehem et al., 2022;Hedman et al., 2012], leading to an overestimation of atrophy in patients at the upper end of the age range and an underestimation in patients at the younger end.

considerable loss of brain volume charac
erizes this part of the life span [Bethlehem et al., 2022;Hedman et al., 2012], leading to an overestimation of atrophy in patients at the upper end of the age range and an underestimation in patients at the younger end.

5-year age brackets did not perform better than 10-year brackets, likely because the width of the bracket defined the maximum age difference between the patient and the members of the reference sample for that age.For most ages, especially near the middle of the age range, there were so many CNs within the age brackets that the 20 closest CN to the patient's age were much closer to the patient's age than 5-year age brackets did not perform better than 10-year brackets, likely because the width of the bracket defined the maximum age difference between the patient and the members of the reference sample for that age.For most ages, especially near the middle of the age range, there were so many CNs within the age brackets that the 20 closest CN to the patient's age were much closer to the patient's age than the width of the bracket suggested.
he width of the bracket suggested.

Accounting for age using W-scores improved classification accuracy but not to the degree achieved using age-specific reference groups (for details please see the Supplementary Materials).While the W-scores account for age differences, they are based on the assumption of a strictly linear relationship between age and GM volume.There is evidence, however, that the rate of atrophy in some brain regions does n Accounting for age using W-scores improved classification accuracy but not to the degree achieved using age-specific reference groups (for details please see the Supplementary Materials).While the W-scores account for age differences, they are based on the assumption of a strictly linear relationship between age and GM volume.There is evidence, however, that the rate of atrophy in some brain regions does not follow a linear trajectory [Bethlehem et al., 2022;Fjell et al., 2013;Hedman et al., 2012;Pfefferbaum et al., 2013;Scahill et al., 2003] and may even accelerate with increasing age, whereas it has also been reported that it levels off around the age of 80 in CN [Schuff et al., 2012].Thus, the age-specific brackets likely captured age-related changes in GM volume more accurately.
t follow a linear trajectory [Bethlehem et al., 2022;Fjell et al., 2013;Hedman et al., 2012;Pfefferbaum et al., 2013;Scahill et al., 2003] and may even accelerate with increasing age, whereas it has also been reported that it levels off around the age of 80 in CN [Schuff et al., 2012].Thus, the age-specific brackets likely captured age-related changes in GM volume more accurately.


Table 4

Areas under the curve for the comparison of MCI and CN determined using different types of references.ROI = region of interest, AUC = area under the curve, Z = Z-threshold, GM = gray matter, MTL = medial temporal lobe, CN = cognitively normal.The smoothing kernel is reported in mm full width at half-maximum.

Contrary to our hypothesis, data indicated that an amyloid-negative reference group was onl

Table 4
Areas under the curve for the comparison of MCI and CN determined using different types of references.ROI = region of interest, AUC = area under the curve, Z = Z-threshold, GM = gray matter, MTL = medial temporal lobe, CN = cognitively normal.The smoothing kernel is reported in mm full width at half-maximum.
Contrary to our hypothesis, data indicated that an amyloid-negative reference group was only superior to a mixed one when not accounting for age.The most parsimonious explanation for this finding is that the age effects on GM volume outweigh those of amyloid-positivity.Previous work indicated that differences in GM volume between amyloidnegative and amyloid-positive CN are detectable but subtle [Becker et al., 2011;Harrison et al., 2021;Whitwell et al., 2013].Furthermore, amyloid-positive CN only made up 24% of the CN reference sample, limiting the influence of amyloid-positivity.Another possible explanation for the fact that removing amyloid-positive CN from the reference groups increased classification accuracy when age was not accounted for is that amyloid-positive CNs in the present sample were older than their amyloid-negative counterparts and MCI patients.Their removal thus moved the average age of the reference groups closer to that of the patients.In summary, our data suggest that the amyloid status in a reference sample is negligible if appropriate measures account for age effects.

superior
to a mixed one when not accounting for age.The most parsimonious explanation for this finding is that the age effects on GM volume outweigh those of amyloid-positivity.Previous work indicated that differences in GM volume between amyloidnegative and amyloid-positive CN are detectable but subtle [Becker et al., 011;Harrison et al., 2021;Whitwell et al., 2013].Furthermore, amyloid-positive CN only made up 24% of the CN reference sample, limiting the influence of amyloid-positivity.Another possible explanation for the fact that removing amyloid-positive CN from the reference groups increased classification accuracy when age was not accounted for is that amyloid-positive CNs in the present sample were older than their amyloid-negative counterparts and MCI patients.Their removal thus moved the average age of the reference groups closer to that of the patients.In summary, our data suggest that the amyloid status in a reference sample is negligible if appropriate measures account for age effects.


Achieved level of accuracy

The highest accuracy in differentiating between CN and MCI in the ADNI sample was a perfect 1 when using age-specific reference groups.The AUC previously reported for the distinction between CN and MCI ranged between 0.86 and 0.949 [Hirata et al., 2005;Matsuda et al., 2012;Waragai et al., 2014].However, Z-statistics were computed using the whole control group's means and standard deviatio

Achieved level of accuracy
The highest accuracy in differentiating between CN and MCI in the ADNI sample was a perfect 1 when using age-specific reference groups.The AUC previously reported for the distinction between CN and MCI ranged between 0.86 and 0.949 [Hirata et al., 2005;Matsuda et al., 2012;Waragai et al., 2014].However, Z-statistics were computed using the whole control group's means and standard deviations in those studies.The Z-statistics using the mean and SD of all CN were much lower, with the highest AUC being 0.825.A possible explanation may be that the MCI patients in the present ADNI sample were much more mildly impaired, with an average MMST of 28.37, while MCI patients in the other studies had average MMSEs ranging between 26 and 27 [Hirata et al., 2005;Matsuda et al., 2012;Waragai et al., 2014].

in those studies.The Z-sta
istics using the mean and SD of all CN were much lower, with the highest AUC being 0.825.A possible explanation may be that the MCI patients in the present ADNI sample were much more mildly impaired, with an average MMST of 28.37, while MCI patients in the other studies had average MMSEs ranging between 26 and 27 [Hirata et al., 2005;Matsuda et al., 2012;Waragai et al., 2014].

It is conceivable that the values we observed were particularly high since the reference groups for the computation of the Z-and W-statistics were derived from the same sample.However, in comparing the results to the literature, it has to be considered that most previous studies have taken a similar approach [Hirata et al., 2005;Matsuda et al., 2012;Waragai et al., 2014].Furthermore, in our independent validation sample, It is conceivable that the values we observed were particularly high since the reference groups for the computation of the Z-and W-statistics were derived from the same sample.However, in comparing the results to the literature, it has to be considered that most previous studies have taken a similar approach [Hirata et al., 2005;Matsuda et al., 2012;Waragai et al., 2014].Furthermore, in our independent validation sample, we also achieved very high levels of classification accuracy with AUC up to 0.985 when using age-specific reference brackets, while the conventional approach using the mean and SD of the whole CN group only reached a maximum AUC of 0.773.
we also achieved very high levels of classification accuracy with AUC up to 0.985 when using age-specific reference brackets, while the conventional approach using the mean and SD of the whole CN group only reached a maximum AUC of 0.773.


Differences in atrophy patterns

In the ADNI sample, the highest classification accuracy was observed for cortical areas, with the poorest performance observed for the MTL.In the validation sample, in line with the literature [Hirata et al., 2005;Matsuda et al., 2012;Waragai et al., 2014], the greatest AUCs were observed in the temporal lobe, especially the MTL.A possible explanation is that only about half of the MCI pat

Differences in atrophy patterns
In the ADNI sample, the highest classification accuracy was observed for cortical areas, with the poorest performance observed for the MTL.In the validation sample, in line with the literature [Hirata et al., 2005;Matsuda et al., 2012;Waragai et al., 2014], the greatest AUCs were observed in the temporal lobe, especially the MTL.A possible explanation is that only about half of the MCI patients in the ADNI sample were amyloid-positive, indicating AD pathology.The number of patients exhibiting AD-typical temporal lobe atrophy [Scheltens et al., 1992] in this group would thus be expected to be lower than in the validation sample, where all but one participant had biomarkers indicative of AD pathology.

nts in the ADNI sample were amyl
id-positive, indicating AD pathology.The number of patients exhibiting AD-typical temporal lobe atrophy [Scheltens et al., 1992] in this group would thus be expected to be lower than in the validation sample, where all but one participant had biomarkers indicative of AD pathology.

Another explanation could be that MCI in the validation sample was defined solely based on logical memory.In the ADNI sample, however, MCI was defined by impairment in logical memory, but also a score of 0.5 on the Clinical Dementia Scale Sum of Boxes (CDR-SB), which can


Table 5

Areas under the curve for comparing MCI and CN, determined using different amyloid-negative references.ROI = region of interest, AUC = area under Another explanation could be that MCI in the validation sample was defined solely based on logical memory.In the ADNI sample, however, MCI was defined by impairment in logical memory, but also a score of 0.5 on the Clinical Dementia Scale Sum of Boxes (CDR-SB), which can

Table 5
Areas under the curve for comparing MCI and CN, determined using different amyloid-negative references.ROI = region of interest, AUC = area under the curve, Z = Z-threshold, GM = gray matter, MTL = medial temporal lobe, CN = cognitively normal.The smoothing kernel is reported in mm full width at half-maximum.
he curve, Z = Z-threshold, GM = gray matter, MTL = medial temporal lobe, CN = cognitively normal.The smoothing kernel is reported in mm full width at half-maximum.

N. Richter et al. also be caused by mild impairments in other cognitive domains, not as strictly associated with the MTL as memory [Balsis et al., 2015;Cedarbaum et al., 2013].


Optimal smoothing kernels and thresholds for classification

Independent of the type of reference groups used, the best classification accuracy was achieved without spatial smoothing or with a small smoothing kernel of 2 mm at FWHM, in line with a study distinguishing between CN and AD patients using Z-statistics derived from GM data (Komatsu et al., 2018).Generally, in group comparisons using VBM data, sm N. Richter et al. also be caused by mild impairments in other cognitive domains, not as strictly associated with the MTL as memory [Balsis et al., 2015;Cedarbaum et al., 2013].

Optimal smoothing kernels and thresholds for classification
Independent of the type of reference groups used, the best classification accuracy was achieved without spatial smoothing or with a small smoothing kernel of 2 mm at FWHM, in line with a study distinguishing between CN and AD patients using Z-statistics derived from GM data (Komatsu et al., 2018).Generally, in group comparisons using VBM data, smaller smoothing kernels are more sensitive [Shen and Sterr, 2013], but larger smoothing kernels perform better in small samples [Mikl et al., 2008;Shen and Sterr, 2013].However, the present data indicate that the concept of 'larger kernels for smaller samples' does not apply when comparing an individual to a group average.Arguably, this is because the current approach does not rely on the spatial overlap of atrophy between patients, as the presence of only one 'patient' eliminates the averaging of atrophy.

ler smoothing kernels are more sensitive [Shen and Sterr, 20
3], but larger smoothing kernels perform better in small samples [Mikl et al., 2008;Shen and Sterr, 2013].However, the present data indicate that the concept of 'larger kernels for smaller samples' does not apply when comparing an individual to a group average.Arguably, this is because the current approach does not rely on the spatial overlap of atrophy between patients, as the presence of only one 'patient' eliminates the averaging of atrophy.

In the literature on voxel-wise analyses of GM atrophy, cut-offs for Zstatistics indicating significant atrophy tend to be around − 2 [Matsuda et al., 2012;Matsuda, 2016] or − 2.5 [Caspers et al., 2021].Our data indicate that a higher classification accuracy may be achieved using lower thresholds with an optimum around Z = -3.5.However, this analysis will fail if the threshold is lowered too far, as no more su In the literature on voxel-wise analyses of GM atrophy, cut-offs for Zstatistics indicating significant atrophy tend to be around − 2 [Matsuda et al., 2012;Matsuda, 2016] or − 2.5 [Caspers et al., 2021].Our data indicate that a higher classification accuracy may be achieved using lower thresholds with an optimum around Z = -3.5.However, this analysis will fail if the threshold is lowered too far, as no more subthreshold voxels are detected.
threshold voxels are detected.


Limitations

The main limitation of this study lies in the fact that the CN sample used to generate the reference groups was also used in subsequent classification analyses.This procedure is common in the literature [Hirata et al., 2005;Matsuda et al., 2012] but may have inflated the classification accuracy within the ADNI sample.The approach was chosen despite this limitation to ensure that the definition of t

Limitations
The main limitation of this study lies in the fact that the CN sample used to generate the reference groups was also used in subsequent classification analyses.This procedure is common in the literature [Hirata et al., 2005;Matsuda et al., 2012] but may have inflated the classification accuracy within the ADNI sample.The approach was chosen despite this limitation to ensure that the definition of the amyloid status was consistent across participants.This is also why only ADNI participants with an amyloid PET within a year of the MR scan were included, even though this limited the sample size.

amyloid sta
us was consistent across participants.This is also why only ADNI participants with an amyloid PET within a year of the MR scan were included, even though this limited the sample size.

Even though we were able to reproduce our findings from the ADNI data in a separate sample using a different 3 T MR scanner, it needs to be taken into consideration, that this sample was relatively small consisting of only 19 participants in each group.

To validate our approach, we tested it in an independent validation sample and observed very high but slightly lower classification accuracy Group differences were Even though we were able to reproduce our findings from the ADNI data in a separate sample using a different 3 T MR scanner, it needs to be taken into consideration, that this sample was relatively small consisting of only 19 participants in each group.
To validate our approach, we tested it in an independent validation sample and observed very high but slightly lower classification accuracy Group differences were computed as Wilcoxon-Tests.Bold font and gray background indicate a significant group difference at p < 0.0001.CN = cognitively normal, MCI = mild cognitive impairment, ROI = region of interest, Z = Z-threshold, SD = standard deviation, GM = gray matter, MTL = medial temporal lobe.The smoothing kernel is reported in mm of full width at half-maximum.levels than in the ADNI sample.Furthermore, the optimum number of subthreshold voxels, i.e., the extent of atrophy that best differentiated between CN and MCI, was much lower in the ADNI sample than in the validation sample.A possible explanation may be that the MCI patients in the validation sample were more severely affected (average MMSE of 25.79) than the MCI patients in the ADNI sample (average MMSE of 28.37).It is also conceivable that the number of subthreshold voxels that best discriminated between the groups was so low in the ADNI sample because that sample was used to generate a reference group, as highlighted above.Consequently, extent thresholds would presently have to be specifically defined for different samples or centers.However, further research may identify factors that contribute to inter-site variability in thresholds.By comparing measurements from a number of different sites, it also may be possible to identify thresholds that reduce accuracy to a tolerable level, but result in the greatest consistency across sites and samples.The use of larger databases that do not incorporate amyloidstatus may in turn allow the generation of larger, more representative reference groups.
computed as Wilcoxon-Tests.Bold font and gray background indicate a significant group difference at p < 0.0001.CN = cognitively normal, MCI = mild cognitive impairment, ROI = region of interest, Z = Z-threshold, SD = standard deviation, GM = gray matter, MTL = medial temporal lobe.The smoothing kernel is reported in mm of full width at half-maximum.levels than in the ADNI sample.Furthermore, the optimum number of subthreshold voxels, i.e., the extent of atrophy that best differentiated between CN and MCI, was much lower in the ADNI sample than in the validation sample.A possible explanation may be that the MCI patients in the validation sample were more severely affected (average MMSE of 25.79) than the MCI patients in the ADNI sample (average MMSE of 28.37).It is also conceivable that the number of subthreshold voxels that best discriminated between the groups was so low in the ADNI sample because that sample was used to generate a reference group, as highlighted above.Consequently, extent thresholds would presently have to be specifically defined for different samples or centers.However, further research may identify factors that contribute to inter-site variability in thresholds.By comparing measurements from a number of different sites, it also may be possible to identify thresholds that reduce accuracy to a tolerable level, but result in the greatest consistency across sites and samples.The use of larger databases that do not incorporate amyloidstatus may in turn allow the generation of larger, more representative reference groups.


Conclusions

We expanded on the established method of using VBM-based Z-statistics to quantify GM atrophy systematically.We observed that the accuracy with which MCI-like atrophy can be distingu

Conclusions
We expanded on the established method of using VBM-based Z-statistics to quantify GM atrophy systematically.We observed that the accuracy with which MCI-like atrophy can be distinguished from agerelated atrophy could be substantially increased using age-specific reference groups.In contrast, limiting reference groups to amyloidnegative CN did not improve diagnostic accuracy.

hed from age
elated atrophy could be substantially increased using age-specific reference groups.In contrast, limiting reference groups to amyloidnegative CN did not improve diagnostic accuracy.


CRediT authorship contribution statement


Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have app

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

red to influence the work reported
in this paper.

Nils Richter :
Richter
Conceptualization, Methodology, Software, Formal analysis, Writingoriginal draft.Stefanie Brand: Data curation, Formal analysis.Nils

Table 1
Demographic and neuropsychological characteristics of the ADNI sample.
Except for age, data were not normally distributed.Group differences were computed with T-Tests or Wilcoxon-Tests as appropriate.The Chi-Square-Test was used to assess differences between gender distributions.Bold font indicates statistical significance at p < 0.05 (two-sided).SD = standard deviation; MMSE = Mini-Mental-Status-Exam; LM-DR = Logical Memory Delayed Recall; TMT-A = Trail Making Test Part A (Seconds); TMT-B = Trail Making Test Part B (Seconds). 1 = missing data for one participant. 2= missing data for two participants.

Table 2
Demographic and neuropsychological characteristics of the validation sample.
Except for age, data were not normally distributed.Group differences were computed with T-Tests or Wilcoxon-Tests as appropriate.The Chi-Square-Test was used to assess differences between gender distributions.Bold font indicates statistical significance at p < 0.05 (two-sided).CN = cognitively normal, MCI = mild cognitive impairment, SD = standard deviation; MMSE = Mini-Mental-Status-Exam, LM-DR = Logical Memory Delayed Recall, TMT-A = Trail Making Test Part A (Seconds), TMT-B = Trail Making Test Part B (Seconds).

Table 3
Number of subthreshold voxels for CN and MCI determined using age-specific reference groups and the standard approach.

Table 6
Number of subthreshold voxels for CN and MCI determined using an age-specific and standard reference approach in the validation sample.