Fully bayesian longitudinal unsupervised learning for the assessment and visualization of AD heterogeneity and progression

Tau pathology and brain atrophy are the closest correlate of cognitive decline in Alzheimer’s disease (AD). Understanding heterogeneity and longitudinal progression of atrophy during the disease course will play a key role in understanding AD pathogenesis. We propose a framework for longitudinal clustering that simultaneously: 1) incorporates whole brain data, 2) leverages unequal visits per individual, 3) compares clusters with a control group, 4) allows for study confounding effects, 5) provides cluster visualization, 6) measures clustering uncertainty. We used amyloid-β positive AD and negative healthy subjects, three longitudinal structural magnetic resonance imaging scans (cortical thickness and subcortical volume) over two years. We found three distinct longitudinal AD brain atrophy patterns: one typical diffuse pattern (n=34, 47.2%), and two atypical patterns: minimal atrophy (n=23 31.9%) and hippocampal sparing (n=9, 12.5%). We also identified outliers (n=3, 4.2%) and observations with uncertain classification (n=3, 4.2%). The clusters differed not only in regional distributions of atrophy at baseline, but also longitudinal atrophy progression, age at AD onset, and cognitive decline. A framework for the longitudinal assessment of variability in cohorts with several neuroimaging measures was successfully developed. We believe this framework may aid in disentangling distinct subtypes of AD from disease staging.


Introduction
Imaging biomarkers of brain morphology have been increasingly used in research and clinical routine during the last decades (Dickerson & Sperling, 2005). More specifically, dementia research has utilized such markers for the investigation of disease-related patterns from populations around the world and many cohorts with complete neuroimaging data are now available to the research community (Frisoni et al., 2013;Whitwell, 2018). Structural neuroimaging markers are also used for selection of participants for clinical trials in Alzheimer's disease (AD) (Menéndez-González, de Celis Alonso, Salas-Pacheco, & Arias-Carrión, 2015). The availability of longitudinal data provides us with the opportunity to assess changes over time in healthy and pathological individuals. A new challenge for the imaging research community is the incorporation of longitudinal information in their study designs (Caruana, Roman, Hernández-Sánchez, & Solli, 2015). Other challenges include the assessment and fixation (ceteris paribus) of different study effects, the meaningful visualization of group differences and finally the simultaneous optimization of all these procedures for the sake of reproducibility in the presence of pragmatic sample sizes.
Unsupervised classification (clustering) is widely applied to neuroimaging data when the aim is to unveil heterogeneous features within samples (Whitwell, 2018). When samples include only one diagnosis, a common use of clustering methods is to investigate whether the neuroimaging measures of interest show heterogeneous patterns within that same diagnostic label. Several studies have investigated the heterogeneity in AD with the aim to define disease specific subtypes (Byun et al., 2015(Byun et al., , 2015Corlier et al., 2018;Park et al., 2017;Poulakis et al., 2018;Schwarz et al., 2018;Whitwell et al., 2012;Young et al., 2017). When samples include more than one diagnosis, the main aim of unsupervised clustering methods is to investigate whether neuroimaging markers can be used to distinguish between the diagnostic classes without specifying them with a label. The clustering methods that are used today are mostly cross-sectional, in the sense that they utilize baseline observations for a set of individuals. In the AD research field, many studies have focused on the unbiased identification of cortical and subcortical patterns of atrophy with structural MRI (sMRI). One recent study utilizes longitudinal atrophy markers to find sets of brain regions with common progression patterns (Marinescu et al., 2019). However, to date no cluster-based study has included longitudinal atrophy data in their method scheme, in order to identify groups of individuals with similar atrophy trajectories and our current study intends to meet this necessity.
In studies where the aim is to investigate neuroimaging measures in association to some clinical outcome, we often wish to account for or exclude the effect of confounders that can potentially introduce bias and may drive the results of our analysis. More specifically, in connection with cluster analysis, two approaches are widely used in the literature. The first approach, is called the residual (de-trending) method (Falahati et al., 2016;Voevodskaya et al., 2014). In this approach a "correction" is applied to a neuroimaging measurement with respect to a confounder that should not affect the results of the main analysis. The adjusted measurement will not be correlated with the confounder anymore. After that, we apply the clustering algorithm on the de-trended data (Corlier et al., 2018;Hwang et al., 2016;Noh et al., 2014;Varol et al., 2017;Zhang et al., 2016). When using the detrending approach, the statistical tests that we need increase dramatically in numbers (one correction for each vertex/voxel/region of interest). Moreover, the cluster parameters are not optimized conditional to the original data but given the artificial data (de-trended data). All these features can make the interpretation of results more difficult and introduce errors in reproducibility, since the results are based on a chain of statistical procedures that are not connected in statistical terms. According to the second approach (for confounders in a clustering study), it is suggested to incorporate the effect that we want to account for in the analysis (Dong, Honnorat, Gaonkar, & Davatzikos, 2016;Young et al., 2017). This can be achieved with the addition of a fixed effect in the case of a statistical clustering model.
Another important feature of a neuroimaging clustering study is the comparison of differences in brain morphology between clusters of individuals. This comparison commonly involves, i) groups of the same pathology with different atrophy patterns or ii) a pathological group and a cognitively unimpaired group with similar demographical characteristics, or other combinations of comparisons between groups. This step is either incorporated in the clustering procedure, or it is performed as an independent post-clustering step. When this step is not included in the clustering procedure but added as a separate step, we need to correct the resulting images for multiple statistical comparisons since multiple models are implemented for that purpose. This issue can be avoided in the case of a simultaneous clustering and visualization.
Previous clustering studies grouped AD patients based on sMRI features from a single time-point (Dong et al., 2016(Dong et al., , 2017Ferreira et al., 2017;Noh et al., 2014;Park et al., 2017;Poulakis et al., 2018;Varol et al., 2017). Their conclusions were based on a single observation in time and the chance that those clusters reflect different stages of the disease and not particular patterns of atrophy (distinct AD subtypes) cannot be excluded. A longitudinal clustering design can reduce the risk that the results will reflect different disease stages. Even if the clusters reflect different disease stages, we can infer them with higher certainty than in a cross-sectional study. Moreover, the follow up MR acquisitions can be irregularly distributed between subjects may drop-out or miss certain visits. We model this feature in order to obtain accurate estimates of atrophy progression.
In this study, we aimed to design and assess a framework for longitudinal clustering that incorporates: 1) simultaneous clustering of several longitudinal neuroimaging measures (multivariate data over time), 2) information for individuals with irregularly sampled observations, 3) comparison of the clusters with a control group, 4) the study and fixation (optional) of effects that should not drive the resulting clusters, 5) visualization of the resulting clusters for interpretation, 6) measures of uncertainty in the clustering. Our overall goal is to perform all the aforementioned methodological steps in one statistical model in order to avoid the statistical pitfalls of a "pipeline" study that limits the ability to correctly identify disease mechanisms because of weak statistical inferences. The designed framework is applied to longitudinal sMRI data of mainly amyloid-β (Aβ) positive AD patients and Aβ negative cognitively unimpaired (CU) subjects over a period of two years (three sMRI time points). To validate the results from our new longitudinal clustering framework, we included all data with longitudinal information from our previous cross-sectional clustering study (Poulakis et al., 2018). This allows us to compare the results from cross-sectional and longitudinal clustering in the same dataset. To be able to estimate cluster-specific atrophy trajectories in time is a very important aspect that has been overlooked by previous cross-sectional AD subtypes studies.

Participants
We used data obtained from the Alzheimer's disease neuroimaging initiative (ADNI), a large project launched in October 2004 in North America from Michael W. Weiner, MD. The initial goal of the ADNI 1 cohort that will be used for the analysis, was to gather neuroimaging data that would help to better detect and track AD in its early stages. More specifically, positron emission tomography, MRI and other data from individuals diagnosed with AD, mild cognitive impairment (MCI) and elderly CU were collected between 2004 and 2010 from different sites of USA and Canada. The inclusion criteria for AD patients were the following: 1) to fulfil the NINCDS/ADRDA probable AD criteria, 2) a Clinical dementia rating scale (CDR) global score between 0.5 and 1, and 3) an MMSE total score between 20 and 26. The exclusion criteria for AD included: the use of psychotropic medication that could affect memory, history of significant head trauma, evidence of significant focal lesions at the screening MRI, and the existence of a significant neurological disease other than AD. For the healthy cognitively unimpaired (CU) subjects, inclusion criteria were an MMSE total score between 24 and 30 and a CDR global score equal to 0. Exclusion criteria for CU subjects comprised presence of depression, MCI or dementia. For more information on the ADNI study, see http://adni.loni.usc.edu/about/.
We included all subjects with longitudinal sMRI data and available CSF data (101 AD and 113CU) from our previously published cross-sectional study on AD subtypes (Poulakis et al., 2018). This was done to be able to compare the cross-sectional and the longitudinal clustering approach in a proper way in the same set of participants. In total 75 subjects were excluded due to bad longitudinal image quality and processing results (see below). At baseline, 94% of the AD subjects were Aβ1-42 positive, while only 31 CUs were included, since we wanted them all to be negative for Aβ1-42 and Ptau. The cut-offs for Aβ1-42 and Ptau used, are discussed by (Shaw et al., 2009). Moreover, the CU sample was further limited by additional inclusion criteria: 1) remain as CU subjects across all the available follow-ups and not only the ones that are used in this study (0-36 months of continuous follow up for the 31 CU subjects), 2) have longitudinal MRI for all the time points of the analysis.
Altogether, 104 individuals were included in the final analysis, 72 AD patients (72 subjects had baseline and 12 months MRI scans, and 57 subjects had a 24 months MRI scan) and 31 CU (baseline, 12-and 24 months MRI scans).

MRI acquisition and preprocessing
The MRI dataset consists of high-resolution sagittal 3D 1.5T T1-weighted Magnetization Prepared RApid Gradient Echo (MPRAGE) volumes (voxel size 1.1×1.1×1.2 mm 3 ). Full brain and skull coverage were required and detailed QC was applied to all the images (Simmons et al., 2011).
Images underwent pre-processing with the longitudinal stream of the FreeSurfer pipeline (version 6.0), where a subject specific template is used (Reuter, Schmansky, Rosas, & Fischl, 2012 was also extracted for the needs of the statistical modelling of the volumetric data (Buckner et al., 2004). This segmentation approach has previously been used for multivariate classification of Alzheimer's disease and healthy controls, neuropsychological-image analysis and biomarker discovery (Ferreira et al., 2014;Maioli et al., 2015;Westman et al., 2010). All data was processed through theHiveDB system (Muehlboeck, Westman, & Simmons, 2014). The FreeSurfer output underwent manual visual QC to find errors in parcellations/segmentations to ensure optimal estimation of thickness and volumes. After QC, 28 AD and 48 CU subjects were excluded because of low output quality, image quality, or because less than two continuous time points existed per subject after the QC. Finally, one AD patient was excluded due to failed parcellation of regions that are included in the analysis.

Data standardization
The cortical thickness and subcortical volume ROI data of AD patients were standardized based on the sample of cognitively unimpaired subjects, including mean centering and unit variance scaling.
The two main benefits of mean centering and unit variance scaling of the patients' data are: 1) after the transformation, each ROI value will represent how many standard deviations below the average CU an AD subject's value is; 2) since we have volume and thickness data, after transformation, all the variables will have the same unit and fair statistical comparisons will be possible (without affecting the kurtosis or skewness of the distributions). This transformation has previously been applied for cross-sectional assessment of AD subtypes (Park et al., 2017;Westman, Muehlboeck, & Simmons, 2012;Zhang et al., 2016). In this study, we adapted this procedure to longitudinal data in order to account for the atrophy that is caused by the normal ageing process in the CU group over time , = , −̂, / ̂, ,where is the original measurement of subject , in the time point for the region , while ̂ and ̂ are the mean and standard deviation of the CU group at time and region . After this calculation, each value will resemble an atrophy level corrected for normal aging levels and also normal decline over time, which was not done previously, and is crucial for biological and clinical interpretation of brain atrophy.

Statistical longitudinal clustering
We set out to test an analytical framework that enables us to investigate longitudinal patterns in sMRI feature analysis. For this reason, we considered a multivariate mixture model that allows us to incorporate many brain regions in the model. Moreover, in our effort to establish a general framework that will be able to facilitate both continuous and discrete data trajectories in the clustering, while accounting for the longitudinal design of the study, we decided to choose a generalized linear instead of general linear (mixed effects) approach. In addition, such an approach allows us to incorporate fixed and random effects that can serve in different ways in sMRI and other modalities. The algorithm clusters the random intercepts and slopes of each individual's outcomes of interest (ROI measures in this study) with repeated measurements instead of repeated measurements data of each individual subject. Hence, a pair of subjects with similar estimated trajectories of atrophy (similar starting value/intercept and slope over time) will be grouped together, while subjects with different trajectories will be assigned to different groups. A Gaussian distribution is used to model the ROI data (general linear model), but data of ordinal or nominal nature can be analysed by changing the link function. Since the random effects in the model account for the repeated measurements, the fixed effects remain to be defined. As mentioned in the introduction, accounting for external effects that might drive the resulting clusters within the model is convenient in this kind of analysis. Therefore, fixed effects are estimated for each of the external variables that we want to assess during the clustering analysis. Such an approach allows fitting the resulting cluster profiles (atrophy maps) for different combinations of fixed effects to investigate their regional contribution. Finally longitudinal trajectories of atrophy to study whether they vary within the AD dementia spectrum.
The clustering algorithm estimates different outcomes. One outcome is the different cluster components. Each estimated multivariate Gaussian component resembles a pattern of atrophy that is observed in the dataset. Each individual subject is assigned a probability to belong to any of the components (soft clustering) rather than being assigned to a single component. The assignment of subjects into clusters is based on the maximum posterior probability rule (an individual is assigned to the component with the highest individual component probability). This is a much more realistic approach in comparison to hard clustering approaches used in most previous data-driven studies (Byun et al., 2015;Gamberger, Ženko, Mitelpunkt, & Lavrač, 2016;Na et al., 2016;Noh et al., 2014;Park et al., 2017;Varol et al., 2017), since heterogeneity in AD is modelled here as a continuum and allows for mixed patterns instead of single patterns. Hence, the data-driven algorithm provides explicit information on whether a subject has a distinct atrophy pattern or a mixture of patterns through the estimation of subject component probabilities. The proposed framework clusters subjects of a cohort into groups (provides probability of subjects to belong in any of the clusters) and not patterns of atrophy into groups for a cohort (clusters of regions/vertices) as in the study of Marinsecu and colleagues (Marinescu et al., 2019).
A schematic representation of the proposed analytical framework is portrayed in Figure 1. The time from the first visit (baseline) was defined as a random effect for the sake of comparability with the previous literature on AD subtypes were only one observation for each subject is included (Noh et al., 2014;Poulakis et al., 2018). Therefore, the intercept of the model will correspond to the atrophy levels on the first visit and the slope will show how these atrophy levels change over the months   Supplementary table 1 for more information). In our hybrid model evaluation approach, all three quality criteria were considered as important in the assessment process (scaled to the same interval, 0-1) (Brooks & Roberts, 1998).

Clustering evaluation.
The reported results are based on 750000 iterations with 500 iterations thinning where 250000 iterations were burn-in period, which therefore saved 1000 MCMC samples. The distributions of the estimated parameters started converging after the burn-in samples and it remained stable afterwards for the rest of the simulations. As expected, the general tendency of the deviance for the different models decreased with the increasing amount of clusters (Supplementary table 1). The different initializations brought various outputs from which the one with the packages' default settings was the worst in terms of deviance. The model with initialization in the means of the clusters from our previous study (Poulakis et al., 2018) and the addition of uniform noise for 8 clusters was optimal in terms of quality.   1, 2 and 3). B) Subjects are coloured based on HPD intervals classification. In comparison to A, in B we added the uncertain classification with orange colour (Two subject from cluster 2 and 1 subject from cluster 7 cannot be classified to any cluster with high certainty). C) Colours are the same as in B, but we excluded from the plot the HPD uncertain classification subjects: orange and the outlier clusters 7: black and 8: yellow. D) The subjects are coloured exactly as in C but the MDS components 1, 2 and 5 are plotted, to showcase the separation between Cluster 4, 5 6. The names in the parenthesis after the cluster numbers refer to the figure 3 and table 2.

Cluster characterization
Three main patterns of atrophy were found in the dataset: i) typical AD pattern (clusters diffuse 1, 2 and 3) (Figure 3 B), ii) a minimal atrophy pattern ( Figure 3A) and iii) a hippocampal sparing pattern (hippocampal sparing early and late onset) ( Figure 3C). The Minimal atrophy cluster is characterized by initial atrophy in the entorhinal cortex (right) and longitudinal decrease in thickness first in the right and then in the left inferior temporal gyrus during the 24 months of follow up ( Figure 3A). The atrophy patterns in the three Diffuse clusters (reported as typical AD), more closely follow the NFTs pattern suggested by (Braak & Braak, 1991).However, differences do exist and may be attributed to age (even after correcting for this effect). Further, the atrophy in the Diffuse 3 cluster is more advanced ( Figure 3B) and these subjects have lower cognitive performance (Table 2). This may be the reason for why the Diffuse 3 cluster only had MRI for baseline and 12 months follow-up. Within the hippocampal sparing AD subtype two clusters are observed. The degree of atrophy as well as the age at onset of dementia differentiate these two clusters ( Figure 3C, Table 2). The early onset hippocampal sparing cluster has greater level of atrophy at baseline and accumulates atrophy faster over time, in contrast to the late onset hippocampal sparing cluster. In both clusters the precuneus and the inferior parietal gyri ( Figure 3C) are atrophied.
For a more comprehensive understanding of the atrophy distributions in the cortex of the different clusters, we can also utilize the 1 st and 3 rd quartile images that present the dispersion around the mean cortical atrophy of each cluster (Supplementary Figure 1).
The six clusters did not differ in terms of sex distribution but they differed in the years of formal education, with the average education being around 16 years ( Table 2). The lowest and highest median years of education are observed in the Diffuse 2 cluster (12 years) and the Hippocampal sparing early onset (18 years). The two clusters with hippocampal sparing patterns of atrophy, differ in several aspect such as the age of onset of dementia. The Minimal atrophy cluster has the slowest decline over time in MMSE and CDR, while the Hippocampal sparing early onset cluster has the steepest decline (Table 2, Supplementary figure 4). The Hippocampal sparing early onset cluster also has a steep decline in constructional praxis, and the greatest deficits in ideational praxis at baseline, but not steeper than the Diffuse 3 group. Although the Minimal atrophy cluster has the best scores in all the ADAS subscales at baseline, the Hippocampal sparing late onset group has a better score in the word recognition task at baseline, but declines very fast during the next two years (2.7 points/year). Alzheimer's disease assessment scale, m0 = first visit, m24 = visit after 24 months. Annual changes in the cognitive assessment scales were estimated with linear regression (follow up data as predictor, two parameters and variance estimation). Standard errors of the estimated parameters are included in brackets. No statistical tests between groups are performed due of small sample sizes in some of the groups. CSF values are in pg/ml.

Comparison to previous results
The atrophy patterns of the different clusters have similarities with our previously reported crosssectional results on AD subtypes (Poulakis et al. 2018), while the differences yield from the longitudinal information that is now added in the algorithm.  Correspondence between the assignments of subjects in the cross sectional clustering (Poulakis et al., 2018) and the current longitudinal study (clustering according to the highest posterior density intervals). The cross-sectional study clusters are in the rows and the longitudinal study clusters are in columns.
The subjects in the cross-sectional study (Poulakis et al., 2018) that were assigned to the Diffuse 1 subtype are now distributed in more than one cluster with the highest prevalence in the Diffuse 1 and 2 clusters (Table 3). Three subjects from the cross-sectional Diffuse 2 cluster are now in the Diffuse 3 (2 subjects) and cluster 8 (1 subject, outlier cluster). All the seven subjects from the cross sectional hippocampal sparing subtype are still in the Hippocampal sparing clusters. Three subjects, assigned to the limbic predominant atrophy pattern in the cross-sectional study are now in cluster 7 (outlier cluster), Diffuse 1 and the HPD group. The subjects in the minimal atrophy group are still mainly in Minimal atrophy in the present study (17 subjects out 20) while two subjects are assigned to the Diffuse 1 cluster and one subject to the Hippocampal sparing late onset cluster. Out of four CSF Aβ1-42 negative AD subjects that are included in the current study, one subject is assigned to the longitudinal diffuse 2 cluster (was in the cross-sectional diffuse 1 cluster), one in the longitudinal outlier cluster 7 (was in the cross-sectional limbic predominant cluster) and two are assigned to the longitudinal minimal atrophy cluster (both subjects were in the cross-sectional minimal atrophy cluster) (Table 3).

Discussion
The optimization of the longitudinal clustering model provided us with interesting findings that support its future use in imaging research for studying heterogeneity in healthy and pathological ageing. Clustering with several longitudinal measures that were irregularly sampled was successfully achieved. We incorporated information from a cognitively unimpaired sample to calculate agecorrected levels of atrophy, while avoiding the need to correct for multiple comparisons. This allowed the direct visualization of atrophy trajectories. Estimated subject-component probabilities made it possible to assess whether subjects are clustered with high certainty or not. All these features provided us with useful insights that substantially helped in the interpretation of the clusters. Moreover, the assessment of some study effects within the model, can potentially assist to investigate which brain regions are statistically associated with them. The framework identified and characterized three overall groups of AD subjects with distinct atrophy patterns with different trajectories over time and cognitive profiles.

Longitudinal clustering initialization and performance
The use of the current dataset helped the evaluation of our framework, because the patterns of atrophy at baseline are known from our previous results (Poulakis et al., 2018). Thus, the longitudinal information incorporated in our framework helped us study if AD atrophy patterns at baseline change when information about the course of the disease is added. The optimization process was longer and more intensive for larger numbers of clusters, since every additional component increased the number of new parameters to be estimated. Initially, the packages' default values for the parameters were used to see the extent of adaptation of the model to the data without any help of locally optimal solutions. The results showed that the model tends to produce 1-2 components that represent the actual dataset, while the rest of the components have non-sensible values.
Moreover, the subjects were classified with high certainty in these 1-2 realistic components. This is advantageous because it means that the probabilistic clustering correctly identifies the components that represent the data in the best way. However, the rest of the components remained empty, which is a sign that the algorithm estimates components with zero presence in the dataset if it is not given some hints on where the data actually lie in the parameter space. The model with default initial values was not considered adequate to describe the dataset since too many parameters had no meaning in our application.
The decision to start the algorithm from the cross-sectional clustering results showed that when the algorithm is fed with initial information for the mixed-effect parameters, the components are more meaningful, in the sense that if not all, almost all the components have some subjects in them.
However, some of the cross-sectional solutions may not be optimal since they were not specifically adjusted to the dataset. For example, when we started the cluster intercepts optimization (initial values) from a k-means cross-sectional clustering solution, the resulting model had low quality, because a more sophisticated method is needed to find suitable clusters that can describe the AD dataset. In contrast, when we started the optimization with initial values (for the cluster intercepts) from the cross-sectional AD subtypes results (Poulakis et al., 2018), the model received the best quality scores among the different initializations. The lack of initial values for the slopes of each cluster (we set the initial slopes to zero due to lack of longitudinal cluster information) might be the reason behind the superiority of a solution with the introduction of uniform noise. In this way, we let the algorithm search for an optimal solution that may not fit (in the parametric space) exactly to the previous study's solution but in a parametric region close to it. Thus, we give more flexibility to the optimizer of the model to end up in the same values (as the cross-sectional study), only if these are the optimal ones. In this way, we avoid stumbling on a local optimum.
We also checked that the variance of the posterior distribution of the fixed effects was considerably smaller than the large prior value that we set it to, in accordance to the Supplementary material of the paper where the clustering methodology was presented (Arnošt Komárek & Komárková, 2013).
The cluster-specific parameters (random effects) such as the mean, covariance matrix and proportion of cluster parameters were the most demanding parameters to optimize, especially in the case of 7 and 8 cluster solutions. The visual inspection of the MCMC trace plots for these parameters showed large steps at the first thousand iterations (burn in period and some iterations later) and then a stable distribution (good chain mixing) is produced.
The idea behind calculating a composite measure of model quality was inspired by the fact that all chains converged perfectly for none of the models. However, some autocorrelation was allowed to exist, which often happens in applications of Bayesian statistics (Gelman et al. 2013, chapter 11). We accepted a certain extent of autocorrelation within chains but did not accept any solution with high values (Dobson and Barnett 2018, chapter 13). The number of chains that had some autocorrelation among the random effects of the selected model was only 6% of the overall parameters, which is a reasonable amount (considering that the chains are generally mixing sufficiently well).
This proposed clustering provides us with two additional types of information apart from the cluster assignment: 1) which subjects in a cohort are not well represented by one cluster (i.e. outliers), 2) which subjects are at risk of shifting from one cluster to another (for example from a cognitive normal cluster to a pathological cluster, i.e. HPD uncertain). In this study we also decided that two clusters of the output model should be considered as outliers. The number of observations that are needed in order to treat a cluster as an outlier is not well defined in the literature. However, we decided that 2 subjects are too few to allow an interpretation of the cluster characteristics and/or an extrapolation to the AD population. The estimated components should have a certain presence in the population in order to interpret them, otherwise the weakness of these clusters might introduce noise in the understanding of heterogeneity in the context of this application. For the sake of transparency, the data of the subjects that were excluded from interpretation are reported in the Supplementary Figure 3 and Supplementary Table 3. Overall, the longitudinal clustering model combined with a priori chosen initial values for the cluster specific parameters produced reasonable cluster estimates for meaningful interpretation of our longitudinal neuroimaging data.

AD subtypes and longitudinal clustering of brain atrophy
The results of the model support that information about atrophy trajectories has the potential to advance our current understanding about the heterogeneity within AD. We identified three main patterns of atrophy with different atrophy signature over time: i) a typical AD pattern, ii) an AD pattern where the cortex is mainly involved while the hippocampus is relatively spared and iii) a minimal atrophy pattern were subjects exhibited mild or no atrophy in cortical and subcortical regions. Within typical AD, we found three atrophy patterns. The most typical AD like atrophy pattern is observed in the Diffuse 1 cluster that has all the demographical and cognitive characteristics of AD, such as the age of AD onset (>65 years of age), MMSE (18.5±7.1) and CDR global (1.3±0.8) (Byun et al., 2015;Ferreira et al., 2017;Kim et al., 2005;Whitwell et al., 2012). The Diffuse 2 cluster is not substantially different in median fitted values from the Diffuse 1 cluster.
However, the higher age at onset (7 years older) and the percentage of females (53.5% in comparison to 46.7%) in the Diffuse 2, together with the atrophy distribution dispersion in this cluster provided by the 1 st and 3 rd quartile atrophy maps (supplementary figure 1), is somehow reminiscent of the AD subtype known as limbic predominant AD (Byun et al., 2015;Ferreira et al., 2017;Kate et al., 2018;Noh et al., 2014;Poulakis et al., 2018;Whitwell et al., 2012). We speculate, given the longitudinal data and the previous cross-sectional study results (Poulakis et al., 2018), that the limbic predominant atrophy patterns is part of the AD disease staging rather than a distinct subtype. For some reason this cluster has later onset but patients seem to follow the Braak staging for NFT distribution and spread, hence they will likely develop typical AD at advanced stages.
Regarding the Diffuse 3 cluster, this is the most atrophied group of subjects in this dataset, its cognitive scores are very low and its frequency in the data is very small (4 subjects). Being already reported in previous results of our group (Poulakis et al., 2018), we can now show that this group consists of subjects with already advanced atrophy at the time of the MRI. The model estimates a random intercept for each ROI at the time of the first MRI acquisition for each subject. Therefore, the few subjects of the diffuse 3 cluster were separated from the other two diffuse atrophy clusters, since they had very low intercepts (great amount of atrophy) in the limbic areas and association cortex, as we can see in figure 3.
The Minimal atrophy cluster that includes subjects with minimal atrophy changes over time is a cluster of considerable interest since the low amount of atrophy correlates well to the very slow cognitive decline in this group. The frequency of minimal atrophy in the current study is higher than in previous studies (Byun et al., 2015;Ferreira et al., 2017;Poulakis et al., 2018), most probably due to the longitudinal design that allows subjects with slow cognitive decline to be followed up for a longer period. This interpretation is supported by the finding that the Diffuse 3 cluster, the more severe group, is the only cluster that did not have 24 months follow up (early drop-out). It is proposed that tau-related pathophysiology and abnormal levels of Aβ alone without significant atrophy are enough to produce the dementia symptoms in the minimal atrophy subtype , perhaps through disruption of relevant brain networks in the absence of overt brain atrophy (Ferreira et al 2019), in the context of lower cognitive reserve (Ferreira et al., 2014;Persson et al., 2017).
The hippocampal sparing subtype with accumulation of atrophy mainly in cortical areas is a subtype that has been consistently reported in many studies (Byun et al., 2015;Dong et al., 2017;Na et al., 2016;Poulakis et al., 2018;Whitwell et al., 2012). Interestingly, our current study disentangled the observed hippocampal atrophy pattern in two different clusters with atrophy in the precuneus and the inferior parietal lobe. A unique characteristic of the most atrophied group of the two is the early onset as well as the high number of years of education, which is a proxy of cognitive reserve. This group seems to decline in cognition more rapidly than any other AD group, in agreement with the cognitive reserve hypothesis of faster disease progression in subjects with high reserve once a specific threshold has been reached (Stern, 2009). In contrast, the less atrophied hippocampal sparing group has a late onset in the AD symptoms, which might be the reason of the less aggressive phenotype (Koedam et al., 2010).

Comparison between longitudinal and cross-sectional AD atrophy clusters
The subjects of this study in their majority are grouped in longitudinal clusters similar to our previously published study (Poulakis et al., 2018). However, subjects from the Diffuse 1 subtype of the cross-sectional study are now distributed in more clusters because of two main reasons: 1) The Diffuse 1 cluster from the cross-sectional study is a cluster that gathered the most typical AD patterns. However, the separation from the other clusters was not very clear as discussed in that study. This cluster had the highest heterogeneity within itself and in the multidimensional scaling plot it was located between the other clusters of atrophy with more distinct patterns. 2) Importantly, the longitudinal trajectories, with help of both intercepts and slopes have disentangled the courses of the disease for the subjects that before were clustered based only in one observation in that cluster (Diffuse 1 of the cross-sectional study). In the cross-sectional study (Poulakis et al., 2018) we observed 4 patterns of atrophy and found 5 clusters while in the current longitudinal study we identified 3 main patterns of atrophy in 6 clusters. The existence of two different patterns of atrophy within the hippocampal sparing subtype (with differences in the AD onset) remains to be validated in larger datasets, whilst shows the potential in this method to identify them. Altogether, these findings highlight the importance of longitudinal clustering methods to advance our current ability to unravel disease heterogeneity. Our current findings show that a certain proportion of the heterogeneity may be missed by cross-sectional clustering.
There are also other aspects which differentiate between cross-sectional and longitudinal clustering.
The statistical approach of the longitudinal clustering is based on distributional assumptions (each cluster has multivariate normal distribution), while the cross-sectional clustering was distance based (random forest). Therefore, the longitudinal model could accommodate fixed effects (variables that we want to account for), while the cross-sectional model could not (we de-trended these effects in advance). Another important methodological difference between the two approaches is the visualization of the clusters. The cross-sectional design included one more step after the clustering to compare AD groups with the sample of CU subjects in terms of ROI volumes (p-value maps). This is indeed the standard approach. Instead, the longitudinal model has an internal measure of similarity between AD groups and the CU sample, namely the fitted value maps where p-values are not calculated. We achieved a comparison between healthy aging and AD clusters without overloading our dataset with statistical comparisons. More importantly the level of difference in actual cortical thickness or volume between two clusters of subjects (fitted value) is easier to interpret biologically and clinically than the statistical differences between clusters of subjects (p-values).

Limitations and strengths of the study
Our study has some limitations. The sample size is limited due to two main reasons. Firstly, we wanted to use the results of our previous study as a ground truth for the clustering. Additionally, the exclusion criteria for CU subjects and AD patients were very strict (See material and methods), to ensure that the former group resembles a true sample of the healthy population over time, while the latter group had no missing information that can bias the interpretation of the results. This was intended to be a methodological study, although some biological interpretations are done. Hence, for the methodological part we believe our current sample size is sufficient. Yet, it is our plan to replicate our current findings in a larger sample in the future. Furthermore, the variable used as time component in this study was the time from the first MRI acquisition, which helped the interpretation of the results in relation to the previous cross-sectional study, but it might limit the ability to assess if a cluster of AD subjects reflects a distinct pattern of atrophy or a stage of the disease . Our study has some strengths as well. We demonstrated that incorporating longitudinal information in the clustering of imaging data is possible. We can now apply it to different imaging modalities in order to label longitudinal data and to better understand the mechanisms underlying the aging process. The estimated model makes it possible to do two more things that were not available before: 1) to estimate future levels of atrophy for any individual subject that belongs to the clusters (prognostic value) and 2) to estimate cluster assignment of new subjects that are not included in the model training (diagnostic value).

Conclusion
In conclusion, a framework for the longitudinal assessment of variability in cohorts with several neuroimaging measures was successfully tested and the results show that it can be used to understand complex processes in ageing and neurodegenerative disorders.