Analyzing the effect of APOE on Alzheimer's disease progression using an event-based model for stratified populations

Alzheimer's disease (AD) is the most common form of dementia and is phenotypically heterogeneous. APOE is a triallelic gene which correlates with phenotypic heterogeneity in AD. In this work, we determined the effect of APOE alleles on the disease progression timeline of AD using a discriminative event-based model (DEBM). Since DEBM is a data-driven model, stratification into smaller disease subgroups would lead to more inaccurate models as compared to fitting the model on the entire dataset. Hence our secondary aim is to propose and evaluate novel approaches in which we split the different steps of DEBM into group-aspecific and group-specific parts, where the entire dataset is used to train the group-aspecific parts and only the data from a specific group is used to train the group-specific parts of the DEBM. We performed simulation experiments to benchmark the accuracy of the proposed approaches and to select the optimal approach. Subsequently, the chosen approach was applied to the baseline data of 417 cognitively normal, 235 mild cognitively impaired who convert to AD within 3 years, and 342 AD patients from the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset to gain new insights into the effect of APOE carriership on the disease progression timeline of AD. The presented models could aid understanding of the disease, and in selecting homogeneous group of presymptomatic subjects at-risk of developing symptoms for clinical trials.


Introduction
Dementia affects roughly 5% of the world's elderly population of whom 60 − 70% are affected by Alzheimer's Disease (AD), which is the most common form of dementia (World Health Organization, 2017). There are several neurobiological subtypes of AD (Ferreira et al., 2020) and each subtype potentially needs a different strategy to prevent or slow the progression of AD. Understanding the pathophysiological processes in AD is thus crucial for selecting novel preventive or therapeutic targets for clinical trials of disease modifying treatments, identifying target groups for such trials and tracking the disease progression in patients.
While several studies have looked into the pathophysiology of AD (Jack Jr. et al., 2013;Bloom, 2014;Weigand et al., 2019), it is still not completely understood. Although it has been observed that AD is phenotypically heterogeneous (Murray et al., 2011;Au et al., 2015;Patterson, 2018) with potentially different pathways for disease progression, these pathways remain unclear. There is hence a need to understand the phenotypic heterogeneity in AD while leveraging neuroimaging, fluid and cognitive biomarkers.
APOE is a triallelic gene in which the ε2 allele reduces the risk of AD (van der Lee et al., 2018), the ε3 allele acts as a reference allele and the ε4 allele is a major genetic risk factor of AD (Saunders et al., 1993;Kim et al., 2009;Genin et al., 2011). APOE has been shown to correlate with phenotypic heterogeneity in AD (Weintraub et al., 2019). Hence we hypothesize that the pathophysiology of AD can be better understood when considering the effect of APOE carriership on biomarker changes.
In the context of data-driven methods for understanding AD pathophysiology, disease progression models have been used to study the trajectories of individual biomarkers (Jedynak et al., 2012;Schiratti et al., 2015;Lorenzi et al., 2017) as well as their progression with respect to each other (Fonteijn et al., 2012;Venkatraghavan et al., 2017;Young et al., 2014;Huang and Alexander, 2012). Unlike typical machine learning approaches, these models are interpretable by design and provide insight for understanding the mechanisms of and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at: http://adni.loni.usc.edu/wp-content/uploads/how_ to_apply/ADNI_Acknowledgement_List.pdf disease progression. Event-based models (EBMs) are a class of such interpretable disease progression models that estimate the timeline of neuropathologic change during AD progression using crosssectional data (Fonteijn et al., 2012;Venkatraghavan et al., 2019a).
Our primary aim is to use the discriminative event-based model (DEBM), which was shown to be more accurate than previously proposed EBMs (Venkatraghavan et al., 2019a), to understand the effect of different APOE alleles on the disease timeline of AD. To shed light on different aspects of neurodegeneration and identify the earliest brain regions affected, we included commonly studied cerebrospinal fluid (CSF) biomarkers, cognitive scores, and volumetric biomarkers from neuroimaging including those of hippocampus subfields. Hippocampus subfields were included in our model because while Hippocampus neuronal dysfunction is quite commonly studied to be affected in AD, it is not a homogenous structure, and subfields could be affected differentially due to APOE carriership (Mueller and Weiner, 2009).
The default approach for estimating the disease progression timeline would be to stratify the population based on their APOE ε2 − 4 carrier status and independently train the DEBM model on the stratified populations (Young et al., 2014). However, since DEBM is a data-driven model, stratification into smaller groups would lead to less accurate models than those obtained by the original method on the entire dataset. Hence our secondary aim is to propose and evaluate a novel approach in which we split the different steps of DEBM into group-aspecific and group-specific parts, where the entire dataset is used to train the group-aspecific parts and only the data from a specific group is used to train the group-specific parts of the DEBM. We present two different variations of this approach and we hypothesize that the optimal split of the DEBM steps into the group-aspecific and groupspecific parts would result in better accuracy of the estimated disease progression timeline. Since the ground-truth timelines are unknown in a clinical setting, we evaluate the accuracy of the proposed variations using simulation experiments and we select the optimal method for the analysis on the effect of APOE on the AD progression timeline on patient data.
To summarize, our contributions in this paper include proposing and evaluating a novel approach for using DEBM in stratified populations and estimat-ing a comprehensive timeline of AD progression, in terms of biomarker changes, in the presence of different APOE alleles.

Methods
An introduction to the DEBM model (Venkatraghavan et al., 2019a) is provided in Section 2.1. In Section 2.2 we propose our novel approach for using DEBM in stratified populations with its two variations.

Discriminative event-based modeling
In a cross-sectional dataset (X) of M subjects, including cognitively normal individuals (CN), subjects with mild cognitive impairment (MCI) and patients with AD, let X j denote a measurement of biomarkers for subject j ∈ [1, M ], consisting of scalar biomarker values x j,i for i ∈ [1, N ]. x ·,i denotes the i-th biomarker for any unspecified j. DEBM estimates the posterior probabilities of individual biomarkers being abnormal. These posterior probabilities are used to estimate the ordering of biomarker changes for each subject independently. The central ordering and disease progression timeline for the entire dataset are estimated based on these subject-specific orderings. The resulting disease progression timeline is used for assessing the severity of disease in an individual based on his/her biomarker values. Figure 1 shows the different steps involved in DEBM.
Step 1 -Mixture Modeling: As AD is characterized by a cascade of neuropathological changes that occurs over several years, presymptomatic CN subjects can have some abnormal biomarker values. On the other hand, in atypical AD subjects, a proportion of biomarkers may still have normal values, especially in patients at an early disease stage. Hence clinical labels cannot directly be propagated to individual biomarkers to label normal and abnormal biomarker values. We shall refer to this as biomarker label noise in the rest of the paper. In order to estimate the posterior probabilities of individual biomarkers being abnormal, DEBM, similar to previously proposed EBMs Fonteijn et al. (2012); Huang and Alexander (2012); Young et al. (2014), fits a Gaussian mixture model (GMM) to construct the normal / pre-event probability density function (PDF), p(x ·,i |¬E i ), and abnormal / post-event PDF, p(x ·,i |E i ). Event E i in this notation is used to denote the corresponding biomarker becoming abnormal and ¬E i denotes the corresponnding biomarer being normal. The aforementioned PDFs can be expressed as: Where, N (µ, σ) is the normal distribution with mean µ and standard deviation σ.
For estimating these parameters robustly in the presence of biomarker label noise, the normal and abnormal PDF estimates are first initialized using the mean and standard deviations after truncating the overlapping tails of the observed distributions in CN and AD subjects. This can be observed in Figure 2, where the initialization is performed only based on the non-overlapping parts of green and red curves, while the overlapping part is left out to account for biomarker label noise. At this stage of GMM initialization, MCI subjects are left out as well, because it is unsure a priori whether their biomarkers are normal or abnormal. The resulting initialized PDFs are denoted as p(x ·,i |¬E i )) and p(x ·,i |E i ). This is followed by an alternating GMM maximum likelihood optimization scheme until both the Gaussian parameters as well as the mixing parameters converge. All the subjects, including MCI, are used for GMM optimization. After convergence, these Gaussians are used to represent the PDFs p(x ·,i |¬E i ) and p(x ·,i |E i ). The mixing parameters (θ i ) are used as prior probabilities to convert these PDFs to posterior probabilities p(¬E i |x ·,i ) and p(E i |x ·,i ). Figure 2 shows an overview of this optimization scheme.
Step 2 -Subject-specific Orderings: p(E i |x j,i )∀i are used to estimate the subjectspecific orderings s j . s j is established such that: Step 3 -Central Ordering: DEBM computes the central event ordering S from the subjectspecific estimates s j . To describe the distribution of s j , a generalized Mallows model is used (Fligner and Verducci, 1988). The central ordering is defined as the ordering that minimizes the sum of distances to all subject-specific orderings s j , with Input for the DEBM model is a cross-sectional dataset X with M subjects and various biomarkers (A,B,C and D) representing different aspects of neuro-degeneration. Using Gaussian mixture modeling (GMM), mixing parameters (θ i ) and probability density functions of normal (p(x ·,i |¬E i )) and abnormal (p(x ·,i |E i )) levels are estimated for each biomarker. This is followed by the estimation of subject-specific orderings (s j ), for each subject in the dataset. Disease progression timeline consisting of central ordering (S) and event-centers (λ) are estimated based on these subject-specific orderings. Based on the constructed disease progression timeline, patient stages (Υ j ) of subjects in an independent test-set can be estimated. probabilistic Kendall's Tau being the distance measure (Venkatraghavan et al., 2019a). While S denotes the sequence of biomarker events, the relative position of these events (event-centers) in a normalized scale of [0, 1] is denoted by the vector λ. The pair {S, λ} together forms a disease progression timeline.
Step 4 -Patient Staging: Once the disease progression timeline is created, subjects in an independent test set (T ) can be placed on this timeline to estimate disease severity. This is achieved by converting the biomarker values of the test subjects to posterior probabilities p(E i |x j,i ), ∀j ∈ T . These can be used to estimate disease severities in test subjects by first estimating the conditional distribution p(i|S, X j ), which estimates the probability that the first i events of S have occurred for a testsubject and the rest are yet to occur.
The patient stage of a test subject (Υ j ) is defined as the expectation of λ(i) with respect to the conditional distribution p(i|S, X j ).

Group-specific and group-aspecific parts of DEBM
We propose extensions of DEBM for stratified populations, i.e., when the dataset X can be subdivided in groups g ∈ [1, G], based on, e.g., genotype or phenotype of the subjects. Since DEBM is a data-driven model, data stratification into smaller groups would lead to more inaccurate models (Venkatraghavan et al., 2019a). To obtain better DEBM accuracies in such scenario, we propose to co-train DEBM for estimating disease timelines ∀g by splitting DEBM into group-aspecific and groupspecific parts. The group-aspecific parts of DEBM are estimated using the entire dataset and groupspecific parts are estimated for each group independently.
We first discuss the default way of independently training DEBM in the different groups and then propose two different approaches for splitting DEBM into group-aspecific and group-specific parts.
Approach 1: Independent DEBM In this default approach, each group is considered as an independent dataset and the disease progression timeline in each group is estimated independently. GMM in such a scenario is illustrated in Figure 3a.
group-specific (6) In this approach, we assume that the different groups share the normal and abnormal PDFs, but the ordering in which these biomarkers become abnormal are different. The mixing parameters (θ i,g ) are considered as group-specific part of the DEBM algorithm because the proportion of subjects with normal and abnormal biomarker values in each group g is correlated with the position of the biomarker along the ordering S g , which we expect to be different in each group.
Hence, in our approach, we modify the alternating GMM optimization scheme to jointly optimize the GMM parameters of multiple groups. First, the GMM algorithm is initialized without considering the groups, as explained in Section 2.1. Secondly, as with the default DEBM, Gaussian parameters and mixing parameters are alternately optimized. In contrast in coupled DEBM, the Gaussian parameters are estimated jointly for all groups, while mixing parameters are estimated separately for each group. This has been illustrated in Figure 3b.
Once the GMM optimization has been performed, S g and λ g are estimated in each group. Patient staging (Υ j ) of the test-subjects in group g are computed based on the disease progression timeline {S g , λ g }.
Approach 3: group-specific (7) In this approach, we assume that the different groups do not share the normal and abnormal PDFs, but that they are close to each other. Hence, in co-init DEBM, we relax the constraint on p(x ·,i |¬E i ) and p(x ·,i |E i ) and instead consider the initialized values of normal and abnormal PDFs ( p(x ·,i |¬E i ) and p(x ·,i |E i )) to be groupaspecific part of DEBM. We estimate p g (x ·,i |¬E i ) and p g (x ·,i |E i ) independently for each group. This is illustrated in Figure 3c.
As with the previous approach, S g , λ g and the patient staging of the test-subjects in group g are computed independently for each group.

Experiments
Section 3.1 describes the experiments to evaluate the proposed DEBM approaches on a stratified population. Since ground-truth orderings are unknown in real clinical data, we use simulated datasets for evaluating the methods. After evaluating the proposed approaches, we select the best approach for analyzing the effect of APOE on AD progression using subjects from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database. Section 3.2 descibes the details of these experiments.

Simulation Experiments
We used the framework developed by Young et al. (2015) for simulating cross-sectional data consisting of scalar biomarker values for CN, MCI and AD subjects in two groups. In this framework, disease progression in a subject is modeled by a series of biomarker changes representing the temporal cascade of biomarker abnormality as estimated by an EBM. Individual biomarker trajectories are represented by sigmoids varying from the biomarker's normal value to its abnormal value. To account for inter-subject variability, the normal and abnormal values for different subjects are drawn randomly from Gaussian distributions.
The simulation dataset used in our experiments are based on a set of seven biomarkers as described in the simulation experiments of Venkatraghavan et al. (2019a). The simulated datasets were stratified into two groups, with each group having its own distinct disease progression patterns. There are two ways in which the progression of disease in the groups can differ: 1. difference in groundtruth orderings S 1 and S 2 ; 2. difference in the abnormal biomarker PDFs in the two groups i.e. p 1 (x ·,i |E i ) and p 2 (x ·,i |E i ). Each of these differences could affect the accuracy of the proposed approaches. Hence, we evaluated the proposed approaches in the presence of each of these differences. Normalized Kendall's Tau distance between the estimated ordering (S) and the ground-truth ordering (S gt ) was used as an evaluation measure in these experiments: where K(A, B) is the number of swaps required to obtain ordering B from ordering A. The normalization ensures that ε S falls in the range [0, 1], with 0 as the distance when the two orderings are the same, and 1 as the distance when the two orderings are the reverse of each other.
Experiment 1: The first simulation experiment studied the effect of the difference in ordering between the two groups. The ordering in the first group (Group 1) was fixed and the ordering in the second group (Group 2) was selected randomly such that the normalized Kendall's Tau distance between the two groups was a fixed number, say ε O . ε O was varied from 0 to 1 in steps of 0.2. The number of subjects in Group 2 was kept constant at 900. The number of subjects in Group 1 was varied from 100 to 900 in steps of 200, to study how the different approaches perform in small as well as large groups. The normal and abnormal biomarkers levels in the two groups were sampled from the same Gaussian distribution for this experiment. We generated 50 random repetitions of the simulated datasets, and reported mean and standard deviation of ε S for independent DEBM, coupled DEBM, and co-init DEBM in groups 1 and 2.
Experiment 2: This experiment studied the performance of the proposed approaches with the µ g,i,E parameter of the p g (x ·,i |E i ) distribution being different in the two groups. µ 1,i,E was fixed, and µ 2,i,E was varied such that the difference where the abnormal Gaussians are the same in the two groups. µ g,i,¬E were kept the same in the two groups. Hence, when ε G = −0.2d, the abnormal biomarker levels are closer to the normal biomarker levels in Group 2 than in Group 1. This results in Group 2 biomarkers being weaker than their Group 1 counterparts when ε G = −0.2d and stronger when ε G = +0.2d. The number of subjects in Group 2 was kept a constant at 900, while the subjects in Group 1 increased from 100 to 900. ε O between the two groups was fixed at 0.4. We again generated 50 random repetitions of the simulated datasets, and reported mean and standard deviation of ε S for coupled DEBM, co-init DEBM and DEBM.
These experiments were used to evaluate the different approaches mentioned in Section 2 and select the best method for analyzing the effect of APOE alleles in AD progression.

Studying the effect of APOE
We considered the baseline measurements from 417 CN, 235 MCI converters and 342 AD subjects in ADNI1, ADNIGO and ADNI2 studies 2 . The MCI converters are subjects who had MCI at baseline but converted to AD within 3 years of baseline measurement. We excluded subjects with significant memory concerns (without a diagnosis of AD or

MCI) and MCI non-converters in our experiments
to select a more phenotypically homogeneous group of subjects with prevalent or incident AD. In each of the experiments, the dataset was divided into three groups (protective, neutral and risk) based on the subject's APOE carriership. The protective group consisted of subjects with APOE ε2, 2 and ε2, 3 (van der Lee et al., 2018). The neutral group consisted of subjects with APOE ε3, 3. The risk group consisted of subjects with APOE ε3, 4 and ε4, 4 (Saunders et al., 1993;Kim et al., 2009;Genin et al., 2011). Subjects with APOE ε2, 4 (n=34) were not included in either group because of the presence of both protective and risk alleles. Subject demographics and their APOE carrierships are summarized in Table 1. The modalities considered were structural imaging (T1-weighted MRI) biomarkers, biomarkers extracted from cerebrospinal fluid (CSF), and cognitive biomarkers. Details of the MRI acquisition protocols of ADNI can be found in Jack Jr. et al. (2008. Imaging biomarkers were estimated from T1weighted MRI scans analysed with FreeSurfer software v6.0 cross-sectional stream and outputs were visually checked. We assumed a symmetric pattern of atrophy in AD and averaged imaging biomarkers between the left and right hemisphere. Experiment 3: For this experiment, the selected imaging biomarkers were: volumetric measures of the hippocampus, entorhinal cortex, fusiform gyrus, middle-temporal gyrus and precuneus, together with whole brain volume and ventricles (Archetti et al., 2019;Frisoni et al., 2010;Vemuri and Jack, 2010  biomarkers were: CSF concentrations of Amyloid-(ABETA), total Tau (TAU) and phosphorylated Tau (PTAU) proteins (Blennow and Hampel, 2003;Blennow et al., 2010), Neurogranin (Thorsell et al., 2010) and Neurofilament light chain (Jin et al., 2019;de Wolf et al., 2020). Mini mental state examination (MMSE) and Alzheimer's Disease Assessment Scale -Cognitive (13 items) (ADAS13) were used as cognitive biomarkers. The availability of these multimodal biomarkers in the ADNI database is summarized in Table 2. For the 12 selected biomarkers, we estimated the disease timelines in the three aforementioned groups using the method selected after simulation experiments. We studied the positional variance of the estimated orderings by creating 100 bootstrapped samples of the data.
Experiment 4: The above experiment was repeated by including volumes of subfields of hippocampus (Iglesias et al., 2015) as biomarkers instead of considering the entire hippocampus volume as a single biomarker. We included the tail of the hippocampus (Tail), Subiculum, CA1, CA2/3, hilar region of the dentate gyrus (CA4), hippocampal fissure (Fissure), area 27 in Brodmann (1909) (Presubiculum), area 49 in Brodmann (1909) (Parasubiculum), Molecular layer of the subiculum and CA fields (ML), granule cell and molecular layer of the dentate gyrus (GC ML DG), and the hippocampus-amygdala transition area (HATA). We excluded the white matter structures of Alveus and Fimbria in our analysis.
Experiment 5: In this experiment, we validated the disease stage (Υ j ) by computing its correlation with the subjects' MMSE and ADAS13 values. We used a 10-fold cross validation, where the training set was used to estimate the disease timeline in the aforementioned groups and the test subjects' dis-ease stage was evaluated by placing them on this disease timeline. We used the volume-based and CSF-based biomarkers from Experiment 3, but excluded MMSE and ADAS13 scores from the model. It can be seen that both coupled-training methods (i.e., co-init DEBM and coupled DEBM) outperform the default method of independently training DEBM models. It can also be observed that in both co-init DEBM and coupled DEBM the ordering errors decrease as ε O increases and that co-init DEBM outperforms coupled DEBM for lower values of ε O , whereas the performance is on par with coupled DEBM for higher values of ε O . Experiment 2: Figures 5 (a) and (b) show ε S in Group 1 and Figures 5 (c), (d) and (e) show the same in Group 2, when varying ε G . Even with ε G = 0, coupled training (i.e., co-init DEBM and coupled DEBM) outperformed the default method of independently training DEBM models. Co-init DEBM showed negligible change in the errors when ε G = 0. The performance of coupled DEBM in Group 1 worsened for ε G = +0.2d (Figure 5 (a)) and in Group 2 for ε G = −0.2d (Figure 5 (d)).

Studying the effect of APOE
The results in Experiments 1 and 2 show that the performance of co-init DEBM is more accurate and robust than coupled DEBM in most scenarios. We hence analyzed Experiments 3 − 5 using co-init DEBM.
Experiment 3: Figure 6 shows orderings of CSF, global cognition and volumetric biomarkers in the APOE based protective, neutral and risk groups along with their uncertainty estimates. It can be seen that the uncertainty of the ordering in the protective group was high. Despite this uncertainty, some biomarkers (i.e. MMSE, NG and PTAU) seem to occur earlier than the other biomarkers in this group.
In the group of subjects with neutral alleles, ABETA was very prominently the earliest biomarker, followed by cognitive scores of MMSE and ADAS13. Among the CSF biomarkers, PTAU followed immediately after ABETA, which was inturn followed by TAU. NFL and NG were late biomarkers. Among the structural biomarkers, volumes of fusiform and middle-temporal gyri were the first to become abnormal, followed by ventricular volume and wholebrain volume. Hippocampus, precuneus and entorhinal volumes were late biomarkers in this group.
In the risk group, the CSF biomarkers followed a pattern that was similar to that of the neutral alleles. The cognitive biomarkers were early biomarkers in this group as well. However the ordering in structural biomarkers was very different from that in the neutral group. Hippocampus and entorhinal volumes were early biomarkers in this group, followed by middle-temporal and fusiform gyri volumes. Wholebrain, ventricular and precuneus volumes were late biomarkers.
Experiment 4: Figure 6 shows orderings of CSF, global cognition and volumetric biomarkers along with that of Hippocampus subfields in the APOE based protective, neutral and risk groups along with their uncertainty estimates. Parasubiculum was an early biomarker in the protective group, which became abnormal even before MMSE and NG. Fissure, Tail and Presubiculum followed immediately after that. Subiculum was one of the late biomarkers.
Fissure and Parasubiculum were early structural biomarkers for the neutral group as well. This was followed by dysfunction in CA3 and HATA.
A majority of the Hippocampal subfields were early biomarkers in the risk group, with ML, Subiculum and Tail as the three earliest regions.
Experiment 5: The variation of MMSE and ADAS13 scores with respect to the estimated disease stages has been plotted in Figure 8, for all three groups. The patient stages showed a significant correlation with both MMSE and ADAS13 scores. The correlation coefficients were also comparable in the three groups.

Discussion
DEBM models have been shown to be effective in determining the temporal cascade of biomarker abnormality as AD progresses, from cross-sectional data. In this work, we introduced a novel concept of splitting the different steps of DEBM into group-specific and group-aspecific parts for coupled training in stratified population. We considered two novel variations to split the steps of DEBM in this manner and through thorough experimentation in simulation datasets we observed that coinit DEBM helps in obtaining more accurate orderings in a stratified population. Using this method, we estimated the biomarker cascades in AD progression with protective, neutral and risk alleles of APOE, based on cross-sectional ADNI data. While the findings in neutral and risk groups fit the amyloid cascade hypothesis of Alzheimer's disease with high-confidence (which posits that ABETA accumulation triggers neurodegeneration), the finding in the protective group shows evidence for an alternative pathway (with relatively low confidence). In this section, we discuss the insights provided by the simulation experiments (Section 5.1) used for method selection as well as the insights into the AD progression pathways provided by our experiments on the ADNI dataset (Section 5.2).

Choice of the method
Coupled DEBM and co-init DEBM both split DEBM into group-specific and group-aspecific steps for coupled training of an EBM in stratified populations. Experiment 1 and 2 showed that coupled training of the group-aspecific parts of DEBM and independently training the group-specific parts of DEBM results in more accurate orderings in the groups better than the default approach of independently training a DEBM model in each group.
While splitting DEBM into group-specific and group-aspecific parts, we started with the assumption that the latent true normal and abnormal biomarker distributions in the groups are either same or similar. The difference between coinit DEBM and coupled DEBM is that, co-init DEBM accounts for slight differences in the underlying biomarker distributions between the groups whereas coupled DEBM does not.
The simulation dataset generated in Experiment 1 had the same true normal and abnormal biomarker distributions in the different groups, from which the simulated subjects were randomly   sampled, aligning well with the assumption of coupled DEBM. However, this did not result in overall better accuracies for coupled DEBM than that of co-init DEBM. Co-init DEBM was also more robust than coupled DEBM as its accuracy was less dependent on ε O , the distance between the groundtruth orderings in the two groups. Another observation in Experiment 1, which was rather counter-intuitive, was that the errors made by the co-init and coupled DEBM models decreased as the distance between the ground-truth orderings in the two groups increased. When the orderings are further apart, the combined biomarker distributions in CN and AD groups have a larger overlap. The non-overlapping initialization (before the GMM optimization) thus results in the normal and abnormal distributions to be further apart. We hypothesize that this results in a better estimation of the mixing parameters during GMM optimization and in-turn resulted in more accurate orderings, as mixing-parameters are dependent on the biomarker's position in the ordering.
In Experiment 2, we checked the performance of our approaches when the assumption (true normal and abnormal biomarker distributions being same across groups) is violated in the dataset. This experiment showed that the orderings obtained using co-init DEBM are more robust to differences between the abnormal Gaussians across groups than those obtained with coupled DEBM. With coupled DEBM, the error increased in the group with weaker biomarkers i.e., Group 1 in the case of ε G = +0.2d and Group 2 in the case of ε G = −0.2d. This shows that coupled DEBM introduces a systematic bias in the estimation of ordering that is detrimental to the group with weaker biomarkers. Co-init DEBM also showed a similar bias, but to a much lesser extent.
We hence selected co-init DEBM as the preferred approach for splitting and performed our analysis on ADNI dataset using this approach. We expect that this idea of splitting DEBM into group-specific and group-aspecific parts can be easily extended to the EBM introduced by Fonteijn et al. (2012).

Cascade of biomarker changes in the APOE based groups
Dividing the total population into groups based on APOE carriership enabled us to create more phenotypically homogeneous groups (Weintraub et al., 2019), each with potentially specific disease progression timeline. In this section, we discuss our results in these APOE carriership based groups.
Our finding suggests that among the CSF biomarkers in the APOE risk and neutral groups, ABETA abnormality is the earliest biomarker event followed by PTAU. This fits the amyloid cascade hypothesis of AD, which posits that ABETA accumulation triggers neurodegeneration (Hardy, 2017). It also confirms the need for preventing the accumulation of ABETA in high-risk patients. NFL and NG are late biomarkers in the neutral and risk groups, which suggests that axonal (Ashton et al., 2019) and synaptic (Thorsell et al., 2010) degeneration do not occur until very late in the disease process in these groups. NG being abnormal after PTAU and TAU in the neutral and risk groups is also consistent with the previous findings that Tau mediates synaptic damage in AD (Jadhav et al., 2015).
In the protective group, we found that the abnormal NG and PTAU are the earliest CSF events, even before ABETA becomes abnormal. This could hint at the existence of an alternative pathway for the formation of tau tangles in the brain before ABETA accumulation, as suggested in Weigand et al. (2019), but needs more extensive validation.
Among the volumetric biomarkers, Entorhinal cortex is one of the early biomarkers in the risk group which is supported by the findings in Huijbers et al. (2014), but is one of the last biomarkers to become abnormal in the neutral allele. Ventricular volume is a late biomarker in the risk group but it becomes abnormal quite early in the neutral group as also observed by Nestor et al. (2008). Hippocampus volume is the earliest biomarker in the risk group, but is a relatively late biomarker in the neutral and protective groups. This suggests that incidence of hippocampal sparing AD (Ferreira et al., 2017) could correlate with APOE carriership.
Volumetric biomarkers of large regions are generally insensitive to small changes in volume of subparts. Thus to find the earliest regions that are affected in AD progression, we analyzed the ordering of biomarker events while also including hippocampal subfields. Our results suggests that in the risk group, molecular layer of the subiculum and CA fields (ML) followed by subiculum and the tail of the hippocampus are the earliest volumetric biomarker events. Tail of the hippocampus is not a histologically distinct region but consists of CA1-4 and dentate gyrus (Iglesias et al., 2015).
Each of these regions as well as subiculum were observed to be associated to AD in Parker et al. (2019). Hippocampal fissure was one of the last regions to become abnormal. In contrast, in the neutral group, subiculum and CA1 were the last regions to become abnormal. Parasubiculum was one of the earliest region to become abnormal in this group. Parasubiculum as an early biomarker was also observed in (Zhao et al., 2019) where they reported a significant difference in its volume between CN and MCI. In the protective group, Parasubiculum, Fissure, Tail and Presubiculum were among the early volumetric biomarkers. Our findings thus highlight that Hippocampus subfields are differentially affected due to APOE carriership.
The findings related to these orderings of biomarker events were validated by correlating the patient stages derived from these orderings with MMSE and ADAS13 scores. Patient stages of subjects in all three groups, when used as test-subjects in a cross-validated manner, showed a significant correlation (p < 0.001) with these scores. These correlations validate our findings and suggest that these genotype-specific disease progression timelines could be used for patient monitoring.

Conclusion and Future work
We conclude that co-init DEBM provides the best accuracy and robustness when estimating orderings in stratified populations. Future work on co-init DEBM can focus on extending the approach for high-dimensional imaging biomarkers (Venkatraghavan et al., 2019b). This work also provides groundwork for extending the method towards hypothesis-free, data-driven stratification of phenotypes.
We gained new insights into the disease progression timeline of AD in the APOE based risk, neutral and protective groups of APOE. While we observed that the estimated disease progression timelines in the risk and neutral groups fit the amyloid cascade hypothesis with high confidence, the estimated time lines in the protective group may suggest an alternative pathway for the formation of tau tangles in the brain before amyloid β accumulation, albeit with relatively low condence. We expect that these genotype-specific disease progression timelines will benefit patient monitoring in the future, and may help optimize selection of eligible subjects for clinical trials.

Acknowledgement
This work is part of the EuroPOND initiative, which is funded by the European Union's Horizon 2020 research and innovation programme under grant agreement No. 666992. E.E. Bron acknowledges support from the Dutch Heart Foundation (PPP Allowance, 2018B011), Medical Delta Diagnostics 3.0: Dementia and Stroke, and the Netherlands CardioVascular Research Initiative (Heart-Brain Connection: CVON2012-06, CVON2018-28).
Data collection and sharing for this project was funded by the Alzheimer's Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904) and DOD ADNI (Department of Defense award number W81XWH-12-2-0012). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: AbbVie, Alzheimers Association; Alzheimer's Drug Discovery