Predicting Alzheimer’s disease progression using deep recurrent neural networks

Early identification of individuals at risk of developing Alzheimer’s disease (AD) dementia is important for developing disease-modifying therapies. In this study, given multimodal AD markers and clinical diagnosis of an individual from one or more timepoints, we seek to predict the clinical diagnosis, cognition and ventricular volume of the individual for every month (indefinitely) into the future. We proposed and applied a minimal recurrent neural network (minimalRNN) model to data from The Alzheimer’s Disease Prediction Of Longitudinal Evolution (TADPOLE) challenge, comprising longitudinal data of 1677 participants (Marinescu et al., 2018) from the Alzheimer’s Disease Neuroimaging Initiative (ADNI). We compared the performance of the minimalRNN model and four baseline algorithms up to 6 years into the future. Most previous work on predicting AD progression ignore the issue of missing data, which is a prevalent issue in longitudinal data. Here, we explored three different strategies to handle missing data. Two of the strategies treated the missing data as a “preprocessing” issue, by imputing the missing data using the previous timepoint (“forward filling”) or linear interpolation (“linear filling). The third strategy utilized the minimalRNN model itself to fill in the missing data both during training and testing (“model filling”). Our analyses suggest that the minimalRNN with “model filling” compared favorably with baseline algorithms, including support vector machine/regression, linear state space (LSS) model, and long short-term memory (LSTM) model. Importantly, although the training procedure utilized longitudinal data, we found that the trained minimalRNN model exhibited similar performance, when using only 1 input timepoint or 4 input timepoints, suggesting that our approach might work well with just cross-sectional data. An earlier version of our approach was ranked 5th (out of 53 entries) in the TADPOLE challenge in 2019. The current approach is ranked 2nd out of 63 entries as of June 3rd, 2020.


Introduction
Alzheimer's disease (AD) dementia is a devastating neurodegenerative disease with a long prodromal phase and no available cure. It is widely believed that an effective treatment strategy should target individuals at risk for AD early in the disease process (Scheltens et al., 2016). Consequently, there is significant interest in predicting the longitudinal disease progression of individuals. A major difficulty is that although AD commonly presents as an amnestic syndrome, there is significant heterogeneity across individuals (Murray et al., 2011;Noh et al., 2014;Zhang et al., 2016;Risacher et al., 2017;Young et al., 2018;. Since AD dementia is marked by beta-amyloid-and tau-mediated injuries, followed by brain atrophy and cognitive decline (Jack et al., , 2013, a multimodal approach might be more effective than a single modality approach to disentangle this heterogeneity and predict longitudinal disease progression (Marinescu et al., , 2020.
In this study, we proposed a machine learning algorithm to predict multimodal AD markers (e.g., ventricular volume, cognitive scores, etc.) and clinical diagnosis of individual participants for every month up to six years into the future. Most previous work has focused on a "static" variant of the problem, where the goal is to predict a single timepoint (Duchesne et al., 2009;Stonnington et al., 2010;Zhang and Shen, 2012;Moradi et al., 2015;Albert et al., 2018;Ding et al., 2018) or a set of pre-specified timepoints in the future (regularized regression; (Wang et al., 2012;Johnson et al., 2012;McArdle et al., 2016;Wang et al., 2016)). By contrast, our goal is the longitudinal prediction of clinical diagnosis and multimodal AD markers at a potentially unlimited number of timepoints into the future, 1 as defined by The Alzheimer's Disease Prediction Of Longitudinal Evolution (TADPOLE) challenge (Marinescu et al., , 2020, which arguably a more relevant and complete goal for tasks, such as prognosis and cohort selection. One popular approach to this longitudinal prediction problem is mixed-effect regression modeling, where longitudinal trajectories of AD biomarkers are parameterized by linear or sigmoidal curves (Vemuri et al., 2009;Ito et al., 2010;Sabuncu et al., 2014;Samtani et al., 2012;Zhu and Sabuncu, 2018). However, such a modeling approach requires knowing the shapes of the biomarker trajectories a priori. Furthermore, even though the biomarker trajectories might be linear or sigmoidal when averaged across participants (Caroli and Frisoni, 2010;Jack et al., 2010;Sabuncu et al., 2011), individual subjects might deviate significantly from the assumed parametric forms.
Consequently, it might be advantageous to not assume that the biomarker trajectories follow a specific functional form. For example, Xie and colleagues proposed an incremental regression modeling approach to predict the next timepoint based on a fixed number of input time points (Xie et al., 2016). The prediction can then be used as input to predict the next timepoint and so on indefinitely. However, the training procedure requires participants to have two timepoints, thus "wasting" data from participants with less or more than two timepoints. Therefore, state-based models (e.g., discrete or continuous state Markov model) that do not constrain the shapes of the biomarker trajectories or assume a fixed number of timepoints might be more suitable for this longitudinal prediction problem (Sukkar et al., 2012;Goyal et al., 2018). Here, we considered recurrent neural networks (RNNs), which allow an individual's latent state to be represented by a vector of numbers, thus providing a richer encoding of an individual's "disease state" beyond a single integer (as in the case of discrete state hidden Markov models). In the context of medical applications, RNNs have been used to model electronic health records (Lipton et al., 2016a;Choi et al., 2016;Esteban et al., 2016;Pham et al., 2017;Rajkomar et al., 2018;Suo et al., 2018) and AD disease progression (Nguyen et al., 2018;Ghazi et al., 2019).
Most previous work on predicting AD progression ignore the issue of missing data (Stonnington et al., 2010;Sukkar et al., 2012;Lei et al., 2017;Liu et al., 2019). However, missing data is prevalent in real-world applications and arises due to study design, delay in data collection, subject attrition or mistakes in data collection. Missing data poses a major difficulty for modeling longitudinal data since most statistical models assume featurecomplete data (García-Laencina et al., 2010). Many studies sidestep this issue by removing subjects or timepoints with missing data, thus potentially losing a large quantity of data. There are two main approaches for handling missing data (Schafer and Graham, 2002). First, the "preprocessing" approach handles the missing data issue in a separate preprocessing step, by imputing the missing data (e.g., using the missing variable's mean or more sophisticated machine learning strategies; Azur et al., 2011;Rehfeld et al., 2011;Stekhoven and Buhlmann, 2011;White et al., 2011;Zhou et al., 2013), and then using the imputed data for subsequent modeling. Second, the "integrative" approach is to integrate the missing data issue directly into the models or training strategies, e.g., marginalizing the missing data in Bayesian approaches (Marquand et al., 2014;Wang et al., 2014;Goyal et al., 2018;Aksman et al., 2019).
In this work, we proposed to adapt the minimalRNN model (Chen, 2017) to predict AD progression. The minimalRNN has fewer parameters than other RNN models, such as the long short-term memory (LSTM) model, so it might be less prone to overfitting. Although RNNs are usually trained using feature-complete data, we explored two "preprocessing" and one "integrative" approaches to deal with missing data. We used data from the TADPOLE competition, comprising longitudinal data from 1677 participants . An earlier version of this work was published at the International Workshop on Pattern Recognition in Neuroimaging and utilized the more complex LSTM model (Nguyen et al., 2018). Here, we extended our previous work by using a simpler RNN model, expanding our comparisons with baseline approaches and exploring how the number of input timepoints affected prediction performance. We also compared the original LSTM and current minimalRNN models using the live leaderboard on TADPOLE.

Problem setup
The problem setup follows that of the TADPOLE challenge . Given the multimodal AD markers and diagnostic status of a participant from one or more timepoints, we seek to predict the cognition (as measured by ADAS-Cog13; Mohs et al., 1997), ventricular volume (as measured by structural MRI) and clinical diagnosis of the participant for every month indefinitely into the future.

Data
We utilized the data provided by the TADPOLE challenge . The data consisted of 1677 subjects from the ADNI database (Jack et al., 2008). Each participant was scanned at multiple timepoints. The average number of timepoints was 7.3 ± 4.0 (Fig. 1A), while the average number of years from the first timepoint to the last timepoint was 3.6 ± 2.5 (Fig. 1B).
For consistency, we used the same set of 23 variables recommended by the TADPOLE challenge, which included diagnosis, neuropsychological test scores, anatomical features derived from T1 magnetic resonance imaging (MRI), positron emission tomography (PET) measures and CSF markers ( Table 1). The diagnostic categories corresponded to normal control (NC), mild cognitive impairment (MCI) and Alzheimer's disease (AD).

Proposed model
We adapted the minimalRNN (Chen, 2017) for predicting disease progression. Here, we utilized minimalRNN instead of LSTM because it has less parameters and is therefore less likely to overfit (see Appendix A for details). The model architecture and update equations are shown in Fig. 2. Let x t denote all variables observed at time t, comprising the diagnosis s t and remaining continuous variables g t (Eq. (1) in Fig. 2B). Here, diagnosis was represented using one-hot encoding. In other words, diagnosis was represented as a vector of length three. More specifically, if the first entry was one, then the participant was a normal control. If the second entry was one, then the participant was mild cognitively impaired. If the third entry was one, then the participant had AD dementia. For now, we assume that all variables were observed at all timepoints; the missing data issue will be addressed in Section 2.4. At each timepoint, the transformed input u t (Eq. (2) in Fig. 2) and the previous hidden state h t−1 were used to update the hidden state h t (Eqs. (3) and (4) in Fig. 2B). The hidden state can be interpreted as integrating all information about the subject up until that timepoint. The hidden state h t was then used to predict the observations at the next timepoint x t+1 (Eqs. (5) and (6) in Fig. 2B).
In the ADNI database, data were collected at a minimum interval of 6 months. However, in practice, data might be collected at an unscheduled time (e.g., month 8 instead of month 6). Consequently, the duration between timepoints t and t + 1 in the RNN was set to be 1 month.
However, experiments with different durations were also performed with little impact on the results (see Section 2.7.2). Fig. 3. The RNN was trained to predict the next observation (x t ) given the previous observations (x 1 , x 2 , …, x t−1 ). The errors between the predicted outputs (e.g. x 2 ) and the ground truth outputs (e.g. x 2 ) were used to update the model parameters. The error (or loss L) was defined as follows:

Training with no missing data-The RNN training is illustrated in
It is important to note that the loss function was only evaluated using available observations. Missing data were not considered when computing the loss. Furthermore, we note that the two terms in the loss function (Eq. (7)) were weighted equally. Changing the relative weights of the two terms could potentially influence the model performance. However, this would increase the number of hyperparameters, so we did not experiment with varying the weighting in this study. The value of h 0 was set to be 0. During training, gradients of loss L with respect to the model parameters were back-propagated to update the RNN parameters. The RNN was trained using Adam (Kingma and Ba, 2015). Fig. 4 illustrates how the RNN was used to predict AD progression in an example subject (from the validation or test set). Given observations for months 1, 2 and 3, the goal of the model was to predict observations in future months. From month 4 onwards, the model predictions (x 4 and x 5 ) were fed in as inputs to the RNN (for months 5 and 6 respectively) to make further predictions (dashed lines in Fig. 4).

Missing data
As seen in Table 1, there were a lot of missing data in ADNI. This was exacerbated by the fact that data were collected at a minimum interval of 6 months, while the sampling period in the RNN was set to be 1 month (to handle off-schedule data collection). During training, the loss function was evaluated only at timepoints with available observations. Similarly, when evaluating model performance (Section 2.6), only available observations were utilized.
The missing data also posed a problem for the RNN update equations (Fig. 2B), which assumed all variables were observed. Here, we explored two "preprocessing" strategies (Sections 2.4.1 & 2.4.2) and one "integrative" strategy (Section 2.4.3) to handle the missing values. As explained in the introduction, "preprocessing" strategies impute the missing data in a separate preprocessing. The imputed data is then used for subsequent modeling. On the other hand, "integrative" strategies incorporate the missing data issue directly into the model or training strategies.

2.4.1.
Forward filling-Forward filling involved imputing the data using the last timepoint with available data (Che et al., 2018;Lipton et al., 2016b). Fig. 5A illustrates an example of how forward-filling in time was used to fill in missing input data. In this example, there were two input variables A and B. The values of feature A at time t = 2, 3 and 4 were filled using the last observed value of feature A (at time t = 1). Similarly, the values at t = 7, 8 of feature A were filled using value at t = 6 when it was last observed. If data was missing at the first timepoint, the mean value across all timepoints of all training subjects was used for the imputation.

Linear filling-
The previous strategy utilized information from previous timepoints for imputation. One could imagine that it might be helpful to use previous and future timepoints for imputation. The linear filling strategy performed linear interpolation between the previous timepoint and the next time point with available data (Junninen et al., 2004). Fig. 5B shows an example of linear interpolation. Values of feature A at time t = 2, 3, 4, 6 were filled in using linear interpolation. However, linear-filling did not work for months 8, 9 and 10 because there was no future observed data for linear interpolation, so forwardfilling was utilized for those timepoints. Like forward filling, if data was missing at the first timepoint, the mean value across all timepoints of all training subjects was used for the imputation.

Model filling-
We also considered a novel model filling strategy of filling in missing data. As seen in Section 2.3.2 (Fig. 5), the prediction of the RNN could be used as inputs for the next timepoint. The same approach can be used for filling in missing data. Fig. 5C shows an example of how the RNN was used to fill in missing data. At time t = 2 to 6, the values of feature A were filled in using predictions from the RNN. The RNN could also be used to extrapolate features that "terminate early" (e.g., time t = 8 and 9).
A theoretical benefit of modeling filling was that the full sets of features were utilized for the imputation. For example, both features A and B at time t = 1 were used by the RNN to predict both input features at time t = 2 (Fig. 5C). This was in contrast to forward or linear filling, which would utilize only feature A (or B) to impute feature A (or B).
Like forward filling, if data was missing at the first timepoint, the mean value across all timepoints of all training subjects was used for the imputation.

Constant prediction-
The constant prediction algorithm simply predicted all future values to be the same as the last observed values. The algorithm did not need any training. While this might seem like an overly simplistic algorithm, we will see that the constant prediction algorithm is quite competitive for near term prediction.

SVM/SVR-As explained in the introduction
, most previous studies have focused on a "static" variant of the problem, where the goal is to predict a single timepoint or a set of pre-specified timepoints in the future. Here, we will consider such a baseline by using SVM to predict clinical diagnosis (which was categorical) and SVR to predict ADAS-Cog13 and ventricular volume (which were continuous). The models were implemented using scikitlearn (Pedregosa et al., 2011). We note that separate models were trained for each target variable (clinical diagnosis, ADAS-Cog13 and ventricular volume).
Because SVM/SVR accepts fixed length feature vectors, it cannot handle subjects with different number of input timepoints. Therefore, we trained different SVM/SVR models using 1 to 4 input timepoints (spaced 6 months apart) to predict the future. The 6-month interval was chosen because the ADNI data was collected roughly every 6 months. As can be seen in Section 3.1, the best results were obtained with 2 or 3 input timepoints, so we did not explore more than 4 input timepoints. The features were concatenated across the input timepoints. For example, since there were 23 features at each timepoint, then for the "2 input timepoints" SVM/SVR models, the input features constituted a vector of length 46. On the other hand, for the "3 input timepoints" SVM/SVR models, the input features constituted a vector of length 69.
For each SVM/SVR baseline, we trained separate SVM/SVR models to predict 10 sets of timepoints (spaced 6 months apart) into the future, i.e., 6, 12, 18,…,60 months into the future. 60 months were the maximum because of insufficient data to train SVM/SVR to predict further into the future (Fig. 1B). To summarize, separate SVM/SVR models were trained for different target variable (clinical diagnosis, ADAS-Cog13 and ventricular volume), for different number of input timepoints (1, 2, 3 or 4 input timepoints) and for different number of future predictions (6, 12, 18,…,60 months). This yielded a total of 3 × 4 × 10 = 120 SVM/SVR models.
To maximize the number of data samples for training, we used all available timepoints in the training subjects to train the SVM/SVR models. For example, let us consider a training subject with 10 observed timepoints spaced 6 months apart. In the case of the SVM/SVR models with one input timepoint, this subject would contribute 9 training samples to train a model for predicting 6 months ahead, 8 training samples to train a model for predicting 12 months ahead, 7 training samples to train a model for predicting 18 months ahead, and so on.
The linear filling strategy ( Fig. 5B) was used to handle missing data. We also experimented with using multivariate functional principal component analysis (MFPCA) for filling in the missing data (Happ and Greven, 2018;Li et al., 2018). Because prediction performance was evaluated at every month in the future, prediction at intermediate months (e.g., months 1 to 5, months 7 to 11, etc.) were linearly interpolated. Prediction from month 61 onwards utilized forward filling based on the prediction at month 60.
One tricky issue arose when a test subject had insufficient input timepoints for a particular SVM/SVR baseline. For example, the 4-timepoint SVM/SVR baseline required 4 input timepoints in order to predict future timepoints. In this scenario, if a test subject only had 2 input timepoints, then the 2-timepoint SVM/SVR was utilized for this subject even though we were considering the 4-timepoint SVM/SVR baseline. We utilized this strategy (instead of discarding the test subject) in order to ensure the test sets were exactly the same across all algorithms.

Linear state space (LSS) model-We considered a linear state space (LSS)
baseline by linearizing the minimalRNN model ( Fig. 6). Other than the update equations ( Fig. 6), all other aspects of training and prediction were kept the same. For example, the LSS models utilized the same data imputation strategies (Section 2.4) and were trained with the same cost function using Adam.

Long short term memory (LSTM) model-
The LSTM model is widely used for modeling sequences and temporal trajectories (Ghazi et al., 2019;Lipton et al., 2016a). We have previously used LSTM for predicting AD progression (Nguyen et al., 2018). Here, we favor minimalRNN over LSTM models, as they have less parameters, so are less prone to overfitting when data is limited. See Appendix A for further discussion.

Performance evaluation
We randomly divided the data into training, validation and test sets. The ratio of subjects in the training, validation and test sets was 18:1:1. The training set was used to train the model. The validation set was used to select the hyperparameters. The test set was used to evaluate the models' performance. For subjects in the validation and test sets, the first half of the timepoints of each subject were used to predict the second half of the timepoints of the same subject. All variables (except diagnostic category, which was categorical rather than continuous) were z-normalized. The z-normalization was performed on the training set. The mean and standard deviation from the training set was then utilized to z-normalize the validation and test sets. The random split of the data into training, validation and test sets was repeated 20 times to ensure stability of results Li et al., 2019;Varoquaux, 2018). Care was taken so that the test sets were non-overlapping so that the test sets across the 20 data splits covered the entire dataset.
The HORD algorithm (Regis and Shoemaker 2013;Eriksson et al., 2015;Ilievski et al., 2017) was utilized to find the best hyperparameters by maximizing model performance on the validation set. We note that this optimization was performed independently for each training/validation/test split of the dataset. The hyperparameter search space for minimalRNN, LSS, and LSTM is shown in Table 2. The hyperparameter search space for the SVM/SVR is shown in Table 3. The final set of hyperparameters are found in Tables S1 to S13.
Following the TADPOLE competition, diagnosis classification accuracy was evaluated using the multiclass area under the operating curve (mAUC; Hand and Till, 2001) and balanced class accuracy (BCA) metrics. The mAUC was computed as the average of three two-class AUC (AD vs not AD, MCI vs. not MCI, and CN vs not CN). For both mAUC and BCA metrics, higher values indicate better performance. ADAS-Cog13 and ventricles prediction accuracy was evaluated using mean absolute error (MAE). Lower MAE indicates better performance. The final performance for each model was computed by averaging the results across the 20 test sets. Even though the 20 test sets do not overlap, the subjects used for training the models do overlap across the test sets. Therefore, the prediction performances were not independent across the 20 test sets. To account for the non-independence, we utilized the corrected re-sampled t-test (Bouckaert and Frank, 2004) to evaluate differences in performance between models.

Impact of the number of input timepoints on prediction accuracy-For
the minimalRNN to be useful in clinical settings, it should ideally be able to perform well with as little input timepoints as possible. Therefore, we applied the best model (Section 2.6) to the test subjects using only 1, 2, 3 or 4 input timepoints (Fig. 7). This is different from the main benchmarking analysis (Section 2.6), where all input timepoints (which accounted for half of the total number of timepoints) of the test subjects were used for predicting future timepoints. Test subjects with less than 4 input timepoints were discarded, so that the same test subjects were evaluated across the four conditions (i.e., 1, 2, 3 or 4 input timepoints).
Because we discarded some test subjects, the result of this analysis is not comparable to that of the main benchmarking analysis (Section 2.6).

Effect of temporal resolution of minimalRNN-Even though the ADNI data
was collected at a minimum interval of 6 months, in practice, data was not collected at exactly 6-month interval, e.g., the data might be collected at month 4, instead of the scheduled data collection at month 6. Furthermore, the TADPOLE challenge required participants to make future prediction at a monthly interval with prediction performance evaluated at a monthly resolution. Therefore, our main analysis utilized minimalRNN models with a temporal resolution of 1 month.
However, the choice of temporal resolution (i.e., number of months between timepoints) might affect the performance of the minimalRNN. For example, using a finer temporal resolution (e.g., 1-month interval versus 6-month interval) leads to more missing data, which might lead to worse performance. On the other hand, using a coarser temporal resolution (e.g., 6-month interval versus 1-month interval) leads to greater mis-alignment between the minimalRNN's timepoints and the actual observations. For example, if we consider a minimalRNN with a temporal resolution of 6 months, then actual observed data at month 10 would need to be assigned to month 12, which might lead to worse performance. Finally, using a coarser temporal resolution results in fewer hidden state updates between two points in time, making it potentially easier for the minimalRNN to learn longer-term temporal patterns.
Here, we experimented with three different temporal resolutions: 1-month interval, 3-month interval, and 6-month interval. The RNN models were trained and tested using the same procedure described in Section 2.6, including hyperparameter search. For training the 3month and 6-month minimalRNN models, observed data were assigned to the closest timepoint. To evaluate performance of the 3-month and 6-month minimalRNN models, their predictions were linearly interpolated to obtain a temporal resolution of 1 month. Performance was evaluated only at timepoints with observed ground truth data.

Impact of different terms in the minimalRNN model-To investigate which
term in the minimalRNN model is important for model performance, we conducted ablation experiments whereby we gradually simplify the MinimalRNN update equations in 4 steps (Fig. 8). In the last step (Variant 4), the simplified update equations were the same as the update equations of the linear state space (LSS) model. The ablated RNN models were trained and tested using the same procedure described in Section 2.6, including hyperparameter search.

Impact of different features on prediction performance-
We performed feature ablation to analyze the contributions of different features to prediction performance of the trained minimalRNN model. To ablate a feature in the input data, the value of that feature was set to the mean value in the dataset, while the other input features were left unaltered. Thus, there were 23 different versions of input data, whereby each version has a different feature ablated. We used the trained minimalRNN model from each split of the data (as described in Section 2.6) and the ablated input data to make prediction in the test data. A large drop in prediction performance when a feature was ablated would suggest that the feature was important for the trained minimalRNN model to make accurate predictions.

TADPOLE live leaderboard
The TADPOLE challenge involves the prediction of ADAS-Cog13, ventricular volume and clinical diagnosis of 219 ADNI participants for every month up to five years into the future. We note that these 219 participants were a subset of the 1677 subjects used in this study. However, the future timepoints used to evaluate performance on the live leaderboard (https:// tadpole.grand-challenge.org/D4_Leaderboard/) were not part of the data utilized in this study. Here, we utilized the entire dataset (1677 participants) to tune a set of hyperparameters (using HORD) that maximized performance either (1) one year into the future or (2) all years into the future. We then submitted the predictions of the 219 participants to the TADPOLE leaderboard.

Data and code availability
The code used in this paper can be found at https://github.com/ThomasYeoLab/CBIG/tree/ master/stable_projects/predict_phenotypes/Nguyen2020_RNNAD. This study utilized data from the publicly available ADNI database (http://adni.loni.usc.edu/data-samples/accessdata/). The particular set of participants and features we used is available at the TADPOLE website (https://tadpole.grand-challenge.org/). SVM/SVR using one input timepoint because they yielded the best results within their model classes. Table 4 shows the test performance of all models across all three missing data strategies.

Overall performance
We performed statistical tests comparing the three minimalRNN variants (RNN-FF, RNN-LF and RNN-MF) with all other baseline approaches (LSS, LSTM, constant prediction, SVM/SVR). Multiple comparisons were corrected with a false discovery rate (FDR) of q < 0.05. RNN-MF showed the best results and was statistically better than most baseline approaches (Table 4). For example, RNN-MF was statistically better than LSS-MF for clinical diagnosis, but not ADAS-Cog13 or ventricular volume. Similarly, RNN-MF was statistically better than LSTM-MF for clinical diagnosis and ventricular volume, but not ADAS-Cog13.
In terms of handling missing data, model filling (MF) performed better than forward filling (FF) and linear filling (LF), especially when predicting ADAS-Cog13 and ventricular volume (Table 4). Interestingly, more input timepoints do not necessarily lead to better prediction in the case of SVM/SVR. In fact, the SVM/SVR model using one timepoint was numerically better than SVM/SVR models using more timepoints, although differences were small. This might be because SVM/SVR models with one input timepoint had access to more training data than SVM/SVR models with more input timepoints (Section 2.5.2). Furthermore, SVM/SVR models with more input timepoints had to handle longer feature vectors, which increased the risk of overfitting (Section 2.5.2).
Recall that for test subjects, the first half of the timepoints of each subject were used to predict the second half of the timepoints of the same subject (Section 2.6). Table 5 shows the breakdown of subjects based on their clinical diagnoses at the last input timepoints (with observed clinical diagnoses) and the last timepoints (with observed clinical diagnoses). For example, if a subject had 10 timepoints, then the 10 timepoints were split into 5 input (observed) timepoints and 5 unobserved timepoints we seek to predict. Then, in the case of this subject, the last input timepoint would be timepoint 5 and the last timepoint would be timepoint 10. If the subject did not have observed clinical diagnosis at timepoint 10, then we would consider the clinical diagnosis at timepoint 9 and so on. We note that a small number of subjects was not included in Table 5 because they did not have any observed clinical diagnosis in the first half and/or second half of the timepoints. Fig. 10 shows the breakdown of the prediction performance ( Fig. 9) into six different groups. The "stable" groups (NC-S, MCI-S, AD) comprised subjects whose diagnostic categories were the same at the last input timepoint and the last timepoint. The "progressive" groups (NC-P, MCI-P) comprised subjects who progressed along the AD dementia spectrum (e.g., from MCI to AD). Finally, the MCI recovered (MCI-R) group comprised subjects who have reverted from MCI to NC. We did not consider the 4 subjects that reverted from AD to MCI because of the small sample size. We note that diagnostic prediction performance was measured using accuracy (fraction of correct predictions) instead of mAUC and BCA because there was only one class in the stable groups.
In the case of predicting ventricular volume or ADAS-Cog13, minimalRNN was comparable to or numerically better than all baselines. In the case of diagnostic category, minimalRNN compared favorably with all baselines except for constant prediction in the stable groups. The reason is that it is optimal to predict all future diagnostic categories to be the same as the last observed diagnosis in the stable groups. However, in reality, whether subjects are stable or not is not known in advance. Therefore, for the stable groups, constant prediction should be treated as an upper bound on prediction performance, rather than a baseline. We note constant prediction did not achieve 100% accuracy in the stable groups because the clinical diagnoses could fluctuate over time. For example, if a subject had 4 timepoints with corresponding diagnoses NC, NC, MCI and NC. Then, the subject would be classified as NC-stable because the second and fourth timepoints had the same NC diagnoses. Fig. 11 shows the breakdown of the prediction performance from Fig. 9 in yearly interval up to 6 years into the future. Not surprisingly, the performance of all algorithms became worse for predictions further into the future. The constant baseline was very competitive against the other models for the first year, but performance for subsequent years dropped very quickly. The minimalRNN model was comparable or numerically better than all baseline approaches across all the years. (Table 4), we further explored how well the trained RNN-MF model would perform on test subjects with different number of input timepoints. Fig. 12 shows the performance of RNN-MF averaged across 20 test sets using different number of input timepoints. The exact numerical values are reported in Table 6. RNNs using 2 to 4 input timepoints achieved similar performance across all metrics. RNN using 1 input timepoint had numerically worse results, especially for ventricular volume. However, there was no statistical difference between using 1 input timepoint and 4 input timepoints even in the case of ventricular volume (p = 0.20). Table 7 shows the prediction performance of the RNN-MF model when the temporal resolution varied from 1-month interval to 6-month interval. There was no significant difference in prediction performance across different temporal resolutions. Table 8 shows the performances of the original minimalRNN model (RNN-MF) and 4 ablated variants decreasing in complexity from RNN-MF to variant 4 (LSS-MF). Numerically, RNN-MF had the best results compared with all 4 variants. However, it was not the case that performance continually degraded from the most complex model (RNN-MF) to the least complex model (LSS-MF). Interestingly, among the 4 variants, LSS-MF (Variant 4) showed the worst performance for clinical diagnosis, but close to the best performance for ADAS-Cog13 and ventricular volume. This suggests that some level of nonlinearity might be more useful for predicting clinical diagnosis, but less so for ADAS-Cog13 and ventricular volume. Overall, it was difficult to conclude that a specific component was essential to minimalRNN's performance. This might not be surprising because as its name suggested, the minimalRNN was designed to be as simple as possible, so removing any component yielded somewhat worse results.

Impact of different features on prediction performance-
The results of the feature ablation experiments are shown in Table 9. Unsurprisingly, ablating diagnosis resulted in the most significant drop in diagnostic mAUC and BCA, while ablating ADAS-Cog13 and ventricular volume resulted in the most significant increase in ADAS-Cog13 MAE and ventricular MAE respectively. Ablating CDRSB also led to a noticeable drop in diagnosis mAUC and BCA, probably because CDRSB is used in the diagnosis of an individual. Interestingly, ablating CDRSB also led to a noticeable increase in ventricular MAE.

TADPOLE live leaderboard
The original LSTM model (Nguyen et al., 2018) was ranked 5th (out of 53 entries) in the TADPOLE grand challenge in July 2019 (entry "CBIL" in https://tadpole.grandchallenge.org/Results/). Our current minimalRNN models were ranked 2nd and 3rd (out of 63 entries) on the leaderboard as of June 3rd, 2020 (entries ("CBIL-MinMFa" and "CBIL-MinMF1"; https://tadpole.grand-challenge.org/D4_Leaderboard/). Interestingly, the model obtained from hyperparameters tuned to predict all years into the future ("CBIL-MinMFa") performed better than the model obtained from hyperparameters tuned to predict one year into the future ("CBIL-MinMF1"), even though the leaderboard currently utilized about one year of future data for prediction.

Discussion
In this work, we adapted a minimalRNN model for predicting longitudinal progression in AD dementia. Our approach compared favorably with baseline algorithms, such as SVM/ SVR, LSS, and LSTM models. However, we note that there was no statistical difference between the minimalRNN and LSS for predicting ADAS-Cog13 and ventricular volume even though other studies suggested benefits of modeling nonlinear interactions between features (Popescu et al., 2019).
As can be seen when setting up the SVM/SVR baseline models (Section 2.5.2), there were a lot of edge cases to consider in order to adapt a "static" prediction algorithm (e.g., SVM/ SVR) to the more "dynamic" longitudinal prediction problem we considered here. For example, data is wasted because static approaches generally assume that participants have the same number of input timepoints. Therefore, for the SVM/SVR models using 4 input timepoints, we ended up with only 1454 participants out of the original 1677 participants. This might explain why the SVM/SVR model using 1 input timepoint compared favorably with the SVM/SVR model using 4 input timepoints (Table 4). Another issue with static models is that the relationship between input features and outputs might vary over time (i.e., temporal conditional shift; Oh et al., 2018;, thus better performance might be achieved by building separate models to predict month 12, month 18, and so on. Here, we built multiple separate SVM/SVR models to predict at a fixed number of future timepoints and performed interpolation at intermediate timepoints. By contrast, state-based models (e.g., minimalRNN, LSS, or LSTM) are more elegant in the sense that they handle participants with different number of timepoints and can in principle predict unlimited number of timepoints into the future.
Even though the ADNI dataset comprised participants with multiple timepoints, for the algorithm to be clinically useful, it has to be successful at dealing with missing data and participants with only one input timepoint. We found that the "integrative" approach of using the model to fill in the missing data (i.e., model filling) compared favorably with "preprocessing" approaches, such as forward filling or linear filling. However, it is possible that more sophisticated "preprocessing" approaches, such as matrix factorization (Mazumder et al., 2010;Nie et al., 2017;Thung et al., 2016) or wavelet interpolation (Mondal and Percival, 2010), might yield better results. We note that our model filling approach can also be considered as a form of matrix completion since the RNN (or LSS) was trained to minimize the predictive loss, which is equivalent to maximizing the likelihood of the training data. However, matrix completion usually assumes that the training data can be represented as a matrix that can be factorized into low-ranked or other speciallystructured matrices. On the other hand, our method assumes temporal dependencies between rows in the data matrix (where each row is a timepoint).
Our best model (minimalRNN with model filling) had similar performance when using only 1 input timepoint instead of 4 input timepoints, suggesting that our approach might work well with just cross-sectional data (after training using longitudinal data). However, we might have simply lacked the statistical power to distinguish among the different conditions because of the smaller number of subjects in this experiment. Overall, there was no noticeable difference among using 2, 3 or 4 input timepoints, while the performance using 1 input timepoint appeared worse, but the difference was not statistically significant (Fig. 12).
Although our approach compared favorably with the baseline algorithms, we note that any effective AD dementia treatment probably has to begin early in the disease process, potentially at least a decade before the emergence of behavioral symptoms. However, even in the case of our best model (minimalRNN with model filling), prediction performance of clinical diagnosis dropped from a BCA of 0.935 in year 1 to a BCA of 0.810 in year 6, while ventricular volume MAE increased from 0.00104 in year 1 to 0.00511 in year 6. Thus, significant improvement is needed for clinical utility.
One possible future direction is to investigate new features, e.g., those derived from diffusion MRI or arterial spin labeling. Previous studies have also suggested that different atrophy patterns (beyond the temporal lobe) might influence cognitive decline early in the disease process (Noh et al., 2014;Byun et al., 2015;Ferreira et al., 2017;Zhang et al., 2016;Risacher et al., 2017;Sun et al., 2019), so the atrophy features considered in this study (Table 1) might not be optimal. Although the new features may be correlated with currently used features, the new features might still provide complementary information when modeling AD progression (Popescu et al., 2019). Another possible source of information might come from electronic health records (EHR), which can be collected more frequently and easily than neuropsychological test scores or MRI scans (Tjandra et al., 2020). Combining neuropsychological test scores, MRI scans and EHR might potentially yield better prediction.
As mentioned in the introduction, an earlier version of our algorithm was ranked 5th out of 50 entries in the TADPOLE competition. Our current model was ranked 2nd out of 63 entries on the TADPOLE live leaderboard as of June 2nd, 2020. Interestingly, the top team considered additional handcrafted features, which might have contributed to its success. Furthermore, the top team utilized a non-deep-learning algorithm XGboost (Chen and Guestrin, 2016), which might be consistent with recent work suggesting that for certain neuroimaging applications, non-deep-learning approaches might be highly competitive (He et al., 2020)

Conclusion
Using 1677 participants from the ADNI database, we showed that the minimalRNN model was better than other baseline algorithms for the longitudinal prediction of multimodal AD biomarkers and clinical diagnosis of participants up to 6 years into the future. We explored three different strategies to handle the missing data issue prevalent in longitudinal data. We found that the RNN model can itself be used to fill in the missing data, thus providing an integrative strategy to handle the missing data issue. Furthermore, we also found that after training with longitudinal data, the trained RNN model can perform reasonably well using one input timepoint, suggesting the approach might also work for cross-sectional data.

Supplementary Material
Refer to Web version on PubMed Central for supplementary material.

Appendix A
This appendix summarizes differences between the minimalRNN and LSTM. For the convenience of the readers, the minimalRNN state equations (Fig. 2B) are repeated below.
For ease of comparison, we use a similar set of notations to show the LSTM state equations below.
c t = f t ⊙ c t − 1 + i t ⊙ u t h t = o t ⊙ tanh c t As can be seen, minimalRNN uses fewer parameters than LSTM by doing away with the output gate (o t ) and setting the input gate (i t ) to be the complement of the forget gate (i.e. (1 − f t )). The hyperbolic tangent is also removed from the computation of h t thus making h t = c t . In addition, the term h t−1 is removed from the computation of the term u t in the minimalRNN, so the hidden state (h t ) of the minimalRNN decays to zero when the input (x t ) is zero. Note that in the context of our study, all variables (except clinical diagnosis) were znormalized (Section 2.6). Thus, input of zero corresponds to observing the mean value. In contrast, the hidden state of the LSTM can fluctuate even when the input is zero.   (Table 1). The input x t to each RNN cell comprised the diagnosis s t and continuous variables g t (Eq. (1)). Note that s t was represented using one-hot encoding. The hidden state h t was a combination of the previous hidden state h t−1 and the transformed input u t (Eq. (4)). The forget gate f t weighed the contributions of the previous hidden state h t−1 and current transformed input u t toward the current hidden state h t (Eq. (3)). The model predicted the next month diagnosis s t + 1 and continuous variables g t + 1 using the hidden state h t (Eqs. (5) and (6)). ⊙ and σ denote element-wise product and the sigmoid function respectively. The minimalRNN was trained to predict the next observation given the current observation (e.g., predicting x 2 given x 1 ). Errors between the actual observations (e.g., x 2 ) and predictions (e.g., x 2 ) were used to update the model parameters. The hidden state h t encoded information about the subject up until time t.
Prediction started at month 4. Since there were no observed data at timepoints 4 and 5, the predictions (x 4 and x 5 ) were used as inputs (at timepoints 5 and 6 respectively) to predict further into the future. Different strategies to impute missing data. (A) Forward-filling imputed missing values using the last observed value. (B) Linear-filling imputed missing values using linear interpolation between previous observed and next observed values. Notice that linear-filling did not work for months 8, 9 and 10 because there was no future observed data for linear interpolation, so forward filling was utilized for those timepoints. (C) Model-filling imputed missing values using model predictions.  Table 1). The input x t to each LSS cell comprised the diagnosis s t and continuous variables g t (Eq. (10)). Like before, s t was represented using one-hot encoding. The hidden state h t was a combination of the previous hidden state h t−1 and the input x t (Eq. (11)). The model predicted the next month diagnosis s t + 1 and continuous variables g t + 1 using the hidden state h t (Eqs. (12) and (13)). Prediction performance as a function of the number of input timepoints in the test subjects. Nguyen et al. Page 28 Neuroimage. Author manuscript; available in PMC 2021 January 10.

Fig. 8.
Different ablated minimalRNN models. Ablation is done by simplifying the update equations.  Table 4 respectively. MinimalRNN performed the best. See Fig. S1 for all models.  Prediction performance from Fig. 9 broken down into yearly interval up to 6 years into the future. For brevity, we denote minimalRNN as RNN. All algorithms became worse further into the future. MinimalRNN was comparable to or numerically better than all baseline algorithms across all years. See Fig. S2 for all models. Test performance of minimalRNN model with model filling strategy (RNN-MF) using different numbers of input timepoints (after training with all timepoints). Results were averaged across 20 test sets. Even though the minimalRNN model using 1 input timepoint yielded numerically worse results, the differences were not significant (see Table 6).    Table 7 Test performance of minimalRNN model with model filling strategy (RNN-MF) at different temporal resolution. We note that the top row (1-month interval) was the same as in Table 4. Results were averaged across 20 test sets. The best result for each performance metric was bolded. There was no significant difference across different temporal resolutions.  Table 8 Test performance of the original minimalRNN model (RNN-MF) and different ablated variants. Results were averaged across 20 test sets. The best result for each performance metric was bolded.

mAUC (more=better) BCA (more=better) ADAS-Cog13 (less=better) Ventricles (less=better)
RNN-MF  Table 9 Test performance of minimalRNN model (RNN-MF) with different features ablated (replacing input feature with the mean value). Results were averaged across 20 test sets. Prediction performance of the original model was bolded. For each column, the top two ablated features leading to the largest drop in performance were bolded and italicized.