Modeling longitudinal imaging biomarkers with parametric Bayesian multi‐task learning

Abstract Longitudinal imaging biomarkers are invaluable for understanding the course of neurodegeneration, promising the ability to track disease progression and to detect disease earlier than cross‐sectional biomarkers. To properly realize their potential, biomarker trajectory models must be robust to both under‐sampling and measurement errors and should be able to integrate multi‐modal information to improve trajectory inference and prediction. Here we present a parametric Bayesian multi‐task learning based approach to modeling univariate trajectories across subjects that addresses these criteria. Our approach learns multiple subjects' trajectories within a single model that allows for different types of information sharing, that is, coupling, across subjects. It optimizes a combination of uncoupled, fully coupled and kernel coupled models. Kernel‐based coupling allows linking subjects' trajectories based on one or more biomarker measures. We demonstrate this using Alzheimer's Disease Neuroimaging Initiative (ADNI) data, where we model longitudinal trajectories of MRI‐derived cortical volumes in neurodegeneration, with coupling based on APOE genotype, cerebrospinal fluid (CSF) and amyloid PET‐based biomarkers. In addition to detecting established disease effects, we detect disease related changes within the insula that have not received much attention within the literature. Due to its sensitivity in detecting disease effects, its competitive predictive performance and its ability to learn the optimal parameter covariance from data rather than choosing a specific set of random and fixed effects a priori, we propose that our model can be used in place of or in addition to linear mixed effects models when modeling biomarker trajectories. A software implementation of the method is publicly available.

, repeated measures over time (i.e., longitudinal data) in neuroimaging are often limited to a baseline measurement and a few follow-up time-points per subject. This is primarily due to the costs and complexities of collecting such data. Consequently, withinsubject trajectory models of regions of interest (ROIs) or clinical measures that are based on such limited data may not be robust to measurement errors from image acquisition or post-processing. Such noise may lead to poor inferences of true underlying trajectory parameters and poor predictions of future values, diminishing the value of trajectory based biomarkers (Curran, Obeidat, & Losardo, 2010). An additional problem is a limit to the flexibility of the models that can be estimated: with two time-points one can only estimate a linear model, with three only a quadratic, and so on.
There has been growing interest in methods that efficiently use longitudinal neuroimaging data; e.g., Telzer et al., (Telzer et al., 2018) provide an overview related to fMRI analysis. By far the most popular approaches are based on mixed effect modeling, which combines fixed effects, that is, pooling subjects' data to create an average trajectory for all subjects, with random effects, that is, individualizing models about the average trajectory. The mixed effects modeling approach is well suited to both balanced (fixed number of samples, fixed time interval between samples) and unbalanced (varying samples or time intervals) longitudinal designs, allowing for separate analysis of between and within subject variability (Fitzmaurice, Laird, & Ware, 2011;Laird & Ware, 1982).  and Guillaume, Hua, Thompson, Waldorp, and Nichols (2014) provide overviews of linear mixed effects (LME) models within neuroimaging and apply them to Alzheimer's disease (AD).
Features from longitudinal measurements remove inter-individual differences and thus make for better descriptions of disease progression. Recent models of disease progression have integrated both crosssectional and longitudinal information to estimate discrete or continuous disease stages for individuals (Donohue et al., 2014;Fonteijn et al., 2012;Jedynak et al., 2012;Lorenzi et al., 2017;Schiratti, Allassonnière, Colliot, & Durrleman, 2017;Young et al., 2014). They have been inspired by, and seek to quantify the hypothetical models of disease progression proposed by neurodegenerative disease researchers (Buckner et al., 2005;Jack et al., 2010). Oxtoby and Alexander ( 2017) provide an overview of the methods within this emerging field. While the purpose of disease progression modeling is to estimate disease stage and find group-level (typically monotonic) trajectories for each biomarker, this procedure can be thought of as a form of coupling of biomarker trajectories across subjects. However, most of these models are not explicitly setup to couple subjects' trajectories within each biomarker (e.g., a brain structural ROI) based on other biomarkers' information (e.g., genetic risk or brain amyloid deposition). As such these models may not fully capitalize on valuable multi-modal information that may improve trajectory estimates within each biomarker.
Multi-kernel learning (MKL) presents an approach to combining multiple biomarker similarity measures. It has been previously applied to neuroimage-based pattern recognition and machine learning to discriminate disease (Hinrichs, Singh, Xu, & Johnson, 2011;Zhang, Wang, Zhou, Yuan, & Shen, 2011). Young et al. (2013) applied a Bayesian MKL approach to AD discrimination, which avoided the need for costly cross-validation when tuning the kernel weightings. In addition to a more efficient means of tuning such hyperparameters, Bayesian modeling also provides a principled approach to incorporating prior information, comparing models, making probabilistic predictions, and inferring distributions over parameters (Gelman et al., 2013). It has been applied in many contexts within neuroimaging (Woolrich, 2012) and more specifically in both parametric and nonparametric trajectory models (Lorenzi, Ziegler, Alexander, & Ourselin, 2015;Ziegler, Penny, Ridgway, Ourselin, & Friston, 2015).
In this article, we develop an approach that realizes the benefits of MKL within a Bayesian trajectory model rather than a disease classification model. To do so we use multi-task learning, which aims to learn multiple related tasks simultaneously, sharing information across tasks. Bayesian MTL was previously applied in neuroimaging within the context of multi-subject fMRI analysis (Marquand, Brammer, Williams, & Doyle, 2014;Marquand, Brammer, et al. 2014) based on the method proposed by Bonilla et al. (2008). Bayesian MTL has also been applied to imaging genetics via a hierarchical Bayesian model that encourages both individual and group sparsity (Greenlaw, Szefer, Graham, Lesperance, & Nathoo, 2017;Nathoo, Greenlaw, & Lesperance, 2016). Here we set the learning of each subject's biomarker trajectory as a task and apply MTL to share information across subjects. We develop a parametric extension of Marquand et al.'s approach, a joint Bayesian linear regression that allows for full coupling across all subjects along with coupling based on biomarker similarity, so that subjects with similar measures in one biomarker may have more similar trajectories in another. Furthermore, we use MKL to couple trajectories within a biomarker based on an optimal balance of multiple other biomarkers' information. We compare our approach, which learns the optimal parameter covariance from data to standard LME modeling, where the parameter covariance structure depends on the a priori choice of random and fixed effects.
This article (a) contributes a parametric model that learns a separate trajectory for each subject while allowing for information-sharing across subjects and the integration of multi-modal information during model training, resulting in better predictions, and inferences; (b) performs simulations to validate the model and understand its properties; (c) applies the model to clinical neuroimaging data, modeling cortical region of interest (ROI) trajectories in neurodegeneration using various biomarkers for coupling and (d) interprets and discusses the results. We present a univariate model of the temporal trajectory of a scalar variable (e.g., values of an ROI or a clinical measure) across multiple subjects. We set the learning of each subject's trajectory as a task and use MTL to share information across subjects to better learn all subjects' trajectories as a single, coupled model. Empirical Bayes allows us to automatically tune the degree and type of coupling across subjects using hyperparameters that control the overall covariance structure of the parameters being learned.
We start by specifying a single, large model for all n subjects' trajectories, stacking the longitudinal observations of all subjects into the vector y = y 0 1 ÁÁÁ y 0 where M ii is defined below and Σ prior is of size nd × nd. The first term in Equation (4), with weight α 1 , allows for a diagonal (i.e., independent) covariance structure in the parameters 1 and ensures the matrix is positive definite. The second term is a sum of d Kronecker products. When fitting linear models (i.e., polynomial models of degree one, as we will do throughout this article), there are two such products: one for the 0 th order parameters (i.e., intercepts), the other for the first order parameters (i.e., slopes or rates of change). In each case, we take the Kronecker product of an inter-subject (coupling) matrix and an intra-subject matrix to form part of the overall covariance matrix. Each Σ ci is an n × n matrix parametrized to allow for fully independent parameters (α i1 I n term, where I n is an n-dimensional identity matrix), fully coupled parameters (α i2 1 n term, where 1 n is an n-dimensional matrix of ones) and coupling based on the set of k biomarker based kernels (the K j 's, each an n-dimensional positive definite matrix). The form of these biomarker kernels is detailed later in this paper.
As a result, each Σ ci contains at least k + 2 hyperparameters 2 and overall there are at least d(k + 2) covariance-related hyperparameters. It is important to bear in mind the distinction between the hyperparameters (the α's) used to control coupling among subjects and the parameters of individuals' trajectory models (the w i 's stacked within w). tity matrix and Σ c = α 1 I n + α 2 1 n is closest to the form used in Marquand et al. Using instead Σ c = α 1 I n + α 2 1 n + P k j = 1 α j + 2 ð Þ K j implements MKL. Both variations use fewer hyperparameters than our proposed parametrization. However, these simpler models tie the coupling of parameters types (e.g., intercepts and slopes) together and as such may be more prone to inducing spurious coupling in one set of parameters while capturing true coupling in another (e.g., spuriously coupling intercepts along with slopes). This may, in turn, lead to the increased false positive group differences in subjects' parameters.
This prior structure differs from that used in hierarchical Bayesian modeling, where individuals' first level parameter prior means are specified as linear combinations of second level parameters that may include covariates and grouping variables that also have prior distributions. In contrast, we assume a zero-mean prior on individuals' parameters and incorporate covariates via kernels within each Σ ci . Kernels allow for nonlinear similarity measures between covariates, adding modeling flexibility not present in linear hierarchical models. Hierarchical models, in contrast, offer a well-developed framework for modeling fixed and random effects. The two approaches are not mutually exclusive: it is possible to combine them, though for the sake of simplicity we leave this as a topic for future work. 1 In practice, we include an additional diagonal term εI nd , with ε set to 10 −6 throughout, that aids numerical stability when inverting Σ prior . 2 There may be more hyperparameters if the kernels themselves contain tuneable parameters.
Finally, we choose the likelihood term, that is, the data observation model, to resemble the GLM from Equation (1), setting p y ð jX, wÞ = N y ð jXw, β −1 I m Þ ð 6Þ with X and w defined as before. We allow the model to learn the (inverse) measurement noise level β within the likelihood as an additional hyperparameter. With the prior and likelihood thus specified, we can use Bayes' rule to update our beliefs on the parameter distribution given some observed data (i.e., find the posterior distribution). In this case as we have a Gaussian prior and a Gaussian likelihood the posterior is also Gaussian and has the following closed-form solution (Bishop, 2007): where we have collected the covariance prior's hyperparameters into the vector α and have made the dependence of the prior covariance on these parameters explicit using the notation Σ(α) prior .
To estimate optimal values for α and β, we take the empirical Bayesian approach described in Huertas et al. (2017) and Marquand, Brammer, et al. (2014a) finding the α and β that maximize the marginal likelihood of the observed data under our assumed model structure. With the prior as in Equations (3)-(5) and the likelihood as in Equation (6) the log marginal likelihood becomes: where A = Σ α ð Þ − 1 prior + βX T X and m = βA −1 X T y. We used minimize, a conjugate gradients optimizer available within the gpml toolbox (Rasmussen & Nickisch, 2010), which uses partial derivatives of the marginal likelihood with respect to each of the hyperparameters. We optimized these variables in the log domain to ensure positivity (see Appendix for further details).
In this article we predict the biomarker value at each subject's mean baseline and final follow-up ages, taking the probabilistic approach of integrating over all possible posterior model parameters.
With the Gaussian posterior as in Equations (7)-(9) and the Gaussian likelihood p(y * | X * ,w) = N (y * |X * w, β −1 I m ) of observing predictions y * given input X * , the predictive distribution that results from integration has a closed form solution (Rasmussen, 2006): where A = Σ α ð Þ − 1 prior + βX T X and X * is an n × nd design matrix with a row encoding either the mean baseline age or final sample age in this case (or 2n × nd to predict both at once). In general, we can predict an arbitrary number of time-points per subject by modifying X * accordingly.
For all models, we standardize (i.e., z-score) the training data across all subjects (the longitudinal observations y and each nonconstant column of the design matrix X) as well as the out-of-sample testing data (X * , using the means and standard deviations from X) during model building and prediction. This ensures that all trajectories are modeled on a similar scale, which aids numerical stability when optimizing the hyperparameters. We rescale both the predictions and the estimated parameters back to their original dimensions for subsequent analysis (e.g., estimating annualized rates of change and group differences in parameters). With higher order models, the columns of X may be highly correlated, leading to unstable variance estimates. In such cases, one may orthogonalize the columns prior to fitting the model.
We use the log Bayes factor, a ratio of the logarithm of model evidences (i.e., marginal likelihoods) to compare biomarker-information based coupling to random-information based coupling. Log Bayes factors are a principled way of comparing Bayesian models, under the assumption that each model has the same prior probability (Penny, 2012;Penny, Stephan, Mechelli, & Friston, 2004).

| Simulations: Data generation
We created a simulation of subjects' trajectories that allowed us to compare several versions of our proposed model along with a baseline model. We simulated two scenarios: (a) linear trajectories with intercept variation: group differences in intercept with fixed slope and (ii) linear trajectories with slope variation: group differences in slope with fixed intercept. In both cases, we simulated 200 subjects' trajectories at each simulation run. To simulate intercept variation, for a given subject i we randomly selected an intercept w i0 from the set {−10, −8, −6, −4, −2} and a fixed slope of w i1 = − 1, and then randomly selected an initial measurement time t i1 between 0 and 10. We then generated three simulated samples with fixed time intervals: y(t i1 )= w i0 + w i1 t i1 + ε i1 = w i0 − t i1 + ε i1 , y(t i2 = t i1 + 0.05) where ε i1 , ε i2 , ε i3 are three independent measurement errors, each drawn from N(0, σ m ). We simulated four different levels of measurement noise, σ m = 1, 2, 4, 8. For each noise level, we made 30 simulation runs and evaluated nine different models (described below) on each run. We used the first two samples of each subject (t i1 , t i2 for subject i) to train each model and the third sample (t i3 for subject i) to evaluate out-of-sample prediction accuracy. The top row of Figure 1 provides an example of one simulation run for 200 subjects at each noise level.
To simulate slope variation, for a given subject i we randomly selected a slope w i1 from the set {−1.0, −1.5, −2.0, −2.5, −3.0} and a fixed intercept of w i1 = 0. We randomly selected t i1 as before and generated y(t i1 ), y(t i2 = t i1 + 0.05), and y(t i3 = t i1 + 0.10) with measurement errors ε i1 , ε i2 , ε i3 drawn from N(0, σ m ). We used the same four measurement noise levels and again made 30 simulation runs, linear both Allow both slope and intercept coupling via true biomarker Gaussian both Allow both slope and intercept coupling via true biomarker linear int Allow intercept coupling via true biomarker Gaussian int Allow intercept coupling via true biomarker linear slope Allow slope coupling via true biomarker Gaussian slope Allow slope coupling via true biomarker No biomarker based coupling None α 1 I 2n + α 11 I n + α 12 1 n ð Þ M 11 Note. Last column contains number of hyperparameters in given covariance prior. We evaluated the same nine models using the first two samples of each subject for training and the third for prediction. The bottom row of Figure 1 provides an example of one simulation run at each noise level.

| Simulations: Model building
We investigated how different variations of the model, with and without biomarker-kernel-based coupling (i.e., via the K j matrices described above), performed under the two parameter variation scenarios and the four measurement noise levels. We created a 200 × 1 biomarker vector b as a noisy measure of the group difference in the parameters, so that the i th element of b was b i = w i0 + ε i under intercept variation and b i = w i1 + ε i under slope variation, with ε i drawn from N(0, 1) in both cases. 3 Using this biomarker, we compared two commonly used similarity matrices: (a) a rank-one approximation K(b) linear = bb T (the outer product of b with itself); and (b) a squared exponential (SE) kernel (also referred to as a Gaussian radial basis in the i th row and j th column, where we make the dependence of these matrices on the input explicit. The term σ SE > 0 is a parameter that gives the kernel additional scaling flexibility. It is also possible to encode categorical (i.e., group) membership via a binary similarity matrices rather than an SE kernel, with one indicating two subjects belong to the same class and zero otherwise.
When using SE kernels, we treated the σ SE term as a covariance prior hyperparameter (in addition to the α's in Equations [4] and [5]).
We formed six different models using these two kernels: "linear both" and "Gaussian both" had covariance prior structure as in Equations (4) and (5), parameterizing full independence, full coupling and kernel-based coupling in both the intercept and slope parameters. The "linear int" and "Gaussian int" models restricted kernel-based coupling to the intercepts: referring to Equations (4) and (5) and assuming one coupling kernel K 1 , this means we allow a K 1 term in Σ c1 but not in Σ c2 . The "linear slope" and "Gaussian slope," in contrast, restricted kernel-based coupling to the slopes, allowing a K 1 term in Σ c2 but not in Σ c1 . These latter four models allowed us to test the effect of "oracle-like" (i.e., with perfect a priori) knowledge of simulation scenario: e.g., whether the two "int" models outperform other models in the variable intercept, fixed slope scenario. In such cases, biomarkerbased coupling of slopes in the intercepts variation case (or vice versa) is extraneous and may lead to spurious inference of group differences if a model is allowed to infer coupling where none exists.
We compared biomarker coupled models to several simpler coupled models. The simplest of these was the "OLS" model, an uncoupled model which asymptotically corresponds to a Bayesian model with a parameter covariance prior of α 1 I 2n with α 1 tending to infinity (i.e., a high prior uncertainty on all parameters for all subjects).
The second model, "plain," trades off fully independent and fully coupled covariance priors, without any kernel-based coupling. It is very similar to an LME model with random intercepts and random slopes. To understand the role of kernel-based coupling, we compared biomarker-kernel coupled models to random-information-kernel coupled models. We formed a 200 × 1 vector r with each element drawn from N(0, 1), so that each subject was assigned a random number, and formed another SE kernel K(r) SE based on it. We used this to create "random," parametrized exactly as "Gaussian both." See Table 1 for further model details.
We compared our approach to standard LME modeling using the LME implementation available in Freesurfer . 4 Specifically, we built two LME models, with fixed effects of baseline age and baseline biomarker value and either random intercepts (termed "LME: rI") or random intercepts and slopes ("LME: rI, rS").
We also compared our empirical Bayesian approach, which produces point estimates of hyperparameter priors (i.e., an estimate of prior means with zero variance) to a fully Bayesian approach, in which we place priors on the hyperparameters and estimate their posterior distribution. In this way, the fully Bayesian approach accounts for the hyperparameter uncertainty, which may improve parameter and prediction coverage by improving the estimation of their uncertainties. We compared the empirical Bayesian version of "plain," with five covariance hyperparameters (the α 0 s) and one inverse observation noise parameter (β) to a fully Bayesian model with the same covariance structure. We Finally, we performed additional simulations to test our assumption of independent measurement noise, parameterized by β in Equation (6), choosing five representative models in all cases: "Gaussian both" and "plain" MTL models plus the two LME models and "OLS." In the first set of simulations, we varied the amount of within-subject noise correlation by instead allowing a block diagonal noise structure. For each subject's measurements, we used a noise covariance with σ 2 m = 4 on the diagonal and all off-diagonal terms set to ρσ 2 m , so that we recover uncorrelated noise when ρ = 0 and perfectly correlated noise (i.e., the same for all observations over time) when ρ = 1. We simulated with three levels of ρ (0, 0.5, 0.75). In the second set of simulations used three levels of noise skewness: zero (Gaussian, as before), 0.6 (slightly skewed), and 0.9 (highly skewed), with σ 2 m = 4 in all cases.
We can thus compare the performance of models in terms of both their accuracy of predicting ground truth trajectories and how well they estimate model parameters. The latter objective is potentially important when there are group differences in trajectories, for example, when a disease group has a steeper rate of gray matter volume decline in a ROI compared to a healthy control group. In such a case, using linear models we should be able to detect a difference in the slope parameters across the groups and, assuming the decline starts from the same level, no corresponding difference in intercepts.
For each model, we evaluated the mean absolute error (MAE) of predicting subjects' held-out samples and quantified the accuracy of the inferred trajectory parameters (intercepts and slopes) via coverage probability and MAE measures. We defined the coverage probability as the fraction of times the true value of a parameter (intercept or slope) falls within two standard deviations of its estimated value, that is, within the posterior credible interval of the parameter. As this is a 95% credible interval a coverage probability of 0.95 is an ideal outcome. For the Bayesian MTL models, we can easily calculate these quantities using the posterior means and variances (Equations (7)- (9)). For the LME models, direct estimates of the posterior parameter variance were not available: we therefore estimated them by adding the fixed effect and random effect variance estimates of the intercept and slope when appropriate.
In practice, the coverage probability may not be sufficient for understanding how accurately a model estimates a parameter as a model may simply estimate a high enough variance so that the true value is always covered. For this reason, we also computed the error (MAE) between the estimated parameters (intercept, slope) and their known true values.  et al. (2018). Briefly, tracer uptakes in the cortical ROI were standardized to the uptake in a composite reference region following recommendations from (Landau et al., 2015). We also used amyloid-β, total tau and phosphorylated tau (pTau) from CSF measured at or before baseline as measures of severity of amyloid and tau pathology. Lastly, we retrieved subjects' apolipoprotein E (APOE) genetic information, particularly the number of ε2 and ε4 alleles (Harold et al., 2009;Liu, Kanekiyo, Xu, & Bu, 2013).

| ADNI application: Dataset
We parcellated the T1-weighted images using geodesic information flows (GIF; Cardoso et al., 2015), creating 20 cortical sub-lobe volumes from each image (see Figure S7 for the list of ROIs). We then normalized each of these ROIs using each subject's total intracranial volume (TIV). Normalized ROIs were subsequently used for longitudinal trajectory modeling and out-of-sample prediction. We withheld the final follow-up ROI from each model to test the out-of-sample prediction accuracy of our models.

| ADNI application: Model building
We built eight different types of models, detailed in Table 2, for each of the 20 regions for a total of 160 models.
We used first order (linear) polynomial models for all regions: previous work has shown this is a reasonable assumption for modeling cortical trajectories . Based on our simulations (see Results), we chose the "Gaussian both" type of model when using biomarker coupling, allowing both intercept and slope coupling, assuming no prior knowledge of the type of coupling that exists in the data. We formed four different kernels (K 1 , K 2 , K 3 , K 4 ) based on true biomarkers along with a fifth kernel based on a random biomarker-based kernel (K 5 = K(r) SE , r a vector of random values) as in the simulations. Kernels with AD, and CSF amyloid-β, which decreases in those with AD (Sunderland et al., 2003), log transformed to improve normality, and (c) b 3 , a vector encoding CSF pTau (Hampel et al., 2010).
To encode the APOE genetic similarity between subjects we used the weighted identity by state (weighted-IBS) kernel function as in Kwee et al., (Kwee, Liu, Lin, Ghosh, & Epstein, 2008): where the IBS ij, ε2 , IBS ij, ε4 terms (each taking values 0, 1, or 2) refer to the number of ε2 and ε4 alleles shared by subjects i and j. The inverse minor allele frequency (1/MAF) weights w ε2 , w ε4 serve to up-weight rarer SNPs. The range of this function is between zero and two. To better compare to the other SE kernels, we created an exponentiated version of this kernel function: that includes a scaling hyperparameter σ and has a range between zero and one. We formed the K 4 kernel matrix using this kernel function.
We also compared our approach to (Freesurfer-based) LME models. We built three LME models with fixed effects of age and baseline amyloid load (measured via PET SUVR, as in the "PET amyloid" MTL model) and either random intercepts (termed "Rand Int"), random intercepts and slopes ("Rand Int/Slp") or random intercepts, random slopes, and random amyloid ("Rand Int/Slp/Amyloid"). Models using SE kernel-based coupling ("Gaussian" type models) generally performed better than their linear kernel counterparts ("linear" type models). The advantage of the SE kernel in some cases may be attributable to the ability to tune the kernel width (the σ SE term) as an additional hyperparameter, which adds scale flexibility. "Gaussian both" was consistently among those with lowest MAE. We expected the oracle-like models ("int" type in top row, "slope" type in bottom, marked with asterisks in the figures) to outperform the other models, however, overall, they perform similarly to the other models in most cases. Importantly, all MTL based models (including "plain") outperform "OLS" by a large margin, roughly halving the error. Figure S1 depicts histograms of parameter estimates for both "plain" and "OLS" for a representative simulation run,

| Simulations: Results
showing that the Bayesian model shrinks both the slopes and intercepts to their respective group mean, decreasing the variance of the estimates considerably. The shrinkage also results in a small increase in bias, evidenced by the larger distance between the parameter means of The two LME models also performed well, with similar MAEs to the "Gaussian" models. Figure S2 depicts the corresponding prediction coverage probabilities, showing that both the MTL and LME models' predictions cover the true target value at close to the ideal rate of 0.95, again outperforming "OLS" by a large margin, especially at higher noise levels. We also observe that the simpler MTL models ("linear" models, along with "plain") have both high coverage in Figure S2 and relatively high MAE in Figure 2, meaning that, compared to the other MTL and LME models, they make relatively inaccurate predictions but estimate high enough measurement uncertainty to cover the true target value. while intercept coverage varied greatly across models and noise levels. The LME models generally did not cover the true intercept values as frequently as the MTL models, particularly the random intercepts model ("LME: rI"). The random intercepts, random slopes model ("LME: rI, rS") performed better at higher noise levels, but was generally outperformed by the MTL models. One possible explanation is that the MTL models explicitly model parameter uncertainty as part of their Bayesian formulation, while we have had to estimate the LME models' overall parameter uncertainty by combining the associated Allow coupling via all four true biomarkers K 1 , K 2 , K 3 , K 4 α 1 I 2n + P 2 i = 1 α i1 I n + α i2 1 n + PET amyloid Allow coupling via SUVR similarity K 1 = K(b 1 ) SE α 1 I 2n + P 2 i = 1 α i1 I n + α i2 1 n + α i3 K 1 ð Þ M ii 9 CSF tau/aBeta Allow coupling via tau/ aBeta similarity CSF pTau Allow coupling via pTau similarity APOE Allow coupling via APOE ε2, ε4 allele similarity random Allow coupling via random biomarker plain No biomarker based coupling None α 1 I 2n + (α 11 I n + α 12 1 n ) M 11 + (α 21 I n + α 22 1 n ) M 22 5

OLS
No coupling None α 1 I 2n , α 1 ! ∞ 0 Note. Last column contains number of hyperparameters in given covariance prior.
F I G U R E 2 Boxplots of log mean absolute errors (MAEs) of predictions of all models across all simulations runs for the two scenarios: intercept variation (top row) and slope variation (bottom row) for four levels of measurement noise (σ m ). Models with oracle-like prior information are marked with an asterisk (top row: "int" models; bottom row: "slope" models) [Color figure can be viewed at wileyonlinelibrary.com] F I G U R E 3 Boxplots of parameter coverage probabilities (i.e., fractions of times the true parameter value fell within the posterior credible region) and log mean absolute errors (MAE) between estimated and actual parameters for the two scenarios: intercept variation (top row) and slope variation (bottom row) for four measurement noise levels (σ m ). Models with oracle-like prior information are marked with an asterisk (top row: "int" models; bottom row: "slope" models) [Color figure can be viewed at wileyonlinelibrary.com] fixed and random effect uncertainties. In addition, the LME models also have higher parameter estimation errors (the intercept and slope log MAE figures in the top row), which measures the quality of parameter mean estimates rather than variances. Overall in this scenario the "Gaussian" models outperformed all others in terms of parameter coverage and estimation error while the "linear" models and "plain" were comparable to the LME models in terms of parameter estimation error.
In the second scenario, fixed intercept and varying slopes ( Figure 3, bottom row), the "Gaussian" models performed well in both parameter coverage and parameter estimation error; in this case, the two LME models also performed competitively. "LME: rI, rS" had consistently highest intercept and slope coverage and lowest parameter estimation errors at low measurement noise levels, reflecting the fact that the random slopes assumption is appropriate in this scenario.
However, this model's parameter estimation error, particularly the slope MAE, increases sharply with higher measurement noise levels while the "Gaussian" models are relatively unaffected.
The two simulation scenarios suggest that the "Gaussian" style MTL models are a good choice for both prediction and parameter inference and compare favorably with standard LME models in many cases. Among these, "Gaussian both" is appealing, as it makes no a priori assumptions on the type of coupling that exists within the data.
Therefore, we used this type of model throughout our experiments with the Alzheimer's study data.  Table S1 gives the convergence diagnostics for the posterior estimates of the hyperparameters of "MCMC plain" for one run of the intercept varying scenario. The estimates appear to have converged: all R values were at their ideal values of one, the number of effective samples (N eff ) was high in all cases and the MCSE's were small compared to the estimated posterior means, so that the 95% confidence intervals on the means did not cross zero. Figure S4 depicts results from the simulations that varied noise correlation while Figure S5 depicts those with varied skewness. In Figure S4, part A, we observe that the prediction related metrics are similar between "Gaussian both" and the two LME models in both simulations scenarios and that all models' prediction errors fall as noise correlation increases. Figure S4, part B shows corresponding parameter estimation metrics for both scenarios: "Gaussian both" outperforms the LME models on intercept coverage and parameter error in the intercept varying scenario (top row) but "LME: rI, rS" has near optimal coverage in the slope varying scenario (bottom row), where the random slopes assumption is appropriate. However, these two models are similar in terms of slope coverage and both intercept and slope estimation errors. Figure S5, part A depicts the three different levels of measurement error skewness that we used in the second set of simulations. In this case, we observe that varying skewness does not substantially affect any of the models' prediction or parameter estimation metrics. Again, in general "Gaussian both" and the two LME models have similar prediction coverages and errors (part B), while "Gaussian both" outperforms the LME models in parameter estimation in the intercepts varying scenario (part C, top) and has similarly good performance in the slopes varying scenario (part C, bottom).

| ADNI application: Results
The likelihood term in Equation (6) assumes that observations are normally distributed about their mean (i.e., the residuals are normal) and uncorrelated over time within each subject. We tested the impact of these assumptions on ADNI modeling by comparing the histograms of residuals for two models in Figure S6: "CSF tau/aBeta," which was representative of the biomarker-coupled MTL models, and "OLS," the uncoupled reference model. Across all regions, the residuals of "CSF tau/aBeta" are much closer to being normally distributed that those of "OLS." We also tested for heteroscedasticity, calculating the correlation of residuals to baseline age across subjects and found no significant correlation in both models for any region.
Figure 4 (top part) depicts the true and estimated annualized rates of change across the 20 cortical ROIs for four representative models ("OLS," "plain," "random," "multiple"). Note that there is no information exchange between ROIs; in its presented form our method is univariate, coupling across subjects within each ROI and modeling ROIs separately. We computed the true annualized rate of change by dividing the percentage change from baseline to final (held-out) follow-up by the number of elapsed years. We note that this true annualized rate is essentially a two-point OLS estimate and is therefore more a silver than a gold standard. We see most of the cortex degenerating by 0.33% (middle cingulate) to 1.3% (posterior insula) annually, with the lateral regions generally degenerating faster than the medial regions We computed the models' estimated annualized rates using their predictions of the held-out sample instead of the true held-out value.
Figure 4 (bottom) depicts the associated MAEs of these predictions, with the three MTL models ("plain," "random," "multiple") having lower MAEs than the "OLS" model across most ROIs. The two kernel coupled models ("random" and "multiple") have further decreased MAEs compared to "plain," though there is no further discernible difference in MAE between the two. icantly lower than those of "OLS" across all ROIs bar the lateral temporal region (where only "plain" and "random" have higher error than "OLS"); (b) there is further improvement due to kernel coupling, evidenced by significantly lower absolute errors in at least one model relative to "plain" in 13 out of 20 ROIs; (c) as in simulations, there is a small difference in error between random-information-based and biomarker-based kernel coupling, with some biomarker-based models having significantly lower error than "random" (within 10 ROIs: anterior insula, DLPFC, lateral occipital, lateral parietal, lateral temporal, medial parietal, medial temporal, posterior cingulate, supratemporal, and temporal pole regions). Statistical tests were Bonferroni corrected for 320 (eight models × 20 ROIs × 2 parameter types) comparisons.
Furthermore, Figure S8 depicts the MAEs of predicting annualized rate of change for three LME models (described in Methods) built using the same information as in "PET amyloid." All models had very similar MAEs across ROIs in this case. showing "CSF tau/aBeta," "PET amyloid" and "multiple" have the largest and most widespread improvements in model evidence. We also see that "multiple" is most similar to "PET amyloid," the best individual biomarker coupled model in terms of model evidence, providing some assurance that combining kernels works as expected. Figure 6 depicts the significance of diagnostic group differences (CN, MCI, or AD) in subjects' estimated parameters across OLS, LME, and MTL assessed via one-way analysis of variance (ANOVAs) and Bonferroni corrected for 480 (12 model comparisons × 20 ROIs × 2 parameter types) comparisons. The three models with highest model evidence are depicted in Figure 6 ("CSF tau/aBeta," "PET amyloid," "multiple"); Figure S9 depicts all MTL models, with Bonferroni correction for 320 comparisons.
In both figures, cross-sectional differences in predicted volumes at mean baseline age across subjects (73.5 years) are depicted instead of intercepts. Intercepts represent group differences at age zero (i.e., at birth) while we have measured and modeled cortical degeneration in older adults. The three MTL models in Figure 6 agree that there are significant cross-sectional disease-related differences in volumes across the cortex, with sparing of the sensorimotor and cingulate regions ( Figure 6, top row). The three LME models, in contrast, detect a less widespread and less significant pattern of cross-sectional differences than the MTL models while "OLS" detects even fewer cross-sectional differences.
Longitudinally, the bottom row of Figure 6 shows both "Rand Int" and "Rand Int/Slp" have almost no significant slope differences in any region, while "Rand Int/Slp/Amyloid" detects some significant differences within parts of the temporal lobe, insula and parietal regions, F I G U R E 6 Top: Significance of (cross-sectional) diagnostic group differences in predicted volume at mean baseline age (73.5 years) for OLS, LME, and selected MTL models bottom: same for (longitudinal) group differences in estimated slopes [Color figure can be viewed at wileyonlinelibrary.com] but does not detect the expected slope difference within the medial temporal lobe. The three MTL models, including "PET amyloid," which represents the fairest comparison to the LME models, have significant differences across the temporal lobe (including the medial temporal lobe), insula, orbitofrontal region and, in the case of "multiple," the lateral parietal region. Overall the MTL models infer a more plausible pattern of both cross-sectional and longitudinal disease effects than standard LME models.
It is reassuring that similar types of biological coupling (amyloid load measured via CSF and PET in "CSF tau/aBeta" and "PET amyloid" respectively) result in similar patterns of longitudinal differences. The longitudinal differences in the lateral parietal region detected by "multiple" may be due to its incorporation of all biomarker coupling priors: "APOE" and "CSF ptau" also show some differences within that region ( Figure S9, bottom row). In contrast to these biomarker-coupled models, "random" does not detect any significant slope differences while "OLS" detects few cross-sectional differences; neither model is plausible given other studies of AD-related atrophy (Frisoni, Fox, Jack, Scheltens, & Thompson, 2010;Risacher et al., 2010;Ziegler et al., 2015).
Interestingly, "plain" only detects longitudinal differences within the medial temporal and temporal pole, suggesting that while this model can reliably detect strong true effects (see simulations), it may not be as sensitive as models with additional prior information. Further to this, Figure S10 depicts data for two regions: the medial temporal region, where most models agree that there are both cross-sectional and longitudinal differences, and the lateral temporal region, where "plain" detects no longitudinal differences, though they are clearly evident in the figure. We further observe that "APOE" detects a similar though weaker pattern of longitudinal differences compared to the other biomarker coupled models, suggesting that coupling based on similarity of genetic AD risk, conferred at birth, is less informative than coupling based on levels of amyloid accumulated decades later in older adults.
We also tested for cortical differences in subjects with differing numbers of APOE ε4 alleles (either 0, 1, or 2), analyzing each diagnostic group separately. Figure S11 depicts group differences in number of alleles for "APOE," "PET amyloid," and "multiple." "OLS," "plain," and "random" showed no differences in neither baseline volumes nor slopes (data not shown) while "CSF ptau" and "CSF tau/aBeta" (not shown) had spatial patterns resembling that of "PET amyloid," which shows allele-related differences in temporo-parieto-frontal, insular and anterior cingulate regions within the MCI group. There is a more widespread pattern of slope differences in "APOE" and "multiple" which is consistent with Ziegler et al. (2015), who found slope differences within temporo-parieto-frontal cortical gray matter in stable MCI subjects. However, these models also find slope differences within the insula and anterior cingulate in MCI subjects that were not reported in that study.

| DISCUSSION
We have presented a multi-task learning based approach to modeling individuals' longitudinal biomarker trajectories, setting the learning of each trajectory as a "task" and using flexible covariance priors to couple tasks (i.e., subjects) during model training. Thanks to its parametric Bayesian formulation, our approach makes probabilistic predictions, infers distributions over parameters and allows for the comparison of competing models via model evidence. Using empirical Bayes (rather than time-consuming cross-validation), we showed how we can combine (a) fully decoupled models (i.e., individual-specific trajectory models; OLS-like); (b) fully coupled models (i.e., a common trajectory across subjects; LME-like); and (c) models coupled via one or more biomarker-based similarity matrices (i.e., kernels). In this way, our approach uses multi-kernel learning and capitalizes on different aspects of biology measured by different biomarkers, within a multitask learning framework.
We performed simulations of trajectories having group wise variations in intercept and slope, showing that even the simplest version of our proposed model ("plain," mixing decoupled and fully coupled models) dramatically outperforms decoupled models ("OLS") in terms of predictive accuracy. We achieved further reductions in prediction error by adding kernel-based coupling using both random information and biomarkers (in various configurations, with and without oracle-like knowledge of simulation scenario). Interestingly, random-informationbased kernels performed almost as well as the biomarker-based kernels ( Figure 2), though the biomarker-based models ("Gaussian both" and the oracle-like models of each scenario) had better inference of true group differences ( Figure 3). As such, biomarker-based models are the better choice for accurately making predictions and inferring parameters. We further conclude that "Gaussian both," which allows biomarker-based coupling in all parameter types (e.g., intercepts and slopes) is a better choice than "linear both" in terms of predictive performance and parameter inference. We emphasize the importance of parameter inference for both scientific (e.g., model interpretation) and translational purposes (e.g., trajectory parameter-based biomarkers such as ROI rates of change).  Huertas et al., 2017).
We tested the assumption of independent measurement noise, parameterized by β in Equation (6), with additional simulations. In general, prediction and parameter estimation errors were similar to or lower than LME models across varying noise correlations ( Figure S4).
However, some degradation of parameter coverage was evident, particularly at the highest noise correlation level, suggesting that a more general parameterization of the measurement error covariance (e.g., a block diagonal form allowing within-subject correlation) may be necessary in some situations. We also explored the effect of non-Gaussian distributed measurement error, finding that in general both the MTL and LME models were robust to error skewness in terms of both predictions and parameter estimates ( Figure S5).
We applied the model to longitudinal data from the ADNI study, modeling trajectories of cortical ROIs across CN, MCI, and AD subjects using kernels formed from amyloid PET, CSF, and genetic (APOE) information. We showed degeneration throughout the cortex, with lateral regions degenerating faster than medial regions (Figure 4).
We showed significantly decreased prediction errors due to coupling ("plain" vs. "OLS") and further decreases when adding kernel-based coupling, with a small difference between the random-information versus biomarker coupled models in some ROIs (Figures 4 and S7).
Our model offers improved interpretability and more concrete biological explanations of trajectory differences across diagnostic groups compared to the baseline models. Here, we were mainly interested in understanding how cortical degeneration varies across diagnostic groups, which required that we carefully interpret the patterns of group differences ( Figures 6 and S9). Single biomarker models based on either "PET amyloid" or "CSF tau/aBeta" had cross-sectional and longitudinal group differences that were consistent with the literature and had the most evidence in their favor ( Figure 5). Importantly, this analysis showed the benefit in coupling cortical trajectories based on baseline measures of amyloid deposition measured via PET or amyloid-to-tau ratio via CSF, which is consistent with the prevailing disease progression model of AD in which amyloid deposition precedes change in brain structure . Coupling based on genetic risk for AD as realized by the APOE genotype was inferior to using baseline amyloid-based biomarkers. This agrees with the literature in that APOE genotype is the genetic risk and amyloid biomarker levels represent the realization of that risk. Furthermore, we showed that combining multiple kernels is effective in the sense that "multiple," the multi-modal model, was as good as the best individual model in terms of model evidence and parameter inference. Thus, our approach removes the requirement to pre-select any specific biomarker.
All coupled models had significant diagnostic group differences across the cortex at mean baseline age, agreeing with the pattern of later-stage neurofibrillary changes due to AD that have been shown to be detectable via MRI (Braak & Braak, 1991;Whitwell et al., 2008), along with many of the AD discrimination studies that have used cross-sectional structural MRI based features (Arbabshirani, Plis, Sui, & Calhoun, 2016). In particular, the pattern of cross-sectional differences we find aligns with Karas et al., (Karas et al., 2003), which reported AD-related differences within the temporal lobe and insula, with sparing of the sensorimotor cortex. We also found no significant differences within the motor and sensory ROIs, supporting the idea that sensorimotor function is relatively spared in AD, unless the disease is very advanced (Ferreri et al., 2016;Suvà et al., 1999). Among the models with high model evidence in their favor, namely "CSF tau/aBeta," "PET amyloid," and "multiple," there were significant longitudinal (i.e., slope) differences within the temporal lobe, orbitofrontal region, insula and lateral parietal region. These findings are similar to the patterns of group differences in 1 year atrophy depicted in Risacher et al. (2010), although the authors did not focus on their apparent findings within the insula. Insel et al. (2015) identified changes within the insula and temporal regions occurring prior to the clinical threshold for amyloid-β positivity, and interestingly, we detect longitudinal differences in these regions with models that couple based on similarity of protein measures.
In addition to clinical diagnosis, we also investigated the effect of APOE ε4, the major genetic risk factor for late-onset AD, analyzing differences in subjects grouped by number of ε4 alleles ( Figure S11).
We found no cross-sectional volume differences at mean baseline age within each group and few significant longitudinal differences within the CN and AD groups. The CN finding is consistent with the literature: Filippini et al. (2009) found no volumetric differences within the brain between young, healthy ε4 carriers and matched non-carriers using cross-sectional information while Raz et al. (2010) found no differences due to ε4 within healthy middle-aged and older adults using longitudinal data. Our findings indicate similar homogeneity within AD subjects. Within the MCI group we found a temporo-parietal-frontal pattern of slope differences that aligned with previous literature  along with additional slope differences with the insula and anterior cingulate. We note that the findings within this group may be due to both the larger sample size and greater heterogeneity of the MCI group compared to the CN and AD groups.
We also compared our novel MTL approach to a widely available LME implementation, showing that MTL makes very similar prediction errors on the held-out ADNI follow-ups ( Figure S8). However, MTL detected more widespread cross-sectional group differences than the three LME models we considered and, importantly, more significant longitudinal differences within the temporal lobe ( Figure 6). As such the MTL based parameters appear to be more plausible than the LME based parameters. Additionally, our method automatically finds the covariance structure that best explains the training data (within the limit of the chosen parameterization), removing modeling decisions such as whether a variable is or is not a random effect.
The approach we have presented has several limitations, however. There are multiple directions for future work. As mentioned in the introduction, the method we present is not a disease progression model and as such it does not estimate a disease stage for each subject. It does, however, provide plausible estimates of trajectory parameters, which may serve as valuable inputs to a staging model. Future work will investigate the staging of subjects based on these parameters within an EBM (Young et al., 2014), providing insight into the role of brain structural changes during the progression from normal cognition to Alzheimer's disease . We can also extend our understanding of the relationship between genetics and cortical atrophy beyond APOE to all single nucleotide polymorphisms (SNPs) using multivariate methods such as partial least squares (PLS; Lorenzi et al., 2018) or canonical correlation analysis (Szefer, Lu, Nathoo, Beg, & Graham, 2017). Finally, we can generalize the approach to simultaneously model multiple variables across subjects (i.e., multi-output learning), where interesting modeling possibilities (coupling parameters across variables within and between subjects) and computational challenges abound. It is important to note that benefits of such an approach depend on whether there are strong multivariate relationships that can be modeled through either correlated parameters or errors. For example, Marinescu et al., (2019) show that there are widespread patterns in neurodegenerative disease progression that can be modeled via spatial coupling, while earlier studies showed only a modest benefit of this type of coupling relative to the computational effort involved (Marquand, Brammer, et al., 2014a;Marquand, Brammer, et al., 2014;Zhang & Shen, 2012).

ACKNOWLEDGMENTS
where α i is an element within vector α, b = (I nd − βA −1 X T X)A −1 X T y and prior c = − βA − 1 FA −1 X T y and the ∂Σ(α) prior /∂α i depends on the α i in question. We can easily change the parameterization of Σ(α) prior without breaking these equations, provided it remains invertible and differentiable with respect to its parameters.