Comparison of single and multitask learning for predicting cognitive decline based on MRI data

The Alzheimer's Disease Assessment Scale-Cognitive subscale (ADAS-Cog) is a neuropsychological tool that has been designed to assess the severity of cognitive symptoms of dementia. Personalized prediction of the changes in ADAS-Cog scores could help in timing therapeutic interventions in dementia and at-risk populations. In the present work, we compared single and multitask learning approaches to predict the changes in ADAS-Cog scores based on T1-weighted anatomical magnetic resonance imaging (MRI). In contrast to most machine learning-based prediction methods ADAS-Cog changes, we stratified the subjects based on their baseline diagnoses and evaluated the prediction performances in each group. Our experiments indicated a positive relationship between the predicted and observed ADAS-Cog score changes in each diagnostic group, suggesting that T1-weighted MRI has a predictive value for evaluating cognitive decline in the entire AD continuum. We further studied whether correction of the differences in the magnetic field strength of MRI would improve the ADAS-Cog score prediction. The partial least square-based domain adaptation slightly improved the prediction performance, but the improvement was marginal. In summary, this study demonstrated that ADAS-Cog change could be, to some extent, predicted based on anatomical MRI. Based on this study, the recommended method for learning the predictive models is a single-task regularized linear regression due to its simplicity and good performance. It appears important to combine the training data across all subject groups for the most effective predictive models.


I. INTRODUCTION
Alzheimer's disease (AD) is a chronic neurodegenerative disorder and a major health burden, with 152 million people expected to suffer from AD by 2050 [1]. Pathophysiological changes in AD begin many years prior to clinical manifestations of disease, and the spectrum of AD spans from clinically asymptomatic to severely impaired [2]. Because of this, there is an appreciation that AD should not only be viewed with discrete and defined clinical stages, but also as a multifaceted process moving along a continuum. Mild cognitive impairment (MCI) is an essential concept along this continuum, representing a transitional stage between healthy elderly individuals and AD [3]. Approximately 10% to 20% of MCI patients tend to progress to AD annually, whereas others will continue with cognitive decline or even revert to normal cognition (NC) [4]. Many treatment strategies have been proposed to decelerate AD, with limited success [5]; one problem is that treatments are not administered within VOLUME X, 2021 1 arXiv:2109.10266v1 [cs.LG] 21 Sep 2021 the correct time window along the AD continuum. Therefore, early prediction of disease progression is a crucial step towards better therapies, unburdening the health care system, and preventing adverse events caused by AD [6], [7].
Cognitive test batteries have been developed to assess the cognitive decline of individuals. Two of the most commonly used standards are the Mini-Mental State Examination (MMSE) [8] and the Alzheimer's Disease Assessment Scalecognition subscale (ADAS-Cog) [9], which are important criteria for the clinical diagnosis of AD. In the evaluation of cognitive decline due to dementia, ADAS-Cog is considered to be more sensitive and reliable than MMSE [10]. The ADAS-Cog is widely used to evaluate the cognitive state of patients with mild to advanced AD. The modified ADAS-Cog 13 contains 13 items for assessing cognitive dysfunction, with a total score of 0-85, with higher scores indicating greater dysfunction [11].
Due to the above-stated reasons machine learning (ML) for predicting ADAS-Cog scores, as opposed to the diagnosis (a recent review in [12]), has been gaining research interest. For example, Utsumi et al. [13] found that the combination of two variants of personalized Gaussian process models can improve the accuracy of predicting future ADAS-Cog13 scores using a limited set of subjects with multimodal data, which included imaging biomarkers (MRI, diffusion tensor imaging), cerebrospinal fluid (CSF) biomarkers, demographics, genetics, and cognitive scores. Zhu et al. [14] concluded that the canonical feature selection method had a significant effect on improving the performance of sparse multitask learning (MTL) to predict the clinical scores of ADAS-Cog. Prakash et al. [15] utilized multivariate regression techniques and determined that longitudinal prediction of AD progression is possible with multimodal data from the baseline, which included MRI, positron emission tomography (PET), CSF biomarkers, cognitive scores and APOE. Tsao et al. [16] found that leveraging hippocampal surface features together with multimodal data, which included sex, age, MRI, APOE, and baseline MMSE, might boost the prediction of cognitive scores, such as MMSE and clinical dementia rating (CDR). However, PET is far less commonly used and more expensive than MRI [17], and obtaining CSF is highly invasive.
Accordingly, an interesting approach to biomarkers is the use of the only standard, T1-weighted MRI, to predict the progression rate of dementia. MRI enables the study of various noninvasive aspects of the human brain to detect biomarkers associated with AD [18], and it is a widely available imaging modality. Indeed, several studies have investigated the association between cognitive scores and MRI biomarkers, such as gray matter volume [19], [20] and cortical and subcortical volumes [21], [22]. Lei et al. [23] studied the relationship between MRI data and cognitive scores by introducing a framework that includes correntropy regularized joint learning and a deep polynomial network for feature construction, as well as ensemble learning based on support vector regression for the prediction of cognitive scores. Bhagwat et al. [24] proposed an artificial neural network model for predicting cognitive scores from the cortical thickness and hippocampal subfield volumes. Jiang et al. [25] proposed a novel MTL formulation that considers a correlation-aware sparse and low-rank constrained regularization in order to explore the relationship between the MRI features and cognitive scores. Huang et al. [26] presented a random forest (RF) with sparse regression and soft-split technique, which adopted probabilistic paths during the testing stage in RF to predict cognitive scores at multiple time points. Zhou et al. [27] proposed two MTL formulations based on a temporal group Lasso regularizer and the convex fused sparse group Lasso, which utilize the common temporal patterns of MRI biomarkers to predict disease progression measured by the cognitive scores. To date, most existing progression models focus on predicting cognitive scores derived from the entire AD continuum, from healthy elderly to moderate AD, using a single model, e.g., [28]- [31], with the exception of Duchesne et al. [32], who studied the relationship between MRI and one-year MMSE changes in the MCI population. However, there is little evidence that this one-size-fits-all strategy would be optimal. Moreover, individuals at different stages of the continuum can be expected to regress differently (for example, most cognitively normal individuals are likely to be cognitively normal after three years, while most AD patients are expected to have regressed during that time period). Therefore, evaluating the prediction models using the whole-continuum data leads to results that are perhaps hard to interpret, and we argue that the prediction models should be evaluated while stratifying the subject population based on the baseline diagnosis.
The evaluation of the prediction models while stratifying for the baseline diagnosis leads to a question of whether the other diagnostic groups are still useful when training the predictive models. This subsequently leads to the consideration of MTL approaches to improve the generalization performance by simultaneously solving multiple learning tasks while exploiting commonalities and differences across tasks [33]. One of the critical issues in MTL is to identify the essential relatedness between the tasks and to build learning models in order to obtain this task relatedness. MTL approaches with sparsity-inducing regularization have been studied to investigate the prediction of cognitive measures. For example, Tabarestani et al. [34] applied 1-norm regularization to introduce sparsity among all features that could select a small subset of features to predict MMSE at six time points. Zhou et al. [27] and Lei et al. [35] employed joint sparsity regularization ( 2,1 norm) in order to share a common subset of features for all tasks simultaneously, where each task refers to AD progression prediction at a single time point. Wang et al. [36] formulated the progression of AD as a weakly supervised temporal multitask matrix regression framework that considers the prediction of cognitive scores at each time point as a regression task. However, MTL has not been studied in cases where different tasks correspond to cognitive score prediction of different diagnostic groups within the AD continuum. In addition, MTL has not been studied for adapting predictive modeling for differences in MRI acquisition.
In this study, we explore whether MRI at the baseline can potentially predict changes in ADAS-Cog scores while stratifying the population based on the baseline diagnosis. We will compare single and multitask learning approaches for the task. We will also address multitask learning in the presence of differences in terms of MRI acquisition; in this case, MRIs have been acquired using two different magnetic field strengths (MFSs). We compare MTL to two heterogeneity reduction approaches, partial least squares (PLS) domain adaptation [37] and ComBat [38].

A. ADNI DATASET
Data used in the preparation of this article were obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu) 1 . The ADNI was launched in 2003 as a public-private partnership, led by Principal Investigator Michael W. Weiner, MD. The primary goal of ADNI has been to test whether serial magnetic resonance imaging (MRI), positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessments can be combined to measure the progression of MCI and early AD.

B. SUBJECTS AND MRI
The data in this study included baseline MRIs from 1376 ADNI subjects (430 NC, 662 MCI, 284 AD), aged 54 to 91 years old. Detailed characteristics of the subjects are presented in Table 1. For these subjects, the baseline MRI data were obtained with a T1-weighted MP-RAGE sequence at 1.5 T (with a 256 × 256 × 170 acquisition matrix and a voxel size of 1.25 × 1.25 × 1.2 mm 3 ) and 3.0 T (with a 256 × 256 × 170 acquisition matrix and a voxel size of 1.0 × 1.0 × 1.2 mm 3 ). Specifically, 808 subjects were from the ADNI-1 cohort with the MRI acquired at 1.5 T, and 571 subjects were from the ADNI-2 cohort with the MRI acquired at 3.0 T. As seen in Table 1, the number of subjects with ADAS-Cog-13 scores decreased during the follow-up due to subject drop-out and missing data. The roster identification numbers (RIDs) of the subjects employed in this study are provided in the Supplementary Material.

C. IMAGE PREPROCESSING
The preprocessing of the T1-weighted images was performed using the fully automated CAT12 package running under MATLAB 2 . T1-weighted images were first denoised by using adaptive nonlocal means filtering [39], then they were corrected for bias field inhomogeneities and segmented into gray matter (GM), white matter (WM), and cerebrospinal fluid (CSF) [40]. After segmentation, partial volume estimation (PVE) with a simplified mixed model with a maximum of two tissue types was performed, resulting in maps of the tissue 1 For up-to-date information, see https://www.adni-info.org. type densities [41]. Furthermore, the segmented images were spatially normalized by utilizing the high-dimensional DARTEL normalization algorithm into the standard MNI space [42]. This procedure resulted in spatially aligned maps for tissue fractions of WM and GM. We only utilized the GM images in this study. Finally, we averaged the gray matter density values according to the brain regions defined by the AAL atlas, resulting in 122 regional GM density values.

D. ADAS-COG SCORE
The ADAS-Cog was developed as an outcome measure to assess the severity of cognitive dysfunction in AD. We consider ADAS-Cog-13, which yields a measure of cognitive performance by combining the original tasks of the ADAS-Cog-11 [43] (subject-completed tests and observer-based assessments) as well as a test of delayed word recall and a number cancellation or maze task [9]. The ADAS-Cog-13, later referred to as ADAS-Cog, scores are in the range of 0 to 85, with higher scores indicating more severe impairment [44]. The ADAS-Cog scores used in this study were acquired once a year as described in the ADNI General Procedures Manuals 3 .

E. OVERVIEW OF THE METHODS
We aim to predict the change in the ADAS-Cog score (∆ t ADAS i ) for each subject i and each time point (t = 12, 24, 36 months) based on regional, MRI-derived gray matter density values at the baseline (t = 0 months). The change in the ADAS-Cog score is defined as is the ADAS score of subject i at time t. We frame this as a singletask ( Fig. 1(A)) or multitask ( Fig. 1(B)) prediction problem, where different tasks correspond to the groups of different, but related, subjects (three diagnostic groups and/or two magnetic field strengths used to acquire the data). More formally, we build S prediction models: where x i denotes the MRI (122 regional gray matter density values) of subject i at baseline and f d is the prediction model for subject group d. We consider the cases where S = 3, where the groups are determined based on the baseline diagnosis (NC, MCI, and AD), and S = 6, where the groups are determined based on the baseline diagnosis and MFS (NC at 1.5 T, NC at 3 T, MCI at 1.5 T, MCI at 3 T, AD at 1.5 T and AD at 3 T). multitask learning aimed at enhancing the precision of learning algorithms by jointly learning independent variables for multiple tasks. The learning approach works well, especially if these tasks have some commonalities, as we expect f d to have. We compare multitask learning to harmonizing MRI acquired with different field strengths. We do this by adopting a recent domain adaptation technique [37] and using ComBat data harmonization [45]. The algorithms compared in this work are summarized in Table 2. Supplementary Figure 1 delineates the workflow of the applied methods for predicting the change of ADAS-Cog scores.

F. PENALIZED LINEAR REGRESSION
In the simplest cases, we assume that either 1) f d are independent or 2) f d are equal and can proceed with single-task learning. As a single-task learning algorithm, we use leastsquares linear regression with an elastic net penalty to predict the ADAS-Cog changes. More formally, the ∆ADAS scores are predicted by solving the following linear regression prob-lems: where i refers to a subject, a d and b d are the model parameters for the task d and i is the error term. Adding the elastic net penalty, the model is solved by minimizing the following elastic net cost function: where N d is the number of training samples, λ is the complexity parameter found by cross-validation, α ∈ [0, 1] defines the compromise between ridge ||a|| 2 2 /2 and lasso penalties ||a|| 1 , and ||.|| 1 denotes the L1-norm. Here, we selected α = 0.5 to give equal weights for the ridge and lasso penalties. We modeled each time point (t = 12, 24, 36 months) separately.

G. MULTITASK LINEAR REGRESSION
Multitask regression incorporates the shared information among different tasks (task relatedness) into the regression model. To introduce the multitask regression approaches, we need to introduce additional notation. Let X d be an N d × M matrix of the input MRI data at the baseline corresponding to task d, where N d is the number of subjects in a group d, and M is the dimensionality of the feature space. We formulate the multitask learning as minimization of the penalized empirical loss: where W ∈ R M ×d is the weight matrix, which is estimated from the feature matrices X d and ∆ ADAS-Cog scores in VOLUME X, 2021 Dirty Model Multitask learning; L(W) = Least Squares; the training set, L(W ) is the empirical loss on the training set, and E(W ) is the regularization term that encodes the relatedness of different tasks. Moreover, we use w d to denote the weights related to the group d, i.e., W = [w 1 , . . . , w S ] T . We then use the least squares loss as the empirical loss function for the regression tasks: where y d,t is a shorthand of the ∆ t ADAS scores of N d subjects in group d at the time point t (12, 24, or 36 months), b d,t are the bias terms, and 1 is an M -element vector of ones. Different regularization terms were used to determine different assumptions on task relatedness [46]- [49]. The regularization terms are described as in the following, where we drop the subindex t as we separately consider the prediction at three time points.

1) Multitask Lasso
Multitask Lasso is a simple generalization of the elastic net penalty [50] to multitask regression. The regularization term is defined as: where ρ 1 is a regularization parameter for controlling the sparsity among all tasks and ρ L2 is an optional regularization parameter that controls the 2-norm penalty.

2) Joint feature selection
Joint feature selection is used to constrain all models to share a common set of features [46], [51]. This goal is achieved by setting E(W ) and minimizing the following 2,1-norm regularized learning [52], [53]: where , is the group sparse penalty, ρ 1 is a regularization parameter for controlling the sparsity among all tasks and ρ L2 is an optional regularization parameter that controls the 2-norm penalty.

3) Dirty model for multitask learning
The Dirty model estimates a superposition of two sets of parameters and regularizes them differently [54]. In more detail, each w d is written as a sum w d = s d + r d , where the corresponding matrices S and R are encouraged to have elementwise sparsity and block-structured row sparsity. This leads to the minimization problem where ||R|| 1,∞ = j max d |r dj |. The final output is W = S +R, whereŜ,R are the minimizers of (8).

4) Low rank assumption
The task commonalities can be utilized to constrain the prediction models from different tasks to share a lowdimensional subspace, i.e., constrain W to be of low rank [55]. Since rank optimization is, in general, NP-hard, the rank function [56] is replaced by the trace norm [47], [48], which is given by the sum of the singular values of W : ||W || * = d σ d (W ), and the regularization term is where the regularization parameter ρ 1 controls the rank of W .

H. MAGNETIC FIELD STRENGTH HARMONIZATION
To correct for differences in features caused by imaging at two MFSs in ADNI1 and ADNI2, we studied three different techniques: 1) PLS-based domain adaptation introduced in [37], 2) ComBat harmonization originating from genetics [38], which has become widely used in brain imaging [45], [57], and 3) multitask learning.

1) Partial least squares domain adaptation
We next briefly explain the PLS-based domain adaptation method introduced in [37] for correcting the site dependency of cortical thickness measurements in predicting the severity of autism spectrum disorder. PLS is a linear feature transformation method for modeling relations between two sets of observed variables. The idea of PLS-based domain adaptation is to project the input features into a new (lowerdimensional) space, which is a product of two orthogonal subspaces: a subspace that is dependent on MFSs and its (orthogonal) complement. This can be achieved by applying a PLS algorithm that ensures the orthogonality of the resulting latent features so that the response variable is a codification of the scanner MFS and the predictor variables are the original features. However, as demonstrated in [37], this works the best, and different brain regions are separately corrected. Therefore, [37] suggested a two-stage strategy where in the first stage a PLS-based domain adaptation is performed for each brain region separately, then for each brain region the prediction task is performed. These predictions are then combined in the stacking framework. We applied an AAL-atlas to decompose the gray matter density values into 122 distinct regions. For the prediction ∆ADAS for each brain region, we utilized support vector regression (with a radial basis function kernel), as suggested in [37]. Finally, the per-region predictions were combined using the elastic-net penalized linear regression described in Section II-F.

2) ComBat harmonization
We applied ComBat harmonization to reduce any potential biases induced by different MFSs. This method was initially proposed to correct for the site effects in genomics [38]. Later, ComBat was applied to correct for site effects in imaging applications, including diffusion tensor imaging data [45], cortical thickness [57], positron emission tomography [58] and functional connectivity [59]. In this study, ComBat utilizes a multivariate linear mixed-effects regression to model MFS-adapted feature measurements. Let x ij denote a regional gray matter density value for subject j with MRI acquired at MFS i ∈ {1, 2} (where 1 refers to 1.5T and 2 refers to 3T ). Then, x ij can be written as: where α is the overall GM density, C ij is a design matrix for the covariates of interest (age), and β is the regression coefficient corresponding to the covariate C. The terms γ i represent the location parameter effect of MFS i, δ i describes the multiplicative effect of MFS i, and ij is an error term from a normal distribution with a zero mean [45]. The ComBat-harmonized GM densities (MFSs-adjusted) are then defined as: in which α and β are estimators of parameters α and β, and γ * i and δ * i are the empirical Bayes estimators of γ i and δ i . To correct the difference among MFSs, we considered two strategies: 1) using ComBat without adjusting any biological covariates (i.e., setting C = 0) and 2) using ComBat while adjusting the age as a biological covariate.

I. IMPLEMENTATION AND EVALUATION
To evaluate the performance of the models, we utilized repeated, nested 10-fold cross-validation (CV). We used two evaluation metrics: (1) R is the correlation coefficient between the predicted and observed ∆ ADAS-Cog scores, averaged over ten repeats of the CV; and (2) MAE is the mean absolute error between the observed and predicted ∆ ADAS-Cog values, averaged over the subjects and ten CV repeats. We then used ten repeats of the 10-fold CV and averaged the metrics to reduce the random variation due to the sampling of subjects to different folds. We computed 95% confidence intervals for cross-validated, averaged correlations R and MAEs using a bootstrap method [60], [61]. Confidence intervals represent an approximation of the overall performance of the prediction model, and the specific bootstrap method used is adapted to be used in repeated CV (see [61] for details).
The values of the hyperparameters were selected in the inner CV loop, and predictions of the ∆ ADAS-Cog scores were evaluated in the outer CV loop to avoid the problem of training on the testing data. The distribution of the ADAS-Cog score changes in each of the CV folds was similar since we used stratified cross-validation folds 4 [62].
The implementation of the elastic-net penalized linear regression model was performed by using the glmnet library 5 [63]. The optimal value of the regularization parameter (λ) was selected in the 10-fold inner CV loop by minimizing the mean squared error (MSE). The implementation of multitask learning techniques was performed using the MALSAR package running in MATLAB [49]. The functions implemented in MALSAR have many parameters to tune. Since fine-tuning all parameters with a grid search was impractical, we only considered ρ 1 (the regularization parameter for controlling the sparsity among all tasks) and ρ 2 (an optional regularization parameter that controls the 2norm penalty) as the most crucial parameters for the grid search. The ρ 1 parameter was selected among the candidate set {10 −3 , 10 −2.5 , . . . , 10 2 , 2 · 10 2 , 2.5 · 10 2 , . . . , 5 · 10 2 )}, where the parameter ρ 2 was chosen among the candidate set {10 −3 , 10 −2.5 , . . . , 10 2 , 2 · 10 2 2.5 · 10 2 , . . . , 10 · 10 2 )} by minimizing the root-mean-square error (rmse). For the tuning parameters, default values were used for the optional optimization parameters (starting points, termination criterion, endurance, and a maximum number of repetitions).
The implementation of ComBat was performed using a publicly available MATLAB package 6  The PLS-based domain adaption was performed as instructed in [37]. The implementation of SVR, required by the PLS domain adaptation, was performed using LIBSVM 7 [64]. The SVR model parameters were set to their default values (C = 1, ν = 0.5, λ = 1/K, where K refers to the dimensionality of the feature space). The implementation of elastic-net penalized linear regression, required by the PLS domain adaptation, was performed as described above.

A. PREDICTION PERFORMANCE OF SINGLE-AND MULTITASK LEARNING
We evaluated the performance of single-and multitask learning approaches by predicting the future change in ADAS-Cog scores using baseline MRI features. The experimental results presented in this subsection ignore the variation in the magnetic field strengths to acquire the MRIs. Table 3 indicates the comparison results between different multitask learning methods based on the least-squares loss function, including multitask Lasso (least lasso), joint feature selection (JFS), dirty model (least dirty), and trace-norm regularization (least trace), with two single-task learning strategies (SEP-EN and ALL-EN) based on elastic-net pe-nalized linear regression (EN). Note that the low number of subjects in the AD group at 36 months (only ten subjects) cannot provide a reliable validation of predictive models; however, the results for this group and time point are shown in Table 3. As shown in Table 3, the average correlation coefficients between the predicted and actual ∆ ADAS-Cog scores were positive for all baseline diagnoses and time points. This demonstrates that MRI-based predictive models were able to predict the disease progression. A comparison of two single-task learning strategies, SEP-EN and ALL-EN, indicated that ALL-EN, which utilized all diagnostic groups for training, performed better in the NC and MCI groups. For instance, R of the NC subjects increased from 0.09 to 0.12 at 12 months, from 0.09 to 0.17 at 24 months, and from 0.04 to 0.19 at 36 months. Moreover, the ALL-EN prediction model achieved the best performance among all methods in terms of a correlation score for NC and MCI groups (e.g., R for MCI group at 12, 24, and 36 months were 0.22, 0.41, and 0.39). However, R of the other methods were typically within the 95% confidence intervals of R of ALL-EN, indicating that the improvement was not large. All multitask learning algorithms performed highly similarly to ALL-EN. However, especially in the predictions concerning the AD group, these performed slightly better than ALL-EN: For instance, R of the AD subjects increased from 0. predicted change in ADAS scores in the CV run with the median correlation (R). These scatter plots imply that the ∆ ADAS-Cog scores with very high values for all time points were the most difficult to predict, since the number of individuals with observed ADAS-Cog score changes over 20 was small (e.g., the numbers of MCI subjects at 12, 24, and 36 months were 2, 8, and 19, respectively). In summary, these results support the notion that auxiliary data were useful in predicting the ADAS-Cog change in the NC and MCI groups, but in the AD group. Complex multitask learning algorithms did not demonstrate benefits over simpler single-task learning methods.
Interestingly, in the majority of methods, R scores typically increased with the length of the follow-up (e.g., R in ∆ ADAS-24 was higher than R of ∆ ADAS-12). The potential reason for this higher correlation is that as the changes in ADAS-Cog become more prominent, they are easier to predict based on MRI. Table 3 the ALL column lists the evaluation results while agglomerating all subject groups in the validation. For all methods, the R values were higher when all the subjects were combined than stratified based on the baseline diagnosis. Since the prediction models were the same in both cases, this inflation in the prediction performance can be seen as artificial and one to avoid. It is likely a product of the Comparison of single and multitask learning on predicting the change in ADAS-Cog. The methods are given in Table 2. SEP-EN refers to the single-task method trained for each diagnostic group separately. ALL-EN refers to the single-task method trained with the data from all diagnostic groups. R is the cross-validated correlation between the actual and predicted ∆ ADAS-Cog scores averaged over 10 CV runs and M AE is the mean absolute error averaged over 10 CV runs. Values in parentheses give the bootstrapped 95% confidence intervals. The asterisk (*) implies that the validation result is not trustworthy due to the low number of samples.  interaction between the heterogeneity of subject groups and the particular evaluation measure (correlation) that scales according to this heterogeneity.

B. PREDICTION PERFORMANCE WHILE ACCOUNTING FOR DIFFERENCES IN MFS
This subsection focuses on the situation associated with heterogeneity reduction between data from two MFSs and explores the prediction models by including MFS correction approaches to reduce unwanted variance across features. In addition, based on the results of the previous subsection, we selected ALL-EN as a single-task learning method and a Dirty model as a multitask learning method for further analysis.
To demonstrate the differences between GM density values of images acquired at 1.5 T and 3.0 T, we applied a standard voxel-based morphometry approach to compare GM densities of NC subjects acquired at 1.5 T and 3.0 T. Voxelwise tstatistics in Fig. 2(A) demonstrate considerable differences in the GM density values between 1.5 T and 3.0 T. We repeated the analysis after using the ComBat harmonization method. Fig. 2(B) delineates that, at the group level, the ComBat harmonization performed exceptionally well in removing nuisance variability associated with two different MFSs. We adopted two strategies to harmonize the MRI data for MFS differences and studied whether harmonization can improve the performance of ADAS-Cog prediction. First, we performed the ComBat approach on 122 regional GM density measurements and then used ALL-EN to predict ∆ ADAS-Cog scores. Second, we applied the PLS-based domain adaptation method, as described in Section 2.8.1. Table 4 presents the R and MAE scores. Combined with ALL-EN, PLS-based domain adaptation methods performed slightly better than the ComBat method in terms of the average correlation. For example, in the PLS approach, R for the NC, MCI, and AD groups at 24 months were 0.16, 0.40, and 0.28, respectively. In the ComBat approach, R for NC, MCI, and AD groups at 24 months were 0.11, 0.38, and 0.24, respectively. Table 4 delineates that the performance of the Dirty model was similar to or worse than the performance of ALL-EN after the correction for the MFS differences. For example, the prediction performance of the PLS method for the AD group at 12 months, when the Dirty model substituted the ALL-EN, the average R score dropped from 0.30 to 0.24. Table 4 also shows the results of 6-task learning approaches for MFS adaptation. These methods were performed on par with other correction approaches, but failed to consistently improve the prediction of the baseline model (ALL-EN). Fig. 3 illustrates the performance comparison between single and multitask learning strategies before and after utilizing 10 VOLUME X, 2021 Comparison of the predictive model performance with and without MFS harmonization. ALL-EN refers to the baseline method without any harmonization. ComBat ALL−EN (P LS ALL−EN ) refers to Combat (PLS) harmonization followed by ALL-EN. ComBat DirtyM odel ( P LS DirtyM odel ) refers to Combat (PLS) harmonization followed by Dirty Model.DirtyM odel 6−T asks (LRA 6−T asks ) refers to the 6-task learning with Dirty Model (LRA). The asterisk (*) implies that the validation result is not trustworthy due to the low number of samples.  correction approaches. The performance comparison shows that ComBat did not improve the prediction performance at the individual level, although it worked well at the group level, as demonstrated in Fig. 2. Fig. 3 indicates that the PLS domain adaptation based on the single-task learning model performed consistently better than the other methods. More- over, combining multitask learning with ComBat slightly improved the performance in the AD group.

C. AGE AS A COVARIATE
In the MRI-based predictive modeling of AD, age plays an essential role; for example, regressing age out of MRI has been shown to improve MCI-to-AD conversion prediction [65]. Therefore, we studied whether removing or preserving age as a biological variable can improve the ADAS-Cog prediction, focusing on the ALL-EN model. We considered different data harmonization methods: (1) we applied ComBat by considering age as a covariate to preserve its effect while removing the variability associated with MFS (ComBat Age ) and (2) Figure 4. In addition, we illustrate the cross-validated accuracy of ComBat and PLS with and without considering age as a covariate in Fig. 4.

IV. DISCUSSION
We predicted the changes in the ADAS-Cog scores (∆ ADAS-Cog) in three distinct subject groups (NC, MCI, and AD) based on MRI for up to 36 months. We explored this problem by comparing various formulations of singleand multitask learning algorithms and scrutinizing whether multitask learning can help to cope with differences in the MRI data caused by different MFSs. MTL models aim to enhance generalization performance by utilizing relatedness among various tasks; here, predicting ∆ ADAS-Cog in different subject groups. We predicted ∆ ADAS-Cog scores from regional GM density values by single-task learning via elastic net penalized linear regression as a baseline learning method. The single-task learning was applied based on two distinct strategies: 1) training the model by pooling together the data (ALL-EN) across the diagnostic groups, and 2) training a separate model for each diagnostic group (SEP-EN). We compared single-task models to multitask learning approaches, where we treated the prediction of cognitive scores in different baseline diagnoses as separate tasks. The experiments revealed a positive correlation between observed and predicted ∆ ADAS-Cog scores in all diagnostic groups at all time points. This indicates that MRI has predictive value for changes in ADAS-Cog scores across all subject groups. As shown in Table 3), the SEP-EN and MTL methods performed similarly; however, the ALL-EN method performed slightly better than the other methods regarding the average correlation score. In addition, a comparison of average correlation scores obtained from two single-task learning strategies (SEP-EN vs. ALL-EN) showed that simultaneous prediction in all diagnostic groups was beneficial for predicting disease progression in the NC and MCI groups. More complex multitask learning approaches were unable to provide benefits over single-task learning in our experiments.
Considering different MFS used in ADNI1 and ADNI2 cohorts, we studied 1) whether correcting for this difference affects the ADAS-Cog prediction and 2) whether multitask learning would be useful for such a correction. We used two heterogeneity reduction approaches, typically applied for correcting for the site differences: PLS-based domain adaptation [37] and ComBat [38]. Correcting the MRI with the help of PLS-based domain adaptation marginally improved the ADAS-Cog change prediction, but the improvement was typically not statistically significant, as seen by comparing the confidence intervals in Table 4. Multitask learning with six tasks corresponding to the three baseline diagnoses and two MFSs of MRI did not bring any improvements over single-task learning.
We investigated the role of age as a covariate in the prediction models. The evaluation demonstrated that the accuracy of predicted ∆ ADAS-Cog scores improved by regressing out the age from the MRI data. This agrees with previous studies, indicating that age has a significant effect on the accuracy of cognitive score prediction [28], [44].
Several studies have analyzed the role of ADAS-Cog scores in the evaluation of AD, as well as the relationship between ADAS-Cog and MRI ( [25]- [30], [66], [67]). For instance, Wang et al. [66] proposed a multitask exclusive relationship learning model to automatically capture the intrinsic relationship among tasks at different time points for estimating clinical measures based on longitudinal imaging data. Yan et al. [30] proposed a new group-sparse multitask regression model for predicting ADAS, MMSE, and RAVLT cognitive scores at the baseline using cortical thickness measurements. Duchesne et al. [32] applied a linear regression model to predict one-year MMSE changes using baseline MRI features and revealed that the baseline MRI features moderately predict one-year MMSE changes in the general MCI population. However, to the best of our knowledge, this work is the first to stratify the estimation of the prediction performance between diagnostic groups and utilize the relatedness between the diagnostic groups to boost the prediction performance. In addition, we demonstrated the necessity of stratifying subjects based on a baseline diagnosis to evaluate the predictive modeling of the change in ADAS-Cog.

V. CONCLUSION
We explored single and multitask learning to predict the changes in ADAS-Cog scores based on T1-weighted anatomical MRI. We stratified the subjects based on their baseline diagnoses and evaluated the prediction performances in each group. Our results indicated a positive relationship between the predicted and observed ADAS-Cog score changes in each diagnostic group, suggesting that standard T1-weighted MRI has a predictive value for evaluating the cognitive decline in the AD continuum. We further studied whether correction of the differences in MFS of MRI would improve the ADAS-Cog score prediction. The PLS-based domain adaptation slightly improved the prediction performance, but the improvement was marginal. In summary, this study demonstrated that ADAS-Cog changes could be, to some extent, predicted based on anatomical MRI. Based on this study, the recommended method for learning the predictive models is ALL-EN, due to its simplicity and good performance.

ACKNOWLEDGMENTS
This study received funding from the Academy of Finland project 316258, "Predictive Brain Image Analysis", to Jussi Tohka.
Data collection and sharing for this project were funded by the Alzheimer's Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904) and DOD ADNI (Department of Defense award number W81XWH-12-2-0012). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: AbbVie, Alzheimer's Association;

AVAILABILITY OF MATERIALS AND METHODS
The source codes of the single and multitask algorithms as well as MRI pre-processing are available at https://github. com/vandadim/ADAS_MRI. The dataset used in this paper was obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) and is available at http://adni.loni.usc.edu/ data-samples/access-data/. The roster identification numbers (RIDs) of the subjects employed in this study are provided as a supplementary csv-file.