Making use of longitudinal information in pattern recognition

Abstract Longitudinal designs are widely used in medical studies as a means of observing within‐subject changes over time in groups of subjects, thereby aiming to improve sensitivity for detecting disease effects. Paralleling an increased use of such studies in neuroimaging has been the adoption of pattern recognition algorithms for making individualized predictions of disease. However, at present few pattern recognition methods exist to make full use of neuroimaging data that have been collected longitudinally, with most methods relying instead on cross‐sectional style analysis. This article presents a principal component analysis‐based feature construction method that uses longitudinal high‐dimensional data to improve predictive performance of pattern recognition algorithms. The method can be applied to data from a wide range of longitudinal study designs and permits an arbitrary number of time‐points per subject. We apply the method to two longitudinal datasets, one containing subjects with mild cognitive impairment along with healthy controls, the other with early dementia subjects and healthy controls. Across both datasets, we show improvements in predictive accuracy relative to cross‐sectional classifiers for discriminating disease subjects from healthy controls on the basis of whole‐brain structural magnetic resonance image‐based voxels. In addition, we can transfer longitudinal information from one set of subjects to make disease predictions in another set of subjects. The proposed method is simple and, as a feature construction method, flexible with respect to the choice of classifier and image registration algorithm. Hum Brain Mapp 37:4385–4404, 2016. © 2016 Wiley Periodicals, Inc.

subjects with mild cognitive impairment along with healthy controls, the other with early dementia subjects and healthy controls. Across both datasets, we show improvements in predictive accuracy relative to cross-sectional classifiers for discriminating disease subjects from healthy controls on the basis of whole-brain structural magnetic resonance image-based voxels. In addition, we can transfer longitudinal information from one set of subjects to make disease predictions in another set of subjects. The proposed method is simple and, as a feature construction method, flexible with respect to the choice of classifier and image registration algorithm. Hum Brain Mapp 37: [4385][4386][4387][4388][4389][4390][4391][4392][4393][4394][4395][4396][4397][4398][4399][4400][4401][4402][4403][4404]2016.

INTRODUCTION
Longitudinal studies aim to follow a set of subjects, making repeated observations of the same variables over time. They have been widely used in medical studies to provide more sensitive detection of disease effects by focussing on within-subject changes in clinical groups [Fitzmaurice and Ravichandran, 2008]. Such study designs are increasingly being used in neuroimaging research, often with the goal of gathering imaging biomarkers for assessing disease progression and, potentially, response to therapy [Bernal-Rusiel et al., 2013;Chen et al., 2012;Douaud et al., 2013;Guillaume et al., 2014]. At the same time, pattern recognition algorithms have been adopted by the neuroimaging community to make use of such biomarkers to make individualized predictions of subjects' current or future disease state [Kl€ oppel et al., 2012;Orr u et al 2012;Wolfers et al., 2015]. At present, however, most pattern recognition-based analyses use cross-sectional data or analyze longitudinal data in a cross-sectional manner and therefore do not capitalise on the full value of longitudinal designs. This article describes a novel method of constructing features from high-dimensional longitudinal data such as imaging biomarkers that solves this problem. Our method is applicable to most forms of highdimensional data; here, we have used structural magnetic resonance images (MRI), with the goal of discriminating subjects with mild cognitive impairment (MCI) from healthy controls (HCs).
Structural MRI is a popular source of imaging biomarkers due to the high spatial resolution images that can be obtained in a safe and clinically acceptable manner. Traditionally, cross-sectional classifiers that only use data from a single time-point have performed reasonably well with such data on certain problems. For example, in the case of discriminating subjects with Alzheimer's Disease (AD) from HCs using structural MR images, the pathology of the disease is severe enough that there are usually significant volumetric differences between age-matched samples from these two groups to discriminate subjects crosssectionally. In a direct and controlled comparison, it was shown that both radiologists and automated methods perform comparably, achieving roughly 90% accuracy in both cases for discriminating subjects with sporadic AD from controls [Kl€ oppel et al., 2008].
Given promising results discriminating such severe forms of neurodegeneration from controls using structural MRI, researchers have attempted to discriminate earlier forms of neurodegeneration as well as subjects who are more likely to progress to dementia in the future [Frisoni et al., 2010]. MCI, in particular, has been a focus of intense study in recent years as it is widely believed to be the best stage for interventional treatments aimed at delaying or preventing progression to AD [Morris et al., 2001]. Discriminating MCI from controls, particularly with structural MRI, has proven to be more difficult however [Chu et al., 2012;M€ onninghoff et al., 2015]. In addition to the problem of head size and shape differences between subjects that pertains to much of neuroimage-based analysis, the study of early stage diseases and disease precursors is often confounded by demographic factors such as age, gender, and education level [Bakkour et al., 2013;Dukart et al., 2013]. Due to such sources of inter-subject variance, the more subtle nature of the pathology of MCI may be hard to discern from healthy aging via volumetric biomarkers derived from structural MRI. This in turn may explain the poorer performance in discriminating MCI from controls using pattern recognition-based classifiers, commonly reported thus far [Orr u et al., 2012].
In cases of high inter-subject variation, longitudinal study designs provide a means of increasing the sensitivity of detecting disease effects of interest by tracking intrasubject differences over time. Such study designs are particularly appropriate for studying neurodegenerative diseases as they are, by definition, a class of diseases marked by progressive loss of structure and/or function [Franke and Gaser, 2012;Misra et al., 2009;Risacher et al., 2009]. Raz and Lindenberger [2011] strongly advocate the increased use of intra-subject longitudinal information rather than inter-subject cross-sectional information to study aging. Several recent neuroimaging studies have done so in a mass univariate context. Zipunnikov et al. [2014] developed an efficient method for handling very high dimensional neuroimaging data using dimensionality reduction, building a mixed effects model that decomposes the variability of repeated observations into subject specific r Aksman et al. r r 4386 r cross-sectional, longitudinal and exchangeable visit-to-visit variabilities. Ziegler et al. [2015], applied Bayesian linear mixed effects modelling to structural neuroimaging data from dementia subjects as well as older healthy subjects, comparing temporal trajectory models with varying model orders for fixed and random effects in a principled manner, through the use of model evidence. Such mass univariate methods hold great value for explaining the sources of variance and localizing disease effects in a cohort of study subjects. However, there is no mechanism for understanding how well such models generalize to unseen data. Pattern recognition-based approaches, in contrast, aim to assess predictive performance on held out samples, making them better suited for clinical applications.
Other neuroimaging methods have made use of longitudinal information at the image registration stage, thereby making them applicable to pattern recognition as feature construction methods [Holland and Dale, 2011;Leung et al., 2012]. Ashburner and Ridgway [2013] developed a group-wise longitudinal registration technique which creates an intra-subject template using two or more images of a subject. In the case of two images per subject, longitudinal features for pattern recognition can then be computed that quantify the contraction or expansion of each voxel from baseline to follow-up time-point. In one of the few direct applications of pattern recognition to longitudinal neuroimaging data, Gray et al. [2012] used a longitudinal registration-based approach to classify dementia. The authors non-rigidly registered the 12-month follow-up FDG-PET images of subjects to their corresponding baseline images. The authors show that longitudinal features, encoding the change from baseline to follow-up, perform worse in two class classifications involving AD, MCI, and HCs than cross-sectional features. However, concatenating longitudinal features with cross-sectional features resulted in a small improvement in balanced accuracies compared to using purely cross-sectional features.
Longitudinal information has also been integrated into a classifier at the level of the algorithm itself, by providing subjects' cross-sectional images from multiple time-points as inputs to the algorithm. Chen and DuBois Bowman [2011] presented a method that generalizes the optimization problem of the popular support vector machine (SVM) by constructing a support vector classifier (SVC) that includes a linear combination of an arbitrary number of cross-sectional time-points from all subjects. At each iteration, their algorithm proceeds by optimizing the support vector (SV) coefficients under fixed time-point weightings followed by optimizing the time-point weightings, common across all subjects, under the updated SV coefficients, until convergence.
In this work, we propose a method for incorporating longitudinal information into a classifier via feature construction rather than modifying a particular pattern recognition algorithm's optimization problem or relying on specialized registration techniques. We project cross-sectional data onto a low-dimensional linear subspace formed by performing principal component analysis (PCA) on a matrix of coefficients describing longitudinal changes. Such low-dimensional subspaces, formed using training data, have been broadly used in the context of facial recognition problems [Belhumeur et al., 1997;Chen et al., 2005;He et al., 2005;Turk and Pentland, 1991;Zhao et al., 2007]. In the context of neuroimaging, Wolz et al. [2010] learned a low-dimensional manifold using both longitudinal and cross-sectional information to classify dementia, using two-sample difference images to represent longitudinal change. Our method is similar to this approach but is applicable to both "balanced" (fixed number of samples per subject and fixed time interval between samples) and "unbalanced" designs (varying number of samples per subject or varying intervals between samples) and is able to make use of an arbitrary number of timepoints per subject. Importantly, it also has the ability to transfer longitudinal information from one set of subjects to make individualized predictions for subjects from another set. Furthermore, the method is fast and relatively simple, having only one parameter that is tuned via crossvalidation. As a feature construction method it is not restricted to any neuroimaging modality, image registration technique or pattern recognition algorithm.
Here, we demonstrate the method using a linear SVC and structural MRI-based biomarkers on the tasks of discriminating subjects with MCI from HCs and early dementia from HCs using two independent datasets. We compare the accuracies obtained using longitudinal features to those obtained using only cross-sectional features and predict that longitudinal information will yield performance improvements that depend on the type of longitudinal study design used and number of samples per subject that are available.

Subjects
For the first dataset, we used clinical and imaging data derived from a substudy of the Heinz Nixdorf Recall (HNR; Risk Factors, Evaluation of Coronary Calcium and Lifestyle) study. The HNR study is a population-based prospective cohort study with subjects randomly selected from mandatory city registries in Germany. Study methods have been previously described in detail [Schmermund et al., 2002;Stang et al., 2005]. Briefly, 4,814 participants 45 to 75 years of age were enrolled between 2000 and 2003 in the Ruhr area in Germany. After the baseline examination participants were followed over a five year period when a second examination was conducted. The second examination (response rate: 90.2%) included a short cognitive performance assessment (for details regarding participants and drop-outs see Dlugaj et al. [2010]), which was accomplished in 4,086 study r Longitudinal Information in Pattern Recognition r r 4387 r participants. At this follow-up time-point, a random sample of participants (aged 50-80) with impaired short cognitive performance assessment results (n 5 701) and age appropriate short cognitive performance assessment results (n 5 316) were invited to a detailed neuropsychological and neurological examination to assess MCI and its subtypes for inclusion in the HNR substudy (HNRS) [Dlugaj et al., 2010]. The neuropsychological examinations performed and MCI diagnosis criteria used in this study are detailed in the Supporting Information.
In total, the HNRS consisted of 148 MCI cases identified and matched, at the five year follow-up of the total HNR cohort, to 148 HCs according to age, sex, and education. Participants were examined with MRI at the substudy baseline (starting at the five year follow-up of the total cohort) and at the substudy 2.5 years follow-up. A second follow-up with MRI, approximately 5.5 years after substudy baseline, is being conducted, with data not yet available. All participants gave their written informed consent. The study was approved by the local institutional ethical committee and followed established guidelines of good epidemiological practice.
The second dataset used was the OASIS longitudinal sample, the full details of which have been reported previously [Marcus et al., 2010]. The longitudinal MRI data consisted of 150 subjects aged 60 to 96, recruited primarily through media appeals and word of mouth. Each of the 150 study subjects had at least two separate visits in which clinical data and MRI data were collected. Clinical data consisted of dementia status, as measured by the Clinical Dementia Rating (CDR) scale. The CDR scale ranges from 0 for no cognitive impairment, to 0.5 for very mild dementia, one for mild dementia, two for moderate dementia, and three for severe dementia. Subjects with a primary cause of dementia other than AD (e.g., vascular dementia, primary progressive aphasia), active neurological or psychiatric illness (e.g., major depression), serious head injury, history of clinically meaningful stroke, and use of psychoactive drugs were excluded, as were subjects with gross anatomical abnormalities evident in their MRI images (e.g., large lesions, tumors).

Data Acquisition and Quality Control
For the HNRS dataset, MR examinations of the head were performed in all study participants with a confirmed diagnosis of MCI and in their matched controls. Each MR examination was performed on a single 1.5T MR scanner (Magnetom Avanto, Siemens Healthcare, Erlangen). The MR scanner was equipped with a 12-channel receive-only matrix head coil provided by the vendor. A sagittal 3D Fast Low Angle SHot T1 weighted image (TR 5 40 ms; TE 5 5 ms; flip angle 5 408; matrix size 5 256 3 256 with 176 1.0 mm thick slices; FoV 5 26 mm 3 26 mm; bandwidth 5 160 Hz/pixel) was acquired and used in this analysis. In addition, fluid-attenuated inversion recovery and T2 weighted sequences were also acquired for standard radiological examination.
For the OASIS dataset, all of the sagittal 3D magnetization prepared rapid acquisition of gradient echo images (TR 5 9.7 ms; TE 5 4 ms; TI 5 20 ms; TD 5 200 ms; flip angle 5 108; matrix size 5 256 3 256 with 128 1.0 mm thick slices; FoV 5 25.6 mm 3 25.6 mm) were also acquired on a Siemens MR system operating at 1.5T (Magnetom Vision, Siemens Healthcare, Erlangen). At each visit, three or four T1-weighted images were acquired for each subject.
For both studies, all T1 weighted images were visually inspected. Images with artefacts or any pathology not associated with dementia, such as signs of stroke, were discarded from further analysis.

Data Pre-Processing
The same image segmentation and image registration procedure described here was performed for both datasets, which were analyzed separately. Subjects' baseline and follow-up images meeting the quality control standards described above were segmented into gray matter and white matter using SPM8's "New Segment" procedure. Gray matter and white matter tissue class images were registered to a common inter-subject space, by creating a study specific template using SPM8's DARTEL registration procedure [Ashburner, 2007]. We formed the template using images from all time-points. In this study, we used Jacobian determinant images from the warping of each subject's gray matter and white matter images to the common template. While our method is generic and can be applied to any type of feature (e.g., modulated gray-matter images), Jacobian determinant images have shown promise in discriminating neurodegeneration by allowing a comparison of the expansion and contraction of voxels across and within subjects [Anderson et al., 2012;Hua et al., 2009Hua et al., , 2010Studholme et al., 2004].
A whole brain mask was applied to exclude the extracerebral voxels. To form the mask, we segmented the MNI152 brain with the same "New Segment" procedure, then registered the resulting gray matter image to the study template's gray matter tissue class image using FMRIB's Nonlinear Image Registration Tool [Andersson et al., 2007a,b]. We applied the resulting warp to the Harvard-Oxford subcortical atlas [Desikan et al., 2006] that has been affine-registered to MNI152, available in FSL . We then formed the mask as a binary image consisting of all atlas regions excluding the brainstem and cerebellum. Masked images, retaining 356,365 voxels in all cases, were then reshaped to form highdimensional data samples used in subsequent analysis.
For comparison to our proposed features, we also created features quantifying longitudinal change using a within-subject diffeomorphic registration method developed by Ashburner and Ridgway [2013] and available in SPM12b. We coregistered each subject's baseline image to r Aksman et al. r r 4388 r its follow-up, generating a temporal midpoint image in the process as well as Jacobian determinant images. We then performed the same inter-subject procedure on the midpoint images as before, segmenting them using SPM8's "New Segment" procedure and using DARTEL to form an inter-subject template with the gray and white matter tissue class images from the segmentation. Rather than using the Jacobian determinant images from the warping of each subject's midpoint to the common template, we warped the Jacobian determinant images from the intra-subject registration to the common template. A whole brain mask was also applied to these images in the same manner described above. We used these registered intra-subject Jacobian determinant images as longitudinal features, thereby quantifying the amount of expansion or contraction that each voxel undergoes due to within-subject coregistration.

Method Overview
We introduce a model for subject specific changes in samples as polynomial functions of time in the general case of an unbalanced longitudinal design. Using the coefficient matrices from the solutions of these subject specific models, we create a transformation into a subspace describing intra-subject longitudinal change that is common across subjects. We form features for classification by projecting cross-sectional samples onto this longitudinal subspace. For the special case of two samples per subject, for both balanced and unbalanced designs, we show that the intra-subject difference between samples, scaled by the time differences between samples, is equivalent to the matrix of coefficients of linear change over time (i.e., the slope coefficients), leading to a simple method for creating linear transformations that model longitudinal changes.
We will make use of several sets in this section. The longitudinal set L of subjects will be defined as the set of subjects for whom the necessary number of longitudinal timepoints is available. As we use cross-validation to evaluate a classifier's performance on unseen data, we define the classification set C as the set that is split, at each crossvalidation fold, into a training set N of subjects used to train a classifier and a set T of subjects used to test the classifier's ability to generalize to unseen data. Additionally we define the longitudinal training set L as the set difference between L and T , L5 , that is, the longitudinal set subjects that are not being held out for testing in a given cross-validation fold.

Longitudinal Trajectory Model
We assume that a regression model can be estimated independently for each of the l subjects in the longitudinal training set L, such that for subject i, having m i samples of dimensionality D, we have longitudinal design matrix with chosen model order P (with a constant term included as the extra dimension) and B i is an P11 ð Þ3D matrix of voxel trajectory coefficients for that subject. Using this assumed model form, Appendices A and B derive expressions for B 1 ð Þ , the matrix of first-order (slope) coefficients of voxels' longitudinal trajectories across subjects and image dimensions, for the general case of an unbalanced design (varying numbers of samples per subject or varying time intervals, Appendix A) and the special case of two time-point per subject balanced and unbalanced longitudinal designs (with fixed or varying time intervals, respectively, Appendix B). We consider this special case separately as longitudinal data in neuroimaging are often limited to the minimum of a single baseline and follow-up image per subject. Due its simplicity and importance in our analysis, Eq. (B.4), the expression for B 1 ð Þ , the matrix of linear (slope) coefficients across subjects and voxels in the balanced two sample case, is restated here as where we have defined the matrix D L as the intra-subject (longitudinal) differences between the baseline timepoint's samples X L t 1 ð Þ and the follow-up time-point's samples X L t 2 ð Þ, scaled by t D , the fixed time interval between samples.
In this article, we use the B 1 ð Þ matrix (derived from either the general or special cases described above) to create projections that capture a subspace of linear longitudinal change that is common across subjects. We note that one can model non-linear longitudinal changes by building a subspace using higher order coefficient matrices (B 2 ð Þ ; . . . ; B P ð Þ ; P > 1Þ based on the method described in Appendix A, although we focus on linear subspaces by fitting first-order models (P51) here. However, our approach easily accommodates datasets with a greater number of follow-up timepoints.

Longitudinal Subspace Formation
To generalize to unseen data, we use PCA to create an orthogonal projection of the l D-dimensional samples in the l3D matrix B 1 ð Þ onto a low-dimensional linear subspace that describes most of the variance in the data. The projection matrix, U k , is composed of a small set of k principal components (PCs) of B 1 ð Þ , that we term "eigenslopes" in the spirit of Turk and Pentland [1991], such that U k is of size D3k, with k D. We have chosen to use PCA as it is a computationally efficient and linear technique, the latter property allowing for easier interpretation of results. Section 12.1.4 of Bishop [2007] describes the PCA procedure we used to compute U k , with the number of retained PCs k chosen as described in the Nested Cross-Validation section.

Within-Set Prediction Problem
When the longitudinal set L and the classification set C are equal all subjects for whom we are interested in making predictions via cross-validation have the necessary longitudinal samples. At each cross-validation fold, we form a training set N and test set T as subsets of C and form the projection U k using the longitudinal information of the subjects in N , which is equivalent to the longitudinal training set L in this case. This is a straightforward application of our method, which seeks to answer whether longitudinal information from a set of subjects can improve predictions for subjects within the set (Fig. 1A).

Information Transferring Prediction Problem
In addition to the Within-Set Prediction problem, our method can be used to transfer longitudinal information from one set of subjects to another. In this case, the longitudinal set L and the classification set C are disjoint, meaning they have no subjects in common. It follows that the longitudinal training set L, used to form the longitudinal transform U k , is disjoint from both the training set N and test set T at each cross-validation fold. This problem illustrates that our approach can be used to form a longitudinal subspace with one set of subjects to make diagnostic predictions for another subject set (Fig. 1B).

Feature Formation
For both cases described above, we form training and test feature matrices, used in classification, by projecting the cross-sectional samples of the training and test subjects' samples at a particular time-point t onto a rank k longitudinal space described by U k U k T . Note that the difference between the two cases described above is whether  (2), the cross-sectional training data (in purple) and test data (non-overlapping area in red) is projected onto a subspace formed using the longitudinal training data (blue area). Graphic B depicts the Information Transferring Prediction problem, where the longitudinal subject set and classification subject sets are disjoint. In this case, there is no overlap in data between the longitudinal data (blue area) used to form the subspace in Eq.
(2) and the classification data (both training and test data, red area) being projected. Here, the cross-sectional data may come from a baseline or follow-up time-point. r Aksman et al. r r 4390 r subject set L, whose longitudinal data is used to form U k , is equal to or disjoint from set N .
For either case, at each cross-validation fold the features used in classifier training and testing have the form Here, we have introduced X N t ð Þ and X T t ð Þ as, respectively, matrices of cross-sectional data for the training and test sets at a particular time-point t, such as at baseline or followup. We use X train as an input feature matrix to train a classifier to predict y train 2 1; 21 f g, the binary class labels of the training set at a given time-point. We test the trained classifier's predictive performance using X test as an input feature matrix to predict y test 2 1; 21 f g, the binary class labels of the test set at a given time-point. In this article we set X N t ð Þ5X N t 2 ð Þ and X T t ð Þ5X T t 2 ð Þ when we predict the follow-up class labels using projected follow-up cross-sectional samples and X N t ð Þ5X N t 1 ð Þ and X T t ð Þ5X T t 1 ð Þ when we predict the baseline labels using projected baseline samples. It is also possible to make prognostic predictions of the follow-up labels by projecting the baseline crosssectional data matrices X N t 1 ð Þ and X T t 1 ð Þ onto the longitudinal subspace. We have chosen to transform the features back into the original D31 space by post-multiplying by U k T , allowing an easier interpretation of the resulting model's D31 weight vector as voxels in a masked brain image. Alternatively, to obtain low-dimensional feature vectors, one may simply project the baseline or follow-up features onto U k instead of U k U T k . Figure 2 helps provide an understanding of the projection described by Eq. (2) by means of a simple twodimensional longitudinal subspace example. The projected features can be seen as the components of cross-sectional data that lie in a space of common longitudinal change described by the retained PCs (eigenslopes) in U k . In the figure there are two retained PCs, forming a twodimensional plane on which a higher dimensional crosssectional (follow-up) sample is projected. The schematics in Figures 3 and 4 depict the formation of training and test features at each cross-validation fold for the Within-Set Prediction and Information Transferring Prediction problems respectively, with matrix U k formed by performing LM-PCA on balanced longitudinal design data in both cases.

Kernel Formation
To keep the number of tuneable parameters to a minimum, we use a linear kernel of the form K5XX 0T , where X and X 0 may be a training or testing feature matrix. Without loss of generality, the kernel used during classifier training can be expressed as as U k is composed of a set of orthonormal vectors. Therefore, the effective dimensionality of the feature vectors is reduced from D to k by the PCA procedure described. Prior to use in training and testing, we mean-centre the features using the training data.

Classification Algorithm
As a feature construction method, our method can be used with a range of different classification algorithms. In this work we show results using the linear SVM algorithm, which has been widely applied to neuroimaging [Davatzikos et al., 2008;Mourão-Miranda et al., 2005]. Lemm et al. [2011] andOrr u et al. [2012] provide a thorough explanation and review of the algorithm's application to disease state classification using neuroimaging. We used the LIBSVM [Chang and Lin, 2011] implementation of the C cost SVC described above [Cortes and Vapnik, 1995]. We used a fixed value of C51 throughout all classifications. Figure S1 in the Supporting Information shows that this choice provides near optimal prediction performance for the datasets we consider here because it lies in a stable region of maximal classifier performance across the three types of features used in Table I. However, in some cases it may be necessary to optimize C by crossvalidation to obtain optimal performance. This may be the case if the input vectors are already relatively lowdimensional (e.g., region of interest [ROI] summary measures).

Nested Cross-Validation
The classification model as described has only one parameter that must be tuned via nested cross-validation: (2), we consider a simple case of a sample vector (Follow-up Sample) that lies in three dimensional space (an image consisting of three voxels) being projected onto the two dimensional plane described by two principal component vectors (PC1, PC2) extracted from a hypothetical coefficient matrix describing subjects' longitudinal changes. The projected sample (Projected Follow-up) is thereby composed of the components of the sample vector that lie in the space of common longitudinal changes, described by the principal components. r Longitudinal Information in Pattern Recognition r r 4391 r the number of PCs k that are retained to form the lowrank projection matrix U k U k T . We used an outer leaveone-out cross-validation (LOO-CV) scheme to make predictions via the optimized parameter k and an inner LOO-CV scheme to optimize k. Within each inner CV fold, the model was trained and tested across a range of k values corresponding to specified amounts of variance explained by the retained PCs. For a desired fraction of explained variance p var ; k is chosen as the minimum such that P k i51 k i = P n i51 k i p var , where k i is the ith highest eigenvalue of the covariance matrix used in the PCA [Bishop, 2007]. In this study, we varied explained variance from 5% to 95% in increments of 5%. The final optimized value of k, used in the outer CV fold, was based on the explained variance value that had the highest balanced accuracy across all inner CV folds.
The balanced accuracy, introduced by Brodersen et al. [2010], is the average of Sensitivity5TP= TP1FN ð Þ and Specificity5TN= TN1FP ð Þ . Here TP, FP, TN, FN are the number of true positives (correctly classified disease subjects), false positives (incorrectly classified control subjects), true negatives (correctly classified control subjects) and false negatives (incorrectly classified disease subjects), respectively. This metric, Sensitivity1Specificity À Á =2, was used rather than the overall accuracy, defined as TP1 TN ð Þ = TP1TN1FP1FN ð Þ , as it is less sensitive to imbalanced class sizes.  Tables I and III. We have emphasized that the longitudinal training set is equivalent to the classification training set in this case: the same follow-up data used to form the longitudinal difference matrix [scaled by the time interval between scans as in Eq. (1)] is projected onto the longitudinal subspace in the training step.

Method Comparison and Performance Evaluation
In addition to the balanced accuracy, sensitivity, and specificity measures of classifier performance already mentioned, we provide measures of positive predictive value (PPV), negative predictive value (NPV) and a summary of the receiver operating characteristic's area under the curve (ROC AUC). PPV is the proportion of disease state (11 class) predictions which are correct, PPV5TP= TP1FP ð Þ . N PV is the proportion of healthy state (21 class) predictions which are correct, NPV5TN= TN1FN ð Þ . The ROC AUC plots the true positive versus false positive rate as the classifier's decision threshold is varied and is considered to be a robust measure of classifier performance in the sense that it is independent of an arbitrary choice of decision threshold.
To compare the performance of a classifier that uses our proposed features, we also build a cross-sectional classifier using features based strictly on either the baseline or follow-up imaging features, as appropriate, which reflects current practice. We also demonstrate the performance of longitudinal features formed using the within-subject registration of Ashburner and Ridgway [2013], as described in the Data Pre-processing section.  Table II. We have emphasized that the longitudinal and classification subject sets are disjoint in this case, with the data in the "Information Transfer" area in blue being used to form the longitudinal difference matrix [scaled by the time interval between scans as in Eq. (1)]. As a result, the follow-up data that is projected onto the the longitudinal subspace differs from the followup data used to form the longitudinal difference matrix.

Permutation Testing
Permutation tests were performed to assess the statistical significance of the balanced accuracies of each classifier, relative to chance. Each permutation test involved randomly rearranging the order of the elements in the vector of n subjects' class labels y 1 ; . . . ; y n ½ T to form a random vector of class labels that retains the original number of subjects in each class. Predicting these random class labels allowed us to build up null distributions of balanced accuracies for our proposed features and the comparison features. A P-value was then estimated by dividing the number of balanced accuracies in the null distribution that match or exceed the true balanced accuracy by the total number of permutations. In this work we present results using null distributions built with 1,000 permutations of class labels [Jockel, 1986;Mourão-Miranda et al., 2005;Nichols and Holmes, 2002].

Model Interpretation: Weight Maps and Forward Maps
As the resulting SVM weight vector w is in the original Ddimensional feature space, we can visualize the weights of the classifier. In general, the weight vector is calculated as where / x i ð Þ is the feature vector in the space implicitly defined by the chosen kernel and the a i 's are the signed versions of the a i 0 SV weights from the SVM optimization [Chang and Lin, 2011]. As we use linear features in this work, we have / x i ð Þ5x i . Equation (4) can then be expressed as whereX is the mean-centred version of a training feature matrix and a is the n31 vector of a i 's, with entries corresponding to the rows ofX .
In addition to visualizing the classifier weights, we can also build a "forward map," defined in Haufe et al. [2014], in the following way: Here, a is the D31 mapping that encodes the n31 classifier function outputỹ in the feature space. The weight map defined by Eq. (5) is useful for understanding the contribution of each feature toward the prediction of a sample's diagnostic label. As discussed in Haufe et al. [2014], interpretation of such maps should be done with care as a feature may have a high weight by virtue of a group difference or as a result of high collinearity between features, for example, features may obtain a high weight to cancel out noise in other features. In contrast, the encoding weights from the covariance-based forward map in Eq. (6) represent the group differences between classes, which are often of interest when interpreting a trained classifier. If the data are standardized, the forward maps are equivalent to "structure coefficients" widely used in multiple linear regression [Kraha et al., 2012]. When discriminating a positively labelled (disease) class from a negatively labelled (control) class, stronger positive values in a forward map indicate a stronger association of a region with the positive class while negative values indicate a stronger association with the negative class (alternatively, a weaker association with the positive class).

RESULTS
The results in this section make reference to the two types of classification problems described in the Methods section, namely the Within-Set Prediction and Information Transferring Prediction problems as well as the two types of PCA-based projections: LTC-PCA and LM-PCA. We will refer to Balanced problems when the longitudinal subject set has both fixed follow-up times and a fixed number of scans per subject (in this case two) and Unbalanced problems when the longitudinal subject set has either varying follow-up times or a varying number of scans per subject. We applied our proposed projection method (LM-PCA and/or LTC-PCA) as well as several comparison methods to five classification problems across two datasets (described below, with results shown in Tables I-IV).

Balanced within-Set Prediction (HNRS Dataset)
As the HNRS has a balanced longitudinal design, with a baseline and follow-up time-point made available, we use the balanced version of LM-PCA, described by Eq. (1) (Balanced Within-Set Prediction problem using this dataset, predicting the follow-up time-point's diagnostic labels). These results are derived from the 24 MCI subjects and 23 HCs with data from both time-points (follow-up periods of 2.5 6 0.2 years). To verify that the improvement we show due to the LM-PCA projection is not due strictly to dimensionality reduction, we projected cross-sectional (follow-up) samples onto cross-sectional (baseline and follow-up) subspaces. Table I shows the classifier performance measures using unprojected follow-up features, "longitudinal coregistration" based features [Ashburner and Ridgway, 2013], and follow-up features projected onto three different subspaces: formed using baseline timepoint information, formed using follow-up time-point information and formed using LM-PCA. In all cases, we performed nested cross-validation to choose the optimal number of retained PCs in the projection within each outer cross-validation fold. The LM-PCA projected features were the only ones whose balanced accuracy exceeded chance (P < 0.002, permutation test). In Figure S2 in the r Aksman et al. r r 4394 r Supporting Information, we see across all cross-validation folds 90% (k55 retained PCs) and 95% (k516 retained PCs) explained variances were selected to form LM-PCA projections. Table II shows the performance measures obtained when testing the information transfer capability of our approach using the HNRS dataset. Here, we formed a balanced LM-PCA projection matrix with the subjects used in Table I (24 MCI, 23 controls), then predicted the baseline time-point's diagnostic labels using unprojected and LM-PCA projected baseline features on a disjoint set of subjects that had a baseline scan but no available follow-up (79 MCI subjects and 87 controls). 1 Also shown in Table II is the result of predicting the follow-up time-point's diagnostic labels using unprojected, follow-up subspace projected and LM-PCA projected follow-up features on a disjoint set of subjects with a follow-up scan but no corresponding baseline scan (10 MCI subjects and 20 controls) using the same projection matrix (formed with 24 MCI, 23 controls). As each subject has only one time-point's information in these experiments, we could not form the "longitudinal coregistration" based features in this case as in the other tables. In this case, none of the balanced accuracies statistically exceeded chance under permutation testing.

Balanced Within-Set Prediction (OASIS Dataset)
The OASIS dataset has an unbalanced longitudinal design, with each subject scanned on two or more visits. To compare the two time-point, balanced design LM-PCA method across datasets we mimicked a balanced longitudinal design with this dataset by restricting the follow-up times of subjects to be roughly similar to that of the HNRS dataset. We used a subject set composed of subjects with follow-up scans between 1.4 and 2.5 years after baseline, resulting in followup periods of 1.9 6 0.3 years (24 subjects with very mild dementia, i.e., CDR 0.5, and 25 HCs, i.e., CDR 0). Table III shows the classifier performance metrics in this case. The LM-PCA projected features' balanced accuracy exceeded chance (P < 0.01, permutation test) while the unprojected follow-up features' and "longitudinal coregistration" features' balanced accuracies did not. As in Table I we also projected follow-up samples onto subspaces formed using both baseline and follow-up information. In neither case did the classifier balanced accuracies achieve statistical significance. In Figure S2 in the Supporting Information, we see in this case that the number of PCs explaining mostly 55% and 60% of variance were selected across cross-validation folds to form LM-PCA projections, with k5 8 and k59 retained PCs respectively. This contrasts with the higher explained variances (90% and 95%) selected with the HNRS dataset (Table I), highlighting the need to use nested cross-validation to tune this parameter.

Unbalanced Within-Set Prediction (OASIS Dataset)
In the final experiment, we considered the problem of discriminating those subjects with three or more longitudinal time-points using the OASIS dataset, without restricting the time interval between samples. To increase the number of disease class subjects, we formed a class consisting of five subjects with mild dementia (corresponding to CDR 1) along with nine subjects with very mild dementia (CDR 0.5), for a total of 14 subjects being discriminated from 28 healthy subjects (CDR 0). There were 33 subjects with three longitudinal measurements, 8 subjects with four longitudinal measurements and one subject with five longitudinal measurements.
We compared the performance of features formed using the cross-sectional data at the final follow-up time-  3)] using the last two time-points' information for each subject. In this case the time interval between scans is the shortest possible for each subject, resulting in an interval of 2.0 6 0.8 years across subjects. The "Final Follow-up, LM-PCA Projected (2 TPs, Long)" features were formed in a similar manner by creating an LM-PCA projection using the first and last time point for each subject, that is, with the longest time intervals between scans for each subject, resulting in an interval of 4.2 6 1.2 years across subjects. Finally, "Final Follow-up, LTC-PCA Projected (All TPs)" were formed using all time-points for each subject. We used the more general LTC-PCA for this (with polynomial model order P51, as it is in LM-PCA), which allowed us to estimate the slope of samples' longitudinal trajectories using all (three or more) time-points of each subject rather than having to select two time-points when using LM-PCA. As in Table I and III we also compare to "longitudinal coregistration" features. In this case we coregistered the first and last time-point of each subject to best compare to the LTC-PCA projected features. Table I  Predicting Follow-up Class Label (MCI n 5 10, HC n 5 20) using disjoint longitudinal subject set (MCI n 5 24, HC n 5 23) from Table I   The performance metrics for all features are shown in Table  IV. Classifiers that used the "longitudinal coregistration" and LTC-PCA projected features were the only ones that statistically exceeded chance under permutation testing. This suggests that when more than two longitudinal samples are available per subject, we can derive a better estimate of the linear coefficient matrix with the more general LTC-PCA than with the two-sample LM-PCA approach.

Statistical Comparison of Methods
Because of the small sample sizes in each comparison, we have limited statistical power to detect differences between the methods within individual comparisons. Indeed, LM-PCA/LTC-PCA projection only improved accuracy relative to unprojected cross-sectional features or "longitudinal coregistration" based features in one case (LM-PCA projected vs. unprojected features in Table I, P < 0.05, McNemar's test). However, we are more interested testing whether our method provided a consistent improvement overall (i.e., across all classification problems and datasets) rather than whether it improves individually for any individual classification problem. To assess this, we performed a 3 3 1 repeated measures analysis of variance (ANOVA) with three different methods as the "withincomparison factor" with the five balanced accuracies across comparisons as the samples for each method. The three methods compared were: (i) the most appropriate version of our method for each contrast (LM-PCA for the comparisons in Tables I-III or LTC-PCA for the comparisons in  Table IV); (ii) the latest follow-up samples projected onto the latest follow-up subspace (the single follow-up in Tables I and III, the baseline in the first part of Table II, the follow-up in the second part of Table II, the final follow-up in Table IV) and (iii) features formed using the (unprojected) latest follow-up samples. The results show a significant difference in the effect of the projection method, with F(1.52, 6.08) 5 10.91, P 5 0.01. Post-hoc pairwise t-tests showed that LM-PCA/LTC-PCA projected features performed better overall than unprojected features (P 5 0.03) and cross-sectionally projected features (P 5 0.003) whereas cross-sectionally projected features did not outperform unprojected features (P 5 0.74). We could not include "longitudinal coregistration" based features in the ANOVA as it was not possible to form such features in Table II due to a lack of either baseline or follow-up information in the classification subject set. Therefore, we conducted an additional paired t-test, comparing the balanced accuracies of LM-PCA in Tables I and III and LTC-PCA in Table IV against "longitudinal coregistration" based features' balanced accuracies. This showed no significant differences.

Discriminative and Forward Brain Maps
Figures 5 and 6 display t-statistic images, weight maps and forward maps for the cross-sectional features and LM-PCA projected cross-sectional features derived from the Balanced Within-Set Prediction problem (HNRS and OASIS datasets, respectively). For both types of features, the figures show a good correspondence between the unthresholded two sample t-statistic maps describing group differences between disease subjects and controls at each voxel and forward maps that were generated via Eq. (6). The weight maps depicted, generated via Eq. (5), are useful for understanding which regions are driving the classifier's decisions. However, they differ greatly from the other two map types, reflecting the fact that they do not represent group differences [Haufe et al., 2014].
In Figure 5, which corresponds to the HNRS dataset result in Table I, we see that the cross-sectional follow-up feature-based forward maps and the LM-PCA projected counterparts have negative values (cold colors) across many gray and white matter regions. As we are interpreting maps of Jacobian determinant features, quantifying the values in the parietal and precuneus areas, the frontal pole and the temporal lobe relative to the cross-sectional map.
In Figure 6, which corresponds to the OASIS dataset result in Table III, we see that both the cross-sectional and LM-PCA projected cross-sectional forward maps show Comparison of maps for Balanced Within-Set Prediction problem, OASIS dataset. Mass univariate differences between groups' features via an (unthresholded) t-statistic and classifier weight and forward maps, discriminating very mild dementia (CDR 0.5) versus HC using cross-sectional (follow-up time-point) features compared to LM-PCA projected cross-sectional (same time-point) features, corresponding to Table III. Weight and forward maps are scaled such that the maximum absolute value in each image is equal to one. In all maps, positive values (hot colors) depict areas of stronger association to the positive class (CDR 0.5 group) while negative values (cold colors) depict areas of stronger association to the negative class (HC group). r Longitudinal Information in Pattern Recognition r r 4399 r strong positive values within the lateral ventricles and the Sylvian fissure, indicating a regional shrinkage of brain tissue in early dementia subjects compared to controls. In addition, there are negative values in the temporal lobes in both sets of forward maps. Qualitatively, the two sets of forward maps in this figure resemble each other, with some evidence of de-noising due to the PCA procedure via a reduction in the number of negative value regions in the LM-PCA projected cross-sectional map compared to the purely cross-sectional forward map. Overall, these patterns of coefficients are highly consistent with the literature on whole brain and ventricular volume change in dementia subjects relative to controls [Freeborough and Fox, 1997;Jack et al., 2004;Nestor et al., 2008].

DISCUSSION
We have introduced a new feature construction-based technique for pattern recognition, which makes use of the longitudinal information available in both unbalanced and balanced longitudinal designs. We demonstrated the effectiveness of this technique using high-dimensional neuroimaging data on the problem of discriminating both MCI subjects from HCs and early dementia subjects from controls, showing improvements in accuracy across two separate datasets. The method introduces a single tuning parameter and, as a feature construction operation, is able to work with existing registration algorithms used in preprocessing and any pattern recognition algorithm (including regression models). In principle, the method is not limited to neuroimaging and can be applied to any highdimensional longitudinal dataset.
We have shown that: (i) in the case of balanced longitudinal designs with two samples per subject, projected crosssectional features have higher predictive accuracy in discriminating disease than do unprojected cross-sectional features; (ii) for unbalanced designs, higher classification accuracy can be attained using three or more samples to estimate the linear coefficient matrix used to form the necessary longitudinal projection compared to using the minimum of two samples and (iii) information transfer from one set of subjects to another is possible using the proposed method. In addition, we have shown that the proposed features, based on a linear transformation, retain the ability to visualize a linear classifier's weight maps and forward maps.
Comparing our method to others in the literature, the improvements we show on the Balanced Within-Set Prediction problem are similar to those achieved by Chen and DuBois Bowman [2011]. However, a direct comparison is not possible as the authors were discriminating AD from controls using PET data with a 12-month follow-up period. The authors report an approximately 10% improvement in predictive accuracy using ROI-based voxels, similar to what we achieved on the OASIS dataset. Another comparison can be made to Wolz et al. [2010], who created nonlinear manifolds using longitudinal and cross-sectional information into which they embedded both crosssectional and longitudinal data to form features for classification. The authors also considered the problem of discriminating MCI from controls (among other contrasts between AD, MCI, and controls) using T1-weighted structural MRI. They showed improved discrimination accuracy from 64% using baseline imaging features to 69% using baseline plus longitudinal imaging features. Ziegler et al. [2015] estimated rates of change (slope) within subjects' samples in a mass-univariate context, noting that a smaller number of longitudinal samples per subject resulted in reduced parameter accuracy. This is in agreement with the results we present in Table IV, where the "Final Follow-up, LTC-PCA Projected (All TPs)" features indicate that improved estimation of the linear coefficient matrix using three or more longitudinal samples per subject leads to better predictions compared to LM-PCA projected features, which were based on the minimum of two samples per subject (see below).
There appears to be an advantage in using balanced design data rather than unbalanced data when forming longitudinal projections. In Tables I and III we see that features formed using balanced LM-PCA projections have statistically significant balanced accuracies while in Table IV the unbalanced LM-PCA projected features' balanced accuracies did not achieve significance. This observation is in agreement with Ziegler et al. [2015], who note that less balanced designs led to poorer correspondence of slope estimates with ground truth as well as higher noise levels. When unbalanced design information is available, the result in Table IV suggests that better predictions result from using all available time-points' information when forming a longitudinal projection (through the use of LTC-PCA).
In this article, we have started with the assumption of subject-specific models of longitudinal trajectories, deriving expressions for the first-order coefficient (i.e., slope) matrix describing intra-subject changes over time, across all subjects and all dimensions, for the general case of an unbalanced longitudinal design. The benefit of such an approach is its conceptual and computational simplicity, as one can compute each subject's coefficients once and then assemble the desired coefficient matrix (of any model order) for a particular set of subjects at each cross-validation fold. Longitudinal trajectories are likely correlated among similarly aged subjects, however, and accounting for these intersubject correlations may lead to better estimates of the desired coefficient matrix, potentially improving the Information Transferring Prediction capability of our method.
Accounting for inter-subject correlations is particularly important for neuroimaging data, where the number of longitudinal samples per subject is often small, amounting to two samples per subject in the HNRS dataset and between two and five in the OASIS dataset. Therefore, for future work we will investigate methods to model such correlations, potentially providing better estimation of subject level coefficients by borrowing strength between r Aksman et al. r r 4400 r subjects. Multi-task learning models have recently been applied to neuroimaging to account for between-subject correlations [Marquand et al., 2014;Zhang and Shen, 2012] and may be particularly well suited for this purpose.
We have focussed on building a subspace projection using the matrix of first-order coefficients. One potential limitation of this decision, imposed by a relatively small number of subjects having full follow-up data in both datasets, is that the subspace we build is most appropriate for situations where the linear term dominates temporal trajectories. There is evidence that the non-linear component of trajectories may be important, particularly in older subjects [Fjell et al., 2014;Raz et al., 2010;Raz and Lindenberger, 2011]. To this end, one can estimate separate projections based on the first-order and second-order coefficient matrices, selecting the number of PCs retained in each projection using a common or model order specific amount of explained variance, for instance. Features could be then composed of a concatenation of cross-sectional features projected onto the linear coefficient subspace with features projected onto the quadratic coefficient subspace. Estimating the matrix of second-order coefficients requires at least three time-points per subject, with the result in Table IV suggesting that more than this minimum may be necessary for good estimates. Indeed, the small sample size that results from requiring that all subjects have data in all timepoints is an important limitation of this study. Across the five contrasts we performed, we have shown that there is an improvement due to LM-PCA/LTC-PCA projected features relative to strictly cross-sectional features; however, due to limited number of longitudinal samples, we could not show a consistent improvement within each contrast. Therefore, additional validation of the method on larger samples (such as ADNI 2 ) and exploring the conditions under which using higher order coefficient matrices would benefit a predictive model is an important avenue for future work.
In summary, we have introduced a novel means of capitalising on longitudinal information for pattern recognition analysis of high-dimensional data. We have provided a conceptual framework for modelling longitudinal trajectories of change over time, which enables: (i) the use of balanced and unbalanced longitudinal designs; (ii) the modelling of linear and non-linear effects; and (iii) the ability to transfer longitudinal information between sets of subjects. Our results suggest that longitudinal subspace projection is a promising method for pattern recognition analysis of longitudinal neuroimaging data.