Prediction of Expected Years of Life Using Whole-Genome Markers

Genetic factors are believed to account for 25% of the interindividual differences in Years of Life (YL) among humans. However, the genetic loci that have thus far been found to be associated with YL explain a very small proportion of the expected genetic variation in this trait, perhaps reflecting the complexity of the trait and the limitations of traditional association studies when applied to traits affected by a large number of small-effect genes. Using data from the Framingham Heart Study and statistical methods borrowed largely from the field of animal genetics (whole-genome prediction, WGP), we developed a WGP model for the study of YL and evaluated the extent to which thousands of genetic variants across the genome examined simultaneously can be used to predict interindividual differences in YL. We find that a sizable proportion of differences in YL—which were unexplained by age at entry, sex, smoking and BMI—can be accounted for and predicted using WGP methods. The contribution of genomic information to prediction accuracy was even higher than that of smoking and body mass index (BMI) combined; two predictors that are considered among the most important life-shortening factors. We evaluated the impacts of familial relationships and population structure (as described by the first two marker-derived principal components) and concluded that in our dataset population structure explained partially, but not fully the gains in prediction accuracy obtained with WGP. Further inspection of prediction accuracies by age at death indicated that most of the gains in predictive ability achieved with WGP were due to the increased accuracy of prediction of early mortality, perhaps reflecting the ability of WGP to capture differences in genetic risk to deadly diseases such as cancer, which are most often responsible for early mortality in our sample.


Introduction
Agricultural and biomedical research has shown through controlled experiments and familial studies that many complex traits are highly heritable, suggesting that in principle, such traits could be predicted early in life from knowledge of individuals' genotypes. Human longevity is not an exception: empirical evidence from twin and familial studies indicate that approximately 25% of inter-individual differences in human lifespan can be attributed to genetic factors [1][2][3].
Research with model organisms offers several examples of genetic polymorphisms having a sizable effect on lifespan [4]. However, although genome-wide association studies (GWAS) and linkage scans in humans have uncovered several regions significantly associated with longevity and aging traits [5][6][7][8][9], only a few of these associations have been consistently confirmed, and our ability to predict inter-individual differences in expected Years of Life (YL) remains limited [7].
Several diseases (e.g., cancer, cardiovascular disease) and biological events (e.g., stroke, heart failure) can lead to death, and the genetic architecture (i.e., the set of genes having an effect on the trait and the ways they interact) of each of these mortalityrelated traits is expected to be disorder-specific. Therefore, the genetic architecture of YL is likely to include a large number, perhaps thousands, of possibly interacting genes.
Recent articles [10,11] have suggested that the limited advances in our ability to predict complex human traits and diseases using genomic information may partially reflect the limitations of traditional GWAS to detect significant associations with complex genetic architectures. These authors have suggested that Whole Genome Prediction (WGP) may be better suited than traditional GWAS to the prediction of complex traits.
Whole genome prediction exploits multi-locus linkagedisequilibrium (LD) between quantitative trait loci (QTL) and genome-wide markers (e.g., SNPs) to predict inter-individual differences in a quantitative trait that are attributable to genetic factors. Unlike traditional association studies, in which the association between markers and phenotypes is tested one marker at a time, WGP uses all available markers to regress phenotype onto genomic information. This methodology was first proposed in the field of animal breeding by Meuwissen Hayes and Goddard in 2001 [12]. Since then, several simulation [12,13] and empirical studies have demonstrated its predictive power with plant [14,15] and animal [16][17][18][19] data.
More recently, research with human height showed that much of the so-called missing heritability of complex traits could be recovered using genome-wide panels of common variants [11] and, more importantly, that regression using WGP methods can improve the prediction of yet-to-be observed human phenotypes [20]. A next logical question is whether these findings apply to traits of greater medical or practical importance. Here, we: (a) extend WGP methods, which were originally developed for continuous un-censored outcomes, to accommodate censoring, a feature commonly encountered in applications with human data, (b) developed a WGP model for YL and (c) quantified the ability of this model to account for and to predict inter-individual differences in human YL that are not accounted for by major factors such as sex, Body Mass Index (BMI, kg/m 2 ) and smoking.

Model
Many outcomes in human-health studies are either binary (e.g., presence/absence of diseases) or are subject to censoring (i.e., bounds of the outcome are known, but the exact value of outcome remains unknown). And it is well established that ignoring censoring yields biased estimates [21]. The linear models commonly used for WGP can be easily extended to accommodate binary or censored outcomes. Here, we present an extension that accommodates censoring. Similar ideas can be used to model binary outcomes as well [22].
In our WGP models, we describe YL (y i , i = 1,…,n) as the sum of individual-specific means (m i ) which, as we explain below, will be a function of genetic and non-genetic factors, and of a model residual (e i ) which is assumed to be a normal random variable with mean zero and variance s 2 ; therefore y i~mi ze i . For individuals with known YL, we observey i ; for individuals with censoring at age equal to t i , the observed event is y i wt i . In our WGP model, expected YL (m i ) was described using a linear regression, which had three components: m, an effect common to all subjects; P J j~1 x ij c j , a regression component accounting for the effects of non genetic covariates (sex, smoking and BMI covariates in our application); and P L l~1 z il b l , a regression on SNP genotypes z ij È É where z ij [ 0,1,2 f g counts the number of copies of the least frequent allele at the j th SNP. By combining (1) with the normal assumptions described above, we derived the likelihood function for censored and un-censored individuals (see Methods S1 for further details).
The Bayesian model is completed by assigning a prior density to the collection of model unknowns m,c,b,s 2 f g . Here, we structure the prior density using a modified version of the Bayesian LASSO (BL) [23]. This model has been effectively used for WGP in plants [14,15,24], animals [14,18,19,25] and humans [20]. We extend this model to accommodate censoring as well as effects other than those of markers. In our model, we assigned independent vague prior densities to the intercept (m) and to the effects of sex, smoking and BMI (c). This treatment yields estimates of the effects of these non-genetic factors that are similar to those obtained with likelihood-based methods. For the remaining unknowns we adopt the prior-specification of the BL of Park and Casella [23] (see Methods S1 for further details). The joint prior-density (see expression 2 in the Methods S1) is indexed by a set of four hyperparameters, including the prior degree of freedom and scale assigned to the residual variance (denoted as df and S, respectively), and the rate and shape parameters (dentoed as d and s, respectively) assigned to the regularization parameter of the BL. A discussion of how these can be chosen is given in Perez et al. [26].
Here, following those guidelines, we set H~df~5,S~170,d~1|10 {4 ,s~2 È É . Given the characteristics of our data (sample size, number of markers and allele frequencies and observed variability on YL), these values provide priors with small influences on predictions.

Implementation
Models were fitted using a modified version of the BLR package [27] of R [28] which handles censoring (right, left and interval) according to the model described above. In addition to BLR, Rpackages bayesm [29], splines [28] and SuppDists [30] were used to implement the sampler.

Data
(N = 5,117) were from the original (N = 1,493) and offspring (N = 3,624) cohorts of the Framingham Heart Study. Data and material distributions from this study are made in accordance with the individual consent history of each participant (see http://www. framinghamheartstudy.org/research/consentfms.html for further details about consent forms). And the current study has been approved by the Internal Review Board of University of Alabama at Birmingham (IRB Protocol Number: X090720002). The criteria for inclusion in the study included being 18 years or older at time of recruitment, having survival information as of 2007, and having complete information for covariates (sex, smoking and BMI).
Average age at entry was 37 with a standard deviation (SD) of 9.0 years. Of the participants, two thirds (N = 3,390) were censored (i.e., at the time at which survival records were defined, these individuals were still alive), 55% were female, and 36% never smoked. Mean BMI at first exam was 25.0 with a SD of 4.1 kg/ m 2 . Subjects were genotyped using the Affymetrix GeneChip Human Mapping 500K Array Set. For details on the genotyping method, please refer to Framingham SHARe at the NCBI dbGaP website (http://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/ study.cgi?study_id = phs000007.v3.p2). Other editing and genotyping quality control and imputation procedures were as described in Makowsky et al. [20].

Primary Data Analysis
Using the specification of equation (3) we generated a sequence of nested models by changing the predictors included in the righthand side of the linear predictor (m i ). Our baseline model (denoted as M A-0 ) includes an intercept, sex, and age at entry; the latter modeled nonparametrically using a 4-df natural spline [31] with interior and boundary knots chosen using the default specifications of the natural spline (ns) function of the spline package of R [28]; with 4-df, interior knots were placed at the 25 th , 50 th and 75 th sample percentiles of the predictor variables. We extended this model by adding smoking and BMI (also modeled nonparameterically using a 4-df natural spline [31]) this model is denoted as M B-0 . Subsequently, models M A-0 and M B-0 were then extended by adding subsets of evenly spaced SNPs, from 2.5K (K = thousand) to 80K; these models were denoted as M ( Models were first fitted to the entire dataset to obtain parameter estimates (estimated posterior means of effects and of variance parameters) and to evaluate the goodness of fit and the Deviance Information Criterion [32]. Subsequently, the prediction accuracy of each of the models was assessed using a 10-fold cross-validation (CV). Prediction accuracy was evaluated using two different metrics: a CV R-squared (R 2 CV ) and the area under Longitudinal Receiving Operating Characteristic Curves (AUC(t) [33]). The CV measures the proportion of inter-individual differences in years of life that can be accounted by CV-predictions, this was is the estimated average YL, derived from a model that only included an intercept and the effect of age at entry, which is taken here as our baseline model. This statistic can be evaluated only with subjects that have already died; therefore, in the 10-fold CV, the summation in the formula for R 2 CV uses only data from subjects with an observed age at death. Un-censored subjects do not constitute a random sample of individuals and this may induce bias in our estimate of R-squared. Because of this, we consider a second measure of prediction performance based on longitudinal AUCs [33]. To this end we defined a sequence of thresholds (t = 60, 65, 70, 75, 80, 85, 90, 95 YL) and for each of these thresholds we generated survival indicator variables if individual i was still alive at time t, and un-determined if individual i had an age at censoring smaller than t. The number of individuals for which d i,t was determined (i.e., those that had known YL or age at censoring greater than t) were 4495, 3836, 3262, 2773, 2366, 2102, 1889 and 1762 for t~65,…, t~95, respectively. Using these survival indicator variables and CV predictions of YL (m m i,M: ) derived from the models above described we computed the AUC(t) for every threshold using the R-package pROC [34].

Evaluation of the effects of population structure and familial relationships on prediction accuracy
The distribution of genotypes, their allele frequency, levels of LD, etc., can be affected by factors such as population structure, admixture or familial relationships. Therefore, a certain proportion of the prediction accuracy of WGP could be attributed to those factors. To further explore this, a series of additional analysis were carried out. First, in order to account for population structure, we extended the model including age, sex, smoking and BMI as predictors (M B-0 ) by adding the effects of the first two principal components (PCs) derived from the same set of 80K SNPs used in M (.)-80K . Second, to quantify the relative importance of familial relationships on prediction accuracy we carried out two additional analyses: (a) we extended M B-0 by adding an effect representing a regression on the pedigree. This was done using the standards of the additive infinitesimal model of quantitative genetics [35], and this model is denoted as M B-PED . And (b) we fitted models M A-0K , M B-0K and M B-80K in a 10fold CV where entire families, as opposed to individuals, were assigned to folds; therefore, in this CV predictions are derived from nominally-unrelated individuals.

Full data analysis
Using M B-0 , we estimated an average (6 posterior SD) difference in YL between females and males of 3.1 (60.42) years and between smokers and nonsmokers of 24.1 (60.44) years. Using estimates from M B-0 , we computed the expected YL of a nonsmoking 35-year-old by sex and BMI; the results are displayed in (Figure S1). Expected YL was greatest within the range BMI[ [20,25]; extreme BMI values, lower than 20 or higher than 25, were associated with a decrease in YL. Using M B-0 we estimate an expected decrease in YL of 0.43 year per extra unit of BMI in the range BMI[ [25,40]. Overall, these patterns are in agreement with what has been reported previously for the effect of sex [36][37][38], smoking [36,39] and BMI [21,36] on YL. Table 1 shows estimates of residual variance and DIC by model. The intercept-only model (not included in Table 1) yielded an estimate of variance of YL of 135, and the estimated residual variance of M A-0 was 104.1; therefore, approximately 23%, computed as 100| 1{(104:1=135) ð Þ , of observed variability in YL in our dataset can be explained by differences in age at entry and sex. Model M B-0 yielded an estimate of residual variance of 98.7, indicating that BMI and smoking accounted for about 5% of inter-individual differences in YL that were not accounted for by age at entry and sex; this was computed as 100| 1{98:7=104:1 ð Þ . Adding SNPs to M A-0 or M B-0 resulted in a marked increase in goodness of fit, and this is reflected in a substantial reduction in the estimated residual variance (Table 1). For instance, M B-80K yielded an estimate of the residual variance that was 65% smaller than that of the M B-0K , computed as 100| 1{34:4=98:7 ð Þ . Due to the curse of dimensionality [40], the increase in goodness of fit achieved by adding SNPs to the model may reflect genetic variability captured by SNPs, over-fitting, or a combination of both. However, DIC, a model comparison criterion that balances goodness of fit and model complexity, decreased monotonically with the number of SNPs, suggesting that information is being added as marker density increases. Figure 1 shows estimated R 2 CV versus marker density (from 0 to 80K) by model. The R 2 CV of a model including age at entry and sex, R 2 CV M A{0 ð Þ, was approximately 6%. The addition of smoking and BMI resulted in a doubling of R 2 CV , from

Evaluation of prediction accuracy in cross validation
; as expected, the addition of smoking and BMI increased prediction accuracy by a sizable amount. Prediction accuracy increased monotonically with the number of markers both in models with and without BMI and smoking covariates. These results confirm that markers are capturing information about expected YL that cannot be predicted using major factors such as age at entry, sex, smoking and BMI. Using 80K markers, we were able to increase R 2 CV from 6% to 11% for the model without smoking and BMI (M A-(.) ) and from 12% to 21% for the model including smoking and BMI (M B-(.) ). (Table S1)  The above results indicate that markers can explain a sizable proportion of inter-individual differences in YL that are not accounted for by age at entry, sex, smoking and BMI. To obtain further insights into the source of this improvement in prediction accuracy, we present in Figure 2 the average absolute value of the CV prediction error (from the 10-fold CV) and its SE by range of YL for models M B-0K and M B-80K . As expected, for both models, the absolute value prediction error was lowest for people dying around median age (80 YL) and increased for people dying early or late in life. Predictions derived from model M B-80K were much more accurate than those of M B-0 for the prediction of YL of people dying early in life; however, the prediction accuracy of the model with markers was slightly higher than that of M B-80K for subjects dying at intermediate ages. This suggests that the overall higher predictive ability of M B-80K is due mostly to improvements in prediction of early mortality. Figure 3 shows the AUC (vertical axis) for models M A-0 , M B-0 , and M B-80K for each of the 8 thresholds (horizontal axis). Adding BMI and smoking information to a model that included sex and age (M B-0 vs M A-0 ) resulted in an increase in AUC(t) of roughly 5-7%. When 80 thousand SNPs (M B-80K ) were added to a model that included age, sex, smoking and BMI as covariates we observed a substantial increase in classification performance for prediction of early stage survival status (relative to M B-0 , M B-80K yielded an increase in AUC(60) of 18%), a more modest increase in AUC(t) for survival status at ages 65-90 (M B-80K outperformed M B-0 by about 14% for AUC(65) and by 7-10% for AUC(70)-AUC(90)), and no change in AUC(95). These results are consistent with those observed with R 2 CV in that they indicate that genomic information can increase the prediction accuracy of lifespan, mostly due to an increase in the prediction of early mortality. Table S1 shows estimates of AUC(t) for models M A-0 , M B-0 and M B-80K by fold of the CV. Similar to what we observed for R 2 CV , although we found an overall superiority in the classification performance of M B-0 relative to that of M A-0 such superiority was not consistently observed in every fold. However, for early and intermediate survival status (t#85) model M B-80K had a classification performance that was consistently higher than that of models without genetic information (M A-0 and M B-0 ). For late mortality (t.85) such superiority was not consistently observed across folds.

Effects of population structure
The estimated R 2 CV of model M B-GWPC was 15.77%, this is roughly half the way from the R 2 CV of model M B-0 (11.45%) and   that of model M B-80K (21.40%). Results for AUC showed similar patterns. This indicates that a sizable proportion of interindividual differences in YL could be attributed to genetic differences associated to population structure. On the other hand, the fact that the R 2 CV of model M B-80K was 37% higher than that of model M B-GWPC suggests that genetic factors beyond those associated with population structure account for a sizable proportion of inter-individual differences in YL.

Effects of familial relationships
Model M B-PED , which including age at entry, sex, BMI, smoking and pedigree information showed clear signs of overfitting (the posterior mean of the residual variance was 11.1, compared to 34.4 for model M B-80K ) and, consequently, had a very poor predictive performance; even worse than our baseline model (M A-0 ). This is most likely to occur because of two reasons. First, the pedigree is very sparse, with 37% of nominally un-related individuals and most of the remaining individuals coming from relatively small nuclear families (74% of individuals were in families with 3 or less members). Additionally, in the great majority of nuclear families the offspring have censored YL. Therefore, in this dataset the amount of familial information available for prediction is very limited. To further illustrate this, we counted for every subject in the 10-fold CV where individuals were randomly assigned to folds the number of close-relatives (father, mother, offspring or full-sib) which were used for prediction (i.e., those which were assigned to a different fold). We found that in our CV 41.6 of the observations were predicted without having any direct relative in the training dataset (i.e. in other folds) and 70.75 were predicted without having any uncensored direct relative available for training. Only 10% of individuals had 3 or more direct relatives in the training datasets, and no-one had 3 or more direct relative with observed YL assigned to a different fold.
Our second approach to quantify the relative importance of family relationships on prediction accuracy consisted on fitting models M A-0 , M B-0 and M B-80K in a10-fold CV where entire families, as opposed to individuals, were assigned to folds. Such setting guarantees that no-direct relatives are used for prediction. The R 2 CV obtained in this new CV were very similar (R 2 CV were 11.9% and 22.3% for M B-0 and M B-80K , respectively) to the ones we obtained when subjects, as opposed to entire families, were assigned to folds (here, R 2 CV were 11.9% and 22.3% for M B-0 and M B-0 , respectively). Combining all these results we conclude that in our analysis familiar relationships were not a major factor explaining the prediction accuracy obtained with WGP.

Prediction accuracy and causes of mortality
Our results suggest that genomic information can enhance prediction of lifespan, mostly by improving prediction of early mortality. This can be due to several factors, one of which may be that SNPs are capturing genetic risk to certain diseases that are most responsible for early mortality. Figure 4 presents the distribution of death by cause and range of age at death in the Framingham sample. Cancer was the leading cause of death for people dying early in life, and the relative importance of cancer as a cause of death declined with increasing YL. On the other hand, the relative importance of other causes of death was much higher for people dying at older ages.

Discussion
Familial studies suggest that roughly 25% of the inter-individual differences in YL can be attributed to genetic factors [7]. Although linkage and association studies have reported several variants associated with human lifespan and aging-related traits [6,8,9,41], the individual effects of these variants is usually small and our ability to use genetic information to predict human lifespan remains very limited. Recent studies [10,11,20] suggest that WGP is effective at predicting complex traits. Here, we developed a WGP model for the prediction of YL and evaluated its predictive power using data from the Framingham longitudinal study.
When genetic markers were added to a model accounting for age at entry, sex, smoking, and BMI, the increase in R 2 CV obtained by adding 80K SNPs (,9-10% of inter-individual differences in YL) was greater than the increase obtained by adding smoking and BMI (,6% of inter-individual differences in YL), indicating that genetic markers are making a relatively important contribution to predictive ability. Similar results were obtained when prediction accuracy was evaluated using longitudinal AUC's.
As anticipated, our results suggest that the genetic basis of YL involves a large number of variants. The observation that DIC and prediction accuracy improved with marker density suggests that a large number of markers spread across the genome are needed to account for differences at QTLs affecting YL, and this is consistent with what one would expect for a trait that conforms to an ''infinitesimal'' model [42,43]. This pattern is also consistent with empirical evidence obtained for traits that conform to the infinitesimal model, such as human height [20] or production traits in dairy cattle [19].
Our results are also consistent with those of Yashin et al. [44] who, using a subset of the dataset used here (1,173 individuals of the original cohort), found that a sizable proportion of interindividual differences in YL (20% in the training dataset) can be explained by the joint influence of 168 small-effect genetic variants which were pre-selected using p-values derived from single-marker regressions. Although the study by Yashin et al. [44] and the one presented here both suggest that a large number of variants is needed to account for interindividual differences in YL, the two studies differ in many respects: (a) our study uses a larger sample size (N = 5,117, versus N = 1,173) and incorporates both uncen- sored and censored observations, (b) unlike the Yashin study, where markers were pre-selected using statistics derived from single marker regression, here we used a much larger number of markers (up to 80K), spread along the whole genome, (c) although the two studies used an additive linear score to predict YL, the two scores are different. In the Yashin study the score consist of a sum of so-called ''longevity alleles'', while in our study the predictive score is a weighted sum of allele dosage, with weights given by estimates of marker effects, (d) in some of our models we account for the effects of BMI and smoking, while these covariates were not accounted for in the Yashin study. Finally (d) in our study we focused on prediction accuracy of yet-to be observed outcomes, while the study by Yashin et al. reports the proportion of interindividual differences in YL that could be accounted for in the same dataset that was used to derive the predictive score. Nevertheless, despite the differences in the datasets and methods used, both studies provide consistent evidence that an important proportion of differences in YL can be predicted using genomic information and that capturing those patterns requires considering a large number of small-effects variants.
In addition to demonstrating that a sizable proportion differences in YL can be predicted using genomic information, we found that most of the gains in prediction accuracy obtained with use of genetic information came from increased accuracy of prediction of early mortality. Further examination of the distribution of causes of death by age at death reveled that cancer was the leading cause of death for people dying early in life. Therefore, a possible explanation of our results is that the ability of our WGP to capture cancer risk (indirectly through YL) was higher than for other death-related disorders. Further studies, using disorder-specific responses (e.g., presence/absence or onset of cancer) and case-control datasets will be needed to confirm this conjecture.
The Framingham dataset has a familial design and exhibits some level of population structure, much of which can be described through PCA of genome-wide SNPs. Whole-Genome Prediction exploits multi-locus LD between markers and QTL. These patterns of LD are likely to change across sub-groups in the population and because of this, models fitted using WGP cannot be regarded as 'universal equations'. The validity across subgroups of the patterns captured by a WGP model will depend on the extent to which genetic features (e.g., stratification) present in training samples are also present in those used for validation.
Including the first two marker-derived PCs increased prediction accuracy markedly, indicating that YL covariates with ancestry, as described by the first 2 PCs. However, the level of prediction accuracy attained by models using the first two marker-derived PCs was substantially lower than that of the model using 80K genome-wide SNPs, suggesting that the genetic factors affecting YL cannot be fully described by features such as population structure. The effects of familial relationships on the prediction accuracy of WGP are well established [13,20]. However, in our study, the pedigree is relatively sparse and when families with more than one subject exist the offspring are highly likely to be censored; therefore, familiar relationships are not very informative to begin with, explaining why in this study familial relationships did not show a strong effect on the prediction accuracy of WGP.

Supporting Information
Methods S1 Describes the Bayesian model used.