Associating multiple longitudinal traits with high-dimensional single-nucleotide polymorphism data: application to the Framingham Heart Study.

Cardiovascular diseases are associated with combinations of phenotypic traits, which are in turn caused by a combination of environmental and genetic factors. Because of the diversity of pathways that may lead to cardiovascular diseases, we examined the so-called intermediate phenotypes, which are often repeatedly measured. We developed a penalized nonlinear canonical correlation analysis to associate multiple repeatedly measured traits with high-dimensional single-nucleotide polymorphism data.


Background
Cardiovascular diseases (CVD) are associated with combinations of phenotypical traits, such as increased blood pressure, blood glucose, or cholesterol, and many other risk factors. These traits are in their turn caused by a combination of environmental and genetic factors. Because of the high diversity of pathways that may lead to CVD, our focus lies on the so-called intermediate phenotypes that often have a much stronger relationship with genetic markers. The disadvantage of this approach is that there are many such intermediate phenotypes, and moreover they are often repeatedly measured in patients, e.g., lipid profiles, blood pressure, and glucose. We developed a new method to associate multiple repeatedly measured phenotypical traits with high-dimensional single-nucleotide polymorphism (SNP) data, and illustrate its use with the data sampled in the Framingham Heart Study, as provided by the Genetic Analysis Workshop 16 (GAW16).
We have previously shown that penalized canonical correlation analysis (CCA) can be a valuable tool to study the association between two high-dimensional sets of variables [1,2]. It penalizes the two datasets such that it finds a linear combination of a selection of variables in one set that maximally correlates with a linear combination of a selection of variables in the other set, thereby making the results more interpretable.
Although CCA accounts for the correlation between variables within the same set, it neglects the longitudinal Page 1 of 6 (page number not for citation purposes)

BioMed Central
Open Access nature of the variables. Furthermore, it does not deal well with categorical data, such as we encounter when dealing with SNP data. We adapted CCA such that it captures the correlation between the multivariate longitudinal responses instead of the correlation between separate measurements within one set. The set with SNP variables is transformed via optimal scaling [3], such that each SNP variable is transformed into one continuous variable, which captures the measurement characteristics of the SNP. Hereafter, this set is penalized such that the number of suspected SNPs is reduced to an interpretable number. Using these approaches, we are able to extract groups of SNPs that have a high association with multiple longitudinal traits.

Data
Data from the Framingham Heart Study containing information about common characteristics that contribute to CVD, together with genetic data of about 550,000 SNPs were provided by the GAW16. Three generations were followed over a period of several years, when at regular time-points characteristics of CVD were measured.
In our analysis we focused on individuals of the offspring generation because the repeated measurements in these individuals were taken under approximately similar conditions; we considered the measurements of total cholesterol, high-density lipoprotein (HDL) cholesterol, triglycerides, and blood glucose, each measured four times. Furthermore, we analyzed the data of the Affymetrix 50 k chip containing 50,000 SNPs. The offspring generation exists of 2,760 individuals, with at most four measurements taken every 7 to 12 years. We only took the individuals (over the age of 17) for whom both phenotypical data and genetic data were available (2,584), and discarded the individuals with more than 5% missing SNP data (6) and individuals who attended only one exam (26). Six individuals with extremely high measures for one of the four traits were also deleted from further analysis.
Monomorphic SNPs and SNPs with a missing percentage of 5% or more were deleted from further analysis. Remaining missing data was randomly imputed once based only on the marginal distribution of the SNP in all other individuals; no assumptions were made about the variation in the imputation. We were primarily interested in common SNP variants, so we grouped SNP classes with less than 1% observations with a neighboring SNP-class; i.e., we grouped homozygotes of the rare allele together with the heterozygotes, if there were less than 1% homozygotes. This research was in accordance with the Helsinki Declaration of Human Rights, compliant with the data use agreement for the Framingham Heart Study, and was approved by the local medical ethics committee of the Academic Medical Center Amsterdam (date of decision, 1 April 2008).
Penalized nonlinear CCA Consider the n × p matrix Y, containing p (measured trait) variables, and the n × q matrix X containing the q (SNP) variables, obtained from n subjects. CCA captures the common features in the different sets by finding a weighted linear combination of all the variables in one set that correlates maximally with a weighted linear combination of all the variables in the other set. These linear combinations are the so-called canonical variates ω and ξ, such that ω = Yu and ξ = Xv, with weight vectors u T = (u 1 , ..., u p ) and v T = (v 1 , ..., v q ).
Because CCA neglects the longitudinal nature of the variables, each repeatedly measured trait is summarized into two measures, one representing the slope and one the intercept for each individual (see next section). Moreover, CCA cannot deal with categorical variables, therefore each SNP variable is transformed into one continuous variable via optimal scaling [3]. Each of the transformed variables are restricted to the measurement characteristics of the SNP. That is, where a is wildtype, b is heterozygous, and c is homozygous. This restriction indicates that the effect of the heterozygous form of SNP j always lies between the effect of the wildtype and homozygous genotype. To make the results more interpretable, the SNP set is penalized using univariate soft-thresholding. The canonical variates are optimized via the following alternating least squares algorithm (see Figure 1): 1. Standardize Y and X.
Summary method for the repeatedly measured phenotypes The four traits (total cholesterol, HDL cholesterol, triglycerides, and blood glucose levels) are log-transformed. Because of the linear nature of the repeated measures, for each subject i its repeatedly measured trait variables can be summarized into two measures, an intercept (b 0 ) and a slope (b 1 ); after correcting for the treatment effect of cholesterol-lowering drugs, according to the following model, at age t: y it = b 0i + b li Age + b 2i trt + b 3i trt* Age + e it , where y is the trait and trt = 0 if no treatment was taken and trt = 1 if cholesterol treatment was taken. Hereafter, the sex-effect was removed from the eight newly obtained intercept and slope variables.
The new dataset Y contains eight variables, two for each of the four phenotypical measures.

Model optimization
Optimization of the penalty parameter is determined via k-fold cross-validation. The weight vectors u and v and the transformation functions τ j per SNP variable are estimated for different penalties in the training set and validated in the validation set.
Instead of determining the penalty l, for interpretation it is easier to determine the optimal number of SNP variables. The optimal number of variables are obtained when the mean difference between the canonical correlation of the training and the validation set is minimized, i.e.,

Results and discussion
To decrease computation time, the number of SNP variables was reduced, using univariate analysis. Each of the four intercept variables were separately associated with the SNPs set using the optimal scaling method. Only the intercept variables were used because the absolute correlation between the intercept and slope of the same phenotype was at least 0.89. For each intercept variable, the top 10% of SNPs with the highest weights were selected for further analysis, i.e., SNP variables that received a high weight for any of the four intercept variables were selected. This resulted in a data set with 12,682 SNP variables.
Next we performed penalized nonlinear CCA; ten-fold cross-validation was performed to determine the optimal number of SNP variables. The optimal number of variables was the number where the canonical correlation of the validation set was closest to the canonical correlation of the training set. Figure 2 shows the effect of the number of selected SNP variables on the difference in canonical correlation of the validation set and the training set. It shows that as the number of selected variables increased, the difference in canonical correlation also increased, which caused the predictive performance of the selected variables to decrease.
If the number of variables is very large, then there is a high probability that a random pair of variables has a very high correlation by chance. To identify a canonical correlation that was large by chance only, a permutationanalysis on the validation sets was performed. We permuted the canonical variate ξ (SNP variables) and kept the canonical variate ω (phenotypical variables) fixed, then the difference in canonical correlation of the permuted validation sets and the training set was determined ( Figure 2).
Furthermore, we performed an additional simulation test in which we associated the actual set of phenotypical variables with a set of permuted SNP variables. For each SNP variable the observations were randomly distributed over the different subjects, then a ten-fold cross-validation was performed (see Figure 2).
The difference in canonical correlation obtained from the permuted data and the simulated data were small (Figure 2), while the difference in the canonical correlation of the real data was smaller, indicating that the simulation set only contained noise data and the results of the real data were larger than would be expected by chance. Figure 2 shows a decreasing trend as

Figure 2
Determination of the optimal number of SNP variables.

Conclusion
Our penalized nonlinear CCA is able to identify SNPs that are associated with repeatedly measured phenotypical markers. In this study two important SNPs (rs328 and rs3764261) were found that are located close to or in a gene that has been reported to be associated with HDL concentrations [4].
Although we chose to model the repeated measurements with a linear random effects model, more complex and more flexible models can be easily incorporated in our penalized nonlinear CCA. Besides the first pair of canonical variates, different pairs can be obtained using the residual matrices of the preceding canonical variate pairs.
The family structure in this study was ignored. McArdle et al. [5] noticed that ignoring the family structure mainly affects the type I error rate and not the bias of the point estimate. Because we focused on the point estimates, we expect no major changes when family structure is considered.