Generating a Robust Statistical Causal Structure Over 13 Cardiovascular Disease Risk Factors Using Genomics Data

Understanding causal relationships among large numbers of variables is a fundamental goal of biomedical sciences and can be facilitated by Directed Acyclic Graphs (DAGs) where directed edges between nodes represent the influence of components of the system on each other. In an observational setting, some of the directions are often unidentifiable because of Markov equivalency. Additional exogenous information, such as expert knowledge or genotype data can help establish directionality among the endogenous variables. In this study, we use the method of principle component analysis to extract information across the genome in order to generate a robust statistical causal network among phenotypes, the variables of primary interest. The method is applied to 590,020 SNP genotypes measured on 1,596 individuals to generate the statistical causal network of 13 cardiovascular disease risk factor phenotypes. First, principal component analysis was used to capture information across the genome. The principal components were then used to identify a robust causal network structure, GDAG, among the phenotypes. Analyzing a robust causal network over risk factors reveals the flow of information in direct and alternative paths, as well as determining predictors and good targets for intervention. For example, the analysis identified BMI as influencing multiple other risk factor phenotypes and a good target for intervention to lower disease risk. Graphical Abstract


Introduction
Interindividual variation in disease susceptibility is influenced by genetic variants, which can be organized into a defined biologic pathways or data-driven associative networks [ 1 -3 ]. By identifying variables correlated with the primary endpoint of interest, we are able to classify individuals and predict future disease. Going beyond partial correlations and evaluating causal relationships among variables plays an essential first step in risk prediction, thereby promoting more efficacious treatment of current disease and prevention of future disease. By changing the level of a causal variable (e.g. LDL-cholesterol levels), we are able to change the risk of future disease (e.g. coronary heart disease), which may not be the case for mere associated variables (e.g. HDL-cholesterol levels) [ 4 ]. In the case of a randomized intervention, such as a clinical trial, identification of causation is conceptually straight forward. However, in observational studies, which represent the majority of most large-scale epidemiologic studies, causal inference is more complex. In most applications, especially "big data" applications, causal inference is embodied in Directed Acyclic Graphs (DAG), where any inference is based on an estimated graph (i.e. nodes and edges). DAGs are illustrations of causal relationships among the variables. Mendelian randomization is an established approach to identify causal relationships [ 5 -8 ] and it is natural in a biomedical setting to integrate genomics and phenotypic information to help establish directionality within a network of phenotypes. We apply this technique in large data sets from different granularities to achieve robust causal graphs (i.e. DAGs). In the present context, granularities are defined as hierarchical levels with different quiddity that the causal relationship between them is known, e.g. they are reflecting different levels of biologic organization and measurement (genomic and phenotypic, [ 4 ]). In the application shown here, we use data from a deeper granularity, the genome, to generate a robust statistical causal network among 13 risk factor phenotypes. Inclusion of genotypes in the analysis of phenotypes (e.g. plasma glucose levels) provides two advantages: first, genotypes are assumed to be measured without error, and second, there is a natural order between these granularities (genome variation → phenotype variation; G→P) and this knowledge helps identify robust directionality in the upper granularity.
Using genome information is a promising approach to identify directionality that is less susceptible to confounding. Previous applications in data integration using gene expression data and genotypes have followed a similar logic [ 9 -12 ]. For example, Mehrabian et al., [ 9 ] integrated genotypic and phenotypic data in a segregating mouse population to generate causal relationships. Aten et al., [ 11 ] introduced an algorithm to estimate directionality among nodes in a DAG by applying information from selected single nucleotide polymorphisms (SNPs). In this study, we apply the concept of granularity in a comprehensive manner and extract information from a deeper granularity, here the genome, to achieve a robust causal network among variables of interest in the upper level of granularity, here cardiovascular risk factor phenotypes. To go beyond using a sample of SNPs, which are incomplete and may introduce instability in the study results [ 13 ], the method of principal components is used to extract information across the genome. Integration of genome information embedded in the deeper granularity and captured using principal component analysis with phenotype information in the upper granularity results in a robust causal network among the phenotypes, and we call this algorithm granularity directed acyclic graph (GDAG).
We first briefly review the theory of graphical causal inference and introduce the granularity framework and the GDAG algorithm. The utility of this approach is introduced by application to a data set including 13 cardiovascular disease risk factors and 590,020 SNP genotypes measured on 1,596 individuals and then the estimated structure is further interpreted. Use of information from the genome level of granularity allowed us to robustly generate the statistical causal network among the phenotypes. A discussion of the GDAG algorithm and the results is provided.

Background
Assume a DAG D = (v,ε) where v is a set of nodes with p elements which corresponds to a set of p random variables and ε is a set of edges which connect the nodes and shows the partial correlation between two corresponding variables. The existence of a directed edge between two nodes shows the causal relationship between the corresponding variables. Assume P is a joint probability distribution over the variables corresponding to the nodes in DAG D = (v,ε). The underlying assumption for a DAG is the Markov condition over D and P [ 14 ]. D and P must satisfy the Markov condition: every variable Y i , i ∈ v is independent of any subset of its predecessors conditioned on a set of variables, corresponds to parents/ immediate causes of node i, where Y k occurs before Y i and parental set pa(i) = pa D (.) denotes the set of parents of node i relatives to the underlying structure of DAG D. For j ∈ pa(i), we denote j → i or .
A topology or skeleton of a DAG is a graph without direction and is obtained by identification of conditional (in)dependencies, see section "Identification the Topology of Nodes" below. Identification of directions is however a challenging problem due to the Markov equivalent property of observational data. Analysis of data in the upper granularity can identify only v-structures, two nonadjacent nodes pointing inward toward a third node. A complete assessment of directionality (i.e. statistical causal relationships) usually cannot be determined from such data alone, resulting in Markov equivalent DAGs [ 15 -16 ]. Different DAGs on the same set of nodes are Markov equivalent (ME DAGs) if and only if they have the same topology and the same v-structures [ 17 ]. When the number of nodes grows, the number of ME DAGs can grow super-exponentially [ 18 ]. Complete determination of directionality over the corresponding set v is not, however, possible in most of cases.

The GDAG Method
Identifying robust and complete directionality and showing flow of information is a difficult task, but can be facilitated by integration of different data types (i.e. granularities) where we know the direction of effect is from one granularity to the other. Assume we are seeking a DAG between two phenotypes Y 1 and Y 2 . For this example, assume genome-wide information, related to the set (Y 1 ,Y 2 ) is captured in the variable X 1 . Based on the results of an analysis assessing conditional independencies, we find that X 1 is correlated to Y 1 and is independent of Y 2 given Y 1 , by notationY 2 ⊥X 1 |Y 1 . Since genome sequence variation is a causal factor in phenotypic differences (and not the other way around), the direction of the effect is from X 1 to Y 1 , as shown in DAG A in Figure 1. Knowing the relationship between X 1 and Y 1 helps generate the directionality between Y 1 andY 2 based on the property Y 2 ⊥ X 1 |Y 1 , and the direction shows the flow of information is from Y 1 toY 2 , as shown in DAG B in Figure 1. If we obtain X 1 ⊥ Y 2 & X 1 ⊥̸ Y 2 |Y 1 by analysis of the data, then the direction of effect would be from Y 2 to Y 1 , as shown in DAG C in Figure 1, which represents a vstructure at Y 1 .
To identify the direction among three variables in ME DAGs , we need to have at least two variables from the genome (i.e. a lower level of granularity, where G→P) influencing Y 1 and Y 2 or one variable from the genome influencing Y 3 . By integrating multi-omics data from different granularities, we are able to derive causal inference that is less susceptible to confounding and, as a result, estimate causal networks robustly and uniquely. Partial information from a deeper granularity creates weak instrumental variables and may result in unstable structures in the upper granularity [ 13 ], and we may not be able to find a genome variable strongly associated with every phenotype under study [ 19 ]. Therefore, we go beyond inclusion of a sample of SNP marker genotypes and extract comprehensive information across the genome by application of principal component analysis (PCA) to reduce the dimensionality of the data while retaining most of the variation in the data set. Since PCA is an unsupervised approach, it avoids increasing false discovery using the same data twice. The steps of the GDAG algorithm are summarized as follows: The GDAG Algorithm: Steps to identify a Granularity Directed Acyclic Graph (GDAG) over a set of variables of interest, Y, using data from a deeper granularity, X

1
Extract genome information by principal component analysis. Select the principal components responsible for a majority of genome variation, set X.

2
Estimate a topology over sets Y and X. *

3
If a variable in set X is linked to a variable in set Y, draw an arrow from the former to the latter.

4
Use the established directions from step 3, generate other directions using partial correlations recorded in step 2. **

5
If there is an undirected link between Ys, use rules in [ 20 ] to identify directionality. *** * Topology estimation is detailed in the following section ** Presented at the beginning of this section. *** The supplementary information provides further details.
To make the algorithm feasible, we assume the underlying network is sparse. A sparse network is a network with fewer numbers of links than the maximum possible number of links within the same network [ 21 ]. Under the sparsity assumption, the run-time of the algorithm is reduced to a polynomial and as a result the number of nodes can grow with the sample size. This assumption is reasonable and is often considered in most biomedical applications [ 22 -25 ].
When applying the GDAG algorithm, we are primarily interested in the causal relationships among nodes in the upper granularity, not among nodes in the deeper granularity or the relationship between the deeper and upper granularities. In the current application, genome information summarized by PCAs is applied to identify a robust statistical causal network structure among cardiovascular risk factor phenotypes in upper granularity. In genetics and epidemiology, application of PCA for summarizing genome information is frequent [e.g. 26 -29 ].

Identification the Topology of Nodes
In this manuscript, generating the basic topology among nodes and then assessing directionality are carried out by finding conditional independencies in the framework of data integration. One statistical approach to estimate conditional independencies under a Gaussian assumption is assessing partial correlations [ 30 -32 ]. Conditioning only on one variable, the partial correlation is defined as where is the Pearson product-moment correlation coefficient. Fisher's Z transform is used to assess the statistical significance of the sample correlation coefficient, r. If the partial correlation between two variables Y i and Y j given variables corresponding to a subset s ⊆v\{i,j} is not determined to be significantly different from zero at some significance threshold, then the corresponding nodes i and j are not connected with an edge. On the other hand, there is an edge between nodes i and j if and only if given all subsets s ⊆ v\{i, j}, Y i and Y j are significantly correlated (see [ 33 ] prop.

5.2)
. Assessing all partial correlations in the case of multivariate normal distribution to estimate conditional independencies is computationally unfeasible. Therefore, a sparsity assumption is employed, meaning that each node is connected to only some but not all of the nodes in the network.

Application to Simulated Data
To evaluate the properties of the GDAG algorithm, we used simulated data. We estimated the simulated DAG without genomic information and separately with the GDAG algorithm incorporating the genomic information. We then compared the frequency of false discoveries (FD), which are the number of wrong directions, and non-discoveries (ND), which are the number of non-directed edges, estimated by each method. Since the performance of the algorithms depends on the number of v-structures in the underlying causal graph, we considered DAGs with different numbers of v-structures. The underlying models of the simulations are depicted in Figure 2.
In order to have genotype data with a realistic linkage disequilibrium structure, we generated 10,000 SNPs for 2000 individuals on the basis of a coalescent model that mimics the linkage disequilibrium (LD) (i.e. non-random association of alleles at different loci) pattern, local recombination rate and the population history of African American and European American using a previously validated demographic model [ 34 ].
Phenotype values were generated using the structural model for j < i, where the phenotypes (Z j ) in the model are compatible with the graphs in Figure 2.
For each scenario, SNPs were selected randomly from the larger set of generated SNPs. The ε i in the model was assumed to be Gaussian with unit variance. The value of non-zero genome effects were randomly sampled from a U (−0.9, − 0.5) and U (0.5,0.9) and the value of the non-zero phenotype effects were sampled from a uniform U (− 1.9, − 1.0) and U (1.0,1.9). The values of these extreme points were based on preliminary analyses. While other studies such as [ 32 ] considered only positive effects, we considered both negative and positive effect sizes.
The simulated data were analyzed considering only the phenotypic data using the PCalgorithm [ 35 ] which is implemented with polynomial complexity in high-dimensional sparse setting [ 32 ]. The simulated data were also analyzed using the GDAG algorithm based on both phenotypic and genomic data. To apply the GDAG algorithm, we extended the PCalgorithm to analyze data from different granularities. We extracted information from the generated SNPs using principal component analysis and selected the first 110 principal components responsible for almost 90% of variation to form the set X in the GDAG algorithm. Result of the comparison of the performance of the PC and the GDAG algorithms based on fifty repetitions is summarized in figure 3, which shows the number of false discoveries (FD) and non-discoveries (ND) under different scenarios.
The GDAG algorithm has fewer FDs and NDs compared to a simple DAG application without the genome information. As can be seen in Figure 3, the performance of the PCalgorithm is improved dramatically by adding information from a deeper granularity. There have been other attempts to improve the PC-algorithm's characteristics. For example, de Campos et al. [ 36 ] improved the PC-algorithm by employing three types of structural restrictions, and Shojaei et al. [ 25 ] achieved better performance using a penalization approach.

The GDAG Performance for Different Numbers of Samples and Significance Levels
The GDAG algorithm can generate directionality robustly because it leverages information from a deeper granularity. Here, we examine the performance of the GDAG algorithm for different numbers of observations and two significance levels. We simulated different number of individuals and 40 replicates for each sample size for an underlying network with directionality. Using data from a deeper granularity, the GDAG algorithm was used to generate the topology and directionality for each replication. Therefore, when assessing the GDAG performance across different scenarios, showing either the false discovery rate (FDR) or true discovery rate are sufficient. The mean FDRs across the 40 replicates for each sample size were calculated. A smooth line over mean of FDRs is depicted in Figure 4. The red line shows the mean FDRs at the significance α =0.01 and the black at α =0.001.
Examination of the FDR rate in Figure 4 indicates unsatisfactory rates of false discovery at small sample sizes (e.g. <600 for alpha =0.001 and <1200 for alpha =0.01) and satisfactory rates at larger sample size at both significance levels. For sample sizes between 600 and 1200, alpha=.001 provides more reliable result than alpha=.01.

Application to Genotype and Risk Factor Data
As an application of the GDAG algorithm, we identified causal relationships among 13 chronic disease risk factor phenotypes: BMI (body mass index), SB (systolic blood pressure), DB (diastolic blood pressure), FG (fasting glucose), FS (fasting insulin), HDL (high density lipoprotein), LD (low density lipoprotein), TRIG (triglyceride), TC (total cholesterol), FIBR (Fibrinogen), PLT (Platelet count), as well as electrocardiogram measurements QT and QRS. The data set includes 590,020 measured genotypes in sample of 1,596 non-Hispanic white individuals from the National Heart Lung and Blood Institute (NHLBI) GO-ESP, which is an ancillary study of the Atherosclerosis Risk in Communities (ARIC) study [ 37 ], the Cardiovascular Health Study (CHS) [ 38 ], and the Framingham Heart Study [ 39 ]. The data were obtained from dbGAP [ 40 ], and this analysis is part of ongoing studies having local Institutional Review Board approval. The following steps were undertaken in order:

1.
Prior to calculation of the principle components, we identified a subset of informative SNPs using hierarchical clustering and the r 2 measure of linkage disequilibrium, where r 2 is the square of correlation between two SNPs [ 41 ].

2.
Since chromosomes are assumed to be independent, we applied principal component analysis over each chromosome separately and selected the first principal components responsible for approximately 90% of the variation. Over all of the chromosomes, 586 principal components were selected.

3.
The topology of the network was determined over the 13 risk factor phenotypes and the principal components. 130 genome wide principal components remained in the model at significant level 0.01.

4.
The direction of effect was assumed to be from the genome-wide principal components to the risk factor phenotypes (and not the other way around).

5.
Use partial correlations to estimate the Markov properties among the risk factor phenotypes and directions in step 4, the directionality of the network is determined. Details can be found in the supplementary materials.
The resulting GDAG among the risk factor phenotypes is shown in Figure 5.
As was shown in the simulated data, use of information from the genome embedded in the principal components allowed us to estimate the causal network among the phenotypes. Some of the relationships in Figure 5 are expected, such as those among the lipid phenotypes. The analysis identified a causal link between BMI and HDL-cholesterol and an indirect effect of BMI on triglycerides. The relationship between fibrinogen and platelets underscores the important role of fibrinogen in platelet aggregation and function [ 42 ]. It is important to note the effect of BMI throughout the network.

Discussion
DAGs are illustrations of causal relationships among a set of related variables. To definitively identify causal relationships, interventions are required. However, interventions, even in some parts of the graph, are not possible in most human observational studies. Data analysis alone does not robustly identify causal relationships, except in very special cases for non-Gaussian distributions [ 43 ]. As shown here, application of domain expert knowledge and data from another granularity is helpful for identifying causal networks, including the direction of arrows and estimating the magnitude of effect sizes. In a granularity framework, we take advantages of genotype information to identify a robust statistical causal network structure among phenotypes (i.e. GDAG), which provides a high degree of certainty about finding causal relationships. Any algorithm for DAGs can be extended in the granularity framework to be able to achieve causal inference that is less susceptible to confounding by hidden variables and, as a result, estimate robust statistical causal networks which are well anchored to domain knowledge. In previous applications, a priori biologic candidate gene variation has been used to analyze phenotypes [ 11 ], but a comprehensive approach to the concept of granularity has not been used. Leveraging eQTLs identified from previous association studies to reduce the number of Markov equivalence classes among phenotypes is well-established [ 2 -3 & 12 ] but distinct from the concept of granularity. To the best of our knowledge, there is no report using PCAs derived from genome-wide SNP information to identify the causal network structure among phenotypes. In the proposed granularity framework, the domain knowledge "genome variation causes phenotypic differences" is used along with objective dependencies in the data to estimate causal relationships.
The concept of granularity can be applied to reduce the running time of the GDAG algorithm. Since the primary interest is generating a causal network structure over the variables Y in the upper level of granularity, the topology of the causal network can be estimated only over the variables Y. After the topology is established, the GDAG algorithm seeks partial correlations between any two variables, one from X and the other from Y.
Since the variables in set X, the genome-wide principal components, are independent of one another, the GDAG algorithm does not require estimating the partial correlations between the Xs. This results in further reduction of the running time.
To implement the granularity framework, we extended the Peter and Clark (PC) algorithm because it is computationally feasible and often fast for sparse problems having many nodes (variables) [ 32 ]. This method can be applied to generate network structures among many variables and reveal patterns in complex systems. A robust statistical causal network reveals patterns in the underlying structure, thereby identifying good targets for intervention and prediction. The total effects among phenotypes can be estimated by structural equations while a sufficient set of confounders identified graphically are in the model [ 44 ]. We applied the GDAG algorithm to 13 risk factor phenotypes and genome-wide principal components as the deeper granularity. Visualization of the phenotype GDAG shown in Figure 5 provides opportunities for improved disease prediction and identifying targets for risk factor intervention. Nodes with a high in-degree (i.e. number of arrows pointing into a node) correspond to variables influenced by multiple other risk factors. These nodes may be good predictors of disease since they capture information from multiple risk factors. On the other hand, nodes with a high out-degree (i.e. number of arrows pointing out of a node) correspond to variables having influences on multiple other risk factors. These nodes may be good intervention targets to lower risk and influence clinical outcomes. For example, according to the GDAG in Figure 5, a good disease predictor may be fibrinogen (FIBR) since it is influenced by multiple other risk factors. BMI may be a good intervention target since it has a high impact on the other risk factor levels, such as fibrinogen, HDL, glucose, and insulin. Changes in BMI are predicted to yield changes in the majority of the network of risk factors. In conclusion, generating a robust statistical causal network among risk factor phenotypes and using this directionality, we are able to identify good candidates for future manipulation.

Supplementary Material
Refer to Web version on PubMed Central for supplementary material.

Highlights
Understanding causal relationships among large numbers of variables is a fundamental goal of biomedical sciences and can be facilitated by Directed Acyclic Graphs (DAGs) where directed edges between nodes represent the influence of components of the system on each other. In an observational setting, some of the directions are often unidentifiable because of Markov equivalency. Additional exogenous information, such as expert knowledge or genome data can help establish directionality among the endogenous variables. In this study, we use the method of principle component analysis to extract information across the genome in order to generate a robust statistical causal structure among phenotypes, our variables of interest. The method is applied to 590,020 SNP genotypes measured on 1596 individuals to generate the structure on a set of 13 cardiovascular disease risk factor phenotypes. First, principal component analysis was used to capture information across the genome. The principal components were then used to identify a robust causal structure, GDAG, among the phenotypes. Analyzing a robust causal structure over risk factors reveals the flow of information in direct and alternative paths, as well as determining predictors and good targets for intervention. For example, the analysis identified BMI as influencing multiple other risk factor phenotypes and potentially a good target for intervention to lower disease risk.  DAG A is a representation of three connected variables as well as the knowledge about direction of the effect between two granularities where variable X 1 is from a deeper granularity. DAG B and DAG C represent direction identification based on analysis of data.