Detecting epistatic interactions contributing to human gene expression using the CEPH family data.

It is believed that epistatic interactions among loci contribute to variations in quantitative traits. Several methods are available to detect epistasis using population-based data. However, methods to characterize epistasis for quantitative traits in family-based association analysis are not well developed, especially for studying thousands of gene expression traits. Here, we proposed a linear mixed-model approach to detect epistasis for quantitative traits using family data. The proposed method was implemented in a widely used software program SOLAR. We evaluated the power of the method by simulation studies and applied this method to the analysis of the Centre d'Etude du Polymorphisme Humain family gene expression data provided by Genetics Analysis Workshop 15 (GAW15).


Background
With the ability to measure simultaneously thousands of gene expression traits, understanding the causes of transcriptional variation has been of great interest. Genetic interactions, also called epistasis, have been shown to affect gene expression phenotypes. For example, Brem and Kruglyak [1] found that the genetic basis of transcripts in yeast is more often likely to be polygenic rather than monogenic and that in yeast epistasis effects are present in more than 15% of transcripts. Therefore, it is essential to analyze epistatic interactions between loci that contribute to variations in gene expression traits.
Several statistical methods for studying epistatic interactions between loci for quantitative traits using populations of unrelated individuals or from experimental designs have been developed [2][3][4][5][6]. For quantitative traits using family-based samples (related individuals), epistatic testing has been incorporated into the variancecomponent linkage analysis and implemented in the software SOLAR [7]. However, epistatic detection on the basis of the linkage analysis can only locate the two interacting loci in wide confidence intervals and will have small power for data sets with small sample sizes, such as in the GAW15 (Genetic Analysis Workshop 15) CEPH (Centre d'Etude du Polymorphisme Humain) data set, which only contains 194 individuals. In this situation, associationbased methods are expected to have higher power for detecting epistasis. A variety of approaches [8,9] that focus on association testing can be used to detect epistasis [3]. However, these are transmission-disequilibrium test (TDT)-based methods, which also have lower power in epistasis studies. Also, it is complicated to identify the best statistical model for the joint effects of loci including their interactions through model selection, particularly for analyzing thousands of gene expression traits for thousands of markers.
In this paper, we have extended the association-based linear regression model [2,3] by adding a random polygenic effect into the model to allow for familial data for epistasis detection of quantitative traits. The proposed linear mixed model was implemented in the widely used software program SOLAR [7], which calculates significance levels for each covariate, and performs covariate screening in the model. We applied the proposed method to a subset of the gene expression profiles in the CEPH data set as provided by GAW15.

Statistical methods
Based on the linear regression model of Cokerham and Zeng [2] (also see Cordell [3]), we propose a linear mixed model for detecting epistatic interactions for quantitative traits using family-based data: This model assumes diallelic marker loci and that y is a normally distributed quantitative gene expression phenotype from related individuals, W is a vector of fixed covariates such as sex effects, β is the corresponding vector of Significant interaction effects imply presence of epistasis.
To detect epistasis, for each gene expression phenotype, we ran Model (1) in SOLAR for each pair of single-nucleotide polymorphisms (SNPs) in the selected candidate regions (see Description of the data set for more details). The number of tests for each gene expression phenotype ranges from 6 to 820, depending on the marker density and size of the candidate regions selected for the epistasis search. For each gene expression phenotype, individual pvalues were adjusted using false-discovery rate (FDR) under the general dependency assumption [10] within each phenotype. FDR-adjusted p-values equal to or less than 0.05 (FDR ≤ 0.05) are considered to be significant.

Simulation study
We simulated a data set based on the pedigree structure from CEPH family data, which has 14 three-generation families of 194 individuals. We considered two unlinked diallelic markers in our analysis with allele frequency of 0.5. Marker genotypes for the grandparents were generated assuming Hardy-Weinberg equilibrium at each locus. Genotypes for parents and children were simulated conditional on their parental genotypes following Mendel's law. As an example we evaluated the power (true negative rate) and type I error (false-positive rate) of the proposed method in identifying additive-additive epistatic effect i aa .

Description of the dataset
The CEPH family data provided by GAW15 includes 3554 Affymetrix ® gene expressions measured for 194 individuals from 14 three-generation CEPH families. In addition, 2882 autosomal and X-linked SNPs were typed for these individuals. The software package pedStat as distributed in Merlin version 1.0.1 [11] was used to check for Mendelian inconsistence, genotyping proportions, and heterozygosity of SNPs. SNP markers with minor allele frequencies less than 1% (equivalent to heterozygousity < 1.98%), markers with greater than 30% missing genotypes, and markers with only two of the three possible genotypes were removed from analysis, which left 2436 SNPs for our analysis.
We limited our analysis to the 27 gene expression phenotypes with the strongest linkage evidence of cis effects ( Table 1 from Cheung et al [12]). Their Table 1 listed one to two peak markers for each phenotype, where a peak marker is the SNP with the most significant finding in the genome-wide association analysis (GWA) for this phenotype. Fourteen of the 27 gene expression phenotypes exhibited cis regulation (with cis peak markers) by the GWA analysis. For the phenotype PPAT there are two peak markers that point to both cis and trans regulation for this gene. For the remaining 12 phenotypes, the peak markers are trans markers. In our study, for each of the 27 phenotypes we selected a 15-Mb candidate region centered on the target gene location. If a trans peak marker was identified in the GWA analysis, we also selected an additional 15-Mb candidate region centred on that marker. We analyzed the epistatic effects for all possible combinations of the SNPs within the candidate regions.

Simulated data
We calculated the ROC profiles for detecting interacting SNPs using linear mixed Model (1). An ROC curve measures the trade-off between sensitivity (true-positive rate, TPR) and 1 -specificity (false-positive rate, FPR) for different cut-offs ( Figure 1). Each threshold value results in a sensitivity and 1 -specificity, which is represented by a point on the ROC curve. For 100 simulated data sets, the TPR starts at 0.82 and quickly increases to 0.95, with FPR less than 0.11. When the nominal threshold is 0.05 for the significant calls of epistasis, the power is 0.95 and the type I error is 0.08 in identifying the additive by additive epistasis effect i aa . In summary, simulation results indicate the proposed linear mixed model has high sensitivity (power) and specificity for detecting SNP-pair interaction in family data for two unlinked markers.

CEPH family data
When applying the linear mixed model to the analysis of the CEPH data, a total of six SNP pairs showed significant epistatic interactions for three gene expression phenotypes (Table 1). Two out of six significant epistatic effects are from SNPs that are located on two regions from different chromosomes. These two regions are corresponding to the cis linkage region and the trans association marker.
These results show gene expression regulation cannot be simplified as cis or trans regulation because both cis or trans effects simultaneously contribute to variation in message RNA expression level. The strong epistatic interactions between SNPs could be interpreted as the strong epistatic interactions between genes if the nearby genes are in linkage disequilibrium (LD) with SNPs. For example, the SNP rs1537638 is within the gene PTK7 (protein tyrosine kinase 7), which is located on chromosome 6 and shows significant interaction with the SNP rs1505694, which is close to the gene ITGB1BP1 (integrin β1 binding protein 1) on chromosome 2. This suggests that these two genes might physically or genetically interact with each other or be involved in the same biological process. In fact, proteins encoded by these two genes are indeed involved in the cell adhesion process, in which PTK7 is a receptor tyrosine kinase transducing extracellular signals across the cell membrane and ITGB1BP1 plays an important role during integrin-dependent cell adhesion by binding the β1 integrin cytoplasmic domain.

Conclusion and discussion
We have presented an association-based method for detecting epistatic interactions for quantitative traits using family data, and applied this method to the analysis of gene expression phenotypes of CEPH family data provided by GAW15. When we applied the proposed method to the CEPH data, we detected six SNP pairs that showed significant epistatic interactions for 3 gene expression Receiver operating characteristic (ROC) curves for 100 sim-ulated data Figure 1 Receiver operating characteristic (ROC) curves for 100 simulated data.
phenotypes among the 27 phenotypes analyzed. We study the epistasis among genes by analyzing the interactions of SNPs located in the corresponding genes. This kind of epistasis detected from statistical tests does not necessarily correspond to the classic model of epistasis. Strong epistatic interactions among SNPs may not always indicate biological interactions among genes.
Although we demonstrated the association-based linear mixed-model approach for analyzing 27 phenotypes, the method is mainly proposed for analyzing thousands of phenotypes in genome-wide study. In general, one could identify two interacting linkage regions (QTL intervals) using two-dimensional genome linkage scan by allowing a higher a false-positive rate. Or one could do stepwise search of two interacting QTLs by identifying one primary QTL and then searching for the secondary QTL conditional on the primary locus being linked [13]. Once the two candidate intervals (regions) have been identified, the proposed linear mixed-model approach in this work could be used for epistasis detection between SNPS.
The proposed linear mixed model could be implemented in two steps by first regressing out the fixed effects (not including genetic effects) and polygenic effects and then detecting genetic interaction effects using predicted residuals from the first-step analysis. This strategy is attractive because of its flexibility in identifying the best statistical model for the joint effects of loci and computational efficiency for analyzing thousands of gene expression traits in genome-wide study.