Abstract

A variety of methods have been proposed for studying the association of multiple genes thought to be involved in a common pathway for a particular disease. Here, we present an extension of a Bayesian hierarchical modeling strategy that allows for multiple SNPs within each gene, with external prior information at either the SNP or gene level. The model involves variable selection at the SNP level through latent indicator variables and Bayesian shrinkage at the gene level towards a prior mean vector and covariance matrix that depend on external information. The entire model is fitted using Markov chain Monte Carlo methods. Simulation studies show that the approach is capable of recovering many of the truly causal SNPs and genes, depending upon their frequency and size of their effects. The method is applied to data on 504 SNPs in 38 candidate genes involved in DNA damage response in the WECARE study of second breast cancers in relation to radiotherapy exposure.

1. Introduction

The Women’s Environment, Cancer And Radiation Epidemiology (WECARE) study [1] is aimed at a comprehensive examination of genes involved in particular functional pathways. The study is a population-based nested case-control study of 708 contralateral breast cancers (CBC) within a notional cohort of about 65,000 survivors of a first breast cancer, 1401 of whom serve as controls, and the primary exposure of interest is ionizing radiation dose to the contralateral breast from radiotherapy for treatment of the first cancer. Ionizing radiation is known to cause double strand breaks (DSBs) in DNA, which can invoke any of several DNA damage response mechanisms, primarily DSB repair via either homologous recombination or nonhomologous end joining, cell cycle checkpoint regulation, or apoptosis. The original study focused on mutations in the ATM gene, which plays a central role in the recognition of DSBs. The study was then extended to include BRCA1, BRCA2, and CHEK2, which are all involved in homologous recombination repair (HRR), and later still to include a broader set of 38 candidate genes involved in this and other pathways for DSB damage response. We have previously reported on the main effects of ionizing radiation [2, 3], ATM [46], BRCA1/2 [712], CHEK2 [13], and the interactions of radiation with ATM [14] and BRCA1/2 [15] as well as with other treatments and reproductive factors [16, 17], amongst other risk factors. The aim of this paper is to provide a comprehensive modeling strategy for examining the effects of all genes in a pathway and to apply the approach to candidate genes for CBC risk in the WECARE study.

There are a growing number of literature works on methods for pathway modeling, motivated in large part by an interest in mining GWAS data for commonalities across related genes that individually may not achieve genomewide significance but in the aggregate may point to novel pathways (see [18] for a review of gene set enrichment analysis and alternatives). Our goal here is more modest, guided by an a priori selection of strong candidate genes [19]. Like other methods of pathway analysis, however, we aim to exploit external knowledge about the biological function of each gene and the relationships between them [20].

Our starting point is a model for multiple variants proposed by Quintana et al. [11], which collapses a subset of the variants within a gene into a single “burden” type index, similar to a number of other recent rare variant proposals (see Basu and Pan [21] for a review and comparison by simulation), but extended to allow for both deleterious and protective effects and to explicitly allow for uncertainty about which variants to include in the model (and which direction for those that are included) by Bayesian model averaging. This approach was further extended to incorporate prior covariates in the probabilities of SNP inclusion [12, 22]. Hoffman et al. [23] introduced a step-up variable selection approach that allows for deleterious and protective effects but did not consider model uncertainty except in the form of a permutation procedure for the overall significance test so is unable to assess the importance and direction of particular variants or alternative models. Chen et al. [24] describe a somewhat similar model that combines variable selection at the SNP level with shrinkage at the gene level. In the current paper, we extend this approach to multiple genes, incorporating prior covariates and prior gene-gene similarity information in a hierarchical modeling framework.

2. Model Specification

We have information on individuals with binary outcomes , a vector of fixed effects (age, family history, etc.), and a vector of SNP genotypes within multiple genes for each individual. We propose a novel model based on a hierarchical Bayes framework with three levels: (i) a subject-level model for the association between genes and disease, (ii) a gene-level model for the regression coefficients in the gene-disease association model, and (iii) a SNP-level model describing which variants contribute to each gene and the direction of their effects. (These submodels are described by (1), (2), (4), and (5), resp., below and the surrounding text.) The general framework is similar to one recently proposed by Quintana et al. [12, 22] but differs in a number of details. The overall model is represented as a directed acyclic graph in Figure 1, where boxes represent observed data and circles represent latent variables or model parameters; single arrows denote stochastic links, while double arrows denote deterministic links. The 3 dotted rectangles enclose the covariates and parameters included in each level of the model and their relations.

The subject-level model is specified in terms of a burden index for each gene, a deterministic function comprised of the number of positively associated SNPs minus the number of negatively associated SNPs; however, the choice of whether a SNP is included or not and, if included, its direction is stochastic, governed by prior probabilities that could in principle vary across genes or across SNPs within genes. The gene-level model has means and covariances for each ln RR (relative risk in log scale) coefficient that can depend upon external information (“prior covariates” and prior “gene-gene connections”). In principle, the SNP-level model could also include prior covariates [22], although that is not considered here. For the simulations and the analysis of the real WECARE data, we used the Gene Ontology (GO, [26] a pathway ontology database, http://www.geneontology.org/) for the 38 WECARE candidate genes to construct the prior covariate and connection information, as described in more detail in the simulation section.

Level 1. The subject-level model for case-control data uses a conditional logistic regression model to relate burden indexes for genes to a binary outcome variable , the disease status for individual . Here, denotes a deterministic function of the SNP genotypes for SNP in gene with corresponding weights defined in the level 3 model. Thus, the first level model is of the following form: where denotes a vector of fixed covariates (confounders) with coefficient vector . The offset term is needed to account for the counter-matched design in the WECARE study [1].

Each gene burden index has a log regression coefficient describing its contribution to risk, the interpretation of which will depend upon the current assignment of weights. A change of the genotype of a single SNP in the function is reflected by the change of on logit scale. This is based on all SNPs tested in the gene, but each SNP has a different weight with different prior probabilities; the details are explained in level 3 of the model. The exponentiation of the s ensures that the effects of each gene will be positive, thereby avoiding the label-switching problem that would arise if the signs of and all the were reversed for a given gene. This also avoids having to deal with truncated normal distributions if were not exponentiated but instead constrained to be positive. (We call (1) Model I and briefly describe this alternative possibility (Model II) in Section 7.)

Level  2. The regression coefficients in the first level logistic regression model are given by the gene level of the hierarchical model: where

The level 2 model uses a simple linear regression to relate the regression coefficients from the level-1 model to external information on the genes’ involvement in certain pathways and the similarity of their effects. We incorporate information regarding prior predictions of the effects of each gene into the design matrix , here structured as a gene-by-pathway matrix of binary values, each indicating whether a gene is in a particular pathway. Basically, contains second-stage covariates for each of the genetic factors. is a column vector of coefficients corresponding to these higher-level effects and is assigned an independent normal prior with mean 0 and variance and identity matrix . Prior information about gene-gene connections is incorporated in the matrix for the random effects with a multivariate normal distribution centered at zero with variance . The term is included as a residual error, also given a zero mean independent normal distribution, with specifying the residual variance of the second-stage covariates.

Level 3. The SNP-level model defines the deterministic functions , where each gene is uniquely determined by the SNP inclusion indicator variables . The serve as a design matrix of genetic factors for the individuals within the study. In other words, the function serves as a risk index for each gene and as a weighted sum of SNP effects within each gene: where the weights , 0, or +1 have prior probabilities: Here, denotes the number of SNPs in gene and the average number of SNPs across all genes; we assigned to be the minimum number of SNPs within any gene. and represent the parameters of the prior probabilities for deleterious and protective SNP effects, respectively. This form of prior probabilities for the SNP indicator variables keeps the expected number of SNPs included in the model to be roughly similar across genes while allowing genes with more SNPs to have similar probabilities of being included as genes with fewer SNPs. For now, we treat as fixed parameters, but these too could be given hyperpriors and estimated.

The posterior estimates for the association parameters resulting from the three-level hierarchical Bayesian analysis are an inverse-variance weighted average between the conventional estimates from the logistic regression only and the estimated conditional second-stage means, . Between the maximum likelihood first-stage estimates and the second-stage prior estimates, the weights will favor the one with smaller variance. This intuitive weight adjustment is one of the important differences between Bayesian hierarchical approach and the single-stage logistic regression analysis.

Finally, the variance components are given standard conjugate inverse gamma hyperprior distributions:

3. Fitting the Model

The full model is fitted in a sequence of Markov chain Monte Carlo (MCMC) steps described in detail in the Appendix. Basically, the selection of SNPs to include in each gene is performed by sampling from their full conditional distributions one at a time; this involves an approximation to the change in the corresponding estimate of and hence the likelihood that would result from adding or deleting that SNP. The gene-level regression coefficients and correlated random effects are accomplished by the Metropolis-Hastings moves for the entire and vectors, conditional on the current SNPs in the model, the prior covariates , and gene-gene correlation matrix , using a multivariate normal proposal. The second-level gene-level coefficients and the independent and correlated variances and are then sampled using further Metropolis-Hastings moves. Updating the coefficients of the fixed covariates involves only a standard update for logistic regression.

4. Posterior Summarization

Instead of parameter estimation, we focus primarily on hypothesis testing and model selection. We use the Bayes factors (BF) at both the SNP level and the gene level to compare the posterior odds provided by data to their prior odds of a pair of hypotheses. Kass and Raftery [27] suggest a qualitative interpretation of BF > 3 (or equivalently ) as providing “positive” evidence, >20 as “strong” evidence, and >150 as “very strong” evidence.

We tabulate the following quantities, where denotes the ensemble of all the data.(i)For each SNP, the posterior probability of and Bayes factor where the first factor is the ratio of posterior probabilities that SNP in gene g has any effect (positive or negative) versus no effect given the data and the second factor is the corresponding ratio of prior probabilities.(ii)For each gene, the Bayes factor for the probability that at least one SNP is included in the model is We also tabulate the posterior means and standard deviations of each, along with the mean number of SNPs included in the model.(iii)For the other parameters, , , , , and , we simply tabulate the posterior means and SDs.(iv)Finally, we tabulate the posterior distributions of numbers of SNPs and numbers of genes with at least one SNP included in the model.

5. Simulation Studies

We conducted simulation studies based on the structure of the real WECARE study data described below. Specifically, we used the real SNP, covariate, and counter-matching offset data for each risk set and reassigned case/control status in each risk set based on an assumed relative risk model. We used the estimated values of the coefficients for the fixed covariates and randomly assigned weights to SNPs and log relative risk coefficients to each gene under the models described above. There were a total of 504 SNPs in 38 genes (ranging from 1 to 51 SNPs per gene) involved in DNA damage response pathways (DNA repair, cell cycle checkpoint control, and apoptosis). Using the Gene Ontology, we extracted 860 terms relating to biological process or molecular function annotated to any of these 38 genes and selected four of these GO terms as prior covariates in the matrix (specifically, DNA damage checkpoint, MRE11 complex, double-strand break repair via nonhomologous end joining, and negative regulation of cell cycle), with , and 1 respectively, and the intercept was set to −2. All 860 GO terms were used to construct a correlation matrix for the similarity in the ways each pair of genes was described in the GO (Figure 2). The log relative risk coefficients were assigned with mean and SDs of and . SNP weights were assigned with and . The resulting gene indices and the corresponding , along with the real and estimated coefficients and offset terms, were then used to compute each subject’s relative risk and randomly assign which member of each risk set would be designated as the case. The estimates are based on 10 replicates for the data of each of 10 realizations of the and from these model parameters, using 1000 MCMC scans for tabulation after a burn-in of 500 scans. It yielded a total of 32 causal SNPs in 24 of the genes on average. Table 1 summarizes the posterior probabilities for SNP and gene inclusion, along with the proportion of SNPs and genes with BFs greater than 3, 20, and 150. Although the differences between null and causal SNPs and genes are somewhat modest, there is a clear shift in both the posterior probabilities and the Bayes factors in the appropriate directions.

6. Application to the WECARE Study Data

Using the same settings as for the simulation studies, we analyzed the real WECARE study data, except that 10,000 scans were retained after a burn-in of 4,000 iterations. The posterior distributions of numbers of genes with at least one SNP included and numbers of SNPs included are shown in Figures 3(a) and 3(b). An average of 10 SNPs in 9 genes was included in the model. Figure 4 shows the posterior probabilities (a) and Bayes factor for each of the genes (b) and SNPs (c). At the gene level, only MDC1 and RAD51 were included with substantial Bayes Factors of 20.71 (“strong evidence”) and 3.51 (“positive evidence”), respectively, while ATM and NBN were identified only with BFs between 1 and 3. In this analysis, the known deleterious variants in ATM, BRCA1, BRCA2, and CHEK2 were treated as fixed covariates rather than being lumped in with the other tag SNPs. None of the four GO terms selected as prior covariates contributed significantly to the model, the strongest being DNA damage checkpoint (, SE = 0.27). The correlated variance , and the independence variance , suggesting moderately strong residual gene-gene similarities (spatiality ) defined by the ensemble of all GO terms and not explained by the regression of s on the subset of selected GO terms.

Table 2 lists the numbers of pairs of the homozygous reference allele, heterozygous allele, and homozygous risk allele for cases (CBC) and controls (UBC), respectively, for all the SNPs identified by our models and by a previous WECARE publication [25]. We also report the estimated s from simple logistic regression for each selected SNP, adjusted for the same set of covariates (age, menarche, menopause, family history, pregnancy, histology, treatment, the FGFR2 GWAS-identified SNP, and deleterious variants in ATM, BRCA1, BRCA2, CHECK2s and offset term) as in our model. The logistic regression found SNPs rs4713354 and rs2269705 in MDC1 to be strongly associated with CBC risk (), and SNPs rs1800057 v_IVS14 m55, rs13447682, rs3736640, and rs1801320 had protective effects with statistical significance () or with marginal statistical significance (rs6005861 and rs9297757, ).

Table 2 also shows the SNP Bayes factors, based on which our model selected a total of nine SNPs with positive to strong evidence for disease association. Two SNPs (one in NBN and one in RAD51) were identified with strong evidence (BF > 20) and seven SNPs from four genes (ATM, CHEK2, MDC1, MRE11A) with positive evidence (BF > 3). In a prior study by the WECARE study Collaborative Group, 134 common variants in six DNA damage response genes (CHEK2, MRE11A, MDC1, NBN, RAD50, and TP53BP1) were tested separately or within haplotypes for association with CBC risk [25]. Six SNPs were reported to be associated with CBC risk with , but none remained statistically significantly associated after correction for multiple comparisons. Five SNPs (rs6005861 in CHEK2, rs4713354 in MDC1, rs13447682 in MRE11A, and rs9297757 and rs3736640 in NBN) among those six SNPs reported by Brooks et al. were selected by our model for showing positive or strong evidence for CBC risk. The remaining SNP (rs14448 in NBN) reported by Brooks et al. was not statistically significantly associated with CBC in the logistic regression (). All the SNPs except rs4713354 in MDC1 reported by Brooks et al. were found to have protective effects in the log-additive model. The same direction of the risk was also found for each SNP in the logistic regression. In addition, our model shows positive evidence of CBC risk for SNP rs1800057, a variant in ATM, which was previously shown to be associated with a statistically significant reduction in CBC risk [28] in the WECARE study. Its protective effect was also found in the logistic regression (ln RR = −0.47, ).

Seven of the nine SNPs selected by our model have been found associated with breast cancer risk in previous investigations. Besides the six SNPs reported in the previous WECARE study, rs1801320 (135G > C), a SNP in the 5′-untranslated region (UTR) of the RAD51 gene, was found with mixed results for its role in breast cancer risk from other breast cancer risk studies [2931]. In addition to those previously reported SNPs, our model selected rs4987951 in ATM and rs2269705 in MDC1, about which we found no previous reports of association with breast cancer.

7. Discussion

Our model is motivated in part by ongoing work on methods for testing associations with multiple rare variants in next generation sequencing data [12, 22], for which it is obvious that attaining statistically significant results for any single variant is difficult because of their rarity and the enormous multiple comparisons penalty. This motivates our choice of a burden index for gene-level associations comprising simple −1/0/+1 weights with model averaging across their uncertainty distribution. For common variants with minor allele frequencies (MAF) >5% (and perhaps in candidate gene studies for uncommon variants with 1% < MAF < 5%), it may be possible to allow each SNP to have its own regression coefficient from some continuous distribution, but constraints would be needed to ensure identifiability if both SNP- and gene-level parameters were to be estimated.

As a compromise, we have treated the known deleterious variants in ATM, BRCA1/2, and CHEK2 as fixed covariates, along with age, treatment, reproductive variables, and so forth, since it seems unreasonable to consider these variants as exchangeable with the tagging SNPs. Unfortunately, this precludes borrowing strength across all the variants within these genes—that is, given that we know that some variants in these genes are deleterious, it would seem more likely that there would be other causal variants in the same genes. Furthermore, if these four genes have similar prior covariate values , that should inform the estimation of the corresponding and draw the estimates of s for other genes that are highly correlated with them in the matrix towards the values for these genes.

We have included prior information only on genes, not SNPs, in our model, since the GO does not provide any annotation of specific variants within genes. However, there are many ways of classifying SNPs a priori, such as simple indicators for whether they are coding or noncoding variants or the predictions of programs like SIFT [32] and PolyPhen [33] based on predicted effects on protein conformation or evolutionary conservation. Such information could easily be incorporated into a multinomial logistic or probit model for the inclusion probabilities [12, 22]. The current version of our program treats and as fixed constants, but these could simply be assigned prior Beta distributions, subject to the constraint that .

In addition to the model described above (Model I), we considered an alternative Model II with a similar structure, except that the gene log RR coefficients are not exponentiated: To ensure that they are positive, the second level of the hierarchical model is in the following form: where denotes the probability density of normal distribution and denotes the cumulative density of normal distribution. This is a proper density for , since it integrates to one. The third level of Model II remains the same as Model I. Model fitting is similar to Model I except for some details in updating s and s.

In the simulations, Model II yielded a total of 47 causal SNPs in 25 of the genes on average. Model I showed higher sensitivity and specificity for SNP selection (Table 2) than Model II based on both posterior SNP inclusion and SNP BFs. Model II showed a higher sensitivity for gene selection than Model I based on the posterior gene inclusion, but a lower specificity. In addition, Model I showed a higher sensitivity based on gene BFs.

In the application to WECARE data, Model II identified 5 SNPs in genes MDC1, NBN, and RAD51, with positive evidence for disease association (BF > 3). Four (rs4713354, rs2269705, rs9297757, and rs1801320) of the five selected SNPs are in common with Model I, two (rs4713354, rs9297757) are in common with Brooks et al. [25], and one (rs11620361) is not in common with previous methods. One gene (MDC1) was selected with positive association based on gene-level Bayes factors (BF = 6). Both the simulation study and real data application suggested that Model I performs better than Model II in terms of selecting causal variants.

We have extended the model to incorporate gene-environment () interactions with radiotherapy or radiation dose since the focus of the WECARE study is on these genes acting in response to the DSB damage induced by radiotherapy exposure. Extending the model to incorporate interactions is straightforward, simply adding the main effect of and an additional vector of interaction terms to the subject-level model and then putting a similar prior on the interaction coefficients. For the time being, we have treated the s and s as independent, but a more appealing approach would be to treat them as having bivariate normal distributions depending on and . No significant interactions were found in this model (results not shown).

It remains to be seen whether this approach is scalable to GWAS data. As currently implemented with MCMC methods, the approach would not be computationally feasible, even with parallel implementations on high-performance computing environments. However, work in progress (Quintana et al. [11, 12, 22]) suggests that analytic approximations may be possible that would obviate the need for MCMC methods.

Appendix

Model Fitting

At each iteration, the following updates are performed.

Selection of SNPs to include in the model involves evaluating the three posterior probabilities for and selecting with the corresponding probability where is a single Newton step iteration towards the maximum likelihood estimate (MLE) of if were set to .

Update the vector of regression coefficients using a multivariate Metropolis-Hastings move with proposal and acceptance probability Update the vector of random effects with a similar Metropolis-Hastings move with acceptance probability Note that an alternative possibility would be to sample from its marginal distribution and omit the update of the s.

Update the prior regression coefficients by a simple linear regression and taking a multivariate normal around its MLE, Update the variances and using a Metropolis-Hastings move with proposals and similarly for , with acceptance probabilities As noted above, we treat the s as fixed, but these too could be given prior distributions and estimated as well.

The coefficients () of subject-level confounders are updated using single Newton-Raphson iteration towards the MLE of , following a random multivariate normal update to sample the new . The procedure is based on the approximation that the likelihood for is quadratic with flat priors.

Acknowledgments

The authors greatly appreciate valuable methodological suggestions from Melanie Quintana and David Conti. This work was supported by NIH Grants R01-ES019876, P30-ES07048, R01-CA112450, R01-ES016813, R01-CA129639, R01-MH084678, and R01-HG005927. The authors are grateful to the WECARE investigators for providing the data used for the application.