Integrating genome-wide association and eQTLs studies identifies the genes associated with age at menarche and age at natural menopause

Objective An early onset of menarche and, later, menopause are well-established risk factors for the development of breast cancer and endometrial cancer. Although the largest GWASs have identified 389 independent signals for age at menarche (AAM) and 44 regions for age at menopause (ANM), GWAS can only identify the associations between variants and traits. The aim of this study was to identify genes whose expression levels were associated with AAM or ANM due to pleiotropy or causality by integrating GWAS data with genome-wide expression quantitative trait loci (eQTLs) data. We also aimed to identify the pleiotropic genes that influenced both phenotypes. Method We employed GWAS data of AAM and ANM and genome-wide eQTL data from whole blood. The summary data-based Mendelian randomization method was used to prioritize the associated genes for further study. The colocalization analysis was used to identify the pleiotropic genes associated with both phenotypes. Results We identified 31 genes whose expression was associated with AAM and 24 genes whose expression was associated with ANM due to pleiotropy or causality. Two pleiotropic genes were identified to be associated with both phenotypes. Conclusion The results point out the most possible genes which were responsible for the association. Our study prioritizes the associated genes for further functional mechanistic study of AAM and ANM and illustrates the benefit of integrating different omics data into the study of complex traits.


Introduction
Menarche is the first menstrual cycle and signals the possibility of fertility. An early onset of menarche is associated with risks for obesity, type 2 diabetes, cardiovascular disease, breast cancer and all-cause mortality [1]. Menopause is defined as the permanent cessation of menses due to the loss of ovarian follicular activity. Younger age at natural menopause (ANM) is associated with low risk of breast cancer and ovarian cancer, but higher risks of osteoporosis, cardiovascular disease and type 2 diabetes [1]. A Mendelian randomization study has found that later ANM causally increased the risk of breast cancer [2]. These two traits also mark the beginning and the end of a woman's reproductive life [3].
Genome-wide association studies (GWAS) are capable of identifying the association between target phenotypes and millions of genetic variants. GWAS of age at menarche (AAM) identified 106 loci containing 389 independent signals [4]. GWAS of ANM has successfully identified dozens of significantly associated loci [2,5,6]. Most of these loci encode proteins that appear to be involved in DNA repair, immune response and breast cancer processes [2,5]. However, GWAS can only identify those SNPs strongly associated with the target phenotypes, without pinpointing the target genes and the underlying biological mechanism. For example, the largest GWAS of ANM identified 44 loci containing at least one common variant significantly associated with ANM [2]. However, the significant SNPs in 21 loci were annotated to more than one gene in each locus. It suggested that the specific causal genes remain mostly unidentified.
A large number of genetic variants influences the target phenotypes by causal regulatory effect rather than directly influencing the structure of the protein [7]. Expression quantitative trait locus (eQTL), which is a genetic variant influencing a target gene's expression, is often used to explain the underlying biological mechanism of significant SNPs identified by GWAS. Previous studies have suggested that in the significant loci, those SNPs which were also eQTLs were more likely to be functional SNPs [8]. Zhu et al. proposed a summary-based Mendelian randomization (SMR) analysis to combine GWAS and eQTL data into a single analysis [7]. SMR integrates GWAS and eQTL data identified from the whole blood tissue to identify potential functionally relevant genes at the significant loci identified in GWAS. Previous studies have shown that whole blood can be a proxy of relevant tissues for various phenotypes and diseases [7,9].
In this study, we identified genes whose expression levels were associated with AAM or ANM due to pleiotropy or causality, by integrating ANM GWAS data with eQTL data. We conducted a colocalization analysis to identify significant SNPs causally associated with both phenotypes.

AAM GWAS summary dataset
The largest AAM GWAS meta-analysis contained data from ReproGen consortium studies UK Biobank and 23andMe study. Using the 1000 Genomes Project-imputed genotype data in up to~370,000 European women, 389 independent signals (P < 5 × 10 −8 ) were identified for age at menarche [4]. The summary data were downloaded from the following website (http:// www.reprogen.org).

ANM GWAS summary dataset
The largest-scale GWAS meta-analysis summary data of ANM was used in this study [2]. The GWAS meta-analysis conducted with a total sample of 69,360 individuals of European descent from 21 studies identified 44 significant loci. SNPs with the minor allele frequency (MAF) no less than 0.01 and the imputation quality larger than 0.4 were included in the meta-analysis. The summary data were downloaded from the following website (http://www.reprogen.org).

eQTL dataset
Because the Westra eQTL data [9] had a low coverage of human genes (5,967), in this study we used the Consortium for the Architecture of Gene Expression (CAGE) eQTL data which contained 11,829 unique probes to perform the SMR test [10]. The CAGE study was performed to investigate the genetic architecture of gene expression in peripheral blood in 2,765 European individuals [10]. We set the p-value threshold to be 5×10 −8 to select the top associated eQTL for the SMR test. After removing those probes where the p-value of the top eQTL was larger than 5×10 −8 , there were 8,144 probes left in the eQTL summary data. The binary summary data can be downloaded from http://cnsgenomics.com/software/smr/#DataResource.

Genetic correlation
We used stratified linkage disequilibrium score regression (LDSC) to estimate the genetic correlation between AAM and ANM using GWAS summary statistics [11].

SMR analysis
The method of SMR was fully described in the previous paper [7]. In this study, the phenotypic trait is the outcome (Y), gene expression is the exposure (X), and the top cis-eQTL that is strongly associated with gene expression is used as the instrumental variable (Z). SMR method assumes that the eQTL has an effect on the trait through the gene expression. In brief, there were three models including causality (Z -> X -> Y), pleiotropy (Z -> X and Z -> Y) and linkage (Z 1 -> X, Z 2 -> Y, and Z 1 and Z 2 are two variants in linkage disequilibrium (LD) in the cis-eQTL region). Since the SMR analysis assumes that the instrument (top cis-eQTL) has a strong effect on the exposure (gene's expression level), only probes with at least one cis-eQTL at P eQTL (a p-value from the eQTL study indicating the significance of the eQTL associated with the gene expression) smaller than 5×10 −8 in the cis-eQTL region were included in the eQTL summary data (hg19) [7]. We excluded cis-eQTL with MAF < 0.01 and cis-eQTL in the MHC region because of the complexity of LD patterns in this region [7]. In this study, we tried to identify those genes with causal or pleiotropic effect on AAM or ANM.
To distinguish the causality and pleiotropy models from the linkage model, we conducted the heterogeneity in dependent instruments (HEIDI) test [7]. The HEIDI test considers the pattern of associations using all the SNPs that are significantly associated with gene expression (eQTLs) in the cis-eQTL region (±500kb from the center of the gene probe). The null hypothesis is that there is a single variant affecting trait and gene expression (pleiotropy or causality). The alternative hypothesis is that gene expression and trait are affected by two distinct variants. Under Hardy-Weinberg equilibrium and linkage disequilibrium (LD), b XY (the effect of the gene expression on the trait) estimated at the top associated cis-eQTL (b XY(top) ) will be equal to that estimated at any of the cis-SNPs in LD that is associated with gene expression. If we define d i = bXY(i)−b XY(top) , with b XY(i) being the b XY value of the i-th cis-eQTL, then it is equal to test whether d i = 0. If the number of significant eQTL (excluding the top cis-eQTL) in the cis-eQTL region is m, then we can have a normal vector z d  [7]. SNPs in the cis-eQTL region with a P eQTL > 1.6 × 10 −3 (equivalent to χ 2 < 10) were removed to avoid weak instrumental variables according to the original paper [7]. We used P HEIDI > 0.05 to exclude those genes belonging to the linkage model [7]. The SMR software was downloaded from http://cnsgenomics.com/software/smr/#Download.
It was impossible to give a causal conclusion based on only one instrumental variable. In Mendelian randomization studies, multiple uncorrelated instrumental variables (for example, the trans-eQTLs and/or uncorrelated cis-eQTLs) were needed to identify the causality. However, multiple uncorrelated instrumental variables (IVs) were not available in the CAGE eQTL data. In this study, we did not distinguish the causality model from the pleiotropy model.

Colocalization analysis
Colocalization analysis was used to identify the genetic variants affecting both phenotypes. The method was detailed in the previous paper [12]. In brief, the method based on a hierarchical Bayesian model which can be used to find the region containing a variant that influences both phenotypes. According to the previous paper, there were four models that a given genomic region either 1) contains a genetic variant that influences the first trait, 2) contains a genetic variant that influences the second trait, 3) contains a genetic variant that influences both traits, or 4) contains both a genetic variant that influences the first trait and a separate genetic variant that influences the second trait [12]. It estimated the posterior probability of each model. The threshold of posterior probability equal to 0.9 was used to control the false discovery rate at level 0.1 [12].

Results
The genetic correlation between AAM and ANM was 0.0079 (p-value = 0.0032). The genomewide significant level for SMR analysis was P smr < 6.14×10 −6 (0.05/8,144, Bonferroni test). We identified 98 gene-trait associations with P smr < 6.14×10 −6 . After the application of the HEIDI test, this reduced to 54 gene-trait associations (P HEIDI > 0.05). Those genes which did not pass the HEIDI test may be associated with AAM or ANM due to linkage.
We identified 31 genes associated with AAM (Table 1) and 24 genes associated with ANM due to pleiotropy or causality ( Table 2). Three (ATP1B3, NAAA, and GRTP1) among of the 31 genes associated with AAM can be considered as novel genes, i.e. no previously identified SNP reported as genome-wide significant in the primary GWAS paper in the cis-eQTL region of the probes (Fig 1). Among the 24 genes associated with ANM, seven genes (MSH6, TLK1, SYCP2L, BRCA1, PGAP3, DIDO1, and DDX17) were previously annotated to be responsible for the association based on distance, biological function, eQTL effect and non-synonymous SNP in high LD. We also identified 6 new genes (AK125462, MSL2, CLSTN3, TRAPPC2L, DDX5, and CPNE1) where there was no significant SNP (p < 5×10 −8 ) in the cis-eQTL region of the probes (Fig 2). C17orf46 was the only gene identified to be associated with both phenotypes.
To identify more pleiotropic SNPs and genes associated with both phenotypes, we conducted a colocalization analysis. One region was identified to contain a variant influencing both phenotypes with the posterior probability of 0.92 (Table 3). Thirteen regions were considered to influence the two phenotypes through different variants (Table 3). rs3136249, with the largest posterior probability (0.37), was considered to be the causal SNP influencing both phenotypes.

Discussion
In this study, we identified 31 genes whose expressions were associated with AAM and 24 genes whose expressions were associated with ANM due to pleiotropy or causality. In total, we identified 9 new genes where there was no significant SNP in the cis-eQTL region of the gene probe. Many of these genes participated in DNA repair, immune response, and breast cancer process [2,5]. C17orf46 was identified to be associated with both phenotypes by integrating GWAS and eQTLs data. We also found one region with a pleiotropic effect influencing two phenotypes through the colocalization analysis. Although the previous study performed the SMR analysis with Westra eQTL data which had a low coverage of human genes (5,967) compared to CAGE eQTL data (8,144). Thus the previous study may omit many potential genes. Eleven genes of the 31 genes associated with AAM were successfully identified by using CAGE eQTL data (Table 1). SMR demonstrated that it was useful to prioritize genes associated with AAM or ANM. SMR tests reduced the multiple hypothesis burdens by testing tens of thousands of genes instead of millions of SNPs [13]. It suggested that SMR was useful in identifying novel genes associated with AAM or ANM.
In this study, we identified 3 novel genes associated with AAM and 6 novel genes associated with ANM. One novel gene DDX5, which is also known as p68, was identified to be associated with ANM. DDX5 is a prototypic member of the DEAD box family of RNA helicases that encompasses multiple functions. DDX5 was highly expressed in a high proportion of breast topSNP was the most significant SNP in the cis-region of the probe. topSNP_bp was the position of the most significant SNP. p_GWAS was p-value from GWAS.
p_eQTL was p-value from the eQTL study. b_smr was effect size from SMR test. p_smr was p-value from SMR test. p_HEIDI was p-value from HEIDI test. a: These genes were annotated by previous ANM GWAS.
b: These genes were considered as novel genes. No SNP in the cis-eQTL region of the probes was identified to be significantly associated with ANM according to the primary GWAS. https://doi.org/10.1371/journal.pone.0213953.t002 Identifying the genes associated with age at menarche and age at natural menopause Identifying the genes associated with age at menarche and age at natural menopause cancers. Patients with a detectable level of both DDX5 and polo-like kinase-1 (pLK1) often had a poor prognosis [14].
In the significant loci, we redefined the functional genes which were more likely to play important roles in the process of menarche or natural menopause. We presented a list of genes to be followed up in future functional validation experiments. For example, HSD17B12 coding a 17beta-hydroxysteroid dehydrogenase transforms estrone (E1) into estradiol (E2) [15]. E2 is involved in the regulation of the estrous and menstrual female reproductive cycles. However, the previous study annotated the significant SNP to MIR129-2 (Table 1). Fatty acid amide hydrolase (FAAH), the enzyme that breaks down the endocannabinoid anandamide and controls its levels, is regulated by estrogen [16]. The previous study annotated the significant SNP in this locus to RAD54L (Table 2), however, we found that FAAH was more likely to be the causal gene in this locus. Another example is SRPK1, encoding the splicing factor kinase SRSF protein kinase 1, which was highly expressed in basal breast cancer cells [17]. The knockdown of SRPK1 significantly suppressed metastasis of breast cancer cells [18].
Despite the common belief that multiple genes are responsible for controlling the timing of menarche and natural menopause, very few genes have been identified that contain common genetic variants associated with AAM and ANM. In this study, we identified two genes (MSH6 and C11orf46) associated with both traits. rs3136249 is located in the intronic region of MSH6. MSH6, which is a mismatch repair gene, was found to be associated with ANM by the previous study [19]. The colorization analysis showed C11orf46 locus may be associated with both traits with the posterior probability of 0.79. This may be caused by the relatively large posterior probability of model 4 (0.21). However, the function of C11orf46 is unknown, further studies are needed to prove this result.
The present study may have some limitations that should be considered. Although we redefined the functional genes in the significant loci, these genes may be associated with age at natural menopause due to pleiotropy which meant that some of these genes may be not the causal genes. Due to the limitation of the method, we did not distinguish those pleiotropic genes from causal genes. So, further works are warranted to confirm the functional genes and explore the underlying mechanism.
In conclusion, we highlighted the putative functional genes in the significant loci for AAM and ANM. Our study prioritizes the associated genes for further functional mechanistic study of AAM and ANM and illustrates the benefit of integrating different omics data into the study of complex traits. Our study may help to understand the ovarian function and benefit for women's reproductive health.