Associations between genetic variants in mRNA splicing-related genes and risk of lung cancer: a pathway-based analysis from published GWASs

mRNA splicing is an important mechanism to regulate mRNA expression. Abnormal regulation of this process may lead to lung cancer. Here, we investigated the associations of 11,966 single-nucleotide polymorphisms (SNPs) in 206 mRNA splicing-related genes with lung cancer risk by using the summary data from six published genome-wide association studies (GWASs) of Transdisciplinary Research in Cancer of the Lung (TRICL) (12,160 cases and 16,838 controls) and another two lung cancer GWASs of Harvard University (984 cases and 970 controls) and deCODE (1,319 cases and 26,380 controls). We found that a total of 12 significant SNPs with false discovery rate (FDR) ≤0.05 were mapped to one novel gene PRPF6 and two previously reported genes (DHX16 and LSM2) that were also confirmed in this study. The six novel SNPs in PRPF6 were in high linkage disequilibrium and associated with PRPF6 mRNA expression in lymphoblastoid cells from 373 Europeans in the 1000 Genomes Project. Taken together, our studies shed new light on the role of mRNA splicing genes in the development of lung cancer.

Despite the great success achieved by GWASs, the identified SNPs by GWASs still only explain a small fraction of heritability of lung cancer, a phenomenon called "missing heritability" 18 . Therefore, many complementary approaches have been developed to improve the study power of GWASs. For example, one of such approaches is the pathway-based association analysis through additional imputation to increase the number of SNPs to be analyzed, which could detect the risk-associated genes with multiple, independent and small effect sizes. This approach could alleviate issues due to insufficient chip coverage of earlier GWASs, reduce the multiple-testing burden of GWAS, achieve consistent results across studies and provide understanding of genetic findings with some functional relevance [19][20][21][22] .
Among the biological candidate pathways for lung carcinogenesis, mRNA splicing, a modification of the pre-mRNA transcript in which introns are removed and exons are joined, is of significant importance 23 . In most of the circumstances, this pathway functions normally for the human genome to generate proteomic diversity sufficient for various biological processes. However, cancer cells may take advantage of this mechanism to produce aberrant proteins with added, deleted, or altered functional domains that contribute to carcinogenesis including lung cancer 24 . Recent studies had shown that somatic mutations and germline variants of some mRNA splicing-related genes were associated with development of lung cancer. For instance, Shen et al. found several SNPs in mRNA splicing associated genes (SRSF7, PTBP2 and HNRNPQ) were associated with lung cancer risk in a Chinese population 25 . To date, however, only a limited number of candidate genes and SNPs in the pathway have ever been studied and reported. In the present study, we systematically selected 206 mRNA splicing-related genes and comprehensively investigated associations between 11,966 genetic variants of these genes and lung cancer risk using eight published lung-cancer GWAS datasets from the Transdisciplinary Research in Cancer of the Lung (TRICL) Consortium.

Results
Overall associations between SNPs in mRNA splicing-related genes and lung cancer risk in the TRICL Consortium. Overall,11,966 SNPs in 206 mRNA splicing-related genes in six GWAS datasets were selected from the TRICL Consortium (an imputated dataset that included 5,472,374 common SNPs), and their associations with lung cancer risk are shown in the Manhattan plot (Fig. 1). The sample size for each study had been listed in Table 1. After FDR correction (P FDR corrected ≤ 0.05), the 12 SNPs in three genes remained statistically significant (Fig. 1). The most significant SNP for each gene was: rs75100087 in PRPF6 (P = 6.83E-06); rs115420460 in DHX16 (P = 3.93E-09) and rs114312980 in LSM2 (P = 2.19E-07). Their basic information and associations with lung cancer risk are listed in Table 2. SNPs in DHX16 (rs115420460) and LSM2 (rs114312980, rs115489726, rs115801685, rs114637560, rs115834633) were excluded from further analysis due to their locations within the previously identified lung cancer susceptible region Chr6p21.33 and in high LD with previously reported lung cancer GWAS SNPs. Specifically, as shown in Supplement Figure 2, rs115420460 in DHX16 was in high LD (r 2 > 0.6) with the previously reported lung cancer GWAS SNP rs4324798 9 . Similarly, Supplement Figure 3 shows that all of the five SNPs in LSM2 were in high LD (r 2 > 0.8) with SNP rs3117582 that was identified from a the previously published lung cancer GWAS 14 . Therefore, we focused on the remaining six newly identified SNPs in PRPF6 (Table 2) for further analysis. Regional plot and recombination rates of PRPF6 (rs8126213 ± 500 kb) in the TRICL consortium are presented in Fig. 2.
Expression quantitative Trait loci (eQTL) analysis. Expression quantitative trait loci (eQTL) analysis, which directly investigates the correlations between genetic variants and gene expression, has been gradually proved as an effect method to characterize the function of SNPs. In the present study, the eQTL analysis was performed by using the data available from lymphoblastoid cell lines derived from 373 individuals of European descendent (http://www.1000genomes.org/). Finally, all of these six SNPs in PRPF6 were found to be significantly   Fixed-effects models were used when no heterogeneity was found between studies (Q > 0.10 and I 2 < 25.0); otherwise, random-effects models were used. 6 + means positive association, and − means negative association.
associated with expression levels of PRPF6 in both additive and dominant models (Supplementary Figure 5 and Supplementary Table 2). We also queried the GTEx database (http://www.gtexportal.org/) and found that SNP rs8126213 was significantly correlated with mRNA expression levels of PRPF6 in normal lung tissues (P = 1.50 × 10 −5 ), which was consistent with the results in the lymphoblastoid cell lines. Similar results were found for other five SNPs (Supplementary Figure 6).

Replication of PRPF6 SNP (rs8126213) in another two lung-cancer GWASs.
We then selected one of these six PRPF6 SNPs (rs8126213) and investigated its association with lung cancer risk in two additional lung-cancer GWASs of Caucasian origin, Harvard Lung Cancer Study (984 cases and 970 controls) and Icelandic Lung Cancer Study (deCODE from the ILCCO) (1,319 cases and 26,380 controls). However, no significant association was observed in these two studies (P = 0.801 and 0.850, respectively), which may potentially result from a relatively limited sample size of the cases, although their effects were in the same direction as in the TRICL-ILCCO studies, and the overall effect among all these eight GWASs remained significant (OR = 0.91, 95% CI = 0.87-0.95, P = 6.36E-05 in an additive genetic model) ( Table 3). Similar effects were also observed on adenocarcinoma (OR = 0.91, 95% CI = 0.87-0.95, P = 0.002) and squamous cell carcinoma (OR = 0.90, 95% CI = 0.83-0.97, P = 0.005) in the subgroup analysis.

Discussion
mRNA splicing is an important biological mechanism to regulate mRNA expression. Somatic mutations and germline variants of the mRNA splicing-related genes have been found to be associated with various kinds of cancers. For example, SF3B1, U2AF1 and SRSF2 mutations were observed in myeloid and lymphoid lineage tumors 26 . SF3B1 mutations were commonly found in uveal melanomas 27 . Mutations affecting spliceosome genes that resulted in defective splicing were attributed to leukemia 28 . For lung cancer, it had been reported that U2AF1 mutation was found in 3% of lung adenocarcinoma cases 29 . However, few studies reported a role of genetic variants in the etiology of cancer. For example, only one recent study found that genetic variants in three mRNA splicing-associated genes might modify individual susceptibility to lung cancer in a Chinese population 25 .
In the present study, we systematically evaluated the associations between 11,966 genetic variants in 206 mRNA splicing-related genes and lung cancer risk using the data from eight published GWAS datasets. To the best of our knowledge, this is the first and largest study focusing on exploring associations between SNPs from mRNA splicing-related genes and risk of lung cancer. Overall, our study identified six SNPs within mRNA splicing-related gene PRPF6 that might play an important role in the development of lung cancer. Furthermore, all of these six SNPs were found to be associated with PRPF6 mRNA expression in both lymphoblastoid cell lines and normal lung tissues of European descendent. Therefore, we proposed that these six SNPs were associated with abnormal expression of PRPF6 and thus modified individual's susceptibility to lung cancer. However, two major issues of the present study should be addressed. Firstly, the associations of these six SNPs with risk of lung cancer were observed in the six TRICL GWASs but not in Harvard and deCODE GWASs, probably due to a relatively limited sample size of the cases in the replication. Further replication studies are warranted to confirm our results. Secondly, the lung cancer-associated mRNA splicing-related genes identified in the previous Chinese study 25 were not replicated in our study, which could be attributed to different genetic background or different environmental exposures as well as other unknown contributing factors between two populations.
PRPF6 is located on chromosome 20q13.33 and is an essential member of the small nuclear ribonucleic proteins (snRNPs), playing an important role in mRNA splicing. For instance, a missense mutation in PRPF6 impairs mRNA splicing and contributes to Autosomal-Dominant Retinitis Pigmentosa 30 . Its role in the development of cancer had also been reported in previous studies. For example, it was over-expressed in lung adenocarcinomas according to The Cancer Genome Atlas (TCGA) projects 31 . In addition, PRPF6 was over-expressed in colon cancer 32 to drive cancer proliferation by preferential splicing of genes associated with growth regulation 33 . Abnormal expression of PRPF6 alters the constitutive and alternative splicing of a discrete number of genes, including an oncogenic isoform of the ZAK kinase 33 that activates several cancer (including lung cancer)-related signaling pathways, such as those of NF-kB, Wnt/b catenin, AP1, ERK and JNK 34,35 . It should be mentioned that the protective alleles of these six SNPs were associated with both a decreased lung cancer risk and a decreased mRNA expression of PRPF6, which is consistent with the previous findings of overexpression of the gene in colon and lung cancers. Therefore, based on all these, we hypothesized of abnormal expression of PRPF6 might contribute to development of lung cancer in the following way: the risk allele of PRPF6 was associated with elevated PRPF6 mRNA expression, then the oncogenic isoforms, such as the ZAK kinase, were generated as a result of abnormal splicing and activation in normal lung tissues, which finally resulted in lung carcinogenesis.
In summary, the present large-scale meta-analysis of eight published lung-cancer GWASs consisting of 14,463 lung cancer cases and 44,188 controls revealed a novel lung cancer susceptibility locus in the mRNA splicing-related gene PRPF6 and provided some new insight into genetic architecture and carcinogenic mechanisms of lung cancer. Further validation and functional evaluation of this genetic variant are warranted to verify our findings.

Materials and Methods
Study populations. In the present study, we used the data from eight lung-cancer GWASs with a total of 14,463 lung cancer cases and 44,188 health controls. As shown in Table 1, the analysis included six GWAS datasets from the TRICL consortium (12,160 lung cancer cases and 16,838 controls of European ancestry) and another two datasets of lung cancer GWASs: the Caucasian origin-Harvard Lung Cancer Study (984 cases and 970 controls) and the Icelandic Lung Cancer Study (deCODE from the ILCCO) (1,319 cases and 26,380 controls). As described previously 16    addition, standard quality control on samples by excluding individuals with low call rate (< 90%) and extremely high or low heterozygosity (P < 1.0 × 10 −4 ), as well as those estimated to be of non-European ancestry (using the HapMap phase II CEU, JPT/CHB and YRI populations as reference) were conducted among all of the studies.
Genes and SNPs selection. The mRNA splicing-related genes were selected with the combination of two online databases commonly exploited in the gene-set enrichment analysis: Molecular Signatures Database (http://www.broadinstitute.org/gsea/index.jsp) and Genecards (http://www.genecards.org/). Overall, 206 mRNA splicing-related genes were selected (Supplementary Table 1), which included a total of 11,966 genotyped or imputed common SNPs extracted from these genes (include 2-kb upstream and downstream of genes) among the TRICL Consortium with the following inclusion criteria: 1) Genotyping call rate ≥ 90%; 2) Minor allele frequency (MAF) ≥ 5% among Europeans; and 3) Hardy Weinberg Equilibrium exact P value ≥ 10 −5 . The detailed work-flow can be found in Supplementary Figure 1.
Statistical analysis. The associations between SNPs and lung cancer risk were evaluated using an additive genetic model by R (v2.6), Stata (v10, State College, Texas, US) and PLINK (v1.06) software 36 . For the studies of ICR, MDACC, IARC, Toronto, NCI and Harvard, top significant principle components were included in the analysis to control for population stratification that might cause inflation of test statistics, while for the deCODE study, genomic control method was used to adjusted for population stratification. No population structure was found in the German Lung Cancer Study (GLC). With the application of the inverse variance method, a meta-analysis under the fixed and random-effects models was performed on the results of a log-additive model for 11,966 SNPs. In order to assess the heterogeneity among studies, the Cochran's Q statistic to test for heterogeneity and the I 2 statistic to quantify the proportion of the total variation due to the heterogeneity were calculated 31 . A fixed-effects model was applied, if no heterogeneity existed among studies (P Q-test > 0.10 and I 2 < 25%); otherwise, a random-effects model was chosen. The Benjamini-Hochberg false discovery rate (FDR) procedure was employed for the correction of multiple testing with a cutoff value ≤ 0.05. Regional association plots were generated with LocusZoom on the basis of 1000 Genomes European (EUR) reference data (phase I integrated release 3, March 2012) 37 . Haploview v4.2 was employed to construct the Manhattan and LD plot. All analyses were conducted with SAS (version 9.1.3; SAS Institute, Cary, NC, USA) unless specified otherwise.