Introduction

Lung cancer is one of the most common human malignancies and the leading cause of cancer-related deaths in both men and women1. It is estimated that 224,390 new lung cancer cases will be diagnosed in the United States in 20162. Lung cancer risk likely results from joint effects and interactions of environmental and genetic factors.

Single nucleotide polymorphisms (SNPs) are the most common genetic variants and have been shown to be associated with lung cancer risk3. Genome-wide association studies (GWAS) have identified 30 loci in 13 genomic regions to be associated with lung cancer risk4,5,6,7,8,9,10,11,12,13,14,15. However, most of the SNPs identified to date have not been shown to be functional. Other approaches to GWAS including pathway-based analysis with reduced dimension or multiple testing have been emerged to identify possible functional SNPs associated with lung cancer risk.

The T-cell protein tyrosine phosphatase (TCPTP/PTPN2) is an important member of the protein-tyrosine phosphatase (PTP) family. Activating and deactivating mutations in PTP genes often result in enzymes that can either promote or suppress oncogenesis. The TCPTP pathway consists of signaling events mediated by TCPTP through negative regulation of several receptor tyrosine kinases such as epidermal growth factor receptor (EGFR)16, vascular endothelial growth factor receptor-2 (VEGFR2)17, platelet-derived growth factor receptor beta (PDGFRβ)18, signal transducer and activator of transcription subtypes 1 (STAT1)19, 3 (STAT3)20, and 6 (STAT6)21, and the insulin receptor22.

Studies have shown that mutations and genetic variants of some genes in the TCPTP pathway are associated with lung cancer risk and survival23, 24. However, SNPs in many candidate genes in the pathway have not been studied and reported. In the present study, we systematically investigated all potentially functional SNPs in TCPTP pathway genes by assessing their associations of lung cancer risk using eight published lung cancer GWAS datasets.

Results

Analysis of six GWAS datasets

Overall, 5162 SNPs from 43 TCPTP pathway genes in the six GWAS datasets from the Transdisciplinary Research in Cancer of the Lung and The International Lung Cancer Consortium (TRICL-ILCCO) Consortium were identified, and their associations with lung cancer risk are shown in the Manhattan plot (Fig. 1A). After multiple-testing correction, 112 SNPs in eight genes (ATR, EGFR, MET, PIK3R1, PIK3R3, PTPN2, STAT3, and STAT5A) remained significantly associated with lung cancer risk with FDR <0.20. The results of associations with lung cancer risk are summarized in Supplementary Table S2. Based on LD analysis (r2 > 0.30) and online functional prediction analyses by using SNPinfo, RegulomeDB, and HaploReg, we selected to perform additional analyses for 11 SNPs: rs11707731 in ATR; rs845553, rs1140762 and rs17172432 in EGFR; rs34280975 in MET; rs706714 in PIK3R1; rs7538978 in PIK3R3; rs2847297 and rs2847282 in PTPN2; rs3744483 in STAT3; rs1135669 in STAT5A for further study (Supplementary Figure S1 and Supplementary Table S3).

Figure 1
figure 1

Screening of SNPs in the TCPTP pathway. (A) Manhattan plot of genome-wide association results of 5,162 SNPs in 43 TCPTP pathway genes and lung cancer risk in the TRICL-ILCCO Consortium. SNPs are plotted on the X-axis according to their positions on each chromosome. The association P values with lung cancer risk are shown on the Y-axis (as −log10 (P) values). The horizontal red line represents FDR threshold 0.20. The horizontal blue line represents P value of 0.05; (B) SNPs in PTPN2 with 500 kb up- and downstream of the gene region and (C) LD plots of the SNPs in PTPN2 with FDR <0.20. In B, the left-hand y-axis shows the association P value of each SNP, which is plotted as −log10 (P) against chromosomal base pair position; the right-hand y-axis shows the recombination rate estimated from the hg19/1000 Genomes European population.

Functional validation by eQTL analysis 21

We assessed associations between the 11 SNPs and mRNA expression levels by using the genotyping and expression data available from the lymphoblastoid cell lines derived from 373 individuals of European descent (http://www.1000genomes.org/), and we found that only rs2847297 and rs2847282 were associated with expression levels of PTPN2 in additive, dominant and recessive models (Table 1). Regional association plots for rs2847297 and rs2847282 in 500 kb up- and downstream region were shown in Fig. 1B. The SNP rs2847297 was in a low LD with rs2847282 (Fig. 1C). PTPN2 mRNA expression levels were significantly decreased with an increased number of the rs2847297 G allele in additive (P = 0.002) (Fig. 2A), dominant (P = 0.017) (Fig. 2B) and recessive model (P = 0.005) (Fig. 2C ). The eQTL analysis results of rs2847282 were also significant (Fig. 2D,E,F). In addition, we compared mRNA expression levels of PTPN2 in 109 paired target tissue samples from The Cancer Genome Atlas (TCGA) and found that PTPN2 mRNA expression levels were significantly increased in tumor tissues than normal tissues (P = 3.01E-05) (Supplementary Figure S2). The two SNPs rs2847297 and rs2847282 were chosen as tagSNPs, because they were significantly associated with lung cancer risk as assessed in the overall association analysis and had potential functions according to the eQTL analysis.

Table 1 Summary of the functional prediction and eQTL analysis results of the 11 selected SNPs in the TCPTP pathways in silico.
Figure 2
figure 2

The correlations between identified SNPs and PTPN2 mRNA expression. rs2847297 in PTPN2 (A) additive model, P = 0.002; (B) dominant model, P = 0.017; (C) recessive model, P = 0.005) and rs2847282 in PTPN2 (D), additive model, P = 0.0006; (E) dominant model, P = 0.001; (F) recessive model, P = 0.029).

Expanded analysis by including additional two GWAS studies

We expanded our analysis by including two additional independent lung cancer GWAS studies, Harvard Lung Cancer Study and Icelandic Lung Cancer Study (deCODE). We performed an overall meta-analysis to evaluate associations between the two PTPN2 SNPs and lung cancer risk. We found that the overall effects among all eight GWAS studies remained significant (OR = 0.95, 95% CI = 0.92–0.98, Phet = 0.476, and P = 0.004 for rs2847297; OR = 0.95, 95% CI = 0.92–0.99, Phet = 0.523, and P = 0.009 for rs2847282) (Table 2 and Fig. 3A,B).

Table 2 Summary of the association results of two SNPs in the eight lung cancer GWAS studies.
Figure 3
figure 3

Forest plots of effect size and direction for tagSNPs from TRICL-ILCCO consortium. PTPN2 rs2847297 P combined = 0.004 in all individuals; P combined = 0.052 in overall adenocarcinoma individuals; P combined = 0.002 in overall squamous cell carcinoma individuals; P combined = 0.042 in overall ever smoking individuals; P combined = 0.465 in overall never smoking individuals (A); PTPN2 rs2847282P combined = 0.009 in all individuals; P combined = 0.114 in overall adenocarcinoma individuals; P combined = 0.016 in overall squamous cell carcinoma individuals; P combined = 0.066 in overall ever smoking individuals; P combined = 0.960 in overall never smoking individuals (B); Each box and horizontal line represent the OR point estimate and 95% CI derived from the additive model. The area of each box is proportional to the statistical weight of the study. Diamonds represent the ORs obtained from the combined analysis with 95% confidence intervals indicated by their widths. The meta-analysis includes eight GWAS studies [the Institute of Cancer Research (ICR) GWAS, the MD Anderson Cancer Center (MDACC) GWAS, the International Agency for Research on Cancer (IARC) GWAS, the National Cancer Institute (NCI) GWAS, the Lunenfeld-Tanenbaum Research Institute (Toronto) GWAS, German Lung Cancer Study (GLC) GWAS, Harvard Lung Cancer Study (Harvard) GWAS, Icelandic Lung Cancer Study (deCODE) GWAS]. NCI GWAS includes four sub-studies: the Alpha-Tocopherol, Beta-Carotene Cancer Prevention Study (ATBC), the Cancer Prevention Study II Nutrition Cohort (CPS-II), the Environment and Genetics in Lung Cancer Etiology (EAGLE), and the Prostate, Lung, Colon, Ovary Screening Trial (PLCO).

In subgroup analysis by histology (Table 2, Fig. 3), we found that the rs2847297 G allele was borderline associated with lung adenocarcinoma (AD) risk (OR = 0.95, 95% CI = 0.91–1.00, P = 0.052) and significantly associated with squamous cell lung carcinoma (SQ) risk (OR = 0.92, 95% CI = 0.87–0.97, P = 0.002, Fig. 3A). We also found the rs2847282 G allele was associated with SQ risk (OR = 0.93, 95% CI = 0.88–0.99, P = 0.016), while there was no statistical association with AD risk (OR = 0.96, 95% CI = 0.91–1.01, P = 0.114, Fig. 3B). In subgroup analysis by smoking status, there was a marginal significant decrease in lung cancer risk for the rs2847297 G allele among ever smokers (OR = 0.96, 95% CI = 0.91–1.00, P = 0.042), but not among never smokers (OR = 0.95, 95% CI = 0.83–1.09, P = 0.465, Fig. 3A). However, there was no association with the PTPN2 rs2847282 G allele and lung cancer risk among ever smokers (OR = 0.96, 95% CI = 0.91–1.00, P = 0.066 and never smokers (OR = 1.00, 95% CI = 0.86–1.16, P = 0.960, Fig. 3B).

Discussion

In the present study, we sought to investigate associations between genetic variants in the TCPTP pathway genes and lung cancer risk using eight published GWAS studies of 14,463 cases and 44,188 controls. The principal findings included two novel, potentially functional SNPs, rs2847297 and rs2847282 of PTPN2, that were both associated with a decreased lung cancer risk and a decreased mRNA expression level of PTPN2, particularly in subgroups of ever smokers and squamous cell lung carcinoma. Four articles about pathway-based analysis and lung cancer risk (Centrosome, DNA repair, lncRNA and RNA degradation) have been accepted or published in our laboratory. We found that the loci of two SNPs in PTPN2 were different from previous studies in our lab and GWAS studies.

PTPN2 plays a dual role in development and progression of cancer. Proliferation and cell cycle assays demonstrated that overexpression of PTPN2 would decrease serum requirement, increase formation of larger colonies in soft agar, alter morphology, and rapidly progress through G1 and S phases and the rate of cell division25, 26. Another study showed that the proliferation rate would reduce in TCPTP (−/−), compared to TCPTP (+/+), lymphocytes27. We found that PTPN2 mRNA expression levels in matched lung cancer tissues were increased compared to adjacent normal tissues from the TCGA database, some other studies also demonstrated that PTPN2 expression levels were higher in lung AD28, 29 and SQ30, 31 than in normal lung tissues. These findings provided oncogenic evidence of PTPN2 and were consistent with our results that the two susceptibility loci of PTPN2 were associated with a decreased lung cancer risk as a result of a decreased mRNA expression level of the gene. In addition, we found that the eQTL analysis result of rs2847297 in lung tissue was also significant in the GTEx analysis (P = 4.0E10–7) (http://www.gtexportal.org/home/eqtls/bySnp?snpId=rs2847297&tissueName=All). This result is also consistent with the eQTL analysis from the lymphoblastoid cell lines in the present study. However, it has been reported that overexpression of PTPN2 induces apoptosis in the p53 + A549 and MCF-7 cells but not in p53- HeLa cells, also consistent with features of a tumor suppressor32. Another study demonstrated that PTPN2 was absent in a large proportion of “triple-negative” primary human breast cancers and PTPN2 overexpression would suppress tumor growth33.

In subgroup analysis we found that the two SNPs were more likely to be associated with SQ risk, and the risk associated with rs2847297 G allele was more likely to be among ever smoking. Cigarette smoke is the major risk factor for lung cancer, especially for SQ. Study showed that smoking led to an increased expression of Nkx234, which is the transcription factor (TF) of PTPN2. Therefore, it is likely that the locus has the possibility of influencing lung cancer risk of ever smokers through changing the expression of PTPN2.

Our study has some limitations. First, genes in the TCPTP pathway were identified mainly from the Molecular Signatures Database and Genecards. Although we did search some relative articles to complete the list of genes in the pathway, some newly discovered genes in the pathway might have been missed. Second, although we demonstrated the association of thetwo novel potentially functional loci in PTPN2 with lung cancer risk with functional evidence from eQTL analyses, the exact biochemical and molecular mechanisms are still unclear. Third, our eQTL analyses were limited to publicly available data from lymphoblastoid cell lines but target tissues, which could provide more direct correlation results between the two SNPs and PTPN2 expression.

Taken together, the present study revealed two novel, potentially functional susceptibility loci in PTPN2 associated with lung cancer risk in European populations, particularly among ever smokers and squamous carcinoma. Further validation and functional evaluation of these genetic variants are warranted to verify our findings.

Materials and Methods

Study populations

The present study first used genotyping data from the TRICL-ILCCO consortium, which included 12,160 lung cancer cases and 16,838 controls (all Europeans) of six previously published GWAS studies: The University of Texas MD Anderson Cancer Center (MDACC), Institute of Cancer Research (ICR), National Cancer Institute (NCI), International Agency for Research on Cancer (IARC), Toronto study from Samuel Lunenfeld Research Institute study (Toronto), and German Lung Cancer Study (GLC). The expanded analysis included additional two GWAS studies of European ancestry from the Harvard Lung Cancer Study (984 cases and 970 controls)35 and the Icelandic Lung Cancer Study (deCODE) (1,319 cases and 26,380 controls)36 of the ILCCO. Details of the study populations are presented in the supplementary file. A written informed consent was obtained by all participating GWAS studies. All methods were performed in accordance with the relevant guidelines and regulations for each of the participating institutions, and the present study followed the study protocols approved by Duke University Health System Institutional Review Board.

Selection of Genes and SNPs from TCPTP pathway

Genotyping in these GWAS studies was performed by one of Illumina HumanHap 317, 317 + 240 S, 370Duo, 550, 610 or 1 M arrays. IMPUTE2 v2.1.1 or MaCH v1.0 software was used for imputation. Genes in the TCPTP pathway were identified from the Molecular Signatures Database (http://www.broadinstitute.org/gsea/index.jsp)37 and Genecards (http://www.genecards.org/). Overall, 43 genes located on autosomal chromosomes were selected (detailed in Supplementary Table S1). The final meta-analysis contained 5,162 SNPs with the following inclusion criteria: genotyping rate >95%, minor allele frequency (MAF) ≥ 5%, and Hardy-Weinberg Equilibrium (HWE) exact P value ≥ 10−5. The detailed workflow is shown in Fig. 4.

Figure 4
figure 4

Flowchart of SNP selection among the TCPTP pathway genes.

In silico functional prediction and validation

We use three in silico tools, SNPinfo (http://snpinfo.niehs.nih.gov/snpinfo/snpfunc.htm)38, RegulomeDB (http://regulomedb.org/)39, and HaploReg (http://www.broadinstitute.org/mammals/haploreg/haploreg.php)40 to predict potential functions. The expression quantitative trait loci (eQTL) analysis was performed in the 1000 Genomes Project41. The mRNA expression of lung cancer tissue samples was performed in TCGA42.

Statistical analysis

Odds ratios (ORs) and their 95% confidence intervals (CIs) were calculated using Stata (v10, State College, Texas, USA) and PLINK (v1.06) software. A meta-analysis with the inverse variance method was employed on the 5,162 SNPs. We used Cochran’s Q statistic to test for heterogeneity and I2 statistic for the proportion of the total variation43. The fixed-effects model was used when there was no heterogeneity among GWAS studies (Q-test P > 0.100 and I2 < 25%); otherwise, the random-effects model was used. The false discovery rate (FDR) was performed to control for multiple testing with a threshold <0.2044. The genes mRNA expression levels in lung cancer and adjacent tissues from TCGA database were performed by paired t-test. Regional association plots were performed by LocusZoom45. Haploview v4.2 was used to generate the Manhattan plot and LD plots46. All other analyses were conducted with SAS (Version 9.3; SAS Institute, Cary, NC, USA).