Identification of four differentially methylated genes as prognostic signatures for stage I lung adenocarcinoma

Lung adenocarcinoma (LUAD) is the main subtype of non-small cell lung cancer with a low survival prognosis. We aimed to generate a prognostic model for the postoperative recurrence of LUAD. The methylated DNA data of LUAD patients were downloaded from the Cancer Genome Atlas (TCGA). The differentially methylated genes were identified and protein–protein interacting network was constructed, with which prognostic signature of this cancer was generated. Survival and functional pathways analysis w used to evaluate the clustering ability of the prognostic signature. We identified 151 differentially methylated genes related to relapse-free survival of patients with LUAD. Nine hub genes were identified in PPI network, with which 4 gene pair signature was selected as prognostic signature. The potential functions of 6 genes (JDP2, SERPINA5, PLG, SEMG2, RFX5, and POLR3B) in the 4-gene pair signature were enriched in intracellular protein synthesis and transportation. The four gene pair signature can predict the prognosis of patients with stage I LUAD. Our study provides a reference for patients with postoperative adjuvant therapy.


Background
Lung cancer has two main types known as small-cell lung carcinoma (SCLC) and non-small-cell lung carcinoma (NSCLC) [1,2]. Currently, no treatment can cure this kind of cancer, making lung cancer the leading cause of the death from cancer worldwide. Hereinto, lung adenocarcinoma (LUAD) is the main subtype of non-small cell lung cancer (NSCLC). With an increasingly high incidence, LUAD is always a threat to human beings regardless of gender and smoking condition over the past few decades in many countries [3]. Previous reports showed that patients with LUAD had a shorter survival time among patients with other types of non-small cell lung cancer (NSCLC) [4][5][6].
As a common histological subtype of NSCLC, most of clinical treatments of LUAD are basing on the cancer staging system of American Joint Committee on Cancer (AJCC) [7,8]. Surgical resection is usually used to treat patients with stage ILUAD, but, achieved a poor prognosis. The recurrence rate of postoperative patients is as high as 35-50% [9]. For the large amount of patients suffering for reoccurrence, adjuvant chemotherapy could be an effective approach to significantly improve their survival time [10]. Thus, it is indeed urgent to discover an accurate and reliable clinical method to evaluate and/or predict the prognosis of LUAD, especially at early stage.
The most popular approach in recent years is to screen oncogenes or/and microRNAs as biomarkers for clinical diagnosis and treatment [11]. For diagnosis and treatment, the application of biomarkers can improve the survival of patients following with more personalized treatments [12,13]. In addition to the genetic causes, of cancer, epigenetic changes also contribute to the development of cancer. Hereinto, DNA methylation is a kind of mark in a variety of tumors including lung cancer [14]. In DNA methylation, hypermethylation is recognized not only able to repress transcription of tumor suppressor genes, but also to trigger oncogenesis. It has been studied that tumor-suppressor genes are inactivated by the hypermethylations occurring oncytosine-guanine (CpG) island which locates in the promoter regions [15]. In addition, the enlarging public database for DNA methylation also provide wider and sufficient resources to study the mechanism and to explore methylated gene biomarkers [16,17]. Taken these, identifying the target genes that can be silenced by DNA methylation or screen the cancer related methylations have a great impact on early diagnosis, cancer staging, and prognosis of cancer patients. However, less study is focusing on the relationships of DNA methylation and recurrence of LUAD patients [18].
Here selected the Cancer Genome Atlas (TCGA) database and downloaded methylated DNA data of LUAD patients. Then we identified 151 differentially methylated genes related to relapse-free survival (RFS) of patients with LUAD. Nine hub genes were identified in PPI network, with which 4 gene pair signature was selected as prognostic signature, which successfully clustered LUAD patients at stage I into high-and low-risk groups with significant differences. The potential functions of 6 genes (JDP2, SERPINA5, PLG, SEMG2, RFX5, and POLR3B) in the 4-gene pair signature were enriched in intracellular protein synthesis and transportation. The prognostic signature generated in the present study can predict the prognosis of patients with stage I LUAD, providing reference on the recurrence risk. The application of this signature will as well contribute to improve the overall survival of LUAD patients with postoperative adjuvant therapy. It should be noted that the signatures identified in this study are generally applicable. But for the personalized cases, specific methods would be applied in practice.

Data source
Expressing profiles of gene-specific DNA methylation data and the follow-ups of corresponding patients were downloaded from database The Cancer Genome Atlas (TCGA, https://tcga-data.nci.nih.gov/tcga/) before April 17th, 2017. The clinic information of the patients was listed in detail (

Data preprocessing and differentially methylated sites
To clear the DNA methylation data, CPG sites would be deleted if the CPGs were in absence in more than 50% of the total samples. CPG sites deficient in less than 50% samples would be weighted by using k-Nearest-Neighbor values [19].
Methylated CPG sites differentially expressed in samples with LUAD was screened by using SAM [20]. This process was repeated 1000 times for each disturbance. For the multiple binomial tests, the p-values are adjusted by the Benjamin and Hochberg method to control the false discovery rate (FDR) [21].

Protein and protein interactions (PPIs)
To analyze the biological functions of the selected genes, protein-protein interacting (PPI) network was constructed to screen the hub genes associating with prognosis of LUAD [22]. All the protein networks used in the present study were obtained in database signor [23].

Survival analysis
The correlation between each differentially methylated gene pair and the RFS was evaluated in univariate survival analysis. To estimate the independent prognostic value of this signature basing on several clinical factors including age, gender and smoking, the multivariate survival analysis was performed subsequently. Cox proportional-hazards regression model was applied in all the above survival analysis [24]. At the same time, the significance of the correlation between the ridge expansion osteotomy (REO) of a gene pair and the poor RFS was validated. Survival curves were drown by using the Kaplan-Meier method and were compared in the logrank test [25].

Enrichment analysis
The KEGG (Kyoto Encyclopedia of Genes and Genomes) pathways of selected methylated genes were analyzed by using DAVID (The Database for Annotation, Visualization and Integrated Discovery) [26]. Functional key genes were determined by Spearman's correlation coefficient FDR < 0.01, r > 0.6.

Data source
All gene-specific DNA methylation data were obtained from 578 patient samples from lung adenocarcinoma (LUAD), including 56 normal samples next to cancer tissues. There were 191 survival information containing RFS from 578 patients with stage I LUAD.

Differentially methylated sites
In identification of the differentially methylated CPG sites (1000 times, FDR < 0.01), a total of 5029 CPG sites were identified up-regulated and 3269 ones were downregulated. Then these differentially methylated CPG sites were mapped onto corresponding genes.
One gene would be defined as differentially methylated if one differentially methylated CPG site occurred on its promoter. On the contrary, if there were two methylated sites occurred on the promoter with different directions, this gene would be deleted. Finally, we obtained 3498 hypermethylation of genes and 2465 ultra-low methylated genes. In univariate Cox analysis basing on survival information, genes related to non-recurrence of postoperative patients with stage I LUAD were screened as well (p < 0.01), 61 hypermethylation and 90 ultra-low methylated genes.
Survival-associated gene pairs were generated with these 9 key genes. Five gene pairs were identified significantly associating with survival of LUAD (p < 0.01) in univariate Cox analysis. By using the C-index of gene pairs and forward-stepwise algorithm, a 4-gene pair prognostic signature, consisting of 6 genes (Table 1), were generated for postoperative patients with stage I LUAD (Fig. 2). Postoperative patients with on gene pair of the 4-gene pair signature were defined as poor prognosis of recurrence. In Fig. 2, patients with stage I LUAD were able to be classified into high-and low-risk group with significant RFS.

Survival analysis
We applied Kaplan-Meier method and log-rank test to evaluate the impacts of prognostic signature casting on the RFS of stage I LUAD patients. The relationships of RFS time with age, gender and smoking condition were further analyzed ( Table 2). Figure 3 indicated that the 4 gene-pair signature was able to classify the stage I LUAD patients into two groups with significant RFS (univariate COX p < 0.0057, multivariate COX p < 0.0069). However, in the relationships of RFS and age, gender and smoking condition, no factors showed influences on our prognostic signature.

Enrichment analysis
The potential functions of 6 genes (JDP2, SERPINA5, PLG, SEMG2, RFX5, and POLR3B) in the 4-gene pair signature were further analyzed in enrichment analysis together with all 5963 differentially methylated genes. A total of 560 hypermethylation genes were screened according to the threshold of FDR < 0.01 and r > 0.6. In GO term, all these genes were associated with intracellular protein synthesis and transportation. All the enriched pathways were listed in Table 3.

Discussion
In non-small cell lung cancer (NSCLC), lung adenocarcinoma (LUAD) is a common histological subtype with high mortality but poor outcome in both male and female, smokers and non-smokers [27]. To date, the current therapy of stage I LUAD patients is still surgery. However, almost half of the patients after surgical treatment suffer from recurrence or even death, resulting in a low 5-year survival rate [28,29]. Therefore, an effective  Smoker prognostic signature to evaluate the survival outcomes or to predict the risk of postoperative recurrence is in need.
In the present research, we screened out 560 hypermethylation genes (FDR < 0.01 and r > 0.6). But for further construction of prognostic signature, 151 genes associating with non-recurrence of postoperative patients were used to analyze for hub genes. Finally, 17 hub genes were obtained, indicating 9 key genes (KLK3, GUCY2F, KLK2, SERPINA5, PLG, SEMG2, RFX5, POLR3B, JDP2) relating to non-recurrence of postoperative patients with LUAD. In the selecting process of prognostic methylated genes, threshold of differential significance p < 0.01 was referred in univariate Cox analysis. In the present research, we generated a 4-gene-pair signature consisting of 6 genes with potential functions in LUAD. All these genes were enriched in intracellular protein synthesis and transportation pathways in GO term. However, further functional analysis specifically on every gene was in need to explore the mechanism of this prognostic signature.
Subsequently, the signature was applied to evaluate the impacts of prognostic signature casting on the RFS of stage I LUAD patients and to detect the relationships between RFS and age, gender and smoking condition. It is satisfied that the 4 gene-pair signature was able to classify the stage I LUAD patients into two groups with significant RFS (univariate COX p < 0.0057, multivariate COX p < 0.0069). Meanwhile, the clustering ability of this signature was not influence by the age, gender and smoking condition. All these indicated that this signature was reliable and stable for evaluation of the prognosis of LUAD.
To get further knowledge of the 6 hub genes involved in the prognostic signature, we carried out literature review to explore the functions or association in cancer. Among these 6 genes, several were identified related to cancer.
That is, most of the genes in the present study were firstly identified related to lung cancer.
Gene JDP2 encodes Jun dimerization protein-2 that functions as AP-1 transcription factor. This protein can repress the transactivation mediated by the Jun family of proteins [30]. It has been identified that decreasedexpression of JDP2 was related to lymph node metastasis and distant metastasis. Besides, its depression was as well found casting strongly relationships with the post-surgery survival time. Yuanhong et al. stated that there might be a possible relationship between the expression of JDP2 and metastasis in pancreatic carcinoma, suggesting JDP2 as a prognostic biomarker for patients with pancreatic carcinoma [31]. Besides, research also discovered an association of epithelial-mesenchymal transition (EMT) with Jun dimerization protein 2 (JDP2) in pancreatic cancer [32]. Hereinto, EMT is considered to contribute to the invasion and metastasis of a variety of malignant tumors. Taken all these previous studies which conform to our discovery, we suggest the JDP2 might play an important role in invasion and metastasis of LUAD as well. The signature model containing JDP2 is reasonable when clustering the LUAD patients. Further research focusing on the specific functions of this gene in lung is still in need to complete the mechanism of the recurrence of LUAD at gene level. Besides, the probable cooperating genes of JDP2 also deserve more attention.
Another cancer related gene discovered in our prognostic signature is SERPINA5, also called Protein C Inhibitor(PCI), belongs to the serine protease inhibitor super family as well. This gene is known to prevent metastasis and anti-angiogenesis in tumor cells, including renal, breast, prostate and ovarian cancers [33][34][35][36][37]. However, little research indentified this depression of SERPINA5 in LUAD till now. The association between SERPINA5 and this subtype of lung cancer is an inspiring result, but the expression status (at transcriptional or post-transcriptional level) and biological function of this gene in LUAD are largely unknown.
Semenogelin-2 (SEMG2) was ever found being catalyzed to degrade by the proteolytic activity of the active PSA-enzyme in prostate cancer [38] and that the functions in HER2+ cellular models of breast cancer [39]. As to RFX5 (Regulatory Factor X), being able to bind DNA and lend promoter specificity [40], was identified as a transcriptional activator of the TPP1 gene in hepatocellular carcinoma [41]. Besides, RFX5 was tested to express differentially in breast cancer [42]. POLR3B (Polymerase RNA III beta subunit) is reported to code for the other subunit that forms Pol III's catalytic site, which transcribes small untranslated RNAs [43]. Researchers were searching the mutations in POLR3, so that its functions in several tumors could be revealed at mRNA levels by using RNA-Seq analysis [44]. Thus, the identification of this gene in LUAD might provide a reference for further studies.
According to the discussions above, we supposed that the 4-gene-pair signature could be used to evaluate the prognosis of LUAD patients generally. However, considering the gene diversity of cancer patients and even within a pedigree, personalized insight is also the key point for cancer managements including the prognosis and therapy.

Conclusion
All in all, in the present study, we generated a 4-gene pair signature with 6 methylated genes. The signature was able to classify the stage I LUAD patients into two groups with significant RFS. However, age, gender and smoking condition did not influence the prediction of our prognostic signature. The potential functions of 6 genes were associated with intracellular protein synthesis and transportation. This signature might be used to provide clinical reference for postoperative chemotherapy of patients with stage I LUAD. In addition, our signature may independently predict the prognosis of LUAD without depending on data standardization.