Epigenetic signature predicts overall survival clear cell renal cell carcinoma

Recently, increasing study have found that DNA methylation plays an important role in tumor, including clear cell renal cell carcinoma (ccRCC). We used the DNA methylation dataset of The Cancer Genome Atlas (TCGA) database to construct a 31-CpG-based signature which could accurately predict the overall survival of ccRCC. Meanwhile, we constructed a nomogram to predict the prognosis of patients with ccRCC. Through LASSO Cox regression analysis, we obtained the 31-CpG-based epigenetic signature which were significantly related to the prognosis of ccRCC. According to the epigenetic signature, patients were divided into two groups with high and low risk, and the predictive value of the epigenetic signature was verified by other two sets. In the training set, hazard ratio (HR) = 13.0, 95% confidence interval (CI) 8.0–21.2, P < 0.0001; testing set: HR = 4.1, CI 2.2–7.7, P < 0.0001; entire set: HR = 7.2, CI 4.9–10.6, P < 0.0001, Moreover, combined with clinical indicators, the prediction of 5-year survival of ccRCC reached an AUC of 0.871. Our study constructed a 31-CpG-based epigenetic signature that could accurately predicted overall survival of ccRCC and staging progression of ccRCC. At the same time, we constructed a nomogram, which may facilitate the prediction of prognosis for patients with ccRCC.


Background
Renal cell carcinoma (RCC) is a cancer that originates in the renal epithelial cells and accounts for more than 90% of renal cancer, of which clear cell RCC (ccRCC) is most common subtype and causes the most deaths [1]. According to statistics, ccRCC caused 14,400 deaths in 2017 [2]. The TNM staging system is still the most commonly used clinical tool to stratify ccRCC patients, however, it is not enough to reflect the biological heterogeneity of tumor and accurately predict the prognosis of ccRCC patients [3], and a better marker is urgently needed to help us to accurately predict.
The aberrant methylation status of CpG islands located in the promoter region of tumor genes is becoming increasingly important in the search for new potential biomarkers for cancer [4][5][6], because these aberrant are relatively stable and potentially reversible [7]. Combining multiple markers, rather than using only one marker to construct a prognostic model, will make the results more stable and reliable, and improve its predictive value [8]. Many studies have used multiple targets to construct the prognostic model of ccRCC [9,10]. Here, we used the methylation data of the ccRCC of TCGA database to construct a 31-CpG-based signature.
In this study, we used the methylation data of the ccRCC of TCGA database to construct a 31-CpG-based epigenetic signature, which can accurately predict the overall survival rate of ccRCC through the most recently used The Least Absolute Shrinkage And Selection Operator method (LASSO) algorithm. We then verified this epigenetic marker using two validation sets (testing set and entire set). Furthermore, we constructed a nomogram to facilitate clinicians to accurately predict overall survival in ccRCC patients. Our results identified a new 31-CpG-based epigenetic marker that may be a new target for predicting overall survival of ccRCC.

Study population and data collection
A total of 319 cases of ccRCC were randomly assigned to two groups, which had a 2:1 ratio. The former was defined as the training set (n = 213) and the latter as testing set (n = 106). The corresponding clinical follow-up information of all cases included the survival time, survival outcome, tumor staging, grading and other information of the cases. The level three of RNA-seq (Illumina RNASeqV2) and the Infinium HumanMethylation450 BeadChip array (Illumina) data were downloaded from The Cancer Genome Atlas (TCGA) database (http:// cance rgeno me.nih.gov/), The details were listed in Table 1, and the flow chart of our entire experiment was shown in Fig. 1.

Differentially methylated positions screening
We first filtered out the probes that have too many zeros, then used the R package "ChAMP" [11] to quality control, remove the batch effects, BMIQ normalization [12] and Single Nucleotide Polymorphism (SNP) filtering (if it located in the CpG dinucleotide) [13]. After that, we used "champ.DMP" function to perform differentially methylated positions (DMP) analysis, we chosen the false discovery rate (FDR) < 0.05 and | Δβ | ≥ 0.2 as the threshold for screening DMPs.

Establishment and validation of epigenetic signature
LASSO cox regression analysis was conducted for the DMPs obtained in the previous step, based on R packet "glmnet". The degree of LASSO regression complexity adjustment was controlled by the parameter λ, and the larger λ was, the greater the penalty was, so as to obtain a model with fewer variables [14]. We construct the model according to the protocol of official website (https :// cran.r-proje ct.org/web/packa ges/glmne t/glmne t.pdf ). Frist, the function "glmnet" returned a sequence of models for us to choose from. Second, we used the function "cv.glmnet" to perform Cross-validation, we followed the protocol, did 100 times cross-validations, and finally got an λ average. Based on this λ value, we constructed an epigenetic signature based on 31 CpGs from training  set. The accuracy of the epigenetic signature was then verified in the testing set and the entire set by using the time-dependent ROC curve and Kaplan-Meier survival curve analysis. Time-dependent ROC curve analysis was performed by R package "survivalROC" [15] while the Kaplan-Meier survival curve analysis was performed by R package "survival".

Cox regression analysis of the epigenetic signature
To investigate whether the epigenetic signature we constructed could be independent of other factors affecting the prognosis of ccRCC, we conducted univariate and multivariate analyses by R package "survival". First of all, the epigenetic signature score, age, gender, neoadjuvant treatment, lymph node count, histologic stage and pathologic grade were performed the univariate analysis. Then, we put the univariate analysis has significant difference factor to carry on the multivariate analysis. Finally, we used the forest map to visualize our results by R package "forestplot".

The construction of the nomogram
Based on the results of the previous step, we used the factors with significant differences (epigenetic signature score and histologic stage) in multivariate regression analysis to construct a nomogram by using the R package "rms". In order to verify the accuracy of the nomogram, we used the calibration curve to evaluate the nomogram.
In the calibration curve, if the observed value and the actual value are more coincident, it indicates that the prediction accuracy of the nomogram is higher.

Functional annotation of the epigenetic signature
We calculated the top 500 genes with the highest correlation with epigenetic signature score, and then enriched the functions of these 500 genes, thus indirectly predicting the functional annotation of epigenetic marker score.
The 500 genes were uploaded to the DAVID website (https ://david .ncifc rf.gov/) for GO analysis and KEGG pathway analysis. The following, the STRING database (https ://strin g-db.org/) was used to performed protein-protein interaction (PPI) network [16] analysis. We've picked the top 50 genes from the "cytoHubba" app at Cytoscape and used the Cytoscape to visualize the results. According to the scores of the epigenetic signature, patients were divided into two groups with high and low risk, and then the corresponding RNA-seq data of these two groups were analyzed by single-samples gene-set enrichment analysis. The R package "GSVA" [17] were used to perform ssGSEA for the 31-CpGs-based epigenetic signature. The C2 (c2.cp.kegg.v6.1.symbols. gmt) set was downloaded from the Molecular Signatures Database (http://softw are.broad insti tute.org/gsea/msigd b/index .jsp). And it was chosen as the signature to perform ssGSEA. The R package "limma" was used to identify the differentially expressed gene sets. The threshold for screening differentially expressed gene sets were the p-value < 0.05.

Screening of DMPs
The R package "ChAMP" was used to screen DMPs between ccRCC and normal ccRCC samples in ccRCC dataset of the TCGA database, a total of 3858 DMPs were identified (1641 up-regulated and 576 down-regulated), under the threshold of FDR < 0.05 and | Δβ | ≥ 0.2. All DMPs were listed in Additional file 1: Table S1.

Establishment and validation of the epigenetic signature
We used the R package of "glmnet" to construct LASSO cox regression model. After 100 cross -validation, we got an optimal λ value (λ = 0.054), and finally we constructed an epigenetic signature based on 31 CpGs. The formula of epigenetic signature score were obtained by calculating the LASSO regression coefficient: epigenetic signature score = n k=1 β ki X ki , here, n is the number of the CpG site; β is the LASSO regression coefficient of the CpG site; X is the methylation value of CpG k and patient i; k is the CpG site. Details of the 31 CpGs were listed in Table 2, the heatmap of the 31 CpGs were shown in Fig. 2a, and the results of ten-time cross-validation were shown in Fig. 2b.
We divided the patients in the training set into high and low risk groups based on the median epigenetic signature scores (here, the median epigenetic signature score was 2.96, Fig. 3a, left panel). Then, time-dependent ROC curve analysis revealed the epigenetic signature we constructed were able to predict the overall survival rate of patients with ccRCC with great accuracy. The AUC of time-dependent ROC for 1-year overall survival was 0.875, 3-year 0.876, 5-year 0.851 and 7-year 0.855 (Fig. 3b middle panel). Finally, Kaplan-Meier survival curve analysis showed that the high-risk group had a significantly worse prognosis than the low-risk group (HR = 13.0, 95% CI 8.0-21.2, P < 0.0001, Fig. 3a right  panel). In order to avoid over-fitting effect, testing set (HR = 4.1, CI 2.2-7.7, P < 0.0001, Fig. 3b) and entire set (HR = 7.2, CI 4.9-10.6, P < 0.0001, Fig. 3c) were used for verification our results.

Nomogram creating and calibrating
In order to more conveniently predict the overall survival rate of ccRCC patients for clinicians, we constructed a nomogram based on the R package "rms". This nomogram could accurately predict 3-or 5-year overall survival in patients with ccRCC (Fig. 5a). Moreover, we used the calibration curve to calibrate the nomogram and found that the prediction value of nomogram was significantly correlated with the actual value (Fig. 5b, c).

Subgroup analysis of the epigenetic signature
To explore the predictive value of epigenetic signature in different subgroups, we performed subgroup analyzed  Fig. 6a, b), histologic grade (III/IV vs I/II, Fig. 6c, d), lymph node examination (Positive vs negative, Fig. 6e, f ) and age (>=65 vs < 65, Fig. 6g, h). The detailed information of subgroup analysis for the epigenetic signature were listed in Table 4.

Functional annotation of the epigenetic signature
Pearson correlation coefficients of all genes and the epigenetic signature scores were calculated. The 500 genes with the highest correlation were selected for subsequent analysis, and the threshold value was set as p-value less than 0.05. Functional enrichment of the 500 genes showed that this epigenetic signature was mainly enriched in the cell adhesion, notch signaling pathway, axon guidance, thyroid hormone signaling pathway and so on (Additional file 2: Figure S1). The PPI network diagram we built was shown in Additional file 3: Figure S2. After the enrichment scores of each sample were calculated, the differences of the high and low groups were analyzed according to the epigenetic signature scores. A total of 26 differentially expressed gene sets (10 up-regulated and 16 down-regulated) were screened under the threshold FDR < 0.05 and |LogFC| > 0.015. The result of ssGSEA revealed that patients in high risk group mainly enriched in p53, nod like receptor, cytosolic DNA sensing and other signaling pathways. The patients in low risk group were mainly enriched in the fatty acid metabolism pathway, PPAR signaling pathway, renin angiotensin system and so on (Fig. 7c, Additional file 1: Table S2).

Verification of CpG island methylation levels in ccRCC tissues and renal carcinoma cell lines
We collected 4 samples diagnosed as ccRCC and 2 paracancer tissue from the Zhongnan Hospital of Wuhan University, plus human renal tubular epithelial cell line (HK2) and 2 kidney cancer cell lines (ACHN and 769-P), and sequenced them with targeted methylation (Methyl-Target, Illumina Hiseq, Illumina, CA, USA). The results showed that CpG island involved in the construction of prognostic model had different expression patterns between carcinoma and paracancer (Fig. 7d).

Discussion
Patients with ccRCC exhibit a highly invasive clinical process, poor prognosis, and high recurrence rate [1]. The TNM staging system is still widely used to predict the prognosis of ccRCC, however, it relies mainly on anatomical information and has no biological characteristics [9]. Therefore, we need a new target to accurately predict the prognosis of ccRCC.
Overfitting often occurs in selecting prognostic models using higher-dimensional data [18], here, we chosen the LASSO algorithm, which could eliminate this disadvantage [19], as the method to select CpG for constructing epigenetic signature. The algorithm of LASSO has been widely used to construct predictive survival models, for instance, colon cancer [20], gastric cancer [21], T-cell lymphoblastic lymphoma [22] and ccRCC [9,10,23].
In recent years, epigenetics has been studied more and more in ccRCC, DNA methylation is one of the most studied patterns in epigenetic regulation [24][25][26]. And there have also been a number of studies using a Heatmap of the 31-CpG-sites that were used to construct the epigenetic signature. b Ten-time cross-validation for parameter chosen. Here, λ = 0.054 was chosen by ten-time cross-validation methylation data to construct prognostic signatures [5,6,27]. Here, we constructed and validated a 31-CpGbased epigenetic signature that could accurately predicted overall survival of ccRCC. Using the epigenetic signature we constructed, we divided patients into high and low risk groups, and high-risk patients had significantly lower overall survival rates in each set. Although previous studies have constructed several molecular   biomarkers to predict the survival of ccRCC, few of the signature they have constructed are as accurate as our epigenetic signature. The AUC of time-dependent ROC was all greater than 0.75 in each set. After the epigenetic signature combined with staging and grading, the AUC predicted 5-year overall survival was reached 0.871. At the same time, we validated our epigenetic signature in different subgroups (pathologic stage, histologic grade, lymph node examination, and age) and found that the epigenetic signature showed excellent prognostic value. We use ssGSEA and other methods to infer the functional annotation of epigenetic signature, and we found that the high risk group patients were mainly enriched in p53 and nod like receptor signaling pathway, which have been reported in previous studies to be associated with the pathogenesis of ccRCC [28][29][30]. The group of low risk were enriched in fatty acid metabolism and which has also been reported to be associated with ccRCC [31][32][33]. We speculate that these pathways may be responsible for the very different overall survival rates between the two groups based on the epigenetic signature. So what we're going to do is to look at how do these pathways affect the prognosis of ccRCC.
We performed methylation sequencing on ccRCC tissue samples and kidney cancer cell lines, and the results showed that CpG islands involved in the construction of the prognosis model had significantly different expression patterns. However, our study has some limitations, we lack more data to verify it, but we have built a new biological sample database called Biological Repositories Zhongnan Hospital of Wuhan University (http://bioba nk.znhos pital .cn), which will provide a lot of data to verify our epigenetic markers in the future.
In conclusion, we provide an effective epigenetic signature that accurately predicts overall survival of ccRCC independent of other risk factors (age, gender, neoadjuvant treatment, lymph node examination, histologic grade and pathologic stage). At the same time, we also constructed a nomogram for the convenience of clinicians to predict the prognosis of ccRCC.

Conclusion
We developed a 31-CpG-based signature using bioinformatics methods (LASSO cox regression analysis). We found that this epigenetic signature could accurately predict the overall survival rate of ccRCC, contributing to