A 16-CpG-based Prognostic Signature for Colorectal Cancer

Background: To develop a CpG-based prognostic prediction model to provide survival risk prediction for colorectal cancer. Differential methylation analysis was performed on 309 colorectal cancer and 38 adjacent cancer specimens from the Cancer Genome Atlas (TCGA). Results: 2113 hypermethylation sites as well as 723 hypomethylation sites were screened out and 16 related CpG methylation loci were further identified. The risk score was calculated based on the methylation sites identified and utilized as an independent prognostic variable for multivariate Cox regression prediction model, which was further optimized by the independent prognostic factors (including stage and risk score). Conclusion: This study has identified several potential prognostic biomarkers and established a CpG-based prognostic prediction model for colorectal cancer, which provides a valuable reference for future clinical research.


Introduction
Colorectal cancer, a predominant malignant tumor in the gastrointestinal tract, is the leading cause of cancer incidence and death worldwide. Moreover, its incidence and mortality rates have shown a rising trend [1]. The early symptoms of the patients are not obvious, following the changes in bowel habits, hematochezia, diarrhea, diarrhea and constipation alternating, local abdominal pain, while the late stage shows systemic symptoms such as anemia, weight loss and so on [2][3][4]. Colorectal cancer mostly occurs in people aged 40-70 years, but it has also been found in young people, which means that the disease has a tendency to become younger. In current statistics, colorectal cancer is more common in men. The incidence ratio of male to female is about 2:1 [5]. Its morbidity and mortality are second only to gastric cancer, esophageal cancer as well as primary liver cancer in the digestive system malignant tumors [6]. Like other malignant tumors, the cause of the disease is still unclear and can occur in any part of the colon or rectum. The disease can spread to other tissues and organs through lymphatic, blood circulation even direct spread [7]. At present, the diagnosis can be confirmed by clinical manifestations, X-ray barium enema or fiberoptic colonoscopy. The key to treatment is early detection, timely diagnosis and surgical cure [8,9].
It is well known that DNA methylation is abnormal in the majorities of cancers, including colorectal cancer [10][11][12]. Tumor DNA methylation has two general changes compared with normal cells of the same tissue type: demethylation in many regions of the genome is coordinated with de novo methylation of certain specific CpG islands. More notable, however, are the changes that occur on a wide range of CpG islands, which are often unmethylated in every tissue [13]. Although early observations suggested that this occurred primarily at the promoters of tumor suppressor genes as a result of growth selection, it now appears that this is a widely programmed process, possibly based on the relatively commonly used polyclonal compound targeting mechanism [14]. Vertebrate CpG islands are short, discrete DNA sequences that are rich in GC and CpG, and are mainly unmethylated, which is significantly different from the average genomic pattern [15]. Meanwhile, CpG islands are considered as the sites of transcription initiation, including thousands of sites beyond the currently annotated promoters [16][17][18][19]. More and more studies demonstrate that abnormal DNA methylation of CpG islands plays an important role in the occurrence and development of tumors. Its characteristics such as the high stability of biological samples, sensitivity to tumor environmental factors, and ease of detection makes it very suitable as a clinically applicable biomarker for colorectal cancer diagnosis and prognosis.
In term of the complexity of the colorectal cancer prognosis, the current prediction model is not mature and needs to be updated for more targets. In address these issues, we constructed an innovative prognostic model for colorectal cancer based on 16 specific related CpG methylation locis. To make the prediction model more accurate, we did a multi-time evaluation. In general, these models can help predict the overall survival of colorectal cancer patients effectively, and can provide tremendous help for colorectal cancer patients.

Study population
In this study, 347 colorectal cancer (CRC) samples were obtained from TCGA, which consisted of 309 tumor samples and 38 paracancerous samples that from a total of 296 patients. There were 287 CRC patients had complete survival information and their detailed clinicopathological characteristics were provide in Table 1.

Differential Methylation Analysis
Differential methylation analysis was performed between tumor and paracancerous samples by using the minfi package in R language, after removing the CpG sites containing missing values. We then removed the methylation sites with an average of less than 0.02 by calculating the average of the methylation levels.
Besides, methylation sites located on autosomes whose FDR values less than 0.05 and absolute values of the difference of average β values greater than 0.4 were identified as differential methylation CpGs.

Construction Of Prognostic Model
The 300 samples with complete survival information were randomly grouped according to the ratio of 2: 1 as the training set and the validation set, respectively. Univariate Cox-regression analysis was used to screen for CpGs that were significantly related to CRC prognosis, with a threshold of P < 0.05. Then LASSO Cox-regression analysis was performed on the training set data to screen CpG set for the prognostic model of colorectal cancer. A risk score could be obtained for every CRC patient as follows according to the model: Coefi was the risk coefficient of each CpG calculated by the LASSO-Cox model, and Xi was the methylation level of each CpG. CRC patients were classified into a high-risk group and low-risk group according to the cut-off value obtained from the R package "survminer".

Establishment And Analysis Of Nomogram Prognosis Model
All independent prognostic factors identified by multivariate Cox were used to establish a nomogram to predict the probability of 1-year-OS, 3-years-OS, and 5-years-OS in CRC patients. The calibration curve of the nomogram was drawn to observe the relationship between the predicted probability of nomogram and the actual incidence.
We then compared the performance of nomograms that including one or all prognostic factors in predicting CRC OS by using a time-dependent ROC curve.

Survival Analysis
The Kaplan-Meier method was used to estimate the overall survival rate of CRC patients. Wilcoxon signed-rank test was applied to comparing prognostic differences among multiple groups. Significance of the difference in survival rates between different groups was tested using log-rank with the cutoff of p value < 0.05.

Statistical analysis
Multivariate Cox regression model was used to analyze independence of risk score. Chi-square test was used in Tabel 1 to analyze the clinical features of CRC patients. Statistical analyses were performed using R software v3.5.2. P value < 0.05 was considered significant in all of the above.

Prognosis-related CpGs
We obtained a total of 2,836 differential methylation CpGs (DMCs) in CRC tumor tissues compared with adjacent normal tissues, including 2,113 hyper-methylation and 723 hypo-methylation ones. Figure S1A illustrated the differential methylation landscape of all CpGs, and Figure S1B showed the methylation level of the 2,836 DMCs in CRC tumor and adjacent normal tissue samples as a heatmap.

Crc Prognostic Model
We used the univariate Cox regression analysis on 287 CRC patients and identified 53 DMCs significantly associated with OS. The coefficients of those 53 CpGs was illustrated in Figure S2A. LASSO-Cox was further established and determined 16 optimal prognostic genes, including cg01129320, cg01992382, cg03904639, cg05317090, cg06084210, cg09170112, cg09492451, cg09918510, cg11712188, cg12740527, cg14675211, cg15265085, cg16834823, cg18237607, cg18624636 and cg19611175. Besides, the Lambda value to turn parameter in the LASSO-Cox was in Figure S2B. And the risk score was as follow:

Risk score was a good assessment of patient survival prognosis
Patients were divided into high and low risk groups based on cut-off = 6.01. Kaplan-Meier curve showed that high-risk group had significantly longer survival compared to low-risk groups. Besides, we found that the AUCs of the 1-, 3-, and 5-year OS in the training set were 0.895, 0.89, and 0.959, respectively; the AUCs of the 1-, 3-, and 5-year OS in the first training set were 0.625, 0.659, and 0.74; the AUCs of OS at 1 year, 3 years, and 5 years in the second training set were 0.615, 0.565, and 0.65 respectively; the AUCs of OS at 1 year, 3 years, and 5 years in the whole training set were 0.709, 0.697, and 0.793, indicating that risk score could better predict patients 1-, 3-, and 5-year survival rates (Fig. 1A-1D). Collectively, the above results indicated that the established prognostic model according to these 16 CpG sites performed well in terms of survival prognosis.

Risk Score Was Independent Of Other Prognostic Factors
The 300 samples were grouped according to age, gender, and stage and their risk scores were calculated. The results revealed that there was a significant difference in risk scores between different ages, however, no significant difference could be obtained in genders and stages ( Fig. 2A-2C). Next, multivariate Cox regression was established and proved the strong independence of our prognostic model including age, gender, and stage factors through the survival package in R (Fig. 2D). In addition to the risk score, stage was also an independent prognostic factor.

Nomogram could better predict the 1-year 3-year 5-year survival of patients.
The nomogram constructed using two independent prognostic factors, stage and risk score to predict the 1-, 3-, and 5-year OS of CRC patients was in Fig. 3A. Besides, the calibration chart to assess the accuracy of the nomogram was displayed in Fig. 3B, which proved the nomogram might infer estimates of slightly higher or lower actual survival probability. Moreover, the 1-year, 3-year, 5-year AUC of the combined model was higher compared with a single factor, indicating that a nomogram established with all independent prognostic factors could better predict the patient's 1-year 3-year 5-year survival than risk score and stage (Fig. 4).

Discussion
The prognosis of colorectal cancer is critical, which mainly depends on early diagnosis and timely surgical treatment as well as effective prognostic prediction models. At present, researchers have been trying to find breakthroughs in the prevention, early screening, diagnosis and treatment of colorectal cancer, and have made many progress in these areas, but the prediction of its prognosis is still insufficient [20][21][22]. In this study, we performed differential methylation analysis on COAD samples and paracancerous control samples to reveal differential methylation sites, and then identified 16 methylation sites associated with prognosis. Furthermore, we established an innovative prognostic prediction model for colorectal cancer based on these specific loci.
It is well known that alterations in DNA methylation occur in cancer, including hypomethylation of oncogenes and hypermethylation of tumor suppressor genes. We screened 2836 differential expressed methylation sites, 2113 hypermethylation sites and 723 hypomethylation sites from 309 cancer samples and 38 adjacent cancer samples in this study (See Figure S1 for the results). It is worth noting that majority of sites were hypermethylated in the colorectal cancer. This raises a very interesting question: the overall level in tumor tissue may be hypermethylated, but it is hypomethylated at certain important sites. Since DNA methylation represents a molecular mechanism related to gene inhibition, it is believed that methylation in cancer may promote tumor phenotype by inhibiting genes that are initially active in the source tissues, especially those related to tumor suppressor factors, such as Rb, P53 etc [23,24]. On the other hand, the other hypomethylation sites are potentially associated with oncogene-directed methylation-associated gene-repression pathways, taking Myc, Ras and Src as examples [25][26][27]. A study by Irizarry etc show that most methylation alterations in colorectal cancer occur in 'CpG island shores', with hypermethylation enriched closer to the associated CpG islands [28]. Moreover, Guo etc demonstrate that epigenomic changes in DNA CpG methylation are closely associated with the local inflammatory response from colorectal cancer [29]. Here we identified 16 specifc expressed CpG methylation sites, which associated with 9 different genes (CCDC48, CSNK1A1L, GDNF, HYDIN, IRX5, LONRF2, NALCN, SLC16A12, TNXB). Based on our search, several genes including CCDC48, LONRF2 and TNXB have not been well studied in the colorectal cancer, which provides a very valuable starting point for future research.
In addition to the comprehensive analysis of CpG methylation site, another feature of this study is to establish the risk scores and obtain the optimal cutoff value. According to the cutoff value, patients can be effectively divided into low risk group and high risk group (See Fig. 1 for details). Previously, several studies have developed the prognostic prediction model. Most of them were constructed by competing endogenous RNAs network such as the prognostic information and expression of the lncRNAs, miRNAs, and mRNAs in colorectal cancer specimen [30][31][32]. Here, in this study, the prognostic prediction model was established according to all independent prognostic factors (stage, risk scores), which are based on the colorectal cancer specific CpG methylation site. This provides an alternative model which may be critical to progress in the prognosis prediction of colorectal cancer.
To conclude, this article study CpG methylation sites from TCGA data base. Through the comparison between colorectal cancer and paracancerous control samples, 16 CpG methylation sites show specific methylation manners, which indicates their potential functions for the colorectal cancer. With these sites, a critical prognostic prediction model has been developed. Overall we shed light on questions and challenges posed by the colorectal cancer, and we establish an innovative prediction model which can provide great help for future understanding of colorectal cancer.

CRC colorectal cancer
DMCs differential methylation CpGs

Declarations
Ethics approval and consent to participate: Not applicable.

Consent for publication: Not applicable.
Availability of data and materials: The dataset supporting the conclusions of this article is available in the TCGA (http://tcgaportal.org/).

Competing interests:
The authors declare that they have no competing interests.  Evaluation for the prognostic ability of risk score. (A-D) The time-dependent ROC curves, the risk scores, the methylation heat map and the Kaplan Meier survival curves stratified by the optimal risk score of CRC samples from the TCGA training set, two testing sets, and the whole TCGA data set, respectively. The horizontal axis of the ROC curve is the false positive rate, and the vertical axis is the true positive rate. The horizontal axis of the risk score scatter plot is the sample data volume, the vertical axis is the risk score value, the color represents  The prediction for overall survival in colorectal cancer patients by nomogram. (A) Constructed nomogram model. For each patient, three lines were drawn up to determine the points obtained from each factor. The sum of these points is on the total points axis, and a line is drawn down to determine the likelihood of overall survival for colorectal cancer patients at 1, 3, and 5 years. (B) Calibration chart for internal verification of the nomogram. The horizontal axis represents nomogram-predicted probability of overall survival, the vertical axis represents actual survival respectively.

Supplementary Files
This is a list of supplementary files associated with this preprint. Click to download.