Establishment of a 12-gene expression signature to predict colon cancer prognosis

A robust and accurate gene expression signature is essential to assist oncologists to determine which subset of patients at similar Tumor-Lymph Node-Metastasis (TNM) stage has high recurrence risk and could benefit from adjuvant therapies. Here we applied a two-step supervised machine-learning method and established a 12-gene expression signature to precisely predict colon adenocarcinoma (COAD) prognosis by using COAD RNA-seq transcriptome data from The Cancer Genome Atlas (TCGA). The predictive performance of the 12-gene signature was validated with two independent gene expression microarray datasets: GSE39582 includes 566 COAD cases for the development of six molecular subtypes with distinct clinical, molecular and survival characteristics; GSE17538 is a dataset containing 232 colon cancer patients for the generation of a metastasis gene expression profile to predict recurrence and death in COAD patients. The signature could effectively separate the poor prognosis patients from good prognosis group (disease specific survival (DSS): Kaplan Meier (KM) Log Rank p = 0.0034; overall survival (OS): KM Log Rank p = 0.0336) in GSE17538. For patients with proficient mismatch repair system (pMMR) in GSE39582, the signature could also effectively distinguish high risk group from low risk group (OS: KM Log Rank p = 0.005; Relapse free survival (RFS): KM Log Rank p = 0.022). Interestingly, advanced stage patients were significantly enriched in high 12-gene score group (Fisher’s exact test p = 0.0003). After stage stratification, the signature could still distinguish poor prognosis patients in GSE17538 from good prognosis within stage II (Log Rank p = 0.01) and stage II & III (Log Rank p = 0.017) in the outcome of DFS. Within stage III or II/III pMMR patients treated with Adjuvant Chemotherapies (ACT) and patients with higher 12-gene score showed poorer prognosis (III, OS: KM Log Rank p = 0.046; III & II, OS: KM Log Rank p = 0.041). Among stage II/III pMMR patients with lower 12-gene scores in GSE39582, the subgroup receiving ACT showed significantly longer OS time compared with those who received no ACT (Log Rank p = 0.021), while there is no obvious difference between counterparts among patients with higher 12-gene scores (Log Rank p = 0.12). Besides COAD, our 12-gene signature is multifunctional in several other cancer types including kidney cancer, lung cancer, uveal and skin melanoma, brain cancer, and pancreatic cancer. Functional classification showed that seven of the twelve genes are involved in immune system function and regulation, so our 12-gene signature could potentially be used to guide decisions about adjuvant therapy for patients with stage II/III and pMMR COAD.


INTRODUCTION
Colorectal cancer (CRC) is one of the most common cancers in men and women, representing almost 10% of the global cancer incidents and the third leading cause of cancer death worldwide (McGuire, 2016). CRC comprises three different subtypes according to distinct pathway operate: chromosomal-instable, microsatellite-instable, and CpG island methylator phenotype, all of which differ in morphology, genetic background, molecular profile, clinical behavior, and response to therapy (De Sousa et al., 2013). Current prognostic model based on the classic tumor-node-metastasis (TNM) staging is the standard prognosis factor for CRC in clinical practice. However, due to the high heterogeneity of disease, the patients at similar stage behave differently in terms of recurrence and response to chemotherapy often differs. Better parameters to guide patients' prognostic stratification and personalized medicine are urgently needed. Currently, some prognostic and predictive molecular markers have been developed. Microsatellite instability (MSI) is the molecular hallmark of DNA mismatch repair (MMR) deficiency. In stage II of the disease, MSI status helps select patients with high risk of developing recurrence (Brychtova et al., 2017). MSI status can also be a predictor of the benefit of adjuvant chemotherapy with fluorouracil in stage II and stage III colon cancer (Ribic et al., 2003). KRAS mutation status has been validated as a molecular marker for prediction of nonresponse to EGFR targeted drugs in metastatic CRC (Cunningham et al., 2010;Karapetis et al., 2008;Siena et al., 2009). However, due to complex pathways contributing to cancer progression, single molecular marker might not be efficient enough to predict prognosis and individualize in selecting adjuvant therapy.
The development of gene expression profiling technologies such as microarray and Next Generation Sequencing (NGS) provide further opportunities to comprehensively characterize the molecular features of cancer. Gene-expression profiling has been used to develop genomic tests that may provide better predictions of clinical outcomes in combination with traditional clinicopathologic factors (Gray et al., 2007;Venook et al., 2011;Meropol et al., 2011;Ebata, Hirata & Kawauchi, 2016;Guinney et al., 2015;Marisa et al., 2013;Smith et al., 2010;Gentles et al., 2015). Some commercially genomic assays are available for the prediction of clinical outcome in CRC patients. The most well-known one is the Oncotype DX Colon Cancer Assay, which is a 12-gene (seven cancer related genes and five reference genes) genomic test that has been used to help identify individuals with high recurrence risk from stage II colon cancer patients with T3 and MMR proficient tumors (Gray et al., 2007;Venook et al., 2011;Meropol et al., 2011). However, the five reference genes in Oncotype DX Assay contain PGK1 and GPX1, which are important players in the process of energy metabolism and cellular oxidative stress, both of which are actively involved in cancer development and metastasis (Ebata, Hirata & Kawauchi, 2016;Moloney & Cotter, 2017). Normalization with PGK1 and GPX1 might have diluted the tumorous heterogeneities among cancer patients. In this work, we applied two steps of supervised machine-learning method and established a 12-gene expression signature to precisely predict colon adenocarcinoma (COAD) prognosis by exhaustively using expression of all genes of The Cancer Genome Atlas (TCGA) COAD patients.

TCGA and GEO datasets
RNA-seq data and clinic information for all cancer types were obtained from TCGA RNA-seq database (https://cancergenome.nih.gov/). Microarray expression data and clinic information for COAD patients were retrieved from Gene Expression Omnibus (GEO) database (https://www.ncbi.nlm.nih.gov/geo/).

Development of the gene expression signature
The development process has a training and validation phase.

Phase I Grouping
The TCGA COAD patients were used for the development of prototype of the 118-gene signature that could predict COAD prognosis. We applied a similar supervised machine learning method that was used for MammaPrint (Van et al., 2002). Forty-two patients that experienced relapse within three years were designated as poor prognosis. Forty-nine patients who were relapse free for at least three years were categorized as good prognosis. The gene expression values were centered and scaled before grouping. For the training dataset, 32 and 39 patients were randomly chosen from poor and good prognosis category, respectively. The rest of the patients were grouped as test dataset. Detailed clinic information is listed in Table S1.

Selection of genes with high correlation to real prognosis status
Overall, there are 20,530 genes in the raw RNA-seq data. The Pearson correlation coefficients with real prognosis status were calculated for all genes. Genes with absolute correlation coefficient greater than 0.3 were selected. To test whether such correlation coefficient distribution could be found by chance, a permutation method was used to generate 10,000 Monte-Carlo simulations randomizing the correlation between gene expression data of the 71 training patients and corresponding prognostic categories.

Supervised machine-learning method
Gene number incorporated in the signature needs to be optimized. One thousand, five hundred and ten genes were reordered by absolute coefficients from maximum to minimum. Starting from the top two genes on the list, 755 signatures were generated by adding two more genes from the top list each time until all the 1,510 genes were exhaustively used as reporters. A Leave-One-Out Cross-Validation (LOOCV) method was employed to check the performances of these signatures: Step 1: leave one tumor out; Step 2: calculate the good-and poor-prognosis expression template by averaging the expression values for each gene incorporated in good-prognosis group and poorprognosis group, respectively. Then we defined a parameter called risk coefficient (risk-coef.). For a tumor, risk coefficient was calculated with its gene expression profile and good-and poor-prognosis expression template: Risk-coef = cor-coef. to good template − cor-coef. to poor template; Step 3: calculate the risk-coefs for all the remaining 70 training samples and the left out sample. Reorder the 71 samples by ranking their risk-coefs from small to large. Determine the genomic risk by taking first 32 tumors as high genomic risk and the rest 39 as low genomic risk. Check the consistency between genomic risk and real risk for the left out sample; Step 4: repeat step 1-3 iteratively until all the 71 samples have been left out once. Collect the error counts when there is a disagreement between genomic risk and real risk for the left out sample. Better signatures with least error counts were selected.

Cross-validation without information leak
The 1,510 genes were obtained using all training samples including the one left out for cross validation, so there might be an over-fitting issue due to information leak. A modified LOOCV with no information leak was performed as below: Step 1: leave one patient out; Step 2: calculate the Pearson correlation coefficients with real prognosis status for all genes with the reminding 70 training samples. Filter the genes with |coefficient| ≥ 0.3.
Step 3: generate the signature with the genes selected and predict the genomic risk for the left out sample.
Step 4: repeat step 1-3 iteratively until all the 71 samples have been left out once.

Phase II
Further machine learning process was applied to generate a concise scoring system. Before machine learning, the RPKM (Reads Per Kilobase per Million mapped reads) values need normalization, which was done through dividing them by geometric mean of RPKM values of TFRC, GUSB, and RPLP0. Firstly, the TCGA COAD patients (Table S2) were split into training and test dataset. There is no significant difference between the clinicopathologic factors of these two groups (Table 1). For each of the 118 genes, we calculated the coefficient and p-value in univariate Cox Proportional Hazard regression model (CPH) with training dataset. Then we reordered the gene list by sorting the univaraite Cox-regression p-value from minimum to maximum. So the top genes have stronger correlations with cancer prognosis. Starting from the top one gene in the list, we added one more gene iteratively from the top for multivariate CPH analysis. In every iteration step, the fitness of established signature on test dataset was checked by calculating Kaplan Meier Log Rank p-value (KM-p). At the end of iteration, signature incorporating the top 12 genes has the minimum test dataset KM-p and was deemed as the optimal one. The multivariate Cox coefficient of each gene in the final signature was extracted to generate the scoring system: Ei * βi.
Ei: expression level of gene i; βi: multivariate Cox-regression coefficient of gene i.

Validation stage
The GEO microarray datasets were used to validate the final gene expression signature. For genes with more than one probe, the probe showing minimum univariate CPH p-value was selected. Relative expression level was obtained via dividing the probe signal by geometric mean of signals of TFRC, GUSB, and RPLP0. For each tumor, a risk score was obtained by calculating the weighted summation of relative expressions of the 12-gene. For a certain dataset, patients with risk scores below the median value of the population were designated as the low risk group, while the rest of the patients were categorized as the high risk group. Survival comparisons between high and low risk groups were conducted by Kaplan-Meier plotting. Log Rank p value <0.05 was considered as significantly different. Other cancer types in TCGA library were also retrieved to validate the 12-gene signature.

Development of signature prototype
The development process was shown as the flow chart in Fig. 1. With the TCGA COAD data, an unbiased screening method was used to obtain 1,510 genes showing absolute correlations greater than 0.3 with disease outcomes. The frequency distribution of number of genes with absolute coefficient no less than 0.3 in the 10,000 Monte-Carlo trials was displayed in Fig. S1. The probability of obtaining 1,510 genes or more with an absolute correlation coefficients of at least 0.3 with prognosis categories purely by chance was 0.0019 (p < 0.05), which was fair for us to reject the null hypothesis. During the (Leave-One-Out Cross Validation) LOOCV process, 755 signatures were generated. Least violation times were observed when signature employed the top 16, 36, 40, 42, 44, 46, 48, 50, 56, 58, 60, 62, 64, 66, 68, 70, 72, 74, 76, 78, 80, 82, 84, 86, or 118 genes. We further found that the predictive accuracy rates were high towards the 71 training samples with the signature containing the top 118 genes (Fig. 2). We had the luxury to further validate the established signatures using the remaining 20 independent samples in test dataset. For each signature, receiver operating characteristic curve (ROC) was plotted with the information of risk-coefs and real risk of the 91 TCGA patients to compare the performances of the 25 signatures. There was no significant difference among the performances of these signatures ( Fig. 3 and Table 2). Because the above 1,510 genes were obtained using all the training samples including the one left out for cross validation, a modified LOOCV without information leak was performed. Seventy-one additional signatures were created. The vast majority of the original 1,510 genes were shared by most of the 71 signatures (Fig. S2). So there was very limited information leak introduced during the previous training process.

Development of 12-gene signature
For the purpose of concise and simplicity, we further established a 12-gene expression signature based on the 118 genes obtained in phase I training stage. Expressional coefficients were assigned to respective genes. Each patient has a risk score by calculating the weighted summation of expression values of the 12 genes. The Kaplan-Meier (KM) survival analysis showed that among TCGA COAD patients, the high risk group displayed significantly poorer prognosis than low risk group regarding to disease free survival (DFS) (training dataset: KM Log Rank p = 0.0001; test dataset: KM Log Rank p = 0.0005) (Fig. 4).  (Marisa et al., 2013). In patients with proficient mismatch repair system (pMMR), our 12-gene signature could effectively distinguish high risk group from low Interestingly, advanced stage patients were significantly enriched in high 12-gene score group (Table 3).

Predictive performances of the 12-gene signature in other cancer types
We also tested the performance of the signature in other cancer types. TCGA RNA-seq data and corresponding clinic information for 24 cancer types were retrieved for validation. Surprisingly, KM results showed that our signature successfully separated good prognosis patients from poor prognosis patients in several other cancer types including pan-kidney cohort (KIPAN) (

DISCUSSION
Numerous attempts have been made to establish gene expression signatures for the purpose of precise prediction of colorectal cancer prognosis (Gray et al., 2007;Venook et al., 2011;Meropol et al., 2011;Ebata, Hirata & Kawauchi, 2016;Guinney et al., 2015;Marisa et al., 2013;Smith et al., 2010;Gentles et al., 2015). A meta-analysis was done to assess the clinical value of several published prognosis gene expression signatures in colorectal cancer (Sanz-Pamplona et al., 2012). Although most of the published signatures showed significant  statistical association with prognosis, their accuracy to classify independent tumor samples into high-risk and low-risk group is limited. So we need more robust and accurate gene expression signature that can predict prognosis cross independent COAD datasets. Here we established a gene expression signature by applying two steps of supervised machinelearning method. The predicative accuracy of our gene expression signature was proven by validation in two large independent gene expression microarray datasets (GSE39582, N = 459; GSE17538, N = 232). Decision making regarding adjuvant therapy has been a debate among professional clinical organizations over the past 20 years (Dotan & Cohen, 2011;Meropol, 2011;Vachani, 2013). Currently speaking, uncertainty is present in adjuvant chemotherapeutic effects among stage II COAD patients who are mismatch repair system proficient. The Scottish Intercollegiate Guidelines Network (SIGN), ASCO, and NCCN are following different guidelines regarding this issue (Gao et al., 2016). Resectable COAD patients with pMMR routinely receive 5-FU based postoperative adjuvant chemotherapy (POCT) which has been shown to provide a relatively small absolute benefit (Andre et al., 2009;Gill et al., 2004;Sargent et al., 2009;Gray et al., 2011;Alex et al., 2017), indicating that many COAD patients might have been over-treated due to the lacking of an effective test to stratify the patients further. Our gene signature showed important prognostic value for stage II or/and III pMMR COAD patients. There validation results in GSE39582 indicate that lower 12-gene score patients have gained survival benefit from adjuvant chemotherapies, while high score patients treated with adjuvant chemotherapies didn't receive survival benefit. So our 12-gene signature could potentially be used to guide decisions about adjuvant therapy for patients with stage II & III and pMMR colon cancer. Seven of the proteins encoded by the 12 genes were related to immune system, they are TREML2, PADI4, NCKIPSD, PTPRN, PGLYRP1, C5orf53, and TREML3, indicating the essential roles of deregulated immune response in COAD progression and metastasis (Table S3). TREML2, acting as the counter-receptor for B7-H3, promotes T cell responses (Hashiguchi et al., 2008). PADI4 protein catalyzes the conversion of arginine to citrulline residue. With specific high expression in blood lymphocytes (Asaga et al., 2001;Anzilotti et al., 2010), PADI4 is believed to be an active autoimmune player in synovial tissue of rheumatoid arthritis (Chang et al., 2005). It is reported that cell free circulation PADI4 mRNA level (together with cfDNA, PPBP, and haptoglobin) in peripheral blood of nonsmall cell lung cancer patients was significantly higher than that in healthy donors, so PADI4 may serve as a potential marker for NSCLC diagnosis (Ulivi et al., 2013). As a member of protein tyrosine phosphatase (PTP), PTPRN is an autoantigen in the sera of insulindependent diabetes mellitus (IDDM) patients, making it a promising therapeutic target of autoimmunity in IDDM (Rabin et al., 1994;Solimena et al., 1996). Hypermethylation in PTPRN was associated with longer progression-free survival in ovarian cancer (Bauerschlag et al., 2011). If that is the case, hypomethylation (upregulated mRNA expression level) in PTPRN may be associated with poor prognosis, which is consistent with our results. NCKIPSD is a protein containing SH3 and proline-rich domains. Reports have shown that NCKIPSD is involved in the maintenance of sarcomeres and assembly of myofibrils into sarcomeres (Lim et al., 2001). A very recent study reported that NCKIPSD downregulation and increased α-tubulin acetylation promotes stiffness of tumor stroma, which in turn, may inhibit chemotherapeutic drug uptake and regulate tumor sensitivity to chemotherapy, resulting in high risk of breast cancer recurrence within 5 years (You et al., 2017). Consistently, our findings also showed decreased NCKIPSD expression is associated with high risk of colon cancer recurrence. PGLYRP1 is a member of peptidoglycan recognition proteins which are conserved innate immunity proteins, recognize bacterial peptidoglycan, and play a role in antibacterial immunity and inflammation (Dziarski & Gupta, 2010). PGLYRP1 interacts with Hsp70 to induces cytotoxic activity in tumor cells via TNFR1 receptor (Yashin et al., 2015). C5orf53 is also called a IgA inducing protein, which enhances IgA secretion from B-cells stimulated via CD40 (Endsley et al., 2009). TREML3 is a inhibitory receptor involved in antigen processing (Cella et al., 1997).
Numerous studies have shown that cancer patients' prognosis and sensitivity to therapy are closely associated with infiltration and density of immunologic cells within primary tumors (Wels et al., 2008;McConnell & Yang, 2009;McLean et al., 2011;Sethi & Kang, 2011;Smith & Kang, 2013). Of particular note, by applying a novel machine-learning method, called Cell-type Identification By Estimating Relative Subsets of known RNA Transcripts (CIBERSORT), Gentles et al. developed several gene expression signatures to inferring distinct leukocyte subsets representation in bulk tumor transcriptomes (Gentles et al., 2015). In several solid tumors including colon cancer, the signatures relating to plasma cells and polymorphonuclear cells were the most significant favorable and adverse module to cancer outcomes, respectively. The broad spectrum involvement of lymphocyte infiltration and intra-tumor immune-suppression implies that this could be the main reason why our 12-gene signature could also predict patient prognosis in several other cancer types including kidney cancer, lung cancer, uveal and skin melanoma, brain cancer, and pancreatic cancer.
Other five genes (NOG, VIP, RIMKLB, NKAIN4, and FAM171B) in the 12-gene signature are functionally sporadic. NOG is related to mesodermal commitment and differentiation pathway (Costamagna et al., 2016). High expressing of gene signature including NOG showed a strong trend for a worse prognosis of patients with lung adenocarcinomas (Rajski, Saaf & Buess, 2015). VIP, a member of glucagon/secretin superfamily, is the ligand of class II G protein-coupled receptor (Umetsu et al., 2011). It causes vasodilation and lowers arterial blood pressure. VIP signaling is enhanced in human prostate cancer (Fernandez-Martinez et al., 2012). Elevated VIP secretion is associated with advanced tumor stage in colorectal carcinoma (Hirayasu et al., 2002). RIMKLB is involved in alanine, aspartate and glutamate metabolism. RIMKLB up-regulation is associated with radio-resistance in nasopharyngeal carcinomas (Li et al., 2016). NKAIN4 may interact with the beta subunit of Na, K-ATPase (Gorokhova et al., 2007). FAM171B which is a single-pass type I membrane protein, belongs to the FAM171 family. It is up-regulated in gemcitabine-resistant pancreatic cancer cell line (Zhou et al., 2015). The associations of these genes with cancer and cancer outcomes are very relevant to our findings in this study.
Our signature generated a novel scoring system with relative gene expression values by dividing the raw expression with geometric mean of RPKM values of three house-keeping genes (TFRC, GUSB, and RPLP0). In order to preserve the heterogeneities among tumors to the most extent, ACTB and GAPDH were avoided using as reference genes due to the fact that cytoskeleton and energy metabolism might be greatly deregulated among cancer individuals (Xiang, Chen & Fu, 2017;Stine & Dang, 2013). A recent study overcomes hypoxia-induced tumor cell resistance by synergistic GAPDH-siRNA and chemotherapy (Guan et al., 2017), indicating the important roles of GAPDH in tumor cell resistance. Our normalization process also makes the gene expression scoring system very friendly to different gene expression detection systems including qPCR, RNA-seq, and QuantiGene Plex.

CONCLUSION
A robust and accurate gene expression signature is essential to assist oncologists to determine which subset of patients at similar TNM stage has high recurrence risk and could benefit from adjuvant therapies. Here we report a new 12-gene expression signature that could separate resectable COAD patients into poor-and good-prognosis group in several independent TCGA and microarray datasets. Functional classification showed that seven of the twelve genes are involved in immune system function and regulation. Our gene expression signature could potentially serve as an effective genomic test in helping identify resectable COAD patients with high risk of poor prognosis. The accuracy and robustness of the signature as a prognostic classification requires further confirmation with large prospective patient cohorts.

ADDITIONAL INFORMATION AND DECLARATIONS Funding
This work was supported by the National Natural Science Foundation of China (No. 81672334) and the International Science and Technology Cooperation Project of Shanghai (No. 15410710100). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.