Gene signature for prognosis in comparison of pancreatic cancer patients with diabetes and non-diabetes

Background Pancreatic cancer (PC) has much weaker prognosis, which can be divided into diabetes and non-diabetes. PC patients with diabetes mellitus will have more opportunities for physical examination due to diabetes, while pancreatic cancer patients without diabetes tend to have higher risk. Identification of prognostic markers for diabetic and non-diabetic pancreatic cancer can improve the prognosis of patients with both types of pancreatic cancer. Methods Both types of PC patients perform differently at the clinical and molecular levels. The Cancer Genome Atlas (TCGA) is employed in this study. The gene expression of the PC with diabetes and non-diabetes is used for predicting their prognosis by LASSO (Least Absolute Shrinkage and Selection Operator) Cox regression. Furthermore, the results are validated by exchanging gene biomarker with each other and verified by the independent Gene Expression Omnibus (GEO) and the International Cancer Genome Consortium (ICGC). The prognostic index (PI) is generated by a combination of genetic biomarkers that are used to rank the patient’s risk ratio. Survival analysis is applied to test significant difference between high-risk group and low-risk group. Results An integrated gene prognostic biomarker consisted by 14 low-risk genes and six high-risk genes in PC with non-diabetes. Meanwhile, and another integrated gene prognostic biomarker consisted by five low-risk genes and three high-risk genes in PC with diabetes. Therefore, the prognostic value of gene biomarker in PC with non-diabetes and diabetes are all greater than clinical traits (HR = 1.102, P-value < 0.0001; HR = 1.212, P-value < 0.0001). Gene signature in PC with non-diabetes was validated in two independent datasets. Conclusions The conclusion of this study indicated that the prognostic value of genetic biomarkers in PCs with non-diabetes and diabetes. The gene signature was validated in two independent databases. Therefore, this study is expected to provide a novel gene biomarker for predicting prognosis of PC with non-diabetes and diabetes and improving clinical decision.


INTRODUCTION
PC is an aggressive cancer of the digestive system, which is becoming a serious health problem worldwide. Overall survival for patients with pancreatic cancer is poor, mainly due to a lack of biomarkers to enable early diagnosis and a lack of prognostic markers that can inform decision-making, facilitating personalized treatment and an optimal clinical outcome (Siegel, Miller & Jemal, 2016). In most cases, type-II diabetes frequently occurs in patients with PC .Thus, it is considered to be an important risk factor for malignancy of PC (Huxley et al., 2005). However, non-diabetes PC patients have no early diagnosis indicator, which makes it more difficult to diagnose. In addition, PC with diabetes and without diabetes are very different in histopathology (Girelli et al., 1995) and molecular levels. Currently, many studies do not consider the difference between PC with diabetes and non-diabetes. They just considered that diabetes was a risk factor in PC development (Fisher, 2001). With the deeper understanding of the relationship between PC patient with diabetes and non-diabetes, recent data suggests that diabetes and altered in glucose metabolism are the consequence of PC, and yet, the clinical presentation of the altered glucose metabolism in these patients vary considerably (Yalniz & Pour, 2005). So, PC patients with diabetes and non-diabetes may represent two types of PC. Therefore, we predict that PC patients with diabetes and non-diabetes are also different in their prognostic biomarkers. The different prognostic biomarkers indicate that they should be treated respectively via their own different ways.
Generally, patients with diabetes have more opportunities to detect the potential risk of pancreatic cancer, while patients without diabetes often lack indicators for early diagnosis and miss the best opportunity for pancreatic cancer treatment. Furthermore, good prognostic markers can also be targeted at two types of pancreatic cancer patients to propose better treatment options, improve the prognosis.
In this study, The Cancer Genomic Atlas (TCGA) database, the Gene Expression Omnibus (GEO) database and the International Cancer Genome Consortium (ICGC)were employed to investigate and validate gene biomarker for prognosis in PC with or without diabetes. By characterizing genetic alterations, TCGA project has provided a large number of comprehensive genomic cancer data and corresponding clinical data that we can be used to figure out the relationship between them, which allows us to understand PC better and more accurate. However, high through-put genomic data (microarray or High seq V2) may encounter the problem in statistics which called ''curse of dimensionality'' (Mramor et al., 2005). Due to this problem, ordinary regression is subject to over-fitting and instable coefficients and stepwise variable selection methods do not scale well (Jr, Lee & Mark, 1996). Therefore, the least absolute shrinkage and selection operator (LASSO) method is employed to resolve this problem (Wang, You & Lian, 2015;Simon et al., 2011). Through adjusting the coefficient of Cox regression, LASSO can penalize the regression in high dimensionality and collinearity to solve ''curse of dimensionality'' (Tibshirani et al., 2012;Friedman, Hastie & Tibshirani, 2010). Least Absolute Shrinkage and Selection Operator (LASSO) regression and a hybrid of these (elastic net regression); all three methods are based on penalizing the L1 norm, the L2 norm, and both the L1 norm and L2 norm with tuning parameters. Although the traditional Cox proportional hazards model is widely used to discover cancer prognostic factors, it is not appropriate for the genomic setting due to the high dimensionality and collinearity. Several groups have proposed to combine the Cox regression model with the elastic net dimension reduction method to select survival-correlated genes within a high-dimensional expression dataset and have made available the associated computation procedures. Many studies have adopted elastic-net regression to screen genes, in order to predict survival of patients. In the current study, we are going to subject the integrated mRNA and clinical factors profiles of PC patients, aiming to identify and analyze gene biomarker that can predict the overall survival (OS) in the diabetes and non-diabetes of PC patients by LASSO.
Recently, many studies employed the TCGA (TCGA-PAAD) and GEO dataset (GSE62452) to identify useful gene biomarker which can predict prognosis in many various cancer patients (Bing et al., 2016;Yang et al., 2016). In this study, the ICGC dataset was also employed to validate prognostic gene signature. Along with the increasing genomic data of PC patients, lots of corresponding studies begin to analyze the genomic data and try their best to explore interesting and meaningful but extremely difficult problems (Gore et al., 2015;Craven et al., 2015).

Information of patients
All related studies about diabetic and non-diabetic patients with PC were identified and collected by carefully searching from the online TCGA (TCGA: GDC TCGA Pancreatic Cancer) databases (http://tcga-data.nci.nih.gov/tcga/). The following combination of keywords was simultaneously applied for the literature search according to the requirement of this study 'pancreatic cancer' or 'PC' or 'pancreatic tumor' or 'pancreatic malignancy' and 'diabetes' and 'non-diabetes'. In addition, the following research feature criteria are used to further improve and screen the desired search samples: (1) researches that concentrated on patients with diabetes and non-diabetes were selected; (2) survival time involved of patients was more than 30 days; (3) patients who didn't receive any adjuvant therapy before. (4) all tissues that were from patients must be the primary tumor. After filtering and screening the data by these above criteria, 136 samples were selected from TCGA databases, which included 99 non-diabetic patients and 37 diabetic patients with PC.

RNA data gathering and filtering
The data of mRNA expression was downloaded from TCGA database. And the IIIumina HiSeq RNASeqV2 platform is selected.

Clinical factors and survival analysis
Clinical factors for the both diabetic and non-diabetic patients with PC are listed exhaustively in Table 1. For the correlation between RNA expression and OS was carried out by forthputting univariate Cox regression (the two-sided log-rank test). In the present meta-analysis, HRs and corresponding 95% CIs were combined to estimate the value of cancer prognosis. The hazard ratio (HR) was calculated from exp (β) and β was the coefficient from Cox regression. Clinical variables from univariate Cox proportional hazards regression P-value ≤0.05 were regarded as an important indicator of diabetic and non-diabetic patient prognosis.

The expression of mRNA associated with survival analysis
The relationship between patient survival and mRNA expression was analyzed through drawing on the univariate Cox proportional hazard regression. The null-selected RNA is calculated again and again. P-value ≤0.05 screened for mRNA (P ≤ 0.05). In normal conditions, RNAs that had a HR>1 and P value ≤0.05 were considered to be a risky gene while HR<1 is seen as an improved low-risky gene. In diabetic patients with PC, we reached a conclusion that 64 mRNAs are significantly associated with overall survival time (p < 0.05) by univariate Cox regression. In non-diabetic patients with PC, we acknowledged that 1,559 mRNAs are obvious significantly associated with overall survival time (p < 0.05).
In data of high dimension gene expression, the coefficients (β) of Cox regression model needs to be penalized in order that it can fit better and minimize errors as much as possible. Therefore, elastic net-regulated Cox regression method is applied to calculate the results from univariate Cox regression. The penalized log-likelihood function is defined as following: With the value of λ increasing, value of p j=1 |β j | would be decreased. Then, some coefficients (β) of RNAs would be changed into 0. This result was analyzed by selecting the LASSO-adjusted Cox regression coefficient =0 mRNA. These steps are carried out by R package ''glmnet''. Finally, we obtained eight mRNAs in diabetic patient with PC and 20 mRNAs in non-diabetic patients with PC.

Prognosis index construction
PI is calculated from linear combination of candidate RNAs and their expression for each PC patient. We defined a weighted prognostic index (WPI) (Xiong et al., 2014) for integrating indicators of RNAs for each PC patient, as following: Where β i represents the coefficient in Cox regression of the i th variable. And V i ii signifies the value of the i th variable. Mean (PI) and SD (PI) stand for the mean value and standard deviation of the PI, respectively. Where V i is the expression value of each mRNA (log2-transformed expression value) and β i is the LASSO regulated Cox proportional hazards regression coefficient of the i th RNA or clinical traits.

Risk stratification and ROC curves
The capacity of the integrated RNA and clinical model to predict clinical outcome was evaluated by comparing the analysis of area under curve (AUC) of the receiver operation characteristic (ROC) curves. AUC for the ROC curve was applied to the ''survival ROC'' package in R software (Heagerty, Lumley & Pepe, 2000). The higher AUC is considered as a better model performance and range of AUC value is from 0.5 to 1. The AUC range from 0.80-0.90 is treated as good performance. And the range from 0.90-1.00 was considered to be excellent performance. The risk of patient group was classified into two groups based on the median of WPIs: high-risk and a low-risk. Survival analysis is forthputting Kaplan-Meier curves. Statistical analysis and graph in this study were performed using the software of R software (Ihaka &Gentleman, 1996), version 3.2.4 andBioconductor, version 2.15 (Gentleman et al., 2004).

Gene ontology and pathway enrichment
Gene ontology (GO) functional enrichment analysis was performed to RNAs which classified as low-risk and high-risk group by making use of the online tool of the DAVID (version 6.8). We chose ''Homo sapiens'' as the background in order to search terms ''GO_TERM_BP_FAT'' for further analysis. And these genes are also enriched in Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway for analysis (Aoki & Kanehisa, 2002).

Validation data of patient information collection
In this study, we selected two independent datasets to validation. An independent mRNA expression data of PC patients with 65 PC patients was downloaded from Gene Expression Omnibus (GEO: GSE62452) database (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi? acc=GSE62452). The clinical traits and expression were all downloaded from GSE62452. And the mRNA expression data were generated by Affymetrix Human Genome U133A Array. Data from GEO was analyzed using the updated July 26, 2018.
Another database was downloaded from ICGC database (https://dcc.icgc.org/). We selected Pancreatic Cancer -AU data for further validation. This dataset included 92 PC patients with RNAseq and clinical information. The genomic data of this dataset uses the technology of next generation sequencing. This gene data contained 56,026 RNAs and 92 patients' follow-up data. We extracted gene signature from 56,026 RNAs for verification prognosis. (All raw data and code was listed in File S1).

Clinical traits
In the TCGA PC cohort of the 136 patients, 99 patients are non-diabetic PC patients and 37 patients are diabetic PC patients. We calculated the clinical factors by adopting univariate survival analysis and multivariable Cox regression analysis. We selected nine clinical variables including age, gender, tumor status, alcohol history, history of chronic pancreatitis, number of lymph nodes positive, maximum tumor dimension, neoplasm histologic grade and pathologic stage. And these data are summarized in Table 1. In pancreatic patients without a diabetes cohort, tumor status was significantly associated with overall survival by long-rank and multivariate Cox regression analysis. This result indicated that tumor status is an independent factor correlated with overall survival. In pancreatic patients with diabetes cohort, gender is significantly associated with overall survival time. But this factor is not an independent factor by multivariate Cox regression analysis (Fig. 1, Table 1).

Gene signature analysis in PC cohort
By analyzing of non-diabetes and diabetes PC patients through LASSO Cox regression and multivariate Cox regression, we have obtained 20 mRNAs and 8 mRNAs biomarkers, respectively, which were significantly associated with overall survival. Among these genes, the values of HR<1 and P value <0.01 were considered as protective RNAs and otherwise the values of HR >1 were risky RNAs (Tables 2 and 3). The graph for elastic net Cox regression can be found in Supplementary 1 and Supplementary 2. The PI was significantly associated with pancreatic patient survival. After normalized PI to WPI, the median value of WPI is acted as cutoff threshold to classify low-risk and high-risk patient cohort (Fig. 1).

Validation of the prognostic gene signature
The results were employed in two different ways to verify its stability and reliability. Firstly, we used the gene biomarker in PC patients with diabetes (8 mRNAs) to test the survival curve in PC patients with non-diabetes. Secondly, we used the gene biomarker in PC patients with non-diabetes (20 mRNAs) to swap above calculation.   The validated results showed that the gene biomarker in two groups performed poor result after exchange (Fig. 2). The results indicated that the gene biomarker in different groups has specificity in each condition.
For validation result, independent mRNA expression data and corresponding clinical information of PC patient with non-diabetes is downloaded from GEO database to estimate the reproducibility and robustness of the results from TCGA database.

Gene ontology enrichment
The Database for Annotation, Visualization and Integrated Discovery (DAVID) v6.8 was employed to discover the function of genes both in PC patient with diabetes and non-diabetes. The eight genes in PC with diabetes were associated with regulation of transcription with a Benjamini-Hochberg correction P-value < 0.05. And many genes had DNA binding function. For 20 genes identified in PC without diabetes were not enriched statistically significant association.

Comparison of clinical traits and gene biomarker for predicting prognosis
We integrated clinical traits that significantly associated with survival and PI of gene biomarker that significantly associated with survival to analyze the pancreatic cancer in diabetic and non-diabetic individuals. After multivariate Cox regression analysis, the results showed that PI of gene biomarker performed greatest P-value (Table 4). We filtered the clinical factors that significantly associated with survival by log-rank test into integrative model. In PC with non-diabetes, tumor status, number of lymph nodes positive, stage G2, G3 and G4 were significantly associated with survival (Table 1). And in PC with diabetes, gender, stage G2 and G3 were significantly associated with survival by log-rank test (Table 1).
From the Table 4, we find PI of gene biomarker have smallest P-value after multivariable Cox regression. Although HR is not the highest among clinical traits, P-value is the  smallest. Besides, we can find that tumor status is another significant risk factor in PC with non-diabetes.

Independent data validation for PC with non-diabetes
For further validation result, independent mRNA expression data and corresponding clinical information of PC patient with non-diabetes is downloaded from GEO database (GSE62452) to estimate the reproducibility and robustness of the results from TCGA database. The results showed that the gene signature from TCGA data could be validated in GEO database (n = 65). PI was calculated from gene signature can effectively predict survival of PC with non-diabetes. The median of PI value divided 65 patients into high-risk group and low-risk group (HR = 3.006, P-value < 0.001). And results of ROC showed that AUC = 0.828. The results indicated that the gene signature from TCGA could be validated in independent dataset (Fig. 3). Pancreatic cancer data was downloaded from ICGC database. This data included 92 patients with genomic data and clinical information. The gene signature was matched ICGC database and constructed PI model. The results showed that the PI from gene signature can divided patients into high-risk and low-risk groups significantly (HR = 2.84, P-value < 0.001) in ICGC data. ROC showed that AUC = 0.74, which indicated that the gene signature also validated in ICGC and predict performance well in 3 years (Fig. 4).

DISCUSSION
PC patients showed different prognostic gene signature in diabetes and non-diabetes. Identification special gene signature in different types of PC patients would provide precise medicine for different patients. We identified and verified specific high-risk genes for PC patients without diabetes. And these genes have not been reported before. These gene targets may be potential therapeutic targets for pancreatic cancer.
In this study, we proposed two classes of gene biomarkers in PC patients with and without diabetes which can guide us to predict PC patient survival better and more accurate. To a large extent, PC patients with and without diabetes have quite different gene biomarker for predicting prognosis. After a series of studies, we not only find that genes candidate in both PC patient groups have no overlapping but also figure out that gene biomarker in non-diabetes PC patients is validated by GEO and ICGC datasets. The result indicated that the two sets of gene biomarker in both groups have been very specified. Therefore, they have their own gene biomarker for predicting their prognosis. Because the differences between diabetic and non-diabetic pancreatic cancer patients are often ignored, we only got two types of patients in TCGA database. Other validation databases contained only non-diabetic patients. Furthermore, non-diabetic patients with pancreatic cancer are more likely to be ignored in the diagnosis, leading to a higher risk of such patients. Thus, we validated gene biomarker in non-diabetes PC patients in more datasets. Although a large number of studies have reported some biomarkers in PC patients, many genes have been identified primarily in PC patients without diabetes. We identified and compared the gene signature that predict both types of PC patients. And many genes have not been reported yet so far. Among the high risk prognostic genes, CRCT1, MUC20, RTP1, C10orf111, SPACA5 and FZD10 have high level of HR. MUC20, FZD10 have been identified in PC patients (Lee et al., 2016;Kirikoshi & Katoh, 2002) and these two genes play a vital role in two important pathways associated with cancer. MUC20 is involved in MET (Mesenchymal-Epithelial transitions) process which is a common process in many tumors (Spaderna et al., 2007). And it may regulate MET signaling cascade. It appears to decrease hepatocyte growth factor (HGF)-induced transient MAPK activation (Higuchi et al., 2004). FZD10 is associated with WNT signaling pathway which is implicated in embryogenesis as well as in carcinogenesis (Terasaki et al., 2002). Other genes were not reported in PC patients, but only SPACA5 is reported in bladder cancer (Zhang, Chen & Chen, 2016). Although many genes have not been reported before, we find that these combinations of these genes can greatly distinguish high-risk and low-risk PC patients with non-diabetes. In addition, these genes were validated in an independent GEO database and ICGC database. The results of GSE62452 in the GEO database indicated that these genes were stably expressed and the gene biomarker could distinct between high-risk and low-risk gene greatly.
The gene biomarker in PC patients with diabetes, three genes are high-risk genes. We can find that the production of these three genes (ZNF793, GBP6, FOSL1) are binding function proteins. Thus, we infer that they are all transcription factors. Of the three genes, FOSL1 has been reported to be closely associated with PC Sahin et al., 2005). But these studies have not reported that this high-risk gene is associated with PC with diabetes yet. Only one study reported that FOSL1 is closely associated with diabetes mellitus (Portal-Núñez et al., 2010). And this gene has not been identified in PC with non-diabetes. GBP6 is reported in diabetes (O'Tierney et al., 2012) but is not reported in PC patients with diabetes. ZNF793 is not identified in both PC and diabetes. Thus, we infer that the gene is a potential risk factor in PC patients with diabetes.
Through multivariate Cox regression analysis, it is interesting to note that tumor status is an independent predictor of prognosis in non-diabetes PC patients. Gender is an independent predictor of prognosis in patients with diabetes in PC. Tumor status is a vital clinical factor for predicting the prognosis in many cancers.
From the results, we find that there was no overlapping of both groups. Thus, we conclude that two types of PC vary greatly at the molecular level. Prognostic gene signature in non-diabetes PC patients showed robustness among two datasets (GEO and ICGC). Many genes have not reported in publication and we hope that these genes can predict prognosis for improving clinical decision.

CONCLUSION
Pancreatic cancer patients with diabetes and without diabetes have different gene signature for predicting their respective prognosis. The results indicated that the gene signature