Identification of a stemness-related prognostic model in Pancreatic ductal adenocarcinoma by Weighted gene co-expression network analysis


 Objective: Aim of this study was to identify the stemness-related genes in Pancreatic ductal adenocarcinoma (PDAC). Methods: The RNA-seq data of PADC were downloaded from The Cancer Genome Atlas (TCGA) and the Genotype-Tissue Expression (GTEx) databases. mRNA expression base-index (mRNAsi) and epigenetically regulated mRNAsi (EREG-mRNAsi) of PADC were evaluated. The mRNAsi-based gene sets in PADC were identified by Weighted gene co-expression network analysis (WGCNA). Functional Enrichment Analyses were performed with key genes. Kaplan–Meier survival analysis and the Cox proportional hazards model were used to evaluate the prognostic value of the key genes. Prognosis-associated hub genes were applied to establish nomograms. The receiver operating characteristic curves (ROC) and concordance index (C-index) were utilized to assess the discrimination and accuracy of the nomogram. Finally, these results were validated in the Gene Expression Omnibus (GEO) database and immunohistochemistry (IHC). Results: 36 key genes associate with mRNAsi were screened via WGCNA. Next a five-gene signature compromised TPX2, NCAPH, UBE2C, CCNB2, CEP55. Based on the expression of the signature, the PADC patients were classify patients into high- and low-risk groups. Cox regression analysis revealed that the high-risk group was significantly positive with overall survival (OS). Moreover, the nomogram has better sensitivity and specificity for predicting the OS. And the ROC, C-index indicated good performance of the prognostic signature in the TCGA and GEO dataset. Conclusion: Prognostic model associated with cancer stem cells (CSCs) reliably predict OS in PADC. this might be beneficial for the treatment of PADC patients.


Introduction
Pancreatic ductal adenocarcinoma (PDAC) is the most prevalent type of pancreatic neoplasm, accounting for more than 90% of the total number of pancreatic tumors.
It has a 5-year survival rate of less than 10% [1,2]. Surgical resection represents the only chance for cure, and advancements in adjuvant chemotherapy have improved long-term outcomes in PDAC patients [3]. Lack of early diagnosis and effective interventions are the major factors in PADC. Badly, these treatments cause great trauma to the patient, thus reducing the patient's quality of life to some extent.
Fortunately, molecular targeted therapy, which blocks cancer growth, progression and metastasis by specific molecular biomarkers, emerges as an individual treatment strategy because of less trauma and excellent efficacy [4]. However, identifying molecular biomarkers remains a challenge due to the insufficient knowledge of PDAC.
Cancer stem cells (CSCs) were found to promote tumor recurrence, metastasis and drug resistance because of their self-renewal ability, proliferation and multi-lineage differentiation potential [5]. Besides, emerging evidence indicated that stemness of CSCs is associated with high intratumoral heterogeneity in most cancers, including PADC [6,7]. Intratumoral heterogeneity reportedly drives disease progression and causes treatment failure. And it includes phenotypic diversity such as cell surface markers and (epi)genetic abnormality. This markers as well as gene mutation types are often considered as basis for the pathological classification and clinical treatment of tumor [8]. Pancreatic CSCs, first described in 2007 [9], account for less than 1% of all pancreatic cancer cells [10], and are responsible for PDAC tumor development, metastasis, and chemoresistance. As previously reported, PADC involves the activation of biomarkers and signaling pathways related to CSCs, including CD133, CD24, CD44, MYC, WNT/β-catenin, Notch, etc. [11]. Although the use of these markers has increased our potential biological knowledge of Pancreatic CSCs, such as unlimited self-renewal capacity, unique tumorigenicity, and chemoresistance, more research is needed on the molecular mechanisms of Pancreatic CSCs. Our study focuses on uncovering heterogeneity of PADC from the perspective of stemness features of CSCs below.
Currently, the stemness features of CSCs have been mainly characterized mRNA expression-based-index (mRNAsi) (eg. Epigenetic regulation based-index (EREG-mRNAsi)) [5]. The high score of mRNAsi in CSCs helps to identify new targets and reflects the high degree of carcinogenic dedifferentiation [12]. Studies have reported that the scores of mRNAsi associated with CSCs stemness could be computed by the one-class logistic regression (OCLR) machine learning algorithm, and that there was significant correlation between mRNAsi obtained and prognosis of PADC [5].
However, genes related to CSCs stemness have not been identified throughoutly, or the biological function of genes related to CSCs stemness remains unknown.
Analysis of differentially expressed genes (DEGs) has been reported to identify genes. However, such analysis was found to show limitations in elaborating the connections between genes. The gene network analysis can reflect the complex biological processes during tumor neogenesis [8]. Weighted gene co-expression network analysis (WGCNA) was used to explore co-expression modules associate with clinical traits, which include mRNAsi [13,14].
In current studies, WGCNA was used to identify stemness-related preserved modules for discovering the key genes. Next, the functions of these genes may be associated with cell cycle regulation during the PADC development by Gene Ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG) analyses and gene set enrichment analysis (GSEA). Subsequently, a novel seven-gene prognostic model (i.e. TPX2, NCAPH, UBE2C, CCNB2, CEP55) was constructed and verified and assessed in TCGA and GEO database. Finally, prognosis-associated genes were applied to established predicted nomograms. Overall, our new model and nomogram might provide a powerful tool to predict PADC prognosis.

Datasets sources and processing
In this work, the criteria for screening patients in the TCGA PADC cohort include the following: (1) histologically confirmed primary PADC (2)  expression of genes analyzed in normal tissues was collected using GTEx. And these data were all available on September 9, 2020. The RNA-seq data included 169 normal samples and 142 tumor samples, then were integrated into a matrix file using a merge script in Perl language (http://www.perl.org/).

mRNAsi in PADC and its clinical significance
The mRNAsi used to evaluate oncogenic dedifferentiation was from previous studies [5]. The mRNAsi score of PADC samples can be calculated from PADC datasets inside TCGA by OCLR. The gene expression-based stemness index ranges from low (zero) to high (one) stemness. Significant differences of mRNAsi in tumors and non-tumors were analyzed by Wilcoxon-test. Next, to evaluate the prognostic value of mRNAsi scores, an overall survival analysis according to mRNAsi scores was performed using the " survival "package in R software v3.6.1 (https://CRAN.Rproject.org/package=survival). the Kaplan-Meier (K-M) analysis for samples with the high and low mRNAsi set was carried out. Furthermore, we used the Wilcoxontest to investigate the correlation between mRNAsi score and patient age.

Differentially expressed genes (DEGs) in UCEC
Wilcoxon-test was used to screen the DEGs between normal pancreases samples and tumor samples. The false discovery rate (FDR) <0.05 and |log2 fold change (FC)|≥1, were considered the cut-off criteria for screening.

WGCNA construction and module preservation
An R package " WGCNA " was utilized to discover the correlations among genes by constructing important modules. And the modules were obtained by clustering genes with similar expression patterns, reflecting external sample characteristics [14]. 1750 DEGs with high precision and accuracy were selected for the next step of WGCNA analysis. To obtain a high-scale independence and average connectivity, the softthresholding power value was calculated from a gradient test range from 1 to 20 and determined by pickSoftThreshold function. Then, we calculated the topological overlap matrix (TOM) according to the corresponding soft-thresholding power. The TOM was introduced to identify modules of highly co-expressed genes for making networks less sensitive to spurious connections. Genes with high absolute correlations were further clustered into the same modules. The modules were defined by cutting the clustering tree into branches by using a dynamic tree cutting algorithm and assigned to different colors for visualization. Through the analysis of hierarchical clustering based on TOM-based dissimilarity measurements, we constructed module dendrograms; To avoid over-splitting, correlated modules (r < 0.25) were then merged and each module was added a corresponding color [15].

Identification of Modules Associated with Clinical characteristics
Module eigengenes (MEs) are considered to be the main components of principal component analysis of each gene module, and the expression patterns of all genes can be summarized as a single feature expression profile within a given module [16].
To show the correlation between genes and sample characteristics, the gene significance (GS) was calculated. And module significance (MS) refers to the average GS of all genes in the module. MS was used to measure the correlation between modules and sample characteristics. The module with higher correlation with MS were identified in the further analysis.

Screening of the key Genes in red module
The key genes in co-expressed network were highly interconnected with nodes in a module, and they were used to explore the biological functions of the identified dysregulated genes [17]. The module membership (MM) was defined as the correlation of gene expression profile with module eigengenes (MEs). The intramodular key genes were defined as cor.GS (gene significance) > 0.5 and cor. gene MM > 0.8. And then the "corrplot" R package was applied to describe the Pearson's correlation among the key genes.

Function and pathway enrichment analysis
To investigate the biological function of key genes, GO and KEGG analyses were employed [18,19]. The "clusterProfiler" R package used to implement this functional annotation and pathway enrichment analysis. FDR ≤ 0.05 were considered to determine statistical significance. The protein-protein interaction (PPI) network of the key genes in key modules was constructed using the STRING (https:// www.string-db.org/) [20]. The criterion for selecting hub nodes was a combined score of ≥0.8 and a connectivity degree of ≥20.

Establishment of a prognostic model
We used K-M survival analysis curves and the stratified Cox proportional hazards analysis to evaluate the prognostic value of the key genes. Then risk score for each prognostic gene was calculated based on the sum of products of Si (expression level of key genes) and β (corresponding coefficients) generated from Cox model, respectively. And the PADC patients were divided into high-and low-risk subgroups by median risk score for fitting K-M analysis to further validate whether they went through diametrically distinct prognosis. predictive accuracy of each prognostic signature was accessed by calculating Uno's inverse-probability of censoring weighting estimation of dynamic time-dependent receiver-operator characteristic (ROC) area under the curve (AUC) values (time spanning from 1 to 3 years) with "timeROC" package (version 0.3), according to the method proposed by Blanche et al [21]. Finally, the expression distribution of the signature genes was visualized in the heatmap using the "pheatmap" R package.

Prognostic value of the five-gene prognostic model
The risk scores and the clinicopathological parameters, including age, gender, TNM stage, grade, OS, postoperative radiotherapy (postoperative_tx_tx), radiotherapy, alcohol story, were included in the univariate and multivariate Cox model. Next, based on the result of multivariable Cox regression analysis, a nomogram was established using the "rms" package (version 5.1.2) [22]. The nanogram was applied to predict 1-, 2, 3-years OS of PADC patients in the TCGA dataset. Subsequently, using the "timeROC" package of R, the time-dependent ROC curve was utilized to assess the sensibility and specificity of the nomogram, and these was quantified by AUC. The predictive accuracy discrimination of the nomogram was assessed by the Harrell's C-index. The C-index was calculated to assess nomogram discrimination based on a bootstrap method with 1,000 resamples.

Validation of the prognostic signature
To minimize bias caused by small study cohorts, the prognostic capability of fivegene model was validated using dataset GSE62452 the GEO database (www.ncbi.nlm.nih.gov/geo/). The optimal cut-off for risk score for the GEO dataset was the same as for the primary TCGA cohort. K-M analysis was used to plot survival curves for the low-risk and high-risk groups. The predictive accuracy of the prognostic model was investigated using ROC curves, and the performance of the

Statistical analysis
Statistical analysis was performed using R software v3. 6

Data processing and survival analysis
RNA-seq data related to CSCs were acquired from the TCGA PADC cohorts, including 142 PADC samples. The 169 normal samples were collected from TCGA and GTEx normal tissues. Gene expression-based stemness indices for PADC were extracted by OCLR [5]. mRNAsi is an indicator used to quantify the degree of similarity between tumor cells and stem cells. mRNAsi is based on the gene expression. EGER-mRNAsi is based on training stem cell epigenetic regulation of the expression of related genes. As shown in Figure 1A

DEGs identification in UCEC
According to cut-off criteria, by Wilcoxon-test, 1750 DEGs were screened, including 845 down-regulated genes and 905 up-regulated genes. We visualized the expression of DEGs using the "beeswarm" R package, and the results showed that DEGs were significant differentially expressed in PADC and normal groups with a clear display of volcano plot and heat map ( Figure 1C).

co-expression modules were identified by WGCNA
Based on the DEGs, the WGCNA package was used to identify the highly stemnessrelated modules. In order to ensure accuracy of the results, ß = 11 (scale-free R 2 = 0.95) as a soft threshold (power) was used to ensure a scale-free network ( Figure   2A). and then we established a sample dendrogram and trait heatmap of mRNAsi and EGER-mRNAsi ( Figure 2B). Based on this cluster analysis of PADC, there were a total of 14 different modules (module size ≥ 50, cut height ≥ 0.25) in the network (blue, tan, brown, greenyellow, cyan, green, magenta, red, purple, yellow, black, turquoise, grey) ( Figure 2C). Three modules were identified as related to mRNAsi.

Analysis of key gene in red module
Genes in the same module exhibit common gene expression patterns. And in this work, we obtained 36 key genes out of 1750 DEGs that were the relevant to mRNAsi in the red module. We also compared the differential expression of genes with minimum p-value(P<0.01) in the PADC and normal tissues, and found that these genes were upregulated in tumor tissues ( Figure 3A). Meanwhile, in order to explicit the difference clearly, we drew a heat map by "ggpubr" R package ( Figure 3B).
Moreover, the interrelationships among key genes were visualized as a heatmap ( Figure 3C, p<0.01).

Functional Enrichment and PPI Network Analysis of the key genes
To fully understand biological functions of 36 key genes in PADC, GO and KEGG function and enrichment analyses were performed. The functions of these genes are mainly enriched to 30 pathways, including BP, CC, and MF ( Figure 4A). BP was found to be relevant to mitotic nuclear division. CC was related to the spindle. MF includes microtubule binding. As shown in Figure 4B, KEGG pathway analysis indicated that the hub genes were enriched in cell-cycle signaling pathway. The signaling pathway with FDR corrected p-values < 0.01 was deemed significant.
Furthermore, a PPI network of key genes was constructed to identify gene interactions ( Figure 4C).

Identification and evaluation of stemness-related prognostic model
To find out the prognostic value of stemness-feature associated genes, a total of 36 key genes were identified to be associated with OS based on univariate Cox regression analysis, and the 35 genes related OS were identified as risk factors ( Figure 5A). Next, a prognostic model composed of 5 genes was then developed based on the red module using the stepwise multivariate Cox regression analysis.  Figure 5B. the heatmap of 5prognistic genes were also showed in Figure 5B.  were statistically significant in multivariable Cox regression analysis.

Validation of the five-gene prognostic model
We evaluated the predictive power of the prognostic gene signature in the PADC cohorts from the GEO database (GSE62452). Figure 6C showed that the high-risk cohort also was a significantly poorer prognosis than the low-risk cohort (P<0.0043).
The protein expression levels of these hub genes were obtained from HPA database.
notably, the proteins of CCNB2, TPX2, NCAPH2 and CEP55 were not or low expressed in normal pancreas tissues, whereas medium and high expression levels of these genes in PADC tissues (Figure 8). This may indicate that this five-gene prognostic model has some reference value for the diagnosis or treatment of patients with PADC.

Gene set enrichment analysis (GSEA)
In order to further explore the potential functions on the 5 genes in PADC, GSEA was applied. these results suggest that these genes are mainly associated with cell cycle checkpoint, mitotic nuclear division, DNA integrity checkpoint ( Figure 9).

Discussion
PDAC is a highly heterogeneous malignancy, with high morbidity and mortality.
What' worse, stem cells which show similar to cancer cells in some bioactivities are found to reside in some cancer tissues. And, there is evidence that stem cells and tumor cells are similar in the following three aspects: 1. The regulatory mechanism of self-renewal is similar; 2. Tumor cells may be derived from normal stem cells; 3.
Tumor tissues may contain a small proportion of CSCs with unlimited proliferative potential [24][25][26]. Several years after the conceptually proposed existence of stemlike cancer cells, experimental evidence was first provided in leukemia models, confirming that CD34+CD38−eukemic cells show bone marrow hematopoietic stem cell characteristics [27,28]. Solid tumor CSCs (CD44+ CD24−/lowLineage− cells) have been found in breast cancer, ovarian, prostate, colon, pancreatic, liver, and lung cancers. therefore, CD34+, CD44+ have been suggested for common or unique CSC signatures [29,30]. In addition, CSCs were identified in PDAC using some cell surface makers, including CD133, CD44, CD24, ESA/EpCAM (epithelial-specific antigen), etc. [31]. And it is now universally accepted that PDAC metastasis, chemoresistance, and disease relapse are driven by a subpopulation of highly plastic "stem-like" cells known as CSCs in tumors, and from a clinical perspective, understanding how these cells function and developing inhibitors to eliminate the CSCs population should lead to tumor eradication [32][33][34]. However, the molecular mechanisms driving the stem-like state of pancreatic CSCs is still incomplete and requires further investigation. In agreement with previous reports [36][37][38], dysregulation of the 5 key genes that we identified were found to result in the development of tumors. In addition, these hub genes were found to be closely associated with cell cycle. These 5 genes functioned to disrupt mitotic cyclins and cell cycle progression and might be involved in cancer progression. TPX2 was found to encode a microtubule-associated protein. Up-regulation TPX2 levels led to increased proliferation and invasion of renal cancer cells [39]. CEP55 were reported to play an important role in promoting self-renewal and stemness maintains of glioma stem like cells [40]. Dysregulation of UBE2C was associated with proliferative marker Ki-67 and a poor overall survival in colorectal carcinoma [41,42]. In advanced prostate cancer, Aaron et al showed that the abnormal expression of CCNB2 was closely related to cell cycle-driven subpopulation [43]. And NCAPH may regulate cell cycle progression and DNA damage via activation of the DNA damage response (Chk1/Chk2) signaling pathways in pancreatic cancer [44]. In PADC, there are few studies on these 5 genes related to stemness characteristics of CSCs. In the study, we identified the key genes as having functions in regulating cell division cycle in PADC. These findings suggest that these genes may promote tumor initiation, metastasis, as well as recurrence.
Therefore, the study of the genes related to the stemness characteristics of CSCs provides a new idea for the targeted therapy of PADC in the future. Future studies should focus on the role of targeting the key genes to direct adjuvant treatment of PADC and further exploration about molecular mechanism that sensitive to tailored targeted therapies.

Conclusion
In summary, we established a stemness index-related signature and developed a prognostic nomogram in combination with prognosis-related clinical parameters.
And the model had predictive value for predicting the OS of patients with PADC.

Ethics approval and consent to participate
Not applicable.

Consent for publication
In this study, the privacy of individual participants was not involved, such as images, videos or other details. All authors have contributed to, read, and approved this submitted manuscript in its current form.

Availability of Data and materials
All data generated or analyzed during this study are included in the manuscript.