Development of a four-gene prognostic model for pancreatic cancer based on transcriptome dysregulation

We systematically developed a prognostic model for pancreatic cancer that was compatible across different transcriptomic platforms and patient cohorts. After performing quality control measures, we used seven microarray datasets and two RNA sequencing datasets to identify consistently dysregulated genes in pancreatic cancer patients. Weighted gene co-expression network analysis was performed to explore the associations between gene expression patterns and clinical features. The least absolute shrinkage and selection operator (LASSO) and Cox regression were used to construct a prognostic model. We tested the predictive power of the model by determining the area under the curve of the risk score for time-dependent survival. Most of the differentially expressed genes in pancreatic cancer were enriched in functions pertaining to the tumor immune microenvironment. The transcriptome profiles were found to be associated with overall survival, and four genes were identified as independent prognostic factors. A prognostic risk score was then proposed, which displayed moderate accuracy in the training and self-validation cohorts. Furthermore, patients in two independent microarray cohorts were successfully stratified into high- and low-risk prognostic groups. Thus, we constructed a reliable prognostic model for pancreatic cancer, which should be beneficial for clinical therapeutic decision-making.


INTRODUCTION
Risk stratification (also commonly called "prognostic modeling") is a useful tool in cancer management, since it enables timely interventions for high-risk patients while obviating unnecessary treatments for low-risk patients [1,2]. Classical prognostic factors such as the clinical tumornode-metastasis (cTNM) stage and pathological stage (pTNM) are not completely prognostically relevant in some patients [3][4][5]. Accordingly, the guidelines for prognostic assessments have been continually modified to improve their accuracy while reducing their complexity for daily clinical use [6][7][8][9].
Novel molecular factors such as immunoscores assessing in situ immune cell infiltration in tumors, abnormal DNA levels and mRNA levels are more accurate risk predictors than the existing tumor parameters [10][11][12]. Highthroughput technologies provide an efficient means of measuring the molecular disruptions in tumors [13]. For example, a prognostic landscape of cancer was recently developed, which integrated the transcriptomes and clinical data of approximately 26,000 patients across 39 malignancies to establish the patterns and determinants of responses to targeted therapy [14]. Since numerous cancer-related microarrays and sequencing platforms have been generated in recent years, it is essential to integrate Pancreatic cancer has a dismal prognosis, with a fiveyear survival rate of only 9% [18]. It is characterized by desmoplastic stroma, perineural invasion [5], invasiveness and immune suppression [13], which are largely responsible for the early metastasis [19], chemoresistance [20] and cachexia [21] observed in patients. Based on the transcriptome data of pancreatic cancer cells, tumors can be classified into the squamous, pancreatic progenitor, aberrantly differential endocrine exocrine, and immunogenic subtypes [13]. The squamous subtype is associated with a poor prognosis, and the immunogenic subtype involves the upregulation of gene networks for acquired immune suppression. A better understanding of the molecular landscape of pancreatic cancer would enable the development of novel therapeutic strategies to improve clinical outcomes and facilitate the stratification of patients into prognostic groups to guide personalized treatment. However, a comprehensive prognostic model with compatibility across different transcriptomic platforms and patient cohorts has not been systematically developed.
To determine the prognostic significance of the pancreatic cancer transcriptome, we screened multiple RNA-Seq and microarray datasets for genes that were differentially expressed between normal and tumorous tissues, and identified genes that were significantly associated with overall survival. We then developed a prognostic risk score and successfully validated it in three independent pancreatic cancer cohorts. We thereby devised a prognostic model that can predict the post-surgical prognosis of pancreatic cancer patients with moderate accuracy.

Combined analyses of multiple pancreatic cancer microarray datasets
We searched the Gene Expression Omnibus (GEO) database for all the human tissue microarrays that included pancreatic cancer tissues and paired/unpaired normal pancreatic tissues. Then, we used Transcriptome Analysis Console software (Applied Biosystems, version 4.0.2) to evaluate the data for hybridization and labeling controls. Affy [22] was used to assess RNA degradation, and simpleAffy [23] was used to determine the 3'-to-5' ratios of β-actin and GAPDH (Supplementary Figure 2). Two pancreatic ductal adenocarcinoma (PDAC) datasets (GSE22780 and GSE27890) were thus excluded, and seven datasets (GSE32676, GSE16515, GSE71989, GSE41368, GSE15471, GSE28735 and GSE62452) were selected for further analysis (Table 1).
After seven cases were excluded from these datasets, the data of 177 normal pancreatic tissue samples and 226 PDAC tissue samples were included in subsequent analyses. A robust rank aggregation analysis [24] identified 616 differentially expressed genes (DEGs) between the normal and PDAC samples across all datasets, with an adjusted p value < 0.05 and |log2 FC (fold change) | > 1 as the cut-offs. Among these genes, 403 were upregulated and 213 were downregulated in PDAC tissues. The heatmap displaying the top 10 significantly overexpressed or suppressed genes is shown in Figure  1A.
Gene Ontology (GO) analysis of the DEGs revealed significant enrichment in the GO terms for 158 biological processes (BPs), 26 cellular components (CCs) and 28 molecular functions (MFs) (p value < 0.01 and q value < 0.01 as cut-offs) ( Figure 1B). The top BP terms were related to three aspects: i) extracellular stroma formation, including extracellular structure organization and extracellular matrix organization, which was not surprising, since stiffness is the defining characteristic of PDAC; ii) immune cell responses, such as the innate immune response, neutrophil activation and neutrophil mediated immunity; and iii) fundamental pancreatic functions, such as regulating pancreatic juice secretion and epithelial cell proliferation. Major CC terms included the extracellular matrix and various components of the intracellular lumen and the apical part of the cell. The most enriched MF terms were extracellular matrix formation, cell adhesion and receptor ligand activity.
Thus, through a combined analysis of seven highquality GEO microarrays, we identified 616 genes that were consistently differentially expressed between normal pancreatic tissues and PDAC tissues. Most of the DEGs were associated with the pancreatic extracellular stroma and the tumor immune microenvironment.

Combined analyses of TCGA and GTEx RNA-Seq datasets
To determine whether the DEGs were independent of the detection method, we also analyzed their expression in PDAC RNA-Seq datasets. Since TCGA only contains data from four normal pancreatic tissue samples [25], we also included normal tissue data from the Genotype-Tissue  [26,27]. The GTEx and TCGA RNA-Seq data and phenotypic information were obtained from the University of California Santa Cruz (UCSC) Xena platform (https://xena.ucsc.edu/), which is routinely updated and integrated [28]. In total, 171 normal and 178 pancreatic cancer samples were analyzed, and 5,886 DEGs were identified by the same cut-off criteria used to analyze the PDAC microarrays (|log2 FC | > 1 and false discovery rate < 0.05). Among these genes, 2,980 were upregulated and 2,906 were downregulated in pancreatic cancer tissues relative to normal pancreatic tissues. These DEGs were enriched in 600 BP, 106 CC and 32 MF terms, and the most significantly enriched BP terms were related to leukocyte function, extracellular matrix formation, inflammation, etc. (Figure 2A). Consistent with the DEGs in the microarrays, most of the genes were enriched in the extracellular matrix or junctions for adhesion and antigen-binding functions ( Figure 2B). Thus, both sequencing and multi-microarray data indicated that the tumor immune and stroma microenvironment was significantly disturbed during pancreatic tumorigenesis.

Identification of 542 genes that were consistently differentially expressed across independent platforms
We next investigated whether the 616 DEGs identified by the seven microarrays were consistent with the 5,886 DEGs identified by PDAC transcriptome sequencing.
When we compared the DEG profiles obtained by these two methods, we detected 542 common genes. Surprisingly, all of the overlapping genes displayed consistent expression trends in the two types of profiles (Supplementary Table 1).
Since transcription factors (TFs) and kinases are key components of cancer regulatory networks and are preferred targets for drug development [29,30], we cross-referenced the DEGs with both the Cistrome Cancer human TF database [31] and a list of 518 human kinases [32]. Of the DEGs, 19 displayed TF activity ( Table 2) and 16 were kinases, of which six belonged to the Tyrosine Kinase group (Table 3). Thus, we identified 542 pancreatic-cancer-related genes that were consistently dysregulated in both multi-microarray datasets and sequencing datasets, including 35 welldefined protagonists harboring core regulatory functions.

The dysregulated transcriptome is associated with overall survival in pancreatic cancer patients
To determine the phenotypic relevance of the DEGs, we performed a weighted gene co-expression network analysis (WGCNA) [33] to identify gene modules   AGING (groups of highly interconnected genes) that were significantly associated with the clinico-pathological features of pancreatic cancer. Since sufficient sample sizes and adequate phenotypic data (including prognostic data) are prerequisites for analyzing gene coexpression networks that are associated with clinical characteristics, we only extracted the gene expression profiles and corresponding clinical data of the PDAC patients from TCGA who met these criteria (N = 135). Fourteen clusters (modules) of highly interconnected genes with co-expression similarity values > 0.75 for the module eigengenes were identified (Supplementary Figure 3, Supplementary Table 2).
Since we had already identified the characteristic gene co-expression profiles of each module, we searched for significant correlations between the module eigengenes and clinical traits. Overall survival status was associated with three modules, although the Pearson correlations were weak; for example, the black module correlated positively (r = 0.27, p = 0.001) and the pink module correlated negatively with overall survival (r = -0.26, p = 0.002) ( Figure 3). Additionally, age and tumor grade were associated with one module each, while the other four clinical traits we examined (gender, tumor stage, T classification and N classification) were not associated with any module. Thus, WGCNA analysis indicated that overall survival was the main clinical trait associated with the pancreatic cancer profiles in TCGA, so we focused on overall survival in our subsequent investigations.

Construction of a prognostic model for pancreatic cancer
Since GSE62452 and TCGA harbored an adequate number of cases and sufficient clinical follow-up data, we performed a univariate Cox regression analysis of AGING these datasets to investigate the prognostic significance of the 542 consistently identified DEGs described above. A prognosis-associated gene was defined as impacting overall survival with a hazard ratio (HR) and 95% confidence interval (CI) greater than or less than 1. While 92 of the 542 DEGs had prognostic value in GSE62452 (p < 0.05), 76 of these 92 genes were still prognostically relevant in the cohort from TCGA (p < 0.05) (Figure 4).
A least absolute shrinkage and selection operator (LASSO) regression model [34] was then used to select key prognosis-associated genes. In the LASSOpenalized Cox regression, as log λ (a tuning parameter) changed, the corresponding coefficients of certain genes were reduced to zero, indicating that their effects on the model could be omitted because they were shrinking parameters ( Figure 5A). Following cross-validation, nine genes achieved the minimum partial likelihood deviance ( Figure 5B). Moreover, at this point, log λ was approximately -2.16, and the nine genes displayed non-zero effects, all contributing positive HRs to the model. The nine genes that were thus fit into the Cox model were PBK, which encodes MAPKK-like protein kinase; DLGAP5, which encodes a mitotic phosphoprotein; RACGAP1, also called Rac GTPase activating protein 1; DSG3, also called cadherin family member 6; ARNTL2, which functions as a TF; NUSAP1, also called nucleolar and spindle associated protein 1; DKK1, also called Dickkopf WNT signaling pathway inhibitor 1; KRT7, which encodes a cytokeratin; and C15orf48, a protein-coding gene that is also called chromosome 15 open reading frame 48.
Then, we randomly divided the 176 pancreatic cancer patients in TCGA (two cases in the enrolled TCGA pancreatic cancer cohort (N=178) were excluded due to insufficient follow-up data) into a training cohort (N = 88) for the construction of a prognostic model, and a validation cohort (N = 88) for internal selfvalidation (Table 4). Multivariate Cox regression analysis of the training cohort indicated that ARNTL2, NUSAP1, DSG3 and KRT7 were independent prognostic factors ( Figure 5C).

Stratification of TCGA training and self-validation cohorts using the four-gene signature
We then divided the training cohort into high-risk and low-risk groups ( Figure 6A). We used the median risk score (6.12) as the cut-off because the risk score usually exhibits a skewed distribution. The high-risk group displayed a higher frequency of poor survival outcomes than the low-risk group ( Figure 6B). The three-year survival rates of the high-risk and low-risk groups were 7.78% and 51.3%, respectively ( Figure 6C). To determine the predictive accuracy of this prognostic model, we performed a receiver operating characteristic (ROC) curve analysis, which demonstrated that the area under the curve (AUC) was 0.805 for one-year survival and 0.839 for three-year survival in the training cohort of TCGA ( Figure 6D).
The model was further tested in the self-validation cohort by the same protocol ( Figure 6E), which indicated that higher scores corresponded to worse overall survival ( Figure 6F). The three-year survival rate was 28.6% in the high-risk group and 50.4% in the low-risk group ( Figure 6G). Moreover, the AUCs for one-year and three-year survival in the validation cohort were 0.747 and 0.695, respectively ( Figure 6H). Thus, the prognostic model successfully stratified pancreatic cancer patients from TCGA into high-and low-risk groups with moderate predictive power.

The four-gene prognostic model is reliable in independent cohorts
The predictive capacity of the four-gene model was further tested on two independent GEO microarray datasets (GSE28735 and GSE62452) that also included clinical data. The risk scores were calculated as described above (Figure 7A), and the expression of each gene was found to be greater in the high-risk group than in the low-risk group ( Figure 7B). Consistent with the results in the entire TCGA cohort ( Figure 7C), the risk score accurately stratified the patients of both datasets in terms of their survival outcomes. The three-year overall survival rates of the high-and low-risk groups in GSE28735 were 14.8% and 41.3%, respectively ( Figure  7D), while those in GSE62452 were 5.15% and 45.3%, respectively, displaying an even greater prognostic difference ( Figure 7E). Therefore, the four-gene signature is an accurate, reliable and independent predictive tool for determining the prognosis of pancreatic cancer patients.

DISCUSSION
In this study, a pancreatic cancer prognostic model based on transcriptome dysregulation was developed, which was AGING compatible across the microarray and RNA-Seq platforms and among different patient cohorts. Our prognostic model containing four DEGs (DSG3, ARNTL2, NUSAP1 and KRT7) was used to determine the risk scores of pancreatic cancer patients, and patients with high risk scores were found to exhibit poor overall survival.
Molecular predictors such as the multi-gene expression assays in lung cancer [35] and renal cell carcinoma [36] or the immunoscore in colon cancer [10] have shown promise in facilitating clinical decision-making in large international validation studies. However, commonly mutated genes are not the primary determinants of PDAC prognosis [4]. On the other hand, transcriptome profiles have been successfully translated into prognostic markers in some prospective studies; for instance, a 21-gene signature was recently developed for breast cancer (the Oncotype DX breast cancer assay) [37]. To this end, we constructed a prognostic model for pancreatic cancer by analyzing the transcriptomes of a large number of pancreatic cancer patients across multiple datasets, and subsequently developed a fourgene risk score.
Since gene expression platforms are based on different analytical and data processing methods, it is often challenging to compare and integrate results from multiple datasets. In some previous studies, researchers have obtained integrated results by intersecting the results from different cohorts, which may lead to bias. Therefore, we used the robust rank aggregation method to screen for significantly altered genes across seven microarray datasets in an unbiased manner, and subsequently identified 542 DEGs that overlapped between the microarray datasets and the RNA-Seq datasets from TCGA and GTEx. A similar approach has been used to identify DEGs in prostate cancer [38] and bladder cancer patients [39], as well as in Pan-cancer [40] and multi-omics analyses [41]. The pancreaticcancer-related DEGs were functionally enriched in the tumor immune microenvironment, with particular influence on the desmoplastic stroma, immune cell infiltration and perineural invasion, which contribute to cancer progression [5], metastasis [42] and chemotherapy resistance [43].  High-throughput data tend to be interpreted from a clinical transformation perspective in the precision oncology era [44,45]. It is necessary to integrate all the available information to identify the most relevant markers in a critical and comprehensive analysis. WGCNA is a powerful bioinformatics tool that detects clusters of functionally correlated genes and therefore can identify clinically relevant markers [13,46]. WGCNA has been successfully used to identify molecular signatures in brain cells with distinct spatial distributions [47], to determine the key factors promoting hepatic ischemia-reperfusion injury [46] and to demarcate the molecular subtypes of pancreatic cancer [13]. The LASSO regression algorithm is another genotype-to-phenotype "bridge" that has been used to construct prognostic models from key radiomic [48] and immunohistochemical [49] features. Using these approaches, we found that overall survival was the main clinical trait associated with the transcriptome profiles of pancreatic cancer patients, and that DSG3, ARNTL2, NUSAP1 and KRT7 were independent prognostic factors.
Desmoglein 3 (DSG3) is a component of desmosomes, the button-like structures in the cytomembrane that facilitate intercellular and cell-to-matrix adhesion. DSG3 is overexpressed in head and neck cancer, where it functions as an oncogene [50,51]. Previous studies have demonstrated that DSG3 is an accurate biomarker for staging sentinel lymph nodes in head and neck cancer [52,53], and for distinguishing lung squamous cell carcinoma from other subtypes of lung cancer [54].

AGING
Similarly, high expression of the pro-metastatic transcription factor ARNTL2 predicts poor survival in lung adenocarcinoma [55]. Forced expression of Arntl2 in estrogen receptor-negative breast cancer cells was found to increase their metastatic potential and thus portend a poor prognosis [56]. NUSAP1 promotes mitosis, cell cycle progression and the DNA damage response as a substrate of Cyclin F [57][58][59]. Overexpression of NUSAP1 correlates with poor survival in melanoma [60], cervical carcinoma [61], prostate cancer [62] and glioblastoma multiforme [63]. KRT7, a membrane-cytoskeleton linker required for cell adhesion, is overexpressed in colon cancer [64] and esophageal squamous cell carcinoma [65], and is associated with poor survival and metastasis in colon cancer [64]. In an in vivo model, KRT7 was found to promote the transition of basal cells into the multilayered epithelium and Barrett's esophagus [66]. Thus, all of the above genes have displayed pro-metastatic effects associated with poor outcomes in multiple cancers.
According to the transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD) statement, internal validation (also called self-validation) is a necessary part of model development, and external validation to evaluate the performance of a model with other datasets is strongly recommended [67]. Therefore, we formulated our four-gene risk score based on the expression and Cox regression coefficient of each gene. Our model stratified pancreatic cancer patients of three independent cohorts into high-and low-risk groups with moderate accuracy. This three-cohort validation, together with the fact that our study was conducted with nine pancreatic cancer datasets in a platform-independent manner, indicates that our model is compatible across different platforms. AGING MCOLN3 and SLC25A45) [68] and a four-gene signature (LYRM1, KNTC1, IGF2BP2 and CDC6) [69] were proposed to predict the overall survival of pancreatic cancer patients. Four of the genes identified by these two studies (MET, CEP55, ITGB6 and ARNTL2) were also identified as prognosis-associated genes in our study. The AUCs for three-year overall survival prediction with the previously reported ninegene model were 0.621, 0.814 and 0.670 in one training and two external validation cohorts, respectively, while the AUCs for three-year overall survival prediction with our four-gene model were 0.839, 0.695, 0.747 and 0.872 (the latter two are data not shown) in one training, one internal and two external validation cohorts, respectively. Thus, our four-gene prognostic model achieved moderate accuracy with less complexity than the nine-gene model.
On the other hand, certain proposed models, such as a five-gene [70] or 25-gene signature [71], were developed through the comparison of gene expression profiles between long-term and short-term survival groups of pancreatic cancer patients. In these studies, genes involved in extracellular matrix organization, cell adhesion and immune response suppression tended to be activated in the short-term survival group, consistent with our findings (Figures 4 and 7B). Other previous bioinformatics studies included only one or two PDAC datasets, and thus could not characterize common prognosis-associated genes across various datasets [72][73][74]. In contrast, our risk score was based on multiple datasets, a significantly larger number of patients and both microarray and sequencing platforms. This difference in patient groups and data processing methods may account for the different results.
This was the first study to combine comprehensive pancreatic cancer cohorts in a systematic analysis strategy. The major strengths of this study were the performance of quality control measures, the crossvalidation of the results and the use of multiple cohorts for consistency. The use of publicly available data from millions of assays is a challenge [75], and one prerequisite is retaining the quality of the raw data. Since issues with technique or RNA degradation may occur, we checked the quality of the enrolled cases and removed any cases with potential problems. Another concern when integrating different platforms is balancing the different batch effects or detection methods, so we adopted robust rank aggregation to perform an unbiased analysis.
Our prognostic model, which was derived from multiple cohorts and validated by three cohorts, deserves further validation and translation into clinical practice, since the four genes in our model encode proteins and can be examined in clinical practice through routine costeffective methods. However, since this model was developed from microarray and sequencing expression profiles, more common and practical methods (e.g., quantitative real-time PCR or immunohistochemistry) need to be used to translate the model into clinical practice. Additional studies are needed to elucidate whether the protein levels of these genes are consistent with their transcriptional levels in pancreatic cancer. For oncological research, these consistently altered genes are worthy of further investigation, and the prognosisassociated genes should be further characterized for both their involvement in cancer progression and their value as therapeutic targets.
In conclusion, we constructed a four-gene prognostic model for pancreatic cancer that can predict postsurgical prognosis with moderate accuracy and facilitate therapeutic decision-making and clinical monitoring.

Study design and cohorts
In order to avoid biases caused by single or small numbers of cohorts, such as those based on a specific race, detection method or analysis technique, here we conducted a systematic retrospective analysis. We screened all the available high-throughput Affymetrix microarray datasets (up to August 2, 2019) in the GEO database to identify datasets that included both normal and cancerous pancreatic tissues and passed our quality control assessment for all the enrolled raw data. The analysis strategy is summarized in Supplementary  Figure 1. Ultimately, we enrolled two RNA-Seq cohorts and seven microarray datasets. We first sought to identify common DEGs across all nine enrolled cohorts. WGCNA was then performed to connect clinical traits with pancreatic cancer expression profiles in TCGA. Based on the results of the WGCNA, we focused on determining the significance of the DEGs in predicting post-operation overall survival.
To establish the molecular prognostic model, we combined the prognosis-associated genes from TCGA and GSE62452 in a univariate Cox regression analysis. Next, the intersecting results were shrunk with the LASSO regression algorithm, such that highly interconnected genes were alternated to avoid overfitting. The filtered genes were entered into the multivariate Cox regression analysis, and a scoring model was built to predict overall survival.
The cohort from TCGA was randomly divided into two comparable sub-cohorts (each N = 88). The training cohort was used to optimize the parameters in the AGING LASSO and multivariate Cox regression analyses to build the risk score model, while the validation cohort was used to self-validate its performance. Two independent GEO microarray cohorts (GSE28735 and GSE62452) were also used to further validate the prognostic prediction model.

Microarray raw data quality control and identification of DEGs
Since all the selected GEO datasets were based on the Affymetrix platform, quality control was performed with Transcriptome Analysis Console software (Applied Biosystems, version 4.0.2) and the R 'simpleAffy' [23] and 'Affy' [22] packages. After averaging the expression values of the genes corresponding to the multi-microarray probes, we calculated the log2 FC ratios between the normal and tumorous samples in each dataset, and determined their statistical significance with the R 'limma' package [86]. The 'RobustRankAggreg' R package was then used to identify the DEGs across the multi-microarray datasets based on a prioritized gene list, with a numerical core and p value determined through Bonferroni correction [24].

TCGA and GTEx sequencing data integration and DEG identification
After the UCSC Toil RNA-Seq Recompute data were downloaded, the FPKM (fragments per kilobase of transcript per million mapped reads) values from GTEx were log2 (x+0.001) transformed, and the values from TCGA were log2 (x+1) transformed. Both forms were unified as log2 (x+1) , and the 'limma' package was used to screen for DEGs between normal and tumorous pancreatic tissues, with a |log2 FC | > 1 and a false discovery rate < 0.05 as the cut-offs. For both the sequencing and microarray platforms, a log2 FC > 0 indicated gene overexpression in the tumor tissues.

Construction of the heatmap and GO plot
The DEGs in each sample were plotted with the R 'pheatmap' package [87]. Entrez gene annotations were referred to 'org.Hs.eg.db' (Carlson M, 2019. org.Hs.eg.db: Genome wide annotation for Human. R package version 3.8.2.), and the GO analysis was performed in the R 'clusterProfiler' package [88] with an adjusted p value < 0.01 and a q value < 0.01 as the cut-offs. The GO cluster was plotted with the R 'GOplot' package [89].

WGCNA
The R 'WGCNA' package was used to detect gene modules and evaluate the correlation of each module with clinico-pathological factors. To be specific, (i) we extracted data from TCGA on pancreatic cancer patients with complete clinical data regarding age, gender, tumor histological grade, clinical stage, TNM classification (except for M, as there were numerous cases lacking metastasis information), overall survival rate and duration, along with their gene expression profiles; (ii) sample clustering was performed to detect any outliers; (iii) the power (soft thresholding) β value was set to 10 so that we could achieve a scale-free topology fit index (scale-free R 2 ) greater than 0.9 and maintain optimal mean connectivity; (iv) the adjacency matrix was transformed into a topological overlap matrix (TOM) to define gene co-expression similarity; (v) the 'hclust' algorithm was used to create a gene hierarchical clustering based on the TOM dissimilarity measure; (vi) the optimal module size was set to 30, and a dynamic tree cut was used to identify the modules; (vii) after the dissimilarity of the module eigengenes was calculated, the similarity cut-off was set to 0.75 in order to merge the modules; (viii) since we had already identified the featured gene expression profiles for each module and the clinical traits of the patients, the correlations of the module eigengenes with the clinicopathological factors were determined.

Construction of the prognostic model
Prognosis-associated genes were identified by the R 'survival' package [90] as those impacting overall survival with HRs and 95% CIs greater than or less than 1. Identified genes were then subjected to univariate Cox regression analysis with p < 0.05 as the significance threshold. The list was further narrowed down by the LASSO algorithm with the R 'glmnet' AGING package [91], and the optimal tuning parameter (λ) was chosen to achieve the minimal partial likelihood deviance in the cross-validation plot. The genes still harboring non-zero corresponding coefficients were entered into the multivariate Cox model. Finally, the expression of each gene was multiplied by its Cox regression coefficient, and these values were summed to calculate the risk score.

Training and validation of the prognostic model
The R 'caret' package [92] was used to divide the cohort from TCGA randomly into training and selfvalidation sets (N = 88 each). Two microarray datasets (GSE28735 and GSE62452) were selected as independent validation cohorts. The risk score was calculated for each patient, and the patients were then divided into high-risk and low-risk groups based on the median risk score of the training cohort. The performance of the model was evaluated in terms of its ability to predict one-year and three-year overall survival in the high-and low-risk groups.

Statistical analysis
Continuous variables with normal distributions are reported as the mean + standard deviation, while those with skewed distributions are reported as the median (25 th percentile -75 th percentile). Categorical variables are reported as frequencies (proportions). All analyses were conducted with the R foundation for statistical computing (version 3.6.1). Pearson correlation analysis was used to determine the correlation between a module eigengene and a clinico-pathological factor, with p < 0.05 indicating statistical significance. Kaplan-Meier survival analysis and the log-rank test were performed with the R 'survival' package. The time-dependent ROC curve from censored survival data was plotted with the R 'survivalROC' package [93]. With the exception of four outliers, all the samples could be hierarchically clustered. Binary variables (overall survival status, gender) are presented as red or white blocks, and continuous variables (overall survival days, age, grade, stage, T, N) are presented as colorcoded blocks, with the color saturation corresponding to the value of the variable. (B) Determination of the soft-thresholding power for the optimal scale-free topology fit index (scale-free R 2 ) (left) and mean connectivity (right). The red horizontal line represents R 2 = 0.9. (C) Module identification. The dendrogram represents the gene clustering based on the TOM dissimilarity measure. Genes with relative interrelatedness are located on the same or neighboring branches. A dynamic tree cut at module size 30 resulted in 17 color-coded modules. By merging the modules after calculating the dissimilarity of the module eigengenes at a cut-off of 0.75, we identified 14 modules.