Identification of an Individualized Prognostic Signature Based on the RWSR Model in Early-Stage Bladder Carcinoma

Bladder cancer (BLCA) is the fourth common cancer among males in the United States, which is also the fourth leading cause of cancer-related death in old males. BLCA has a high recurrence rate, with over 50% of patients which has at least one recurrence within five years. Due to the complexity of the molecular mechanisms and heterogeneous cancer feature, BLCA clinicians find it hard to make an efficient management decision as they lack reliable assessment of mortality risk. Meanwhile, there is currently no screening suitable prognostic signature or method recommended for early detection, which is significantly important to early-stage detection and prognosis. In this study, a novel model, named the risk-weighted sparse regression (RWSR) model, is constructed to identify a robust signature for patients of early-stage BLCA. The 17-gene signature is generated and then validated as an independent prognostic factor in BLCA cohorts from GSE13507 and TCGA_BLCA datasets. Meanwhile, a risk score model is developed and validated among the 17-gene signature. The risk score is also considered an independent factor for prognosis prediction, which is confirmed through prognosis analysis. The Kaplan-Meier with the log-rank test is used to assess survival difference. Furthermore, the predictive capacity of the signature is proved through stratification analysis. Finally, an effective patient classification is completed by a combination of the 17-gene signature and stage information, which is for better survival prediction and treatment decisions. Besides, 11 genes in the signature, such as coiled-coil domain containing 73 (CCDC73) and protein kinase, DNA-activated, and catalytic subunit (PRKDC), are proved to be prognosis marker genes or strongly associated with prognosis and progress of other types of cancer in published literature already. As a result, this paper would more accurately predict a patient's prognosis and improve surveillance in the clinical setting, which may provide a quantitative and reliable decision-making basis for the treatment plan.


Introduction
Bladder cancer (BLCA) is the fourth most common cancer for men in the United States, with an estimated 80,470 adults (61,700 men and 18,770 women) and 17,670 deaths (12,870 men and 4,800 women) in 2019 [1,2]. For respective incidence and mortality rates, men are about four times higher than women globally. Besides, incidence rates in white men are double those of black men [3]. On the other hand, BLCA patients tend to older adults. Ninety percent of the patients are older than age 55 [4,5]. Meanwhile, BLCA is third lead-ing cancer and the fourth leading cause of cancer-related death in older men, and sixth and eighth in those of older women, separately [1,6]. Finally, BLCA can be mainly divided into two subtypes based on the cancer cell infiltration: nonmuscle-invasive BLCA and muscle-invasive BLCA. The former has a high recurrence rate but less aggressive, while the latter has a relatively poor prognosis and is easier to metastasize [7][8][9]. It reports that BLCA has a high recurrence rate with over 50% of patients which at least have one recurrence within five years, and it possibly progresses to an aggressive, muscle-invasive, and even metastatic forms [10][11][12].
In clinical practice, the initial purpose of treatment is to slow down its development for early-stage BLCA. However, it is hard to achieve a better outcome based on the heterogeneous cancer feature [13] as well as their recurrent tendentiousness in time and location. At the same time, with the number of comorbidities increasing, it is complicated for clinicians too often making a challenging decision on how to choose effective treatment plans for an individual patient. It may take many resources in an aspect of humans, materials, and finances. Many authorities support the view of intensive surveillance and treatment for early-stage BLCA patients in the practice guidelines [14,15]. It is implied that patients should prevent further progression or, at the very least, be able to detect recurrence early enough so that subsequent interventions are more successful and palatable. However, many cases may not readily satisfy a typical scenario in the published guidelines, thus leaving clinicians enough room to make a decision for the individual patient. And the essential evidence is poor and often due to expert experience and medical theory in terms of some view of these guidelines.
In the phase of staging and risk assessment, further imaging studies [16] will be completed to confirm the stage after patients have confirmed muscle invasion histology, such as computed tomography (CT) or magnetic resonance imaging (MRI). But both tests are often unable to reliably identify T2 from T3a, T3b, or even T4a, separately. For neoadjuvant and adjuvant therapy about muscle-invasive BLCA, the treatment plans are mainly from randomized trials, with lower methodological quality and suspicion of bias [17]. Meanwhile, there is still insufficient evidence for the routine use of adjuvant chemotherapy in early clinical stage (stage IA) practice [18]. High-risk patients may likely benefit most from adjuvant chemotherapy. And for further adjuvant chemotherapy, the clinical data is limited, so the evidence is not strong enough to guide treatment. As a result, clinicians lack quantitative and reliable estimates of competing for mortality risks when considering treatments, although there are guidelines for reference [19,20]. At the same time, there is currently no efficient screening prognostic signature or method recommended for people in early-stage detection.
In this study, the purpose of a robust RWSR model is to find a prognosis signature that is strongly associated with clinical characters through quantitative analysis. And the signature possesses a potential prognostic value for patients with early-stage BLCA and may provide new information for research and treatment. Details are shown as follows. Firstly, the risk coefficient is calculated for each gene through the risk regression algorithm using the mRNA gene expression level and overall survival in the clinical dataset [21]. Genes with zero coefficients are excluded. Secondly, the risk probability is calculated for the remaining genes through a risk estimation algorithm using a gene expression value and risk coefficient. Finally, the risk probability is considered a parameter, and the risk coefficient is considered a dependent variable in the weighted least absolute deviation-smoothly clipped absolute deviation (WLAD-SCAD) algorithm. The genes of nonzero coefficients are identified to be candidate signature in both of GSE13507 and TCGA-BLCA, separately. The common genes between the two datasets construct the gene signature. After that, a series of statistical analyses are performed to validate how accurate, independent, and significant is the 17-gene signature individually.

Data Collection.
In this study, we download mRNA expression profile data and corresponding early-stage (I-III) clinical information of BLCA patients from GEO and TCGA, respectively. The gene expression data of the GEO dataset was calculated on Affymetrix U133 Plus 2.0 microarray platform and contained mRNA gene expression profile and clinical information of 256 patients from Chungbuk National University Hospital in GSE13507 Series (https:// www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE13507). And another dataset was analyzed on the Illumina sequencing platform and contained mRNA gene expression profile and corresponding clinical information of 408 patients from TCGA (https://portal.gdc.cancer.gov/). Samples with more than 30 percent of zero in gene expression values are excluded. Characters with missing values in the clinical dataset are excluded, such as not available and unknown.

Risk-Weighted Sparse Regression (RWSR) Model for
Screening Signature. A risk-weighted sparse regression model is proposed to screen the 17-gene signature, which represents the relationship between prognosis of early-stage BLCA patients and mRNA gene expression level. This model is performed using the R language, including three steps in total. In the first step, we obtain the risk coefficient matrix between the mRNA gene expression level and overall survival by [26] in Equation (1) and Equation (2) as follows: BioMed Research International where X 1 represents the gene expression matrix, Rðt, XÞ represents overall survival, and β 1 , β 2 , ⋯, β n is the risk coefficient. the log-likelihood function is shown in Equation (3) as follows: Then, the risk coefficient is calculated according to the Newton-Raphson algorithm in Equation (4) as follows: where a new gene set is obtained based on the result of the Cox regression algorithm.
Referring to Equation (6), if HR i < 1 and β i < 0, the corresponding genes are identified as candidate protective genes. If HR i > 1 and β i > 0, the relevant genes are defined as risky candidate genes. And if HR i = 1 and β i = 0, the relevant genes are a nonassociate factor.
In the second step, to obtain the risk probability matrix, we calculate the risk score and risk probability for each patient using the risk coefficient in Equation (7) and the gene expression value in Equation (8): where n represents the number of genes and X i represents the i th gene expression value. BLCA patients were separated into high-risk and low-risk groups by the median value of the risk score as a cutoff value.
In the last step, to further screen the gene signature, [27] proposed a sparse linear regression model (WLAD-SCAD), considering rp as a parameter of X i , given by Equation (9) as follows: where Q nω ðγ n Þ is the objective function, p λð:Þ is SCAD penalty function, X i is the gene expression value, rs i is the overall survival, and γ = ðγ 1 , γ 2 , ⋯γ p Þ T represents the regression coefficient.
In order to calculate the objective function, [28] proposed an efficient weighted method, the process is shown as follows: Firstly, in order to compress the dataset into arranging (0,1), a transformation of X i is given by Equation (10) as follows: where x i j represents the i th row, j th column element of matrix X and xˇi j represents the i th row, j th column element of the matrix Xˇ. Secondly, the Euclidean distance is used to calculate the center distance given by Equation (11) as follows: where xˇi is the i th row, mˇis the median vector of xˇi, and k:k represents the Euclidean distance. To obtain d i , i = 1, 2, ⋯, n, we order d i in a decreasing sequence. In the end, a subset is constructed in Equation (12) as follows: where center distance of xˇi is d i , ð1 ≤ i ≤ l = 0:6 * nÞ. Finally, the weight function is calculated by Equation (13), which can not only avoid heavy calculation burden but also improve the robustness of estimation, shown as follows: Obviously, the weight is inversely proportional to the subset size of the center distance, which can greatly reduce the impact of outliers on regression, and wlad has better robustness than other methods. In order to calculate Equation (9), the penalty Equation (14) is local linear approximation through [29]: BioMed Research International where γ nj ≠ 0, and then, γ n is calculated based on In order to improve reliability, only interaction genes between the two datasets are identified to construct the prognostic signature.

Prognosis Model.
After screening the signature through the RSWR model, the multiple Cox proportional hazard regression model is used to estimate whether the signature could be an independent prognostic factor for patient survival. A multigene-based prognostic risk score is constructed in Equation (18) as follows: where n is the number of prognostic genes, p is the number of patients, X i represents the expression level of gene i, and γ i the regression coefficients from the multivariate Cox regression model. A risk score is considered a prognostic index.
Taking the median risk score as a cutoff value, patients from TCGA-BLCA and GES13507 are divided into high-risk and low-risk groups. The univariate and multivariate Cox regression analyses are applied to evaluate the prognostic role of the risk score, along with age, gender, grade, and TNM stage.

Functional Enrichment
Analysis. The functional enrichment analysis of Gene Ontology (GO) and Kyoto encyclopedia of genes and genomes (KEGG) is conducted using the hypergeometric distribution method to identify significantly enriched biological themes including GO terms and KEGG pathways. GO functional terms limited in the "Biological Process" and KEGG pathways with a P value < 0.01 are considered significant. Four pathway databases and one GO terms are downloaded from Explore the Molecular Signatures Database (MSigDB) for doing enrichment analysis, including GO-BP database, KEGG pathway database, Reactome pathway database, Pathway Interaction Database (PID), and BioCarta pathway database.

Statistical Analysis.
To predict the differences between the two risk BLCA patient groups based on survival time, we use the Kaplan-Meier method and calculate the log-rank value to identify the statistical significance between groups. Multivariable Cox regression analysis and stratification analysis are used to estimate the independence of the risk score with other clinical factors [30]. Time-dependent receiver operational feature (ROC) curve analyses are made to evaluate the predictive capacity of the model [31]. And AUC is compared to judge the prognostic performance of Cox analysis, which is the area under the ROC curve with a significance of P < 0:05. In addition, comprehensive survival analysis is also implemented to analyze the relationship between the different clinical characters (stage, grade, and age) and the prognosis model. The P value < 0.05 is used as a cutoff during the prognostic analysis.

Result
3.1. Prognostic Signature Generation. The novel model RWSR is used to identify the gene signature, which is significantly associated with the overall survival of BLCA patients.
In the first step, with P < 0:01 and hazard ratio (HR) <1 as the cutoff value, 617 genes and 1761 genes are selected to be candidate protective genes in GSE13507 and TCGA individually. With P < 0:01 and HR > 1 as the cutoff value, 1,399 genes and 2096 genes are selected to be risky candidate genes in GSE13507 and TCGA, separately. It is named com_prot that the intersection of two candidate protective gene sets from both initial datasets, with 268 genes. Similarly, it is called com_risk for standard risk gene sets with 189 genes in total. Based on the second step, the risk score and risk probability are obtained individually. Followed by the last step, the genes with a nonzero coefficient are identified, 106 genes and 403 genes have remained in the gene set of com_pret_geo and com_pret_tcga. Meanwhile, 327 genes and 496 genes have remained in the gene set of com_risk_geo and com_risk_ tcga. Only common genes on the two datasets are considered candidate signature reliability. As a result, a 17-gene signature is identified, which is strongly correlated with overall survival depending on two independent datasets, including 5 protective genes and 12 risky genes, separately. In order to validate the fitness of the novel model directly, a point plot presents the relationship between the predicted risk score value and actual value, as shown in Figure 1. In Figure 1, there is a value R 2 shown in Figure 1, which is the determination coefficient, also known as the goodness of fit. The arrangement is between 0 and 1; the larger the value is, the higher the fitting degree between the regression model and the actual data is. It is evident that fitness is better in which the value of R2 is 0.937 in GES13507 (Figure 1(a)) and 0.808 in TCGA-BLCA (Figure 1(b)), separately. The points in Figure 1 are distributed near the diagonal of y = x. The general information of the 17 genes is displayed in Table 1.
The prognostic analysis information of the 17 genes with overall survival of early-stage BLCA patients in both datasets is shown in Table 2.
3.2. 17-Gene Prognostic Signature Validation. Based on the risk coefficients, a risk score is built up for signature. To the assess overall survival, a prognostic model is constructed. The patients are separated into the high-risk group and the low-risk group by using the median risk score as a cutoff point. In Figure 2, the risk score distribution and the patients' survival status in two datasets are displayed, ranked based on the risk score values for the 17-gene signature. It is obvious that the patients in the high-risk group has a shorter overall survival than those in the low-risk group (GSE13507: According to these results, the risk score can be considered a prognostic factor. The corresponding ROC curves present   Among the 17-gene signature, 12 genes are associated with high risk (RER1, ZNF284, ZNF780B, XKR8, CCDC73, ATP4A, ZNF680, CNTNAP4, GDF7, PRDM14, EWSR1, and GSG1L; HR > 1) and five genes appear to be protective (PRKDC, SYNPO, FRYL, FCGR3B, and CRB1; HR < 1). We examine the expression level of the prognostic genes according to the comparison of the differences between high risk and low risk. It is evident that patients with high-risk scores prefer expressing risky genes, while patients with the low-risk group prefer expressing protective genes; the corresponding boxplot is shown in Figure 4.  Table 3.

Stratification
Analysis. Some clinicopathological characters are also considered independent prognostic characters during multivariate Cox regression analysis. In order to assess the predictive ability of the 17-gene signature in the same clinical character subgroup, a stratification analysis is adopted in this study. Patients are manually stratified due to clinical characters, such as age (<=70/>70), gender (male/female), stage (0-III), and invasiveness (yes/no). The result shows that the 17-gene signature could divide patients with the same characters into high-risk and low-risk groups, separately. Patients with low-risk scores have a longer overall survival, and vice versa, which is shown in Figure 5.

Survival Prediction by Stage and 17-Gene Signature
Combination. It is proved that the tumor stage, as an emphasis clinical character, has a significant survival predictive value in clinical management. In this study, the stage and the risk score are confirmed as independent prognostic factors in two independent datasets individually. Therefore, a further prognostic model is constructed for survival assessment, trying to integrate the character of stage and 17-gene signature. Based on the stage status and the risk score, patients are divided into eight independent groups: Group 1 (stage 0 and low risk), Group 2 (stage 0 and high risk), Group 3 (stage I and low risk), Group 4 (stage I and high risk), Group 5 (stage II and low risk), Group 6 (stage II and high risk), Group 7 (stage III and low risk), and Group 8 (stage III and high risk), which are shown in Figure 6(a). Patients are all classified into high-risk and low-risk groups under stages 0 to III in both datasets. And in general, patients in the high-risk group have a poor prognosis. According to the result demonstrated in Figure 6, the patients in the high-risk group have worse outcomes than those in the low-risk group among the same stage. It means patients in Groups 2, 4, 6, and 8 are worse than in Groups 1, 3, 5, and 7 individually. However, there are no significant changes in the overall survival among patients in Group 2 and Group 3 as shown in Figure 6(g). Meanwhile, the overall survival in patients of Group 6 and Group 7 is nearly the same as shown in Figure 6(i). These results suggest that patients with a high-risk score in stage II might have similar prognosis as those with a low-risk score in stage III, suggesting that intravesical therapy and neoadjuvant chemotherapy should also be used in patients diagnosed as stage II with a high-risk score.
Among the eight groups, Group 1 demonstrates the best prognosis result obviously; on the contrary, Group 8 displays the worst. In the future clinical practice, patients can be divided into eight groups due to stage information and risk scores to estimate treatment outcomes through the model proposed in this study.
3.6. Signature Enrichment Analysis. To obtain more insights into the functional roles of the 17-gene signature in BLCA, we performed enrichment analysis for the signature to investigate the associated biological processes and pathways. A P value < 0.002 is considered statistically significant as shown in Table 4. And according to the ascending order of P values, the top five GO-BP terms are GO cell fate commitment, GO negative regulation of the fibroblast growth factor receptor signaling pathway, GO negative regulation of cellular senescence, GO lymphoid progenitor cell differentiation, and GO sodium ion export. The other results of the GO-BP term are shown in Supplementary Table S1.

Discussion
Due to cancer heterogeneous and complex molecular mechanisms, BLCA remains one of the most commen malignancies in the world. BLCA patients still face a crisis   9 BioMed Research International of mortality. And BLCA clinicians are hard to make an efficient management decision as they lack of reliable assessment of prognosis. In this study, a novel prognostic signature is identified and validated through RNA-seq data and plenty of clinical data from early-stage BLCA patients in two independent datasets. The discovered gene signature can discriminate early-stage BLCA patients between high and low risk in poor prognosis. This feature may improve BLCA surveillance in clinical treatment and may be an emphasis step in making treatment decisions for early-stage BLCA patients.

RWSR Model for Screening the Prognostic Signature.
In order to screen the gene signature effectively, an RWSR model is proposed to describe the relationship between the prognosis of early-stage BLCA patients and the mRNA gene expression levels. This model is mainly based on a sparse linear regression algorithm. We consider the risk possibility of each gene and the Euclidean distance of the prognosis index as restricted parameters of the linear regression algorithm [32][33][34][35][36][37], which is the difference between existing methods and RWSR. In order to improve reliability, only interaction genes between the two datasets are identified to construct the prognostic signature. An index for validation of the model is the goodness of fit, which is larger than 0.8 (range [0, 1]) in two linear models. The larger the value is, the higher the fitting of the model. The other index is the P value, which represents the statistical significance of the model. In this study, two models are significant with a P value less than 2.2E-16 and a value of R square more than 0.8 in Figure 1, respectively.

Prognosis Model through the 17-Gene Signature.
A prognosis model is established by the risk score and multivariate Cox analysis model for predicting BLCA outcomes. The corresponding ROC curves present the accuracy (AUC) of the signature up to values of 0.74 and 0.737 in GSE13507 and TCGA, respectively, suggesting that the 17-gene signature has good survival prediction performance in Figure 3. Our study found that high expression levels of the gene signature   Table 3. Thus, it is proved that grouping BLCA patients into the high-risk and low-risk groups by the 17-gene-based risk scoring model can be considered early prevention or detection of BLCA recurrence in high-risk patients. The gene signature is derived from a common gene set, and different kinds of survival analysis are adopted to validate the possibility and accuracy of prediction ability for prognostic and detection in early-stage BLCA.

Prognostic Signature for BLCA.
As a result of this study, the 17-gene signature is identified to be an independent prognosis factor. The general information of the signature is shown in Table 1 and Table 2. Among it, ATP4A, which encodes protein as a family member of P-type cationtransporting ATPases, catalyzes the hydrolysis of ATP coupled with the exchange of H(+) and K(+) ions across the plasma membrane, being responsible for acid production in the stomach [38]. Meanwhile, ATP4A and ATP4B downregulation involve DNA methylation and methylated ATP4B DNA in the plasma are potential biomarkers for gastric cancer [39,40]. CNTNAP4 which encodes a member of the neurexin protein family, which function in the vertebrate nervous system, is considered cell adhesion molecules and receptors [41]. Meanwhile, in breast cancer patients, 16q deletion is associated with survival, molecular subtypes, mRNA expression, and germline haplotypes, and the cell recognition gene CNTNAP4 is included [42]. EWSR1 encodes a multifunctional protein that is involved in various cellular processes, including gene expression, cell signaling, and RNA processing and transport [43]. Mutations in this gene are known to cause Ewing sarcoma as well as neuroectodermal and various other tumors [44,45]. Alternative splicing of this gene results in multiple transcript variants. The fusion of short fragments between EWSR1 and FLI1, contributing to the oncogenic gene to construct and maintain expression programs, is enough to recapitulate BAF complex retargeting and EWS-FLI1 activities [46]. GDF7 encodes a secreted ligand of the TGF-beta superfamily of proteins, which binds various TGF-beta receptors leading to recruitment and activation of SMAD family transcription factors that regulate gene expression [47]. A mutation in this gene may be associated with increased risk for Barrett's esophagus and esophageal adenocarcinoma [48,49]. PRDM14 encodes a protein that may possess histone methyltransferase activity. It plays a critical role in cell pluripotency by suppressing the expression of differentiation marker genes. Expression of this gene can reduce the tumor size and distant metastasis of these cells in nude mice, which may be an effective and radical therapy for solid cancers, such as pancreatic cancer, testicular cancer, and breast cancer [50][51][52]. Its related pathways are developmental biology and transcriptional regulation of pluripotent stem cells [53]. An important paralog is PRDM6. CCDC73 is associated with ovarian cancer [54], hepatocellular carcinoma [55], and endometrial cancer [56]. RER1 enhances carcinogenesis and the stemness of pancreatic cancer under the   14 BioMed Research International hypoxic environment [57], which is associated with hepatocellular carcinoma [58]. The protein encoded by FCGR3B is a low-affinity receptor for the Fc region of gamma immunoglobulins (IgG), which function is to capture immune complexes in the peripheral circulation. FCGR3B is associated with innate immune system-related disease, including neonatal alloimmune neutropenia and eosinophilic granulomatosis with polyangiitis [59]. The function of FCGR3B is associated with the innate immune system, renal cell carcinoma, and antiglomerular basement membrane antibody disease (anti-GBM disease) [60]. Meanwhile, FCGR3B may be a helpful prognostic tool for patients with metastatic carcinoma [61]. The research indicates that FRYL is a direct target of miR-1205 through experience and calculation methods. Meanwhile, miR-1205 regulates the proliferation and migration of prostate epithelial cells, and loss of miR-1205 promotes a tumorigenic phenotype in prostate cancer. Consequently, strategies to increase miR-1205 or target FRYL may have therapeutic potential in androgen-independent prostate cancer [62]. A previous study has implicated that downregulation of PRKDC-sensitized MCF-7 cells to chemodrugs in vitro, which is a potential prognostic and predictive marker of response to adjuvant chemotherapy in breast cancer patients [63]. It is proved that PRKDC expression is significantly increased in breast cancer tissue samples compared with NATs and is correlated with reduced overall and progression-free survival in highgrade glioma patients [64]. Downregulation of PRKDC reduces colorectal cancer cell proliferation/survival and induces apoptosis partially through inhibiting AKT activation in colorectal cancer cells. Meanwhile, PRKDC has no relationship with tumor growth but is associated with OS in colorectal cancer patients [65]. Spinal miRNA-124 regulates SYNPO and nociception in an animal model of bone cancer pain [66], so SYNPO is upregulated in bone cancer. However, there are no any reports or laboratory data on the relationship between cancer and the following genes: GSG1L, CRB1, XKR8, ZNF680, ZNF284, and ZNF780B, which is part of the 17-gene signature. Among them, according to the latest report, GSG1L is associated with the plasma concentration of methadone, which can predict treatment responses and methadone-related deaths for individuals. Methadone maintenance treatment is commonly used for controlling opioid dependence, preventing withdrawal symptoms, and improving the quality of life of heroindependent patients [67]. XKR8 can be considered a specific signal for engulfment. CRB1 maintains cell polarization and adhesion. And mutation of CRB1 is correlated with a severe form of retinitis pigmentosa and with Leber congenital amaurosis [68]. But its homolog protein CRB3 may mediate the extracellular signal transduction in clear cell renal cell carcinoma development via an intracellular signal, such as Hippo signal [69]. ZNF680, ZNF284, and ZNF780B are all members of the zinc finger gene (ZNF) family which is one of the vertebrate transcription factors (TF). But the ZNF family is a notable exception that novel ZNF gene types have arisen, duplicated, and diverged independently throughout evolution to yield many lineage-specific TF genes, which makes identification of ZNF complicated [70]. And many members of the ZNF family are associated with cancer, including breast cancer, colorectal cancer, hepatocellular carcinoma, and lung cancer [71][72][73][74].
It is noteworthy that, in predicting survival status, the AUC value of the 17-gene signature is more significant than 0.6 in two datasets, respectively. This means a combination of 17 genes can be considered a new prognosis indicator for BLCA patients. Besides, the 17-gene signature represents the powerful ability to divide BLCA patients into a highand low-group using stratification analysis. As a result, it could be a vital method considered in providing better prescriptions and improving prognosis.

Enrichment
Analysis for the 17-Gene Signature. In the functional enrichment analysis, 24 pathways are significantly enriched among signature genes, including cell cycle, signaling pathway, and transport pathway, which is shown in Table 4. And 142 GO terms of biological process are enriched considerably among the 17-gene signature in Supplementary  Table S1., including the signaling pathway, immune response, cell development/aging, regulation of the immune process, and transport. Taken together, it demonstrated that the signature significantly participated in immune-, cell cycle-, and transport-associated biological processes, which are cancer-related biological activities [75].
Nowadays, the tumor staging system is still the most essential tool of survival prediction and treatment decisions for BLCA patients. Despite having a large clinical value, it is not enough to guide management of its ability on prognosis and prediction. In particular, the present staging system is far from estimate survival at an individual, because 50% of early-stage patients will develop to be recurrence disease [76]. This is directly associated with the decision on 4.5. Validation of the 17-Gene Signature for BLCA. In stratification analysis, the 17-gene signature could dedicate the prognosis value of patients in stages 0-III. Furthermore, it could separate patients whose survival prospects are significantly different in the same stage into high-and low-risk group, which means the signature may improve the accuracy of survival prediction in Figure 5. We can also see that age, stage, and invasiveness could separate patients into highand low-risk groups in two independent datasets individually. And the P value is less than 0.04. Additionally, a prognosis model, estimating survival, is constructed to integrate characters of the 17-gene signature. We find that survival of patients in Group 3 are better than those in Group 2, Group 5 are better than Group 4, and Group 7 are better than Group 6 in Figure 6, which means survival of parts with high-stage patients have longer survival time than those with low-stage patients. These results demonstrate that the patients with a high risk score in stage 0 should received the same close monitoring with stage I patients with a low risk score, patients with high risk score in stage I should accept the same treatment as those with a low risk score in stage II, and the patients with a high risk score in stage II should take out the same therapy as those with a low risk score in stage III. The discovery may help clinicians to select high-risk patients except transurethral resection (TUR) doing adjuvant chemotherapy. Importantly, it reveals that stage and overall survival are significantly correlated no matter which univariant or multivariant Cox regression model is, which is shown in Table 3. We can also see that the value of HR in sex, age, stage, grade, and risk score is more than 1 for both datasets. In addition, the overall survival of some patients in low stage I is worse than that in high stage, it may be that the reason is the extra reexam time needed by a clinician in order to decide the second step diagnosis plan after initial treatment. But it could be more efficient and cheaper if patients are grouped into highand low-risk at the very beginning according to the prognosis model constructed in this study.
The findings in this study may have significant clinical value for early-stage BLCA patients. Significantly, there are still a few limitations existing. Firstly, the clinical information is incomplete, so some records with missing information cannot be used in the study. Secondly, it should be more comprehensive in the future research. The microenvironment, different kinds of data and more analysis methods would be considered into the next step, such as the immune environment, lncRNA, and optimized model.

Conclusion
A robust 17-gene signature is identified to predict the prognosis of early-stage BLCA patients. The results show that the 17-gene signature is a powerful prediction factor for the overall survival of early-stage BLCA patients. In addition, this signature is an independent factor for prognosis. There is no relationship with any other clinical characters, such as stage status. Finally, a prognostic model was proposed combining the gene feature of the 17 genes and stage information of BLCA patients. This study might help prognosis and treatment individually more accurately, especially for high-risk BLCA patients. So, it has clinical practice significance.
In the future, more mRNA datasets could be adopted to identify signature, which could reduce the range of signature and improve the accuracy. On the other hand, multiple omics data analysis could be tried to improve our result.

Data Availability
The raw data of RNA-seq and clinical data from TCGA and GEO datasets are available in the Supplemental Files.