Gene signatures with predictive and prognostic survival values in human osteosarcoma

Osteosarcoma is a common malignancy seen mainly in children and adolescents. The disease is characterized by poor overall prognosis and lower survival due to a lack of predictive markers. Many gene signatures with diagnostic, prognostic, and predictive values were evaluated to achieve better clinical outcomes. Two public data series, GSE21257 and UCSC Xena, were used to identify the minimum number of robust genes needed for a predictive signature to guide prognosis of patients with osteosarcoma. The lasso regression algorithm was used to analyze sequencing data from TCGA-TARGET, and methods such as Cox regression analysis, risk factor scoring, receiving operating curve, KMplot prognosis analysis, and nomogram were used to characterize the prognostic predictive power of the identified genes. Their utility was assessed using the GEO osteosarcoma dataset. Finally, the functional enrichment analysis of the identified genes was performed. A total of twenty-gene signatures were found to have a good prognostic value for predicting patient survival. Gene ontology analysis showed that the key genes related to osteosarcoma were categorized as peptide–antigen binding, clathrin-coated endocytic vesicle membrane, peptide binding, and MHC class II protein complex. The osteosarcoma related genes in these modules were significantly enriched in the processes of antigen processing and presentation, phagocytosis, cell adhesion molecules, Staphylococcus aureus infection. Twenty gene signatures were identified related to osteosarcoma, which would be helpful for predicting prognosis of patients with OS. Further, these signatures can be used to determine the subtypes of osteosarcoma.


INTRODUCTION
Osteosarcoma (OS) is a common primary malignancy seen mainly in children and adolescents (Yang et al., 2018). Its annual incidence ranges from two to three in 1 million people (De Azevedo et al., 2020), and it accounts for 40.51% of primary bone malignancies (Li et al., 2020). The incidence of OS is higher in males than in females (Damron, Ward & Stewart, 2007), and it accounts for 15% of all extracranial tumors in the 10-19-year-old age group) (Nie & Peng, 2018). It is considered the third most common cancer in adolescence (Zhang, Lan & Lin, 2018). OS treatment is aggressive and combines neoadjuvant chemotherapy, extensive surgical resection, and additional postoperative adjuvant chemotherapy. The five-year survival rate of non-metastatic patients is 65%-70% (Siegel, Miller & Jemal, 2018). Regarding patients with distant metastases, the 5-year survival rate is only 15%-30% (Whelan & Davis, 2018). Further, relapse rate remains high at approximately 35% (Yu et al., 2014). Metastasis remains the main cause of death in patients with OS. Many prognostic factors are reported to be related to OS: miR-195, miR-21, TGF-β, MMP-9, HIF-1, APE1, and COX2. However, use of these factors has not improved survival rate related to OS (Zamborsky et al., 2019). Therefore, clarifying the molecular mechanisms underlying the occurrence and progression of OS and exploring the potential diagnostic and therapeutic targets are of great importance for the diagnosis and treatment of OS. Many biomarkers have been evaluated (Table 1) (Wan-Ibrahim et al., 2015), but none have been approved by the Food and Drug Administration for clinical settings (Wan-Ibrahim et al., 2015). This lack may be due to research deficiencies, unacceptable heterogeneity, or absence of effective evaluation. Biomarkers for patients with OS, particularly those with metastases, are urgently needed for early diagnosis and establishment of treatment goals. Serum biomarkers are used for predicting prognosis of other cancers but are rarely characterized in OS (Zamborsky et al., 2019). An obvious need in OS is effective biomarkers for characterizing disease progression and associated prognosis.
One study reported that CDC20 and its downstream substrates, secure, cyclin A2 and cyclin B2 are good prognostic factors for OS (Wu et al., 2019). Savage et al. (2013) suggested that two loci in the GRM4 gene at 6p21.3 and in the gene, desert, at 2p25.2, These two loci warrant further exploration to uncover the biological mechanisms underlying susceptibility to osteosarcoma.The study addressed a single gene and did not take into account interactions among molecules that regulate tumorigenesis. Three candidate genes (ALOX5AP, CD74 and FCGR2A) were found. Their expression levels in lung and lymph nodes were higher than levels in matched cancer tissues, and they may be expressed in microenvironments (Li et al., 2020). Some limitations exist in these studies. First, accuracy cannot be guaranteed with only one dataset because of an expected high false-positive rate. Further, using a single high-throughput analysis method (only sequencing or chip data), results obtained will be biased. Second, a patient's sample data are too limited. Finally, clinical information is incomplete.
Identifying the minimum number of robust genes needed to produce a predictive signature for prognosis for patients with OS was the objective of this study. The lasso regression algorithm was used to analyze sequencing data from TCGA-TARGET, and Cox regression analysis, risk factor score, receiving operating curve (ROC), KMplot prognosis analysis, nomogram and other methods were used to assess genes for their predictive power. Next, the accuracy and predictive power of twenty-gene identified in this process were assessed using the GEO OS dataset. Finally, we performed functional enrichment analysis on these twenty-gene.

Construction of gene signatures
A linear regression multiple regression model was developed for the underlying expression levels of genes for prognostic risk scores. The method chosen by lasso Cox was 10-fold cross-validation. According to the median cutoff value (the cutoff value refers to the content before the brackets of the HR value in each dataset) of the risk score, patients with OS were divided into high-risk and low-risk groups. Model prediction efficiency using the training set was evaluated by Kaplan-Meier log-rank test, time-dependent ROC curve analysis, Cox regression analysis, and risk factor score for validation and test sets. A nomogram was constructed using Iasso's guidelines.

Weighted correlation network analysis of genes
Based on the variance of gene expression in TARGET-OS data, the top 5000 genes were selected for WGCNA (Langfelder & Horvath, 2008). This analysis proceeded as: check for outliers in all samples, construct sample tree with hclust, and remove outliers according to cut height. To explore the correlation between expression data and clinical phenotypes, the sample tree and characteristic heat map were visualized. Subsequently, the strength of associations between pairs of nodes of the adjacency matrix aij was calculated as: sij = |cor (xi, xj)| aij = Sijβ. xi and xj are genes i and j. The vector of expression value, sij, indicates the strength of Pearson's correlation coefficient between genes i and j. The aij coding network connects genes i and j. β value is a soft threshold (power value). Further, the Scale-Free Topology Fit Index (scale-free R2) range from 0 to 1 is used to determine the scale-free topological model. Selecting a set of soft threshold powers (range: 1 to 20) assists in calculating scale-free topological model fitting. The soft threshold of β = 7 was used to define the adjacency matrix. The corresponding scale-free R2 value is 0.87, suggesting a satisfactory scale-free topology model.
In coexpression networks, the highest absolute association genes were clustered into the same module to generate a clustering dendrogram. Relationships between clinical traits and risk scores were analyzed by Pearson correlation and results were visualized by heat map analysis. Genes in the module were analyzed for gene ontology (GO) and KEGG pathway enrichment. Moreover, Cytoscape (Smoot et al., 2010) (version 3.7.2) was used to visualize the weighted coexpression network. gene ontology (GO) analysis, which includes annotation of biological processes (BPs), molecular functions (MFs), and cellular components (CCs), serves as a major bioinformatics tool to annotate genes and analyze the biological processes of these genes. Huimei Wang et al. analyzed the biological classification of DEGs, showing that changes in BPs of DEGs were significantly enriched in positive regulation of associated cell response, by GO analysis (Wang et al., 2018). Clusterprofiler (Yu et al., 2012) was used to study modules related to biological function for determining functional and pathway enrichment. A multiple testing correction was performed using hypergeometric test functions and the Benjamini-Hochberg method. The GOplot (Walter, Sánchez-Cabo & Ricote, 2015) package was used to visualize the enrichment analysis.

Construction of genes classifier for OS
Prognostically significant genes for OS in TARGET-OS data were analyzed and a total of 1151 genes were significantly associated with poor prognosis. These 1151 candidate genes were included in the lasso prognostic classifier for further screening and model construction. A twenty-gene classifier for the OS was developed (Figs. 1A and 1B). The gene information in the model is shown in Table 1. Patients were divided into a high-risk group (n = 42) and low-risk group (n = 42) based on risk scores. The median risk score was set as the cutoff (the cutoff value refers to the content before the brackets of the HR value in each dataset). Table 2 shows clinical characteristics of patients with OS in the training set according to their high risk and low risk scores. The Kaplan-Meier log-rank test suggested a significant difference between high-risk and low-risk groups in the training set (P < 0.001; Fig. 1C). In the time-dependent analysis of the ROC curve, AUCs for OS in the first, third, and fifth years were 0.94, 0.98, and 0.97, respectively (Fig. 1D).

Identification of a prognostic risk score model based on the training set
Based on univariate regression analysis (Cox's proportional hazard model), forest plots of selected genes with p-value and hazard ratios are shown in Fig. 2A. HR, Hazard Rate in Fig. 2A, shows that the p-value of selected genes was less than 0.05, showing the statistical difference, which cannot be ignored. Only when 95% CI conclude 1, it could be proved that these selected genes signatures were not significantly associated with the prognosis of patients with OS. Although the extremum of HR is very close to 1, they show that these selected genes might be associated with the prognosis of patients with OS. Risk scores and survival status of patients with OS are shown in Figs. 2B and 2C, respectively. These results indicate that patients with high risk scores have poor outcomes compared with patients with low risk scores (Fig. 2D). Using these data, a nomogram combining the classifier with clinicopathological features to predict the survival probability of patients with different risk scores was prepared (Fig. 3A). The calibration chart demonstrated that predicted three-year and five-year survival rates were very close to observed ratios (Fig. 3B).

Validation of twenty-gene signature for survival prediction in the validation set
Subsequently, the validation set was used to assess the power of the twenty-gene signature in predicting prognosis. OS of the low-risk group was superior to the high-risk group (P < 0.05; Fig. 4A). Also, time-dependent ROC analysis is used to assess the effectiveness of risk models in predicting outcomes. Areas under ROC was 0.81, 0.78, and 0.76 at one, three and five years, respectively, suggesting that the twenty-gene signature displays good predictive power (Fig. 4B). The C-index value is 0.944. The risk score (Fig. 4C) and survival status (Fig. 4D) of patients with OS, and distribution of risk scores of twenty-gene expression profiles are shown in a heat map of 53 patients in the validation set (Fig. 4E). These results suggest that the twenty-gene signature shows good prognostic value for patient survival. This finding is further validated since patients with high risk scores were associated with poorer prognoses compared with patients with low risk scores. Previous studies have investigated gene factors in an attempt to identify new prognostic OS markers (Goh et al., 2019;Guan, Guan & Song, 2020). Hence, I have compared the prognostic gene set for osteosarcoma, Cox univariate and multivariate analysis showed that BACE2, ING2 ALOX5AP, HLA-DMB, HLA-DRA, and SPINT2 were not the independent prognostic factors for osteosarcoma (Table 3).

Gene co-expression network analysis
The relationship between risk scores and gene expression profiles was evaluated using the top 5000 genes in the variance filter for WGCNA analysis. The red line (cut height = 8000) was used to remove the abnormal samples in the sample tree. TARGET-40-PAUXOZ-01A was excluded after removing outliers (Fig. 5). The sample dendrogram  and trait heat map placed selected samples into different sample clusters that provided clinical trait information (Fig. 6A). Independence and average connectivity of coexpression modules were determined by power (β) and scale R 2 value. A series of soft thresholds and corresponding performance power were plotted. The threshold for scale R 2 value was set at 0.85. The power value of seven is the threshold that first reaches the scale R 2 value, and was chosen as the soft threshold to construct and identify coexpression modules (Figs. 6B and 6C). The expression matrix was converted to an adjacency matrix, and subsequently into a topology matrix. Based on TOM, genes are clustered based on criteria of mixed dynamic shear trees using average connection-level distance. The minimum number of genes in each module was set at 7. Finally, a total of 11 modules were identified with the WGCNA package (Fig. 6D). Statistics of gene numbers in each module are shown in Table 4. Associations of these modules with clinical characteristics (including sex, age, relapse, OS, metastasis, major cancer sites, specific cancer sites, and risk scores) are shown in Fig. 6E. T module showed the highest correlation with the risk score. Further, the magenta module also was negatively correlated with sex and OS and positively correlated with major cancer sites. GSEA analysis between high-risk and low-risk groups can better illustrate the risk-score related to the biological process. But GSEA analysis between high-risk and low-risk groups only considers the expression level of gene sets. Analysis of WGCNA is based on the clinical relevant factors, which makes it sense, although its risk-score is low.

Functional enrichment analysis in the gene coexpression network of the magenta module
GO and KEGG enrichment analysis was used to characterize biological functions of genes in the magenta module as they related to risk scores. GO analysis showed that key genes related to OS were mainly enriched in peptide-antigen binding, clathrin-coated endocytic vesicle membrane, peptide binding, and MHC class II protein complex (Fig. 7A). KEGG analysis showed that the key genes related to OS were mainly enriched in antigen processing and presentation, phagosome, cell adhesion molecules, Staphylococcus aureus infection (Fig. 7B). Subsequently, the interaction network for enrichment pathways in the magenta module was visualized (Fig. 7C). Moreover, an interaction network was constructed to visualize genes in the coexpression magenta module (Fig. 7D).

DISCUSSION
OS is a disease involving complex interactions among many factors. Overall, OS disrupts cell signaling pathways, causing loss of bone tissue homeostasis (Otoukesh et al., 2018). The urgent need to obtain better clinical results highlights the related need for better diagnostic, prognostic, and predictive biomarkers (Ludwig & Weinstein, 2005). Presently, no specific markers are available for OS diagnosis. To reduce mortality and increase limb salvage, biomarkers are needed for early identification of disease (Smida et al., 2017).
Possible genetic biomarkers to address this need were identified in the present study. Lasso regression screened in twenty-gene in a training set. These genes have certain prognostic value in time ROC, risk factor, and Kaplan-Meier plots. Results were verified with a validation set.
One current study showed that inhibition of BNIP3 expression by baicalein treatment could inhibit cell apoptosis (Ye et al., 2015). Moreover, the MYC gene is also reported to be amplified in a subset of OS (Ladanyi et al., 1993). MYC promotes the proliferation of OS cells through the autophagy pathway (Mo et al., 2019). Cross-species genomics identified DLG2 as a tumor suppressor in OS (Shao et al., 2019). Further, depletion of KIF25 leads to the formation of actin stress fibers, which may be due to the changes of Rho signaling observed before microtubule destabilization (Wittmann & Waterman-Storer, 2001). In addition to the above studies of genes related to OS, relationships between other genes screened in this study and malignancies have been reported in the literature but have not been reported in OS. A lack of related research reports to help assess the potential of these genes as targets for the treatment and diagnosis of OS currently exits.
The current study also explored the relationship between risk scores and gene expression profiles using WGCNA analysis. Important genes related to OS were enriched in 11 different modules. Results showed that pathways related to inflammation and immunity were primarily enriched in the turquoise module. The gray and turquoise modules share the most pathways among all pairwise comparisons. Genes in these modules may play similar roles in OS.
LASSO is a method of shrinking and variable selection linear regression model. The purpose of LASSO regression is to obtain a subset of the predictors to minimize the prediction errors of the quantitative response variables. The lasso does this by constraining the model parameters, making the regression coefficients of some variables approach 0. ''Coxnet'' fits a lasso penalty, and its adaptive forms, such as adaptive lasso. Moreover, it treats the number of non-zero coefficients as another tuning parameter and simultaneously Previous studies report that CXCR3 may be an independent prognostic risk factor, suggesting a possible benefit of immunotherapy for OS (Tang et al., 2019). The susceptibility and severity of OS may also be related to functional polymorphism of inflammatory genes (Oliveira et al., 2007). Another study has shown that increased expression of MIF indicates an increased risk of metastasis, and MIF is related to angiogenesis and cell infiltration of OS. MIF can be used as a prognostic marker of OS and a potential therapeutic target (Kim et al., 2008). Therefore, clarifying molecular mechanisms of OS may facilitate the identification of novel therapeutic and prognostic targets. Limitations to the study exist. First, although these twenty-gene were identified to have certain prognostic value for OS, all data analyzed in the study were retrieved from the online databases. Thus, further experimental evidence, such as real-time PCR, western blot, immunohistochemistry assays, is required to fully elucidate the role of 20-gene signatures. Second, to determine the diagnostic accuracy of gene associations, a larger sample size would be useful for additional internal validation. Third, assessment of all clinically relevant influencing factors for OS was not included, and more clinical information and PFS-related data are needed. Finally, differentially expressed genes were not evaluated and nor was the need to add a normal control group and joint verification of multiple tumor sites.In conclusion, the current study proposes a 20-gene signature for diagnostic and prognostic purposes for OS. The twenty-gene signature is independently related to prognostic parameters of OS classification. Also, the signature is a good classifier for different subtypes of patients with OS. This signature may provide a new perspective on the prognosis of OS. The biological functions and pathways enriched in specific modules will be beneficial to the development of new therapeutic methods for the treatment of OS.

ADDITIONAL INFORMATION AND DECLARATIONS Funding
The authors received no funding for this work.