A Four-Pseudogene Classifier Identified by Machine Learning Serves as a Novel Prognostic Marker for Survival of Osteosarcoma

Osteosarcoma is a common malignancy with high mortality and poor prognosis due to lack of predictive markers. Increasing evidence has demonstrated that pseudogenes, a type of non-coding gene, play an important role in tumorigenesis. The aim of this study was to identify a prognostic pseudogene signature of osteosarcoma by machine learning. A sample of 94 osteosarcoma patients’ RNA-Seq data with clinical follow-up information was involved in the study. The survival-related pseudogenes were screened and related signature model was constructed by cox-regression analysis (univariate, lasso, and multivariate). The predictive value of the signature was further validated in different subgroups. The putative biological functions were determined by co-expression analysis. In total, 125 survival-related pseudogenes were identified and a four-pseudogene (RPL11-551L14.1, HR: 0.65 (95% CI: 0.44–0.95); RPL7AP28, HR: 0.32 (95% CI: 0.14–0.76); RP4-706A16.3, HR: 1.89 (95% CI: 1.35–2.65); RP11-326A19.5, HR: 0.52(95% CI: 0.37–0.74)) signature effectively distinguished the high- and low-risk patients, and predicted prognosis with high sensitivity and specificity (AUC: 0.878). Furthermore, the signature was applicable to patients of different genders, ages, and metastatic status. Co-expression analysis revealed the four pseudogenes are involved in regulating malignant phenotype, immune, and DNA/RNA editing. This four-pseudogene signature is not only a promising predictor of prognosis and survival, but also a potential marker for monitoring therapeutic schedule. Therefore, our findings may have potential clinical significance.


Introduction
Osteosarcoma is the most common malignancy derived from bone, which is originated in mesenchymal tissue like many other sarcomas. It usually occurs in children and young adults as well as the elderly with a typical bimodal distribution in age [1]. The incidence of osteosarcoma is 4.4 per-million among those aged 0-24 years old, and in the second peak of age the disease is usually secondary, accompanied with Paget's disease or other bony lesions [2]. The primary therapy for osteosarcoma is surgery, but the survival of patients with treatment of surgery alone remains disappointing, around 15-17% [3]. At present, the therapy for osteosarcoma is a combination of chemotherapy and surgery, which could cure about 70% of the patients. For the patients with localized tumor, the reaction of chemotherapy is the best predictor to predict prognosis for now [4]. However, for patients with recurrent or metastatic disease, their overall survival is still not optimistic, which remained at 20% which is a kind of RNA-seq data scaled by gene length and sequencing depth (M) [24]. After excluding the samples with incomplete information, as well as pseudogenes with low expression (average TPM ≤ 1 across all samples), 94 osteosarcoma expression data of 1333 pseudogenes accompanied by corresponding clinical follow-up information was obtained. The expression value was processed as log2 (TPM + 1) for further analysis [25].

Construction of Prognostic Signature
Univariate cox regression analysis was performed to screen survival related pseudogenes. The pseudogenes with p-value < 0.05 were screened as candidate pseudogenes for next analysis (FDR for p-value adjustment). These candidate pseudogenes were further screened by LASSO regression, which is an efficient method for regression analysis of high-dimensional predictors [26], and was widely used in Cox proportional hazard regression model for survival analysis [27]. 10-fold cross-validation was conducted to select the tuning parameter to determine the magnitude of penalization, and was then used to choose stable features with nonzero coefficient [28][29][30][31][32]. Then, multivariate regression analysis was performed and pseudogenes with p < 0.05 was selected for predictive signature construction. Corresponding risk score of each patient was calculated by the formula: risk score = expression of RPL11-551L14.
Patients were divided into high and low risk groups according to the median risk score (1.23). The Kaplan-Meier (K-M) survival analysis was performed on the two groups. Furthermore, the receiver operating characteristic (ROC) analysis was performed to evaluate the predicting efficiency of the model to predict 3-, 5-, and 8-year survival and the area under curve (AUC) was calculated. All of these processes were conducted by R software (version 3.5.1).

Nomogram and Calibration
We combined several basic clinical information with the risk score signature to predict the 3-, 5-, and 8-year survival of osteosarcoma patients by plotting nomogram. Calibration plot, which demonstrates whether the predicted outcome is similar with the actual outcome, was used to evaluate the performance of the nomogram. In this section, R package rms was used to plot nomogram and calibration plot.

Correlation Analysis of the Four Pseudogenes and Annotation of Their Function
The Pearson correlation coefficients between the expression profiles of the four prognostic pseudogenes and protein-coding genes (PCGs) were calculated to determine the co-expression relationships of the pseudogenes and PCGs. The PCGs positively or negatively correlated with the four pseudogenes were considered as pseudogene-related PCGs for functional analysis. The online Database for Annotation, Visualization, and Integrated Discovery (DAVID, https://david.ncifcrf.gov/summary. jsp) [33] was used to explore the functions of pseudogenes-related PCGs and p-value < 0.05 was regarded significant, which is described in detail in our previous studies [19,21,34,35]. The biological processes and KEGG (Kyoto Encyclopedia of Genes and Genomes) pathways were presented.

Clinical Characteristics of the Patients
The osteosarcoma cohort includes data from a total of 274 samples. After excluding samples without clinical follow-up information, 94 osteosarcoma patients were involved in this study and the baseline of clinical characters were presented ( Table 1). The study flowchart is outlined in Figure 1.

Identification of Osteosarcoma Survival-Related Pseudogenes
The survival-related pseudogenes were screened using univariate Cox proportional hazards regression analysis. Using p < 0.05 as the cut off, 125 pseudogenes were identified by the univariate analysis as significant to overall survival ( Table 2 and Table S1). The top 20 significant survival-related pseudogenes were demonstrated by Forest plot (Figure 2A).

Construction of the Prognostic Pseudogene Signature for Osteosarcoma
The predictive model was constructed based on the 125 survival-related pseudogenes using LASSO regression. The LASSO Cox regression model was used to select variables in order to avoid overfitting of the predictive model, and 15 pseudogenes with non-zero coefficient were selected with the minimum criteria ( Figure 2B,C). Multivariate Cox proportional hazards regression analysis was then performed, and the significant pseudogenes (p < 0.05) were used to construct the model. A four-pseudogene (RPL11-551L14.1, RPL7AP28, RP4-706A16.3, and RP11-326A19.5) based risk model was finally obtained ( Figure 2E; Table 3), and a risk-score formula was established according to their expression levels. The four-pseudogene risk score for each patient was calculated, and the patients were divided into high and low risk groups (N = 47 each) according to the median risk score ( Figure 2D; Table 4, Table S2).

Predictive Value of the Four-Pseudogene Signature
Kaplan-Meier survival analysis was performed to assess the potential prognostic value of the four-pseudogene signature, which revealed significantly worse prognosis of the patients in the high-risk group (p < 0.0001; Figure 3A). In addition, ROC analysis was used to determine the accuracy of this signature in predicting the 3-, 5-, and 8-year survival ( Figure 3B), and the respective AUCs were 0.885, 0.878, and 0.796, indicating high sensitivity and specificity of this signature. In addition, a higher risk score was associated with shorter survival and more death events ( Figure 3C). Taken together, this four-pseudogene signature-based risk model can distinguish the high-risk osteosarcoma patients from the low-risk patients, indicating a prognostic significance for osteosarcoma. The Kaplan-Meier estimates of the OS for high-risk and low-risk patient cohorts grouping by the four-pseudogene signature (N = 94). The OS differences between the two groups were determined by the two-sided log-rank test. It can be concluded that higher risk scores are significantly associated with worse OS (p < 0.0001). (B) ROC analysis of sensitivity and specificity for the four-pseudogene signature in predicting the OS of patients for 3-, 5-, and 8-years. (C) The distribution of four-pseudogene risk score, patients' survival status and pseudogene expression signature were analyzed. As the risk score rising, the patients had a shorter survival time, more death events and the expression value of four pseudogenes ascended or decreased. OS: overall survival.

The Four-Pseudogene Signature Is an Independent Prognostic Predictor of Osteosarcoma
To detect the possible contribution of other factors like age, gender, and metastatic status on patient survival, we also grouped the patients according the above variables, and applied the four-pseudogene signature to different sub-groups. There were 40 males and 54 females in the osteosarcoma cohort, which did not differ significantly in terms of the risk score distribution ( Figure 4A). In addition, the high-risk patients had significantly shorter overall survival (OS) (p < 0.01) in both the male and female groups ( Figure 4B,C) with respective five-year AUC values of 0.753 and 0.805 ( Figure 5A,B), indicating that the four-pseudogene signature was independent of gender. There was also no significant difference in the distribution of the risk score of younger (<18 years, N = 72) and older (>18 years, N = 22) patients ( Figure 4D). Furthermore, the low-risk patients had significantly longer OS (p < 0.01; Figure 4E,F), and similar five-year AUC values (0.888 and 0.861; Figure 5C,D) in both age-stratified groups, indicating that the four-pseudogene signature was also independent of age. Based on metastatic status, the patients were divided into non-metastatic and metastatic groups. The both have a similar risk score and a shorter OS in high risk group ( Figure 4G-I), and the five-year AUC values were 0.87 and 0.89 for non-metastatic and metastatic respectively ( Figure 5E,F), indicating that the four-pseudogene signature was independent of metastatic status. Taken together, the four-pseudogene signature can be applied to osteosarcoma patient subgroups stratified on the basis of clinical characteristics, and is an independent prognostic predictor for osteosarcoma. . Kaplan-Meier analysis with two-sided log-rank test was performed to estimate the differences in OS between the low-risk and high-risk patients in male cohort and female cohort. (D) The dot plot showed the distribution of risk score based four-pseudogenes between different age group (<18 and ≥18). The risk score is not different between the two groups (p-value > 0.05). (E,F) Kaplan-Meier analyses of patients with osteosarcoma in different age cohorts, grouping based on their age at initial diagnosis: <18 (N = 72), ≥18 (N = 22). Kaplan-Meier analysis with two-sided log-rank test was performed to estimate the differences in OS between the low-risk and high-risk patients in <18 and ≥18. (G) The dot plot showed the distribution of risk score based four-pseudogenes between different metastatic status group (non-metastatic = 72 and metastatic = 22). The risk score is not different between the two groups (p-value > 0.05). (H,I) Kaplan-Meier analyses of patients with osteosarcoma in different metastatic status cohorts. Kaplan-Meier analysis with two-sided log-rank test was performed to estimate the differences in OS between the low-risk and high-risk patients in metastatic and non-metastatic groups.

Combined Pseudogene and Clinical Risk Score Is an Independent Predictor of Survival
To combine the basic clinical status (age, gender, site, and metastatic status) with the four-pseudogene signature for predicting patient survival, we constructed a multivariable cox probability hazard model to predict the 3-year, 5-year, and 8-year survival, and visualized it by a nomogram with an assigned score for each term ( Figure 6A). Both the forest plot and the nomogram indicated that the risk score is an independent predictor for patient survival ( Figure 6B). We also tested the stability and accuracy of the nomogram in terms of agreement between prediction and actual survival. The model performance is shown in Figure 6C and represents perfect prediction.

Functional Analysis of the Predictive Pseudogenes
Correlation analysis was used to identify the protein coding genes co-expressing with the pseudogenes, followed by gene ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analysis to determine their function. The top 10 correlated PCGs of each pseudogene and their relationship are presented as a network ( Figure 7A). The genes with highest correlation to each pseudogene were as follows: RPL11-551L14.1-EHMT2, RPL7AP2 8-HMG20B, RP4-706A16.3-ING5, and RP11-326A19.5-ABHD2 ( Figure 7B). The functions of the pseudogenes as per GO (BP, biological process) and KEGG analyses are shown in Figure 7C,D. We found RP11-326A19.5 is associated with cell migration and adhesion such as wound healing, focal adhesion, and pathways in cancer. The RPL11-551L14.1 is participating in regulation of transcription and molecular metabolism. As for RP4-706A16.3, it plays a role in translation of ribosome. Lastly, RPL7AP28 may be related to immune regulation of MHC. A simple summary table for all results was in Table 5. The gene ontology enrichment analysis for the four pseudogenes correlated genes were carried out in DAVID to reveal the potential function of the four pseudogenes. The top five significant biological process terms for each pseudogene were shown. (D) The KEGG pathway enrichment analysis for the four pseudogenes correlated genes were carried out in DAVID to reveal the potential pathways in which the four pseudogenes are involved. The top five significant pathway terms for each pseudogene were shown.

Comparison and Combination of Gene-Signature and Pseudogene-Signature
To compare the performance of coding genes as signature with pseudogenes, we used the same method listed in the manuscript, and obtained a two-gene signature predicting the prognosis of osteosarcoma patients (MT1A, HR: 1.41 (95% CI: 1.18-1.69); MPV17L2, HR: 0.39 (95% CI: 0.24-0.65)) ( Figure 8A,B). The Kaplan-Meier showed that the two-gene signature can distinguish high risk and low risk patients efficiently. The OS difference between the two groups was significant and high-risk group was associated with worse OS (p < 0.0001) ( Figure 8C). ROC analysis revealed that the AUC of the signature predicting the 3-, 5-, and 8-year survival are 0.821, 0.861, and 0.781 respectively ( Figure 8D). Compared with the pseudogene model (3-, 5-, and 8-year AUCs were 0.885, 0.878 and 0.796), it seems that the pseudogene-based model has a better performance in our study. Additionally, we combined the gene signature and pseudogene signature together to explore whether the pseudogenes can provide an additional power to the prediction of gene signature or they can work together to give a better performance than alone ( Figure 9A). The ROC curve and K-M plot showed that the combined signature can also distinguish the high-risk and low-risk patients and the performance of predicting prognosis is satisfactory ( Figure 9B,C). Comparing the three signatures by AUC of ROC curve, the combined signature performed best in sensitivity and specificity in which the AUC is 0.956 (5 years), indicating the genes and pseudogenes can work together to make a better prediction ( Figure 9D). Forestplot showed results of multivariate cox analysis. Two significant coding genes in multivariate cox analysis were screened out (p_value < 0.05) as candidate to construct the risk model. (C) The Kaplan-Meier estimates of the OS for high-risk and low-risk patient cohorts grouping by the two-coding gene signature (N = 94). The OS differences between the two groups were determined by the two-sided log-rank test. It can be concluded that higher risk scores are significantly associated with worse OS (p < 0.0001). (D) ROC analysis of sensitivity and specificity for the two-gene signature in predicting the OS of patients for 3, 5, and 8 years.  (N = 94). The OS differences between the two groups were determined by the two-sided log-rank test. It can be concluded that higher risk scores are significantly associated with worse OS (p < 0.0001). (D) ROC analysis of sensitivity and specificity for the comparison among the two-gene signature, four-pseudogene signature, and combination signature in predicting the OS of patients for 5 years and combination signature performed better than the other two.

Discussion
With the application of chemotherapy in the 1970s, the treatment of osteosarcoma has made great progress. However, survival rate of metastatic and relapse cases remains to be unsatisfactory, and the poor prognosis of such patients is the major problem for osteosarcoma [36]. Thus, identification of novel biomarkers to predict patients' outcome might help to customize more personalized therapy and would be able to improve their prognosis. Growing evidence supports the role of pseudogenes in the oncogenesis and progression in different cancers [23,37,38] and there are also a few studies which recognized the importance of pseudogenes in osteosarcoma [39,40]. High throughput RNA-seq has paved the way for exploitation of various biomarkers for the diagnosis and prognosis of many cancers including osteosarcoma [18,41,42]. In this study, we took a systematic analysis for the potential role of pseudogenes as prognostic predictor and provided first evidence of survival related pseudogenes of osteosarcoma. We made several important discoveries during the course of this analysis.
First, we identified 125 survival-related pseudogenes using univariate Cox analysis, and most of them are risk factors (91/125) which may play the oncogenesis role. Second, we identified a four-pseudogene signature and established a scoring system that was significantly associated with the OS of osteosarcoma patients. This signature helped to stratify the low-and high-risk groups and predicted the OS of osteosarcoma patients with high sensitivity and specificity. Out of four, RP4-706A16.3, is a risk factor and the another three are protective factors. Of the four pseudogenes we have identified, none were reported before, suggesting that these pseudogenes were newly found and required more attention. Third, in order to validate the applicability in different patients and extend the signature to various subgroups, K-M survival analysis and ROC curve analysis were performed in different subgroups. We found that it was independent of other potential predictors-including age, gender, and metastatic status-and the performance of predicting survival was satisfactory. As for the important clinical feature-stage, we did not perform related analysis on it due to the incompleteness of the stage information. Further studies and data are needed to uncover the role of stage. We visualized the pseudogene signature and the other clinical information by a nomogram to simplify the use of this signature in clinical practice. Last, to further understand the biological function and explore the underlying oncogenic mechanism of the four pseudogenes, co-expression analysis was employed. Results showed the four pseudogenes were involved in multiple biological processes and pathways including malignant phenotype, immune, and DNA/RNA editing, which might be the underlying mechanism of osteosarcoma progression. Last, we compared the gene signature and pseudogene signature by ROC curve and found the pseudogene signature is a little better than the gene signature. The AUC of them were very close indicating the two signatures may have similar performance. Maybe the patient sample size leads to these results, a large size cohort is needed to verify this finding.
There are some limitations and shortcomings in this study that cannot be ignored. First, this study was mainly focused on data mining and data analysis, which are based on methodology and the results were not validated using experiments. Further experiments are needed to verify the findings of this study. Second, the datasets we were able to obtain were limited as we could only obtain one osteosarcoma dataset that contained both patient RNA-seq data and clinical follow-up information. If there were another dataset that matched our requirements, it could have been used to further validate our results. Additional datasets should be included to obtain a better result. Besides, there is currently no other study exploiting pseudogene signature for osteosarcoma, meaning that we also cannot validate our result in another independent study. Third, when constructing a pseudogene signature for prognosis, one must take it into consideration of the application of such a model. Since different methods of detecting pseudogenes might lead to different results, the procedure of detection, quantification, and determination of transcriptional activity of pseudogenes must be standardized [43]. Therefore, the four newly found prognosis-related pseudogenes deserve more attention and the next step for our research is to validate our results using experiments. We hope that these results could give other researchers inspiration for further study.

Conclusions
Taken together, we identified a novel four-pseudogene signature for osteosarcoma which is a promising independent survival predictor and served as an important biomarker for guiding the clinical treatment of osteosarcoma to improve management for patients. In addition, our findings provide new insights into exploring the underlying molecular mechanisms of osteosarcoma, and present a promising new prognostic marker. Therefore, our findings in the signature have a very promising clinical significance.
Supplementary Materials: The following are available online at http://www.mdpi.com/2073-4425/10/6/414/s1, Table S1: The results of univariate analyses for the significant survival-related pseudogenes, Table S2: The detail of clinical information for the 94 osteosarcoma patients. Funding: This research received no external funding.