Identification of potential gene signatures associated with osteosarcoma by integrated bioinformatics analysis

Background Osteosarcoma (OS) is the most primary malignant bone cancer in children and adolescents with a high mortality rate. This work aims to screen novel potential gene signatures associated with OS by integrated microarray analysis of the Gene Expression Omnibus (GEO) database. Material and Methods The OS microarray datasets were searched and downloaded from GEO database to identify differentially expressed genes (DEGs) between OS and normal samples. Afterwards, the functional enrichment analysis, protein–protein interaction (PPI) network analysis and transcription factor (TF)-target gene regulatory network were applied to uncover the biological function of DEGs. Finally, two published OS datasets (GSE39262 and GSE126209) were obtained from GEO database for evaluating the expression level and diagnostic values of key genes. Results In total 1,059 DEGs (569 up-regulated DEGs and 490 down-regulated DEGs) between OS and normal samples were screened. Functional analysis showed that these DEGs were markedly enriched in 214 GO terms and 54 KEGG pathways such as pathways in cancer. Five genes (CAMP, METTL7A, TCN1, LTF and CXCL12) acted as hub genes in PPI network. Besides, METTL7A, CYP4F3, TCN1, LTF and NETO2 were key genes in TF-gene network. Moreover, Pax-6 regulated four key genes (TCN1, CYP4F3, NETO2 and CXCL12). The expression levels of four genes (METTL7A, TCN1, CXCL12 and NETO2) in GSE39262 set were consistent with our integration analysis. The expression levels of two genes (CXCL12 and NETO2) in GSE126209 set were consistent with our integration analysis. ROC analysis of GSE39262 set revealed that CYP4F3, CXCL12, METTL7A, TCN1 and NETO2 had good diagnostic values for OS patients. ROC analysis of GSE126209 set revealed that CXCL12, METTL7A, TCN1 and NETO2 had good diagnostic values for OS patients.


INTRODUCTION
Osteosarcoma (OS) is a type of primary malignant bone cancer that causes public health concern, especially in children and adolescents (Isakoff, Meltzer & Gorlick, 2015;Lindsey, Markel & Kleinerman, 2017). Several treatment strategies of OS such as surgical resection, traditional adjuvant chemotherapy and radiotherapy have been favored by clinical oncologists in the past few decades (Nagarajan et al., 2011). Accordingly, the 5-year survival rate has been raised to approximately 70% (Bielack et al., 2002). However, it is estimated that 80% of OS patients may suffer from the micro-metastasis, which cannot be detected at early diagnosis (Messerschmitt et al., 2009). Although multiple methods for the diagnosis and treatment of OS have been developed, new methods for the prevention and treatment of OS are still needed. The pathogenesis of OS progression remains not fully understood. Therefore, the identification of effective diagnostic makers and exploring underlying molecular etiology of OS is a pressing need.
The emergence of high-throughput sequencing technology has become an effective way to illuminate the pathogenic genes in a variety of human diseases, which help to explore pathogenesis and develop biomarkers. Many research groups have screened multiple biomarkers associated with OS by analyzing gene expression data. For example, Sun et al. evaluated the difference of genes in the expression level between OS metastasis and OS non-metastasis and discovered that TGFB1, LFT3, KDM1A, and KRAS may participate in the occurrence of OS. Xiong et al. (2015) found that CCT3, COPS3 and WWP1 were involved in the OS development by integrating gene expression and genomic aberration data. Liu, Zhao & Chen (2017) constructed a co-expression network based on a Gene Expression Omnibus (GEO) dataset and identified many potential biomarkers such as CTLA4 and PBF for diagnosis and treatment of OS. However, the molecular mechanisms of OS initiation and development have not been fully explored.
In this study, we retrieved GEO database and obtained four OS datasets. Subsequently, the differentially expressed genes (DEGs) between OS samples and normal samples were obtained and subjected to functional analysis. A protein-protein interaction (PPI) network was constructed followed by the establishment of transcription factor (TF)-target regulatory network. Following this, we downloaded two published GEO datasets of OS as the validation set for assessing the expression levels of key candidate genes. Finally, the receiver operating characteristic (ROC) analysis was conducted to evaluate the diagnostic values of key candidate genes. This study will discover novel gene signatures associated with OS, providing new trains of thought for the diagnosis and treatment of OS.

Data acquisition
The datasets were retrieved from the National Center for Biotechnology Information-GEO (http://www.ncbi.nlm.nih.gov/geo/) repository using the key terms of 'osteosarcoma' AND 'Homo sapiens' [porgn]. All selected datasets in this study should meet the following criteria: (i) the datasets contained genome-wide expression data of tumor tissues and normal control tissues of OS patients; and (ii) all data were standardized or raw data. As shown in Table S1, a total of five datasets were obtained. Notably, the GSE9508 dataset contained over 50% missing data and was subsequently removed from the following analysis. Eventually, four datasets (GSE12865, GSE19276, GSE87624 and GSE99671) were included in this study, which included 118 OS tissues and 28 normal bone tissues. The GSE12865 series (GPL6244 platform) included a total of 14 samples (12 OS and two normal control tissues). The platform for GSE19276 was GPL6848 including 44 OS and five normal control bone tissue. The platform for GSE87624, consisting of 44 OS and three normal control bone tissue samples, was GPL11154. GSE99671 was in GPL20148 platform, which contained 18 OS and 18 normal control bone tissue samples. The platform and series matrix files were downloaded. The dataset information was listed in Table S1. The impact of different platforms on the results, we normalized the data through the log function, and centralized and standardized the scale function to eliminate the impact of the dimension on the data structure.

Data pre-processing and DEGs identification
The standardized data from included datasets were firstly processed as follows: (i) the probes that mapped to several genes were deleted; and (ii) if the gene was matched by multiple probes, the probe with the greatest gene expression value would be retained. Overall, there were overlapping 14981 genes among four datasets. Subsequently, MetaMA (https://cran.r-project.org/web/packages/metaMA/), an R package, was used to combine data from four GEO datasets. Individual p-values were obtained using Limma R package. The inverse normal method was used to combine P values in meta-analysis (Marot, Mayer & Jaffrézic, 2009). We carried out the multiple comparison correction by Benjamini & Hochberg approach to acquire false discovery rate (FDR). Herein, the DEGs between OS tissues and normal controls were defined according to the cutoff of false discovery rate (FDR) < 0.05 and those DEGs with different directionality were removed from this study. Finally, the hierarchical clustering analysis of top 100 DEGs was also carried out by pheatmap package in R software.

Functional enrichment analyses
To systematically explore the underlying biological functions of identified DEGs, the Metascape (http://metascape.org/gp/index.html), an online tool that integrates multiple data resources such as Gene Ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG) and Universal Protein (Uniprot) database, was used to perform GO and KEGG pathway enrichment analysis of DEGs. GO analysis involved three categories: biological process (BP), cellular component (CC) and molecular function (MF). P values < 0.05 was set as the thresholds for significant enrichment analyses.

Protein-protein interaction (PPI) network analysis
The Search Tool for the Retrieval of Interacting Genes/Proteins (STRING) database, a freely web-based analytic tool, can predict the interactions among proteins (Szklarczyk et al., 2019). Here, a PPI analysis was conducted to examine the interactive associations between protein products of DEGs. The Cytoscape software (http://www.cytoscape.org) was utilized to establish a PPI network. In addition, the CytoNCA (http://apps.cytoscape. org/apps/cytonca) was used to analyze topological characteristics of PPI network. The top 15 nodes were considered as hub genes according to the degree value.

TF-candidate gene network analysis
TFs can bind to specific DNA sequences in promoter region of target gene to regulate gene expression. The top 20 up-and down-regulated genes were regarded as the candidate genes. The DNA sequences (2 kb) in the upstream promoter region of these candidate genes were firstly downloaded from the University of California, Santa Cruz (http://www.genome.ucsc.edu/) databases. Then, we employed online match tool from TRANSFAC (http://genexplain.com/transfac) to predict potential TFs that targeted candidate genes. Notably, the TFs that had only one binding site with target genes were retained in this study. Finally, the Cytoscape software was used to build a transcriptional regulatory network and perform node degree analysis.

Evaluation the expression level and diagnostic values of key genes
Two published OS datasets were obtained from GEO database for the expression level evaluation of seven key DEGs (CAMP, METTL7A, TCN1, LTF, CXCL12, CYP4F3 and NETO2). Then, we performed a ROC analysis using the pROC package in R software (http://web.expasy.org/pROC/) to evaluate the diagnostic value of these seven DEGs. Accordingly, the area under the curve (AUC) was computed and the ROC curve was built. The AUC value > 0.8 showed a good diagnostic value for OS.

Identification of DEGs
After data pre-processing, a total of 1,059 DEGs (569 up-regulated genes and 490 downregulated genes) were identified between OS samples and normal controls according to methods described above. The clustering analysis indicated that top 100 DEGs could distinguish OS samples and controls from four datasets (Fig. S1). The top 20 up-and down-regulated genes were listed in Table 1.

GO and KEGG enrichment analysis of DEGs
The GO enrichment analysis of DEGs showed that a total of 214 GO terms were enriched, including 194 GO-BP terms, 15 GO-CC terms and 5 GO-MF terms (Table S1). Specifically, for GO-BP analysis, these DEGs were strongly associated with positive regulation of cell death, myeloid cell activation involved in immune response and regulation of protein kinase activity. Meanwhile, protein domain specific binding and transcription factor binding were significantly enriched GO-MF terms. Many DEGs were primarily involved in multiple GO-CC terms, such as anchored component of membrane and tertiary granule. The top 20 clusters of significantly enriched GO terms were displayed in Fig. 1. In addition, these DEGs were markedly enriched in 20 KEGG pathways such as regulation of lipolysis in adipocytes, protein processing in endoplasmic reticulum and pathways in cancer (Table 2). Notably, the top 20 up-regulated genes did not enrich in any KEGG pathway. However, four of top 20 down-regulated genes played vital roles in multiple significantly enriched KEGG pathways, including FABP4 (fatty acid binding protein 4), CXCL12 (C-X-C motif chemokine ligand 12), CXCL12 (C-X-C motif chemokine ligand 12) and CAT (Catalase). More specifically, FABP4 was involved in regulation of lipolysis in adipocytes and CXCL12 participated in pathways in cancer and axon guidance (Table 2). CYP4F3 was closely correlated with arachidonic acid metabolism pathway and CAT was significantly enriched in biosynthesis of amino acids and AMPK signaling pathway (Table 2).

Evaluation the expression level and diagnostic values of key genes
An external dataset (GSE39262) was obtained from GEO database, which contained 10 human osteosarcoma cell lines and five untransformed cell lines samples. The platform for this dataset was GPL96 [HG-U133A] Affymetrix Human Genome U133A Array. GSE126209 was downloaded from the GEO database, which inclued 12 osteosarcomas tumors and 11 adjacent normal tissues samples. The platform for this dataset was GPL20301 Illumina HiSeq 4000. Seven key genes (CAMP, METTL7A, TCN1, LTF, CXCL12, CYP4F3 and NETO2) were selected to verify in GSE39262. Among them, CAMP, METTL7A, TCN1, LTF and CXCL12 acted as hub genes in PPI network. METTL7A, CYP4F3, TCN1, LTF and NETO2 were key genes in TF-gene network. Moreover, Pax-6 regulated four key genes (TCN1, CYP4F3, NETO2 and CXCL12). The gene differential expression analysis of the GSE39262 dataset revealed that NETO2 was significantly up-regulated while CXCL12, METTL7A and TCN1 were significantly down-regulated, which were consistent with our integration analysis ( Fig. 4; Table S2). The gene differential expression analysis of the GSE126209 dataset displayed that NETO2 was significantly up-regulated while CXCL12 was significantly down-regulated, which were consistent with our integration analysis ( Fig.  5; Table S3). ROC analysis is a commonly used method to evaluate the value of genetic diagnosis and has been used in previously biomedical works (Le et al., 2019;Le, Yapp & Yeh, 2019;Thi, Trang & Khanh, 2020). Additionally, the results of the GSE39262 dataset showed that five genes had good diagnostic values for OS (CXCL12, CYP4F3, METTL7A, NETO2 and TCN1; Fig. 6). The AUC of CXCL12 was 1.000 and the specificity and sensitivity of this model were 100.0% and 100%, respectively. The AUC of CYP4F3 was 0.840 and the specificity and sensitivity of this model were 80.0% and 80.0%, respectively. The AUC of METTL7A was 0.900 and the specificity and sensitivity of this model were 100.0% and 70.0%, respectively. The AUC of NETO2 was 0.860 and the specificity and sensitivity of this model were 100.0% and 80.0%, respectively. The AUC of TCN1 was 0.820 and the specificity and sensitivity of this model were 80.0% and 90.0%, respectively. Ultimately, the results of the GSE126209 dataset showed that four genes had good diagnostic values for OS (CXCL12, METTL7A, NETO2 and TCN1; Fig. 7). CXCL12 was 1.000 and the specificity and sensitivity of this model were 100.0% and 100%, respectively. The AUC of METTL7A was 0.856 and the specificity and sensitivity of this model were 100.0% and 83.3%, respectively. The AUC of NETO2 was 0.833 and the specificity and sensitivity of this model were 100.0% and 75.0%, respectively. The AUC of TCN1 was 0.879 and the specificity and sensitivity of this model were 81.8% and 83.3%, respectively.

DISCUSSION
OS is a common malignant bone tumor and originates from mesenchymal stromal cells (MSCs) (Xiao, Hogendoorn & Cleton-Jansen, 2013). The heterogeneous histopathological characteristics and complex genomic landscape of OS have been major challenges for elaborating underlying the molecular pathogenesis of OS. In this study, we included four OS datasets and identified 1,059 DEGs (569 up-regulated DEGs and 490 down-regulated DEGs) between OS and normal samples. These genes were significantly enriched in 54 KEGG pathways such as pathways in cancer. Moreover, CAMP, METTL7A, TCN1, LTF and CXCL12 served as hub genes in PPI network. METTL7A, CYP4F3, TCN1, LTF and NETO2 were key players in TF-target gene regulatory network. Interestingly, TCN1, CYP4F3,  NETO2 and CXCL12 were all regulated by Pax-6. Additionally, the expression patterns of key genes (CAMP, METTL7A, TCN1, LTF, CXCL12, CYP4F3 and NETO2) were selected to verify in two published OS datasets (GSE39262 and GSE126209). CAMP, also known as hCAP18 or LL37, is an antimicrobial peptide gene in human (Larrick et al., 1995). The C-terminal of the protein product of CAMP contains a 37amino acid-long peptide with broad spectrum-antibacterial activity (Vandamme, Luyten & Schoofs, 2012). There are positive expressions of CAMP in the multiple cell systems, such as epithelial cells, neutrophils and macrophages (Dhawan et al., 2015;Frew et al., 2014;Li et al., 2018). Wu et al. (2012) suggested that bone marrow stroma could express CAMP, which may be a potential ex vivo priming factor for hematopoietic stem progenitor cells to promote hematopoietic reconstitution after transplantation. Later, Coffelt et al.
(2019) discovered that CAMP expression level was elevated in MSCs compared to that in ovarian cancer cells. Herein, our analysis showed that CAMP was the most down-regulated gene in patients suffering from OS. Besides, CAMP acted as a hub gene in PPI network, suggesting that this gene may be involved in the pathologic mechanism of OS. Although the underlying role of CAMP on the initiation and progression of OS has not been investigated, available evidence showes that CAMP plays significant roles in several cancers, including breast cancer, lung cancer and pancreatic cancer (García-Quiroz et al., 2016;Sainz Jr et al., 2015;Von Haussen et al., 2008). More notably, existing data indicated that CAMP had either carcinogenic or anti-cancer effects (Chen et al., 2018;Wu et al., 2010). Therefore, the influence of CAMP on OS occurrence and development needs to be further clarified in future.
Our gene differential expression revealed that CXCL12 and TCN1 were down-regulated in OS patients, which were verified in a validation dataset. Moreover, these two genes also acted as hub genes in PPI network. In addition, up-regulated NETO2 and down-regulated CYP4F3 had high degree in TF-gene regulatory network. Interestingly, CXCL12, TCN1, NETO2 and CYP4F3, regulated by Pax-6, exhibited important diagnostic values for OS. CXCL12 is also called stromal cell-derived factor-1 (SDF-1) and can bind to G-proteincoupled chemokine receptor CXCR4 (Nagasawa, 2014). Increasing studies suggested that CXCL12/CXCR4 axis played pivotal roles in tumor growth and development (Balkwill, 2004;Lu et al., 2015;Perissinotto et al., 2005). Li et al. (2018). highlighted that epigenetic regulation of CXCL12 by DNA methyltransferase 1 was associated with the metastasis and immune response in OS. Previous reports also indicated that down-regulation of CXCR4 induced OS cell apoptosis via suppressing PI3K/Akt/NF-κβ pathway (Pollino et al., 2019). However, there is no directive evidence to support the involvement of TCN1, NETO2 and CYP4F3 in OS. Notably, Pax-6 is a highly conserved evolutionarily TF and belongs to paired box TF family (Mansouri & Gruss, 1996). Several studies have pointed out that Pax6 participated in the regulation of cancer cell proliferation and progression (Shyr et al., 2010;Zong et al., 2011). Yang et al. (2019) established a TF-top 20 DEGs regulatory network by integrating and analyzing three GEO datasets (GSE66673, GSE49003 and GSE37552), and found that Pax-6 down-regulated BMP6 expression in non-metastatic OS samples. Taken together, we inferred that CXCL12/TCN1/NETO2/CYP4F3-Pax-6 axis may be implicated in the pathogenesis of OS, and four genes (CXCL12, TCN1, NETO2 and CYP4F3) were novel diagnostic biomakers for OS.
METTL7A and LTF are reported to act as tumor suppressor genes (Qi et al., 2017;Zhang et al., 2011;Zhang et al., 2015). Similarly, our findings showed that the expressions of METTL7A and LTF were decreased in OS samples. Moreover, these two genes were both hub genes in PPI analysis and key gene nodes in TF-gene regulatory analysis. These results implied that METTL7A and LTF may be correlated with underlying mechanisms of OS. However, the potential effects of METTL7A and LTF down-regulation on OS progression needs to be further investigated.
Although we have identified multiple novel gene signatures associated with OS, there are still limitations in this work. Our conclusion was drawn based on an integrated bioinformatic analysis. Therefore, additional experiments are required to confirm our findings. In addition, a larger sample size verification will also improve the reliability of our conclusion. Moreover, the clinical information should be collected to evaluate the diagnostic value of biomarkers for OS patients. Finally, the biological significances of key biomarkers will be investigated in model systems or cell lines.
In summary, a total of 1,059 DEGs were identified between OS and normal samples. Among them, up-regulation of NETO2 and down-regulation of METTL7A, TCN1, and CXCL12 may be potential gene signatures related to OS. Pax-6 was also probably associated with the pathological process of OS. However, a comprehensive bioinformatics analysis with larger sample size and in vivo or in vitro assays should be performed to confirm our results.

ADDITIONAL INFORMATION AND DECLARATIONS Funding
This work was supported by the Tianjin Union Medical Center Research Projec-tïĳĹ2019JZPY06, 2019JZPY02. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.