Construction of an integrated human osteosarcoma database, HOsDb, based on literature mining, microarray analysis, and database retrieval

Osteosarcoma (OS) is the most frequent primary malignancy of bone with a high incidence in adolescence. This study aimed to construct a publicly available, integrated database of human OS, named HOsDb. Microarray data, current databases, and a literature search of PubMed were used to extract information relevant to human OS-related genes and their transcription factors (TFs) and single nucleotide polymorphisms (SNPs), as well as methylation sites and microRNAs (miRNAs). This information was collated for constructing the HOsDb. In total, we identified 7191 OS tumor-related genes, 763 OS metastasis-related genes, and 1589 OS drug-related genes, corresponding to 190,362, 21,131, and 41,135 gene-TF pairs, respectively, 3,749,490, 358,361, and 767,674 gene-miRNA pairs, respectively; and 28,386, 2532, and 3943 SNPs, respectively. Additionally, 240 OS-related miRNAs, 1695 genes with copy number variations in OS, and 18 genes with methylation sites in OS were identified. These data were collated to construct the HOsDb, which is available at www.hosdatabase.com. Users can search OS-related molecules using this database. The HOsDb provides a platform that is comprehensive, quick, and easily accessible, and it will enrich our current knowledge of OS.

Many databases have been developed to investigate the association between certain molecules of interest and disease pathogenesis from different perspectives. For instance, Online Mendelian Inheritance in Man (OMIM) [8] contains information on the relationship between the phenotype and genotype of all known Mendelian disorders. Wikigenes [9] is a portal that provides information about genes, proteins, chemical compounds and their reported associations with various diseases. The miR2Disease [10] and Human microRNA Disease Database (HMDD) [11] aim to provide comprehensive collection of microRNAs (miRNAs) associated with various human diseases. MethyCancer [12] contains highly integrated data regarding cancer-related genes, DNA methylation sites, and information on cancer from public resources. TRANSFAC is a database of transcription factors (TFs), which offers an integrated system for predicting gene expression regulation [13]. Although research data regarding OS has accumulated during the past decades, to the best of our knowledge, there is only one available database specifically focusing on OS molecular biology, called Osteosarcoma Database [14]. Nevertheless, only 911 OS-associated genes and 81 miRNAs collected through manual literature mining are included in this database, and there is no information available regarding other OS-related molecules, such as TFs or methylation sites [14]. The development of high-throughput laboratory techniques, such as microarray analysis, has enabled generation of large quantities of data associated with OS, which are an important resource for exploration of potential OS-related molecules, including genes, miRNAs, and copy number variations (CNVs) [15][16][17][18]. While these data provide insight into certain aspects of OS, they are not assembled together in a structured format. Thus, there is a need to establish an integrated, OSspecific database or platform of OS-related genes, TFs, methylation sites, and miRNAs.
We collected detailed OS-related data, including OS-related genes, TFs, single nucleotide polymorphisms (SNPs), miRNAs, methylation sites, and CNVs by analyzing several microarray deposits in the Gene Expression Omnibus (GEO) data repository, searching current databases, and mining the literature in PubMed. Using these data, we aimed to construct a publicly available, integrated database of human OS to facilitate the exploration of human OS-related molecules and create a unique resource for research into this disease.

Construction and content Database construction
The integrated database of human OS, named HOsDb, aims to provide a high-quality collection of human OSrelated genes, methylation sites, CNVs, miRNAs, TFs, and SNPs based on literature mining, microarray analysis, and database retrieval. The data collection and processing steps are illustrated in Fig. 1.  Gene set analysis on previously published genome-wide gene expression data of osteosarcoma cell lines (n = 19) and osteosarcoma prechemotherapy biopsies (n = 84), and characterizing expression of the insulin-like growth factor receptor signaling pathways in human osteosarcoma as compared with osteoblasts and with the hypothesized progenitor cells of osteosarcoma -mesenchymal stem cells. The assay was performed among three pairs of cublines, the first two pairs of sublines comes from the different passage of sublines established with orthotopic transplantation

OS-related genes
Initially, mRNA expression microarrays related to OS were downloaded from the GEO database [19]. Detailed information regarding the datasets used, such as the GEO accession number and sample type and size, is shown in Table 1.

OS-related miRNAs
Normalized miRNA expression microarray data related to OS were also downloaded from the GEO database (Table 1). Differentially expressed miRNAs (DEMs) were identified using the limma package with a cutoff value of |logFC| > 1 and FDR < 0.05. Known OS-related miRNAs were extracted from the miR2Disease database (updated on 2011.04.14) [10] and HMDD database (updated on 2012.09.09) [11]. In total, 209 OS-related DEMs were identified based on miRNA expression microarray, and 31 known OS-related miRNAs were identified in the miR2-Disease and HMDD databases, generating a final count of 240 OS-related miRNAs for inclusion ( Table 2).

OS-related CNVs
Normalized, comparative genomic hybridization (CGH) microarray data were downloaded from the GEO database (Table 1) and analyzed using DNAcopy [32] and cghMCR packages [33]. The criteria were set at (Segment Gain or Loss (> 0.2 and incidence > 30%. A total of 1695 genes with CNVs in OS were identified (Table 2).

OS-related methylation sites
MethyCancer [12] and PubMeth [34] databases were searched using the keyword "osteosarcoma." Eighteen genes with methylation sites related to OS were identified for further analysis (Table 2).

Data storage
The data obtained using the methods described were collated and used to construct the integrated human OS database (HOsDb), which is available for use at www. hosdatabase.com. HOsDb is a one-stop comprehensive platform for OS researchers.

Database description
The HOsDb is a search engine that can be used to search detailed information on each OS-related term stored in the database. Terms include 'Home,' 'Introduction,' 'Tumor vs. normal,' 'Metastasis vs. non,'  (Fig. 2). The 'miRNA' term links users to a list of OS-related miRNAs, and users can search a particular miRNA by inputting its symbol in the search bar. Notably, users can define their own thresholds (logFC and p-value) for gene or miRNA expression. However, the default settings are logFC > 1 and p-value < 0.05 (Fig. 3a). The 'copy number variation' term generates a list of genes with CNVs in OS. Users can query whether a certain gene undergoes changes in copy number in OS or not by inputting the corresponding gene ID or symbol (Fig. 3b). The 'methylation' term lists all genes with methylation sites related to OS. Users can input a gene symbol to check whether its sequence has methylation sites in OS or not (Fig. 3c). The 'Related database' terms include several internal resources or databases, which are cross-linked in HOsDb, including NCBI,

Utility and discussion
Compared with a previously established OS database [14], the HOsDb provides more information. For example, our analyses of mRNA and miRNA expression microarrays, and CGH microarray provide a comprehensive list of candidate genes, miRNAs, and CNVs, which will assist users to navigate through the complexity of OS. Moreover, the HOsDb contains detailed gene regulation information, such as potential TF-and miRNA-gene pairs associated with OS, which is convenient for the identification of novel gene relationships involved in OS. Furthermore, information regarding SNPs in OS-related genes is provided in the HOsDb, which will help direct further studies of OS-related SNPs. The OS-related CNVs listed in the HOsDb were generated through analysis of three CGH microarray datasets. Thus, they are more reliable than those generated from a single dataset. Additionally, the HOsDb incorporates a user-friendly interface, which makes all the features easily accessible. Although data in the HOsDb were collected using a number of different platforms and approaches, all data were normalized prior to analysis, thus adding to the reliability of our results. However, microarray data regarding OS are likely to be constantly updated in the GEO database and next-generation sequencing studies can also provide OS-related data, which will provide new insights into OS biology. This updated information will need to be added to HOsDb, once it is available. Although the HOsDb has advantages over the only other known OSrelated database in its current form, we plan to update the database periodically to consistently maintain the quality of OS-related data available, and thus, keep up to date with changes and improvements in the field.

Conclusions
The HOsDb provides a one-stop, comprehensive platform for human OS research that is quick and easily accessible. We believe that the HOsDb will be particularly attractive to communities and researchers interested in OS, and that the HOsDb will considerably facilitate research regarding the pathogenesis of OS.