OncoMX: A Knowledgebase for Exploring Cancer Biomarkers in the Context of Related Cancer and Healthy Data

PURPOSE The purpose of OncoMX1 knowledgebase development was to integrate cancer biomarker and relevant data types into a meta-portal, enabling the research of cancer biomarkers side by side with other pertinent multidimensional data types. METHODS Cancer mutation, cancer differential expression, cancer expression specificity, healthy gene expression from human and mouse, literature mining for cancer mutation and cancer expression, and biomarker data were integrated, unified by relevant biomedical ontologies, and subjected to rule-based automated quality control before ingestion into the database. RESULTS OncoMX provides integrated data encompassing more than 1,000 unique biomarker entries (939 from the Early Detection Research Network [EDRN] and 96 from the US Food and Drug Administration) mapped to 20,576 genes that have either mutation or differential expression in cancer. Sentences reporting mutation or differential expression in cancer were extracted from more than 40,000 publications, and healthy gene expression data with samples mapped to organs are available for both human genes and their mouse orthologs. CONCLUSION OncoMX has prioritized user feedback as a means of guiding development priorities. By mapping to and integrating data from several cancer genomics resources, it is hoped that OncoMX will foster a dynamic engagement between bioinformaticians and cancer biomarker researchers. This engagement should culminate in a community resource that substantially improves the ability and efficiency of exploring cancer biomarker data and related multidimensional data.


INTRODUCTION
Cancer biomarkers are molecules that can be assayed from bodily fluids or tissues whose presence indicates some process(es) associated with cancer. 2 Molecular characterization of cancers can lead to pan-cancer or cancer type-specific biomarkers, which are becoming integral components of risk assessment, pathologic diagnosis, monitoring of disease progression, and therapeutic decisions. 3,4 This is especially true when biomarker assays enable the identification of subpopulations amenable to treatment by targeted molecular therapies. Cancer biomarkers are commonly individual genes or proteins, multigene/protein panels, or biomolecules, like glycans, analyzed from urine, blood, stool, or other biologic sample. Biomarker diversity is ever growing and may also include findings from image analyses, 5 gut microbiome abundance, 6 and others.
Attempts at molecular characterization of cancer samples have increased as a result of advances in high-throughput sequencing and other technologic progress. 7,8 Paired with increased funding opportunities, such as those enabled through the Precision Medicine Initiative of 2015, it is not surprising to find increased reports of potentially novel cancer biomarkers. 9 However, reproducibility of initial findings, clinical validation, and access to harmonized biomarker and associated data remain formidable challenges. 7 While issues of reproducibility and clinical validation must be addressed by biomarker research and regulatory communities, usability and accessibility of findings can be addressed in the public domain. In fact, a number of research groups already make cancer biomarker data available, including EDRNthe Early Detection Research Network focusing on the research and development of biomarkers and technologies for the clinical application of early cancer detection strategies 10 ; CBD-the Colorectal Cancer Biomarker Database containing colorectal cancer biomarkers reported from articles in PubMed; 11 ResMarkerDB-the database of biomarkers of drug response to monoclonal antibody therapy in breast and colorectal cancer, 12 and more. 13 Additional current and past efforts exist, such as those belonging to the Cancer Biomarkers Research Group of the National Cancer Institute Division of Cancer Prevention, including the Alliance of Glycobiologists for Cancer Research, 14 the Consortium for Imaging and Biomarkers (CIB), 15 and others, 16 and more biomarker information can be obtained directly from the literature or other published sources. Although not always biomarker centric, the National Cancer Institute Informatics Technology for Cancer Research (ITCR) program funds a number of resources, including HemOnc, 17 the Cancer Proteome Atlas (TCPA), 18 the Patient-Specific Drug-Gene Networks for Recommending Targeted Therapies (CDGNet), 19 and others, that describe, generate, analyze, or link cancer data that could provide additional evidence for biomarker relevance. It seems logical, then, that these various data could be combined into a meta-resource for the search and exploration of cancer biomarker-related information.
However, the above-mentioned data sets are highly variable in scope, developed for specific biological applications, specific to certain cancer types, limited by regulatory status, or pertinent to specific clinical populations. Furthermore, data formats are frequently heterogeneous, lacking common attributes to enable integration. Improved modeling of cancer biomarker data would facilitate the integration of interscope biomarker data. Ontology unification and mapping through common resource accessions would improve the quality of integration, downstream assertions, and reasoning. Improved data provenance tracking would transparently communicate parametric configurations and assumptions made during processing and retrieval, and the availability of integrated data (searchable via Web portal designed with key user input) would enhance the usability of the underlying biomarker data.
To address these issues, the OncoMX group developed a data model to integrate public biomarker data from EDRN and the US Food and Drug Administration (FDA) and additional related data around persistent accessions and identifiers. The resulting model provides the foundation for integration of heterogeneous biomarker evidence (using the BioCompute Object [BCO] framework 20,21) ) into the OncoMX knowledgebase and Web portal for cancer biomarker exploration. Integrated data types include cancer mutation, cancer differential expression (mRNA and miRNA), cancer expression specificity (single cell RNA [scRNA] sequencing), healthy expression (mRNA from human and mouse), literature mining for cancer mutation and expression, EDRN biomarkers, and FDA-approved breast cancer biomarkers 2 . Figure 1 shows an overview of OncoMX integration.

Data Retrieval
Data were provided as .txt, .tsv, or .csv files for the following: literature mining for differential expression in cancer from DEXTER-Disease-Expression Relation Extraction from Text 22 and mutation in cancer from Extraction of Mutation Association to Diseases (DiMeX) (University of Delaware) 23 ; RNA sequencing-derived healthy expression calls and ranks for human and mouse from Bgee, a Database for Gene Expression Evolution (SIB Swiss Institute of Bioinformatics) 24 ; and the FDA-approved breast cancer biomarkers and cancer cell-type expression specificity from scRNA sequencing (George Washington University). Data were pulled from external consortium databases for public biomarker data from EDRN, 10 cancer mutation data from BioMuta, 25,26 cancer differential expression of mRNA and miRNA from BioXpress, 26,27 and Reactome. 28 The Data Supplement provides a summary of data access and pertinent details for OncoMX and contributing resources.

CONTEXT Key objective:
To develop a data model and central, unified, integrated Web resource to enable improved exploration of cancer biomarkers in the context of related evidence. Knowledge generated: A data model was developed to describe cancer biomarker evidence and that was capable of accommodating heterogeneously structured extant data and extensible to diverse new data types. Data integrated and unified through this model were made available through OncoMX, a knowledgebase and Web portal for exploring cancer biomarker data and related evidence.

Relevance:
Cancer biomarker evidence can be accumulated across various resources from a single access point through OncoMX to generate summary reports for a given gene, including available information on clinical status and relevance. Furthermore, OncoMX is linking to several other resources, improving the reach of each resource reciprocally and increasing the utility of any given data set.

Processing and Unification
Many data sets used by OncoMX were initially processed by external pipelines documented in source resources. A summary of relevant preprocessing details for cancer mutation, cancer differential expression, healthy expression, and biomarkers is included in the Data Supplement, along with links to documentation. Data sets were unified through Cancer Disease Ontology (CDO) slim 29  OncoMX field dictionary was created that leveraged field names and ontology terms of existing resources (UniProtKB, GlyGen, 32 HUGO Gene Nomenclature Committee (HGNC), 33 and so on), and headers were uniformly formatted and mapped to existing OncoMX field names.
Resulting files were checked for completeness and adherence to expected content and subjected to automated quality check to ensure integrity and format sanity. Pipeline steps, provenance details, and other metadata were captured using the BCO 20 specification and output in Java-Script Object Notation (JSON) and .txt formats. All processed data and corresponding metadata objects were entered into a database and subjected to version control. Additional processing considerations for specific data sets are described below. Mutation and differential expression in cancer. Earlier versions of the cancer mutation and differential expression data sets were updated to include additional data (from The Cancer Genome Atlas, International Cancer Genome Consortium, and Clinical Interpretations of Variants in Cancer [CIViC]), and the expression data set was updated to include miRNA analysis. 26 A new pipeline was developed for cancer mutation to improve the tracking of downloaded content, unify data in .vcf format, subject resulting information to automated quality check, automate mapping of disease terms to CDO slim terms and genes/transcripts to UniProtKB AC using gene symbols or alternate accessions, generate frequency across sources, and automate site annotation and functional impact prediction. Minor updates were made to the differential expression analysis pipeline, including updating the analysis package to DESeq2, 34 mapping cancer subtypes to parent CDO slim terms, and quantifying patient ratios for each direction of expression change (the Data Supplement provides additional details). Resulting data were subjected to the general OncoMX processing steps outlined above before integration.
Healthy expression. A custom format of healthy (wild type) expression data was devised for RNA sequencing-derived expression calls for human and mouse. Data were limited to tissue types that had at least one reported association in the cancer mutation or differential expression data sets, as determined through mapping between CDO slim and Uberon Anatomical Entity. Data were further limited to adult developmental life stages. Analyzed expression values were ranked in two series for each species: one compared the expression of a given gene with that of all genes in a given tissue and the other compared the expression of a given gene with that of the same gene across all tissues. To enable cross-species exploration of expression profiles, the set of 1:1 orthologs between human and mouse were retrieved from the Orthologous MAtrix (OMA) database 35 and used to map human genes to orthologous mouse genes (the Data Supplement provides additional details).
Literature mining. DEXTER, 22 an automated text-mining tool, was applied on Medline abstracts to extract gene, disease, expression level, experimental context, and conditions being compared. Customized for OncoMX, DEXTER extracted expression differences in cancer compared with normal/control tissues and verified whether the normal tissue came from the same patient or not. DEXTER was applied on a comprehensive set of cancer-related abstracts identified using the PubMed query "cancer OR cancers OR carcinoma OR carcinomas OR neoplasm OR neoplasms," which returned 3,717,745 abstracts (as of March 2018). Abstracts were filtered for those that contained words or phrases pertinent to expression, which reduced the number of abstracts to 1,750,928.
DiMeX 23 is a tool that detects different types of connections between mutations and disease in the literature. For OncoMX, less reliable connections, such as those extracted on the basis of comentions, were dropped while maintaining association relations between mutation and a disease aspect. The mutation detection module of DiMeX was improved by integration with tmVar 36 after performance evaluation of several mutation detection tools on a range of well-known corpora. The resulting integration enabled the reduction of false negatives without increasing false positives. DiMeX's relation extraction module was also refined and improved by integrating more recent parsing technology.
Cancer biomarkers. Public biomarker records were retrieved from EDRN, and a new data set of FDA-approved biomarkers in breast cancer was generated by manual search across Web resources (details are provided in the Data Supplement). The generation and integration of these data sets enabled the OncoMX team to troubleshoot issues of extension, facilitated augmentation of infrastructure to ensure interoperability between data of differing types, and demonstrated the immediate utility of newly added data and views available to the end user. A new, easily extensible, biomarker-centered data model was developed to describe a biomarker evidence data object, accounting for the relationships between data sets reporting clinical status of known and/or actively studied biomarkers (EDRN and/or FDA) and other biomarker evidence data types.
Cancer expression specificity. Expression of 10 cancer cell types across the brain, lung, and colon were analyzed from scRNA sequencing data from three studies retrieved from recount2 37 and integrated into OncoMX (details are provided in the Data Supplement). Count tables were filtered for low-quality cells and low-abundance genes by filtering out samples with library sizes and features that expressed fewer than three median absolute deviations from the median value of each quality control metric computed from mitochondrial gene expression across all samples. Cellspecific biases between samples were normalized with a deconvolution approach 38 in which pooled counts from multiple samples are used to compute size factors that are deconvolved to infer size factors for each sample. Data sets were filtered for target cell sequencing runs, and dimensionality reduction was performed using principal component analysis (PCA), t-distributed stochastic neighbor embedding, and uniform manifold approximation and projection (UMAP). scRNA sequencing gene-level count matrices were analyzed for cell type-specific expression using Preferential Expression Measure (PEM), a metric designed for tissue-level expression specificity analysis, 39 and qualitative annotations of specificity were computationally determined for each PEM score. Resulting data were mapped to UniProtKB AC, HGNC gene IDs, Ensembl gene IDs, 41 DOIDs, and DO names, and were subjected to OncoMX processing steps described above.

Integration and Data Modeling
All data were mapped to UniProtKB AC directly or through HGNC gene ID and/or Ensembl gene ID to UniProtKB AC. A data model describing a biomarker evidence object was devised such that an object of type biomarker evidence can be diverse (molecular characteristic, clinical relevance, descriptive literature, and so on) and is currently expected to map to information about one or many genes.
Database Architecture, Back-End Infrastructure, and Front-End Implementation OncoMX was developed within the virtual machine environment, configured using a 32-GB ram and 12 CPU virtual machine running under the CentOS Linux 7 operating system. The database was built under the MariaDB version 5.5.60 engine, running under the same server as and connected directly to the application, developed using Python Django framework integrating bootstrap 3 for user interface design and jQuery for interactive capabilities. The application was designed with user login security based on the Django authentication library and built to use RESTful Web services to expose the underlying database for external access. The application environment, including the Django framework, 42 was dockerized into a container, which allowed easy transfer and deployment across servers while maintaining a consistent architectural environment.

Integrated Data
Resulting data sets encompassed 939 and 96 unique biomarkers from EDRN and FDA, respectively, mapped to 20,576 genes with available mutation and/or differential expression in cancer. Sentences reporting mutation in cancer were extracted from 14,360 PubMed publications, and those reporting differential expression in cancer were extracted from 25,865 publications. Healthy expression data for 33,753 human and 45,879 mouse Ensembl gene IDs were retrieved, mapping to 19,555 canonical human genes, 15,349 with mouse orthologs. Table 1 lists data sets currently available at https://data. oncomx.org.

Biomarker Evidence Data Modeling
Integration of EDRN and FDA data sets solidified the need to describe both clinical assertions related to actively characterized cancer biomarkers along with other information that could be leveraged as biomarker evidence. A data model was developed to describe a biomarker evidence object that is extensible to any number of components but currently has components for provenance, genomic variation, gene expression, clinical status, and literature mining. Although not required, the current core of the provenance domain is the gene symbol as available biomarker related data contain or link to one or many genes or miRNAs. Granularity of biomarker-level details was retained without sacrificing gene-level detail for panels that contained multiple genes by mapping to each unique gene. The resulting model (Fig 2) accommodates all OncoMX data types and is readily extensible to new data types containing some combination of keys and additional core attributes of disease and tissue; data types not mapped to genes can still be connected through other core attributes.

Usage and Utility
OncoMX Web site features. Early Web development focused on basic gene search functionality such that a string search query results in the display of all visual and tabular information for a specified gene. Combinatorial filtering and analysis on the integrated data layer allow filtering by P value thresholds or specific disease labels. Other features include a landing page with search bar and quick links for data set exploration; a dashboard for visualizing statistics and summaries; a table viewer rendered dynamically on the basis of user interaction; various types of user documentation; and a contact form to solicit user feedback.
Use case 1: Search for an individual gene biomarker.
PCA3 is a long, noncoding RNA that is overexpressed in most prostate cancer PCa cells and is involved in regulating the expression of epithelial-mesenchymal transition markers, androgen receptor cofactors, and PCa suppressor PRUNE2. [43][44][45] As depicted in Figure 3, search for "PCA3" shows an overexpression of PCA3 in prostate cancer. In fact, 49 of 52 samples have a logtwo-fold change increase in PCA3 expression in the tumor sample compared with its adjacent normal. Biomarker description, aliases, and publications reported by EDRN can be viewed in the biomarker details tab. Moreover, there are 14 literature evidences reporting upregulated PCA3 in prostate cancer. Of note, PCA3 expression in healthy samples is qualified as "MEDIUM" for earlier adult stages, but "HIGH" for later stages, which implies an association between PCA3 expression and age.  [46][47][48] is included in 10 FDA-approved biomarker products-nine individual amplification markers and one panel, the Prosigna multigene prediction panel. In the detailed search view, there are multiple evidences for ERBB2 in the literature, both for differential expression and mutation. Of interest, whereas the majority of differential expression evidences (n = 48) indicate upregulation, there are two evidences for downregulation.
Exploration of the underlying data shows that downregulation has been observed in response to pertuzumab treatment in mouse xenograft studies 49 and has been observed to be lower in obese patients with early-stage breast cancer. 50 Visualization of available information shows that this gene is approved for various indications, including prognosis, prediction, and companion diagnosis (Fig 4). The Data Supplement describes additional use cases.

Leveraging Existing Resources to Promote Sustainability
OncoMX was designed to combine existing and newly generated data, establishing a strong application customized for cancer biomarker research. Although other multidimensional, integrated cancer resources exist, such as the (cBioPortal) 51,52 and CIViC, 53 the focus on biomarkers, foundational inclusion of large-scale literature mining, ontology-driven unification, and cross comparison of analysis of healthy samples across species are aspects that are unique to OncoMX. However, OncoMX leverages the utility of such extant related resources, including cBioPortal for Cancer Genomics (cBioPortal), 51,52 CDGnet (CANCER Drug Gene Network), 19 CDSA (Cancer Digital Slide Archive), 54 CIViC, 53 HemOnc, 17 iPTMnet, 55 PDX (patientderived tumor xenograft) finder, 56 and TCPA (The Cancer Proteome Atlas). This model promotes the extensibility and sustainability of OncoMX and referenced resources by allowing OncoMX to focus on harmonization, integration, annotation mapping, evidence tagging, and database maintenance while outsourcing data set creation and curation to collaborators and other experts.

Future Directions
OncoMX is actively seeking new data and types, such as imaging, glycan biomarkers, drug targets, methylation, alternative splicing, and more. Work in progress includes extending the data model to new types, integrating FDA data sets for additional cancers upon user request, improving healthy expression and cross-species visualization, and expanding cross references to key cancer resources. The OncoMX team will continue hosting workshops to ensure that development is shaped by users, facilitating engagement between stakeholders and ultimately resulting in an adaptable resource for the improved exploration and potential discovery of cancer biomarkers.
Food and Drug Administration biomarker data sets; Stephanie Singleton for early drafts of user documentation and critical administrative assistance; Sneh Talwar for assistance preparing draft figures and data tables; Peng Su for contributions to and support for mutation literature mining; Edmund Cauley and Daniel Lyman for manuscript review; and Jeet Kiran Vora and Rahi Navelkar for thoughtful discussion regarding data set maintenance and related considerations.