Exploring Integrative Analysis Using the BioMedical Evidence Graph

PURPOSE The analysis of cancer biology data involves extremely heterogeneous data sets, including information from RNA sequencing, genome-wide copy number, DNA methylation data reporting on epigenetic regulation, somatic mutations from whole-exome or whole-genome analyses, pathology estimates from imaging sections or subtyping, drug response or other treatment outcomes, and various other clinical and phenotypic measurements. Bringing these different resources into a common framework, with a data model that allows for complex relationships as well as dense vectors of features, will unlock integrated data set analysis. METHODS We introduce the BioMedical Evidence Graph (BMEG), a graph database and query engine for discovery and analysis of cancer biology. The BMEG is unique from other biologic data graphs in that sample-level molecular and clinical information is connected to reference knowledge bases. It combines gene expression and mutation data with drug-response experiments, pathway information databases, and literature-derived associations. RESULTS The construction of the BMEG has resulted in a graph containing > 41 million vertices and 57 million edges. The BMEG system provides a graph query–based application programming interface to enable analysis, with client code available for Python, Javascript, and R, and a server online at bmeg.io. Using this system, we have demonstrated several forms of cross–data set analysis to show the utility of the system. CONCLUSION The BMEG is an evolving resource dedicated to enabling integrative analysis. We have demonstrated queries on the system that illustrate mutation significance analysis, drug-response machine learning, patient-level knowledge-base queries, and pathway level analysis. We have compared the resulting graph to other available integrated graph systems and demonstrated the former is unique in the scale of the graph and the type of data it makes available.


INTRODUCTION
Biological data produced by large-scale projects now routinely reach petabyte levels thanks to major advances in sequencing and imaging. With multiple profiling methods, platforms, versions, formats, and pipelines, a major unaddressed issue is querying across the increasingly heterogeneous data. When faced with the substantial labor and computation costs, researchers may use outdated and/or only a fraction of publicly available data.
Graph databases are useful tools for integrating complex and interconnected data. [1][2][3] In the commercial sector, several major data aggregators have successfully used graph databases to integrate heterogeneous data. Facebook (Menlo Park, CA) uses the "Social Graph" 4 to represent the connections between people and their information, whereas Google's search engine (Alphabet, Mountain View, CA) uses Google's "Knowledge Graph" to connect various facts about different subjects. On the basis of these observations, we have built the BioMedical Evidence Graph (BMEG) to allow for complex integration and analysis of heterogeneous biological data.
semantics. Independently, a researcher would need to download hundreds of files, write multiple file parsers, develop an integrated data model, map identifier systems, and normalize analytical results. The BMEG centralizes that work and makes it searchable using a full-feature query engine. To enable analysis and machine learning, the BMEG includes high-quality feature-extraction methods applied consistently to all samples. This includes the results provided by the best methods of somatic variant calling and RNA-sequencing analysis for cancer genomic and transcriptomic data sets. We used open challenges to create leaderboards of the best methods from all of those submitted by the community. We then participated in the development of open standards to enable the exchange of genomic associations from cancer knowledge bases. Finally, we implemented computational integrity checks and unit tests for each of the data import modules.

Graph Schema
At the core of BMEG's metadata is a tree representing the organization of all the different data elements (Fig 1). Program node represents the root of the tree, defining a cohort of samples studied by a consortium. For example, The Cancer Genome Atlas (TCGA; National Cancer Institute, Bethesda, MD) is one such "program" and cohorts for different tumor types can be selected using the program's child node called "Project." Each tumor type is then populated by a number of Case nodes, which in turn have multiple Sample nodes, which can then be subdivided into a number of Aliquot nodes. The BMEG schema builds on this base structure to include data from a number of additional sources including: (1) genome reference, (2) gene and pathway annotations, (3) somatic variants, (4) gene expression data, and (5) knowledge bases.

Data Sources
Initial data sources (Appendix Table A1) for the BMEG were centered on large cohorts of patient-derived samples, with DNA and RNA profiling, cell lines with drug-response data, and literature-derived drug-phenotype associations. The goal was to provide uniform input data for analysis and machine learning.

RNA-Sequencing Data
To identify the best methods for RNA analysis, we launched the somatic mutation calling-RNA challenge, which benchmarked isoform quantification methods to prioritize the methods used for processing data that would be ingested into the BMEG. For example, for transcript abundances we used results from Kallisto (Patcher Lab, University of California, Berkeley, Berkeley, CA), a top contending method in the somatic mutation calling-RNA challenge, to process the TCGA and Cancer Cell Line Encyclopedia (CCLE) 5 data sets. In addition, the Genotype-Tissue Expression project (National Institutes of Health, Bethesda, MD) 6 provided gene-level, transcript-per-million-mapped reads estimates for normal tissues that could be contrasted with tumors. Combinations of these resources provide 36,000 vertices to the BMEG graph.

TCGA Metadata
The Genomic Data Commons (GDC; National Cancer Institute, Bethesda, MD) created a data system to track the clinical and administrative meta-data of the TCGA samples and files. We used their web application programming interface (API) to obtain TCGA patient and sample metadata for the evidence graph.

TCGA Genomic Data
To determine the best methods for somatic mutation calling, we partnered with the Dialogue on Reverse Engineering Assessment of Methods (DREAM) consortium, Sage BioNetworks (Seattle, WA), and the Ontario Institute for Cancer Research to launch the International Cancer Genome Consortium-TCGA Somatic Mutation Calling challenge. 7 Many methods evaluated by this effort were incorporated into pipelines that would be deployed on the TCGA's 10,000 exomes as part of the Multi-Center Mutation Calling in Multiple Cancers (MC3) project. 8 The MC3 adds 10,000 vertices that connect to 3 million alleles (2.6 million CONTEXT Key Objective Can a graph database help researchers access multiple integrated data sets for cancer omics analysis? Knowledge Generated We have developed and tested an integrated graph database, BioMedical Evidence Graph, that connects patient sample data, cell-line drug-response data, and multiple knowledge bases. Simple queries to this system allowed cross-data set analysis in seconds when the same questions would have required days or weeks of manual effort otherwise. Relevance Analysis of cancer systems biology data requires a large variety of different kinds of data. The BioMedical Evidence Graph system provides a uniform interface for data interrogation that will make it easier to pose a variety of clinically relevant queries. distinct alleles) in the graph. For the set of copy-number alteration events, we used the Gistic2 9 data from the Broad Institute's Firehose system (Massachusetts Institute of Technology, Cambridge, MA).

Cell-Line Drug-Response Data
Cell-line clinical attributes and drug-response data has been collated by the DepMap (Broad Institute) 10 and Pharmacodb (BHKLAB, Princess Margaret Cancer Centre -University Health Network, Toronto, ON, Canada) 11 projects, respectively. This includes response curves, half maximal inhibitory concentration and half maximal effective concentration (EC 50 ) scores from CCLE, 12 Cancer Therapeutics Response Portal (CTRP) 13,14 and Genomics of Drug Sensitivity in Cancer. 15 In addition, the DepMap and Cell Model Passports (Wellcome Sanger Institute, Hinxton, Cambridgeshire, UK) 16 provided variant calls for a number of cell lines.

Variant Drug Associations
The Genotype To Phenotype (G2P; Wellcome Sanger Institute) schema 17 was designed to enable several different cancer knowledge-base resources to be aggregated into a coherent resource. The entries from these knowledge bases typically demarcate associations such as "the T41A mutation in CTNNB1 causes sensitivity to imatinib." With this resource, the BMEG has aggregated associations from   29 and WikiPathways. 30 Once loaded into the graph, these resources provided approximately 2 million vertices that could be queried by the user.

Reference Data
The BMEG uses Ensembl identifiers 31 as a global identifier to unite various genomic components present across the ingested biologic reference data and experimental results. The genomic annotations from Ensembl were modeled into the graph to provide a consistent chromosomal coordinate system for any sequence-level sample information. Part of the import pipeline includes annotating sample variants using Variant Effect Predictor 32 to connect them to gene, transcript, and exon data from Ensembl. These, in turn, link to Protein and Pfam (protein family) 33 assignments, as well as Gene Ontology 34 functional annotations.

Queries Using a Graph Language
To enable various analytical queries and provide a framework for analysts to build custom queries, we developed the Graph Integration Platform (GRIP) to design queries to use the BMEG. GRIP stores multiple forms of data and has the ability to hold thousands of data elements per vertex and per edge of the graph. This allows it to store sparse relationship data, such as pathways and ontologies, as well as dense matrix-formatted data, such as expression levels for thousands of genes across hundreds of samples.
The query language implements most operations needed for subgraph selection, as well as aggregation of features. A general purpose end point places more emphasis on the client side, building smart queries to obtain the data they need rather than having custom server-side components provide specialized facets. Because of this, clients can easily create new queries, unanticipated by the server developers, that still have the correct desired effect. The API is available via Python (Python Software Foundation, Wilmington, DE), Javascript (PluralSight, Farmington, UT), and R (R Foundation, https://www.r-project.org/) clients.

RESULTS
To test the utility of the BMEG and its query engine, we have crafted example queries that traverse different parts of the graph, to demonstrate how the system can quickly provide an analyst with connected data. Although it would be possible for an analyst to find the solutions to the following exercises without using the BMEG, the analyst would need to download and merge data from multiple different repositories such as from the TCGA's GDC system, somatic mutations predicted from Broad Institute's CCLE collection and the TCGA's MC3 variant-calling project, seven different somatic variant-to-phenotype association catalogs, PubChem for the names and modes of action of molecular compounds, pathway gene sets from Pathway Commons, and three different drug-response databases. Thus, the benefits of ingesting all into a uniform graph data structure should provide a more seamless presentation that users will find easier to use once the API becomes familiar.
The GRIP Query Language is a traversal-based graphselection language inspired by Gremlin. 35 The user describes a series of steps that will be undertaken by a "traveler." An example traversal would start on a vertex with label Project, move to edges labeled samples, then move along edges labeled aliquots. The engine then scans the graph for all valid paths that can be completed given the instructions. Each of the traversal instructions is based on the graph schema seen in Figure 1. The commands are written using the Python version of the client, but they could be executed similarly in R or Javascript. These queries can be visualized as paths traversed through the BMEG graph (Fig 2).
We begin with an example that counts the mutations per gene in a cancer cohort. This is a useful statistic to gauge one aspect of whether a gene may be a driver of progression, as evidenced by its prevalence in a subpopulation (in this case, breast cancer). As seen in example 1 in the following section, the query starts on a breast cancer project node (TCGA-BRCA) then traverses to the Case, Sample, and Aliquot nodes while filtering out any data properties that do not belong to the previous node information. As it passes the Sample node, it filters for tumor samples. Once on the Aliquot node, it continues to the SomaticCallset, which represents sets of variants produced by a single mutation calling analysis. The traversal then identifies the edges that connect the SomaticCallset to different alleles, this time using the outE command to land on the edge rather than the destination vertex. With the gene identifier in hand, it then uses the aggregate method to count the various terms that occur in the ensembl_ gene field.

Example 2: Identify Pathways Containing Mutated Genes
There may be a theme among the most frequently mutated genes in breast cancer. The next two example queries address this by finding the most frequently affected pathways. First, example 2 identifies pathways containing mutated genes. Example 3 tallies the number of genes per pathway. To identify the pathways involved in mutations, the example derives a list of all mutated genes, finds their associated pathways, and retrieves the tuples of every gene-pathway pair, using the as_ command (the underscore is added to avoid clashing with the Pythonreserved word as) to store the gene and then using the render function to display only the needed data. Query.
• G.query().V(genes).as_("gene") • .out("pathways").render(["$gene._gid", "$._gid"]) Result. The result returns all the pathways for which each gene is a member. This query uses data extracted from Ensembl and Pathway Commons. Continuing in this investigation to derive the most affected pathways, the next step is to aggregate the mutations per pathway and then sum them. To do this, the preceding listed information can be combined with the previously found result tabulating the mutations per pathway. To sum all the mutations in all the genes in a particular pathway, the traversal starts on the Pathway vertex marked for later retrieval using the as_ command. Once the traveler has split and moved out to the multiple child Gene vertices, the select command recalls the stored pathway vertex and moves the traveler back. At this point, an aggregation is called to count the number of travelers on each Pathway vertex. Query.
• G.query().V("Project:TCGA-BRCA") • .out("cases").out("samples") • .has(gripql.eq("gdc_attributes.sample_type", "Primary Tumor")) • .out("aliquots").out("somatic_callsets").out("alleles") • .out("g2p_associations").out("publications") • .aggregate(gripql.term("pub", "_gid")) Result. The result returns a list of the number of mutations for all genes connected in each of the returned papers. This query connects data from the GDC, the knowledge bases imported from the G2P project, and PubMed. The phenotypes in the G2P associations, linked to the collected breast cancer mutations, may be associated with drugs that treat specific conditions. Example 5 defines a traversal of the graph to uncover compounds linked to phenotypes on the basis of specific alleles. The traversal is much like the one illustrated in example 4; however, it also includes a distinct operation to identify unique pairs of cases and compounds. If there are multiple known association records from different publications and these publications link one allele to the same drug-response phenotype, then only one relationship will be noted per patient. Query.

Example 6: Find Drugs Tested in Breast Cancer Cell Lines
Taking the analysis one step further, the next two examples identify drugs proven effective against breast cancer cell lines as determined in the CTRP project. Example 6 identifies those compounds that have been tested in breast cancer cell lines as part of the CTRP project. The query uses the drugs found in example 5 through a list named compounds as a starting point. Query.
• G.query().V(compounds).as_("compound").out("projects") • .has(gripql.eq("project_id", "CTRP_Breast_Cancer")) • .select("compound").render(["_gid", "synonym"]) Result. The result lists those drugs that were profiled in the CTRP effort. For example, at the top of the list, one finds that the compound fulvestrant has been tested against breast cancer cell lines in the CTRP project. To get a sense of the effectiveness of each of these drugs, a natural extension of this line of inquiry is to find out how sensitive the cells are to them. The EC 50 measures the concentration achieving a response midway between the baseline and maximum when cells are exposed to a drug. It is a widely used measure of sensitivity (although a measure, GR 50 , that factors in growth rate, has been shown to be more useful) and available for compounds tested in the CTRP project. To this end, example 7 searches for the EC 50 values for the breast cancer cell lines tested against fulvestrant.
This query includes a call to the render method, which shapes the output into a custom JavaScript Object Notation structure (JSON). In this case, it forms a tuple with the stored sample identifier and EC 50 value. The list of tuples returned by the client can then be passed directly into a Pandas DataFrame. 36 Query.
• G.query().V("Program:CTRP").out("projects") • .out("cases").out("samples").as_("sample") • .out("aliquots").out("drug_response").as_("response") • .out("compounds").hasId("Compound:CID104741") • .render(["$sample._gid","$response.submitter_compound _id","$response.ec50"]) Result. The result lists the EC 50 values for each of the BRCA cell lines to fulvestrant, connecting the data from CTRP to PubChem entries. Drug response data often are not available for patient samples; thus, machine-learning methods that can use more widely available data, such as gene expression data from RNA sequencing, to predict drug response are highly promising. Example 8 illustrates how associated transcriptomic data can be obtained for the cell lines collected in the previous steps. There is no RNA sequencing available from the CTRP project; however, many of the cell lines were assayed as part of the complementary CCLE project. To identify these samples, example 8 follows the edge connecting the list named samples found in example 7 to their parent cases. It then follows the same_as edge to identify Case vertices in other projects that have overlapping identifiers, and then follows the tree down to the GeneExpression node to obtain the expression values. Again, the example uses the render function to return properly formatted data structures that can be passed directly into Pandas. Query.
• G.query().V(samples).as_("sample") • .out("case").out("same_as") • .out("samples").out("aliquots").out("gene_expressions"). as_("exp") • .render(["$sample._gid", "$exp._data.values"]) Result. The resulting matrix (Table 1) lists the expression values of each gene across cell lines with variants in CTRP and RNA in CCLE. The matrix can be used to develop transcriptome-based drug-response prediction models. 37,38 The BioMedical Evidence Graph Data Releases. The BMEG resource was designed to be portable and open, with multiple ways to access the data (Fig 3). The graph query engine that runs the system is open source and easy to install, and all the compiled source files are made available for bulk download. This will allow other researchers to build on our existing system and to reuse the collected data. We provide translations of the BMEG to make it compatible with a number of different query engines. Part of the BMEG toolkit is a set of scripts to translate the data set and load it into other graph database systems, including Neo4J (San Mateo, CA) and Dgraph (Dgraph Labs, San Francisco, CA).

DISCUSSION
Recently, several graph-based data integration projects have appeared, including biograkn, 39 Biograph, 40 Bio4j (discontinued), 41 Bio2RDF, 42 and Hetionet. 43 Many of these systems were built to aggregate pathway and link genotype to phenotype. The BMEG holds genomic, transcriptomic, and phenotypic data from cancer cases, as well as from cell-line samples, pathway data, genomic descriptions, and extractions from genome-variant knowledge bases. The system includes unit tests composed of built-in Python conversion code, implemented in a Travis continuous integration facility, for technical validation to ensure data are copied and represented accurately. The unique accumulation of various high-quality data types differentiates the BMEG from other data systems. As demonstrated in the example queries, data interrogation can traverse sample mutations, pathway descriptions, knowledge bases, and drug-response data, all within a few lines of query code.  The fundamental idea of the BMEG is to define a connected data set to enable many possible investigations without the effort needed to collect, normalize, and merge information across disparate systems, thus saving time and effort to focus on research questions rather than data wrangling. Adopting systems like the BMEG will drive analyses that can tap a wide range of sample data, with structured annotations, allowing for a number of feature and prediction label combinations for machine-learning applications to support new pattern discovery.

ACKNOWLEDGMENT
We thank the referees for helpful suggestions.