CIViCpy: A Python Software Development and Analysis Toolkit for the CIViC Knowledgebase

PURPOSE Precision oncology depends on the matching of tumor variants to relevant knowledge describing the clinical significance of those variants. We recently developed the Clinical Interpretations for Variants in Cancer (CIViC; civicdb.org) crowd-sourced, expert-moderated, and open-access knowledgebase. CIViC provides a structured framework for evaluating genomic variants of various types (eg, fusions, single-nucleotide variants) for their therapeutic, prognostic, predisposing, diagnostic, or functional utility. CIViC has a documented application programming interface for accessing CIViC records: assertions, evidence, variants, and genes. Third-party tools that analyze or access the contents of this knowledgebase programmatically must leverage this application programming interface, often reimplementing redundant functionality in the pursuit of common analysis tasks that are beyond the scope of the CIViC Web application. METHODS To address this limitation, we developed CIViCpy (civicpy.org), a software development kit for extracting and analyzing the contents of the CIViC knowledgebase. CIViCpy enables users to query CIViC content as dynamic objects in Python. We assess the viability of CIViCpy as a tool for advancing individualized patient care by using it to systematically match CIViC evidence to observed variants in patient cancer samples. RESULTS We used CIViCpy to evaluate variants from 59,437 sequenced tumors of the American Association for Cancer Research Project GENIE data set. We demonstrate that CIViCpy enables annotation of > 1,200 variants per second, resulting in precise variant matches to CIViC level A (professional guideline) or B (clinical trial) evidence for 38.6% of tumors. CONCLUSION The clinical interpretation of genomic variants in cancers requires high-throughput tools for interoperability and analysis of variant interpretation knowledge. These needs are met by CIViCpy, a software development kit for downstream applications and rapid analysis. CIViCpy is fully documented, open-source, and available free online.


INTRODUCTION
The use of massively parallel sequencing to profile the molecular composition of human tissues has become increasingly commonplace in the clinical setting to inform diagnosis and therapeutic strategy for patients' tumors. 1,2 This has led to an ever-growing body of biomedical literature describing the impact of tumor variants on disease progression and response to therapy, creating a bottleneck of expert review of relevant literature to construct a clinical report. 3 The Clinical Interpretations for Variants in Cancer (CIViC) community knowledgebase (civicdb.org) is a platform for expert crowdsourcing the clinical interpretation of variants in cancer. 4 To date, CIViC contains 6,471 interpretation evidence records (ie, clinical significance statements extracted from biomedical literature) describing 2,312 variants in 402 genes. Evidence in CIViC is used to construct interpretation assertions of clinical significance (ie, therapeutic, prognostic, diagnostic, or predisposing effects) of gene variants on the basis of published criteria and guidelines for the classification of variant interpretations. 5,6 CIViC evidence and assertions are also linked to data classes describing genes, drugs (if applicable), and diseases, in addition to the myriad supporting data for tracking the provenance and community activity surrounding these concepts and their relationships. These data are released under a Creative Commons public domain attribution (CC0), promoting their redistribution and use in downstream applications.
As a curation platform for the Clinical Genome Resource (ClinGen) Somatic Working Group, 7 CIViC supports the export of generated assertions to ClinVar, in line with existing ClinGen submission practices for ASSOCIATED CONTENT

Data Supplement
Author affiliations and support information (if applicable) appear at the end of this article.
Accepted on January 15, 2020 and published at ascopubs.org/journal/ cci on March 19, 2020: DOI https://doi. org/10.1200/CCI.19. 00127 germline diseases. 8 This was accomplished through the development of the civic2clinvar export utility and Python package, which constructs ClinVar-style submission records from CIViC assertions. 7 In developing civic2clinvar, several issues with building an application from the CIViC database and application programming interface (API) were identified: (1) simplified retrieval and use of CIViC records as native Python objects, (2) routines for local caching of CIViC data for analysis, (3) support for highthroughput queries, and (4) export of CIViC records to established variant representation formats, such as variant call format (VCF). 9 Here, we describe CIViCpy, a software development kit (SDK) that addresses these needs and enables rapid downstream tool development and analysis by removing the burden of implementing these features in independent applications. We demonstrate the use of the SDK in an associated analysis notebook to evaluate 59,437 tumors from patients cataloged by the American Association for Cancer Research Project GENIE cohort. 10 CIViCpy is opensource, Massachusetts Institute of Technology (MIT) licensed, and readily available for installation on the Python Package Index (PyPI; pypi.org). CIViCpy documentation is available online at civicpy.org.

METHODS
We designed the CIViCpy Python SDK as a standalone package to retrieve the CIViC knowledgebase content and transform responses into Python objects with intuitive structures and interobject linkages. The resulting software is a toolkit to support numerous downstream operations, including exploratory analyses, variant annotation, and application development (Fig 1). Here, we describe the optimizations and design choices made to construct CIViCpy.

CIViCpy Objects
The primary data class in CIViCpy is the CivicRecord. This class provides the framework for all first-class entities in CIViC: Genes, Variants, Variant Groups, Evidence, and Assertions. First-class entities are delineated from other object classes in CIViC by the combination of persistent public identifiers, dedicated API end points for returning object details, and tracked provenance (Table 1). Provenance tracking records the history of all actions taken on the object as part of the CIViC curation cycle: object submission, revisions, and editor approval. We also create CivicRecord objects from CIViC Sources, Users, and Organizations, despite their lack of provenance tracking; CivicRecord objects only require that a CIViC class is identifiable and has supporting API end points. Documentation for each of the CivicRecord subclasses can be found online at http://bit.ly/civicrecord-types.
The CIViCpy CivicAttribute is a data class for representing composite data entities not captured by CivicRecord. This includes composite entities with nested or list attributes (eg, diseases, coordinates, or variant_aliases), as opposed to primitive, single-valued entities (eg, description, allele_ registry_id, or evidence_direction). CivicAttribute inherits from CivicRecord but is not indexed and accordingly overrides many of the features of its parent class. Importantly, CivicAttribute is not cached (see Caching) except as a linked object to other (non-CivicAttribute) CivicRecord objects, and cannot be retrieved independently.
One of the strengths of the CivicRecord class is the ability to dynamically evaluate nested CivicRecord objects. A Variant, for instance, may have multiple associated Evidence records, each of which may have a source linked to multiple Evidence records describing other Variants. A CivicRecord will automatically link nested objects; consequently, one can chain through linked objects when analyzing CIViC records to efficiently get to values of interest, such as evaluating the Association for Molecular Pathology/ASCO/College of American Pathologists somatic variant classification 6 for a CIViC Assertion (Fig 1, Object Inspection). Therefore, when evaluating evidence for a variant using CIViCpy, the associated Evidence objects are returned rather than a list of evidence identifiers. Evidence, in turn, will link to other CivicRecord objects (eg, Assertion, Source), which can also be explored. An example of exploring Assertions linked to a Variant using CIViCpy is provided in the Data Supplement (see the CIViCpy Objects section).

Querying CIViC
CIViC is built on a segregated server/client architecture, where all functionality of the CIViC Web client is managed through RESTful API calls to the underlying Rails server. CIViCpy leverages this architectural design to post complex queries to the CIViC advanced search API end points. This enables high-throughput searches for records of interest, including full data set requests. CIViC full data sets include Evidence and Assertions in multiple states of the CIViC review cycle: those that have been editor-reviewed and approved (accepted), those that are pending review (submitted), and those that have been rejected for inclusion in CIViC. Because the CIViC full data set contains rejected and submitted Evidence items and Assertions that have not been approved by CIViC editors, these records may be inaccurate, in a partial state, or incongruent with the CIViC knowledge model.
CivicRecord objects may be retrieved by the corresponding get_all functions (eg, get_all_variants(), get_all_assertions()). These functions may optionally be passed a parameter for explicitly including only objects of a given status. For example, a user may request only evidence that has been accepted or submitted (and exclude any rejected evidence). Although these behaviors are readily reproducible by users without leveraging this feature (eg, through inspection of evidence.status), we expect most downstream workflows would desire to include only accepted and/or submitted evidence, and so we have provided this functionality as a convenience. An example of filtering evidence by status using CIViCpy is provided in the Data Supplement (see the Filtering Evidence section).

Caching
CIViCpy also is a standalone component for services intended to perform large-volume operations on the CIViC knowledgebase. A key design consideration, therefore, is the local caching of CIViC content for quick retrieval and Each cache also maintains a timestamp of when the cache was last generated. This information is used when loading the cache from file to determine whether a fresh cache needs to be built or retrieved from a remote source. CIViCpy will expire a cache 7 days (or after user-configurable length of time) after it is initially built and will retrieve the newest (nightly) cache from the CIViC live server. The local cache can also be manually updated from the command line with the civicpy update utility.

Variant Coordinate Search
When loading all variants from CIViC, a sorted variant coordinate index is also constructed after cache generation to enable coordinate search and lookup strategies (Fig 2A). A similarly sorted list of CoordinateQuery objects represent variants to query, such as those observed in a patient's tumor ( Fig 2B). The index and CoordinateQuery objects are used by the search algorithm for high-throughput searches ( Fig 2C). Importantly, the search algorithm supports the notion of variant ranges for CIViC records and queries, and provides several search modes to accommodate varying sensitivity and specificity tradeoffs (Fig 2D).

Variant Exports
CIViCpy enables the export of CIViC Variant records and their associated Evidence Items and Assertions into the VCF (Fig 1)
User documentation is written using reStructuredText markup language and the Sphinx documentation framework (sphinx-doc.org). Documentation is hosted on Read the Docs (readthedocs.org) and can be viewed at civicpy.org.
This documentation serves as both a "quick start" guide and a detailed reference for developers and bioinformaticians.
This project is licensed under the MIT License (https:// opensource.org/licenses/MIT). CIViCpy has been packaged and uploaded to the PyPI under the civicpy package name and can be installed by running the pip install civicpy command. Installation requires a Python version 3.7 environment. Releases are also made available on GitHub (https://github.com/griffithlab/civicpy/releases).

GENIE Analysis
To evaluate the performance of CIViCpy in annotating patient data, we performed a demonstrative analysis in a Jupyter Notebook, available on the CIViCpy GitHub repository (git.io/ civicpy-genie). The Project GENIE 10 version 5.0 extended mutations file, which describes 445,655 variants across 59,437 patient tumors, was downloaded from https://www. synapse.org/#!Synapse:syn17394041. Coordinates from the reported variants were extracted and each coordinate set was tagged using the corresponding tumor sample barcode.

N CIViC Entity Status
The status of the CIViC entity being annotated, either "submitted," "accepted," or "rejected" The extracted coordinates were then transformed into a sorted list of CIViCpy CoordinateQuery objects, which were passed to the bulk-query search method using an exact search strategy (Fig 2D). Match results and query times were recorded for the full set of GENIE variants in addition to timings from exponentially increasing subsets from one to 300,000. Match results from an anticonservative search strategy, which allowed for any coordinate overlap (Fig 2D), were also recorded.
Match results from the full set of variants were grouped by tumor identifier and summarized by counts of Exact, Any, and no matches. In addition, tumors for which no variations were reported for querying were also summarized. Finally, we grouped tumors on the number of variants matching CIViC evidence by Exact search, and summarized the highest level of evidence found across the tumors in those groups.

GENIE Tumor Variant Results
Using the Exact search strategy, CIViCpy successfully matched CIViC evidence to 7.6% (n = 34,642) of GENIE variants, and using the Any strategy, an additional 42.9% of variants (n = 195,349) were matched (Fig 3A). Furthermore, 46.3% of tumors (n = 27,545) had at least one Exact match to a reported variant and an additional 40.2% of tumors (n = 23,925) had at least one Any match. Notably, 8.7% of tumors (n = 5,200) in the cohort had no reported variants to query against the knowledgebase. We evaluated the highest CIViC evidence level reported for the 27,545 tumors with matched evidence, and found 14.0% of those tumors (n = 3,852) matched to a CIViC Validated Association (Level A). An additional 69.3% of tumors (n = 19,098) matched to Clinical Evidence (Level B). Far fewer tumors had Case Study (Level C, 11.8%; n= 3,248), Preclinical (Level D, 3.9%; n = 1,066), or Inferential (Level E, 1.0%; n = 281) as the highest-level evidence match (Fig 3B). In total, 38.6% (n = 22,950) of all GENIE tumors (including those without reported variants) had at least one variant that matched to Level A or B evidence. Tumors from the cohort had an average of 6.88 (median, 4) variants reported, and although most tumors (53.7%; n = 31,892) did not Exact match to CIViC variants, many tumors matched one (35.4%; n = 21,026), two (9.4%; n = 5,592), three (1.3%; n = 796), or more (0.2%; n = 131) CIViC variants.

CIViCpy Search Performance
CIViCpy uses a search strategy (Fig 2C) designed to scale linearly with input query size. We evaluated the performance of CIViCpy annotation of the GENIE cohort and found that queries with ≥ 1,000 variants scale linearly with input size, increasing total search time by approximately 0.808 ms/variant (Fig 3C). Queries with , 1,000 variants exhibit higher performance and are returned in , 1 second. We annotated the entire GENIE data set (n = 445,655 variants) in 369 seconds at a rate of 1,236 variants/s.

DISCUSSION
CIViCpy is an SDK and high-throughput analysis toolkit for exploring and analyzing content within the CIViC knowledgebase. CIViCpy has tools for object inspection, with convenient features for record retrieval and search. Variant annotation through CIViCpy is demonstrated to handle variant searches at 1,236 variants/s through the provided coordinate search methods. In addition, the SDK provides convenient tools for exporting CIViC content to VCF for integration in external annotation pipelines and tools.
The CIViCpy SDK has demonstrated utility in downstream applications, including the previously published civic2clinvar utility 7 and the Variant Interpretation for Cancer Consortium. 11 Additional features to improve CIViCpy are already being planned, including extensions for other variant export formats such as the Browser Extensible Data 12 and the Global Alliance for Genomics and Health VR specification. 13 These extensions will also support export of variant types beyond the singlenucleotide variants and insertions/deletions currently supported by our VCF export utility. In addition, we are planning to develop utilities to allow users to annotate their own VCFs with CIViC data. We are also developing strategies to incorporate CIViC Drug, Disease, and Phenotype entities as full CivicRecord objects.
Finally, we have provided all source code for CIViCpy (git.io/ civicpy) and the analyses in this manuscript (git.io/cpygenie) in a public repository under the permissive MIT license and have uploaded CIViCpy distributions to the PyPI for ease of installation. The permissive licensing and easy installation through PyPI allow for rapid integration into existing analytical workflows. See our documentation and project homepage at civicpy.org for more details.