Open-Sourced CIViC Annotation Pipeline to Identify and Annotate Clinically Relevant Variants Using Single-Molecule Molecular Inversion Probes

PURPOSE Clinical targeted sequencing panels are important for identifying actionable variants for patients with cancer; however, existing approaches do not provide transparent and rationally designed clinical panels to accommodate the rapidly growing knowledge within oncology. MATERIALS AND METHODS We used the Clinical Interpretations of Variants in Cancer (CIViC) database to develop an Open-Sourced CIViC Annotation Pipeline (OpenCAP). OpenCAP provides methods to identify variants within the CIViC database, build probes for variant capture, use probes on prospective samples, and link somatic variants to CIViC clinical relevance statements. OpenCAP was tested using a single-molecule molecular inversion probe (smMIP) capture design on 27 cancer samples from 5 tumor types. In total, 2,027 smMIPs were designed to target 111 eligible CIViC variants (61.5 kb of genomic space). RESULTS When compared with orthogonal sequencing, CIViC smMIP sequencing demonstrated a 95% sensitivity for variant detection (n = 61 of 64 variants). Variant allele frequencies for variants identified on both sequencing platforms were highly concordant (Pearson’s r = 0.885; n = 61 variants). Moreover, for individuals with paired tumor and normal samples (n = 12), 182 clinically relevant variants missed by orthogonal sequencing were discovered by CIViC smMIP sequencing. CONCLUSION The OpenCAP design paradigm demonstrates the utility of an open-source and open-access database built on attendant community contributions with peer-reviewed interpretations. Use of a public repository for variant identification, probe development, and variant interpretation provides a transparent approach to build dynamic next-generation sequencing–based oncology panels.


INTRODUCTION
Despite recognition that genomics plays an important role in tumor prognosis, diagnosis, and treatment, scaling genetic analysis for routine analysis of most tumor specimens has been unattainable. 1,2 Barriers preventing widespread incorporation of genomic analysis into treatment protocols include costs associated with genomic sequencing and analysis, 3 computational limitations preventing timely identification of relevant variants, 3 and rapidly evolving knowledge of the clinical actionability of variants. 4 Technologic improvements in sequencing and data analysis continue to reduce these first 2 limitations; however, less progress has been made in integrating dynamic genomic annotation into clinical workflows. More than 22% of oncologists have acknowledged limited confidence in their own understanding of how genomic knowledge applies to patients' treatment, and 18% reported testing patients' genetics infrequently. 5 In the face of exponential growth in clinically relevant genomic findings, driven by precision oncology efforts, there will likely be increased inability for physicians to command the most current information, resulting in increasing delay between academic discovery and clinical utility. This information gap has been described as the interpretation bottleneck. [4][5][6] Alleviating the interpretation bottleneck will require codevelopment of targeted sequencing panels, bioinformatic tools, and variant knowledgebases that effectively elucidate and annotate clinically actionable variants from sequencing data. 7,8 These requirements each raise separate challenges. With regard to targeted panel development, commercial and academic pancancer clinical gene capture panels have now become commonplace, with at least 2 obtaining US Food and Drug Administration approval (FoundationONE CDx 9 [Foundation Medicine, Cambridge, MA] and Memorial Sloan Kettering-Integrated Mutation Profiling of Actionable Cancer Targets 10 [Memorial Sloan Kettering Cancer Center, New York, NY]). Even so, few panels indicate how genomic loci are selected for panel inclusion (Data Supplement), and none have proposed a sustainable or scalable mechanism to allow for panel evolution over time in response to knowledge advances in molecular oncology. With regard to bioinformatic tool development, the OncoPaD 11 portal provides one of the only methods to create rational designed panels by linking clinically relevant variants to genomic loci on the basis of a cohort of tumor samples; however, this database is not directly linked to actively updated clinical interpretations with detailed underlying evidence. The final challenge of building knowledgebases for variant interpretation perhaps poses even greater and more persistent challenges. Commercial entities typically rely on the manual curation and organization of research findings into structured databases, which are expensive to create and maintain, forcing companies to limit public access or to charge for use. The resulting lack of transparency creates inefficiencies in the field through unnecessary replication of curation effort and suboptimal communication with clinicians, ultimately hindering development of effective patient treatment plans. Separately, governmental and academic institutions have developed variant interpretation resources, such as the Catalogue of Somatic Mutations in Cancer, 12 ClinVar, 13 and cBioPortal, 14,15 that have drastically improved research efforts and academic discovery; however, these resources do not have well-supported (evidence-based) clinical relevance summaries for cancer variants that can be easily accessed and used by physicians. Several resources provide detailed clinical interpretation of cancer variants (eg, OncoKB, 16 JAX Clinical Knowledgebase, 17 and others), but these databases are either limited by license restrictions or closed curation models.
To address these limitations, we developed a method to identify, capture, and annotate variants using the Clinical Interpretation of Variants in Cancer (CIViC) database. 18 The CIViC database is a freely accessible (public domain content), publicly curated, expert-moderated repository of therapeutic, prognostic, predisposing, and diagnostic information in precision oncology. 19 The database provides a powerful platform for panel development and variant annotation for the following reasons: each variant within CIViC is described by clinical relevance summaries linked to medical literature; the history of curation within CIViC is stored and publicly available to all users; and CIViC has an open-source, open-access applied programming interface (API) for external query. Using the CIViC database and API, we developed the Open-Sourced CIViC Annotation Pipeline (OpenCAP) for creating custom capture panels, executing capture panel sequencing on prospective samples, identifying variants from sequencing data, and annotating variants for clinical relevance. 20 An exemplary clinical capture panel was created using OpenCAP to demonstrate utility. Specifically, variants within the CIViC database were identified based on clinical relevance, and single-molecule molecular inversion probes (smMIPs) were designed to target variants of interest. This panel was used on cancer samples to evaluate design, and identified somatic variants were compared with orthogonal sequencing. Variants identified via smMIP capture were linked back to the CIViC database for clinical annotation (Fig 1). Ultimately, this method could be used to rapidly and efficiently link variants to clinical relevance summaries, enabling the development of custom capture panels for a variety of clinical and research scenarios. library preparation, and high-throughput sequencing. The final sections describe identifying variants from raw sequencing data and annotating those variants for clinical relevance.

Determining Eligible CIViC Variants for smMIP Capture
Variants in CIViC were filtered using their Variant Evidence Score (required . 20 points) and sequence ontology identification numbers (SOIDs; must be DNA based; Appendix). Variants were also filtered if all evidence supported only germline clinical relevance, evidence was directly conflicting, or a majority of evidence in a container variant (eg, MUTATION) pointed to a hotspot that was already being covered. The remaining variants were eligible for the CIViC smMIP capture panel.

Designing smMIPs for the CIViC Capture Reagents
Variants were further categorized by length. If the variant length was , 250 base pairs, the variant was eligible for hotspot targeting. If the variant was . 250 base pairs, the variant required either sparse or full tiling of the protein coding exons (Appendix). For all variants, smMIPs were designed and synthesized as previously described 23 with the single alteration that the "-double_tile_strands_ separately" 24 flag was used with the MIPgen tool to separately capture each strand of DNA surrounding the target.

Rescue and Annotation of Clinically Relevant Variants
Variants called using the CIViC smMIP capture panel were compared with variants called using original sequencing for samples that had matched tumor and normal sequencing. All genomic loci were manually reviewed 23 using both the smMIP aligned Binary Alignment Map (BAM) files and the original aligned BAM files. Variants only identified using smMIP sequencing were grouped into the following 4 categories: germline polymorphism, pipeline artifact (low variant support or poor mapping), variant support on smMIP sequencing but no support on original sequencing, or variant support on both smMIP sequencing and original sequencing. For variants that showed support on smMIP sequencing but no variant support on original sequencing, the binomial probability was used to assess whether ≤ 3 variantsupporting reads would be detected with 95% confidence using the original coverage and the observed smMIP variant allele frequency (VAF). The accession number for the first release of the Database of Genotypes and Phenotypes study was phs001890.v1.p1, and the accession number for first release of the Sequence Read Archive was PRJNA529857.

Identification of Eligible CIViC Variants for smMIP Targeting
At the time of the CIViC smMIP capture panel design, there were 988 variants from 275 genes within the CIViC database with at least 1 evidence item. After filtering based on the Variant Evidence Score and the SOID (Appendix, Data Supplement), smMIPs were designed to cover all eligible  (Fig 3). The average consensus read depth for these 65 variants was 2,942 reads (standard deviation, 4,697 reads).
Accuracy of CIViC smMIP variant identification compared with exome or genome variant identification. Of the 65 variants identified on exome sequencing, all but 4 were also identified using CIViC smMIP sequencing (Fig 3). One variant was missed as a result of lack of adequate coverage, 2 variants were missed as a result of lowperforming probes, and 1 variant was retrospectively considered ineligible as a result of smMIP design (Appendix). After removing this variant from the list of eligible variants, the CIVIC smMIP capture sequencing attained a 95% sensitivity for variant detection (n = 64 variants).
VAF correlation between CIViC smMIP sequencing and exome or genome sequencing. VAFs obtained via original sequencing were compared with the VAF obtained using the CIViC smMIPs. To compare VAF quantitation across platforms, the 19 variants obtained from samples that failed the CIViC smMIP sequencing quality check were eliminated ( Fig 4A). Subsequently, we eliminated the 4 variants that were not validated using the CIViC smMIP reagents (Fig 4B). When comparing original VAFs to CIViC smMIP VAFs, Pearson correlation for the remaining 61 variants was 0.885. There were several variants whereby the VAF observed by the CIViC smMIP sequencing was lower than that observed by the original sequencing. These outliers were not associated with tumor type, sequencing mass input, average coverage, presence of matched normal, or sample type (Figs 4C to 4F).

Analysis of Variants Only Identified Using CIViC smMIP Sequencing
Using samples that had sequencing data for both tumor and matched normal (n = 12 samples), we evaluated whether the targeted CIViC smMIP sequencing could identify clinically relevant variants that had not been observed by the original sequencing. There were 273 variants recovered by CIViC smMIP sequencing that were not identified using original sequencing. After manually reviewing these variants within the original exome or genome alignments, 55 variants (20.1%) were identified as germline mutations. smMIP sequencing VAF distribution at 50% and 100% further supported that these variants were germline polymorphisms (Fig 5A). An additional 36 variants (13.2%) were thought to be caused by pipeline artifacts and attributable to assumptions underlying automated callers or alignment problems. The majority of these artifacts were associated with nucleotide repeats in the reference sequence (Fig 5B). There were 171 variants (62.6%) called as somatic using CIViC smMIPs that did not have any variant support on the original sequencing. For these variants, we calculated the binomial probability that ≤ 3 reads would support the variant given the original coverage (number of chances to get a variant supporting read) and the observed smMIP VAF (likelihood that a read would show variant support). If the binomial probability of ≤ 3 variant-supporting reads was . 95%, then it was considered statistically unlikely that a variant would be called using original sequencing data. Using this calculation, 162 variants (94.7%) showed insufficient coverage in the original sequencing for detection ( Fig 5C). Finally, 11 variants (4.2%) were not called as somatic on original sequencing but did show some variant support in those original sequencing data. The VAFs observed on original sequencing data were strongly correlated with the VAFs observed using CIViC smMIP sequencing (Pearson's r = 0.92; Fig 5D). Reviewing manual review files from the original sequencing, we observed that 6 of these variants failed manual review as a result of low VAF, 4     within precision oncology to identify variants that are relevant for capture. In addition, the public API permits rapid mapping of identified somatic and germline variants to CIViC clinical relevance summaries. Most importantly, the variants covered by CIViC and associated clinical summaries can be updated in real time as knowledge is entered into the database to accommodate new information discovered within the field of precision oncology.

% of Samples With Mutation
The smMIP capture method for sequencing provides inherent error correction capability, scalability to detect ultrasensitive variation, and cost effectiveness within a modular design. Combining the public access CIViC database with an ultrasensitive and versatile capture reagent provides an advantageous and principled method for building precision oncology capture reagents. This approach could enable a standardized framework for detecting and interpreting cancer-relevant genomic variation, lowering barriers to use of genomic analysis in the clinical practice of oncology. For maximal flexibility, OpenCAP describes methods for using both unique molecular identifiers (UMIs) and non-UMI-based probes to capture variants of interest.     In summary, the methods described here validate that community curated data on clinically relevant cancer variants can provide a systematic and dynamic method for capture reagent design. The curated coordinates in the database accurately map to desired variants, and probes designed using these coordinates show accurate recapitulation of the genomic landscape described by orthogonal sequencing. It is our hope that OpenCAP will provide the research community with a novel method to develop next-generation sequencing-based oncology panels. Filtering based on the sequence ontology identification number. Variants were also filtered to only include variants that could be analyzed using a DNA-based sequencing platform. This required use of curated sequence ontology identification numbers (SOIDs). Within CIViC, SOIDs are manually classified as DNA based, RNA based, and/or protein based (Data Supplement). For example, variants with the variant type of "missense_variant" would be labeled as "DNAbased," whereas variants with the variant type of "transcript_variant" would be labeled as "RNA-based." Variants that had a "DNA-based" SOID were considered eligible for smMIP targeting, and variants whose SOIDs were "RNA-based" and/or "protein-based" were ineligible.

Categorization of Variants Based on Length
Using CIViC curated coordinates, variant length was determined (ie, variant start position minus variant stop position). This difference inferred the total number of smMIP probes required to adequately assess each variant.
Hotspot targeting. If the variant length was , 250 base pairs, the variant was eligible for hotspot targeting. For variants that required hotspot targeting, smMIP probes were designed for the genomic region indicated in the CIViC database.
Sparse exon tiling and full exon tiling. If the variant was . 250 base pairs, the variant required some or total tiling of the protein coding exons. For all variants that required sparse exon tiling or full exon tiling, the representative transcript from the CIViC database was used to obtain all possible exons associated with each Ensembl gene. The Ensembl gene was used to obtain all possible exons (bio-mart="ENSEMBL_MART_ENSEMBL", host="grch37.ensembl.org", dataset="hsapiens_gene_ensembl"). Exons were further filtered by Biotype to remove untranslated regions. Some large-scale copy number variants (ie, "AMPLIFICATION," "LOSS," "DELETION") were eligible for sparse tiling, wherein 10 probes distributed across the exons of the gene were retained to enable assessment of copy number state. Other variant types such as "MUTATION" or "FRAMESHIFT MUTATION" required tiling of all protein coding exons. Categorization of all variants eligible for capture is described in the Data Supplement. For variants that required full exon tiling, overlapping smMIPs (ie, at least 1 base pair of overlap) were designed to tile across all protein coding exons in the gene that encompassed the variant. For variants that required sparse exon tiling, approximately 10 smMIPs were designed to cover a portion of the transcript.

smMIP Sequencing and Data Analytics
Sequencing library construction and balancing of the probe pool were performed as described previously, 21 and sequencing was performed using an Illumina NextSEquation 500 (Illumina, San Diego, CA). Probes were excluded from the final reagent if they demonstrated poor hybridization to target sequence during initial quality checks.
Sequence data analysis was performed as previously described 21 with 3 enhancements. First, consensus reads were generated using the fgbiotools (http://fulcrumgenomics.github.io/fgbio/) CallMolecularConsensusReads utility with parameters "-error-rate-post-umi=30-min-reads=2-mininput-base-quality=20". Second, a custom variant caller was used to identify all consensus calls at a site having at least 2 supporting reads with a minimum specified mapping quality (mapping quality score . 0). Third, variants were required to be detected on at least 4 DNA strands (at least 2 positive and at least 2 negative) to be considered real, rather than postbiologic artifacts (Eijkelenboom A, et al: J Mol Diagn 18:851-863, 2016). Collectively, these provisions require that at least 2 reads are derived from a common unique molecular identifier to create a consensus read and that multiple consensus reads in both directions support the apparent variant. This helps to exclude preanalytic artifacts reflecting DNA damage and stochastic errors that occur during library construction and sequencing. DNA input ranged from 100-500 ng across samples; however, any sample with an overlapping variant that had a variant allele frequency (VAF) , 5% used 500 ng to increase the number of template molecules interrogated.

Assessment of Variants Missed Using the CIViC smMIP Capture Panel
Of the 65 variants identified on exome sequencing, all but 4 were also identified using CIViC smMIP sequencing. One variant was missed as a result of lack of adequate coverage, 2 variants were missed as a result of low-performing probes, and 1 variant was retrospectively considered ineligible as a result of smMIP design. The variant missed as a result of inadequate coverage was a TP53 (p.G266R) variant identified in the AML31 tumor sample. Original sequencing indicated that this variant was present at 0.04% VAF; therefore, given smMIP coverage of 2,388 reads at this site, there was only a 0.01% chance that this variant would have been detected (1-tailed probability of ≥ 4 reads [K] of 2,388 reads [n]; P = .0046). However, this low-prevalence variant could have been recovered given additional sequence coverage. In addition, there were 2 variants missed as a result of low molecular inversion probe (MIP) performance. The first variant that was missed (chr10: g.89690805G.A in the SCLC8 tumor sample at 94% VAF) was a result of poor performance of the MIP covering the region of interest in the reverse direction. This MIP showed only 1 aligned read across all 36 samples and had no aligned reads in SCLC8. Despite the fact that there was extensive support from the forward MIP (95% VAF with 34 of 35 consensus reads), the requirement that both forward and reverse reads show support prevented this variant from being called. The second missed variant (PTEN e8-1 in the SCLC4 tumor sample at 100% VAF) was a result of low performance of MIPs in both directions. Even though both the forward and the reverse MIPs showed variant support, the forward MIP only contained 2 consensus reads and the reverse MIP only contained 1 consensus read, preventing it from being called as somatic. The final variant (chr17:g7577094C.T in the CRC5 tumor sample at 32% VAF) was retrospectively considered ineligible because the original smMIPs developed to cover the eligible STK variant called for sparse tiling (ie, identification of copy number change). As such, the variant was contained by a region that did not have full coverage in the forward direction. When evaluating the reverse MIP that contained this site, we observed a 34% VAF (402 of 1,184 reads), which was comparable to the original sequencing data. However, lack of a secondary probe designed against the complementary DNA strand prevented this variant from being called as somatic.

Code and Accessibility
All raw data, analysis, and preprocessing code, are publicly available on the GitHub repository (https://github.com/griffithlab/civic-panel/). All plots were produced using the MatPlotlib library in Python (Hunter JD: Comput Sci Eng 9:90-95, 2007). The raw sequencing data are publicly available for most projects included in this study (Data Supplement). The smMIP sequence analysis pipeline is accessible on bitbucket (https://bitbucket.org/uwlabmed/smmips_analysis).

Data Statement
The raw smMIP sequencing data associated with samples from the McDonnell Genome Institute (head and neck squamous cell carcinomas, SCLCs, HLs, and AMLs) have been submitted to the Database of Genotypes and Phenotypes under accession No. phs001890.v1.p1. Institutional review board approval, consent forms and versions, and other demographic data are provided in this submission. The raw smMIP sequencing data associated with samples from Washington University (CRCs) have been submitted to the Sequence Read Archive under accession No. PRJNA529857.