Benchmarking challenging small variants with linked and long reads

Summary Genome in a Bottle benchmarks are widely used to help validate clinical sequencing pipelines and develop variant calling and sequencing methods. Here we use accurate linked and long reads to expand benchmarks in 7 samples to include difficult-to-map regions and segmental duplications that are challenging for short reads. These benchmarks add more than 300,000 SNVs and 50,000 insertions or deletions (indels) and include 16% more exonic variants, many in challenging, clinically relevant genes not covered previously, such as PMS2. For HG002, we include 92% of the autosomal GRCh38 assembly while excluding regions problematic for benchmarking small variants, such as copy number variants, that should not have been in the previous version, which included 85% of GRCh38. It identifies eight times more false negatives in a short read variant call set relative to our previous benchmark. We demonstrate that this benchmark reliably identifies false positives and false negatives across technologies, enabling ongoing methods development.


In brief
Analogous to placing puzzle pieces that look similar, mapping sequences to regions of the genome that look similar is challenging. Wagner et al. describe a new Genome in a Bottle Consortium resource for benchmarking accuracy of human genome sequencing in more challenging regions of the genome.

INTRODUCTION
Advances in genome sequencing technologies have continually transformed biological research and clinical diagnostics, and benchmarks have been critical to ensure the quality of the sequencing results. The Genome in a Bottle Consortium (GIAB) developed extensive data 1 and widely used benchmark sets to assess the accuracy of variant calls resulting from human genome sequencing. [2][3][4] To use these benchmarks, the Global Alliance for Genomics and Health (GA4GH) Benchmarking Team developed tools and best practices for benchmarking. 5 These benchmarks and benchmarking tools helped enable the development and optimization of technologies and bioinformatics approaches, including linked reads, 6 highly accurate long reads, 7 deep-learning-based variant callers, 8,9 graph-based variant callers, 10 and de novo assembly. 11,12 As these new technologies and methods accessed increasingly challenging regions of the genome, studies highlighted many known medically relevant genes that were excluded from these previous benchmarks. 7,[13][14][15] These studies demonstrated the need for improved benchmarks covering segmental duplications, the major histocompatibility complex (MHC), and other challenging regions. A separate synthetic diploid benchmark was generated from assemblies of error-prone long reads for two haploid hydatidiform mole cell lines, but this had limitations in terms of cell line availability and small insertion or deletion (indel) errors because of the high error rate of the long reads. 16 Many of the difficult regions of the genome lie in segmental duplications and other repetitive elements. Linked reads have been shown to have the potential to expand the GIAB benchmark by 68. 9 Mbp to some of these segmental duplications. 6 A circular consensus sequencing (CCS) method was recently developed that enables highly accurate 10-to 20-kb-long reads. 7 This technology identified a few thousand likely errors in the GIAB benchmark, mostly in long interspersed nuclear elements (LINEs). It also had more than 400,000 variants in regions mappable with long reads but outside the benchmark, and it covered many difficultto-map, medically relevant genes that are challenging to call using short reads or lower-accuracy long reads. GIAB recently used these data to produce a local diploid assembly-based benchmark for the highly polymorphic MHC region.
Here we use linked reads and long reads to expand GIAB's benchmark to include challenging genomic regions for the GIAB pilot genome NA12878 and the GIAB Ashkenazi and Han Chinese trios from the Personal Genome Project, which are more broadly consented for genome sequencing and commercial redistribution of reference samples. 17 We more carefully exclude segmental duplications that are copy number variant (CNV) in the GIAB samples 18 or missing copies in GRCh37 or GRCh38 19,20 because these currently cannot be reliably benchmarked for small variants. We also refined the methods used to produce the diploid assemblybased MHC benchmark 21 to include most of the MHC region in each member of the trio. We show that our benchmark reliably identifies false positives (FPs) and false negatives (FNs) across a variety of short-, linked-, and long-read technologies. The benchmark has already been used to develop and demonstrate new variant callers in the precisionFDA Truth Challenge V2. 22

RESULTS
The new benchmark covers more of the reference, including many segmental duplications GIAB previously developed an integration approach to combine results from different sequencing technologies and analysis methods, using expert-driven heuristics and features of the mapped sequencing reads to determine at which genomic positions each method should be trusted. This integration approach excludes regions where all methods may have systematic errors or locations where methods produce different variants or genotypes and have no evidence of bias or error. The previous version (v.3.3.2) primarily used a variety of short-read sequencing technologies and excluded most segmental duplications. 4 Our new HG002 v.4.2.1 benchmark adds long and linked reads to cover 6% more of the autosomal assembled bases for GRCh37 and GRCh38 than v.3.3.2 ( Table 1). The median coverage by linked-and long-read datasets for each genome is shown in Table S1. We also replace the mappingbased benchmark with assembly-based benchmark variants and regions in the MHC. 21 v.4.2.1 adds more than 300,000 single nucleotide variants (SNVs) and 50,000 indels compared with v. 3 and  Table S4.
The new benchmark includes additional challenging genes To focus analysis on potential genes of interest, we analyzed inclusion of genes previously identified to have at least one exon that is difficult to map with short reads, which we call ''difficultto-map, medically relevant genes.'' 13 v.4.2.1 covers 88% of the 10,009,480 bp in difficult-to-map, medically relevant genes on primary assembly chromosomes 1-22 in GRCh38, much larger than the 54% covered by v.3.3.2 (Table S5). 3,913,104 bp of the difficult-to-map, medically relevant genes lie in segmental duplication or low-mappability regions. The v.4.2.1 benchmark includes 2,928,012 bp (74.8%) of those segmental duplications and low-mappability regions, whereas the v.3.3.2 benchmark includes 208,882 bp (5.3%). Future work will be needed to include 49 of the 159 genes on chromosomes 1-22 that still have less than 90% of the gene body included (Figure 2A), such as a recently published assemblybased approach. 20 For example, 5 genes that have potential duplications in HG002 were previously partially included in v.3.3.2 but are excluded in v.4.2.1 because new methods will be needed to resolve and represent benchmark variants in duplicated regions ( Figure 2B). The medically relevant gene KIR2DL1 was partially included in v.3.3.2 but is now completely excluded because the copy number variable KIR region is removed from the v.4.2.1 benchmark regions. v.4.2.1 also does a better job excluding regions that are duplicated in the benchmark sample relative to the reference, specifically because it excludes regions with higher-than-normal PacBio HiFi and/or Oxford Nanopore (ONT) coverage ( Figure 3). We detail the inclusion of each difficult-to-map, medically relevant gene in Table S6.
PMS2 is an example of a medically important gene involved with DNA mismatch repair that is included more by v.4.2.1 (85.6%) than by v.3.3.2 (25.9%) for HG002 ( Figure 4). Variant calling in PMS2 is complicated by the presence of the pseudogene PMS2CL, which contains identical sequences in many of the exons of PMS2 and is within a segmental duplication. 23 Using long-range PCR followed by Sanger sequencing, we  Figure S2).

High Mendelian consistency in trios
To further evaluate the accuracy of the benchmark, we evaluated the Mendelian consistency of our v.4.2.1 benchmark sets for the son, father, and mother in two trios from GIAB of Ashkenazi ancestry (HG002, HG003, and HG004) and Han Chinese ancestry (HG005, HG006, and HG007). In the intersection of the benchmark regions for the Ashkenazi trio, this evaluation identified 2,502 variants that had a genotype pattern inconsistent with Mendelian inheritance of the 4,968,730 variants in at least one member of the trio (0.05%), slightly below the rate for v. 3 Following manual inspection of 10 random de novo SNVs in HG002, 10 of 10 appeared to be true de novo. After manual inspection of 10 random de novo indels, 10 of 10 appeared to be true de novo indels in homopolymers or tandem repeats. The violations that were not heterozygous in the son and homozygous reference in both parents fell in a few categories: (1) clusters of variants in segmental duplications where a variant was missed or incorrectly genotyped in one individual, (2) complex variants in homopolymers and tandem repeats that were incorrectly called or genotyped in one individual, and (3) some overlapping complex variants in the MHC that were correctly called in the trio but the different representations were not reconciled by our method (even though we used a method that is robust to most differences in representation). 4,25 We exclude all Mendelian inconsistencies that are not heterozygous in the son and homozygous reference in both parents from the v.4.2.1 benchmark regions of each member of the trio because most are unlikely to have a biological origin. Conservative paternal|maternal phasing for HG002 on GRCh38 was performed for the MHC using local diploid assembly and outside the MHC using phasing that was consistent between trio analysis and integrated Strand-seq and PacBio HiFi phasing (1,812,845 of 2,449,937 heterozygous benchmark variants).
Regions excluded from the benchmark A critical part of forming a reliable v.4.2.1 benchmark was to identify regions that should be excluded from the benchmark. In Table 2 and Figure S3, we detail each region type that is excluded, the size of the regions, and reasons for exclusion. We describe how each region is defined in STAR Methods. These excluded regions fall into several categories: (1) the modeled centromere and heterochromatin in GRCh38 because these are highly repetitive regions and generally differ in structure and copy number between any individual and the reference; (2) the VDJ, which encodes immune system components and undergoes somatic recombination in B cells; (3) in GRCh37, regions that are expanded or collapsed relative to GRCh38; (4) segmental duplications with more than 5 copies longer than 10 kb and identity greater than 99%, where errors are likely in mapping and variant calling (e.g., because of structural or copy number variation resulting in calling paralogous sequence variants); 26,27 (5) potential large duplications that are in HG002 relative to GRCh37 or GRCh38; (6) putative insertions, deletions, and inversions greater than 49 bp in size and flanking sequence; and (7) (Tables S8 and S9). Many of these were in centromere regions that have very few benchmark variants but were erroneously included in the v.3.3.2 short-read-based benchmark; e.g., in chromosome 20 (chr20). Our new benchmark correctly excludes these regions from the benchmark because they cannot be confidently mapped with short, linked, or long reads used to form the benchmark. Table S10 describes refinements to these excluded regions between the initial draft release and the v.4.2.1 benchmark.
Evaluation and manual curation demonstrates reliability of the benchmark GIAB has established a community evaluation process for draft benchmarks before the official release, following the reliable identification of errors (RIDE) principle for benchmarks. 3 The RIDE principle is designed to ensure that, when comparing state-of-the-art query call sets with the benchmark, at least 50% of the putative false positives and false negatives are errors in the query call set and not errors in the benchmark. GIAB recruited volunteer experts in particular variant calling methods to follow the GA4GH Benchmarking Team's best practices 5 (Table S11; STAR Methods). Each call set developer curated a random selection of FPs and FNs to ensure that the benchmark reliably identifies errors in the query call set. Overall, we found that the benchmark was correct and the query call set was not correct in the majority of FP and FN SNVs and indels ( Figure 5, with all curations in Table S12). Overall, 433 of 452 (96%) curated FP and FN SNVs and indels inside v.3.3.2 benchmark regions and 422 of 463 (91%) outside v.3.3.2 benchmark regions were determined to be correct in the v.4.1 benchmark. Some technologies/variant callers, particularly deep-learning-based variant callers using HiFi data, had more sites where it was unclear whether the benchmark was correct or the query call set was correct. These sites tended to be near regions with complex structural variation or places that appeared to be inside potential large duplications in HG002 but were not identified in our CNV approaches. In general, most sites that were not clearly correct in the benchmark and wrong in the query were in regions where the answer was unclear with current technologies ( Figure 5B). For example, the v.4.1 benchmark correctly excludes much of the questionable region in Figure S4 but still includes some small regions where there may be a duplication and the variant calls in the benchmark and the query are questionable. Future work will be aimed at developing a new benchmark in the small fraction of questionable regions, but these evaluations demonstrate that the new benchmark reliably identifies FPs and FNs across a large variety of variant call sets, including those based on short, linked, and long reads, as well as mapping-based, graph-based, and assembly-based variant callers.

New benchmark regions are enriched for false negatives
We demonstrate the benchmarking utility of v.4.2.1 by comparing an example query call set to the new and old benchmark sets for HG002. For a standard short-read-based call set (Ill GATK-BWA in Figure 5), the number of SNVs missed (even when including filtered variants) was 8.5 times higher when benchmarking against v. 4

DISCUSSION
We present the first diploid small variant benchmark that uses short, linked, and long reads to confidently characterize a broad spectrum of genomic contexts, including non-repetitive regions as well as repetitive regions, such as many segmental duplications, difficult-to-map regions, homopolymers, and tandem repeats. We demonstrated that the benchmark reliably identifies false positives and false negatives in more challenging regions across many short-, linked-, and long-read technologies and variant callers based on traditional methods, deep learning, 8,9 graph-based references, 10 and diploid assembly. 12 The benchmark was used in the precisionFDA Truth Challenge V2 held in 2020. This challenge focused on difficult regions not covered well by the v.3.2 benchmark used in the first Truth Challenge in 2016, and SNV error rates of the winners of the first Truth Challenge increase by as much as 10-fold when evaluated against the v.4.2 benchmark compared with the v.3.2 benchmark. 22 We designed this benchmark to cover as much of the human genome as possible with current technologies as long as the benchmark genome sequence is structurally similar to the   Article ll OPEN ACCESS assembly), standards for representing variants in these regions, and benchmarking methodology and tools. For example, for variants inside segmental duplications for which the individual has more copies than the reference, methods are actively being developed to assemble these regions, 26 but no standards exist for representing on which copy the variants fall or how to compare with a benchmark. We expect that future benchmarks will increasingly use highly contiguous diploid assembly to access the full range of genomic variation. Our current benchmark is helping to enable this transition by identifying opportunities to improve assemblies in the genome regions that are structurally similar to GRCh37 and GRCh38.

Limitations of the study
It is critical to understand the limitations of any benchmark. Because our current benchmark excludes regions that structurally differ from GRCh37 or GRCh38, it will not identify deficiencies in mapping-based approaches because of their reliance on these references or highlight advantages of assembly-based approaches that do not rely on these references. Although we have tried to exclude all regions where our samples differ structurally from the reference, some regions with gains in copy number and other large structural variants remain, particularly in segmental duplications where these are more challenging to identify. Similarly, we may not exclude all inversions, particularly those mediated by segmental duplications. In addition, the benchmark still excludes many indels between 15 and 50 bp in size. We also do not characterize mosaic variants, so methods detecting somatic or mosaic variants may identify true variants missing from the benchmark. Although we have significantly increased our inclusion of difficult-to-map, medically relevant genes, more work remains. Many of these genes are excluded because of putative SVs or copy number gains, and future work will be needed to understand whether these are true SVs or copy number gains, and if so, how to fully characterize these regions. The genomes characterized in this work come from individuals of European, Ashkenazi Jewish, and Han Chinese ancestry, and future benchmarks are needed to understand any potential differences in variant benchmarking for other ancestries.

STAR+METHODS
Detailed methods are provided in the online version of this paper and include the following:

ACKNOWLEDGMENTS
We thank the Genome in a Bottle Consortium for ongoing feedback and discussions about the benchmark. We thank participants in the precisionFDA Truth Challenge V2 for helpful feedback about the v.4.2 benchmarks for the trio. We thank Valerie Schneider for advice regarding alignments of GRCh38 to GRCh37. Chunlin Xiao was supported by the Intramural Research Program of the National Library of Medicine, National Institutes of Health. P.E. and T.M. acknowledge computational infrastructure provided by the Center for Information and Media Technology at the University of D€ usseldorf and funding from the German Research Foundation (grants 391137747 and 395192176) as well as support by the BMBF-funded de.NBI Cloud within the German Network for Bioinformatics Infrastructure (de.NBI) (031A537B, 031A533A, 031A538A, 031A533B, 031A535A, 031A537C, 031A534A, and 031A532B). Certain commercial equipment, instruments, or materials are identified to specify adequately experimental conditions or reported results. Such identification does not imply recommendation or endorsement by the National Institute of Standards and Technology, nor does it imply that the equipment, instruments, or materials identified are necessarily the best available for the purpose.

Lead contact
Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact, Justin Zook (justin.zook@nist.gov). Data and code availability d DNA sequencing data were previously deposited in the NCBI SRA under BioProject PRJNA200694 and are publicly available as of the date of publication. Accession numbers are listed in the key resources table. d All original code has been deposited at Zenodo and is publicly available as of the date of publication. DOIs are listed in the key resources table. d Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.

EXPERIMENTAL MODEL AND SUBJECT DETAILS
The Genome in a Bottle Consortium selected the seven human lymphoblastoid cell lines described in Materials availability for characterization because the pilot female HG001 had extensive pre-existing public data, and HG002 to HG007 are two son-father-mother trios from the Personal Genome Project that have a broader consent that permits commercial redistribution and recontacting participants for further sample collection.

Generating callable files with haplotype-separated BAMs
We use the CallableLoci utility in GATK3 to find regions with good coverage of high mapping quality reads. For PacBio HiFi and 10x Genomics read data, we use WhatsHap 32 haplotag to partition reads by haplotype then use the bamtools filter function to generate separate BAM files for the two haplotypes. To partition reads by haplotype, we used a vcf that combined 10x linked read phasing with trio information described in the v0.6 structural variant benchmark paper. 3 For CallableLoci with the unseparated BAM, we set the callable maxDepth threshold to 2 times the median coverage for VCF entries, then the minDepth threshold to 20. For the haplotype separated BAMs, we use median coverage for VCF as the maxDepth and 5 as the minDepth. For PacBio HiFi, we first remove multi-allelic entries from the VCF and 50 bp on each side of the variant then remove RefCall entries with QUAL value below 40 along with 50 bp on each side of those variants. We then find callable regions for each haplotype BAM and the unseparated BAM then use bedtools multiIntersectBed to find the union of those regions.
For 10x Genomics, we first remove filtered indels along with 50 bp on each side from its callable regions. Then we find callable regions on each haplotype and the unseparated BAM. After using multiIntersectBed to find the union of those callable regions we subtract regions that were callable on one haplotype but not callable on the other haplotype.

Python integration
We implemented the integration pipeline using Python as opposed to the Bash and Perl implementation for v3.3.2. The integration maintains a similar structure and we generated a DNAnexus applet to run on the same platform as v3.3.2. We updated the v4.2.1 pipeline to exclude all of a tandem repeat that is only partially covered by the benchmark regions. We also provide an option to not provide a callable file for given callsets, which for v4.2.1 we do not use callable regions for Ion Torrent or SOLiD. This results in a benchmark VCF that includes annotations for those technologies but variants are not excluded based on disagreement with their calls.
Regions excluded from the benchmark We determined regions to exclude from the benchmark where it was not currently possible to determine which technologies were correct due to the difficulty of resolving variation in that region. The difficult regions included those that had a structural variant identified in the GIAB SV v0.6 Benchmark, regions in which the HG002 sample had a copy variation compared to the reference, high depth and highly similar segmental duplications, tandem repeats >10 kb, regions that are collapsed and expanded from GRCh37/38 Primary Assembly Alignments, modeled centromere and heterochromatin, VDJ, and inversions. The bed files excluded from the benchmark are being made available in the v3.00 stratifications from GIAB under https://ftp-trace.ncbi.nlm.nih.gov/ ReferenceSamples/giab/release/genome-stratifications/. We refined these regions with several rounds of internal and external evaluation on intermediate versions of the benchmark. We describe intermediate versions of the benchmark in Table S10.

Modeled centromere and heterochromatin
We use the modeled centromere for GRCh38 from ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/NA12878_HG001/ NISTv3.3.2/GRCh38/supplementaryFiles/genomic_regions_definitions_modeledcentromere.bed and the heterochromatin region ftp:// ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/NA12878_HG001/NISTv3.3.2/GRCh38/supplementaryFiles/genomic_regions_ definitions_heterochrom.bed. 33 VDJ region subject to somatic recombination We downloaded the UCSC genes tracks 34 from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/kgXref.txt.gz and selected entries with ''abParts''. We then subset to chromosomes 2, 14, and 22 which contain the IGK, IGH, and IGL components that make up the VDJ recombination regions. KIR region v4.2.1 excludes the KIR region (chr19:54716689-54871732 in GRCh38 and 19:55228188-55383188 in GRCh37) because it is highly variable in copy number in the population, variant representation is challenging, and our current mapping-based methods have more errors in this region. Regions that are collapsed and expanded from GRCh37/38 primary assembly alignments The GRC aligned GRCh37 to GRCh38 (excluding alts) with results available at: ftp://ftp.ncbi.nlm.nih.gov/pub/remap/Homo_sapiens/ 2.1/GCF_000001405.13_GRCh37/GCF_000001405.26_GRCh38/. We parsed the file GCF_000001405.13.xlxs for two Discrepancy values: (1) SP that denotes collapsed regions and (2) SP-only that denotes a region that was expanded between the reference versions. Highly similar and high depth segmental duplications longer than 10kb We begin with the segmental duplications track from UCSC: 34 http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/ genomicSuperDups.txt.gz. We filter for entries larger than 10 kb and with identity >99%. We then use bedtools genomecov to calculate segmental depth and subset to those greater than 5. Potential copy number variation We employed several approaches to find potential regions of large duplications in HG002 that are not in GRCh37 and GRCh38: 1 Short read and Long Read Intersection: We used mosdepth 35 to find 1,000 bp windows that have higher than average coverage/2*2. 5 in ONT and PacBio HiFi data. We intersected these regions with results from the CNV analysis tool, mrCaNaVar, 36  GRCh37/GRCh38 with bedgraph files that denote coverage of the reference by the number of contigs for the maternal and paternal haplotypes. We used bedtools unionBedGraphs and then found reference regions that are covered by 2 or more contigs in the union of haplotypes. We did this separately on a TrioCanu assembly using ONT, 37 a Flye assembly using ONT, 38 and a TrioCanu assembly of PacBio HiFi 15 kb reads. 7 We found an intersection across the three assemblies and subset to regions greater than 10 kb.

Inversions
We used SVrefine (github commit f0fb99969b6e239d1f49bc64a8f6cf.5d52a2b88b) to call structural variants with, -maxsize 1000000 option. We then extracted inversions from the call set. Variants were merged with SVmerge (github commit aa8beb6f1cb5c53 9eea9f980ff30d2648caeee21), default maximum "distances", which were 0.2 for all. SVrefine and SVmerge were from SVanalyzer (https://github.com/nhansen/SVanalyzer). HG002 v0.6 GIAB Tier1 plus Tier 2 SV Benchmark expanded by 150% We used bedtools 39 slop with parameters -b -pct 0.25 to expand the GIAB v0.6 Structural Variant benchmark file: ftp://ftp-trace.ncbi. nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/NIST_SVs_Integration_v0.6/HG002_SVs_Tier1plusTier2_v0.6. 1.bed. This file defines regions in which calls with PASS in the FILTER field as well as regions should contain close to 100% of true insertions and deletions R50 bp, with variants merged into a single region if they were within 1 kb. SVs excluded from HG001 and HG003-HG007 Because we don't have SV benchmarks for HG001 and HG003-HG007, we used pbsv (https://github.com/PacificBiosciences/pbsv) SVs >49 bp from PacBio HiFi data for HG001 and HG003-HG007, and well as svanalyzer and dipcall SVs >49 bp from trio-hifiasm assemblies of HG001 and HG005. We expanded these SVs to include overlapping tandem repeats and homopolymers and expanded the resulting regions by 25% of the region size on each side with bedtools 39 slop with parameters -b -pct 0.25. Tandem repeats greater than 10,000 bp We took the union of SimpleRepeat dinucleotide, trinucleotide, and tetranucleotide STRs as well as RepeatMasker_LowComplexity, RepeatMasker_SimpleRepeats, and TRF_SimpleRepeats downloaded from UCSC Genome Browser. We then subset to tandem repeats longer than 10,000 bp. Reference assembly contigs shorter than 500,000 bp We downloaded the gap track from UCSC: 34 ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/gap.txt.gz. Then subset to regions that are gap. We used bedtools complemented with GRCh37/GRCh38 to find contigs then subset to those less than 500 kb.