Misannotation of multiple-nucleotide variants risks misdiagnosis

Multiple Nucleotide Variants (MNVs) are miscalled by the most widely utilised next generation sequencing analysis (NGS) pipelines, presenting the potential for missing diagnoses. These variants, which should be treated as a single insertion-deletion mutation event, are commonly called as separate single nucleotide variants. This can result in misannotation, incorrect amino acid predictions and potentially false positive and false negative diagnostic results. Using simulated data and re-analysis of sequencing data from a diagnostic targeted gene panel, we demonstrate that the widely adopted pipeline, GATK best practices, results in miscalling of MNVs and that alternative tools can call these variants correctly. The adoption of calling methods that annotate MNVs correctly would present a solution for individual laboratories, however GATK best practices are the basis for important public resources such as the gnomAD database. We suggest integrating a solution into these guidelines would be the optimal approach.


Introduction
The rapid progress and reduced cost of Next Generation Sequencing (NGS) has transformed approaches to genomic research and clinical diagnostic testing 1 . While single-gene tests, for instance using Sanger (dideoxy) sequencing, will produce a short list of variants which can be manually evaluated, this is not feasible for next generation analysis. Sequencing at this scale requires highly automated analysis pipelines. High throughput sequencing services are dependent on automated tools to annotate and classify variants by potential consequence. For this reason, it is particularly important that any tools used to call and annotate variants do so accurately without the need for any manual assessment to avoid potential misdiagnosis.
Multiple Nucleotide Variants (MNVs) 2 present a particular challenge for automated NGS analysis pipelines. These variants consist of multiple Single Nucleotide Variants (SNVs) located very close together on the same strand of DNA. The Human Genome Variation Society (HGVS) guidelines state that in most circumstances, two adjacent substitutions should be classified as a single deletion-insertion mutation event, rather than two or more separate SNVs 3 .
MNVs that contain multiple SNVs within the same codon may have a significantly different protein consequence than if the separate SNVs are annotated independently. For instance, a CTG codon (Leu) can be changed to TTG or CTC (two separate SNVs) without any protein coding consequence, but when changed to TTC (an MNV) the consequence is a missense (see Figure 1). Importantly, some MNVs would meet the evidence criteria for pathogenicity when called as a single mutational event, but would not when each SNV is treated separately 4 . NGS pipelines that annotate these MNVs as two independent SNVs could fail to correctly identify a pathogenic variant, potentially negatively impacting on clinical care.
Most standard NGS variant calling pipelines, including the widely adopted GATK best practices 5 , do not deal with MNVs correctly -calling them as separate SNVs 6 . Consequently, most laboratories using NGS technologies are at risk of miscalling these variants. Some NGS variant callers incorporate haplotype information to correctly call MNVs 7,8 . Another approach to correctly call MNVs is to re-process variant calls, for example using the Multi-Nucleotide Variant Annotation Corrector (MAC) 9 . There is also a GATK tool, ReadBackedPhasing 10 , which performs phasing of SNVs based on the overlap between reads and uses this information to call variants. However, this tool is not part of the current versions of the widely followed GATK best practice guidelines.
The scale of the potential problem with MNVs was highlighted by the ExAC database. The variants within this data set were called using a GATK best practices pipeline which does not recognise MNVs as single mutation events. Lek et al. 6 identified an average of 23 MNVs that were incorrectly annotated by the original analysis within each whole exome in the ExAC data set. In total, 778 MNVs where a STOP codon should have been annotated were noted, including multiple examples within known dominant disease genes. Crucially, they identified 10 MNVs that have previously been reported as pathogenic, but which were missed by the original pipeline. The impact of MNVs has also been highlighted by the diagnosing developmental disorders (DDD) study -Kaplanis et al. 11 showed that 2% of de novo variants appeared as part of MNVs and that these were significantly enriched in genes associated with developmental disorders in affected children.
HGMDpro is a commercially owned, curated database of published putative pathogenic variants associated with human genetic disorders widely used by genomic diagnostic and research laboratories. There are 628 2bp MNVs, 108 3bp MNVs and more than 150 larger MNVs listed within this database. These previously reported pathogenic variants are at risk of being inaccurately called by standard analysis pipelines. These misannotations represent potential misdiagnoses unless this problem is fully addressed.
In order to investigate the potential extent of this problem for clinical diagnostic services we devised two experiments. Firstly, to establish how MNVs are analysed we modified a set of NGS data to create simulated MNVs and processed this data using both a standard GATK best practices pipeline, and pipelines incorporating GATK ReadBackedPhasing 10 , VarDict 7 , Platypus 12 or MAC 9 . Secondly, we re-analysed a cohort of 1447 samples previously tested using a targeted panel of genes for diagnosis of monogenic diabetes and congenital hyperinsulinism 13 to determine if any potential diagnoses were missed.
By simulating MNVs in NGS sequencing data and testing for them using a typical NGS pipeline employed by an NHS diagnostic laboratory, we demonstrate that MNVs are incorrectly annotated by standard diagnostic NGS pipelines, potentially generating false positive and false negative results and negatively impacting on patient care.

GATK best practices pipeline
The Molecular Genetics Laboratory at the Royal Devon & Exeter NHS Foundation Trust routinely uses a targeted NGS testing pipeline to interrogate an extended panel of genes associated with monogenic diabetes and congenital hyperinsulinism 13 . This uses GATK 3.7.0. The pipeline aligns reads to the hg19/GRCh37 human reference genome with BWA mem 0.7.15 14 , applies Picard 2.5.0 for duplicates removal 15 and GATK IndelRealigner for local re-alignment 16 . GATK haplotypecaller is then used to identify variants and these are annotated using Alamut batch version 1.5.2 (Interactive Biosoftware, Rouen, France). This analysis approach is based on the GATK best practice guidelines 5 . We also tested GATK 4.0.11.0 to see if the problem had been corrected in the later version of the software.

Generating simulated MNVs
To determine whether the pipeline correctly annotates MNVs, we generated a BAM file containing five simulated MNVs in the HNF4A gene. These MNVs are detailed in Table 1. Each variant was generated as a homozygous call (GT 1/1 with no reads supporting the reference allele). We processed these variants with the standard GATK best practices pipeline described above.
This dataset is publicly available at https://github.com/ rdemolgen/MNV-test-data to provide a simple method for laboratories to test if their current analysis pipeline will annotate MNVs correctly.

Re-processing with alternative tools
To investigate whether using alternative tools results in correct annotation of MNVs, we re-processed the VCF file of simulated MNVs using GATK 3.6.0 ReadBackedPhasing 10 (default parameters plus "-maxDistMNP 2 -enableMergeToMNP") or MAC 1.2 9 then annotated the resulting VCF files using Alamut batch version 1.5.2 (Interactive Biosoftware, Rouen, France). We also tested re-calling the variants using VarDict 1.4 7 and Platypus 0.8.1 12 .

Investigating NGS targeted panel data for MNVs
Using the GATK ReadBackedPhasing tool 10 , we re-examined a set of 1447 samples previously sequenced using a custom panel of genes for the diagnosis of monogenic diabetes and congenital hyperinsulinism 13 to determine if any MNVs with an incorrect annotation were present.

Results
Simulated MNVs are miscalled using GATK best practices All five of the simulated MNVs described above were called as two separate SNVs using GATK best practices, and thus Simulated MNVs were correctly called using alternative software As described above, when our simulated MNVs are called using GATK v3.7.0 best practices they are incorrectly called as two separate variants. In contrast when re-analysed using GATK ReadBackedPhasing 10 , MAC 9 and Platypus 12 the separate SNVs are correctly merged into a single MNV in all five cases and the MNVs were correctly annotated by Alamut batch 1.5.2 as in-frame insertion-deletions. VarDict 7 correctly calls four variants but fails to call variant 1, which is a CTG to TTC non-consecutive change, as a single event. We also tested GATK 4.0.11.0 to see if the updated version of the software dealt with MNVs differently to older versions but the results were the same.
Variants identified through an NGS diagnostic targeted panel are miscalled by GATK best practices The Molecular Genetics Laboratory at the Royal Devon & Exeter NHS Foundation Trust utilises an NGS analysis pipeline based on GATK best practices. Having established, using simulated data that GATK ReadBackedPhasing 10 correctly called MNVs, we re-analysed 1447 samples tested on a diagnostic panel for monogenic diabetes and congenital hyperinsulinism 13 to examine if any MNVs had been incorrectly annotated using the GATK best practices pipeline.
On four occasions MNVs were found to have been miscalled as two separate single base substitution variants (Table 3). In three cases the correct annotation for the MNV was a missense variant; however GATK best practices resulted in two different missense variants being called. The fourth MNV should also have been called as a missense variant, but was called as a nonsense variant and a different missense variant. In  The GATK best practice guidelines 5 have been widely adopted and are employed in the analysis pipelines for the majority of diagnostic and research NGS facilities worldwide. Our analysis pipeline, based on GATK best practices, which is currently in use at our diagnostic laboratory, failed to correctly call our simulated MNVs and four MNVs identified by reanalysis of targeted gene panel data.
Our analysis demonstrated that in contrast to GATK best practices, alternative tools 7,9,12 are available which merge the nearby SNVs correctly into a single MNV, which is essential for correct annotation of variant consequence. There are two distinct approaches for correcting the problem, either changing the variant caller used to one such as Platypus 12 which calls MNVs correctly or post-process variant calls with tools such as MAC 9 or ReadBackedPhasing to correct the variant calls. Both solutions present problems integrating into existing pipelines. Platypus 12 does not produce the same quality metrics making it more challenging to integrate into an existing GATK based pipeline while ReadBackedPhasing does not maintain the quality information from the variant calls, in both cases making it difficult to filter variants by quality. Thus while a potential solution for individual laboratories to resolve this issue would be the integration of other tools within their NGS pipelines that deal with MNVs correctly this will present challenges integrating them. Additionally, this depends on laboratory awareness of this ongoing problem and the potential for patient harm that it presents.
In the current versions of the GATK best practices, phasing is performed by GATK HaplotypeCaller, so the ReadBackedPhasing software, which previously performed this role, is no longer being actively maintained. However, while HaplotypeCaller builds haplotypes we have demonstrated that it does not correctly utilise the information to call MNVs. ReadBackedPhasing calls MNVs but does not provide the quality score information for them that is produced for variants by HaplotypeCaller, which prevents them from being filtered by quality. Thus we suggest that the ideal solution would be for the features of software which enable correct calling of MNVs, namely the appropriate use of haplotype information, to be incorporated into HaplotypeCaller.
Adoption of a solution into the GATK best practices is the optimal solution as it does not require individual laboratories to be aware of the problem and adopt bespoke solutions. GATK is widely adopted for its ease of use: it provides an integrated suite of tools with inputs and outputs in standard formats, it has excellent documentation and a large user community solving shared problems.
Another important consideration to note is that publicly available online variant frequency resources such as gnomAD and ExAC are currently based on GATK best practices pipelines. These resources are critical to variant interpretation in rare genetic disorders as a key criterion for pathogenicity assignation is allele frequency 4 . Currently MNVs are flagged, but still represented as multiple separate SNVs within gnomAD and ExAC. This means that even where laboratories make changes to their local pipeline to correctly call MNVs, their local data for these variants will be incompatible with these public resources, with allele frequency information being unavailable for those MNVs.
In summary, the issue of MNVs being miscalled by the most commonly employed NGS analysis pipelines continues to be an important issue. Although there are a number of tools available that call MNVs correctly, these are not currently being widely adopted. Addressing this issue by implementing changes within GATK best practices would have the greatest impact on prevention of misdiagnoses resulting from MNV calling inaccuracies and also importantly provide compatibility with the online public variant frequency databases that are central to current diagnostic variant classification. The dataset of 1447 samples previously sequenced cannot be shared due to patient confidentiality issues, as the genotype data could be used to identify individuals and so cannot be made openly available. Requests for access to the anonymised data by researchers will be considered following an application to the Genetic Beta Cell Research Bank (https://www.diabetesgenes. org/current-research/genetic-beta-cell-research-bank/) with proposals reviewed by the Genetic Data Access Committee.
The article presents a less known problem of misannotation of nearby genomic variants. The authors suggest alternative methods to correct misannotated variants. According to their analysis, the problem seems to affect only some variant callers (detected only in GATK). Overall, the article is well written and contains important information for the genomics labs. I also appreciate provided data sets for internal laboratory validation, as well as the proposed solutions for the problem. On the other hand, some parts of the article should be further clarified to improve its quality.
The details of the implemented pipelines are not clear. Some tools (like Alamut) have a software version, others have detailed parameters (like ReadBackedPhaser). However, the rest of the featured tools have no execution details, so it's not clear which versions are affected. My main concern is the version of the GATK. It appears that the reported issues were found in an older version (3.?.?), according to the described methods (section GATK Pipeline for Best Practices), as IndelRealigner is no longer needed in current versions (as of 4.0.0). Therefore, it is important to identify if the problem persists or has been revised in newer releases. This is important information for genome laboratories, as the GATK upgrade should be less demanding than the implementation of an alternative software as proposed in the Key points.

Reviewer Expertise: Bioinformatics
I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

Dominic J. McMullan
West Midlands Regional Genetics Laboratories, Birmingham Women's and Children's NHS Foundation Trust, Birmingham, UK