HAPDeNovo: a haplotype-based approach for filtering and phasing de novo mutations in linked read sequencing data

Background De novo mutations (DNMs) are associated with neurodevelopmental and congenital diseases, and their detection can contribute to understanding disease pathogenicity. However, accurate detection is challenging because of their small number relative to the genome-wide false positives in next generation sequencing (NGS) data. Software such as DeNovoGear and TrioDeNovo have been developed to detect DNMs, but at good sensitivity they still produce many false positive calls. Results To address this challenge, we develop HAPDeNovo, a program that leverages phasing information from linked read sequencing, to remove false positive DNMs from candidate lists generated by DNM-detection tools. Short reads from each phasing block are allocated to each of the two haplotypes followed by generating a haploid genotype for each putative DNM. HAPDeNovo removes variants that are called as heterozygous in one of the haplotypes because they are almost certainly false positives. Our experiments on 10X Chromium linked read sequencing trio data reveal that HAPDeNovo eliminates 80 to 99% of false positives regardless of how large the candidate DNM set is. Conclusions HAPDeNovo leverages the haplotype information from linked read sequencing to remove spurious false positive DNMs effectively, and it increases accuracy of DNM detection dramatically without sacrificing sensitivity. Electronic supplementary material The online version of this article (10.1186/s12864-018-4867-7) contains supplementary material, which is available to authorized users.


Background
De novo mutations (DNMs) have been shown to be a major cause of neurodevelopmental and other congenital diseases including autism [1], schizophrenia [2], intellectual disability [3], and congenital heart disease [4]. Next generation sequencing of nuclear families provides an unprecedented opportunity to investigate the de novo mutation spectrum of these diseases at single nucleotide resolution. Variant calling methods such as GATK [5] and SAMtools [6] implement a straightforward approach to explore DNMs by selecting the mutations that appear in the child but not the parents. Other approaches such as DeNovoGear [7] and PolyMutt [8] model the family relationship as a prior probability of mutation transmission to distinguish true DNMs from noise, dramatically improving performance. These programs assume a consistent mutation rate across all positions, which is not always the case. TrioDeNovo [9] was developed to address this issue by employing flexible priors. Nevertheless, most of the existing algorithms are still overwhelmed by an enormous number of false positives, which are probably caused by factors such as sequencing coverage bias, sequencing batch effects, and alignment artifacts on repetitive regions. There has been a lack of studies that investigate which factor has the most impact and how to correct biases in de novo mutation calling.
Knowing the phase of DNMs is critical in determining their parent-of-origin. Yet, phasing is complicated and remains challenging for DNMs on just short reads, such as the typical ~500bp fragments of Illumina sequencing. The 10X Chromium system microfluidically partitions long DNA fragments from which short fragments and, subsequently, Illumina reads are generated [10].
Thus, each original long DNA fragment generates a collection of short reads with a shared barcode (linked reads), enabling robust and accurate, genome-scale variant genotyping and phasing.
Phasing analysis reveals that linked read sequencing generates a very low overall long switch error (<0.03%) [11]. Here we developed a novel filtering and phasing toolkit for DNMs, HAPDeNovo, which takes full advantage of robust variant phasing from linked read sequencing to sift true DNMs from noise. We show that HAPDeNovo drastically eliminates false positive DNMs without decreasing the detection rate of true positives. We identify the culprit of false positive calls to be allele-specific sequencing coverage biases.

HAPDeNovo Implementation
Linked read sequencing is a technology that allows for simultaneous variant calling and phasing, by reconstructing the original long fragments from linked short reads. When reads with the same barcode align in proximity to each other in the genome, they originated from the same haplotype because the original template was a single DNA fragment. In general, HAPDeNovo is designed to re-calibrate the DNM quality based on read coverage and sequencing quality for each haplotype.
The reads from each phasing block are allocated to either of the two haplotypes, enabling HAPDeNovo to identify two haploid genotypes for each candidate DNM. Each putative de novo mutation (homozygous reference allele for both parents and heterozygosity for the child) is The input to HAPDeNovo is paired-end reads in FASTQ files generated by Illumina-sequencing of 10X Chromium libraries for each individual of the trio. In the first step, HAPDeNovo aligns the reads to the reference genome and performs multi-sample variant calling on the trio with any available variant callers (FreeBayes by default). Because variant phasing is independent for each individual, HAPDeNovo separates the variants into three individual VCF files for variant phasing based on the barcode-aware haplotype assembly approaches such as Long Ranger or HapCUT2 [12]. The individual phased VCF files are then merged into a phased multi-sample VCF file. For each phase block, HAPDeNovo determines the haplotype that each read comes from and marks the read accordingly in the BAM file.
In the second step, the BAM file for each individual is divided into three, according to the three haplotype tags: HP1, HP2 and HP0, which denote the reads coming from maternal haplotype, paternal haplotype, and undetermined haplotype within the phase block, respectively. Then, multi-sample variant calling is performed again on all nine BAM files, which identifies the specific alleles that comprise each individual's two haplotypes.
In the last step, putative DNMs are extracted into a VCF (Variant Call Format) file from the original multi-sample variant calls (FreeBayes by default), if the allele occurs in only one haplotype of the child (heterozygosity) and is absent in both parents (homozygous reference).
HAPDeNovo currently accepts the multi-sample variant calls to produce the DNM candidate set from GATK, TrioDeNovo and DeNovoGear, for which preprocessing scripts are included in HAPDeNovo. HAPDeNovo defines variant sites to be high-confidence if all the six genotypes called from the reads with HP1 and HP2 tags are homozygous (0|0 or 1|1, denoting reference and variant haploid genotypes) and the calls are supported by sufficient sequencing depth (>1X by default). High-confidence DNMs are defined as such when they belong to the high-confidence sites, all four parental haploid alleles are 0, the child's alleles are 0 and 1, and the DNM is phased.
The candidate DNMs are defined as low-confidence when one or more haplotype is uncovered by any reads, but they are identified as putative DNMs in the original candidate set. These lowconfidence variants are kept for further consideration since HAPDeNovo is unable to determine on the basis of the haploid genotypes whether they are false positives.
The required sequencing depth per haplotype is a user-defined parameter. The output of HAPDeNovo is a flat file containing high-confidence DNMs annotated by H, and low-confidence DNMs annotated by L.

Results
The performance of HAPDeNovo was evaluated on the validated DNMs of the 1000 Genomes Project CEU trio (NA12878, daughter; NA12891, father; and NA12892, mother). There are 48 validated germline de novo mutations in NA12878 [13], serving as a gold standard (Supplementary Table S5). 44 of these were covered by 10X-based linked read sequencing; four de novo mutations could not be evaluated due to poor sequencing coverage.
We used Lariat [14] to align the reads from the trio against the reference genome (hg19) followed by variant calling and generation of a set of putative DNMs. We included the variant calls from four programs: two general purpose callers (GATK and FreeBayes) and two DNM specific callers (TrioDeNovo and DeNovoGear) to evaluate the impact of different inputs with respect to HAPDeNovo performance. To incorporate as many DNMs in the gold standard as possible, we applied lenient parameters in variant calling. A depth threshold was commonly applied to all four methods as well as additional unique threshold for each program. De Novo Quality (DQ) and Posterior Probability (PP) were considered in TrioDeNovo and DeNovoGear, and Genotype Likelihoods (GL or PL) were applied to GATK and FreeBayes. We also varied these thresholds to examine their potential influence (Supplementary Table S1 Table 1). In each of these sets, HAPDeNovo identified approximately 22% of as high-confidence DNMs, which included a majority of true positives (33/44 for FreeBayes and TrioDeNovo and 32/43 for GATK and DeNovoGear). By increasing the stringency of thresholds, further false positive reduction was achieved at a small cost of sensitivity (Supplementary Table   S1-S4). Moreover, HAPDeNovo could phase and determine the parent-of-origin of all the 44 validated de novo mutations.
To understand whether the increased specificity of HAPDeNovo is sensitive to read depth, we performed an extensive ROC analysis for each of the four variant callers with and without HAPDeNovo ( Figure 2). We varied the read depth thresholds from 10 to 30, and found the optimal parameter settings for the four variant callers (the maximal number of true DNMs and minimal number of false positives; GL=-50 and PP=3E-5 for FreeBayes and DeNovoGear, DQ=7 and PL=450 for TrioDeNovo and GATK). Application of HAPDeNovo on top of the optimal parameter settings always generated a smallest set of false positives without losing any true positives. In general, 80% to 99% of false positives were eliminated by HAPDeNovo. Specifically, by using HAPDeNovo, the average false positive removal was 82.7% for FreeBayes, 82.7% for TrioDeNovo, 99.5% for GATK, and 98.8% for DeNovoGear.
To ascertain whether haplotype information was generally beneficial for calling DNMs we also analyzed results from Long Ranger, which, like HAPDeNovo, can allocate allele-specific reads to each haplotype. This boosts the power for detecting heterozygous variants, such as DNMs. We compared the performance of TrioDeNovo, Long Ranger and HAPDeNovo with respect to DNM calling. Both Long Ranger and HAPDeNovo performed better than TrioDeNovo, which is consistent with the idea that the accuracy of calling DNMs benefits from the haplotype information. Nevertheless, Long Ranger, which considers the individuals of a trio independently from one another, called many more false positives than HAPDeNovo. HAPDeNovo eliminated ~80% ( Table 2) of false positives from Long Ranger, suggesting that HAPDeNovo's simultaneous consideration of all six haplotypes boosts accuracy in DNM detection.
Finally, we explored whether HAPDeNovo's consideration of reads that cannot be allocated to a specific haplotype (HP0) would affect the accuracy of DNM calling. We compared HAPDeNovo performance when only HP1 and HP2 BAM files are provided as input versus when all nine BAM files (including those of HP0) were considered. Accuracy without HP0 is lower than with HP0 (Table 3).

Discussion
Many diseases with early onset age are associated with de novo mutations. The extensive availability of next generation sequencing technology has encouraged the study of de novo mutations, which played an important role in explaining why diseases with critically decreased fitness occur frequently in the human population [12]. Barcode-based linked read sequencing, as an alternative solution to single molecule long reads sequencing, enables high-quality haplotype phasing and structural variation analysis [11]. In this study, we developed HAPDeNovo, a flexible and efficient pipeline that benefits from variant phasing from linked read sequencing to improve de novo mutations calling. The assignment of reads to each haplotype (HP1, HP2) decreases the chance that a genotype is miscalled because one haplotype is dominant, such as the haplotype with the reference allele in the parents. This is the major cause of an inherited variant getting called as a de novo mutation.
To date, methods that were developed to work on short reads alone have not achieved satisfactory performance for calling DNMs. For example, GATK best practices is highly effective in reducing the impact of sequencing and alignment artifacts on variant calls, but it still challenged in the accurate detection of de novo mutations. Existing de novo mutation specific callers as DeNovoGear and TrioDeNovo perform better than the general callers such as GATK and FreeBayes. Nevertheless, tremendous numbers of false positives remain in their ultimate results.
We showed HAPDeNovo to be superior in comparison because it explicitly leverages the haplotype-specific genotypes of the three individuals of a trio simultaneously.

Conclusions
Linked read sequencing is a powerful tool to phase the variants from a single person rather than by statistical inference from a population. This boosts our ability to identify the parent-of-origin and transmission of de novo mutations. HAPDeNovo introduces haploid genotyping to take advantage of physical phasing that benefits from linked read sequencing and to overcome sequencing coverage imbalance and alignment artifacts in detecting de novo mutations.
HAPDeNovo can be applied in conjunction with any variant caller to dramatically decrease false positive mutations. HAPDeNovo is user friendly and includes auxiliary scripts to process the results from other tools, and in the future, will be extended to detect inherited mutations in complex pedigrees and somatic mutations in tumor-normal pairs.

Availability and requirements
Project name: HAPDeNovo.