HaplotypeTools: a toolkit for accurately identifying recombination and recombinant genotypes

Background Identifying haplotypes is central to sequence analysis in diploid or polyploid genomes. Despite this, there remains a lack of research and tools designed for physical phasing and its downstream analysis. Results HaplotypeTools is a new toolset to phase variant sites using VCF and BAM files and to analyse phased VCFs. Phasing is achieved via the identification of reads overlapping ≥ 2 heterozygous positions and then extended by additional reads, a process that can be parallelized across a computer cluster. HaplotypeTools includes various utility scripts for downstream analysis including crossover detection and phylogenetic placement of haplotypes to other lineages or species. HaplotypeTools was assessed for accuracy against WhatsHap using simulated short and long reads, demonstrating higher accuracy, albeit with reduced haplotype length. HaplotypeTools was also tested on real Illumina data to determine the ancestry of hybrid fungal isolate Batrachochytrium dendrobatidis (Bd) SA-EC3, finding 80% of haplotypes across the genome phylogenetically cluster with parental lineages BdGPL (39%) and BdCAPE (41%), indicating those are the parental lineages. Finally, ~ 99% of phasing was conserved between overlapping phase groups between SA-EC3 and either parental lineage, indicating mitotic gene conversion/parasexuality as the mechanism of recombination for this hybrid isolate. HaplotypeTools is open source and freely available from https://github.com/rhysf/HaplotypeTools under the MIT License. Conclusions HaplotypeTools is a powerful resource for analyzing hybrid or recombinant diploid or polyploid genomes and identifying parental ancestry for sub-genomic regions.

physical phasing information in Pre-Implantation Genetic Testing (PGT) and Physical phasing ID information (PID) format fields of the VCF [13]. Other tools specifically designed to perform physical phasing for large and accurate haplotype construction include WhatsHap [14], HapCut2 [15], and SmartPhase [16]. The underlying algorithms of each method include weighted minimum error correction, maximum-likelihood, and read-based (either RNAseq or DNAseq) phasing respectively. HapCut2 works on a range of sequencing data including Hi-C and long read sequencing [15]. WhatsHap takes a dynamic programming approach that is both fast and more accurate than statistical phasers [14]. PoolhapX infers haplotypes across naturally pooled samples [17]. Accuracy for these tools has been determined by consensus to other methods or using simulated data. One potential drawback for each of these tools is that there are not easily (no options to) parallelize across multiple nodes on a computer cluster. Where such resources are available, this approach may decrease computational time.
Here, I present a toolset to phase diploid variant calls from whole genome sequencing data, validating phasing accuracy, phylogenetically placing haplotypes to other lineages or species, and identifying crossovers between pairs of phased VCFs. HaplotypeTools phasing performed better overall in terms of accuracy, at the cost of smaller haplotypes, in comparison to WhatsHap [14]. Both tools performed considerably better than GATK HaplotypeCaller physical phasing alone. HaplotypeTools was also used to identify the parental lineages and loci of crossovers for a hybrid fungal isolate belonging to the species Batrachochytrium dendrobatidis.

Benchmarking with simulated data
HaplotypeTools was benchmarked against WhatsHap using simulated reads from the genome of the fungal pathogen Batrachochytrium dendrobatidis (Bd) JEL423 (see methods), highlighting several differences between the tools. First, the accuracy of the variant caller Pilon [18] to call heterozygous positions from short (100nt) and long-read (10 kb) paired end alignments (20X depth) was assessed (Table 1) revealing high levels of sensitivity (> 0.91), specificity (> 0.99) and overall accuracy (> 0.98), which is suitable for testing downstream phasing.
Haplotypes defined by HaplotypeTools and WhatsHap were assessed for haplotype length, coverage, accuracy and computational time ( Table 2). HaplotypeTools outperformed WhatsHap in terms of phasing accuracy, while it underperformed in terms of haplotype length, genome coverage of those haplotypes, sensitivity and QAN50 values (an assessment of haplotype length and quality in terms of Switch Errors). For example, the longest haplotype block/pair from HaplotypeTools was ~ 11.7 kb, compared with ~ 874 kb for WhatsHap.
HaplotypeTools achieved higher accuracy overall than WhatsHap according to a range of metrics (Table 2). HaplotypeTools resulted in fewer (< 14%) Switch Errors (SE), and lower Switch Error Rate (SER) for every test, which had a value of between 0 and 0.0031 compared with 0.007 and 0.016 for WhatsHap. Indeed, for two of the tests (100nt reads with 1/kb heterozygosity and 10 kb reads with 1/kb heterozygosity), HaplotypeTools did not produce a single switch error (SER = 0), demonstrating the high accuracy achieved by HaplotypeTools even using default settings.
Lowering the minimum haplotype coverage parameter in HaplotypeTools achieved better SE and SER for one of the tests (10 kb reads for 100/kb heterozygosity). For the same test data, increasing the maximum phasing length resulted in longer haplotypes and reduced computational time, at the cost of a slightly decreased genome coverage (Table 2). Therefore, adjusting HaplotypeTools' parameters may achieve better results than the default settings depending on the use case (e.g. read length and heterozygosity level) and desired outcome (sensitivity vs specificity).
While optional, HaplotypeTools was designed to run in parallel across a computer cluster -first splitting up the VCF and BAM files into windows that can be processed in parallel. HaplotypeTools was scatter gathered across ~ 100 low-spec nodes (8 Gb RAM, Intel Xeon CPU E5-2680 v2 @ 2.80 GHz), which took between 23m13s and 32m59s till completion (Table 2). WhatsHap is not designed to run in parallel (although such a process could be achieved with a custom pipeline if desired). Thus, WhatsHap was tested locally on a single high-spec laptop (32 Gb RAM, Intel Core i9-9980HK CPU @ 2.40 GHz). WhatsHap required some preprocessing to run (e.g. removing reference bases from VCFs). After pre-processing, WhatsHap was overall computationally faster: taking between 95m56s for 100 nt reads with 100/kb heterozygous positions, to as quickly as just 24 s on 10 kb reads with 1/kb heterozygous positions. Table 1 Accuracy of heterozygous variant calling by Pilon was assessed Paired reads (100nt or 10 kb) were simulated at 20X depth from reference Bd JEL423 genome that was duplicated to create an in silico diploid. In silico mutations were then randomly introduced throughout (1/kb, 10/kb or 100/kb). Reads were aligned to the original reference sequence (non-duplicated, non-mutated version), and diploid variants called by Pilon. Counts of variants are shown including single nucleotide polymorphisms (SNP), heterozygous positions (HET), insertions (INS), deletions (DEL) and ambiguous (AMB). Accuracy was assessed according to Comparison of FDR tool [28], that calculated TN = true negatives (correct reference bases), TP = true positives (correct HET), FN = false negatives (incorrect reference bases) and

Results on real data
HaplotypeTools was used to phase real Illumina data (100nt paired reads, 43X depth) to determine the parental lineages of the hybrid Bd isolate SA-EC3 with several settings. GATK v4 HaplotypeCaller was used for variant calling, which includes its own physical phasing, and could therefore also be compared to the results from HaplotypeCaller and WhatsHap (Table 3). The five isolates representing each of the lineages were assessed for ploidy and aneuploidy. Non-overlapping windows presenting normalized depth of coverage revealed evidence for aneuploidies in all isolates apart from BdCH ACON (Fig. 1). Supercontig (sc) 1 is the largest supercontig and therefore the best evidence for the baseline ploidy from genomic data alone. Therefore, based on this depth of coverage and allele frequencies (percent of reads agreeing with the reference base), Asia1 KRBOOR323 is diploid Rows include the number of haplotypes produced by each tool, the total number of nucleotides included in those haplotypes, the maximum haplotype length found (Haplotype N max ), the Hap N 50 and N 90 (the shortest haplotype length that includes ≥ 50% and ≥ 90% of haplotype sequence, respectively.). The number of haplotypes (#haps) that overlap on the genome assembly with haplotypes produced by HaplotypeTools (HT), and the number of nucleotides (nt) in those haplotpes. Computational time is also given for HaplotypeTools and Whatshap, but omitted for GATK given its primary role was variant calling, which both HaplotypeTools and WhatsHap were also based on, and the time taken for phasing alone can neither be determined or distinguished from that process

GPL JEL423
Hybrid SA-EC3  Fig. 1 The five isolates representing each of the lineages were assessed for ploidy and aneuploidy (largest five supercontigs presented). A Non-overlapping 10 kb windows of the normalized depth of coverage (normalized by total sequencing depth across percentiles of GC content, and excluding ambiguous sites) shows evidence for aneuploidies (supercontig 2, 3 and 5) among each of the Bd isolates apart from BdCH ACON. B Allele frequencies (percent of reads agreeing with the reference base) are shown from 25% agree to 75% disagree, with red-dotted lines indicating greatest support for bi-allelic/diploidy between 47 and 53% and greatest support for tri-allelic/triploidy between 30-36% and 63-69% with sc2 and sc3 trisomies, Asia2 CLFT065 and CAPE TF5a1 are triploid with a sc2 tetrasomy, CH ACON is triploid with no aneuploidies, GPL JEL423 is diploid with a sc3 trisomy, and Hybrid SA-EC3 is diploid with possible sc2 and sc3 tetrasomy. WhatsHap and HaplotypeTools were therefore tested on polyploid and aneuploid genomes, which could impact the accuracy of phasing.
Phased SA-EC3 VCF's from each tool were compared to consensus genomes for each lineage (generated by a HaplotypeTools utility script). First, pairs of haplotypes from single locations were compared for length and sequence similarity ( Fig. 2A) demonstrating that most of the haplotypes in all tools were short and as a proportion of their total  length, contained a greater number of nucleotide differences between them. The longest and most divergent haplotypes are the most informative in terms of ancestry, and example haplotypes and their phylogenetic placement are shown for each tool for illustrative purposes only (Fig. 2B). One haplotype is phylogenetically closest to BdCAPE for all three haplotype-based trees chosen. For trees based on HaplotypeTools and Whats-Hap haplotypes, the second haplotype is closest to BdGPL, while the second haplotype is BdCH for HaplotypeCaller physical phasing alone. The HaplotypeTools utility script HaplotypePlacer iteratively constructs approximately-maximum-likelihood phylogenetic trees using FastTree for every haplotype (with a default 100 nt minimum haplotype length parameter) in order to identify overall trends in haplotype relatedness to other lineages or species. HaplotypePlacer also outputs a summary for the closest relative across all the haplotypes (Table 4) and generates non-overlapping window plots showing the genomic region for each haplotype pairs, which are colored according to their closest relative (Fig. 3). Using this iterative approach for every haplotype, the majority of haplotypes from hybrid Bd isolate SA-EC3 were phylogenetically clustered with BdGPL for one of the haplotypes and BdCAPE for the other, confirmed by each of the three tools tested (65% for HaplotypeCaller, 80% for HaplotypeTools and 91% for WhatsHap). Therefore, the parental lineages of Bd SA-EC3 are most likely BdGPL and BdCAPE.
To explore recombination in SA-EC3, phased regions between SA-EC3 and phased representatives for each of the lineages were compared using other HaplotypeTools utility scripts ( Table 4). The parental lineages identified by HaplotypePlacer (BdGPL and BdCAPE) had the highest number of overlapping phase groups compared with other lineages (1018-1344 compared to 758-940) and highest number of overlapping phased positions/nucleotides (OPP; 2487-3457 compared to 1661-2131), corroborating those lineages as parental lineages, given a greater sequence divergence Table 4 HaplotypeTools' utility script HaplotypePlacer constructs haplotype trees with FastTree and identifies the closest relative to each Hybrid Bd isolate SA-EC3 haplotypes from GATK v4 HaplotypeCaller physical phasing, HaplotypeTools and WhatsHap were analysed using HaplotypePlacer, finding that the majority of haplotypes from each of the three tools are closest in those trees to BdGPL (38-46%) and BdCAPE (27-45%). A HaplotypeTools utility script was used to compare phasing between SA-EC3 and each of the lineages. For each comparison, the script identified overlapping phase groups, comprising overlapping phased positions (OPP), which were either in the same phase, or showed evidence of crossovers result in fewer conserved heterozygous positions that can be phased. Only 20 crossovers were detected between SA-EC3 and BdGPL (0.8% of all OPP), and only 36 crossovers were detected between SA-EC3 and BdCAPE (1% of all OPP), compared with 9.43-10.54% for the other lineages, which again supports those relationships, given a greater divergence time may result in greater numbers of ancestral crossovers. Crossovers between SA-EC3 and its parental lineages were distributed across the genome. For example, SA-EC3 and BdGPL JEL423 had five OPP's in one overlapping phase group between supercontig 15 positions 473,179-477,410. These phased positions included the following haplotype variant positions for SA-EC3: C-C-A-A-A and T-A-G-G-C, and for BdGPL JEL423: C-C-G-A-A and T-A-A-G-C, indicating a crossover in the middle position (473,194), which is located in an intergenic region between hypothetical protein BDEG_28268 and hypothetical protein BDEG_28269 with PFAM Cytochrome P450. The very low levels of crossovers identified between either parent indicate that the parental haplotypes have remained physically separated, suggesting that SA-EC3 is a result of mitotic recombination/parasexuality i.e. genetic exchange without meiosis, and those few crossovers likely resulting from either (1) double mutations, and (2) mitotic gene conversion events.

Discussion
Correctly identifying haplotypes is central to understanding diploid organisms, including determining genotype-phenotype associations, identifying recombinant or hybrid isolates in microbial populations, determining parental ancestry, and a precursor to a range of population genetic tests. Here, I present a new toolset called HaplotypeTools that is able to accurately phase heterozygous positions from short or long whole genome sequencing data in a fungal genome, and perform a variety of processing steps to recover FASTA files of haplotypes, plot haplotype relatedness to other species across genomic windows, and identify loci of potential crossovers between isolates. HaplotypeTools achieved greater accuracy than two other tools tested (GATK v4 HaplotypeCaller physical phasing and WhatsHap [14]), while also highlighting room for further improvement including computational speed, haplotype length and benefiting from additional datatypes such as Hi-C. Currently, regions lacking alignment data or variant calls due to complex genomic regions (such as very repeat rich regions) are ignored by Haplotype-Tools, and present a further challenge and opportunity for development. HaplotypeTools was tested on the hybrid Bd isolate SA-EC3 from the Amahlathi Local Municipality of the Eastern Cape in South Africa [19]. Comparing the output of Haplo-typeTools to GATK HaplotypeCaller physical phasing revealed that HaplotypeTools was able to recover ~ 5X the total phased nucleotides, and > 4X the haplotype length in terms of haplotype N max and N 50 . HaplotypeTools phased regions were almost entirely contained within WhatsHap [14] phased regions, and based on simulation data, Haplotype-Tools phased regions were likely to include a greater number of true positives and fewer false positive phased sites. Phylogenetic placement of those haplotypes by HaplotypePlacer indicated that SA-EC3 is a recombinant of BdGPL and BdCAPE haplotypes, which was supported by all three tools tested. South Africa has endemic BdGPL and BdCAPE lineages present [20], thereby facilitating such an event. ~ 1% of overlapping phased positions indicated crossovers between SA-EC3 and either parental lineage, indicating that meiotic recombination has not occurred between the parental lineages, and the recombinant genotype is more likely to have arisen via mitotic recombination/parasexuality: a process characterised in disparate fungal relatives [21]. Bd recombining via parasexuality is parsimonious with polyploidy isolates commonly found [22], and has been hypothesized previously [23]. These results highlight the threat of emerging novel genotypes of pathogens following anthropomorphic spread [19].
HaplotypeTools is designed for phasing bi-allelic data, with tri-alleles phasing a possible upgrade route in the future. However, as shown in the real data experiments, Hap-lotypeTools works on polyploid genomes and over aneuploidies. HaplotypeTools also has no clear upper limit on genome size or depth of coverage. HaplotypeTools has been tested using variant calls from GATK v4 HaplotypeCaller [12] and Pilon [18], although additional variant callers such as FreeBayes [24] that output in standard VCF should also work, as should other alignment and sequencing strategies. Sequencing technologies Nanopore and PacBio have not been tested with HaplotypeTools, although simulated long reads have been, which showed a particular advantage in accuracy tests. Indeed, given the shorter haplotypes offered by HaplotypeTools, a particular strength may be high accuracy haplotypes stemming from such long-read data. Sequencing strategies that yield higher sequencing errors could be accommodated by adjusting the 'cut-off percent reads supporting phase group' parameter. Lower sequencing depth could be accommodated by adjusting minimum read depth. Where the use case is very different from those presented here, the tools to perform accuracy checks have been included in the HaplotypeTools toolset, and ideally will be used to validate phasing accuracy and thereby optimized for individual use cases.
The interpretation and usefulness of HaplotypePlacer will rely on the lineages or species that the phased isolate is compared to. For example, it is advisable that a comprehensive set of possible parental lineages are included in the analysis or HaplotypePlacer will be unlikely to yield a clear answer. The phylogenetic relationships from Haplotype-Placer are not currently tested for significance, and therefore for more robust results, haplotype trees should be examined individually and further phylogenetic tests and tools applied to the multiple alignments output. Future areas of development may include updates to efficiency and computational speed, as well as exploring where haplotypes could be extended further without impacting accuracy, and expanding the toolset to include new tools for population genetic tests such as Four-gamete tests.

Conclusions
HaplotypeTools is powerful resource that is able to accurately phase and extract haplotypes for population genetic tests and can determine parental ancestry for hybrid or recombinant diploid isolates or individuals. The toolset will be useful for benchmarking new tools or parameter space for phasing accuracy, and visualizing haplotype coverage across a genome and their phylogenetic placement. Therefore, HaplotypeTools should prove valuable for a range of research questions in model and non-model organism genomes.

HaplotypeTools algorithm
The algorithm for HaplotypeTools comprises on five steps. The first step splits the VCF into windows of a specified length (default 10 kb), and BAM files into windows of the same length. Step  , and then concatenates those into a final phased VCF. Splitting input data into windows allows steps 3 and 4 to be run in parallel on a cluster (Platform Load Sharing Facility (LSF), Sun GridEngine (SGE) or Univa GridEngine (UGE) currently supported). HaplotypeTools can also be run on an individual computer in serial at the expense of slower computational time.
Step 3 of HaplotypeTools assigns Phase Positions (PP) for all reads that overlap ≥ 2 heterozygous positions, which are separated by semicolons and stored in the ID column of the output. PPs consist of: Step 4 runs through pairs of consecutively found heterozygous positions named Previous Heterozygous Position (PHP) and Current Heterozygous Position (CHP), checking them for 5 conditions: 1. Check for ≥ 2 rGT's in CHP. 2. Check the 2 CHP rGT's with the highest depth > min. haplotype depth parameter. 3. Check the 2 CHP rGT combined depth (percent) > phase cutoff parameter. 4. Check PHP passed conditions 1-3. 5. Check for ≥ 2 haplotypes from PHP and CHP PPs.
If any of those 5 conditions are not fulfilled, the PHP ID column is replaced by a comment stating the sample number and the reason it was not phased. A Phase Block (PB) integer value (identifier for separate haplotypes) is also incremented. The following pair of PHP and CHP are then assessed. Providing all 5 conditions are met, the reads that match the two PHP rGT's and the two CHP rGT's are identified, and used to construct a new CHP phased genotype. In the case > 2 rGT's are found, the two with the highest depth are selected. A phase group (PG) is assigned to PHP (if not already assigned) and CHP, which is appended to the SAMPLEINFO column. The PG consists of the contig, sample number, start window and PB separated by dashed (e.g. supercont1.1-0-350,000-1), ensuring every PG is unique e.g. the same phase block identifier in the same window (350,000-360,000) for a 2nd sample in a multi sample VCF will be supercont1.1-1-350,000-1. A summary file for each window is printed including contig, position, ID and SAMPLEINFO, which is used to update the final phased VCF during concatenation in Step 5.

Benchmarking using simulated data
HaplotypeTools and WhatsHap (downloaded from https:// github. com/ whats hap on 1st March 2021) were benchmarked using simulated data from the Batrachochytrium dendrobatidis (Bd) JEL423 genome. First, 40 contigs each of < 10 kb were removed from the reference sequence, ensuring we could simulate 10 kb reads across the genome (updated reference = 29 contigs, 23.44 Mb, N50 = 1.7 Mb). Next, the Biscap utility script "Introduce Random Mutations (IRMS)" [28] was used with the heterozygous setting (HET), which duplicates every chromosome (homologous versions), followed by selecting random nucleotides to 'mutate' into other random nucleotides across both chromosomes and homologous chromosomes. Three such modified reference genomes were generated including 1 SNP/Kb (23,137 total), 10 SNP/Kb (231,370 total) and 100 SNP/Kb (2,313,700 total). Next, sequence reads were simulated from this duplicated and modified reference sequence using WGSim (https:// github. com/ lh3/ wgsim) to ~ 20X depth using either short (100 nt) paired reads (2,313,797 pairs) or long (10 kb) paired reads (23,138 pairs) with no introduced errors (-r 0). Aligning these reads back to the unduplicated and unmodified reference genome will then appear to contain heterozygous positions, for which each position changed is known (the truth set). Reads were aligned to the genome using BWA v0.7.4-r385 mem, and a clean BAM created using Samtools v1.8 view -b -h -f 0 × 2. For WhatsHap compatibility, Picard AddOrReplaceReadGroups was applied to the clean BAM files.
Variants were called from the simulated data alignments using Pilon v1.9 with the diploid flag [18]. For WhatsHap compatibility, reference bases were removed from the VCF. GATK v.4.1.2.0 [12] was not used for calling heterozygous positions from simulated data alignments owing to failing the StrandBiasBySample filter used by HaplotypeCaller. Accuracy of Pilon was assessed using Biscap utility script "Comparison of FDR (CFDR)" [28].
Phased VCF's from both HaplotypeTools and WhatsHap were assessed for accuracy using HaplotypeTools utility scripts. Specifically, Phased in Any (PIA) regions were identified (VCF_phased_to_PIA.pl), with parameter -t PS for WhatsHap and -t PID (default) for HaplotypeTools. FASTA sequences of haplotypes blocks/pairs were extracted using VCF_phased_and_PIA_to_FASTA.pl. Accuracy was assessed using Haplotype_FASTA_ files_to_compare_to_IRMS_het_sites.pl, which calculates for every haplotype block/ pair the number of sites that are correctly phased, sites that are incorrectly phased (False Positive type 1) and sites that have been incorrectly variant called and also been phased (False Positive type 2), false negatives within haplotype blocks (not presented), and Switch Errors (incorrect crossovers between haplotypes). To calculate Switch Errors, false negatives were ignored, while False Positive type 2 were considered as a switch error. Additionally, the script produces two summary statistics including overall Switch Error Rate, where the switch error is divided by the number of opportunities for switch errors. Finally, the Quality adjusted N50 (QAN50) was calculated for each test, where each haplotype block/pair is divided into sub-blocks with no switch errors, which are multiplied to the proportion of phased alleles inside that block (called an adjusted span), sorted from largest to smallest, and then the QAN50 is the size of the adjusted span that includes more than half of the total variants [29].

HaplotypeTools using real data
To test HaplotypeTools on real data, variant calling was first applied to a major fungal pathogen of amphibians, Batrachochytrium dendrobatidis, which has a 23 Mb diploid or triploid (with frequent aneuploid [22]) genome. Paired-end Illumina data from representatives of all five known lineages (BdGPL JEL423, BdCAPE TF5a1, BdCH ACON, BdAsia-1 KRBOOR_323, BdAsia-2 CLFT065, and a hybrid of unknown parentage SA-EC3) were obtained from the NCBI Sequence Read Archive (SRA) [19,22,30]. The Genome Analysis Toolkit (GATK) v.4.1.2.0 [12] was used to call variants. Our Workflow Description Language (WDL) scripts were executed by Cromwell workflow execution engine v.48 [31]. Briefly, raw sequences were pre-processed by mapping reads to the reference genome Bd JEL423 using BWA-MEM v.0.7.17 [32]. Next, duplicates were marked, and the resulting file was sorted by coordinate order. Intervals were created using a custom bash script allowing parallel analysis of large batches of genomics data. Using the scatter-gather approach, HaplotypeCaller was executed in GVCF mode with the diploid ploidy flag. Variants were imported to GATK 4 GenomicsDB and hard filtered (QD < 2.0, FS > 60.0, MQ < 40.0, GQ ≥ 50, AD ≥ 0.8, DP ≥ 10). HaplotypeTools and WhatsHap were used individually to phase each VCF with default parameters, and HaplotypeTools utility scripts to phylogenetically place and visualise haplotype placement across the genome, as well as explore crossovers between pairs of phased VCFs.