Haploflow: Strain-resolved de novo assembly of viral genomes

In viral infections often multiple related viral strains are present, due to coinfection or within-host evolution. We describe Haploflow, a de Bruijn graph-based assembler for de novo genome assembly of viral strains from mixed sequence samples using a novel flow algorithm. We assessed Haploflow across multiple benchmark data sets of increasing complexity, showing that Haploflow is faster and more accurate than viral haplotype assemblers and generic metagenome assemblers not aiming to reconstruct strains. Haplotype reconstructed high-quality strain-resolved assemblies from clinical HCMV samples and SARS-CoV-2 genomes from wastewater metagenomes identical to genomes from clinical isolates.

ability of Haploflow to resolve strains fast and accurately on multiple data sets, including a low complexity HIV strain mixture to a complex, simulated virome sample consisting of 572 viruses with substantial strain-level variation, varying abundances and genome sizes as well as two data sets of clinical human cytomegalovirus (HCMV) and SARS-CoV-2 data.

Results
We next describe the algorithm for creating and manipulating the assembly graph and the flow algorithm that gave Haploflow its name.

deBruijn and unitig graph creation
The input to Haploflow is a sequence file including read sequences and specifying the k-mer size for constructing the deBruijn graph. Optionally, the lowest expected strain abundance (or error rate ) can be specified, leading to removal of more rare kmers from the graph, for graph simplification. Setting the error-rate size too low possibly makes the unitig graph and subsequent assembly more complex, while a too high value will prevent low abundant strains from being assembled.
First, a deBruijn graph 21 is created from the reads, using ntHash 24  This unitig graph has the following properties: a) Every remaining vertex is a junction, having more than one ingoing or outgoing edge or being a source or sink. This means that all variation is found in vertices, all non-unique sequences (i.e. occurring in multiple haplotypes) are found in edges.
b) The unitig graph is a homeomorphic image of the input deBruijn graph, disregarding error correction. This means that no information is lost and the original deBruijn graph could be reconstructed.
When constructing this unitig graph, for each connected component, so-called junctions, vertices having a different in-from out-degree, or an in-or out-degree of more than one in the de Bruijn graph are identified with a depth-first search. These will be the vertices of the new unitig graph, and their kmers are maintained ( Supplementary Fig. S1). The sequence of all the traversed kmers is added to the connecting edge and we define the length of an edge as the length of this sequence in base pairs. Starting from any junction, the next junction in the deBruijn graph is searched, passing vertices with exactly one ingoing and one outgoing edge until the next junction is found. Since all junctions are guaranteed to be searched and the transformation is deterministic, the choice of starting junction does not matter. When the next junction is found, the coverage of all the traversed edges is averaged and checked versus a threshold based on the error rate ( Supplementary Fig. S2). If it is above, the target junction is also added as a vertex to the unitig graph and an edge with the (averaged) coverage value as the edges coverage is added between the two vertices. If the coverage is below the threshold, then neither the target vertex nor the edge are created and the next outgoing edge of the source is considered. This is repeated until all junctions have been searched, such that no vertices with in-degree = out-degree = 1 are remaining ( Figure S1). The resulting unitig graph is usually of drastically reduced size in comparison to the original graph, with sometimes less than 0.01% of vertices remaining. All linear paths of the original graph are condensed into single edges that represent stretches of unique contig sequences.
For every unitig graph a kmer coverage histogram is built (Fig. S2). These histograms reveal several key properties on our data sets: First, the coverage of reads belonging to one genome is approximately normally distributed around the "real" coverage of that genome 19,20 If multiple sufficiently distinct (in terms of average nucleotide identity) genomes are present in a single unitig graph, then all of them will have a corresponding peak in the histogram. The longer a genome, the more different kmers it includes, and accordingly, the higher the peak. If genomes are very closely related, then the peaks will consist of k -mers that are unique to the individual strains and there will be another peak for the common k -mers.
Haploflow uses these coverage histograms as indication of the putative number of genomes 26 and their size relation as well as for error correction. Every read error will create k erroneous kmer vertices in the deBruijn graph 27, 22 , with low coverage in comparison to the real coverage cov of the genomes. Since sequencing errors are rare in Illumina reads, most erroneous kmers will only appear once 28,29 , with fewer kmers appearing multiple times, creating an exponentially decreasing curve in the kmer histogram. This information is factored into the error correction with too rare k -mers being removed (red line, Figure S2). The exact method and values used for error correction can be customized by the user, but by default, all k -mers with a coverage less than the first inflection point of the coverage histogram are filtered and every k -mer which has less than 2% of the coverage of its neighbouring k -mers. This parameter can be increased when dealing with long read data to reflect the higher number of errors in current long read technologies.

Assembly using the flow algorithm
In the second stage the algorithm operates on the unitig graph. It infers and returns a set of contigs based on paths of similar coverages throughout the graph. The flow algorithm consists of three steps that are repeated until the whole graph has been resolved into contigs: (i) finding paths through the graph, (ii) assigning flow values to them, and (iii) determining the path sequence.
In the first step, the source vertex (with an in-degree of 0) with the highest coverage is selected from the unitig graph. Starting from this source, a modified Dijkstra's algorithm 30 is applied, which identifies the fattest path from a source to sink (a vertex having an out-degree of 0) based on edge coverages (Alg. 1, Fig. 1). The fatness of a path is defined by the minimal fatness of the edges on the path. The fatness of an edge is determined as the minimum of its coverage and the fatness of the path from the source until the current edge 31 and can also be called the "capacity" of the edge. The fattest path from a source to a sink is then determined by following the edges maximising fatness until the sink is found. All edges on this path are then marked with a path number. Subsequently, the coverage for all edges on this path are reduced by the path fatness, the next source is selected and the previous steps are repeated until no edges with coverage remain.
Likely due to technical issues, such as amplification biases 32 and read errors 33 , and biological structures such as genomic repeats 34 , coverages do not follow a normal distribution globally and consequently some consecutive edges in the assembly graph may exhibit steep changes in coverage. This is the reason why Haploflow uses a two-step procedure for path finding: First, paths are found through the graph as described before. But instead of directly returning contigs for these paths, these paths are only putative, meaning that all paths and changes to the graph are temporary first.
The algorithm of Haploflow is then able to handle heterogeneous coverages across genomes, e.g. highly pronounced in amplicon data or sequence data with high error rates, by using the local, not global coverage distribution, and not absolute coverage, but relative coverage, i.e.
the only assumption is that the ratio between haplotypes is somewhat conserved.
Additionally, putative paths can get removed, if too many of its edges are already part of a previous putative path (Supplementary Methods). If a path consists almost only of edges that have been used before, it is an indicator that these paths would lead to duplicated contigs.
Finally, this results in a graph where all edges are marked with one or more paths they are assumed to be on.

Alg. 1:
The adapted Dijkstra algorithm used in Haploflow to find fattest paths through the unitig graph. Instead of determining the shortest paths from the source to all vertices, this algorithm determines the fattest path. The fatness is initialised as 0 for all vertices but the source and then the graph is searched using a breadth-first search and based on the fact that the fattest path from a source s to a sink t is based on the edge with the lowest coverage along this path (lines 9 to 12).
In the second part of the path finding we start again from the source with the highest coverage. Since we have all edges marked with the path that they are on, we can select the edge on the same path which is farthest away from our source and calculate the fattest path from the source to this sink. If Haploflow is not able to resolve the fatness unambiguously, for example because two outgoing edges have almost the same fatness, then the path is terminated in this vertex. This is to prevent formation of chimeric contigs, because locally two strains might have similar coverages. For the final path, a corresponding contig is returned and the coverage reduced permanently (Supplementary Methods). Then all edges with capacity 0 and all vertices without any edges are removed and the flow algorithm started anew from the source vertex. This procedure is repeated until the graph does not have any edges remaining.
Haploflow has multiple parameters that can be set to improve the assembly, if more information is given. If no additional information is given, Haploflow has default settings that usually already provide high quality assemblies. All the evaluations in this article were performed using these default parameters, i.e. a value for k of 41, and an error-rate of 0.02.
The value of k = 41 was chosen since too small (in comparison to read lengths) values for k lead to more ambiguities and a higher k might lead to fragmented assemblies. If k does not exceed 50% of read-size, the assemblies are of comparable quality. The error-rate parameter was set to 0.02, because this is the value assumed to be the upper bound of errors in short-read sequencing 35 and can be increased when dealing with more error-prone reads like those from PacBio or Oxford Nanopore.
Additional parameters include a setting for detecting strains with very low absolute abundance ( strict ), for data sets with exactly two strains ( two-strain ), as well as an experimental mode for highly complex data sets with clusters containing five or more closely related strains.

SARS-CoV-2 clinical and wastewater metagenome data
We reconstructed viral haplotypes using Haploflow from 17 clinical SARS-CoV-2 samples sampled in Northrhine-Westphalia, Germany (DUS, 5 Illumina short-read samples) and Madison, Wisconsin (WIS, 6 Illumina short-read and 6 Oxford Nanopore long-read samples).  Table   S6). In addition, Haploflow identified three longer deletions. Five (19.2%) "unique" LoFreq variants are located in error-prone regions (homopolymeric or strand biased) or at the very end of the genome. Four further low frequency sites (<5%, 15.4%) were found by Haploflow and were also among low frequency Lofreq predictions. strains from GISAID 57 . Strains from the same sample are indicated by color, and "major" and "minor", based on their inferred abundances. Evolutionary events, including mutations and indels are shown on edges.
11 . CC-BY 4.0 International license perpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in In a study of eight shotgun metagenome samples of sewage from the San Francisco Bay Area 43 , the authors manually assembled consensus SARS-CoV-2 genomes from seven samples and subsequently called variants with inStrain 44 . A comparison to common variants of clinical isolate genomes showed that most of the SNPs found in the data set could be detected in the isolate genomes, with the more (>10%) abundant ones found in strains from California or the US. This and the abundance distribution of some SNPs over time suggested that the data set captured real genomic variation and that different SARS-CoV-2 strains were present in this data set. Haploflow with the option strict 1 (reduced error correction threshold to account for shallow sequencing depth) and scaffolding ( Supplement) , assembled full-length SARS-CoV-2 genomes for the same seven samples, recovering two strains for six of them (Supplementary Table S8). Strikingly, for all assemblies identical genomes of clinical SARS-CoV-2 isolates were identified in the GISAID database using minimap 45 v2.17 (Supplementary Table S8), mostly from samples obtained in the U.S. (5), and California (3), highlighting the ability of Haploflow to recover high quality, strain-resolved viral haplotype genomes from metagenomic data.

Performance evaluation
We evaluated Haploflow on three simulated data sets with increasing complexity: a mixture of three HIV strains represented by error-free simulated reads, multiple in-vitro created mixtures with different proportions of two HCMV strains sequenced with Illumina HiSeq 46 , and a simulated virome 47,48 data set of 572 viruses, with 417 genomes in unique taxa and 155 genomes in common strain taxa with up to eleven closely related strains, to assess Haploflow's ability to assemble complex, larger data sets. Finally, we assembled HCMV genome data from clinical samples collected longitudinally over time from different patients 49 , to characterize the within-and across patient genomic diversity of viral strains, including also larger genomic differences between individual strains in mixed-strain infections, which has not been possible so far. The evaluation was performed using metaQUAST 50 v.5.0.2, which is commonly used to evaluate metagenome assemblies and provides useful metrics for measuring completeness (genome fraction), continuity (NGA50, largest alignment) and accuracy (mismatches per 100kb, duplication ratio) of assemblies and has specific options for analyzing strain-resolved assemblies. In addition, we calculated metrics for assessing strain-resolved assembly; the strain recall, specifying the fraction of correctly assembled strains (more than 90 (80)% genome fraction and less than 1 (5) mismatches/kb), the strain precision, specifying the fraction of correctly assembled strain genomes of all provided genome assemblies (true positives defined as in recall; total number of genome assemblies estimated as number of ground truth genomes with at least one mapping contig * duplication ratio), as well as the composite assembly quality score, we previously defined 14 . This composite score takes six common assembly metrics (genome fraction, largest alignment, duplication ratio, mismatches per 100 kb, number of contigs and NGA50), normalises them in the range of all results, such that for genome fraction, largest core(method) s = value(method) min(value(m ∈ methods)) − max(value(m ∈ methods)) min(value(m ∈ methods)) − alignment and NGA50 and for the core other metrics and then weighs with a weight of 0.3 for genome fraction and largest alignment, respectively and a weight of 0.1 for the other metrics.

HIV-3 in silico mixture
HIV, the human immunodeficiency virus, is a single-stranded RNA virus with an approximately 9.5 kb genome that infects humans, causing AIDS (acquired immunodeficiency syndrome). HIV evolves rapidly within the host and may also present as multi-strain infections 51,52 . The three HIV-1 strains 89.6, HXB2 and JR-CSF, which are commonly used to evaluate viral haplotype assemblers 53,54 , were downloaded from NCBI RefSeq 55 , mixed in the proportions 10:5:2 and error-free reads with a length of 150bp and depth of 20,000 created with CAMISIM 56 and the wgsim read simulator 57 . These genomes differ mainly by SNPs and have an average nucleotide identity (ANI) of ~95%. This threshold was chosen, because experiments on MEGAHIT and metaSPAdes showed that genomes more closely related than 95% will not be resolved 56 .
We benchmarked the quality of strain-resolved Haploflow assemblies for the three strain HIV data against five other de novo assemblers (SPAdes, metaSPAdes, megahit, PEHaplo, SAVAGE in de novo mode) with metaQUAST v.5.0.2, using multiple parameter settings, if defaults settings were undefined (QuasiRecomb, PEHaplo). Furthermore, we assessed five reference-based assemblers (GAEseq 58 , SAVAGE ref-based mode, PredictHaplo, QuasiRecomb and CliqueSNV), which were provided with one strain genome for assembly.
Of all evaluated de novo assemblers, Haploflow performed best across all metrics and the composite assembly score ( Figure S2 ) , assembling all three strains almost completely (more than 90%), with less than 1 mismatch/kb, providing no false positive strain assemblies -that for some methods (QuasiRecomb) reached several thousand strains -and with more than double the assembly contiguity (NGA50) than the second best method (PEHaplo). Haploflow was the only method assembling all strain genomes into complete contigs. Also in comparison to the reference-based assemblers, Haploflow performed best. SAVAGE in reference-based mode, run on a subsample of the data, performed similarly well in five of the eight metrics, however, provided a substantially more fragmented assembly (lower NGA50, more contigs) and a strain genome with more mismatches. Haploflow also closely estimated the true underlying strain proportions, with predicted coverages of 10,371 for HIV 89.6, 5,372 for HIV HXB2 and 1,745 for HIV JR-CSF.

HCMV in vitro mixtures
We next evaluated Haploflow on six lab-created mixtures of two HCMV strains sequenced with Illumina MiSeq 59 . HCMV is one of the largest human pathogenic viruses, causing severe illness in immunocompromised patients and infants, and possessing a double stranded DNA genome of more than 220 kb 60 . The data set includes two different strain mixtures, denoted "TA" (strains TB40 and AD169, 97.9% ANI) and "TM" (strains TB40 and Merlin, 97.7% ANI), with three different mixture ratios each (1:1, 1:10 and 1:50), allowing us to test the ability of assemblers to resolve strains at varying abundances. We ran Haploflow on these data and compared the results to those of twelve other assemblers. These include nine Assemblies were evaluated using metaQUAST v.5.0.2 with the benchmarking workflow QuasiModo 14 , based on common assembly metrics, the composite assembly score, recall and 14 . CC-BY 4.0 International license perpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for this this version posted January 26, 2021.
precision in strain-resolved genome assembly, as before, and the top performing methods falling in the 95-100% range of results identified for every metric.
Of the 12 evaluated de novo assemblers, Haploflow scored best in 5 of the 8 metrics , followed by metaSPAdes (best in 2 of 8: NGA50, duplication ratio), while PEHaplo, tadpole, IDBA, Vicuna and IVA each scored best for one metric, respectively (Supplementary Table   S2). Haploflow assemblies were of very high quality, recovering the most correct strain genomes (10 of 12), providing the best strain precision and composite assembly score (9.34 of 10), highest genome fraction (83.87%) and the most contiguous assemblies (NGA50 62,560). Interestingly, the similarly good NGA50 values of metaSPAdes and Haploflow were obtained in different ways, for the former due to a more contiguous assembly for the abundant strain, while only Haploflow and the haplotype assembler SAVAGE in reference-mode recovered more than 50% of the low abundant strain in several mixtures.

15
. CC-BY 4.0 International license perpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in   CC-BY 4.0 International license perpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in

Simulated virome data set
To test Haploflows ability to recover viral strain genomes from complex data sets, we evaluated Haploflow, MEGAHIT and metaSPAdes on the simulated virome data set from the Namib desert 47 , which includes short-read data simulated from an in-silico mixture of 572 viral genomes created to assess different assemblers 48 . It was not possible to run the reference-free haplotype-assemblers (SAVAGE, PEHaplo) on this data set. To assess the evolutionary divergence between the viral genomes, we identified clusters of similar genomes using dRep 68 , which resulted in 469 clusters total, out of which 52 clusters had at least two members with more than 95% ANI (average nucleotide identity), resulting in 417 "unique" genomes and 155 genomes in common strain clusters. The 95% threshold was chosen since MEGAHIT and metaSPAdes are only able to resolve genomes less similar than that 56 .
For the 155 common strain genomes, Haploflow correctly assembled 13-28.6% more sequence (62.85% genome fraction versus 55.58% and 48.88% for SPAdes and MEGAHIT, respectively). This was even more pronounced for clusters with genomes of at least eight-fold coverage, for which 19.8-37.5% more genome sequence was correctly assembled (89.37% versus 74.58% and 64.99% for SPAdes and MEGAHIT, respectively). For the less abundant strains from these clusters, 32.7-45.3% more genome sequence was correctly assembled (87.37% versus 65.85% and 60.12% genome fraction, respectively). Even for the complete data set with "unique" genomes and low abundant genomes, Haploflow reconstructed genome fractions similar to the MEGAHIT and metaSPAdes assemblers (72.2% and 68.6% versus 66.6% genome fraction; Table S5), which performed best in the original publication.

Analysis of clinical HCMV data
We used Haploflow with default parameters to reconstruct genomes from longitudinal clinical samples of eight HCMV positive patients, who had multi-strain infections 59 (Supplementary   Table S5). QUAST was used to map HaploFlow's contigs against the consensus strain of the first time point as reference genome, as the exact underlying strain genomes in the samples are unknown. Using the QUAST output, in particular the duplication ratio, the number of strains predicted by HaploFlow was determined by rounding the duplication ratio and then clustering the contigs into that many clusters using HaploFlow's predicted flow (using 17 . CC-BY 4.0 International license perpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for this this version posted January 26, 2021. ; https://doi.org/10.1101/2021.01.25.428049 doi: bioRxiv preprint python's sklearn 69 k-means method). For each of the clusters, QUAST was re-run, again using the consensus as reference genome. Since the resulting genomes, in particular the low abundant (minor) strains, will inherently be different to the consensus to some degree, only the genome fraction is considered a relevant metric here. Additionally, to confirm that HaploFlow created accurate strain-resolved contigs instead of consensus contigs, we compared clusters from the same patient at different time points with each other, finding that contigs from two clusters from consecutive time points showed ~99.9% ANI, while randomly matched clusters only had ~98% ANI.

Runtime and memory consumption
Haploflow's run time depends on the three main steps (Fig. 1 where k is the number of distinct k -mers. In practice, the number of paths is usually limited by the number of different strains, causing this step to also be linear time complexity. For runtime assessment we compared Haploflow to SAVAGE and PEHaplo, the only other haplotype assemblers able to process the HCMV data, though SAVAGE only in reference-based mode, as well as metaSPAdes and MEGAHIT, which performed closest to Haploflow in terms of the summary score or is a very fast metagenome assembler, respectively (Table 1). On the HIV data, Haploflow was more than twice as fast than SAVAGE. The running time and memory requirements of Haploflow and metaSPAdes were comparable, while MEGAHIT was most efficient.
On the HIV three strain and the HCMV two strain mixtures, building the deBruijn graph and creation of the unitig graphs from the reads dominated the overall running time. For the HIV data, building the deBruijn and unitig graphs took ~8 minutes on a laptop with 4 cores and 16 GB RAM. The resulting single unitig graph included 281 vertices and assembly finished after 0.6 seconds. For the HCMV data, assembly on the same laptop required ~100 minutes, of which 85 were used for building the deBruijn and unitig graphs from the reads. . CC-BY 4.0 International license perpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for this this version posted January 26, 2021. ; https://doi.org/10.1101/2021.01.25.428049 doi: bioRxiv preprint assemblers for small viral genomes of a few kb in size. It combines the best of both worlds for strain-resolved genome assembly, by using the fast algorithms of the metagenome assemblers, i.e. deBruijn graph based assembly, together with a specialised flow algorithm for capturing strain variation, which allows to link variants that do not co-occur on reads.
Taken together, our results demonstrate a substantial performance improvement in strain-resolved assembly for Haploflow in comparison to sixteen other metagenome and viral haplotype assemblers evaluated across different benchmark data sets. The benchmark experiments on data sets with varying numbers of strains and abundances demonstrated that Haploflow can handle data sets with substantial variation in genomic coverage introduced by amplicon sequencing and resolved strains at different degrees of evolutionary divergences well, ranging from 95% ANI (HIV), over 98% ANI (HCMV), to more than 99% ANI (SARS-CoV-2 data). On the six lab-generated HCMV mixed strain data sets, Haploflow was top scoring in the most metrics (5 of 8) in comparison to twelve other assemblers. This performance improvement in strain recall, strain precision, composite score, genome fraction and NGA50 was largely due to a better assembly of the less abundant strains. Except for Haploflow and SAVAGE no method assembled low abundant strains to 50% on average and Haploflow had a far higher NGA50, creating long contigs rather than a highly fragmented assembly. On the clinical HCMV data tested, Haploflow almost perfectly (91.7% recall and 93.6% precision) assembled strains with variants predicted by variant callers and very closely predicted the abundances of second and third strains. On a three strain HIV data set, Haploflow assembled all three genomes almost entirely, with very few mismatches. This is reflected in Haploflow scoring top in all eight metrics, with a composite assembly score of 9.66 (out of 10), in comparison to 8.02 for the best reference-based assembler PredictHaplo, and of 6.28 for the best reference-free assembler PEHaplo.
Benchmarking on a rather complex simulated virome data set with 417 taxa with unique genomes and 155 genomes in common strain taxa showed that Haploflow successfully assembled 2-3 strains for "common strain taxa" with 2-11 strains, substantially better so than the state-of-the-art metagenome assemblers able to process these data, that the tested haplotype assemblers could not. This effect was particularly pronounced for strain genome coverages within a favorable (>8) range for assembly. The abundance distribution of taxa in microbial communities is assumed to be oftentimes log-normal 13 , with only a few abundant and a long tail of very low abundant ones with consequently low coverages. This indicates 21 . CC-BY 4.0 International license perpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for this this version posted January 26, 2021. ; https://doi.org/10.1101/2021.01.25.428049 doi: bioRxiv preprint that Haploflow is suitable for processing many real world data sets and characterizing the more abundant strains, similar to the reference-based StrainPhlan strain-typing software 74 .
Finally, Haploflow reconstructed multiple, full length SARS-CoV-2 strains from a multi-sample wastewater metagenome data set with exact matches to clinical isolate genomes found in the GISAID database, highlighting the ability of Haploflow to recover high quality, strain-resolved viral haplotype genomes from metagenomic data.
In addition to short-read data, Haploflow also allows processing of long read data, which we demonstrated on the SARS-CoV-2 clinical data sets. For most applications dealing with low viral loads (e.g. the SARS-CoV-2 sequencing demonstrated in this article), PCR amplification is necessary to enrich viral reads. This naturally limits the possible maximum read length to the length of the PCR product, which is for those applications in the domain of short-read sequencing. The speed of the Haploflow algorithm principally also allows its extension to bacterial data, e.g. by adding multi-core and multi-k support and modules for handling differently sized and structured microbial genomes. Thus strain-resolved assembly from metagenome data for microbial taxa with several closely related strains could be a future application.

Availability of Data and Materials
The code of Haploflow is available on Github under https://github.com/hzi-bifo/Haploflow .

22
. CC-BY 4.0 International license perpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for this this version posted January 26, 2021. ; https://doi.org/10.1101/2021.01.25.428049 doi: bioRxiv preprint . CC-BY 4.0 International license perpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for this this version posted January 26, 2021. ; https://doi.org/10.1101/2021.01.25.428049 doi: bioRxiv preprint . CC-BY 4.0 International license perpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in

30
. CC-BY 4.0 International license perpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in

Exemplary clarification of path finding step realized in Haploflow
In the unitig graph there are multiple paths between a source and a sink which (sans sequencing errors) correspond to the different strains present in a sample. The choice of the correct path follows the fatness algorithm described before. There is another factor though, namely the length of the fattest path, which Haploflow also maximises. In Figure S1 there is exactly one source, the vertex ACTA, and one sink, the vertex ATGC, but there are infinitely many paths from ACTA to ATGC, since CTAT to TCTA and TCTA to CTAT form a loop. To prevent this, Haploflow allows every edge only to be used once in every path finding step. This makes the particular loop in Figure S1 "resolvable", the number of paths reduces to five: 1: with a fatness of 30 with a fatness of 45 with a fatness of 75 with a fatness of 25 with a fatness of 25 Just going by the fattest graph, path 3 would get selected, but this path is shorter than all other paths and thus only paths 2 and 5 can be selected, out of which path 2 has the higher fatness of 45 (the coverage of the first sequence). The next longest and fattest path is path 5 with a fatness of 25 (the coverage of the last sequence) and finally path 1 remains with a fatness of 30. Paths 3 and 4 do not exist at this point, since the capacity of all edges has been used.

32
. CC-BY 4.0 International license perpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for this this version posted January 26, 2021. ; https://doi.org/10.1101/2021.01.25.428049 doi: bioRxiv preprint Suppl. Figure S1: The deBruijn graph (1) and its corresponding Unitig graph (2) for three related sequences and their coverage (3). The red k -mers and edges between them are part of linear paths and are replaced by a single red edge in the unitig graph. The edges are labelled with the "capacity", the sum of the coverages of the sequences going over them, in the deBruijn graph and the average capacity of all smoothed edges in the Unitig graph -which in 33 . CC-BY 4.0 International license perpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for this this version posted January 26, 2021. ; https://doi.org/10.1101/2021.01.25.428049 doi: bioRxiv preprint this case is the same as the original capacity. Some of the edges represent one (capacities 25, 30, 45), some two (capacity 70 = 45 + 25) and some all (capacity 100 = 45 + 30 + 25) of the sequences.

Algorithmic details of the flow algorithm
The fatness of a path is defined by the lowest fatness value of any edge along this path. Since the fatness of an edge might be underestimated if the coverage dropped for edges occurring before this edge in the path, it is not sufficient to just remove the calculated fatness when reducing flow along a path. Instead, the coverage of the source is set to 0 and for every other edge on the path the flow is reduced to max(capacity -previously_removed_flow, 0) where previously_removed_flow is the flow removed from the last edge on the path. Since it is possible that edges are used multiple times, it is also possible that there are paths that have hardly any edges that are "unique" to that path. We call an edge unique , if it is part of exactly one path. If the fraction or length of unique edges of a path is too low, by default less than 500 bases, the path is removed for all edges on which it is not unique, to avoid overestimating the total number of paths in the graph. Edges with coverage of 0 will get removed, possibly producing new sources. If Haploflow crosses a junction with two or more outgoing edges with similar coverage values and cannot make an informed decision which is the higher abundant path, Haploflow will break the contig at this position. This happens either if multiple strains have very similar coverages or on genomic repeats. The exact threshold for this break is derived using the error_rate and strict/threshold parameter: If the difference is less than the percentage value given or the (either explicitly stated or derived from strict ) threshold, the contig is broken.
After the path has been found, the coverage of all unique edges on this path is reduced to 0, as no other path will be traversing this edge. If there is more than one path going over the edge, then the flow is reduced, corresponding to the expected coverage of the current edge.
This value is the flow removed from the last visited unique edge, meaning that local increases and decreases in coverage are also captured. If the coverage of an edge would be reduced to 0, even though there are still paths going over this edge, the coverage is set to a dummy value such that it can still be used. On the other hand, if a path consists solely of non-unique edges, a duplication is assumed and the current path is not considered.

34
. CC-BY 4.0 International license perpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for this this version posted January 26, 2021. ; https://doi.org/10.1101/2021.01.25.428049 doi: bioRxiv preprint When permanently reducing the flow, it is not sufficient to remove the (overall) fatness of the path, since the fatness can only decrease (or stay the same) along a path, while the coverage values might fluctuate, based on amplification and sequencing strategy. To circumvent this, the flow is reduced by a "local fatness": All unique edges are removed as described before, for all other edges either flow removed from the last edge or, if the value is higher, of the average per-base removed flow, is taken as a baseline and depending on the fact whether the flow decreased or increased within the last edge, the flow to be removed is decreased and increased accordingly. If there would not be any flow remaining, a minimal value is left over.
Suppl. Figure  CC-BY 4.0 International license perpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for this this version posted January 26, 2021. ; https://doi.org/10.1101/2021.01.25.428049 doi: bioRxiv preprint (234,127 bp), with a length ratio of 22:1, distinct peaks occur at coverages of ~45 and ~2500.
The first peak has 10,000 distinct kmers and the second one 400, indicating that the first genome might be around 25x as large as the second one.

36
. CC-BY 4.0 International license perpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in

Reconstruction of full length SARS-CoV-2 sequences
In nine out of 17 SARS-CoV-2 samples and 6 out of 7 wastewater SARS-CoV-2 samples, quast reported a high duplication ratio for the Haploflow assembly; four out of five DUS and five out of twelve WIS samples. This can be explained by either artificially duplicating parts of the genome or the presence of two closely related strains. Since Haploflow did not construct single contig assemblies for all these strains, first a "scaffolding" step was performed: All contigs are clustered using k -means clustering on Haploflow's predicted abundance, the number of clusters depending on the duplication ratio. Then, using the 39 . CC-BY 4.0 International license perpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for this this version posted January 26, 2021. ;

Supplementary tables
Suppl. Table S1: Benchmark of Haploflow against five de novo assemblers and five reference-based assemblers (grey background) on the HIV-3 data set. For every metric, best performing methods (95-100% range of results) are indicated. Strain recall : fraction of correctly recovered high quality strain genomes ( 1 ( 5) mismatches per kb; more than ≤ ≤ 90% (80%) genome fraction 77 ); Strain precision : fraction of correctly recovered high quality strain genomes of all genome assemblies. Evaluation using metaQUAST results and derived strain assembly metrics with HIV reference genomes 89.6, HXB-2 and JR-SCF in "combined reference" mode. a Results for a 140 Mb subset of the 500 Mb dataset generated with BBnorm. *runs that did not complete after ten days or failed. **as being an outlier, QuasiRecomb results were excluded from composite score and radar plot calculation. Unique genomes refers to genomes for which no other genome with an ANI of >95% is in the data set, common strain genomes are ones for which at least one such genome is present. The coverage value was calculated by dividing the total number of base pairs in the reads belonging to the genome by its size. 45

SPAdes
MEGAHIT Haploflow . CC-BY 4.0 International license perpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for this this version posted January 26, 2021. ; https://doi.org/10.1101/2021.01.25.428049 doi: bioRxiv preprint Suppl. Table S6: Comparison of multiple strain infection labeling of samples by VATK 78 , the predicted relative abundance of the low abundant strain(s) and the predicted abundance by Haploflow (relative and absolute) as well as the genome completeness (genome fraction, mapped against the first sample consensus genome) of strains Haploflow reconstructed (Supplementary methods). A "-" denotes that no evidence of a second strain was found by either VATK (column 3) or Haploflow (column 4). Percentage values with a (*) denote problems in clustering, evident by a still high duplication ratio after clustering or the sum of genome fractions of two clusters summing up to ~1, indicating that underclustering or in the latter case overclustering took place. Three percentage values in the third column indicate that Haploflow predicted three strains being present. 46

Patient
Time . CC-BY 4.0 International license perpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in . CC-BY 4.0 International license perpetuity. It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for this this version posted January 26, 2021. ; https://doi.org/10.1101/2021.01.25.428049 doi: bioRxiv preprint