flopp: Extremely Fast Long-Read Polyploid Haplotype Phasing by Uniform Tree Partitioning

Resolving haplotypes in polyploid genomes using phase information from sequencing reads is an important and challenging problem. We introduce two new mathematical formulations of polyploid haplotype phasing: (1) the min-sum max tree partition problem, which is a more flexible graphical metric compared with the standard minimum error correction (MEC) model in the polyploid setting, and (2) the uniform probabilistic error minimization model, which is a probabilistic analogue of the MEC model. We incorporate both formulations into a long-read based polyploid haplotype phasing method called flopp. We show that flopp compares favorably with state-of-the-art algorithms—up to 30 times faster with 2 times fewer switch errors on 6 × ploidy simulated data. Further, we show using real nanopore data that flopp can quickly reveal reasonable haplotype structures from the autotetraploid Solanum tuberosum (potato).


INTRODUCTION
A s genomic sequencing technologies continue to improve, we are increasingly able to resolve ever finer genomic details. Case in point, traditional genotyping only determines whether a particular allele is present in a genome (Scheben et al., 2017). However, when organisms are polyploid (and most eukaryotic organisms are), they have multiple copies of each chromosome. We are then additionally interested in the problem of resolving haplotypes, that is, determining the sequence of alleles on each specific chromosome and not just the presence of an allele within the genome. Phasing is the procedure of resolving the haplotypes by linking alleles within a chromosome (Browning and Browning, 2011).
We focus on phasing polyploid organisms by using third-generation sequencing data. Many plants have ploidy greater than two (i.e., have more than two copies of each chromosome), such as tetraploid potatoes (Solanum tuberosum) or hexaploid wheat and cotton. Haplotype phasing has been used to gain insights into evolution (Eriksson et al., 2018), breeding (Qian et al., 2017), and genome-wide association studies (Maldonado et al., 2019), among other applications.
The most common way of determining haplotypes is to use pooled genotyping information from a population to estimate haplotypes (Browning and Browning, 2011). For unrelated individuals, sophisticated statistical methods are used to determine the most likely haplotypes for each individual (Browning and Browning, 2007;Howie et al., 2009;Delaneau et al., 2019) in the population. For related individuals, identity-by-descent information can be used for haplotype phasing (Gao et al., 2009;Motazedi et al., 2018). However, these types of methods do not work on single individuals because they rely on having population data available.
Instead, in this work, we adopt the approach of single individual phasing by sequencing, which is now common in phasing human haplotypes (Choi et al., 2018). We focus on using sequencing information for phasing, which allows us to phase a single individual without population information or prior haplotype knowledge. This is closely related to genome assembly where overlapping reads are stitched together (Nagarajan and Pop, 2013); in our case, nearby heterozygous alleles are stitched together by read information. For the rest of the article, we use the term ''phasing'' to mean single individual phasing using sequencing information.

Related work
The first method for phasing polyploid genomes was HapCompass (Aguiar and Istrail, 2012), which uses a graphical approach. Popular methods that followed include HapTree (Berger et al., 2014(Berger et al., , 2020, H-PoP (Xie et al., 2016), and SDhaP (Das and Vikalo, 2015). HapTree and H-PoP heuristically maximize a likelihood function and an objective function based on the minimum error correction (MEC) model, respectively whereas SDhaP takes a semi-definite programming approach. HapTree-X (Berger et al., 2020) additionally incorporates long-range expression correlations to allow phasing even of pairs of variants that cannot be covered by a single read, overcoming some of the problems with short-read phasing.
Due to the increased prevalence of long-read data from Oxford Nanopore or PacBio, newer methods taking advantage of the longer-range correlations that are accessible through long-read data have been proposed (Schrinner et al., 2020;Abou Saada et al., 2021). Unfortunately, because the error profiles of longread technologies differ considerably from Illumina short-reads [e.g., a higher prevalence of indel errors compared to single nucleotide polymorphisms (SNPs)]. Methods tailored to short-reads (Siragusa et al., 2019;Moeinzadeh et al., 2020) may be ineffective, so altogether new paradigms are required.
At a more theoretical level, in the diploid setting, the standard MEC model (Bonizzoni et al., 2016) has proven to be quite powerful. It is known to be APX-Hard and NP-Hard but heuristically solved in practice with relatively good results. Unfortunately, a good MEC score may not imply a good phasing when errors are present (Majidian et al., 2020). This shortcoming is further exacerbated in the polyploid setting, because similar haplotypes may be clustered together since the MEC model does not consider coverage; this phenomenon is known as genome collapsing (Schrinner et al., 2020). Thus, although the MEC model can be applied to the polyploid setting, it may be suboptimal; however, there is yet to be an alternative commonly agreed-upon formulation of the polyploid phasing problem. Indeed, this is reflected in the literature: The mathematical underpinnings of the various polyploid phasing algorithms are very diverse.

Contributions
In this article, we first address the theoretical shortcomings highlighted earlier by giving two new mathematical formulations of polyploid phasing. We adopt a probabilistic framework that allows us to (1) give a better notion of haplotype similarity between reads and (2) define a new objective function, the uniform probabilistic error minimization (UPEM) score. Further, we introduce the idea of framing the polyploid phasing problem as one of partitioning a graph to minimize the sum of the max spanning trees within each cluster, which, we show, is related to the MEC formulation in a specific case.
We argue that these formulations are better suited for polyploid haplotype phasing using long-reads. In addition to our theoretical justifications, we also implemented a method we call flopp (fast local polyploid phaser). flopp optimizes the UPEM score and builds up local haplotypes through the graph partitioning procedure described. When tested on simulated datasets, flopp produces much more accurate local haplotype blocks than other methods and also frequently produces the most accurate global phasing. flopp's runtime is additionally comparable to, and often much faster than, its competitors.
The code for flopp is available at https://github.com/bluenote-1577/flopp. flopp utilizes Rust-Bio (Kster, 2016), is written entirely in the rust programming language, and is fully parallelizable. flopp takes as input either binary alignment map + variant call format (BAM + VCF) files, or the same fragment file format used by AltHap (Hashemi et al., 2018) and H-PoP.

Definitions
We represent every read as a sequence of variants (rather than as a string of nucleotides, which is commonly used for mapping/assembly tasks). Let R be the set of all reads that align to a chromosome and m be the number of variants in our chromosome. Assuming that tetra-allelic SNPs are allowed, every read r i is considered as an element of the space r i 2 f -‚ 0‚ 1‚ 2‚ 3g m . A read in this variant space is sometimes called a fragment. Denoting r i [j] as the jth coordinate of r i , r i [j] 2 f0‚ 1‚ 2‚ 3g if the jth variant is contained in the read r i where 0 represents the reference allele, 1 represents the first alternative allele, and so forth.
We note that flopp by default only uses SNP information, but the user may generate their own fragments, permitting indels and other types of variants to be used even if there are more than four possible alleles. The formalism is the same regardless of the types of variants used or the number of alternative alleles.
For any two reads r 1 ‚ r 2 , let d and s stand for different and same, representing the number of different and same variants, respectively, between two reads. We use k to denote the ploidy. Given a k-ploid organism, a natural approach to phasing is to partition R into k subsets where the cluster membership of a read represents which haplotype the read belongs to. Let R 1 ‚ . . . ‚ R k be a partition of R. Given a partition P = fR 1 ‚ . . . ‚ R k g, we denote P Define the consensus haplotype H(R i ) 2 f -‚ 0‚ 1‚ 2‚ 3g m associated to a subset of reads as follows. For all indices l = 1‚ . . . ‚ m let H(R i )[l] = arg max a jfr 2 R i : r[l] = agj and break ties according to some arbitrary order. If onlyappear at position l over all reads, we take In our formalism, we can phrase the MEC model of haplotype phasing as the task of finding a partition fR 1 ‚ . . . ‚ R k g of R such that P k i = 1 P r j 2R i d(r j ‚ H(R i )), which is called the MEC score, is minimized. For notational purposes, for a subset R i & R, define S(R i ) = P r2R i s(H(R i )‚ r) and D(R i ) = P r2R i d(H(R i )‚ r). P k i = 1 D(R i ) is just the MEC score for a particular partition.

Problem formulation
2.2.1. min-sum max tree partition model. Let G(R) = (R‚ E‚ w) be an undirected graph where the vertices are R and edges E are present between two reads r 1 ‚ r 2 if r 1 ‚ r 2 overlap, that is, d(r 1 ‚ r 2 ) + s(r 1 ‚ r 2 ) > 0. Let the weight of e = (r 1 ‚ r 2 ) be w(e) = w(r 1 ‚ r 2 ) for some weight function w. We call G(R) the read-graph; see Figure 1. A similar notion is found Wang, 2014, 2020;Das and Vikalo, 2015;Schrinner et al., 2020;Abou Saada et al., 2021). For a partition of R into disjoint subsets fR 1 ‚ . . . ‚ R k g, we take G(R i ) as defined earlier. We only consider partitions of vertices such that all G(R i ) are connected, which we will denote as valid partitions. Let MST(G) be the maximum spanning tree of a graph G. Define FIG. 1. An example of a read-graph (without the weighting w specified) along with corresponding d‚ s values between reads. The colors represent a partition of the read-graph into two subsets R 1 ‚ R 2 . The consensus haplotypes H(R 1 )‚ H(R 2 ) are shown, where the 0=1 indicates that either 1 or 0 are valid alleles for a consensus haplotype.
We formulate the min-sum max tree partition (MSMTP) problem as finding a valid partition fR 1 ‚ . . . ‚ R k g of R such that SMTP k R (R 1 ‚ . . . ‚ R k ) is minimized. We refer to the objective function being minimized as the SMTP score and the computational problem of minimizing the SMTP score as the MSMTP problem.
The MSMTP problem falls under a class of problems called graph tree partition problems (Cordone and Maffioli, 2004), most of which are NP-Hard. The following proof shows that MSMTP is NP-Hard.
Proof. Let G = (V‚ E‚ w) be a connected, undirected, and weighted graph with weight function w. We take a reduction from graph coloring.
Let G 0 = (V‚ E 0 ‚ w 0 ) be a complete graph where the vertices of G 0 are the same as G. Let the weight of any edge e 2 E 0 to be 2 if e 2 E, the original graph, and let it be 1 otherwise.
Let V 1 ‚ . . . ‚ V k be a (valid) partition of V that solves MSMTP for G 0 . Note that none of V 1 ‚ . . . ‚ V k are empty, otherwise we could move a single vertex to the empty set and it would have a lower SMTP value for this graph. Therefore, the total number of edges over all spanning trees of then the maximum spanning tree for each subset only contains edges with weight 1. In particular, this means that no subgraph G 0 (V i ) has an edge with weight 2, so there are no edges between any vertices of V i in G; otherwise, the weight 2 edge would be included in the max spanning tree. Thus, the partition V 1 ‚ . . . ‚ V k gives a kcoloring of G: For the other implication, clearly if a kcoloring exists for G, then we can find a partition V 1 ‚ . . . ‚ V k such that G 0 (V i ) only has edges of weight 1 between vertices. Then SMTP k G 0 (V 1 ‚ . . . ‚ V k ) = jVjk follows. Therefore, any algorithm that solves MSMTP also decides whether G has a k-coloring, which is NP-Complete for k ! 3.
Intuitively, assuming each G(R i ) is connected, a maximum spanning tree is a maximum measure of discordance along the entire haplotype. We prove next that under a specific constraint on the read-graph, the SMTP score for w(r 1 ‚ r 2 ) = d(r 1 ‚ r 2 ) is an upper bound for the MEC score.
Theorem 2. Suppose w(r 1 ‚ r 2 ) = d(r 1 ‚ r 2 ). Let a‚ b 2 N and let R be a set of fragments such that for every r 2 R, for all k 2 fa‚ a + 1‚ . . . ‚ bg, r[k] 6 =and for l 6 2 fa‚ a + 1‚ Á Á Á ‚ bg, r[l] = -. For any R i in a valid partition where Star(H(R i )) is the star-graph having an internal node H(R i ) and every r 2 R i is a leaf node. Now note that H(R i ) is constructed precisely as a sequence in f -‚ 0‚ 1‚ 2‚ 3g m , which is nonat indices a‚ a + 1‚ . . . ‚ b that minimizes the sum P e2Star(H(R i )) w(e). For any r 2 R i , r is also nonat the same indices by assumption, so P e2Star(r) w(e) ! P e2Star(H(R i )) w(e). Removing the node H(R i ) from Star(r) and the corresponding edge, we get a spanning tree of G(R i ); we call this new graph Star(r) 0 . Thus, , and clearly P e2MST(G) w(e) ! P e2Star(r) 0 w(e). The inequality holds for any r, so we can choose r to minimize w(r‚ H(R i )), completing the proof.

SHAW AND YU
The theorem just cited relies on the assumption that all reads in the set overlap exactly. Although this is obviously not true for the entire genome, flopp takes a local clustering approach where the entire read set R is partitioned into smaller local read sets that overlap significantly.
We verified experimentally that the SMTP score for partitions generated by our method has a strong linear dependence on the MEC score when w = d; see Section 3.3. These results justify that minimizing the SMTP score is worthwhile for this specific case. However, we do not have to necessarily use w = d. In Section 2.3, we opt for a more theoretically sound probabilistic weighting.
2.2.2. UPEM model. The SMTP score has problems with collapsing genomes in the same manner the MEC score does; it does not take into account the assumption that coverage should be uniform between haplotypes. Concretely, if a polyploid organism has two identical haplotypes, the reads from both haplotypes may be clustered together in the MEC model and a noisy read may instead be placed in its own cluster.
Let e represent the probability that a variant is called incorrectly. Let r 2 R be a normalizing constant, and X i *Binomial(Ø(D(R i ) + S(R i ))=rø‚ e) be a binomial random variable. Then, The v 2 (x 1 ‚ . . . ‚ x n ) term is the p-value for the v 2 test, whereas the binomial term is a sum of log onesided binomial tests where the null hypothesis is that the error rate of a clustering is e. Therefore, the UPEM score is just a sum of log p-values.
The UPEM score is a probabilistic version of the MEC model under the hypothesis that the errors and coverage are uniform across haplotypes. The parameter r is a normalizing constant and is important because if a specific genome has a high rate of heterozygosity and e is slightly underestimated, then the sample size D(R i ) + S(R i ) is large. The p-value associated with the binomial random variable will be extremely small and drown out the contributions from the v 2 term, so r is a learned dataset specific constant used to keep the two terms balanced.
The UPEM maximization enforces a relatively uniform partition. Further, errors will be distributed among partitions equally due to the non-linearity of the UPEM score; if one cluster is extremely erroneous, the sum of the binomial terms may be higher for a more spread out error even if the overall MEC score is slightly higher. If error and coverage uniformity assumptions are satisfied, clearly these two properties are desirable in an objective function.

Local graph clustering
We now discuss the algorithms implemented in flopp. A high-level block diagram showing outlining flopp's main processing stages is outlined in Figure 2. Importantly, flopp is a local clustering method, which means that we first attempt to find good partitions for smaller subsets of reads by optimizing the SMTP and UPEM functions and then joining haplotype blocks together afterward. A graphic outlining the local clustering and linking procedure can be found in Figure 3. 2.3.1. Choice of edge weight function. Previous methods that use a read-graph formalism such as SDhaP (Das and Vikalo, 2015), FastHap (Mazrouee and Wang, 2014), WhatsHap Polyphase (WHP) (Schrinner et al., 2020), and nPhase (Abou Saada et al., 2021) all define different weightings between reads. Two previously used weight functions are w error (r 1 ‚ r 2 ) = s(r 1 ‚ r 2 ) s(r 1 ‚ r 2 ) + d(r 1 ‚ r 2 ) (nPhase; FastHap is similar): These weightings are quite simple and have issues when the lengths of the reads have high variance, as is the case with long-reads. A more sophisticated approach is to use a probabilistic model. Let E(r 1 ‚ r 2 ) = 1w error (r 1 ‚ r 2 ) be the average error rate between two reads. Assuming an error parameter e as described in the UPEM formula, we define : D KL (r 1 ‚ r 2 jje) is the Kullback-Leibler divergence between a E(r 1 ‚ r 2 )coin and a 2e(1 -e)coin. The reason we use 2e(1 -e) is because, given two reads with error rate e, the probability that two variants from the same haplotype are different is 2e(1 -e). Now let w(r 1 ‚ r 2 ) = [s(r 1 ‚ r 2 ) + d(r 1 ‚ r 2 )] Á D KL (r 1 ‚ r 2 jje): (2) D KL (r 1 ‚ r 2 jje) is a measure of divergence between the error profiles of the two reads and the expected error profile for two reads from the same haplotype. There is a slight issue that D KL (r 1 ‚ r 2 jjE) has a minimum at E(r 1 ‚ r 2 ) = 2e(1 -e) when e is fixed. We want D KL to be a monotonically increasing function with respect to E, so we flip the sign of D KL (r 1 ‚ r 2 jje) if E < 2e(1 -e).
There is a nice interpretation of w(r 1 ‚ r 2 ) when E(r 1 ‚ r 2 ) > 2e(1 -e) as given by the following theorem.

SHAW AND YU
Pr (Binom(n‚ p) ! an) e -nH : (3) In particular, this bound is asymptotically tight up to logarithms, that is, lim n!1 log Pr (Binom(n‚ p) ! an) -nH = 1: The proof of the theorem cited earlier is simple; it follows by an application of a Chernoff bound and Sterling's inequality. Arratia and Gordon (1989) show that this approximation is particularly good when a >> p. Theorem 3 gives an interpretation of w(r 1 ‚ r 2 ) as the negative log value of a 1-sided binomial test where the null hypothesis is that r 1 ‚ r 2 come from the same haplotype, the number of trials is s(r 1 ‚ r 2 ) + d(r 1 ‚ r 2 ), and the number of successes is d(r 1 ‚ r 2 ).
WHP (Schrinner et al., 2020) uses a log ratio of binomial probability densities as their edge weight. Mathematically, the resulting formula is almost the same as Equation 2, except they fix E(r 1 ‚ r 2 ) to be the average error rate between reads from different haplotypes, a constant. They end up with both negative and positive edges with a large magnitude and Theorem 3 does not apply to their case. On the other hand, our edge weights are rigorously justified as negative log p-values for a binomial test, making it more interpretable.

MSMTP algorithm.
We devised a heuristic algorithm for local clustering inspired by the MSMTP problem. From now on, let S & R. The pseudocode is shown in Algorithm 1. In the diploid setting, the main idea behind this algorithm is similar to that of FastHap (Mazrouee and Wang, 2014), but the algorithm is quite different after generalizing to higher ploidies.
For the FindMaxClique method mentioned in Algorithm 1, we use a greedy clique-finding method that takes O(kjSj log jSj + k 2 jSj) time. First, we take the heaviest edge between two vertices and make this our 2-clique. Then, we re-sort vertices by maximizing the minimum distance from the vertex to all vertices in the 2-clique, add a third edge to get a 3-clique, and so forth until we get a kclique.
Algorithm 1: Greedy min-max read partitioning Input: Read-graph G(S), ploidy k, iterations n Output: A partition fS 1 ‚ . . . ‚ S k g of S 1 fv 1 ‚ . . . ‚ v k g)FindMaxClique(G(S)‚ k) 2 for i = 1 to k do 3 j S i )fv i g 4 end 5 for i = 1 to n do 6 jV)G(S)n S k i = 1 S i 7 j Reverse-sort V by assigning to v 2 V the value j v ! min S i 2fS 1 ‚ ...‚ S k g max r2S i s(r‚ v) + d(r‚ v) The complexity of the local clustering algorithm is O(njSj 2 + kjSj log jSj + k 2 jSj). In practice, note that jSj >> k. The parameter n is fixed to be 10. By iterating over n, we re-sort the edges based on their overlaps to the new clusters, which have changed since the previous iteration. This improves the order in which we add vertices to the clusters.
The connection to MSMTP is at line 10. A priori, it is not obvious what metric to use to determine which cluster to put the vertex in. For Kruskal's algorithm, one starts by considering the heaviest edges, so we decided to minimize the maximum edge from the vertex to the cluster so that the heaviest edges are small. Intuitively, a maximum spanning tree is sensitive to a highly erroneous edge, so we prioritize minimizing the worst edge even if on average the vertex connects to the cluster well.

Iterative refinement of local clusters.
We refine the output of the local clustering procedure by optimizing the UPEM score using a method similar to the Kernighan-Lin algorithm (Kernighan and Lin, 1970). Pseudocode can be found in Algorithm 2.
In lines 4-14, we check how swapping vertices between partitions affects the overall UPEM score and take a fraction of the best swaps in line 14. We then execute the swaps and check whether the UPEM score has increased. If it has not, we terminate the algorithm; otherwise, we continue until n iterations have passed. We take n = 10 in practice, and note that almost always the algorithm terminates before 10 iterations pass. In practice, we set the parameter n = 10. The time complexity of Algorithm 2 is O(njSjk log (jSjk)). for r in S i do 6 for S j in P, S j 6 ¼ S i do 7 Compute change in UPEM, D(r‚ S j ) by moving r from Return P 24 end 25 end 26 Return P 2.3.4. Local phasing procedure. Note that Algorithms 1 and 2 work on subsets of reads or subgraphs of the underlying read-graph. Let b 2 N be a constant representing the length of a local block. We consider subsets The subsets are just all reads that overlap a shifted interval of size b, similar to the work done in Sankararaman et al. (2020). After choosing a suitable b, we run the read-partitioning and iterative refinement on all B 1 ‚ . . . ‚ B l to generate a set of partitions P 1 ‚ . . . ‚ P l . We found that a suitable value of b is the 1 3quantile value of read lengths. By read length, we mean the last non 0 -0 position minus the first non 0 -0 position of r 2 f0‚ 1‚ 2‚ 3‚ -g.
It is important to note that computationally, the local clustering procedure is easily parallelizable. The local clustering step has, therefore, a linear speedup in the number of threads. according to the local clustering procedure cited earlier, we can identify partitions with low UPEM score and correct them. After computing the UPEM scores for every partition, we use a simple 3.0 inter-quartile range outlier detection method for the distribution of UPEM scores. For an outlying partition P i of the read set B i , if a partition P i -1 is not an outlier, we remove P i and extend P i -1 to include B i . To do this, we run a subroutine of Algorithm 1 where we skip the clique finding procedure and instead treat the partition P i -1 as the initial clusters. We then run lines 9-12 of Algorithm 1 where in line 9 we iterate over v 2 B i nB i -1 instead.
For genomes with large variations in coverage, we give the user the option to disable this procedure because variation in coverage may affect the UPEM score distribution.
2.4.2. Local cluster merging. Let P represent the final partition of all reads R. We build P given P 1 ‚ . . . ‚ P l as follows. Start with P = P 1 . Let S k be the symmetric group on k elements, that is, the set of all permutations on kelements. At the ith step of the procedure, let We experimented with more sophisticated merging procedures such as a beam-search approach but did not observe a significantly better phasing on any of our datasets.

Phasing output.
Once a final partition P has been found, we build the final phased haplotypes.
The output is k sequences in f0‚ 1‚ 2‚ 3‚ -g m where m is the number of variants. flopp can take fragment files as input, in which each line of the file describes a single read fragment in f0‚ 1‚ 2‚ 3‚ -g m , or it can take a VCF file and a BAM file of aligned reads. In the former case, without a VCF file, we do not have genotyping information, so we simply output fH 1 ‚ . . . ‚ H k g where H i = H(P[i]) is the consensus haplotype. If a VCF file is present, flopp constrains the final haplotype by the genotypes, that is, for some output variant, the number of reference and alternate alleles is the same as in the VCF file.
We constrain the haplotypes using the VCF as follows. For every variant indexed over 1 i m, let c(i‚ j‚ a) 2 R be a value representing the confidence of calling allele a at index i for the haplotype represented by P[j]. We produce khaplotypes according to Algorithm 3. 2.4.4. Parameter estimation. We already mentioned in previous sections how we set all parameters for algorithms except for e and the parameter r in the UPEM score. We set r to be the median length of the reads divided by 25, which we empirically found to be a good balance between the binomial and chisquared terms.
To estimate e, we start with an initial guess of e = 0:03. We then select 10 subsets B i & R at random and perform local clustering and refinement. We estimate e from the 10 Á k total clusters by choosing e = the 1 10 th quantile error, where for each cluster C & B i the error is D(C) S(C) + D(C) . We pick a bottom quantile because we assume that there is some error in our method, so to get the true error rate we must underestimate the computed errors.

Phasing metrics
There are a plethora of phasing metrics developed for diploid and polyploid phasing (Motazedi et al., 2017). We use three different metrics of accuracy, but argue that each individual metric can be misleading and that all three metrics should be used in unison before drawing conclusions on phasing accuracy.
For a global measure of goodness of phasing, we use the Hamming error rate.
where S k is the set of permutations on k elements, m is the length of each H[i], k is the ploidy, and d 0 is the same as the d function defined earlier except that we count the case where one haplotype has a 0 -0 at a coordinate as an error.
We define the switch error rate (SWER) similarly to WHP (Schrinner et al., 2020). Let P i & S k be the set of permutations such that H[j][i] = H Ã [j][r(i)] for all 1 j k. These are the mappings from the truth to the candidate haplotypes that preserve the alleles at position i. Then, we define the switch error as SWER(H‚ H Ã ) = min where 1 r i 6 ¼r i + 1 = 1 if r i 6 ¼ r i + 1 and 0 otherwise. The Hamming error, thoughe easily interpretable, can be unstable. A single switch error can drastically alter the Hamming error rate. The SWER also has issues; for example, if two switches occur consecutively, the phasing is still relatively good but the SWER is worse than if only one switch occurred.
We define a new error rate, called the q-block error rate.
The q-block error rate measures local goodness of assembly and interpolates between the Hamming error rate, when q is the size of the genome, and a metric similar to the switch error when q = 2.

Simulation procedure
We used the v4.04 assembly of S. tuberosum produced by the Potato Genome Sequencing Consortium (Hardigan et al., 2016) as a reference. We took the first 3.5 Mb of chromosome 1 and removed all ''N''s,

204
SHAW AND YU leaving about 3.02 Mb and simulated bi-allelic SNPs by using the haplogenerator software (Motazedi et al., 2017), a script made specifically for realistic simulation of variants in polyploid genomes. We generated SNPs with a mean distance of 45 and 90 bp between SNPs. This is in line with the 42.5 bp average distance between variants, as seen in Uitdewilligen et al. (2013) for S. tuberosum; in that study, they observe that > 80% of variants are bi-allelic SNPs. This is also in line with the 58 bp mean distance between variants seen for the hexaploid sweet potato (Ipomoea batatas) observed in Yang et al. (2017b). The dosage probabilities provided to haplogenerators are the same parameters as used in Motazedi et al. (2017) for the tetraploid case. When simulating triploid genomes, we use the same dosage probabilities but disallow the case for three alternate alleles.
We used two different software packages for simulating reads. We used PaSS (Zhang et al., 2019) with the provided default error profiles for simulating PacBio RS reads. PaSS has higher error rates than other methods such as PBSIM (Ono et al., 2013), which tends to underestimate error (Zhang et al., 2019). We used NanoSim (Yang et al., 2017a) for simulating nanopore reads by using a default pre-trained error model based on a real human dataset provided by the software.
After generating the haplotypes and simulating the reads from the haplotypes, we obtain a truth VCF from the haplotypes. We map the reads by using minimap2 (Li, 2018) to the reference genome. The scripts for the simulation pipeline can be found at https://github.com/bluenote-1577/flopp_test.

SMTP versus MEC correlation
We ran flopp on four different simulated datasets and calculated the SMTP score with the weight function w(r 1 ‚ r 2 ) = d(r 1 ‚ r 2 ); see Equation 1, and the MEC score for each local partition before merging.
We varied the coverage between 10 · to 20 · for a simulated 4 · ploidy genome. We also varied the length of the local partition blocks by manually changing the parameter b mentioned at the end of Section 2.3 over three different values (20, 50, and 80 SNPs) for each different dataset to investigate how the size of the local clusters affects the SMTP and MEC relationship. The results are shown in Figure 4. FIG. 4. A plot of the SMTP versus MEC score for every local partition (i.e., partition blocks before merging/linking) generated by flopp over three different coverages on 4 · ploidy simulated data, as described in Section 3.2. There is a strong linear relationship between MEC and SMTP scores, and the upper bound for Theorem 2 holds surprisingly well as seen by almost all points being above the line y = x. SMTP, sum max tree partition; MEC, minimum error correction.

Results on simulated data set
We primarily test against H-PoPG (Xie et al., 2016), the genotype constrained version of H-PoP. Other methods such as HapTree and AltHap were tested, but we ran into issues with either computing time or poor accuracy due to the methods not being suited for long-read data. We did not test against nPhase, because the output of nPhase does not have a fixed number of haplotypes. We discuss WHP (Schrinner et al., 2020) at the end of this section.
The switch error and Hamming error rates are shown in Figure 5 for 45 bp average distance between SNPs. For 90 bp, the results are shown in Figure 6. The rest of the analysis in the section pertains to the 45 bp distance case.
For each test, we ran the entire pipeline three times; each run at high ploidies takes on the timescale of days to complete. The run times on PacBio reads for H-PoPG, flopp, as well as one instance of WHP on 3 · ploidy data are shown in Figure 7.
For the nanopore dataset simulated from NanoSim, the results for H-PoPG and flopp are very similar. The PaSS PacBio simulator outputs reads that are more erroneous, and we can see that flopp generally performs better than H-PoPG across ploidies except for the Hamming error rate when the coverage is relatively low; interestingly, the SWER is still lower in this case. flopp's SWER is consistently 1.5-2 times lower than H-PoPG for the simulated PacBio reads. On the low coverage datasets, H-PoPG's global optimization strategy leads to a better global phasing than flopp's local methodology.
Note that in these tests we phase a contig of length 3.02 Mb, whereas most genes of interest are smaller. In Figure 8, we plot the mean q-block error rates of the 5 · and 6 · ploidy phasings at 10 · coverage; these runs have higher Hamming error rates for flopp than H-PoPG. For blocks of up to around 850 Á 45 = 38250 bases, flopp outputs phasings with lower q-block error rates than H-PoPG despite a larger global error rate.
FIG. 5. The mean SWER and Hamming error rate from testing on simulated data sets as described in Section 3.2 over a range of ploidies and coverages on two different types of read simulators, with SNPs 45 bp apart on average. The error bars represent the lowest and highest values for the metric over three iterations. The results on the nanopore simulated reads from NanoSim (Yang et al., 2017a) are similar for both methods. flopp achieves much better SWERs on the PacBio simulated reads from PaSS (Zhang et al., 2019), although for low coverage flopp sometimes has a higher Hamming error rate. SNPs, single nucleotide polymorphisms; SWER, switch error rate.

SHAW AND YU
Although flopp may sometimes give worse global phasings than H-PoPG, flopp can give extremely accurate local phasings. Computationally, Figure 7 shows that flopp is at least 3 times faster than H-PoPG and up to 30 times faster than H-PoPG for 6 · ploidy on a single thread. The local clustering step sees a linear speedup from multi-threading. For most datasets, we get a 3-4 times speedup with 10 threads.
We tried testing against WHP (Schrinner et al., 2020), but we found that the accuracy was relatively poor across our datasets and it took a long time to run; see Figure 7. Using the default block-cut sensitivity value, the N50 did not exceed 20 for the 67,000 variants on the simulated contig. We found that WHP takes a conservative approach as > 15% of the variants were not called by WHP, contributing to a high hamming error rate. However, we noticed that on real data (discussed below), the runtime of WHP was much more FIG. 6. The mean SWER and Hamming error rate from testing on simulated datasets as described in Section 3.2. The same experiment as shown in Figure 5 except 90 bp between SNPs on average. reasonable at 11,492 seconds with 10 threads despite a much larger dataset, suggesting that on certain datasets the performance may be more reasonable.

Phasing tetraploid potato using real nanopore data
We ran flopp on real nanopore data sequenced from an autotetraploid potato (S. tuberosum) specimen generated from Schrinner et al. (2020). We used the DM v6.1 S. tuberosum assembly from Pham et al. (2020) as our reference, and we called SNPs using freebayes (Garrison and Marth, 2012) from short-reads that were also obtained from the same specimen. We aligned all reads with minimap2. We ran flopp on chromosome 2 in the assembly, which was 46 Mb long and had an SNP heterozygosity of 2.0%, that is, 50 bp average distance between SNPs.
flopp only took 1050 seconds to complete on 10 threads. The WHP took 11,492 seconds on the same dataset, and H-PoPG took 8321 seconds. We show two examples of flopp's phasings on two genes in Figures 9 and 10, the first an example of a successful phasing and the second an example of the difficulties of phasing collapsed haplotypes.
In Figure 9, the clustered structural variants between haplotypes show that our phasings are quite good. Note that haplotypes 1 and 4 differed for only 13 out of 180 alleles. Thus, we are able to separate FIG. 8. The mean q-block error rates for the 5 · and 6 · ploidy datasets at 10 · coverage per haplotype over three iterations. Each variant is spaced on average 45 bp apart, so each block is of size *q Á 45. The upward trend indicates an inaccurate global phasing, but the q-block error rates for flopp are lower than for H-PoP in this regime. H-PoP.

208
SHAW AND YU haplotypes that look relatively similar. However, in Figure 10, three of the haplotypes look perfectly phased, whereas haplotype 3 has very low coverage and mostly consists of noisy supplementary alignments. It appears that the UPEM uniformity constraint was not strong enough to deter collapsing in this case because the reads in haplotype 3 were so noisy. We found that haplotype 2 had twice the coverage of haplotypes 1 and 4, suggesting that haplotype 3 should look like a copy of haplotype 2. Although collapsing is an issue for flopp, we have demonstrated that we can still obtain very sensible local haplotypes, even in the case of collapse. We believe that clever post-processing of flopp's output by looking at coverage can recover correct haplotypes in many instances of collapsing. We envision implementing such a post-processing step in future versions of flopp.

CONCLUSION
In this article, we presented two new formulations of polyploid phasing, the MSMTP problem and the UPEM model. The SMTP score is a flexible graphical interpretation of haplotype phasing that is related to the MEC score when using a specific weighting on the read-graph, whereas the UPEM score is a superior version of the MEC score when uniformity assumptions are satisfied. Using our probabilistic formulation, we give a new notion of distance between read fragments based on the Kullback-Leibler divergence, which has a rigorous interpretation as a log p-value of a one-sided binomial test.
We implemented a fast, local phasing procedure by using these new formulations and showed that our software, flopp, is faster and more accurate on high coverage data while always having extremely accurate local phasings across a range of error profiles, coverages, and ploidies.

AUTHOR DISCLOSURE STATEMENT
The authors declare they have no conflicting financial interests.

FUNDING INFORMATION
This work was supported by a faculty startup fund from the University of Toronto at Scarborough Department of Computer and Mathematical Sciences. The phasings for haplotypes 1, 2, and 4 look reasonable, whereas haplotype 3 has very low coverage and it appears that most reads are noisier supplementary alignments. On further inspection, haplotype 2 has twice the coverage of haplotypes 1 and 4, indicating haplotype collapsing and suggesting that haplotype 3 should look similar to haplotype 2.