On the transformation of MinHash-based uncorrected distances into proper evolutionary distances for phylogenetic inference

Recently developed MinHash-based techniques were proven successful in quickly estimating the level of similarity between large nucleotide sequences. This article discusses their usage and limitations in practice to approximating uncorrected distances between genomes, and transforming these pairwise dissimilarities into proper evolutionary distances. It is notably shown that complex distance measures can be easily approximated using simple transformation formulae based on few parameters. MinHash-based techniques can therefore be very useful for implementing fast yet accurate alignment-free phylogenetic reconstruction procedures from large sets of genomes. This last point of view is assessed with a simulation study using a dedicated bioinformatics tool.


Introduction
To estimate the level of proximity between two non-aligned genome sequences x and y, recent methods (e.g. 1-7) have focused on decomposing the two genomes into their respective sets K x and K y of non-duplicated nucleotide k-mers (i.e. oligonucleotides of size k). A pairwise similarity may then be easily estimated based on the Jaccard index j = |K x ∩ K y | / |K x ∪ K y | 8 . The Jaccard index between two sets of k-mers is a useful measure for two main reasons. First, it can be quickly approximated using MinHash-based techniques (MH 9 ), as implemented in e.g. Mash 2 , sourmash 3 , Dashing 4 , Kmer-db 6 , FastANI 5 , or BinDash 7 . Such techniques select a small subset (of size σ) of hashed and sorted k-mers (called sketch) from each K x and K y , and approximate j by comparing these two subsets (for more details, see 2, [9][10][11][12]. Second, the proportion p of observed differences between the two aligned genomes (often called uncorrected distance or p-distance) can be approximated from j (therefore without alignment) with the following formula (e.g. 13,14): provided that both sizes σ and k are large enough, and j is not too low (see below).
As a consequence, a pairwise evolutionary distance d can be derived from the Jaccard index j using transformation formulae of the following form: where p is obtained using Equation (1). Parameters b 1 and b 2 can be defined according to explicit models to estimate the number d of nucleotide substitutions per character that have occurred during the evolution of the sequences x and y, e.g. 15-24. When b 1 = b 2 = 1, Equation (2) corresponds to the Poisson correction (PC; e.g. 21) distance. Although it is based on a simplistic model of nucleotide substitution 1,16,25,26 , PC is the p-distance transformation implemented in many MH tools (e.g. Mash, Dashing, FastANI, Kmer-db, BinDash). However, more accurate distance estimates may be obtained by using substitution models based on more parameters. Among these models, equal-input (EI, sometimes called F81 18,19,24,27-29 ) takes into account the equilibrium frequency π r of each residue r in Σ = {A, C, G, T}. An EI distance can be estimated using Equation (2)  where π r x and π r y are the frequencies of r in the two sequences x and y, respectively 20 . Further assuming that the heterogeneous replacement rates among nucleotide pairs and sites can be modelled with a Γ distribution, an EI distance d can be derived from p using the following formula: where a > 0 is the (unknown) shape parameter of the Γ distribution, e.g. 22,24,30-33. It is worth noticing that when a is high, Equation (2) and Equation (3) yield very similar distance estimates (for any fixed b 1 and b 2 ).
The aim of this study is to assess the accuracy of Equation (2) and Equation (3)

Results and discussion
MinHash-based p-distance approximation Varying d from 0.05 to 1.00 (step = 0.05), a total of 200 nucleotide sequence pairs with d substitution events per character were simulated under the models GTR and GTR+Γ. Each model was adjusted with three different equilibrium frequencies: equal frequencies (f 1 ; π A = π C = π G = π T = 25%), GC-rich (f 2 ; π A = 10%, π C = 30%, π G = 40%, π T = 20%), and AT-rich (f 3 ; π A = π T = 40%, π C = π G = 10%). The GTR substitution rates and the Γ shape parameters were obtained based on a maximum likelihood (ML) analysis of 142 real-case phylogenomics datasets. Overall, ML estimates of Γ shape parameters were quite low (i.e. varying from 0.162 to 0.422, with an average of 0.314), confirming that the heterogeneity of the substitution rates across sites is a non-negligible factor when studying evolutionary processes. Every simulation was completed with indel events, resulting in sequences > 3 Mbs with relative lengths (i.e. longer/shorter) varying from 1.0196 (d = 0.05) to 1.1117 (d = 1.00), on average.
For each of the 2 (GTR, GTR+Γ) × 3 (f 1 , f 2 , f 3 ) × 20 (d = 0.05, 0.10, ..., 1.00) × 200 = 24,000 simulated sequence pairs x and y, the corresponding p-distance was estimated using three MH tools: Mash, BinDash and Dashing. Of note, the accuracy of a MH estimate ĵ of the Jaccard index between K x and K y is mainly dependent on two parameters: the k-mer size k and the sketch size σ. The size k should be large enough to minimize the probability q of observing a random k-mer shared by x and y by chance alone. Such a value can be obtained from q by k = ⌈log |Σ| (g(1−q)/q) − 0.5⌉, where g is the length of the largest sequence 2,35,29 . The size σ should be large enough to minimize the error bounds of ĵ 2 , but also to avoid the inconvenient estimate ĵ = 0. Following 29, σ was set by the proportion s of the average sequence length.
Two statistics were calculated to assess the linear relationship between the MH estimate p (derived from ĵ ≠ 0) and the 'true' p-distance p: the coefficient of determination R 2 and the slope β of the linear least-square regression p = βp. Let Ψ(p max ) be the subset of pairs (p,p) such that p ≤ p max . Varying p max from 0.10 to 0.55, R 2 and β were estimated from Ψ(p max ) ( Figure 1). The cumulative proportions ˆ0 j f = of MH Jaccard index ĵ = 0 within [0, p max ] were also measured ( Figure 1). Finally, every value p r>0.99 was estimated, where p r>0.99 is defined as the highest p-distance such that the subset Ψ(p r>0.99 ) provides a coefficient of correlation r > 0.99 (as assessed by a Fisher transformation z-test with p-value < 1%; Figure 1). The highest values p r>0.99 were obtained with parameters k = 26 (q ≤ 10 −9 ) and s = 0.8 (illustrated in Figure 2).
One important result ( Figure 1) is that current MH implementations return suitable estimates of p as long as p ≤ 0.25, provided that k is sufficiently large. Indeed, when k ≥ 21 (and any s ≥ 0.2), the statistics p r>0.99 are higher than 0.25 (Figure 1), therefore showing that p and p are highly linearly correlated when p ≤ 0.25 (see e.g. Figure 2). Interestingly, when p ≤ 0.25, the worthless estimate ĵ = 0 was almost never observed with the different selected parameters s and q (Figure 1).  Furthermore, when p > 0.25, large k-mers are required to obtain satisfactory estimates, i.e. k > 21 or q < 10 −6 ( Figure 1). However, dealing with k > 21 involves using large sketch sizes to minimize the cases ĵ = 0 (see ˆ0 j f = in Figure 1). Simulation results suggest that k = 26 (i.e. q = 10 −9 ) and s > 0.4 yield suitable estimates of p, obtained from sequences of lengths > 4 Mbs with pairwise p < 0.35 (see Figure 1 and Figure 2). Indeed, when p ranges between 0.25 and 0.35, small sizes k (e.g. k ≤ 21 or q ≥ 10 −6 ) always provide underestimated p (with any s). Large size k (e.g. k = 31 or q = 10 −12 ) results in the same trend, but also in high numbers of useless estimates ĵ = 0 (even with large σ; see ˆ0 j f = in Figure 1).
When p ≥ 0.35, MH tools always underestimate the p-distances between the sequences simulated for this study ( Figure 1 and Figure 2). One could suggest that more accurate MH estimates p will be expected with larger sketch sizes σ. Nevertheless, results represented in Figure 2 (i.e. q = 10 −9 and s = 0.8, providing the highest p r>0.99 ) are based on average σ varying from ∼2.7 × 10 6 (d = 1.00) to ∼3.6 × 10 6 (d = 0.35), which are larger than some real genomes.

Transformation of p-distances into evolutionary distances
When a pairwise p-distance p can be estimated from unaligned nucleotide sequences, it may be transformed into an evolutionary distance d, based on Equation (2) or Equation (3). The relationship between p and d was represented in Figure 3 for different distance estimators: PC transformation (2) (b 1 = b 2 = 1), and EI transformations (2) and (3) with equilibrium frequencies f 1 (b 1 = b 2 = 0.75 under homogeneous substitution PC and EI p-distance transformations (2) result in improper underestimates as the expected distance d increases. Indeed, when compared with realistic GTR-based distances d, PC and EI transformations (2) give distance estimates that are always lower than d, especially under GTR+Γ and when d is large (e.g. d > 0.1; Figure 3). This downward bias is somewhat expected, knowing that PC and EI transformations (2) are based on less parameters than both models GTR and GTR+Γ. However, the additional parameter a in Equation (3) may help dealing with heterogeneous substitution rates among residue pairs (e.g. 36). Hence, the relationship between GTR distances d and the corresponding p-distances p can be approximated by the EI transformation (3) with a = 4.590 ( Figure 3). Moreover, as d returned by Equation (3) is inversely proportional to a (for any fixed p), the relationship between d and p under the model GTR+Γ (with Γ shape parameter of 0.314, on average) can also be approximated by the EI transformation (3) with a = 0.291 ( Figure 3).
These results show that complex distance measures can be approximated by simple analytical formulae based on few parameters. In practice, nucleotide frequencies (four parameters) can be trivially computed and p-distances (a fifth parameter) can be estimated using MH tools (see above). Therefore, the evolutionary distance d between two sequences that have evolved under the parameter-rich model GTR+Γ can be approximated from these only five parameters using (3) with a ≤ 4.590 (Figure 3).
At this point, it should be stressed that MH p tends to be overestimated. Indeed, MH estimates are of the form p ≈ βp with slope β varying from 1.08 (BinDash, k = 31, s = 0.2) to 1.15 (Dashing, k = 26, s = 0.2) when p ≤ p r>0.99 (Figure 1). This has a direct impact on the derived distances: using PC and EI transformations (2)  The script JolyTree v2.0 was used to reconstruct phylogenetic trees from the simulated sequences. For each pair of unaligned sequences, this script estimates the MH p-distance using Mash, and transforms it into an evolutionary distance. Using these MH-based distances, JolyTree next reconstructs a minimum evolution phylogenetic tree with confidence supports at branches, based on a ratchet-based hill-climbing procedure (for more details, see 29). To obtain accurate MH p-distance estimates, JolyTree was run with parameters q = 10 −9 and s = 0.5 (see above). Evolutionary distances were estimated using the PC and EI transformations (2), as well as the EI transformation (3). To observe the impact of the parameter a, the EI transformation (3) was computed with a varying from 0.05 to 10.0. The accuracy of each p-distance transformation for phylogenetic inference was assessed by the percentage of recovered reference trees, i.e. identical topologies ( Figure 5).
Using JolyTree with EI transformations improves the percentage of recovered reference trees ( Figure 5). In spite of their limitations, PC distances result in the recovery of 75.3% of the 142 reference trees, but EI transformation (2) increases this percentage to 76.7% ( Figure 5). Furthermore, the EI transformation (3) generally provides better results in a large range of a, i.e. up to 83.1% of recovered reference trees ( Figure 5). Low a-values (e.g. a ≤ 0.3) translate into many incorrect tree topologies, whereas high ones (e.g. a > 6) tend to provide the same reference tree recovering percentage as the EI transformation (2) ( Figure 5). Most suitable values of a (corresponding to the highest reference tree recovering percentages, e.g. 80%) seem to range in the interval [1.0, 2.0] ( Figure 5).
These simulation results are consistent with two views which are somehow contradictory. On the one hand, accurate (parameter-rich) distance estimates are required, because biased ones (i.e. corresponding to a concave or convex function of the actual evolutionary distances) may result in incorrect phylogenetic trees 23,37 . On the other hand, simple (underparameterized) distance estimates should often be preferred, because they frequently result in more accurate tree topologies 21,38-42 .
Here, the simple PC and EI transformations (2) (one and five parameters, respectively) enable many reference trees to be recovered ( Figure 5). However, the EI transformation (3) is able to approximate realistic distance measures (e.g. GTR+Γ) by using only one supplementary parameter a (Figure 3). It therefore enables more reference trees to be recovered ( Figure 5).
In line with 43, most suitable values of a (e.g. between 1.0 and 2.0) are all higher than the Γ shape parameter values used for simulating the sequence datasets (i.e. varying from 0.162 provide distance estimates that are quite comparable to the ones returned by Equation (3) with a ranging from 1.000 to 4.590 (see Figure 4 for the equilibrium frequencies f 2 ; similar results were observed with f 1 and f 3 -not shown). The PC transformation (2) on the upward biased MH p returns distances that are then comparable to some complex distance measures (e.g. derived from a GTR model), therefore justifying its use by many MH tools. Nevertheless, the EI transformation (3) remains necessary when dealing with distantly related sequences (e.g. p > 0.2) and strong heterogeneity of the substitution rate across sites (e.g. often observed Γ shape parameter < 1.000). In such cases, the value of the parameter a should always be slightly increased to compensate the MH upward bias. For instance, EI transformation (3) on p with a = 0.291 (i.e. GTR+Γ distance least-square fitting in Figure 3) can be approximated by the same equation on p = βp with β = 1.15 and a = 0.431.

Phylogenetic reconstruction from MinHash-based evolutionary distances
To assess whether MH p-distance transformations may translate into reliable phylogenetic trees, additional simulations were performed. A total of 142 sets of sequences was simulated under the model GTR+Γ along reference phylogenetic trees. Representative GTR+Γ model parameters (same as above) and to 0.422, with an average of 0.314). This can be explained by the MH upward bias (see above), but also by the large variance of the estimate (3) when a becomes low. Reminding that the Γ shape parameters (and the reference trees) used in these simulations were inferred from real-case datasets, these results suggest that using EI transformation (3) with a ≈ 1.5 may be suitable to infer genus phylogenetic trees. In light of this, it should be stressed that the article 29 describing the first distributed version of JolyTree (v1.1) incorrectly stated that the Mash output is the MH estimate p (instead of its PC transformation). As JolyTree v1.1 uses the EI transformation (2), this misinterpretation translates into the odd transformation formula δ = −b 1 log e (1 + log e (1 − p) / b 2 ). However, as δ can be approximated on p ≤ 0.35 by the EI transformation (3) with a = 1.208, this explains the overall accuracy of JolyTree v1.1 despite its use of δ 29 .

Conclusions
Alignment-free phylogenetic inference from pairwise MH-based distance estimates is a promising approach. It enables phylogenetic trees to be quickly reconstructed from a large number of genomes without the burden of multiple sequence alignments (see e.g. 2,29,[44][45][46][47][48][49][50][51][52] Third, as proper evolutionary distances can be derived from MH p-distance estimates, their efficiency in phylogenetic inference was established using the dedicated tool JolyTree 29 . In particular, the EI transformation (3) with a ≈ 1.5 enables accurate phylogenetic trees to be inferred.

Model parameter estimation
To simulate the evolution of nucleotide sequences according to realistic substitution processes, the 187 genus datasets compiled in 29 (available at https://doi.org/10.3897/rio.5.e36178. suppl2) were first considered to infer a representative range of GTR parameter values. For each of the 187 genera, the associated genome assemblies were processed using Gklust v0.1 to obtain one representative genome assembly for each putative species. This analysis provided 142 sets of representative genome assemblies after discarding genera containing < 10 putative species. For each of these 142 sets, coding sequences were clustered using Roary v3.12 55 . Each cluster with at least four coding sequences was used to build a multiple amino acid sequence alignment using MAFFT v7.407 56 . Multiple sequence alignments were back-translated at the codon level and concatenated, leading to 142 supermatrices of nucleotide characters. A phylogenetic tree was inferred from each supermatrix of characters using IQ-TREE v1.6.7.2 57 with evolutionary model GTR+Γ. All data related to these analyses are publicly available as Extended data at https://doi.org/10.5281/zenodo.4034244 58 .

Sequence simulation
To assess the accuracy of different pairwise distance estimates, a simulation of sequence pairs was performed under both models GTR and GTR+Γ with three different sets (π A , π C , π G , π T ) of equilibrium frequencies: was used to simulate the evolution of 200 sequence pairs with d substitution events per character. Initial sequence length was 5 Mbs, and an indel rate of 0.01 was set with indel length drawn from [1,50000] according to a Zipf distribution with parameter 1.5. For each simulated sequence pair, model parameters (i.e. GTR: six relative rates of nucleotide substitution; GTR+Γ: six rates and one Γ shape parameter) were randomly drawn from the 142 sets of estimated ones (see above). All simulated sequences are publicly available as Extended data at https://doi.org/10.5281/zenodo.4034461 60 .
To compare the efficiency of p-distance transformations for phylogenetic reconstruction, the program INDELible v1.03 was also used to simulate the evolution of a sequence along each of the 142 phylogenetic trees previously inferred from different genera (see above). For each of the 142 genera, sequence evolution was simulated under the model GTR+Γ with the corresponding parameters (i.e. four nucleotide frequencies, six relative rates, and one Γ shape parameter). Sequence length and indel events were simulated as described above. First, what is the justification to the choice of a value equal to 1.5 for the shape of the Zipf distribution used to simulate the indels. As a user of INDELible myself I know very well how this value can be of importance so I would like to know how this value was chosen (arbitrary choice?). Second, I am not sure that the right reference for the computation of missing distances from the triangle inequality property is Guénoche and Grancolas (1999). The first paper I have seen on that topic is Lapointe and Kirsch (1995 1 ). Also, I have no concerns on the results presented and everything seems ok for me.
The conclusion of the manuscript presents the approach of phylogenetic inference from pairwise MH-based distance estimates as a possible alternative to classical methods of phylogenetic reconstruction (especially in bacteria). Indeed, a classical phylogenomic approach involves the chaining of a complex set of procedures: the identification of orthologous genes in multiple species; the alignment of these genes; and finally, the building of a tree from the concatenation of those alignments. I think that the inclusion in the paper of a comparison of two phylogenies obtained with: i) the MH-based distance and ii) a concatenation would strengthen this claim. A lot of concatenation datasets extracted from complete bacterial genomes are available and it would be not too difficult to do the experiment. I think this addition is of importance because I have no idea on the influence horizontal gene transfers can have on MH-based reconstructions when classical approaches try to minimize this influence.
In recent papers, MinHash techniques were proposed as an attractive way of estimating the pdistance between DNA sequences, i.e. the number of mismatches per position in an (unknown) alignment of the compared sequences.
The present paper shows how distances based on complex models of evolution (GTR, GTR + \Gamma) can be approximated using MinHash-based p-distances. The approach is carefully evaluated based on simulated sequences.
The proposed method is a welcome and useful addition to existing alignment-free methods, extending them to more realistic models of evolution. The approach is novel, the paper is very well written and clear, and suitable references are given to the literature for more details. Therefore, I support indexing of the manuscript.
A certain limitation is that the manuscript is restricted to MinHash distances. According to the author, these methods are accurate for p-distances roughly < 0.25 (for reasonable k-mer length), but are less accurate for larger distances. However, a number of other alignment-free methods have been proposed that accurately estimate distances for much larger distances, e.g. Kr . It should be straight-forward to transform these distances to more complex models, as done in the present paper, and to compare them to the MinHash methods evaluated in the paper. of pure k-mer counting a la Fan et al., rather than the MinHash approaches, which trade some accuracy of the Jaccard estimation for speed. This is still useful, but could be somewhat misleading for those that need the speed benefits of running MinHash with more typical parameters. Additionally, since the sweep of s is linear, but MinHash error bounds relate to an exponential of s, the sweep ends up being less informative than it could be, which is likely why the columns of Figure 1 are nearly identical.
As a simple remedy, I would suggest sweeping the parameter s exponentially rather than linearly, starting closer to the default sketch size for the tools. It could still end closer to the genome size to get a sense of the behavior of MinHash as it degenerates to the actual k-mer Jaccard score. Along these lines, it would also make sense for the later phylogenetic experiments to either use a smaller s or do another sweep of values.
As a minor point when listing relevant software in the introduction: Dashing is based on HyperLogLog sketching, which is similar in concept to MinHash but algorithmically distinct.

If applicable, is the statistical analysis and its interpretation appropriate? Yes
Are all the source data underlying the results available to ensure full reproducibility? Yes Are the conclusions drawn adequately supported by the results? Yes