Convergence of maximum likelihood supertree reconstruction

Supertree methods are tree reconstruction techniques that combine several smaller gene trees (possibly on different sets of species) to build a larger species tree. The question of interest is whether the reconstructed supertree converges to the true species tree as the number of gene trees increases (that is, the consistency of supertree methods). In this paper, we are particularly interested in the convergence rate of the maximum likelihood supertree. Previous studies on the maximum likelihood supertree approach often formulate the question of interest as a discrete problem and focus on reconstructing the correct topology of the species tree. Aiming to reconstruct both the topology and the branch lengths of the species tree, we propose an analytic approach for analyzing the convergence of the maximum likelihood supertree method. Specifically, we consider each tree as one point of a metric space and prove that the distance between the maximum likelihood supertree and the species tree converges to zero at a polynomial rate under some mild conditions. We further verify these conditions for the popular exponential error model of gene trees.


Introduction
High-throughput sequencing is making large collections of sequences available to researchers at a low cost. These genomic data represent a broad spectrum of life and motivates studies of the problem of reconstructing large phylogenetic trees using statistical methods. Those data, however, also come from different sources, cover different genomic regions on which the evolutionary processes happen very differently, and may not be collected on the same set of species. Thus, combining trees on different, overlapping sets of species into a "supertree" has become a popular approach for reconstructing large species trees. Over the years, several methods of supertree reconstruction have been developed (Cotton and Wilkinson, 2007), and analyses of supertrees continue to play more and more important roles in the search for answers of many fundamental evolutionary questions.
While supertree methods have been of great interest in phylogenetics, little is known about their theoretical properties, especially in non-asymptotic settings. In cases when the individual trees are gene trees and the set of taxa are the same across the input trees, several statistical consistent methods have been derived (Mossel and Roch, 2008;Heled and Drummond, 2009;Kubatko et al., 2009;Larget et al., 2010;Liu et al., 2010;Liu and Yu, 2011;Bryant et al., 2012;Chifman and Kubatko, 2014;Vachaspati and Warnow, 2015). However, most proofs of statistical consistency have been analyzed under the condition that each of the gene trees can be estimated accurately, and do not provide a guarantee of robustness to gene tree estimation errors (Roch and Warnow, 2015). This is a critical concern because when the species tree is constructed using sequences taken from large genomic regions, those regions have a high chance of involving some recombination, which violates one of the main assumptions of the multi-species coalescent model. On the other hand, limiting analyses to short regions increase gene tree estimation error (GTEE) and various summary methods had impaired accuracy when the error was high (Gatesy and Springer, 2014;. It is later proved in Roch et al. (2019) that when the sequence length of each locus is bounded and the gene tree cannot be estimated reliably, most summary methods that estimate the species tree by combining gene trees are not statistically consistent.
Because some coalescent-based summary methods sometimes produce less accurate estimates than concatenation (Bayzid and Warnow, 2013;Patel et al., 2013;, seemingly as a result of GTEE, the question of whether provable guarantees can be established in the presence of GTEE (for both species trees and supertree) naturally arises. In the context of species tree estimation, Roch and Warnow (2015) establish statistical consistency of the Rooted Triplet Consensus method and the Maximum Pseudo-likelihood for Estimating Species Trees method (Liu et al., 2010) under GTEE and provide bounds on the sampling complexity for these methods to construct the correct species tree with high probability.
The maximum likelihood (ML) supertree method is proposed by Steel and Rodrigo (2008) based on a probability model that permits "errors" in gene tree topologies and allows the species tree to be estimated even if there is topological conflict amongst gene trees. Steel and Rodrigo (2008) shows that ML estimate of the species tree is topologically consistent under fairly general conditions and also shows that the method of Matrix Representation with Parsimony (Baum, 1992) may be inconsistent under these same conditions. However, while the ML estimate is topologically consistent, no results on the convergence rate of the estimator are obtained. In this paper, we propose a new analytic approach to study the convergence of the ML supertree method. Based on embedding trees into a metric space, we establish the conditions for which the convergence rate of the ML supertree can be obtained. We verify these conditions for the popular exponential error model of gene trees, thus obtain a polynomial convergence rate for the ML supertree to the true species tree under this model.

Mathematical framework
In this paper, the term phylogenetic tree refers to a tree T with leaves labeled by a set of species.
Each branch of T is associated with a non-negative branch length. A tree is said to be resolved if it is bifurcating and all branch lengths are positive. Given an unrooted phylogenetic tree T on a finite set L of species, any subset L ′ of L induces a phylogenetic tree on L ′ , denoted T| L ′ , which is the subtree of T that connects the species in L ′ only.
We regard the tree space as a metric space (T , d) where T is the set of all phylogenetic trees with branch lengths bounded from above by a positive constant g 0 and d is a continuous metric.
For simplicity of presentation, we will assume that d is the BHV distance on the set of trees with n species (Billera et al., 2001), but note that the analysis of the paper can be extended to any continuous and locally-Euclidean distance, including the branch-score distance (Kuhner and Felsenstein, 1994) and the AGPS distance (Amenta et al., 2007).
To describe trees that are "near" to each other, the BHV distance use the class of nearest neighbor interchange (NNI) moves (Robinson, 1971). An NNI move is defined as a transformation that collapses an interior branch to zero and then expands the resulting degree 4 vertex into a branch in a different way. The BHV space models the set of trees T on n species as a cubical complex consisting of a collection of orthants, each isomorphic to R 2n−3 ≥0 . Each orthant of T corresponds uniquely to a tree topology, and the coordinates in each orthant parameterize the branch lengths for the corresponding tree. The adjacent orthants of the complex with the same dimension correspond to NNI-adjacent trees.
The BHV space is equipped with a natural metric distance: the shortest path lying in the BHV space between the points. If two points lie in the same orthant, this distance is the usual Euclidean distance. If two points are in different orthants, they can be joined by a sequence of straight segments, with each segment lying in a single orthant. We can then measure the length of the path by adding up the lengths of the segments. The distance between the two trees T 1 and T 2 on the BHV space is defined as the minimum of the lengths of such segmented paths joining the two points.
Throughout the paper, we assume that the evolution of the species has not involved reticulate processes, and there exists an underlying "true species tree" in T , denoted by T * . Furthermore, T * is a resolved tree with leaf set L * . In a supertree reconstruction problem, we observe a sequence of gene trees (T 1 , T 2 , . . . , T k ), where T i has leaf set L i , and wish to combine these trees into a phylogenetic treeT k on the union of the leaf setŝ In this paper, we consider a tree-generating probability model (P T ) T∈T such that for each set of species L ⊂ L * , P L T (·) is a distribution of trees forming by species in L . We assume that the respectively. In other words, the joint probability of the observed gene trees is Remark 2.1. We note that the tree-generating probability model P T * may not necessarily be nested. That is, given two nested sets of leaves L 1 ⊂ L 2 ⊂ L * , although the probability P L 2 T * (which is used to generate trees with leaf set L 2 ) also induces a natural distribution on the set of trees with leaf set L 1 , this probability P L 2 T * | L 1 may not be the same as P L 1 T * .
Given the observed gene trees (T 1 , T 2 , . . . , T k ), the ML supertree is defined aŝ where ℓ k (T) denotes the log-likelihood function To enable theoretical analyses of the ML supertree method, we make the following assumptions.
Assumption 2.1 (Weak covering property). The sequence of subsets of L * satisfies the weak covering property: there exists c 1 > 0, γ > 1/2, and K > 0 such that for each subset L of taxa Here, |A| denotes the number of elements in the set A.
We note that the weak covering property ensures that as the number of gene trees increases, all quartets (subtrees with 4 leaves) of T * are visited with enough frequency to enable a reliable estimate for each of the quartets. Assumption 2.1 is a direct generalization of the covering property introduced in Steel and Rodrigo (2008), which can be obtained from Assumption 2.1 by setting γ = 1.
Assumption 2.2 guarantees that it is at least possible to reconstruct the restriction T * | L of T * to the subset L with a complete knowledge of P L T * . This assumption is similar to, but distinct from, the condition of basic centrality introduced by Steel and Rodrigo (2008), which requires that for all subsets of L ⊂ L * of size 4, for all trees T ′ on leaf set L that are different from T * | L , and where η > 0: • On one hand, the basic centrality condition implies that if d(T * | L , T| L ) > 0, the modes of the distributions P L T * and P L T are different. From this, we can deduce that KL(P L T * , P L T ) > 0.
Thus, our identifiability assumption is somewhat weaker than this condition. Assumption 2.2 also does not impose any assumption on the family of distribution themselves and thus can be applied to a wider class of probabilistic models.
• On the other hand, the basic centrality condition only concerns leaf sets of size 4, while Assumption 2.2 impose restrictions on leaf sets of all sizes. However, we note that conditions on subset of leaves (such as the basic centrality condition) only work under the implicit assumption that there are some connection between P L 2 T | L 1 and P L 1 T for L 1 ⊂ L 2 . Since our framework does not assume any nested structure in the probability model, Assumption 2.2 is more appropriate.
Finally, we impose the following regularity conditions on the tree-generating probability model: Proof. Note that the function E P T * log P L T (T ′ ) (with respect to T) is analytic in a neighborhood U of T * . Define For all T ∈ A, we have By Assumption 2.2, we conclude that T| L = T * | L for all T ∈ A. Applying Łojasiewicz inequality (Ji et al., 1992, Theorem 1), we deduce that there exists c 3 > 0 and m ≥ 2 such that for any T in the neighborhood, This implies the result.
Throughout this paper, we will assume that Assumptions 2.1, 2.2, and 2.3 hold. In the next section, we will establish the following convergence rate of the ML supertree.
Theorem 1. Under Assumptions 2.1, 2.2, and 2.3, for any δ > 0, there exist constants m > 0, Theorem 1 establishes that under fairly mild regularity conditions, the ML supertree is consistent and has a polynomial convergence rate. Notably, the result holds for all identifiable family of locally-analytic distributions (Remark 2.2). The degree of this (polynomial) convergence rate depends on the sampling scheme of the leaves (characterized by the covering coefficient γ) and on the geometry of the probabilistic model (characterized by the constant m in Assumption 2.3(b)).
We further note that Assumption 2.3(b)) is only required for establishing the convergence rate of the ML estimator and the absence of this condition does not affect the proof of consistency.

Convergence of maximum likelihood supertree
To enable the analysis of convergence, we define We refer to R k and E P T * [R k (T)] as the empirical risk function and the expected risk function, respectively.
An intuitive argument for the consistency of the ML estimator can be described as follows. As the number of gene trees increases, we have with sufficiently high probability. Therefore, the ML estimator will converge to the optimal value of the risk function, which is attained at the true species tree T * . This simple argument is formalized by a lower bound of the expected risk function (Section 3.1) and a uniform concentration bound on the deviation of the empirical risk function and its expectation (Section 3.2).

Lower bound of the expected risk
Lemma 3.1. There exist a neighborhood U of T * and c 4 > 0 such that Proof. We note that Consider an arbitrary leaf set L ⊂ L * of size 4 and define I = {i ≤ k : L ⊂ L i }. By the weak covering property, we have |I| ≥ c 1 k γ . Thus, for all T in some neighborhood U around T * . Here, the second inequality comes from Assumption

2.3(b).
On the other hand, Dinh et al. (2018) (Lemma 6.2 (i)) proved that for some leaf set L ⊂ L * of size 4, we have We deduce that Lemma 3.2. For any leaf set L and any x > 0, there exists C x > 0 such that Proof. Assume that there exists a sequence of tree Since the tree space is compact, we can extract a sub-sequence {T i j } of {T i } that converges to some tree T 0 . We deduce that KL(P L T * , P L T 0 ) = 0 and T * | L = T 0 | L . This contradicts Assumption 2.2.
Proof. By Dinh et al. (2018) (Lemma 6.2 (i)), there exists L ⊂ L * of size 4 such that By the weak covering property, we have |I| ≥ c 1 k γ . Thus, which completes the proof.
Proof. Since the functions log P L i T (T i ) is locally Lipschitz with respect to T (Assumption 2.3(a)) and number of species is finite, there exists C 1 > 0 such that Since the tree space is compact, C 1 d(T, T ′ ) is bounded by a constant C 2 . Using Hoeffding's inequality (Hoeffding, 1963), we obtain On the other hand, we have . Note that the total number of balls of radius x/(4C 1 √ k) required to cover the tree space is bounded above by where g 0 is the upper bound of branch lengths and C n is a constant depending on the number of species. To obtain the desired inequality, we will chose x such that which can be done with x = C(δ, n, C 2 , g 0 ) log k.

Proof of Theorem 1
First, we establish that the ML estimator is consistent.
By Lemma 3.4, we have with probability at least 1 − δ, sinceT k is the maximizer of the empirical risk function.
On the other hand, let V be a neighborhood of T * , by Lemma 3.3, we have Since γ > 1/2, there exists K δ,V such that for k ≥ K δ,V , we have We deduce that for k ≥ K δ,V ,T k ∈ V with probability at least 1 − δ. This proves that the ML estimator is consistent.
Next, we will derive the convergence rate of the ML supertree. From Lemma 3.1, there exist a neighborhood U of T * and c 4 , m > 0 such that For k ≥ K δ,U , we haveT k ∈ U with probability at least 1 − δ. By Lemma 3.4, with probability 1 − 2δ, we obtain Here, becauseT k is the maximizer of R k . This completes the proof.

Applications: convergence under the exponential model
The exponential model is a simple model of gene tree estimation errors in which the probability of observing a given tree decreases exponentially with its distance from the species tree (Steel and Rodrigo, 2008). Suppose d is some metric on the set of trees, in the exponential model, the probability of reconstructing any tree T ′ with a leaf set L , when T * is the generating tree is proportional to an exponentially decaying function of the distance from T ′ to T * | L : Here β L is a constant that depends only on the set of leaves L , while α T * ,L is the normalizing constant to ensure that P L T * is a density function.
In this section, we verify the identifiability and regularity conditions for the exponential model when d is any continuous tree distance such that if T and T ′ have the same topology, then d(T, T ′ ) is the Euclidean distance.
Theorem 2. Under the exponential model with Assumptions 2.1, for any δ > 0, C δ > 0 and Proof. First, we note that for all T ′ with leaf set L , P L T (T ′ ) is a positive and locally Lipschitz function. To obtain an upper bound on the convergence rate for the ML supertree, we need to verify Assumption 2.2 and Assumption 2.3(b)/Remark 2.3.
Identifiability. Since it is sufficient to prove either KL(P L T * , P L T ) > 0 or KL(P L T , P L T * ) > 0, we can assume that α T * ,L ≥ α T,L .
We denote and pick c small enough such that β L d(T ′ , T| L ) ≥ 2c. We note that in A c , Since α T * ,L ≥ α T,L , we have By Pinsker's inequality, which establishes the identifiability of the exponential model. Here, µ is the Lebesgue measure.
Regularity. Consider a fixed tree T ∈ T and leaf set L , we first assume that α T * ,L ≥ α T,L .

If we define
Using the same argument as above, we have It can be verified that there exists C > 0 such that for all x ∈ [0, C], we have (1) Moreover since d is Euclidean inside each orthant, if d(T, T * ) is small enough, we have Thus, let W be a neighborhood in the same topology of T * such that and both Equations (1) and (2) are satisfied, we have for some constant C > 0 independent of T.
For the case when α T * ,L < α T,L we define

Discussion and Conclusion
In this paper, we propose a novel analytic approach to analyze the convergence of the ML supertree method. Instead of focusing on reconstructing the correct discrete topology of the species tree as in previous studies (e.g. Roch and Warnow, 2015;Steel and Rodrigo, 2008), we employ a continuous model of the tree space and analyze the ML estimator on this metric space, aiming at recovering both the topology and the branch lengths of the species tree. This framework enables us to use tools from statistical learning and information theory to establish the convergence rate of the ML estimator and at the same time, to weaken the conditions to obtain consistency and convergence of the estimator. Our weak covering property is an extension of the classical covering property (Steel and Rodrigo, 2008) and provides a considerable relaxation on the sampling schemes for supertree estimation. Our identifiability condition is also more intuitive and generalizable than the well-known basic centrality condition and does not impose constraints on the shape of the probabilistic model of gene tree estimation errors. Our information-theoretical approach to analyze statistical estimator on tree spaces is of independent interest and can be extended to other problems in phylogenetics.
There are several avenues for future directions for this work. The first direction is extending our results to other practical models of phylogenetic errors, including the multiple-coalescent model (along with a detailed model of the effects of short sequence length on the accuracy in estimating the individual gene trees). Second, while our result provides a polynomial bound on the convergence rate, the power of the convergence (characterized by the geometric constant m) is not sharp. A sharper bound of the convergence rate would be of great interest to the field (from both theoretical and applied perspective) and would require further understanding of the tree-generating probabilistic model.