Novel Algorithm for Comparing Phylogenetic Trees with Different but Overlapping Taxa

: Comparing phylogenetic trees is a prominent problem widely used in applications such as clustering and building the Tree of Life. While there are many well-developed distance measures for phylogenetic trees defined on the same set of taxa, the situation is contrasting for trees defined on different but mutually overlapping sets of taxa. This paper presents a new polynomial-time algorithm for completing phylogenetic trees and computing the distance between trees defined on different but overlapping sets of taxa. This novel approach considers both the branch lengths and the topology of the phylogenetic trees being compared. We demonstrate that the distance measure applied to completed trees is a metric and provide several properties of the new method, including its symmetrical nature in tree completion.


Introduction
Phylogenetic trees serve as graphical representations aimed at depicting potential relationships among various species or groups of organisms over time.Despite their limitations, such as simplified interpretations, phylogenetic trees remain valuable in efforts to understand the diversity of life and find applications across disciplines such as taxonomy, comparative genomics, and evolutionary biology.Many practical tasks, such as tree clustering and classification, involve comparing phylogenetic trees by computing distances between them, which is a crucial part of the process.
Concurrently, there are applied problems that require trees defined on different but overlapping sets of taxa.For instance, phylogenetic tree clustering [15][16][17], supertree construction [16,18,19], the Tree of Life construction [20,21], and phylogenetic database searching [22,23] demand computing distances between such trees.There are also related works on distance measures for phylogenetic trees with different numbers of overlapping leaves (see Table 1).
One approach to determine the distance between two phylogenetic trees defined on different but partially overlapping taxa sets is to prune the trees to their common taxa set.Specifically, in the RF(−) approach, the two trees are first defined by their different but overlapping leaves.Subsequently, the unique leaves are removed to make both trees defined on a common set of taxa, and then the classical RF distance between these trees is computed [24].This approach is known for its simplicity and relatively fast computation time.However, it should be noted that this pruning step may result in the loss of valuable topological information from both trees.
A more advanced RF-based approach involves adding the non-common leaves of one tree to the other tree being compared.This results in two trees defined on the same set of taxa, i.e., on the union of the original leaf sets of both trees.The methods of the RF(+) approach are described in [24] and developed in [25][26][27][28][29].The completion-based RF(+) method as described in [29] is based on adding leaves to trees under the condition of the minimization of the RF distance criterion.Compared to RF(−), the RF(+) method processes more information about both trees after their completion and includes a wider set of possible distance values.
The Generalized Robinson-Foulds (GRF) distance [4] is another approach that can be applied to two trees not necessarily defined on the same set of taxa.This method utilizes symmetric differences for sets of clades (clusters) from both phylogenetic trees, accounting for both shared and non-shared clades.As a result, the GRF distance incorporates more tree information compared to the classical RF distance and can be computed in linear time.However, it primarily uses topological information and does not consider the length of tree branches.
The distance between trees defined on different but partially overlapping sets of taxa can be calculated using the Vectorial Tree Distance (VTD) [30].VTD is a vector consisting of elements representing the difference in the number of branches at each tree level, starting from the root of the tree.It is based on a tree-alignment technique that maps the branches of one tree to the branches of another tree at the same level.However, VTD does not consider leaf names, is not a metric, and the final result is a vector rather than the scalar commonly used as a distance value (although the authors suggest a method for how to convert a vector to a single value).
The geodesic distance in the Billera-Holmes-Vogtman (BHV) tree space [31] is a distance metric that considers both tree topology and branch lengths.This distance was originally developed for phylogenetic trees with the same set of taxa, but there are extensions of this approach to trees defined on different but overlapping sets of taxa [32,33].In particular, in [32], the authors introduce the BHV connection cluster, the BHV connection space, and the BHV connection graph.They describe the process of constructing the BHV connection graph to move from lower dimensions to higher dimensions of the BHV tree space, which provides a way to compute distances for trees with different numbers of overlapping leaves.However, this process introduces additional computational complexity into the resulting geodesic distance calculation and makes this method slow for large trees with many non-common leaves.
These distance measures are compared based on their properties, such as the tree information used (e.g., binary nature, branch lengths, topology, and leaf names) and their computational complexity.The results are shown in Table 1.
Table 1.Comparison of distance measures for trees with non-identical taxa.A distance measure is recognized as a metric if it satisfies the 4 properties of a metric (non-negativity, identity of indiscernibles, symmetry, and triangle inequality), where n is the total number of unique leaves in both trees, n 1 is the total number of nodes in the tree, k is the number of maximal subtrees unique to an input tree, |M(m)| is the number of minimal matching of two m-dimensional vectors, n 2 is the maximum number of nodes, l is the number of leaves to be added to the tree.A number of research papers have addressed the issue of imputing missing taxa in phylogenetic trees, presenting a variety of techniques for this purpose.Yasui et al. [34] presented an optimization-based method using a mixed-integer non-linear programming model to handle missing pairwise distances in gene trees.This method involves a two-stage optimization process to assign individuals to hypothetical groups and estimate the missing distances.Yoshida [35] introduced a technique utilizing tropical geometry, leveraging tropical polytopes and max-plus algebra.This method projects the incomplete tree onto a constructed tropical polytope to estimate missing data, focusing on equidistant trees.Rabiee and Mirarab [36] developed INSTRAL, which integrates new species into existing trees by minimizing quartet discordance.Mai and Mirarab [37] proposed a method to complete gene trees independently by optimizing the quartet score and introduced quartet subsampling for better accuracy.Mahbub et al. [38] addressed missing data using QT-GILD, a deep learning approach that employs an autoencoder to generate and correct quartets in incomplete gene trees.

Measure Metric
Problem statement.The primary problem is to complete phylogenetic trees defined on different but overlapping sets of taxa while preserving the evolutionary relationships and structural integrity of the original trees.The challenge is to maintain the accuracy of evolutionary distances and the topological information of phylogenetic trees despite differences in their original taxon composition.
Our contributions.In this work, we present a new algorithm for completing phylogenetic trees defined on different but overlapping sets of taxa.This algorithm utilizes branch lengths and pairwise distances between leaves of the considered trees.It applies branch adjustment rates, common leaf distances, temporary nodes, and a midnode approach to insert distinctive leaves from one tree into the other, making them defined on the same set of taxa.Based on the common part of both trees (i.e., the set of common leaves), new leaves are inserted into the other tree by exploiting this common information using the leaf distances associated with the common leaves.Specifically, using the branch adjustment rates to scale the branch lengths, the algorithm uses the adjusted distances to find planting points for the distinctive leaves in the other tree associated with the same common leaves.The planting point is chosen as the position for inserting a new leaf (or leaves) with its adjusted terminal branch.Several important properties are formulated for the proposed approach.
The rest of this paper is organized as follows.Section 2.1 provides the necessary notation and preliminary information.Sections 2.2 and 2.3 outline the new phylogenetic tree completion algorithm, and Section 2.4 provides a practical example.Section 3 presents several properties of the proposed algorithm and discusses their importance.

Notation and Preliminaries
In the phylogenetic tree T, nodes (or vertices, v) represent taxonomic units, such as species.The set of nodes is denoted as V(T).The root node is a special node that represents the most recent common ancestor of all taxa included in the tree.Leaves (or terminal nodes, l) of the tree T are nodes that do not have any children (i.e., nodes that are not connected further downstream).These nodes represent individual species or taxa under study and the set of leaves for the tree T is denoted as L(T).Internal nodes are nodes that have at least one child and represent ancestral taxa that have given rise to descendant taxa.Edges (or branches) in the phylogenetic tree T represent evolutionary relationships between taxa, and each branch connects two nodes and indicates a common ancestor, showing how species have evolved from their ancestors over time.The set of edges (branches) is denoted as E(T).Branches can have a length associated with them.A terminal branch (or a pendant branch) is a branch that ends in a leaf (terminal node) and does not give rise to any further branches or internal nodes.The length of the terminal branch of the leaf l in the tree T is denoted as br (T) (l).In this work, rooted phylogenetic trees with labeled branch lengths are considered.Definition 1 (Distance between leaves).The distance between any two leaves l 1 and l 2 of the phylogenetic tree T (denoted as d (T) (l 1 , l 2 )) is the sum of branch lengths along the unique path from l 1 to l 2 .
The distance d (T) (l 1 , l 2 ) can be expressed as follows: where e represents each branch (edge) along the path P (T) (l 1 , l 2 ) and length(e) is the length of branch e.
Similarly, the distance between any two nodes in the tree can be calculated.
Definition 2 (Common and distinct leaves).For two phylogenetic trees T 1 and T 2 , the leaves they share are called common leaves.The set of common leaves for trees T 1 and T 2 is denoted as CL(T 1 , T 2 ).If one tree contains leaves that are not included in the other tree, these leaves are called distinct leaves.The set of distinct leaves for tree T is denoted as DL(T).
Definition 3 (Maximal distinct-leaf subtree).A subtree S of a tree T is called a maximal distinctleaf subtree if and only if all leaves in the subtree S belong to the set DL(T) and there is no other subtree of T that includes all the leaves from DL(T) and has S as its proper subtree.
The branch connecting the root of a maximal distinct-leaf subtree S to its lowest ancestor node in the tree T is called the root branch of that subtree.The length of the root branch of the maximal distinct-leaf subtree S is denoted as br (T) (S), similar to how the length of the terminal branch for a leaf is denoted.To calculate the cutback distance between a leaf l and the maximal distinct-leaf subtree S, it is necessary to compute the distance between a leaf l in the entire tree T and the lowest ancestor node of the subtree S within the larger tree T. For a tree T, the set containing all its maximal distinct-leaf subtrees and remaining distinct leaves is denoted as SD(T) and can be found using Algorithm 1.
Definition 4 (Branch adjustment rate).Given two phylogenetic trees T 1 and T 2 defined on different but overlapping taxa and their set of common leaves CL(T 1 , T 2 ), the branch adjustment rate is the ratio of the sums of pairwise (without repetitions) distances between common leaves in one tree to the other.
where r(T 1 , T 2 ) is the branch adjustment rate for tree T 1 related to tree T 2 and N CL is the number of common leaves CL(T 1 , T 2 ).
The branch adjustment rate is used to adjust the terminal branch lengths for distinct leaves (and subtree branches, if applicable) in the tree completion process.Definition 5 (Leaf-based adjustment rate).Given two phylogenetic trees T 1 and T 2 defined on different but overlapping taxa, their set of common leaves CL(T 1 , T 2 ), and a common leaf l c ∈ CL(T 1 , T 2 ), the leaf-based adjustment rate is defined by the following equation: where r (l c ) (T 1 , T 2 ) is the l c -based adjustment rate for tree T 1 related to tree T 2 and l c , l i ∈ CL(T 1 , T 2 ).Each leaf-based adjustment rate is calculated based on one common leaf relative to the other common leaves in the considered trees.
Definition 6 (Midnode).In the context of a phylogenetic tree T, the midnode between any two connected nodes v 1 and v 2 refers to a specific point along the path that connects these nodes, such that this point divides the total branch length of the path into two equal halves.
Formally, if P T (v 1 , v 2 ) represents the unique path between v 1 and v 2 , then the midnode M is defined as the point on It is essential to note that the midnode represents a calculated position that may not coincide with the pre-existing nodes in the tree.This definition presupposes the existence of a unique path between any two nodes in T, a fundamental characteristic of phylogenetic trees, ensuring that each pair of nodes is connected by exactly one path.The midnode approach is employed in the tree completion process and can be found using Algorithm 2.

Distance Measure
The task of comparing phylogenetic trees with different but overlapping sets of taxa can be formulated by calculating the distance between them after completing the trees on the union of their taxa sets.Given two trees, T 1 defined on L(T 1 ) and T 2 defined on L(T 2 ), and the set of their common leaves CL(T 1 , T 2 ) containing at least two elements, the tree completion process involves making the trees T ⊎ 1 and ).The distance between the completed trees is then calculated using the Branch Score Distance, denoted as BSD(+), which utilizes the difference in distances between the corresponding tree leaves [13].The formula for the BSD(+) distance is as follows: where l i , l j ∈ L(T ⊎ ) and N is the size of the set L(T ⊎ ).

Tree Completion Algorithm
A novel tree completion algorithm based on the concepts of common leaf distances, adjustment rates, and midnodes is described in this subsection.The proposed algorithm adopts a procedural approach, leveraging domain knowledge and procedural logic to achieve its objectives outlined in the problem statement.Specifically, the algorithm seeks to preserve evolutionary information by adjusting branch lengths with calculated rates, aiming to maintain the proportional evolutionary distances present in the original trees.Additionally, the use of common leaves and the distances between them allows leaf insertion to be guided by the shared information of the trees being compared, ensuring that insertions reflect established phylogenetic relationships.
Let T 1 and T 2 be phylogenetic trees defined on different but overlapping sets of taxa.The phylogenetic tree completion algorithm includes the following main steps.
The first step consists of finding common leaves CL(T 1 , T 2 ), distinct leaves, and maximal distinct-leaf subtrees (sets SD(T 1 ) and SD(T 2 )).Finding the maximal distinct-leaf subtrees and the remaining single distinct leaves in a phylogenetic tree T can be accomplished using Algorithm 1.
The process continues for each tree T i in {T 1 , T 2 }.The second step calculates the branch adjustment rates r(T i , T 3−i ) (see Equation ( 2)) and the leaf-based adjustment rates The third step consists of processing each distinct element a ∈ SD(T 3−i ), performing the following substeps.
The first substep calculates the new branch length for element a as the current branch length multiplied by the corresponding branch adjustment rate using Equation (6).
All branch lengths within maximal distinct-leaf subtrees should be adjusted using the appropriate adjustment rate when inserted into the corresponding tree.It is important to note that the initial branch lengths in the trees should be kept unchanged.
The second substep calculates the cutback distances between each common leaf and that element, denoted as dc (T 3−i ) (l c , a), using Equation (7).
The third substep involves multiplying these cutback distances by the corresponding leaf-based adjustment rates to obtain distances d p (l c , a) (see Equation ( 8)), which are used to find possible positions for adding temporary nodes in the next substep.
A possible position for inserting temporary nodes is determined by traversing the branches of tree T i and identifying points where the distance from l c matches d p (l c , a).This involves checking each branch and calculating the cumulative distance from common leaf l c .If the calculated cumulative distance matches d p (l c , a), that point is considered a possible position for temporary nodes.Temporary nodes are auxiliary nodes that are introduced to facilitate finding insertion points for new leaves in the tree completion process.These temporary nodes do not represent actual biological entities and serve as placeholders.
The fourth substep consists of adding temporary nodes in T i in all possible positions (only among the branches that were in the tree initially, not including newly added branches and nodes) at appropriate calculated distances (d The fifth substep involves finding the planting point among temporary nodes (see Algorithms 2 and 3) and inserting the considered distinct element a (leaf or maximal distinctleaf subtree) with its adjusted branch length br (T ⊎ i ) (a) at the planting point position.
Algorithm 2: Finding the farthest nodes Input: Non-empty list of nodes Output: Pair of the farthest nodes 1 Function FindFarthestNodes(nodes): Let V be the set of temporary nodes in the phylogenetic tree, M be the midnode (see Definition 6), and {v 1 , v 2 } be the pair of farthest nodes (see Algorithm 2).The iterative process of identifying the planting point can be described as follows (Equation ( 9)).In Algorithm 3, the function TraverseTree traverses the tree from one node towards another to find the exact position of the midnode based on the calculated half distance.Starting at node v 1 with an initial distance of zero, the function iteratively moves to the next node towards v 2 while accumulating the distance.If the next step exceeds the half distance, the function interpolates the position between the current node and the next node to pinpoint the precise midnode position.This traversal continues until the cumulative distance equals the half distance, at which point the midnode position is returned.
As a result, two completed phylogenetic trees, T ⊎ 1 and T ⊎ 2 , are obtained, both defined on the same set of taxa L(T ⊎ ) = L(T 1 ) ∪ L(T 2 ).The distance between completed trees is calculated using Equation (5).

Example
The following example illustrates the tree completion procedure within the proposed algorithm.Consider the trees T 1 and T 2 in Figure 1.The tree completion process can start with either T 1 or T 2 due to the symmetry property (see Proposition 1).In this example, the tree T 2 is completed first.All subsequent results are rounded to three decimal places.The process begins with identifying common leaves, distinct leaves, and maximal distinct-leaf subtrees in both trees.The common leaves in both trees are A, B, and D. Tree T 1 possesses one distinct leaf C, whereas tree T 2 includes distinct leaves G, F, and E, which together form a maximal distinct-leaf subtree, denoted as S.
The process continues by inserting the distinct leaf C with its adjusted terminal branch length (0.725) into the tree T 2 .This is achieved by calculating the cutback distances between each common leaf in tree T 1 and leaf C as follows (see Equation ( 7)): dc (T 1 ) (A, C) = 0.6, dc (T 1 ) (B, C) = 0.7, and dc (T 1 ) (D, C) = 0.4.
Subsequently, the distances for identifying possible locations for temporary nodes in tree T 2 are computed (see Equation ( 8)): d These distances are employed to integrate temporary nodes into tree T 2 at all possible positions (among the original branches) from the same common leaves A, B, and D. A traversal of tree T 2 at a distance of 1.661 from leaf A identifies temporary nodes c 1 and c 2 .Upon traversing tree T 2 at a distance of 2.150 from leaf B, the temporary node c 3 was identified.Finally, the temporary node corresponding to common leaf D at a distance of 0.705 is c 4 .The results are presented in Figure 2.
The planting point for leaf C into tree T 2 is determined using Algorithm 3 (this point is highlighted in blue in Figure 2b).After this insertion, the completion of tree T 2 is finished, as there are no further distinct leaves to incorporate.The completed tree T ⊎ 2 is shown in Figure 3b.The branch adjustment rate for the next tree, T 1 , is calculated to be r(T 1 , T 2 ) = 2.4 5.8 = 0.414.The leaf-based adjustment rates are r 3.7 = 0.568.Completion of tree T 1 proceeds with the addition of subtree S with its adjusted root branch length br (T ⊎ 1 ) (S) = 0.248.For subtree S, upon insertion into tree T 1 , all branch lengths are adjusted using the rate r(T 1 , T 2 ) = 0.414.
The distances for identifying potential locations of temporary nodes in tree T 1 are p (B, S) = 0.5 × r (B) (T 1 , T 2 ) = 0.163, and d Utilizing these distances, temporary nodes in tree T 1 are identified (nodes s 1 , s 2 , s 3 , and s 4 in Figure 2a).The planting point (indicated in blue in Figure 2a) is then found using these nodes.
Completed trees T ⊎ 1 and T ⊎ 2 defined on the same taxa are shown in Figure 3. Newly added internal nodes and leaves with their adjusted branch lengths are highlighted in blue.Finally, the distance BSD(+) between the completed trees T ⊎ 1 and T ⊎ 2 is calculated as follows:

Results and Discussion
The properties of the described approach are formulated in the form of theorems and propositions.
Theorem 1 (Metric properties).Let T 1 and T 2 be phylogenetic trees, then the BSD(+)(T 1 , T 2 ) distance is a metric.Proof.To be a metric, a distance measure has to satisfy the following properties for any three phylogenetic trees T 1 , T 2 , and T 3 : non-negativity, identity of indiscernibles, symmetry, and triangle inequality.
These properties are first demonstrated for the general case where phylogenetic trees are defined on the same set of taxa, denoted as L(T).It is then discussed how the proposed tree completion process affects these properties in order to preserve their metric characteristics.Let N denote the size of the set L(T).
Non-negativity.It is evident that squaring each term within the square root guarantees a non-negative sum of squared differences.Consequently, given that the square root of a non-negative value is non-negative, it follows that BSD(+)(T 1 , T 2 ) ≥ 0.
Identity of indiscernibles.To prove this property, it is necessary to consider two cases.If T 1 and T 2 are identical, then for any pair of leaves (l i , l j ) in L(T), the distances d (T 1 ) (l i , l j ) and d (T 2 ) (l i , l j ) are equal.Formally, ∀(l i , l j ) ∈ L(T), d (T 1 ) (l i , l j ) = d (T 2 ) (l i , l j ).Consequently, each squared difference term in the BSD formula becomes zero.Therefore, the entire summation evaluates to zero, leading to BSD(+)(T 1 , T 2 ) = 0.
If BSD(+)(T 1 , T 2 ) = 0, then it implies that the sum of squared differences is zero.Given that squared differences are always non-negative, this can only occur if each squared difference term is individually zero.Formally, ∀(l i , l j ) ∈ L(T), (d (T 1 ) (l i , l j ) − d (T 2 ) (l i , l j )) 2 = 0.This implies d (T 1 ) (l i , l j ) = d (T 2 ) (l i , l j ) for every pair of leaves (l i , l j ), establishing that T 1 and T 2 are identical in terms of their leaf distances, and hence are identical trees.
Symmetry.The formula for BSD(+) is symmetric, as it considers pairwise differences between d (T 1 ) (l i , l j ) and d (T 2 ) (l i , l j ), and it is known that ∀(l i , l j ) ∈ L(T), d (T) (l i , l j ) = d (T) (l j , l i ).
Triangle inequality.In order to establish the triangle inequality, it is necessary to demonstrate the following inequality, for any triplet of phylogenetic trees denoted as T 1 , T 2 , and T 3 : BSD(+)(T 1 , T 3 ) can be written as The triangle inequality can be used for the square root of a sum of squares: Applying this inequality to each term in the double sum above, we have The first square root on the right side is BSD(+)(T 1 , T 2 ), and the second square root is BSD(+)(T 2 , T 3 ): Therefore, the distance function BSD(+) satisfies the triangle inequality.Since the distance function BSD(+) satisfies all four properties, it is a metric.Next, we discuss that the BSD(+) distance for the completed trees T ⊎ 1 and T ⊎ 2 is still a metric.
Non-negativity.The BSD(+) distance inherently ensures non-negativity through the square of differences in distances between pairs of leaves, followed by the square root of their sum.The introduction of additional leaves in the tree completion process does not result in negative values, as distances calculated are inherently non-negative.Thus, BSD(+)(T ⊎ 1 , T ⊎ 2 ) ≥ 0 is maintained.Identity of indiscernibles.The tree completion process preserves the original distances between common leaves while adding new distances in a consistent manner across both trees.As a result, if the trees are identical after completion, implying that all pairwise distances among leaves are equal, then it implies all corresponding distances are equal, including those involving newly added leaves, confirming the trees are identical.Symmetry.Symmetry is inherent to the BSD(+) formulation, as it calculates the difference in distances between corresponding leaf pairs in both trees.This symmetry is not affected by the tree completion process, as the method of calculating distances between leaf pairs remains consistent, ensuring BSD(+)(T ⊎ 1 , T ⊎ 2 ) = BSD(+)(T ⊎ 2 , T ⊎ 1 ).Triangle inequality.The addition of leaves maintains the integrity of original distances and adds new distances in a manner that respects the structure of the metric space.Thus, the aggregation of differences in distances, including those from newly added leaves, continues to satisfy the triangle inequality in the completed tree context.
The fact that the BSD(+)(T 1 , T 2 ) distance is a metric is crucial because it is essential for maintaining mathematical consistency and validity in phylogenetic comparisons, ensuring reliable and interpretable results.
Proof.To establish the computational complexity of the proposed approach, it is necessary to analyze the complexity associated with each significant step in the completion and distance computation process.The computational complexity for determining common and distinct leaves is O(n), utilizing efficient data structures for leaf comparison, such as hash sets.
Assume that the first tree has V 1 nodes and E 1 edges, and the second tree has V 2 nodes and E 2 edges.Identifying distinct-leaf subtrees via breadth-first traversal in both T 1 and T 2 has a complexity of O( . It is to be noted that this complexity surpasses O(n) but remains inferior to O(n 2 ).
The calculations of branch adjustment rates r(T 1 , T 2 ) and r(T 2 , T 1 ), along with leafbased adjustment rates r (l c ) (T 1 , T 2 ) and r (l c ) (T 2 , T 1 ), involve nested loops over the set of common leaves CL(T 1 , T 2 ).The number of common leaves is at most n, and for each pair of common leaves, the distances between them are calculated in each initial tree.The time complexity for this step is O(n 2 ).
The completion process involves adding temporary nodes and determining the planting point for each element in SD(T 1 ) and SD(T 2 ).Each insertion involves traversing the trees, which has a complexity of O(n 2 ).Since we perform this operation for k elements, the overall complexity for this step is O(k • n 2 ).
The distance calculation between the completed trees T ⊎ 1 and T ⊎ 2 involves nested loops over all pairwise combinations of leaves in the union of the sets of leaves (L(T 1 ) ∪ L(T 2 )).For each pair of leaves, distances are calculated in both completed trees.The time complexity for this step can be estimated as O(n 2 ).
Combining the complexities of all major steps, the overall computational complexity of the approach is the maximum of these complexities, which is O(k • n 2 ).
Understanding the computational complexity of the algorithm is important for assessing its feasibility and efficiency, especially in the case of dealing with large datasets.An estimated complexity of O(k • n 2 ) indicates that the algorithm is scalable and can handle large phylogenetic trees within a reasonable timeframe, which is important for practical applications in evolutionary biology and comparative genomics.
Proposition 1 (Symmetry in tree completion).Let T 1 and T 2 be phylogenetic trees defined on different but overlapping sets of taxa.The proposed tree completion algorithm is symmetric with regard to the input trees T 1 and T 2 .
Proof.The symmetry property ensures that interchanging T 1 and T 2 in the tree completion process does not alter the resulting completed trees T ⊎ 1 and T ⊎ 2 .To prove symmetry, it is neccessary to demonstrate that the operations performed by the algorithm do not depend on the order of T 1 and T 2 .
The first step involves identifying the common leaves, given by CL(T 1 , T 2 ) = L(T 1 ) ∩ L(T 2 ).This operation is inherently symmetric because set intersection is commutative, ensuring CL(T 1 , T 2 ) = CL(T 2 , T 1 ).The operations of identifying distinct leaves are symmetric because they are based on the set difference, which is inherently order-independent.
Consequently, for each pair of common leaves (l i , l j ), the distances d (T) (l i , l j ) are computed within each tree.These distances are used for adjusting branch lengths and determining insertion points.Since distances are symmetric, d (T) (l i , l j ) = d (T) (l j , l i ), this step does not introduce any asymmetry.
The selection of insertion points for distinct elements from SD(T 1 ) into T 2 , and from SD(T 2 ) into T 1 , employs temporary nodes and midnodes.This method applies the same criteria regardless of whether the tree is designated as T 1 or T 2 .
Let f (T 1 , T 2 ) denote the function that produces the completed trees T ⊎ 1 and T ⊎ 2 through a series of operations O involving the identification of common and distinct leaves, pairwise distance calculations, and distinct element integration.The symmetry of the algorithm implies that This indicates that the operations O are commutative and order-independent.
In addition, the distance between the completed trees (see Equation ( 5)) is symmetric with respect to T ⊎ 1 and T ⊎ 2 because the squared differences (d . Therefore, the symmetry of the proposed tree completion algorithm guarantees that the completed trees T ⊎ 1 and T ⊎ 2 are structurally consistent and represent the evolutionary relationships accurately, regardless of the order in which T 1 and T 2 are processed.
Symmetry ensures that the algorithm treats the input trees T 1 and T 2 equally, without bias towards either tree.This property is important for the consistency of the algorithm, as it guarantees that the outcome does not depend on the order of the input trees, making the method robust and reliable.
Proposition 2 (Branch adjustment rates).Let T 1 and T 2 be phylogenetic trees defined on different but overlapping sets of taxa.The branch adjustment rates r(T 1 , T 2 ) and r(T 2 , T 1 ) are positive and non-zero.Furthermore, if the common leaves in both trees have identical pairwise distances, then r(T 1 , T 2 ) = r(T 2 , T 1 ) = 1.
Proof.Branch adjustment rates are defined by Equation (2).Since all terms in the numerator and denominator of the equation are non-negative distances, and the denominator is non-zero (as T 2 has at least one edge between any two common leaves), it follows that r(T 1 , T 2 ) is positive and non-zero, i.e., 0 < r(T 1 , T 2 ), r(T 2 , T 1 ).
Consider the scenario where d (T 1 ) (l i , l j ) > d (T 2 ) (l i , l j ) for some l i , l j ∈ CL(T 1 , T 2 ).In this case, the adjustment rate r(T 1 , T 2 ) will be greater than 1, indicating that distances in T 1 are generally longer than in T 2 , and vice versa.
The positivity and non-zero nature of branch adjustment rates, along with their equality when common leaves have identical pairwise distances, ensure that the adjustments made to branch lengths are meaningful and preserve the evolutionary distances.This property is critical for maintaining the biological relevance and accuracy of the completed trees, ensuring that the algorithm reflects true evolutionary relationships.
Proposition 3 (Preservation of leaf-leaf distances).Let T be a phylogenetic tree, and T ⊎ be its corresponding completed version.For any two leaves l i , l j ∈ L(T), the distance between them is preserved in the completed tree.That is, d (T ⊎ ) (l i , l j ) = d (T) (l i , l j ).
Proof.By definition, the tree completion algorithm adds distinct elements to the tree T to form T ⊎ without modifying the existing structure and distances among the initial leaves in L(T).This ensures that the paths and distances between original leaves l i and l j are unchanged.
Specifically, when a new leaf or subtree is added, it is appended in such a manner that it does not disrupt the pre-existing paths between any pair of leaves l i and l j in L(T).
The planting points for new elements are determined based on the common leaves and the calculated midnodes among temporary nodes, ensuring these new insertions do not shorten or lengthen the original distances between any two common leaves (l i , l j ).
The distance d (T) (l i , l j ) in the original tree T is defined as the sum of branch lengths along the unique path from l i to l j .Since this path and its constituent branch lengths remain unaltered in T ⊎ , the sum of branch lengths, and thus the distance d (T ⊎ ) (l i , l j ), remains the same as d (T) (l i , l j ).
Given that the tree completion algorithm preserves the structural integrity of T regarding the distances between its original leaves during the completion process, and since the algorithm ensures that new elements are added in a way that does not affect these distances, it follows that for any two leaves l i , l j ∈ L(T), the distance between them is preserved in the completed tree T ⊎ .Thus, d (T ⊎ ) (l i , l j ) = d (T) (l i , l j ).
This proposition underscores the integrity of the original phylogenetic tree during the completion process.This aspect is important for preserving the biological significance of the tree.The preservation of the original leaf-to-leaf distances, both among the common leaves and between distinct leaves, ensures that the completed tree continues to reflect the evolutionary relationships and distances initially depicted in the original tree T. Proposition 4 (Multifurcation).The BSD(+) distance can be applied to non-binary trees with multifurcations.

Proof. Consider T ⊎
1 and T ⊎ 2 as two completed phylogenetic trees, potentially including multifurcations, and defined on the same set of taxa L(T ⊎ ).Internal nodes may bifurcate (in binary trees) or multifurcate (in non-binary trees).The BSD(+) distance between these trees is defined by Equation (5).
For any two leaves l i , l j ∈ L(T ⊎ ), the path P (T ⊎ k ) (l i , l j ) connecting them is unique, due to the acyclic nature of phylogenetic trees.This holds true for both binary and non-binary (multifurcated) structures.
The distance d (T ⊎ k ) (l i , l j ) is the sum of the lengths of the edges along the path P (T ⊎ k ) (l i , l j ) (see Equation ( 1)).This calculation depends solely on the path, not on the branching structure at each node, indicating that the presence of multifurcations does not alter the fundamental distance calculation between leaf pairs.Therefore, the computation of BSD(+)(T ⊎ 1 , T ⊎ 2 ), which aggregates the squared differences of these pairwise distances across T ⊎ 1 and T ⊎ 2 , remains valid and meaningful regardless of the tree binary or multifurcated nature.
The ability to apply the BSD(+) distance to non-binary trees with multifurcations increases the versatility and applicability of the algorithm.Many real-world phylogenetic trees are not strictly binary, thus accommodating multifurcations allows the algorithm to be used in a broader range of scenarios, enhancing its utility and relevance in evolutionary studies.However, BSD(+) application requires the careful consideration of how multifurcations are represented and interpreted within this framework.The utility and accuracy of using BSD(+) in this context will depend on the specific characteristics of the trees being compared and the biological implications of their multifurcating structures.
Proposition 5 (Consideration of topology and branch lengths).The proposed tree completion algorithm integrates considerations of both the topology (the arrangement and relationships between nodes) and the branch lengths (quantitative measures of evolutionary distance) of the original trees T 1 and T 2 to produce completed trees T ⊎ 1 and T ⊎ 2 .
Proof.The proposed tree completion algorithm inserts new leaves and subtrees at the planting points determined by midnodes and temporary nodes, which are positioned based on the distances to common leaves in the original trees T 1 and T 2 .
Let L(T ⊎ ) = L(T 1 ) ∪ L(T 2 ).For any new leaf l new added to the tree, its position is determined by calculating the planting point among the selected temporary nodes, which are based on distances related to common leaves.This process ensures that the topological arrangement of leaves in T ⊎ 1 and T ⊎ 2 respects the evolutionary relationships inferred from T 1 and T 2 .
In addition, the tree completion algorithm incorporates maximal distinct-leaf subtrees (if any) from one tree into the other.This process is critical for topology consideration, because these subtrees represent significant evolutionary branches that must be integrated while preserving the phylogenetic relationships.
Branch lengths in T ⊎ 1 and T ⊎ 2 are adjusted using rates r(T 1 , T 2 ) and r(T 2 , T 1 ), calculated as the ratio of sums of pairwise distances between common leaves in T 1 and T 2 .These rates reflect how branch lengths should be scaled to align the tree branch lengths.For any branch b in T ⊎ 1 or T ⊎ 2 corresponding to a new insertion, the length is adjusted as follows (see Equation ( 6)): br This ensures that the branch lengths in the completed trees reflect the evolutionary distances measured in the initial trees, adjusted for the context of the completion.
The BSD(+) distance between the completed trees T ⊎ 1 and T ⊎ 2 is calculated by considering the pairwise distances between leaves, which are the sum of branch lengths along the paths connecting the leaves.Although the BSD(+) distance does not directly quantify dissimilarities in tree topology like purely topological measures, it indirectly involves topology through the paths chosen and directly involves branch lengths through the sum of lengths along these paths.The topology involvement is indirect because the BSD(+) calculates the sum of squared differences in pairwise distances between leaves across the two trees.These pairwise distances are inherently determined by the paths through the tree topology from one leaf to another and the lengths of the branches that compose those paths.Thus, the BSD(+) formula encapsulates both topological and branch length considerations by evaluating the aggregate difference in these pairwise distances between T ⊎ 1 and T ⊎ 2 .
Remark 1.The use of branch lengths in the phylogenetic tree distance metric, instead of solely relying on tree topology, provides numerous advantages, especially in terms of biological insights.
Branch lengths in a phylogenetic tree often represent evolutionary time or genetic change.Incorporating branch lengths can provide a more nuanced and accurate picture of the evolutionary relationships between species, reflecting not just how they are related but also how much they have diverged from each other.In addition to this, in cases where branch lengths are considered, the distance metric can better differentiate between trees that are topologically similar but differ significantly in how branches are distributed.This can be crucial in scenarios where slight changes in branch length represent important evolutionary events.Two trees might have a similar structure (topology), but if the lengths of the corresponding branches are significantly different, it indicates a greater evolutionary divergence.Furthermore, trees with branch lengths can be sensitive to more subtle evolutionary changes that are not apparent from topology alone.This sensitivity is critical in studies where small genetic differences are significant, such as in closely related species or in populations of the same species.Finally, using branch lengths can make comparisons between different phylogenetic trees more robust, especially when those trees are derived from different datasets or methods.
Considering the importance of using branch lengths in the distance metric between phylogenetic trees for gaining biological insights, the introduction of branch adjustment rates takes this a step further by ensuring that the evolutionary distances expressed in branch lengths are comparable between different trees.These rates standardize evolutionary rates across disparate datasets and methodologies, making it possible to compare trees more effectively.They adjust branch lengths to a common scale, allowing for meaningful evaluations of evolutionary time and genetic change, which is crucial when different trees may scale these distances differently.This is particularly significant in cases where trees might share similar topologies but differ markedly in their branch distributions, as it allows the distance metric to differentiate between trees that, while structurally similar, differ substantially in their evolutionary paths.Such differentiation is essential in scenarios where even slight variations in branch length can indicate important evolutionary developments.Practical examples of these adjustments include their use in supertree construction, where they help integrate multiple partial trees into a single coherent whole, and in comparative phylogenetics, where they enable more accurate analyses of evolutionary relationships and rates across different organisms or genes, facilitating deeper insights into biodiversity and evolution.

Conclusions
In conclusion, the proposed method for tree completion and distance calculation between phylogenetic trees defined on different but overlapping sets of taxa is designed to address the existing limitations of distance-related measures.Our approach exploits important properties of phylogenetic trees, such as branch length, topology information through the arrangement and relationships between nodes, and leaf names.Additionally, the proposed distance measure is a metric that can be applied to both binary and nonbinary trees and is computed in polynomial time.Incorporating branch lengths into phylogenetic tree distance metrics enhances the biological relevance and interpretative power of phylogenetic analyses, providing deeper insights into evolutionary processes and relationships.The proposed tree completion algorithm is designed to respect evolutionary relationships and preserve the structural integrity of the compared phylogenetic trees.
Building on this foundation, the ability of our algorithm to integrate leaves from different trees into a unified taxonomic framework supports significant advancements in comparative genomics and phylogenetics.By enabling the insertion of leaves from one phylogenetic tree into another while maintaining a consistent taxonomic framework, our algorithm significantly aids in the synthesis of comprehensive phylogenetic trees, often referred to as 'supertrees'.These supertrees are particularly useful in scenarios where partial data from various studies need to be integrated to form a more complete evolutionary picture.Moreover, our approach is crucial in studying evolutionary alternatives among sets of genes, where clusters of genes may exhibit similar evolutionary trajectories but are often studied in isolation.Our algorithm allows for the integration of these clusters into alternative phylogenetic trees, providing a holistic view of potential evolutionary paths and helping researchers to understand complex genetic relationships.
Future work in a more advanced comparison framework includes comparing the proposed tree completion approach and the BSD(+) distance with the relevant phylogenetic tree distance measures, as well as the BSD(−) approach, which involves pruning non-common leaves from both trees before calculating the distance.Experiments with different scenarios will be conducted using biological and simulated data [39,40].Furthermore, various modifications of the proposed tree completion algorithm can be implemented and evaluated.In particular, the number of common leaves involved in finding planting points for new leaves can be limited to a few common leaves closest to the distinct leaf or maximum distinct-leaf subtree.Additional strategies for determining a representative point among temporary nodes can also be investigated and tested.

Algorithm 1 : 2 SD ← ∅; 3 foreach 4 if 5 current_subtree ← ∅; 6 7 8 Enqueue 9 while Q is not empty do 10 current_lea f ← dequeue from Q; 11 Add current_lea f to current_subtree; 12 foreach
Finding distinct elements Input: Phylogenetic tree T and its set of distinct leaves DL(T) Output: Set of distinct elements SD(T) 1 Function FindDistinctElements(T, DL(T)): lea f _start in DL(T) do lea f _start is not marked as visited then Mark lea f _start as visited; Initialize a queue Q for breadth-first traversal; lea f _start into Q; neighbor_lea f of current_lea f in T do 13 if neighbor_lea f is in DL(T) and is not marked as visited then 14 Enqueue neighbor_lea f into Q; 15 Mark neighbor_lea f as visited; 16 Add current_subtree to SD(T); 17 return SD(T)

2 Figure 1 .
Figure 1.Phylogenetic trees defined on different but mutually overlapping sets of taxa.Common taxa are colored red.Tree T 1 (a) has one distinct leaf (C).Tree T 2 (b) includes three distinct leaves (G, F, and E) that form a distinct-leaf subtree.

2 Figure 2 .
Figure 2. Temporary nodes and planting points.Temporary nodes are marked in black for both trees.Tree T 1 (a) has 4 temporary nodes (denoted as s 1 , s 2 , s 3 , and s 4 in (a)).Tree T 2 (b) also contains 4 temporary nodes (labeled as c 1 , c 2 , c 3 , and c 4 in (b)).The planting points found using Algorithm 2 are marked in blue.

Figure 3 .
Figure 3. Completed trees T ⊎ 1 and T ⊎ 2 .Newly added internal nodes and leaves with their adjusted branch lengths are colored in blue.

Theorem 2 (
Computational complexity).Let T 1 and T 2 be phylogenetic trees defined on different but overlapping sets of taxa, T ⊎ 1 and T ⊎ 2 be their corresponding completed versions, n = |L(T 1 )| + |L(T 2 )|, and k = |SD(T 1 )| + |SD(T 2 )|.The computational complexity of the proposed algorithm for completing the phylogenetic trees T 1 and T 2 and computing the distance BSD