Family-Joining: A Fast Distance-Based Method for Constructing Generally Labeled Trees

The widely used model for evolutionary relationships is a bifurcating tree with all taxa/observations placed at the leaves. This is not appropriate if the taxa have been densely sampled across evolutionary time and may be in a direct ancestral relationship, or if there is not enough information to fully resolve all the branching points in the evolutionary tree. In this article, we present a fast distance-based agglomeration method called family-joining (FJ) for constructing so-called generally labeled trees in which taxa may be placed at internal vertices and the tree may contain polytomies. FJ constructs such trees on the basis of pairwise distances and a distance threshold. We tested three methods for threshold selection, FJ-AIC, FJ-BIC, and FJ-CV, which minimize Akaike information criterion, Bayesian information criterion, and cross-validation error, respectively. When compared with related methods on simulated data, FJ-BIC was among the best at reconstructing the correct tree across a wide range of simulation scenarios. FJ-BIC was applied to HIV sequences sampled from individuals involved in a known transmission chain. The FJ-BIC tree was found to be compatible with almost all transmission events. On average, internal branches in the FJ-BIC tree have higher bootstrap support than branches in the leaf-labeled bifurcating tree constructed using RAxML. 36% and 25% of the internal branches in the FJ-BIC tree and RAxML tree, respectively, have bootstrap support greater than 70%. To the best of our knowledge the method presented here is the first attempt at modeling evolutionary relationships using generally labeled trees.


Introduction
Phylogenetic trees are models of evolutionary relationships.The general approach in phylogenetics is to represent evolutionary relationships using bifurcating trees with sampled taxa (represented by so-called labeled vertices) placed at the leaves.Neighbor-joining (NJ) is a popular method for constructing such trees and uses distances between each pair of taxa.Such trees have the maximum number of unsampled ancestors (represented by so-called latent vertices), each ancestor corresponding to a vertex comprising a branching point in the tree.This approach does not allow the labeled vertices to share an ancestordescendant relationship, and thus may not be appropriate for data sets that have been densely sampled with respect to evolutionary time, for example, genomic sequences of pathogens that have been sampled from individuals who are part of the same transmission chain.
To account for ancestor-descendant relationships Jombart et al. (2011) model evolutionary relationships using a directed acyclic graph in which each edge is directed from a parent to its child.This graph does not contain any latent vertices and is not necessarily connected.In case the graph is disconnected, it is an incomplete representation of the evolutionary relationships among all the labeled vertices.
In related work Gavryushkina et al. (2014) provide a method for constructing so-called sampled ancestor (SA) trees in which labeled vertices come to be placed at internal vertices by contracting terminal branches.The authors do this in a Bayesian inference framework where trees are generated under a model that does not allow labeled vertices to have degree greater than two and, in addition, does not allow latent vertices to have degree greater than three.
Two distance-based algorithms, recursive grouping (RG) and Chow-Liu recursive grouping (CLRG), have been developed by Choi et al. (2011) for constructing trees which may contain latent vertices with degree greater than two and labeled vertices with degree greater than 0 (so-called generally labeled trees).
The authors additionally developed NJc, a method for constructing generally labeled trees by initially constructing a tree using NJ and subsequently contracting all branches that are incident to a latent vertex and are smaller than a preselected threshold.The performance of RG, CLRG, and NJc was compared on simulated data where only the tree topology was varied.In that study, no method clearly outperformed the others.
We developed a distance-based agglomeration method called family-joining (FJ).FJ iteratively identifies, on the basis of a distance threshold, vertices that are in a parent-child or sibling relationship, and introduces latent vertices if required.After inferring all the edges, the branch lengths are estimated using ordinary least-squares (OLS) regression.
RG, CLRG and FJ require the setting of a threshold that determines the model complexity (number of branches) of the output tree.We tested three approaches to threshold selection which minimized Bayesian information criterion (BIC), Akaike information criterion (AIC), and cross-validation (CV) error, respectively.
We compared the performance of FJ-BIC, FJ-AIC, FJ-CV with NJc-BIC, RG-BIC, CLRG-BIC and SA across diverse simulation scenarios.We applied FJ-BIC to an HIV-1 transmission chain data set (Vrancken et al., 2014) and checked if the known transmission events were compatible with the FJ-BIC tree.Additionally in the analysis of HIV-1 sequences, we compared the bootstrap support of branches in the FJ-BIC tree and the maximum likelihood tree constructed using RAxML (Stamatakis, 2006).

An overview of family-joining
The family-joining (FJ) method consists of a distance-based agglomeration algorithm for constructing generally labeled trees, and an efficient algorithm for computing ordinary least-squares (OLS) branch lengths.Trees are inferred using the following agglomeration procedure.We initialize a vertex set with all labeled vertices.At each iteration we select from the vertex set, the vertex pair that optimizes the neighbor-joining objective, as defined by Saitou and Nei (1987), see eq. ( 1) in Materials and Methods.We classify the selected vertex pair as being either parent-child or siblings on the basis of a threshold , see eq. ( 2) in Materials and Methods.If they are found to be siblings we check if there is another vertex that is the parent of both the siblings.If no such vertex is found, a latent vertex is introduced as the parent of both the siblings.The distance matrix is augmented by adding distances from the newly introduced latent vertex to each of the other vertices, obtained using the formula described in Studier and Keppler (1988), see eq. ( 5) in Materials and Methods.Rows and columns of the distance matrix corresponding to the children are removed, and the procedure is iterated until a connected graph is obtained.Subsequently, we estimate branch lengths using ordinary least-squares (OLS) regression.
For efficient calculation of OLS branch lengths we extended the algorithm by Bryant (1997), which was designed for leaf-labeled trees, to generally labeled trees.OLS branch lengths may be negative, which has no biological interpretation.To account for this, after estimating the branch lengths, all branches that are shorter than and are incident to a latent vertex are contracted.Overall, the procedure is similar to constructing the neighbor-joining tree followed by contracting short branches.We demonstrate FJ by applying it to a tree-additive distance matrix.A distance matrix is treeadditive if there exists a tree, in which the distance between each pair of labeled vertices is equal to the corresponding sum of lengths of the branches that lie along the unique path between the two vertices.

An example using tree-additive distances
We simulated a generally labeled tree and computed corresponding tree-additive distances.We applied FJ to the resulting tree-additive distance matrix and describe the major steps below.See Fig. 1 for an illustration.The first iteration identified O1 and O2 as neighbors that share a sibling relationship.No parent was found for these siblings and a latent vertex L1 was introduced.Distances between L1 and vertices O3 through O9 were calculated and the rows and columns corresponding to O1 and O2 were removed from the distance matrix.Edges were added between L1 and O1, and between L1 and O2.The second iteration found O4 and O5 as neighbors that share a parent-child relationship with O4 being the parent.An edge was added between O4 and O5, and O5 was removed from the distance matrix.The following two iterations identified neighbors that are siblings with no parent thus introducing two latent  vertices L2 and L3.The sibling pairs found in the third and fourth iteration are (L1, O3) and (L2, O4) respectively.The fifth iteration identified L3 and O6 as siblings, both of which are the children of O9.
Similarly, the next iteration found O9 to be the parent of both O7 and O8.The final step involved estimating branch lengths using ordinary least-squares.The estimated branch lengths are identical to the corresponding branch lengths in the simulated tree.
3 Results and Discussion

Simulated data
Simulated data sets were constructed by varying either the tree type, proportion of labeled internal vertices, type of contracted edge, number of labeled vertices, sequence length or branch length.Each of these parameters is described in detail below.An overview of the parameter settings is provided in Table 1.Three types of binary trees were generated: balanced, unbalanced and random.Unbalanced or ladderlike trees have the largest diameter among all the trees with the same number of vertices.The diameter of a tree is the number of edges that lie on the path in the tree with the maximum number of edges.We chose this tree type because it has been shown that the accuracy of the neighbor identification step (1), which forms a part of FJ, is inversely related to tree diameter St. John et al. (2003).A balanced tree is complementary to an unbalanced tree and has the smallest diameter possible.
The fraction of latent vertices ranges from zero to (n − 2)/(2n − 2) where n is the number of labeled vertices.We simulated trees by varying the fraction of latent vertices over this range in four equal steps.
Trees with the desired proportion of labeled vertices were constructed by contracting edges of a binary tree.Depending on the type of simulation experiment, the following edges were contracted: leaf/latent, labeled/latent, latent/latent, and any/latent.
For each setting of tree type, fraction of latent vertices, and edge type, we randomly generated corresponding types of binary trees and contracted randomly selected edges of the appropriate type, until the desired fraction of latent vertices was reached.Once the topology was generated, branches were assigned lengths by uniformly sampling numbers between 1 and 100, and scaling them such that the expected branch length was equal to a preselected branch length average.Branch length averages took values of 0.001, 0.004, 0.016, 0.064, and 0.256 subs/site.A vertex was randomly selected as the root and sequences were evolved along the branches according to a GTR+Γ model of substitution Lanave et al. (1984).The parameters of the GTR model were set using estimates from a real data set Waddell and Steel (1997).The parameters shape and scale of the Γ model were set to 1 which resulted in a moderate variation of substitution rate across sites.Seq-Gen was used for simulating sequence evolution Rambaut and Grassly (1997).Sequence lengths took values of 250, 500, 1000, 2000, and 4000 nt.The number of labeled vertices (taxa) took values of 20, 40, 80, 160, and 320.Simulation scenarios were defined by varying each parameter over its range while keeping the remaining parameters fixed at their default setting.The default settings for each parameter are described below.Note that this procedure would result in 22 different parameter combinations.We simulated the corresponding 22 scenarios.
For the categorical parameters tree type and contracted edge type, the respective default settings were random and any/latent.These settings were selected as the defaults as they do not restrict the generation of generally labeled trees.
For the continuous parameter, fraction of vertices that are latent, which has a bounded range the midpoint was considered as the default value.
For the following continuous parameters with no upper bound: number of labeled vertices, sequence length, and average branch length, we selected the appropriate range and default settings such that the trend in performance over each parameter range would be apparent.
The default setting for the number of labeled vertices was 160, for the sequence length it was 1000 nt, for the average branch length was 0.016 subs/site.
For each setting of parameter values, 100 trees and corresponding sequences were simulated.For distance-based methods we computed pairwise distances using ML distance estimates under a GTR+Γ model, computed using RAxMLv8.2.8 Stamatakis (2014).For SA which constructs rooted trees we provided sampling times for each labeled vertex.This was done by randomly selected a vertex as the root and defining the sampling time for each labeled vertex as the path length from the root.Note that this method of defining sampling times is equivalent to assuming a strict molecular clock with a clock rate of 1.0.When substitution rates (subs./site/time)follow a strict molecular clock, the distance from the root to each labeled vertex is proportional to the time elapsed since divergence from the root.SA recovers the correct clock rate of 1.0 under the strict molecular clock model in all scenarios except two where the average branch length is very small (0.001 and 0.004; see Supplementary Fig. 3)

Performance metrics
Precision and recall were used to quantify the accuracy of the various methods at reconstructing the simulated trees.These metrics are defined below.
where S and Ŝ are the set of splits corresponding to the simulated tree T and the reconstructed tree T , respectively.Please note that S contains the split of every branch in T , including the terminal branches.Precision and recall range from zero to one.Precision is equal to one only if all the splits in the reconstructed tree are present in the simulated tree.Similarly, recall is equal to one only if all the splits in the simulated tree are present in the reconstructed tree.Please note that we do not report Robinson-Foulds distance, which is popularly used for quantifying reconstruction accuracy, since it would be biased against methods that do not allow polytomies.Each of the reconstruction methods that we tested can achieve the highest and the lowest possible value of recall.Among the reconstruction methods that were compared, only SA can not achieve a precision of one if the simulated tree contains polytomies.We feel that both precision and recall are important measures of reconstruction accuracy.

Results of comparative study on simulated data
We present the results of applying FJ-BIC, NJc-BIC, RG-BIC, CLRG-BIC and SA to all simulated data sets.For methods which have the suffix BIC, we performed threshold selection by minimizing Bayesian information criterion (BIC).For FJ, we also tested FJ-AIC and FJ-CV which optimized Akaike information criterion (AIC), and cross-validation error (CV), respectively.As FJ-AIC and FJ-CV never performed higher than FJ-BIC in any simulation scenario we do not show the results in the main paper.These results are shown in Supplementary Fig. 4. A change in precision or recall is considered to be statistically significant if the corresponding Welch's t-test has a p-value that is smaller than 0.01.A method is said to have significantly high precision or recall if no other method has significantly higher precision or recall, respectively.

Type of tree
Precision/Recall 0.2 0.4 0.6 0.8 1.0 q q q q q q balanced random(d) unbalanced Type of contracted edge q q q q q q q q leaf/latent labeled/latent any/latent(d) latent/latent B Fraction of latent vertices Precision/Recall 0.2 0.4 0.6 0.8 1.0 q q q q q q q q q q 0.5 0.37 0.25(d) 0.12 0

C
Average branch length (subs./site)q q q q q q q q q q 0.001 0. 1.0 q q q q q q q q q q 20 40 80 160(d) 320

E
Sequence length (nt) q q q q q q q q q q 250 500 1000

Tree type
Both FJ-BIC and NJc-BIC have significantly higher precision and recall on balanced trees than on unbalanced trees.This behavior is expected, since the accuracy of the step of FJ, in which neighbors are identified, is inversely related to tree diameterSt.John et al. (2003).Even on unbalanced trees, which have large diameters, FJ-BIC and NJc-BIC have moderately large (median) precision/recall values of 0.79/0.81and 0.76/0.87respectively (see Fig. 2A).Similarly RG-BIC performs low on unbalanced trees than on balanced trees, which is in agreement with previous work Choi et al. (2011).RG iteratively partitions the entire vertex set into families.Balanced trees and unbalanced trees have n leaves /2 and two families of size two, respectively.This suggests that RG has a higher error rate for unbalanced trees than for balanced trees.In contrast, CLRG-BIC performs significantly higher on unbalanced trees than on balanced trees with median precision/recall values of 0.89/0.93 and 0.89/0.91,respectively.CLRG constructs the MST and then iteratively applies RG to the neighborhood of each internal vertex.The higher performance of CLRG-BIC on unbalanced trees is most likely due to the MST being topologically close to the unbalanced tree.SA has a median precision and recall of 0.77 and 0.96, respectively, across all tree types.The comparatively lower precision of SA is due to this methods constructing trees in which a labeled vertex can only have up to one descendant and all other internal vertices have degree three.Subsequently this results in trees with excess branches if the true tree contains polytomies.

Type of contracted edge
FJ-BIC has a significantly higher precision than other methods for all types of contracted edges, except latent/latent.SA has a high median recall of 0.96 for all types of contracted edges.However the recall values of SA are not significantly higher than those of FJ-BIC if the contracted edge is leaf/latent.This is due to a large variance in the performance of SA, quantified with an inter-quantile range of 0.26 (see Fig. 2B).SA has high median precision of 0.94 if the contracted edge is leaf/latent.Contracting leaf/latent edges results in trees in which a labeled vertex can have up to one descendant and all other internal vertices have degree three.The high performance of SA in this category is because these are the same type of trees which SA samples when optimizing tree topology.SA has lower performance when any other edge type is contracted.RG-BIC and CLRG-BIC have significantly higher precision and recall if latent/latent edges are contracted, when compared to precision and recall for other edge types.

Fraction of vertices that are latent
For leaf-labeled trees which have a maximal fraction (0.5) of latent vertices, all methods have a median precision higher than 0.95 (see Fig. 2C).In this simulation scenario, with a median recall of 0.97, SA has significantly higher recall than other methods, even though FJ-BIC also has a high median recall of 0.94.In general, precision reduces and recall rises with a decrease in the fraction of latent vertices.FJ-BIC has a median precision and recall that is greater than 0.89 across all settings of fraction of latent vertices.CLRG-BIC has a significantly higher precision and recall than other methods when all vertices are labeled.This is because the CLRG algorithm involves the construction of a MST which should be topologically similar to the completely labeled tree.
Average branch length (substitution rate) All methods perform badly on trees with short average branch lengths of 0.001 subs/site with median recall smaller than 0.8 each (see Fig. 2D).This is because a significant portion of the simulated sequences are identical.Thus, in FJ-BIC, NJc-BIC, RG-BIC, and CLRG-BIC there is a preference for choosing parent-child relationship ovSoundser siblings.CLRG-BIC has significantly higher precision than other methods if branch lengths are either very small or very large.FJ-BIC has high precision if the average branch length is between 0.004 and 0.064.In trees with larger branch lengths there is a high chance that sequences undergo multiple substitutions at the same site.This effect has been termed genetic saturation and results in an underestimation of the true evolutionary distance.Additionally, estimates of large distances are associated with large variance Hoyle and Higgs (2003) which results in the selection of wrong neighbors in the neighbor-joining step.CLRG-BIC has higher performance for trees with large branch lengths because the input to CLRG-BIC is the MST and the construction of the MST is probably robust to noise in distance estimates.The performance of SA is not greatly affected by long branches.

Number of labeled vertices (taxa)
The performance of all the methods is expected to worsen with increasing number of labeled vertices.RG shows significant change in precision/recall with corresponding median values changing from 0.88/0.75(5 labeled vertices) to 0.83/0.61(80 labeled vertices) (see Fig. 2E).The change in precision and recall shown by SA is not significant.FJ-BIC and CLRG-BIC show a significant drop in precision but not in recall.Even for trees with 320 taxa, FJ-BIC has high median precision and recall of 0.92 and 0.9 respectively.NJc-BIC shows significant change in both precision and recall with median precision/recall changing from 0.93/0.93 to 0.89/0.91.

Sequence length
The performance of all methods improves with increase in sequence length.For all settings of sequence length, FJ-BIC is among the methods with significantly high precision (see Fig. 2F).FJ-BIC is among the methods with significantly high recall for sequences of length 1000 nt to 4000 nt.For all settings of sequence length, SA is among the methods with significantly high recall.

Summary of performance
For the simulations performed at the default parameter settings, the methods listed in order of decreasing median precision are FJ-BIC (0.93), NJc-BIC (0.9), CLRG-BIC (0.89), RG-BIC (0.82), and SA (0.77), and the methods listed in order of decreasing median recall are SA (0.96), NJc-BIC (0.92), CLRG-BIC (0.92), FJ-BIC (0.91) and RG-BIC (0.63).In 15 out of the 22 simulated scenarios FJ-BIC is among the methods with significantly high precision.In 17/22 simulated scenarios SA is among the methods with significantly high recall.In 13/22 simulated scenarios NJc-BIC is among the methods with significantly high recall.FJ-BIC has a median recall that is greater than 0.9 in 16/22 simulated scenarios.The remaining scenarios are (i) trees with 20 taxa (recall of 0.89), (ii) trees in which branches are very short (0.001 and 0.004 subs/site; recall of 0.6 and 0.84 respectively), (iii) unbalanced trees (0.81), and (iv) trees constructed using short sequences (250 and 500 nt; recall of 0.77 and 0.85 respectively).

Comparison of time-complexities and run times
Clustering methods are deterministic procedures for which we report worst-case run times.Both FJ and NJ run in time O(n 3 ).RG runs in time O(n 4 ) which makes it infeasible to run on large datasets.CLRG runs in O(n 2 log n + niδ 3 max (MST)) where ni is the number of internal vertices of the MST and δmax(MST) is the largest vertex degree in the MST.Model selection with BIC or AIC requires the repeated optimization of the likelihood function with respect to parameters of the substitution model.Computing the likelihood with Felsenstein's dynamic programming algorithm Felsenstein (1981) takes O(nA 2 L) time where L is the sequence length and A is the size of the alphabet.A is four for genetic sequences and 20 for protein sequences.We used RAxML for computing and optimizing likelihoods; RAxML is highly optimized for this task.SA performs Bayesian inference by MCMC sampling, a stochastic procedure whose runtime depends on how easily the MCMC chain moves through the space of trees and model parameters.The observed run times (see Fig. 3) suggest that FJ-BIC and NJc-BIC are the fastest methods for trees containing up to 320 taxa, with both the methods having a median run time of 5.4 and 4.8 minutes respectively.CLRG-BIC took around 9.3 minutes to reconstruct trees containing 320 taxa and showed the slowest growth in run time.RG showed the largest growth in run time taking 4.8 hours for reconstructing trees with 320 taxa.SA was run with MCMC chain-length set to 10 8 states.SA took around two hours to construct trees containing 20 taxa and 30 hours for constructing trees containing 320 taxa.

HIV-1 transmission chain data
We applied FJ-BIC to a dataset of HIV-1 subtype C env gene sequences that were sampled from 11 hosts who are part of a partially known transmission chain Vrancken et al. (2014).We selected this dataset because it contains sequences from viruses that are evolutionarily closely related.We discarded 31 sequences which had gaps and analyzed the remaining 181 sequences of length 1376 nt.The hosts are labeled A, B, C, D, E, F, G, H, I, K, and L. Sequences from multiple time points are available for A, B, C, D, E, and H.The sampling times for all sequences are known.All the pairs of hosts who were involved in a transmission event are known and have been inferred by interviewing the hosts.The direction of transmission is known for all transmission events except for the transmission between A and

B.
Additionally we compared the bootstrap support of branches in the FJ-BIC tree with the branches in the maximum likelihood tree constructed using RAxMLv8.2.Stamatakis (2006).We first identified the most appropriate model of substitution using JModelTest2 Darriba et al. (2012).A BIONJ tree Gascuel (1997) was constructed with Jukes-Cantor distances and AIC was computed for the following models of substitution: JC, F81, K80, HKY, TrNef, TrN, TPM1, TPM1uf, SYM, GTR.Variants of all substitution models which included a parameter for invariant sites (I) and/or a Gamma model (Γ) for inter-site rate variation were also tested.GTR+Γ+I was the best model, i.e., the one with the smallest AIC score.We constructed a tree with RAXML using the original sequence alignment and the GTRCATI model of substitution, which we refer to as the RAxML tree.The CAT model approximates the Gamma model and enables fast computation Stamatakis (2006).
We inferred a generally labeled tree using FJ-BIC.Pairwise distances were computed using RAxML which included the following steps Stamatakis et al. (2005).First a maximum parsimony tree was constructed using stepwise addition and the parameters of the substitution model GTR+Γ were optimized.The optimized substitution model was used to compute maximum likelihood distances for all sequence pairs.For computing the likelihood of FJ trees at various values of the distance threshold we used RAxML as follows.FJ trees were converted to leaf-labeled trees by replacing each interior labeled vertex with a latent vertex and adding an edge of length zero between the newly added latent vertex and the former interior labeled vertex.This conversion was necessary since RAxML can only handle leaf-labeled trees.We then maximized the likelihood of the converted FJ tree by fixing the tree topology and branch lengths and optimizing the parameters of the substitution model GTR+Γ.The maximized log-likelihood was used for computing BIC.
The FJ-BIC tree was rooted assuming a strict molecular clock model.We define the optimal position of the root as that position which minimizes the RSS of regressing distances from the root to each labeled vertex against sampling times.We placed the root at the midpoint of each branch and computed the RSS for each branch.We then picked the branch that had the smallest RSS and searched along the branch for the position of the root with the smallest RSS.Subsequently this position was chosen as the root of the FJ-BIC tree.

Compatibility of the FJ-BIC tree with known transmission events
In order to check if the known transmission events are compatible with a rooted tree we needed to label all latent vertices with a host.Latent vertices were visually labeled with hosts using standard maximum parsimony.The labeling that we applied resulted in the minimum possible total cost of 10 (see Fig. 4).Given a rooted tree with all vertices labeled with a host, we define a transmission event (X → Y ) to be compatible with the tree if there is a directed edge from a vertex labeled X to a vertex labeled Y .9 out of 10 transmission events are compatible with the FJ-BIC tree.The direction of transmission between A and B is not known.The FJ-BIC tree suggests that A was infected by B. The branch of the FJ-BIC tree that suggests this transmission event has been labeled with the known transmission event A ↔ B. 8 out the remaining 9 transmission events are compatible with the FJ-BIC tree and branches indicative of these transmission events are labeled in Fig. 4. The transmission event B → I is not compatible with the FJ-BIC tree (red solid box in Fig. 4) which may be due to insufficient sampling; Only three sequences were available from host I.It is possible that the polytomy present inside the red dotted box in Fig. 4 may be resolved if more sequences from I were available, in such a way that the resulting tree would be compatible with the transmission B → I.

Branch support in the FJ-BIC tree and the RAxML tree
The bootstrap support of a branch is defined as the number of bootstrap replicate trees that contain this branch.The bootstrap support of branches in the FJ-BIC tree and the RAxML tree were computed using 1000 replicates.Since each labeled vertex is a leaf in all bootstrap RAxML trees, all terminal branches of the RAxML tree trivially have a support of one.The support of a terminal branch in the FJ-BIC tree is not necessarily one.
75 internal branches were common to both the FJ-BIC tree and the RAxML tree.The median(IQR) supports for the common branches were 0.73 (0.43) and 0.76 (0.38) in the FJ-BIC and the RAxML tree respectively.Supports for the common branches in both trees were strongly correlated (Pearson's ρ = 0.84, see Fig. 5).There are 44 and 103 internal branches that are present only in the FJ-BIC tree and subs/site 0.00 0.02 0.04 0.06 0.08 0.10 The FJ-BIC tree of 181 HIV-1 env gene sequences sampled from hosts involved in a known transmission chain.Each vertex is represented by a circle whose inner color is black if the vertex is labeled and white if the vertex is latent.The outer color of each circle indicates the host of the corresponding vertex.Branches reflecting transmission events have been labeled.9/10 transmission events are compatible with the FJ-BIC tree.The red box highlights the transmission event B → I which is not compatible with the FJ tree. the RAxML tree respectively with lower median (IQR) branch supports of 0.22 (0.28) and 0.18 (0.33) respectively (see Fig. 5).The 124 terminal branches in the FJ-BIC tree have a median(IQR) branch support of 0.95 (0.16).
On average an internal branch in the FJ-BIC tree has a higher support than an internal branch in the RAxML tree.36% of FJ-BIC branches and 25% of RAxML branches have supports greater than 0.7.

Summary and Outlook
In this paper, we present a fast distance-based agglomeration method called family-joining (FJ) for constructing generally labeled trees.A key feature of the algorithm presented here is its low worst case time complexity, O(n 3 ), where n is the number of taxa making it feasible for analyzing large data sets.For precomputed distances between 320 taxa, FJ-BIC took around 5.4 minutes (±0.76) to estimate a tree.At each agglomeration step FJ only adds branches (both internal and terminal) if there is sufficient data to support this move.The algorithm treats short branches as unreliable and identifies an optimal threshold by minimizing test error.We tested two methods, FJ-BIC and FJ-CV error, which minimize BIC and CV error, respectively.When compared with related methods FJ-BIC was best at reconstructing the correct tree across a wide range of simulation settings.FJ-BIC was applied to HIV sequences sampled from individuals involved in a known transmission chain.The FJ-BIC tree was compatible with ten out eleven transmission events.On average, internal branches in the FJ-BIC tree were found to have higher statistical support than internal branches in the tree constructed using RAxML.A method for reconstructing phylogenetic trees with high precision circumvents the need for a time-consuming bootstrap analysis.To the best of our knowledge the method presented here is the first attempt at modeling evolutionary relationships using generally labeled trees.

Summary and Outlook
In this paper, we present a fast distance-based agglomeration method called family-joining (FJ) for constructing generally labeled trees.A key feature of the algorithm presented here is its low worst case time complexity, O(n 3 ), where n is the number of taxa making it feasible for analyzing large data sets.For precomputed distances between 320 taxa, FJ-BIC took around 5.4 minutes (±0.76) to estimate a tree.At each agglomeration step FJ only adds branches (both internal and terminal) if there is sufficient data to support this move.The algorithm treats short branches as unreliable and identifies an optimal threshold by minimizing test error.We tested two methods, FJ-BIC and FJ-CV error, which minimize BIC and CV error, respectively.When compared with related methods FJ-BIC was best at reconstructing the correct tree across a wide range of simulation settings.FJ-BIC was applied to HIV sequences sampled from individuals involved in a known transmission chain.The FJ-BIC tree was compatible with ten out eleven transmission events.On average, internal branches in the FJ-BIC tree were found to have higher statistical support than internal branches in the tree constructed using RAxML.A method for reconstructing phylogenetic trees with high precision circumvents the need for a time-consuming bootstrap analysis.To the best of our knowledge the method presented here is the first attempt at modeling evolutionary relationships using generally labeled trees.

Terminology
A phylogenetic tree is an edge weighted undirected tree consisting of two types of vertices, labeled vertices (representing observed sequences) and latent vertices (representing unobserved sequences).Sequence information is present only at labeled vertices.Where appropriate, we refer to edges as branches and edge weights as branch lengths.A branch length quantifies the amount of expected change between the sequences corresponding to the respective incident vertices.Branch lengths are usually in units of substitutions per site.Labeled vertices and latent vertices have a minimum degree of one, and three respectively.For a tree consisting of n labeled vertices the number of latent vertices lies between zero and n − 2. For trees with a maximal number of latent vertices, all labeled vertices are leaves (degree one) and all latent vertices have degree three.Trees are leaf-labeled if all labeled vertices are leaves, else they are generally labeled.
A distance matrix d is tree-additive in a tree T if the distance between each pair of labeled vertices equals the corresponding path length (sum of branch lengths along the unique path between the two vertices) in T .Each branch partitions the set of all labeled vertices into two disjoint sets which are referred to as the split of the branch.The two sets of labeled vertices that are present in a split are referred to as the sides of the split.A split is compatible with a tree if there is any branch in the tree such that removing the branch bipartitions the set of labeled vertices as defined by the split.S(T ) denotes the set of splits that are defined by a branch in T .
A pair of vertices are siblings if both of them are leaves and are adjacent to the same vertex.A vertex pair is in a parent-child relationship if they are adjacent and one of them is a leaf.Thus we call siblings what in the context of the neighbor-joining algorithm is called neighbors.
A rooted tree is a tree with directed edges.In such trees there is one latent vertex (the root) which has indegree zero and outdegree greater than zero.All edges in a rooted tree are directed away from the root.
Edges incident to leaves are referred to as terminal edges while edges incident to internal vertices are referred to as internal edges.

Family-joining algorithm
Our objective is, given a tree-additive distance matrix d, to infer the respective tree To.To may be generally labeled and may contain latent vertices with degree greater than three.We assume that all branch lengths in T0 are strictly greater than zero.We provide a method which correctly infers To using entries in d.
Let Tmax be the set of all trees that satisfy the following criteria: (i) their set of leaves includes all the labeled vertices in To, (ii) they have the maximum number of latent vertices, (iii) and d is the tree-additive distance matrix in every tree in Tmax.All splits in S(To) are compatible with every tree in Tmax.If this were not true for some tree Tmax in Tmax then there would be two branches, bo in To and bmax in Tmax, such that labeled vertices {i, j} and {k, l} lie on different sides of b0 and {i, k} and {j, l} lie on different sides of bmax.Applying Buneman's 4-point condition Buneman (1971) Figure 6: An illustration of the family-joining algorithm.The main steps have been labeled with their time complexity.
The inequality is strict for b0 as all branch lengths in T0 are greater than zero.Thus any tree in Tmax can be constructed as follows.If there is a labeled vertex in T0 with degree greater than one replace this vertex with a latent vertex and add a branch of length zero between the labeled vertex and the newly added latent vertex.If there is a latent vertex with degree greater than three (v poly ) disconnect two randomly selected vertices adjacent to v poly and connect them to a new latent vertex with a branch of length zero.Lengths of branches between the newly added latent vertex and each adjacent vertex are the same as the length of the original branch between v poly and that vertex.Both of these augmentation operations are performed until all latent vertices have degree 3 and there are no labeled internal vertices.
Applying the neighbor-joining algorithm using distances in d yields a tree TNJ with the maximum number of latent vertices such that d is tree-additive in TNJ .Thus TNJ belongs to Tmax and consequently neighbors in TNJ are either parent-child or siblings in To.
NJ is an agglomerative clustering method that identifies, at each step, the pair of vertices to cluster by minimizing the following objective value Saitou and Nei (1987); Studier and Keppler (1988).
where n is the number of vertices that are yet to be clustered.Neighbors i and j can be classified as parent-child or siblings based on the following quantity.
It can be easily shown that: These criteria are shown to be true in the following statements.If i is the parent of j then the path from j to any vertex k = i, j, will visit i.Thus d jk = dji + d ik , which gives ∆ij = 0 and ∆ji = dij.
If i and j are siblings then d jk = dju + d uk where u is the vertex adjacent to both i and j.Similarly When distances are estimated from sequences we use a threshold for classifying the relationship as parent-child or sibling.Specifically i is the parent of j if |∆ij| < .The unordered vertex pair {i, j} are said to be in a parent-child relationship if min{|∆ij|, |∆ji|} < (2) The criterion for selecting the appropriate is discussed in detail later.When d is tree-additive any sufficiently small can be used for correctly classifying the vertices.The family-joining algorithm consists of two main parts: GetTreeTopology which infers the tree topology, and GetBranchLengths which estimates the branch lengths.We describe these two steps in detail below.

Inferring tree topology
An overview of GetTreeTopology is provided in Algorithm 1. GetTreeTopology initializes a so-called active vertex set Va with the set of all labeled vertices.It then performs agglomerative clustering where the following actions are performed at each step.
The pair {i, j} of vertices in Va that minimizes (1) is identified.i and j are then classified as parentchild or siblings using (2).If i is the parent of j, or vice-versa, an edge is added between them and all distances from the child are removed from d.If i and j are found to be siblings then we search for another vertex k in Va that minimizes the following quantity.
If |d ik + d kj − dij| < 2 then k is the parent of both i and j.Corresponding edges are added and all distances from i and j are removed from d. i and j are removed from Va.Note that there are alternate criteria for checking if there is a vertex k that is the parent of both i and j.One such criterion is to compute min{|∆ ki |, |∆ kj |}, (4) and consider k to be the parent of both i and j if min{|∆ ki |, |∆ kj |} < 2 .In the simulation study we found that reconstruction accuracy was higher when we used the quantity in eqn.(3) as opposed to eqn. ( 4) (see Supplementary Fig. 4).This is probably because the quantity in eqn.( 3) is more robust to noise in the estimates of large distances.If k is not the parent of both i and j, a latent vertex l is introduced as the parent of both i and j.Corresponding edges are added and distances from l to any vertex m in Va other than i and j are calculated using the following estimate by Studier and Keppler (1988).
i and j are removed from Va and all distances from i and j are removed from d. Distances from u are added to d and u is added to Va.The agglomeration step described above is repeated until the number of vertices in Va is less than four.After each iteration Va reduces by either one or two vertices.If Va has reached the size three, we check using (3) if there are vertices i, j, and k in Va such that k is the parent of both i and j.If we find such vertices, corresponding edges are added.Otherwise a latent vertex u is introduced and edges are added between u and the three remaining vertices.If Va has reached size two, an edge is added between the two remaining vertices.
GetTreeTopology returns the list of edges of the estimated tree T .T has the same topology as the true tree if distances are additive in the true tree.

Upper bound on the time complexity of GetTreeTopology
At first glance it appears that the neighbor identification step requires Ω(n 3 ) time.This can be reduced to O(n 2 ) with the observation that the neighbor-joining objective can be reformulated as follows Studier and Keppler (1988): From eq. ( 6) it is evident that initializing each row sum Ri with the original distances takes O(n) time.
Updating each Ri after each agglomeration step is done by subtracting distances from children and, if applicable, adding distances to the newly introduced latent vertices.Thus the process of updating each Ri takes O(1) time.Additionally, storing all the Ri in memory requires O(n) space which incurs very little memory overhead compared to the O(n 2 ) space required to store all the pairwise distances.If all distances and row sums are stored in memory then identifying the neighbors takes O(n 2 ) time.Note that ∆ij can also be reformulated for faster computation as follows.
Thus, once the neighbors {i, j} have been identified, it takes O(1) time to compute both ∆ij and ∆ji.It takes O(n) time to find the vertex k which minimizes -d ki + d kj − dij-.The overall time-complexity of GetTreeTopology is O(n 3 ).The time-complexities associated with the main steps of GetTreeTopology are shown in Fig. 6.

Efficient estimation of branch lengths
Branch lengths b of T are estimated by ordinary least squares.This is done by solving Ab = d where d is a column vector containing all those entries of d that are above or alternatively all those entries of d that are below the diagonal.A is the branch incidence matrix of T and is constructed as follows.If the m th entry of the d is dij, then ame = 1 if the path from i to j contains e 0 otherwise (7) A has the dimension n(n − 1)/2 × |E| where |E| is the number of branches in the tree, n is the number of labeled vertices, and b is the vector of branch lengths that we wish to estimate.The ordinary least squares (OLS) estimate of branch lengths is given by b For the estimation of OLS branch lengths we do not make the assumption that distances are treeadditive.For leaf-labeled trees there is a fast O(n 2 ) algorithm for computing the OLS branch lengths Bryant (1997).Any algorithm that estimates OLS branch lengths by performing the matrix operations that are defined in eqn.( 8) needs to use all entries of the distance vector, and thus must run in Ω(n 2 ) time (Bryant and Waddell, 1998).Thus the algorithm by Bryant (1997) is time-optimal.We show that this algorithm extends to generally labeled trees.The main steps involved in this computation are computing first A t d and then (A t A) −1 A t d, each in O(n 2 ) time.We describe both of these steps below.Computing A t d The i th entry of A t d, δ T i d, is the sum of all distances between labeled vertices a and b that lie on either side of edge ei.δi is the i th column of A. For efficient computation of A t d, edges are visited in order of increasing distance from leaves, keeping track of which edges have already been visited.
We first compute δ T i d for every terminal edge ei which is defined as follows.
Next we compute δ T i d for every internal edge ei which are visited in the order of increasing distance from leaves.Consider the internal vertex α with only one incident edge ei such that δ T i d has not been calculated.Let the edges incident to ei be ej 1 , . . ., ej m Let Ci be the side of the split of the edge ei that does not contain α.Similarly Cj k is the side of the split of ej k that does not contain α.
Depending on whether α is labeled or not labeled, δ T i d is computed as follows: Case 1: Vertex α is not labeled Bryant (1997).
Computing each element of A t d involves the summation of entries of the distance vector.Since each element of the distance vector is summed over just once, There is a closed-form solution for the OLS branch length b0 of any edge e0 which is formulated in terms of the splits, and the elements of A t d, that are defined by e0 and the edges adjacent to e0.A description of the branch length formula is given later.When computing branch lengths, edges can be visited in any order.We derive the branch length formula for an internal edge.Consider the internal edge e0 shown in Fig. 9 with adjacent edges e1, . . .e k , e k+1 . . .em. e0 is incident to {α, β}.The respective sizes of the parts of the split defined by e0 are nα and n β For each edge ei define Pi = x∈A i ,y∈B i pxy where Ai and Bi are the parts of the split defined by edge ei.Here pxy denotes the length of the path from x to y when branch lengths are determined by OLS.It turns out that Pi = δ T i d.For each edge ei let Ci be the side of the split that does not contain α and β. ni is the cardinality of Ci.Define For the case where both α and β are not labeled it can be shown that Bryant (1997) where N is the m × m diagonal matrix with (n1, n2, . . ., nm) on the diagonal, I is the identity matrix, Q = (Q1, Q2, . . ., Qm) T , U is the m × m matrix of ones, v is the vector with n β in positions 1 to k followed by nα in positions k + 1 to m, and P = (P1, P2, . . ., Pm) T .
Similarly for the internal edge e0 P0 = v T Q + nαn β b0 Letting X = (nN −1 − 2I + U ) and substituting Q gives the following branch length estimate.
For cases where only α and both α and β are labeled, respectively, the derivation of the above mentioned equations is similar to that described in Bryant (1997) and is provided in the supplementary material.The formula, eqn.( 13), for branch length is valid only when X −1 exists.Bryant (1997) showed that X is invertible as long as there is at most one zero on the diagonal of the matrix (nN −1 − 2I).The i th diagonal element is zero if ni/n = 2 which occurs if there is an edge where both parts of the split have equal size.Even in generally labeled trees there can be at most one such edge.
There are two cases to consider for external branches, one if α is not labeled and the other if α is labeled see Fig. 10.In both cases the derivation of the branch length formula is similar to what has been described earlier and is omitted.
The branch length formulae turn out to be identical in all cases.The reader is referred to the supplementary material for the proof.
For a more detailed description of the algorithm for computing OLS branch lengths, the reader is referred to Bryant (1997).
Once A t d has been computed, all branch lengths can be calculated in O(n) time.Since there are O(n) edges the time complexity of computing OLS branch lengths is O(n 2 ) The overall time complexity of FJ is O(n 3 ).This can be reduced further if heuristics are used at the neighbor identification step, eqn.(1).
OLS branch lengths may be negative which has no biological interpretation.After estimating the branch lengths all branches that are shorter than and are incident to a latent vertex are contracted.If there is a branch between two labeled vertices that has a negative length, its length is set to 10 −7 .10 −7 is smaller than the smallest non-zero distance estimate computed in any of the simulation scenarios.

Model selection
Values of are inversely related to the number of latent vertices and thus inversely related to model complexity.
We performed model selection using three estimates for test error, cross-validation error, Akaike information criterion (AIC) and Bayesian information criterion (BIC).In all cases, model selection is performed by identifying the value of that minimizes the estimate for test error.Please refer to the Supplementary material for a description on how cross-validation error is computed.
AIC and BIC are Taylor series approximations of the Kullback-Leibler distance between the generative model which one wishes to recover and the model that is obtained by maximum likelihood estimation.These are formulated as, AIC = −2 log likelihood + 2m BIC = −2 log likelihood + m log(n) Under the likelihood framework, phylogenetic trees are probabilistic graphical models which are completely described by tree topology and branch lengths.n denotes sample size and is given by sequence length.The number m of parameters equals the number of branches in the tree.
We use FJ branch lengths as approximations of the maximum likelihood branch lengths.Likelihood is computed using Felsenstein's pruning algorithm which is a dynamic programming algorithm that enables efficient calculation of the likelihood Felsenstein (1981).
The calculation of cross-validation error is described on page 6 of the supplement.Plummer et al. (2006).ESS were found to be greater than 200 for all parameters across all the MCMC chains indicating that the chains were sufficiently long.The trees that are produced by BEASTv2.3.0 are rooted and contain the maximum number of latent vertices.The sampled trees were post-processed by unrooting them and contracting all terminal edges of length zero.We reported the average precision and recall of the post-processed sampled trees from the true tree.

Recursive grouping and Chow-Liu recursive grouping
For assessing the performance of RG and CLRG we used the Matlab implementation that was provided by the authors.Both of these methods are distance-based.RG initially sets the active vertex set Va to the set of all labeled vertices.At each iteration Va is partitioned into so-called families using k-means clustering.For each family containing more than one vertex, a relationship test similar to the one used in FJ is performed.If there is a vertex that is the parent of all other vertices in the family then edges are added from the parent to each child.If no such parent is found then a latent vertex is introduced as the parent to all vertices of the family and corresponding edges are added.Va is reduced by removing all the children.This procedure is iterated until a connected graph is obtained.CLRG starts by constructing a minimum spanning tree over all the labeled vertices.For each internal vertex vi, the vertex set Vi comprising of vi and its neighbors is constructed and RG is applied to distances between vertices in Vi, producing the tree Ti.The subgraph in the minimum spanning tree that is induced by Vi is replaced by Ti.
Both RG and CLRG require the setting of two thresholds, and τ .The first threshold, is used for performing the relationship test.RG and CLRG additionally contract branches that are smaller than this threshold.The second threshold, τ is used to filter out large distances and only distances below this threshold are used when performing the relationship test.We optimized using BIC and set τ to a reasonably high value of 0.5.
We modified the implementation provided by the authors, in order to correctly evaluate distances of value zero.Such distance estimates were encountered, predominantly, when the average branch length was the shortest and when a large fraction of internal vertices were labeled.The modification is that all distances of value zero were changed to 10 −7 .

Neighbor-joining with edge contraction
We implemented NJc in Python.NJc involves two steps.The first step is the construction of a tree using NJ.Subsequently all branches that are incident to a latent vertex and are smaller than a preselected threshold are contracted.We optimized using BIC.

Acknowledgments
PK is partially supported by German Center for Infection Research, grant no.DZIF 80008023.This work has been performed in the context of the EuResist Network GEIE, and the project MASTER-HIV/HEP which is funded by the German Health Ministry.

Availability of code
A program that constructs generally labeled trees using FJ-BIC is provided at https://bioinf.mpiinf.mpg.de/publications/prabhavk/familyJoining.Case 2: α is labeled and β is not labeled After substituting Q and rearranging we get, Case 3: Both α and β are labeled Qi + nαn β b0 After substituting Q and rearranging we get, Consider the terminal edge e0 shown in Fig. 10 with adjacent edges e1, e2 . . .em. e0 is incident to the vertices α and β.The respective sizes of the sides of the split defined by e0 are nα and n β .Since e0 is a terminal edge the leaf β is labeled.There are two cases to consider depending on if α is labeled or not labeled.
If α is not labeled (Case 1 in Fig. 10), the branch length formula given by Bryant (1997) where nα = (n − 1), n β = 1 and k = m.If α is labeled (Case 2 in Fig. 10), the branch length formula can be derived as follows.

Comparison of various FJ-based methods
For computing cross-validation error the original sequence alignment with L columns was partitioned into K validation alignments by randomly sampling L/K columns without replacement.For each validation alignment, the corresponding training alignment was constructed using the complimentary set of L−L/K alignment columns.This procedure was repeated R times, giving RK training and validation alignments in total.ML distances were computed for all training and validation alignments.For a fixed value of , FJ trees were constructed for each training distance matrix.We set R to 10 and tried two values for K, i.e., 3 and 5. Test error was computed as the residual sum of squares between the fitted distances (path length on the tree) and the corresponding distances computed from the validation alignment.We then found the that minimized expected test error as this would yield the most generalizable model.

Figure 1 :
Figure 1: Panel A: The tree-additive distances used in this example.Labeled vertices are represented by solid circles and latent vertices by white circles with black border.Panels B to G: The agglomeration steps of FJ which identifies the correct tree topology.The edges that are inferred in each agglomeration step are shown as solid lines.The dotted lines connect the labeled and latent vertices that will be used in the next iteration.Panel H: The correct branch lengths estimated using OLS.

FFigure 2 :
Figure 2: A comparison of the reconstruction accuracy of all methods in six simulation categories.One parameter (x-axes) was varied in each category.The default parameter settings are denoted as parameter-Value(d) on each x-axis.For each parameter setting, 100 data sets were created.Precision is shown in blue and recall is shown in pink.

Figure 3 :
Figure 3: A comparison of run times of all methods in the scenario where the number of labeled vertices was varied.Run times are shown on a log-scale.

Figure 5 :
Figure 5: Left: Comparing the support of common branches in the FJ-BIC tree and the RAxML tree.Right: Supports for branches that are only present in either the FJ-BIC tree or the RAxML tree.

Figure 7 :
Figure 7: The three cases for the internal edge e 0 .Case 1: Both α and β are not labeled.Case 2: Only α is labeled.Case 3: Both α and β are labeled.The triangles represent subtrees.

Figure 8 :
Figure8: The two cases for the terminal edge e 0 .α is labeled in case 1 and not labeled in case 2. The triangles represent subtrees.

Figure 9 :
Figure 9: The three cases for the internal edge e 0 .Case 1: Both α and β are not labeled.Case 2: Only α is labeled.Case 3: Both α and β are labeled.The triangles represent subtrees.

ForFigure 10 :
Figure10: The two cases for the terminal edge e 0 .α is labeled in case 1 and not labeled in case 2. The triangles represent subtrees.

For edges ei incident to α weFFigure 11 :
Figure 11: Rate of the strict molecular clock that is estimated by SA.The true rate of the strict molecular clock is 1.0 subs./site/time in all simulation scenarios.
arg min k i,j (d T ( ,k) (i, j) distance in fitted tree − d V (k) (i, j)) 2 distance in validation set

Table 1 :
Simulated data sets were constructed by varying either the tree type, proportion of labeled internal vertices, type of contracted edge, number of labeled vertices, sequence length or branch length.All settings that were considered for each parameter are shown below.The default setting for each parameter is indicated with * .

Table 2 :
Methods with the significantly highest precision and recall are shown below.All methods that are not significantly worse than the best method are also shown.F, N, R, C, and S stand for FJ-BIC, NJc-BIC, RG-BIC, CLRG-BIC, and SA, respectively.Black and red indicate methods with the highest precision and recall, respectively.The default setting for each simulation parameter is indicated with * .
F, F, N, C, S F, N, C, F, N, C, S F, N, C, F, N, S would result in the following contradictory inequalities:dij + d kl < d ik + d jl for b0 dij + d kl ≥ d ik + d jl for bmax