Imputing missing distances in molecular phylogenetics

Missing data are frequently encountered in molecular phylogenetics, but there has been no accurate distance imputation method available for distance-based phylogenetic reconstruction. The general framework for distance imputation is to explore tree space and distance values to find an optimal combination of output tree and imputed distances. Here I develop a least-square method coupled with multivariate optimization to impute multiple missing distance in a distance matrix or from a set of aligned sequences with missing genes so that some sequences share no homologous sites (whose distances therefore need to be imputed). I show that phylogenetic trees can be inferred from distance matrices with about 10% of distances missing, and the accuracy of the resulting phylogenetic tree is almost as good as the tree from full information. The new method has the advantage over a recently published one in that it does not assume a molecular clock and is more accurate (comparable to maximum likelihood method based on simulated sequences). I have implemented the function in DAMBE software, which is freely available at http://dambe.bio.uottawa.ca.


INTRODUCTION
Distance-based phylogenetic methods, especially those based a local or global optimization criterion (Desper & Gascuel, 2002;Desper & Gascuel, 2004;Saitou & Nei, 1987), are widely used in studies on molecular phylogenetics and evolution. The least-square method for phylogenetic reconstruction is generally consistent when the distance is estimated properly (Felsenstein, 2004;Gascuel & Steel, 2006;Nei & Kumar, 2000), and is quite robust against over-or under-estimated distances (Xia, 2006). The popularity of the distancebased methods arises not only from their speed and performance which allows them to build super-trees (Criscuolo et al., 2006;Criscuolo & Gascuel, 2008), but also from their applicability to non-sequence data (Wayne, Van Valkenburgh & O'Brien, 1991). In particular, distance-based methods represent the only category of methods that can construct a phylogeny based only on pairwise alignment (Thorne & Kishino, 1992), which may be valuable in situations when reliable multiple alignment is difficult to obtain with highly diverged taxa. Such a phylogenetic method based on pairwise alignment has been implemented in DAMBE (Xia, 2013;Xia, 2017) for nucleotide, codon and amino acid sequences. There are cases where distance-based methods are the only option for building phylogenetic trees, such as those involving the new genome-based distances proposed in recent years. These include genome BLAST distances (Auch et al., 2006;Deng et al., 2006;Henz et al., 2005), breakpoint distances based on genome rearrangement (Gramm & Niedermeier, 2002;Herniou et al., 2001), distances based on the relative information between unaligned/unalignable sequences (Otu & Sayood, 2003), distances based on the sharing of oligopeptides (Gao & Qi, 2007), the composite vector distance (Xu & Hao, 2009), and composite distances incorporating several whole-genome similarity measures (Lin et al., 2009).
Distance-based methods may be the only way to build phylogenetic tree even with sequence data. For example, thousands of DNA transposons exist in Tasmanian devil (Gallus et al., 2015), but many have accumulated so many indels and substitutions that it is impossible to obtain a multiple alignment. The analysis is then limited to computing the distance between the consensus and each individual sequences (Gallus et al., 2015), without being able to have a phylogenetic tree. One can do pairwise alignment among most of the transposon sequences and compute their distances, but some transposon sequence pairs do not share homologous sites (Fig. 1, where a distance between Sp3 and Sp4 cannot be computed) and therefore cannot have their distances computed. If these missing distances can be imputed from those computable ones, then we have a method (and the only one) to build a phylogeny from such sequences. The same scenario is found in bacteriophage where (1) many do not share homologous genes and (2) high sequence divergence precludes multiple sequence alignment (but pairwise alignment using dynamic programming is often possible).
Note that missing data in this manuscript does not refer to indels in aligned sequences. In the distance matrix context, missing data means that some distances in the distance matrix are missing. In the sequence context, missing data means lack of homology between sequences to compute evolutionary distances. For sequences where a reliable multiple alignment can be obtained, likelihood-based methods are expected to have better phylogenetic accuracy than distance-based method, with or without imputed distances.

The statistical rational
Suppose we have N species with K possible pairwise distances, where K = N (N − 1)/2. Also suppose that M distances are missing and need to be imputed. The general framework for imputing missing distances is to find the M distances corresponding to the best tree based on certain criteria. There are two criteria used in choosing the best tree: the least-squares (LS) criterion (Beyer et al., 1974;Cavalli-Sforza & Edwards, 1967) and the minimum evolution (ME) criterion (Rzhetsky & Nei, 1992). I will show that only the LS criterion is appropriate for imputing missing distances. I will first outline the general approach, point out problematic cases where unique solution cannot be found, and then develop an efficient computational method which partially resembles the expectation-maximization (EM) algorithm. However, this approach is easily trapped in a local optimum and a downhill simplex method in multidimensions (Press et al., 1992, pp. 408-412) was implemented for imputing multiple missing distances. I illustrate the method by applying it to real data.
I will start with a simple illustrative example. Suppose we have four species (S1 to S4 in Fig. 2) with D 12 = 2, D 14 = 5, D 23 = 3, D 24 = 5, D 34 = 4 but with D 13 missing. One may take a wrong approach by thinking that, in this particular case, we have five unknowns and five equations and can solve for D 13 exactly. For example, given a topology in Fig. 2A, we can write the expected D ij values, i.e., E(D ij ), as: These E(D ij ) values are termed patristic distances in phylogenetics. If we replace E(D ij ) by the observed D ij values, we can indeed solve the simultaneous equations in Eq. (1), which give the solution as The missing D 13 given the tree in Fig. 2A, designated as D 13.A , can therefore be inferred, as: Thus, given the five known D ij values above, I obtain x 1 = x 2 = x 3 = x 5 = 1,x 4 = 3, D 13.A = 3. The tree length (TL), defined as TL = x i , is 7 for the tree in Fig. 2A, i.e., TL A = 7. TL is used in the ME criterion for choosing the best tree. The best tree is one with the shortest TL.
One might think of applying the same approach to the other two trees in Figs. 2B, 2C to obtain D 13.B and D 13.C as well as TL B and TL C , and choose as the best D 13 and the best tree by using either the LS criterion or the ME criterion (Rzhetsky & Nei, 1992;Rzhetsky & Nei, 1994), i.e., the tree with the shortest TL.
This approach has two problems. First, the approach fails with the tree in Fig. 2B where the missing distance, D 13 , involves two sister species. One can still write down five simultaneous equations, but will find no solutions for x i , given the D ij values above, because the determinant of the coefficient matrix is 0. For the tree in Fig. 2C, the solution will have x 5 = −1. A negative branch length is biologically meaningless and defeats the ME criterion for choosing the best tree and the associated estimate of D 13 . Second, in most practical cases where missing distances are imputed, there are more equations than unknowns, e.g., when we have five or more species with one missing distance.
The LS approach aims to find the missing distances and the topology that minimizes the residual sum of squared deviation (RSS): where D ij is the distance that can be computed from species i and j (i.e., not missing), E(D ij ) is specified in Eq. (1) for the tree in Fig. 2A, m is a constant typically with a value of 0 (ordinary least-squares, OLS), 1 (Beyer et al., 1974), or 2 (Cavalli- Sforza & Edwards, 1967). In the illustration below, I will take the OLS approach with m = 0. It has been shown before that OLS actually exhibits less topological bias than alternatives with m equal to 1 or 2 (Xia, 2006). Table 1 Estimation results from minimizing RSS, with Trees A, B, and C as in Fig. 2, and with the constraint of no negative branch lengths.

Site Tree A Tree B
Tree C x 1 1 0 1 x 2 1 1.5 1.5 x 3 1 0 1 x 4 3 3.5 3.5 x 5 1 1 0 Given the three tree topology, the results from the LS estimation are summarized in Table 1. Note that, for the tree in Fig. 2B, there are multiple sets of solutions of x i that can achieve the same minimum RSS of 1.
We see a conflict between the LS criterion and the ME criterion in choosing the best tree and the best estimate of D 13 . The ME criterion would have chosen Tree B with TL B = 6 and D 13 = 0 because TL B is the smallest of the three TL values. In contrast, the LS criterion would have chosen Tree A with RSS = 0 and D 13 = 3. There is no strong statistical rationale for the ME criterion, which is based on the assumption that substitutions are typically rare in evolution, so a tree with few substitutions is more likely than a tree requiring many substitutions. However, this criterion is logically inappropriate for imputing distances because it favors the distance that is the smallest. Phylogeneticists sometimes think that the ME criterion would be appropriate if the branch lengths are not allowed to take negative values (Desper & Gascuel, 2002;Desper & Gascuel, 2004;Felsenstein, 1997). The illustrative example in Table 1 shows that the ME criterion is problematic even when I do not allow negative branch lengths. In contrast, the LS-criterion (or the best-fit criterion) is well-established.
An earlier version of DAMBE implemented the LS approach above by using an iterative approach similar to the EM (expectation-maximization) algorithm as follows. For a given distance matrix with a missing distance D ij , I simply fill in the missing D ij by the smallest sum of D ik and D jk . For example, if D 25 is missing, but I have a {D 23 , D 35 } pair and a {D 27 , D 57 } pair with (D 23 + D 35 ) < (D 27 + D 57 ), then (D 23 + D 35 ) is used as the initial D 25 . According to triangular inequality, D 25 ≤ (D 23 + D 35 ). These initial D ij guesstimates are designated as D ij,m0 where the subscript ''m0'' indicates missing distances at step 0. I now build a tree from the distance matrix that minimizes RSS in Eq. (4). From the resulting tree I obtain the patristic distances E(D ij ) from the tree and replace D ij.m0 by the corresponding E(D ij ) values which are now designated as D ij.m1 . I now build a tree again, obtain the corresponding E(D ij ) to replace D ij.m1 , so now I have D ij.m2 . I repeat this process until RSS does not decrease any further. This process can quickly arrive at a local minimum. Unfortunately, different topologies have different minimums, and this approach is too often locked in a local minimum with a tree that does not achieve a global minimum RSS. I have implemented the LS criterion with a downhill simplex method in multidimensions (Press et al., 1992, pp. 408-412) when multiple distances are missing. With a single missing distance, the Brent's method (Press et al., 1992, pp. 402-408) is used. The optimization is run multiple times, with different initial values for the points in the simplex to increase the chance of finding the global RSS associated with the missing distances and the tree. While the simplex method is slow, it is good for proof-of-principle studies. The next version of DAMBE will replace the simplex method by the faster Powell's method (Press et al., 1992, pp. 412-419).

Comparison against the maximum likelihood method with simulated sequences
The pruning algorithm used for computing the likelihood can handle missing data, which was numerically illustrated in detail (Xia, 2014). While this method is intended in cases where a reliable multiple sequence alignment is difficult to obtain, i.e., when the maximum likelihood (ML) method is inapplicable, it is still of interest to gauge the performance of the distance imputation and phylogenetic reconstruction against the ML method.
The simulated sequences consist of 24 OTUs and 10 genes evolving in the HKY85 model (Hasegawa, Kishino & Yano, 1985) but with different transition bias (different κ values) varying from 2 to 6.5 (Fig. 3). The simulation was performed with INDELible 1.03 (Fletcher & Yang, 2009) with a symmetrical topology. I attach the supplemental control.txt file that specifies the specifics of the simulation including substitution models, nucleotide frequencies, indel rate and distribution, and phylogenetic tree with branch lengths. Each simulated set of sequences is aligned with MUSCLE (Edgar, 2004) with the default option (which is the slowest but most accurate). I have also used MAFFT (Katoh, Asimenos & Toh, 2009) with the LINSI option that generates the most accurate alignment ('-localpair' and '-maxiterate = 1,000'). Alignment from MAFFT generally contains more indels than MUSCLE but the phylogenetic results from the sets of alignments are almost identical.
Each of the 10 genes were simulated independently generating 1,000 sets of sequences with each set containing 24 OTUs (and 24 simulated sequences). They are then concatenated into 1,000 sets of sequences, with each sequence being a concatenation of 10 genes in the configuration shown in Fig. 3. The first 100 sets of sequences without gene deletion is designated Group0 (Fig. 3A). The next 100 sets of sequences with one gene randomly deleted (out of the 10 concatenated genes in Fig. 3) is designated Group1, and so on. The 100 sets with N genes randomly deleted is designated GroupN, where N varies from 0 to 9. The simulated data in 10 files named Group0.fas, Group1.fas, . . . , Group9.fas are in supplemental file Group0_9.fas.zip.
Maximum likelihood phylogenetic reconstruction was performed with PhyML (Guindon & Gascuel, 2003). The tree improvement option '-s' was set to 'BEST' (best of NNI and SPR search). The '-o' option was set to 'tlr' which optimizes the topology, the branch lengths and rate parameters. The distance imputation and distance-based phylogenetic analysis was done in DAMBE by choosing simultaneously estimated distance (Tamura, Nei & Kumar, 2004;Xia, 2009;Xia & Yang, 2011) and FastME as the tree building algorithm. To replicate results from this method in DAMBE, click 'File|Open file with multiple data sets'' to open a GroupX.fas file. Specify 24 as the number of sequences per set, and then choose ''Distance-based phylogenetics'' to perform phylogenetics analysis with DAMBE defaults on all data sets in the file. PhyML can be run in DAMBE in a similar way. The resulting trees are then compared against the ''true tree'' used in simulation. I used the bipartition-based Robinson-Foulds' method for tree comparison. Recovering a true tree also recovers all bipartitions in the true tree. A reconstructed tree that differs from the true tree also implies that some bipartitions in the true tree are not recovered. The Robinson-Foulds method of tree comparison can be accessed in DAMBE by clicking ''Phylogenetics|Robinson-Foulds dist between trees''. Figure 4 shows an illustrative example with seven OTUs (operational taxonomic units). The distance matrix in Fig. 4A is computed from aligned sequence data used before (Xia & Yang, 2011). Figure 4B is the phylogenetic tree built from this distance matrix. Suppose D gibbon,orangutan and D gorrila.chimpazee are missing (shaded in Fig. 4A) and need to be imputed. The method above yields D gibbon,orangutan = 1.3776 and D gorrila.chimpazee = 0.4600, which are close to the observed values (Fig. 4A). The final tree built from the distance matrix with the two missing distances is identical to Fig. 4B except for a negligible difference in branch lengths. Note that I have two distances missing out of a total of 21 possible pairwise distances, which suggests that phylogenetic reconstruction is possible with nearly 10% of distances missing. Sequence input I used a set of mitochondrial COI and CytB sequences from 10 Hawaiian katydid species in the genus Banza together with four outgroup species (Table 2) to test the performance of distance imputation by the method detailed above. Two sequence files are provided as supplemental file: (1) COI_CytB_aln.fas file that contains both COI and CytB sequences for each specimen, and (2) COI_CytB_aln_withMissing.fas that excluded sequences whose accession numbers are in strikethrough font in Table 2. There are 24 OTUs (Table 2), with 18 OTUs having COI sequences and 19 OTUs having CytB sequences. Out a total of 276 possible pairwise distances, 30 OTU pairs do not share homologous sites and need to have their distances imputed. This is a more dramatic example than before with more than 10% of the distances missing. The two sequences were read and analyzed in DAMBE with the simultaneously estimated distances based on the TN93 model (MLCompositeTN93). The missing distances are then imputed. The final output tree, based on the distance matrix with the imputed distances, is reconstructed with either the FastME method (Desper & Gascuel, 2002;Desper & Gascuel, 2004) or the neighbor-joining method (Saitou & Nei, 1987). The tree with 30 distances missing (Fig. 5B) is generally consistent with the tree with the full data set (Fig. 5A) except three minor misplacements of OTUs (shaded in Fig. 5B). Giving the missing sequences indicated in Table 2, I can only get a tree of 18 OTUs with the COI data, and a tree of 19 OTUs with CytB data. By imputing missing distances, I can obtain a tree with 24 OTUs that is almost as good as the tree with the full data set. Notes. a Two accession numbers are for partial COI and CytB sequences, respectively. One accession number is for the mitochondrial genomic sequence from which full-length COI and CytB sequences are extracted. Those with a strike-out font are ''missing'', i.e., removed from aligned sequences for testing the effect of missing data, so some OTUs, e.g., B. nihoa_B and B. kauaiensis_B, do not share homologous sites and need to have their distance imputed.

Distance matrix input
In addition to the purging of sequences shown in Table 2, I have also purged sequences in different ways. The result that distance imputation and phylogenetic reconstruction can be done satisfactorily with about 10% of distances missing is generally repeatable.

Contrasting performance against maximum likelihood method for aligned sequences
Simulated sequence data are grouped into Group0 to Group9. Each group contain 10 sets of aligned sequences, with no gene deletions in sequence sets in Group0, but with progressively more gene deletions from Group1 to Group9, leading to progressively more missing distances (Fig. 6). Sequence sets in Group0 to Group4 data do not have missing distance to impute, although deletion of gene sequences occur in sequences from Group1 to Group4. This is because sequences share at least one gene with either other for distance   Figure 7 Illustration of the Robinson-Foulds method of comparing phylogenetic accuracy. (A) A balanced tree with node labels, with 21 possible bipartitions, used in sequence simulation. A bipartition is created by cutting one internal branch, e.g., cutting the branch between nodes 24 and 25 (designated as 24..25) creates a bipartition with OTUs A1 to H1 in one partition and all other OTUs in the other. (B) % of bipartitions (created by cutting the branch specified in X -axis) recovered from simulated sequences, based on (1) distance-based method FastME in conjunction with distance imputation (FM+Distance imputation) and (2) the maximum likelihood method (DNAML), for the 100 data sets when seven of the 10 concatenated genes are randomly deleted (N_7_MF +Distance imputation versus N_7_DNAML).
Full-size DOI: 10.7717/peerj.5321/ fig-7 computation. Sequence sets in Group0 to Group5 always recover the true tree with either PhyML or the method of distance-imputation plus FastME reconstruction. The Robinson-Foulds method used here to assess phylogenetic accuracy is based on tree bipartitions. A bipartition is generated when a branch is broken to separate a tree into two subtrees. A tree with 24 OTUs have 21 bipartitions. Cutting the branch between nodes 24 and 25 (designated as 24..25) creates a bipartition with OTUs A1 to H1 in one partition and all other OTUs in the other. A tree with 24 OTUs as in Fig. 7A has 21 bipartitions. If a reconstructed tree has the same topology as the true tree, then all 21 bipartitions will be identical between the two trees. Thus, the percentage of bipartitions in a true tree (which is used to simulate the sequences) recovered from simulated sequences is a proxy of phylogeny accuracy. Figure 7B shows one special comparison between the distance-based method with imputed distances and the maximum likelihood method, with seven of the 10 concatenated genes randomly deleted from each sequence. The two approaches recovered a high and comparable percentage (93-100%) bipartitions in the true tree. The corresponding lines for data sets with fewer gene deletions are expected to be closer to 100%, and those for data sets with more gene deletions are expected to have lower percentages.
These expected patterns are empirically substantiated in Fig. 8 for Group5 to Group9. The percentage of recovered bipartitions decreases with increasing number of gene deletion (which is associated with an increasing number of missing distances). The distance-based method (FastME) based on imputed distances on average is worse than the likelihood-based method (represented by DNAML in Fig. 8), especially with more genes randomly deleted. The results from simulated data are consistent with the previous deletion involving the COI and CytB genes. Given the variation in the number of missing distances in Fig. 6, the distance imputation and phylogenetic reconstruction is comparable to that of the maximum likelihood ( Fig. 7 and Fig. 8), albeit slightly worse. The pattern suggests that the distance-based method with imputed distances should be used in cases involving up to about 10% of missing distances.

DISCUSSION
Imputing 30 missing distances does highlight the speed limitation of the simplex method of optimization, which is known to be the slowest (but simplest to implement) in multivariable optimization (Press et al., 1992, pp. 408-412). It takes almost 1.5 min to complete the distance imputation and phylogenetic reconstruction on my desktop PC with a i7-4770 processor clocked at 3.4 Ghz. If the number of missing distances is reduced to 15, then the computation is instantaneous.
There are cases where missing distances can only be determined approximately. For example, if our OTUs include avian species and mammalian species and if distances between mammalian species are missing, then there will be distances that have a narrow range of optimal values instead of a single optimal value. Any distance value within that range will lead to the same minimum RSS. The only way to eliminate this problem is not to have sister species with a missing distance.
There is another program, Lasso, for building phylogenetic trees from a distance matrix with missing values (Kettleborough et al., 2015). I found Lasso to be inaccurate. First, Lasso does not recover the tree in Fig. 4B. Second, the tree for the katydid species, when constructed with Lasso, differs in numerous ways from the tree in Fig. 5A. Lasso assumes a molecular clock, probably because it uses a UPGMA-type of phylogenetic reconstruction. I did not investigate whether Lasso's performance is limited by the assumption of molecular clock assumption or in distance imputation.
While missing data can be accommodated by the likelihood method with the pruning algorithm (Felsenstein, 1973;Felsenstein, 1981, pp. 253-255;2004), they can inflate branch lengths and introduce phylogenetic bias (Darriba, Weiss & Stamatakis, 2016;Xia, 2014). Some popular likelihood-based phylogenetic methods, e.g., PhyML (Guindon & Gascuel, 2003), optionally use distance-based methods to build the initial phylogenetic tree, which is then modified in various ways and evaluated in the likelihood framework to find the maximum likelihood tree. Distance-based methods are much faster than other phylogenetic methods such as maximum likelihood, Bayesian inference and maximum parsimony, and consequently are useful in constructing supertrees.

CONCLUSION
Distance imputation and phylogenetic reconstruction can be done with about 10% of distances missing, and the phylogenetic result is almost as good as that with full information. The method in the paper has an advantage over a previous method (Kettleborough et al., 2015) that assumes a rooted tree and a molecular clock for building a tree and for inferring missing distances. This assumption is not needed and is too restrictive in practice.

SOFTWARE AVAILABILITY
DAMBE is available free at http://dambe.bio.uottawa.ca. It can take two types of distance data. The first is the distance matrix data in PHYLIP format, but with missing distances represented by '.' (a period without quotation marks). One can access the function of distance imputation by clicking 'File|Open other molecular data|Distance matrix file with missing values', and open a distance matrix file. DAMBE will output the imputed distance together with the best tree.
The second input type sequence data, either aligned or unaligned. DAMBE reads and converts almost all currently used sequence formats. For aligned data, DAMBE will compute the distances between sequences sharing homologous sites, impute distances between sequence pairs that do not share homologous sites, and output the imputed distances and the associated optimal tree. For unaligned sequences, DAMBE will align homologous sequences, compute their pairwise distances, impute distances from those sharing no homologous sites, and output the imputed distances and the optimal tree. This function is accessed by clicking 'Phlogenetics|Sequence aligned' or 'Phylogenetics|Phylogenetics by pairwise alignment'.