Minimum contradiction matrices in whole genome phylogenies.

Minimum contradiction matrices are a useful complement to distance-based phylogenies. A minimum contradiction matrix represents phylogenetic information under the form of an ordered distance matrix Yi, jn. A matrix element corresponds to the distance from a reference vertex n to the path (i, j). For an X-tree or a split network, the minimum contradiction matrix is a Robinson matrix. It therefore fulfills all the inequalities defining perfect order: Yi, jn ≥ Yi,kn, Yk jn ≥ Yk, In, i ≤ j ≤ k < n. In real phylogenetic data, some taxa may contradict the inequalities for perfect order. Contradictions to perfect order correspond to deviations from a tree or from a split network topology. Efficient algorithms that search for the best order are presented and tested on whole genome phylogenies with 184 taxa including many Bacteria, Archaea and Eukaryota. After optimization, taxa are classified in their correct domain and phyla. Several significant deviations from perfect order correspond to well-documented evolutionary events.


Introduction
The discovery of the importance of lateral transfers, losses and duplications events in the evolution of genetic sequences has motivated the development of new approaches to graphically represent phylogenies. Methods like NeighborNet (Bryant and Moulton, 2004), T-Rex (Makarenkov et al. 2006), SplitTrees (Bandelt and Dress, 1992;Dress and Huson, 2004;Huson, 1998), Qnet (Grünewald et al. 2006), Pyramids (Bertrand and Diday, 1985), Tree of Life (Kunin et al. 2005a) allow visualizing deviations from a tree topology. All these methods have in common that they summarize the information in the form of a planar network. Deviations from an X-tree are often represented by supplementary edges (Makarenkov et al. 2006;Nakhleh et al. 2004) that create cycles in the graph.
Phylogenetic information can be represented by a distance matrix Y i j n , . For an X-tree, the elements of the distance matrix Y i j n , correspond to the distance from a reference taxon n to the path (i, j). The taxa can be ordered through permutations, so that the distance matrix is a Robinson matrix (Bertrand and Diday, 1985), with values of both rows and columns decreasing away from the diagonal. The corresponding circular order is defi ned as a perfect order. We have shown with a probabilistic model that perfect order is quite robust against lateral transfer and crossover (Thuillard, 2007). The search for the order minimizing a measure of the deviation from perfect order can be effi ciently done with a multiresolution algorithm (Thuillard, 2001(Thuillard, , 2007. The method has been tested on SSU rRNA data for Archaea. The matrix with the best order corresponds quite well to a Robinson matrix. In this article, the minimum contradiction approach is further developed and applied to whole genome phylogenies. With the availability of complete genomes, many methods have been proposed to determine the evolution of whole genomes (For reviews see Galperin et al. 2006;Delsuc et al. 2005;Henz et al. 2005). The construction of trees from whole genomes has proved over recent years to be a quite diffi cult task. This is mainly because of the very limited number of genes shared by Archaea, Eukaryota and Bacteria. Furthermore, gene evolution can sometimes be very different from species evolution. The main diffi culty consists in fi nding a good operator to estimate the distance between genomes. Distances have been estimated with measures based on gene order or arrangement (Wolf et al. 2002;Wang et al. 2006 Among genome distances obtained with Blast, the genome conservation (Kunin et al. 2005b) has furnished some of the best trees up to date, if the quality of a whole genome phylogeny is measured by its concordance to broadly accepted classifi cations. The genome conservation estimates the distance between two taxa using the sum of BlastP reciprocal best hits between two genomes. The method is capable of quite correctly recovering all main phyla. At the phylum level, the evolution of the different genes is suffi ciently similar to form a distinct cluster. The main uncertainties in whole genome phylogenies are on the relationships between phyla. Different evolution rates of the genes, gene losses or duplications, lateral gene transfer may result into large deviations of the distance matrix from a tree topology. In this context, minimum contradiction matrices can furnish information not contained in a single tree or a split network.
The paper is organized as follows. After introducing minimum contradiction matrices in section 2 and their connection to Robinson matrices and Kalmanson inequalities, section 3 explains why the identifi cation of deviations from perfect order is a useful complement to phylogenetic studies. Section 4 presents an algorithm to search for the order minimizing a measure of the deviation from perfect order over all taxa. This order can be interpreted as an average best order over all reference taxa Y i j N , (N = 1, …, n). The algorithm is applied in section 5 to distance matrices for whole genome phylogenies obtained with the genome conservation method.

Defi nitions
Let us start by recalling a number of defi nitions that are necessary to introduce the notion of circular order. A graph G is defi ned by a set of vertices V(G) and a set of edges E(G). Let us write e(x, y), the edge between the two vertices x and y. In a graph G, a path P between two vertices x and y is a sequence of non-repeating edges e(x 1 , z 1 ), e(z 1 , z 2 ), …, e(z i , y) connecting x to y. The degree of a vertex x is the number of edges e ∈ E(G) to which x belongs.
A leaf x of a graph is a vertex of degree one. A vertex of degree larger than one is called an internal vertex.
A valued X-tree T is a graph with X as its set of leaves and a unique path between any two distinct vertices x and y, with internal vertices of at most degree 3. The distance d between leaves satisfi es the classical triangular inequality (1) with d(x, y) representing the sum of the weights on the edges of T in the path connecting x and y. A central problem in phylogeny is to determine if there is an X-tree T and a real-valued weighting of the edges of T that fi ts a dissimilarity matrix δ. Typically, a dissimilarity matrix δ corresponds to an estimation of the pairwise distance d(x i , x j ) between all elements in X. A necessary and satisfactory condition for the existence of a unique tree is that the dissimilarity matrix δ satisfi es the so-called 4-point condition (Bunemann, 1971). For any four elements in X, the 4-point condition requires that

Circular order and Kalmanson inequalities
Consider a planar representation of a tree T or a split network S. A circular order corresponds to an indexing of the n leaves according to a circular (clockwise or anti-clockwise) scanning of the leaves (Barthélemy and Guénoche, 1991;Leclerc, 1997, 2000;Yushmanov, 1984).
In an X-tree, a circular order has the property that for any integer k (modulo n), all the branches on the path P(x k , x k+1 ) between x k and x k+1 correspond to the left branch (or right branch if anti-clockwise). A circular order can be obtained by considering the distance matrix Y i j n , . As illustrated in Figure 1 corresponds to the distance between a reference leaf n and the path P(x i , x j ). A circular order can be computed by ordering the distance matrix Y i j n , so that it fulfils the inequalities defining a perfect order The above inequalities characterize also a Robinson matrix (Christopher et al. 1996;Thuillard, 2007). Using the defi nition of Y i j n , the inequalities become These inequalities have a similar form to the 4-point condition (2) and are known as the Kalmanson inequalities.

Minimum contradiction matrix
In real applications, the distance matrix Y i j n , does often only partially fulfi ll the inequalities corresponding to a perfect order. The contradiction on the order of the taxa can be defi ned as The best order of a distance matrix is, per defi nition, the order minimizing the contradiction. The ordered matrix Y i j n , corresponding to the best order is defi ned as the minimum contradiction matrix for the reference taxon n.
For a perfectly ordered X-tree, the contradiction C is zero. A tree with a low contradiction value C is a tree that can be trusted, while a high contradiction value C is the indication of a distance matrix deviating signifi cantly from an X-tree.

Why Perfect Order is an Important Property?
Kalmanson inequalities are at the center of a number of important results relating convexity (Kalmanson, 1975), the Traveling Salesman Problem (TSP) (Deineko et al. 1995;Korostensky and Gonnet, 2000), phylogenetic trees and networks (Christopher et al.1996;Dress and Huson, 2004). Let us explain why perfect order is an important property.
-If the error on the distance in an X-tree is not greater than x min /2 with x min the shortest edge on the tree, then the Neighbor-Joining algorithm will recover the correct tree topology and Kalmanson inequalities hold (Atteson, 1999 The solution to the TSP has the Master Tour property (Deineko et al. 1995). A Master Tour is a solution of the TSP with the property that the optimal tour restricted to a subset of points is also a solution of the reduced TSP. This result follows directly from the inequalities for perfect order Y Y i j . Any restriction of a perfectly ordered distance matrix Y i j n , to a subset of taxa is perfectly ordered and consequently is a solution to the reduced TSP. In contrast to this result, one fi nds with numerical experiments that, if the minimum contradiction matrix does not fulfi ll the inequalities for perfect order, the best order is not always preserved when a number of taxa are removed. The order minimizing the contradiction over n taxa does not always minimize the contradiction when restricted to a subset of taxa. It follows that one cannot exclude that the topology of a tree or a split network may change when taxa contradicting perfect order are removed. Deviations from perfect order correspond to problematic regions that have to be interpreted very carefully. For that reason we suggest that minimum contradiction matrices are a useful complement to any distancebased phylogeny.

Fast algorithm to search for the best order
The choice of the reference taxon n in Y i j n , can signifi cantly infl uence the best order, when the distance matrix cannot be perfectly ordered. For that reason, an average best order is determined by minimizing the contradiction over all reference taxa.
The contradiction over all n reference taxa is given by The best order is the order (1, …, i 0 , …, j 0 , …, n 0 ) minimizing the contradiction. The computation of the contradiction requires O(n 4 ) operations. For a large ensemble of taxa, the computational cost may become quite high. We will therefore introduce below an algorithm requiring only O(n 3 ) operations to compute a (slightly different) measure of the contradiction.
Let us start by considering an X-tree and the 3 vertices i, j, k as in Figure 2. The distance matrix fulfi lls the inequalities for perfect order. The order between the vertices i, j, k is preserved for any reference vertex not in the interval (i, k) and the inequalities Y Y The value S i, j is central to the NJ algorithm (Saitou and Nei, 1987;Gascuel and Steel, 2006 ). Two vertices i, j are joined by the NJ algorithm, if they maximize S (i.e. max(S) = S i, j ). From the above discussion, it seems natural to initialize the search for the best order on the NJ tree. The search for the best order of Y i j n , is initialized with the NJ algorithm and a small supplementary procedure that we describe below. Given two vertices a and b that are joined by the NJ algorithm and the leaves a 1 , a 2 , …, a i (resp. b 1 , b 2 , …, b j ) that have the vertex a (resp. b) as fi rst ancestor. The best order of the leaves is chosen so as to minimize the contradiction among 4 possibilities: (ab ab ab ab , , , with ab the order a 1 , a 2 , …, a i , b 1 , b 2 , …, b j and a the  inversed order a i , a i-1 , …, a 1 . Once the order is optimized over the NJ tree, the best order is refi ned with a multiresolution search algorithm (Thuillard, 2001(Thuillard, , 2007).

Similarity matrix for whole genomes phylogenies
For whole genome phylogenies, the search for appropriate measures to estimate the evolutionary distance between taxa is still the subject of signifi cant research efforts (Korbel et al. 2002;Kunin et al. 2005b;Yang et al. 2005;Fukami-Kobayashi, 2007). Distance matrices obtained from BlastP scores have been quite successful to generate good trees. The similarity score obtained with BlastP programs can be given a probabilistic interpretation. The statistics of high scoring segments in the absence of gaps tends to an extreme value distribution (Karlin and Altschul, 1990). The probability P of fi nding at least a high scoring segment is well approximated, for small values of P, by the formula P = m 1 ⋅m 2 ⋅2 −Score with m 1 , m 2 the length of the 2 sequences. It follows that Score = −log 2 P + log 2 (m 1 ⋅m 2 ). Defi ning the distance d between two sequences as d = −Score and assuming equal lengths one has d = log 2 (P/m 2 ). Using that defi nition, the distance matrix Y i j n , becomes for 3 sequences The log term has the form of a mutual information and furnishes a measure of the similarity of the genomes i and j in reference to the genome n.
Different approaches have been proposed to normalize the distance matrix using the marginal entropy (Kraskov et al. 2005), the self-score (Kunin et al. 2005b), Korbel normalization (Korbel et al. 2002) or the average score. The normalization by the self-score in the genome conservation gives some of the best results. It is based on a nonlinear weighted sum of the BlastP scores. The gene conservation method computes the distance between two taxa by normalizing the sum of reciprocal best hits between genome i and j by the self-score. The effect of duplication is limited by using only reciprocal best hits. The normalization by the self-score is important to correct, at least partially, the effect of different genome sizes. The genome conservation similarity matrix is given by with ∑ (i, j) the sum of reciprocal best hits between the genomes of the two taxa.

Search for the best average order
The algorithms described in section 4 have been used to search for the best order. The distance matrix was computed using the data furnished by the genome phylogeny server (Kunin et al. 2005b) obtained with an e-value cut-off set to 10 −10 . The contradiction is signifi cantly lower with the score (1 -S i, j ) than with the logarithm of the score. Figure 3 shows the best order after optimization with the algorithms described in section 4 followed by 5000 steps of the multiresolution search algorithm using Eq. (7) to compute the contradiction. Table 1 gives the order of the different taxa corresponding to the best order. Archaea and Eukaryota are grouped into two adjacent clusters of taxa. One observes, for Bacteria, that all the members of a class or a phylum are neighbors. All proteobacteria (together with Aquifex?) are grouped together. The best order obtained with the minimum contraction approach differs from the NJ tree on the following aspect: all spirochetes and δ-proteobacteria form a cluster. This is not the case of the NJ tree.

Interpreting minimum contradiction matrices
This article focus on the mathematical aspects of Minimum Contradiction Matrices. We will limit the discussion to 3 examples showing how to interpret Minimum Contradiction Matrices. The matrix Y i j n , can be imaged for different reference taxa using the best order of Figure 3 given in the annex. Figure 4 shows the matrix Y i j n , using Pirellula (taxa 177) as reference taxa. The scale on the right of the fi gure gives the color code used to represent Y i j n , after rescaling. The minimum value of Y i j n , corresponds to dark blue, while the largest values are coded red. Low values of Y i j n , are associated to two vertices (i, j) having a first common ancestor vertex close to the reference taxa. A cluster of adjacent taxa with large values (red cluster) can be interpreted as a group of close taxa. One observes that Archaea and Eukaryota are not only adjacent but form also a cluster. The best order in Figure 3 is obtained by minimizing the contradiction using all taxa as reference vertex at least once. The best order is therefore a kind of "average" best order. The matrix Y i j n , (resp. ∑ = n n n i j n k Y 1 , ..., , ) with n corresponding to a unique taxon (resp. a group of taxa belonging to some phylum) allows the identifi cation of large contradictions from the best order. These contradictions can often be specifi cally related to the reference taxon. A loss of a gene, a lateral gene transfer or a crossover in the reference taxon modifi es all elements of the distance matrix Y i j n , . A similar perturbation on a taxon that is not a reference taxon affects at most the row and the column corresponding to that taxon.
Many contradictions in Figure 5 can be associated to well accepted endosymbiotic events (Chloroplasts in plants or mitochondria in Eukaryota). Figure 5a shows Y i j n , for Archaea, Eukaryota and some Bacteria (Taxa 72-116) using Rickettsiales (Taxa 1-4 in annex) as reference taxa. The average best order is used to order the taxa.    (Fig.3, 4).
with α the proportion of the genome laterally transferred (α Յ 1) from the reference taxa R, and R 1 , R 2 the laterally transferred sequence after further evolution into the Eukaryota genomes E 1 , E 2 . The observed contradiction and the small values of Y i j n , for Eukaryota are consistent with a lateral transfer between the reference taxa (Rickettsiales) and Eukaryota. Let us recall here that mitochondria are believed to be the result of an endosymbiotic event involving Rickettsia (Timmis et al. 2004), an event that resulted also into the transfer of some Rickettsia genes into the nucleus of the host. Figure 5b shows the distance matrix using all Cyanobacteria as reference taxa. The elements associated to Arabidopsis and Cyanidioschyzon have lower values than both adjacent lines (resp. columns). The observed contradictions for Arabidopsis and Cyanidioschyzon merolae (a plant and a red alga) may be explained by the many genes that are found in both Cyanobacteria and plants/red alga but absent in other Eukaryota, a hypothesis that is supported by the small value of the distance between Cyanobacteria and (Arabidopsis, Cyanidioschyzon). Chloroplasts in plants and red alga are generally considered to have originated as endosymbiotic Cyanobacteria. The low values of Y i j n , for i = Arabidopsis, Cyanidioschyzon are compatible with the hypothesis that some Cyanobacteria genes have been transferred into the host.

Conclusions
For an X-tree or a split network the minimum contradiction matrix Y d Ն , i Յ j Յ k Յ n). In real applications a number of taxa may typically be in contradiction to the inequalities for perfect order. In that case, the Master Tour property does not hold. It follows that the removal or the addition of taxa in contradiction to the inequalities may change the topology of the associated NJ tree or split network.
An average best order can be obtained by searching for the best circular order over Y i j n , (N = 1, …, n). The matrix Y i j n , can be used to localize a problematic taxon, as large deviations from the average best order are often related to the reference taxon n. This approach was applied to whole genome phylogenies using distances computed with the genome conservation method.  Figure 3. b) Eukaryota using Cyanobacteria as reference taxa. The arrow points to Arabidopsis and Cyanidioschyzon.
Several large deviations from the average best order were found to correspond to well-documented evolutionary events.

Disclosure
The authors report no confl icts of interest.