Compromise or optimize? The breakpoint anti-median

Background The median of k≥3 genomes was originally defined to find a compromise genome indicative of a common ancestor. However, in gene order comparisons, the usual definitions based on minimizing the sum of distances to the input genomes lead to degenerate medians reflecting only one of the input genomes. “Near-medians”, consisting of equal samples of gene adjacencies from all the input genomes, were designed to restore the idea of compromise to the median problem. Result We explore adjacency sampling constructions in full generality in the case k=3, with given overlapping sets of adjacencies in the three genomes, where all adjacencies in two-way or three-way overlaps are included in the sample. We require the construction to be maximal, in the sense that no additional proportion of adjacencies from any of the genomes may be added without violating the local linearity of the genome. We discover that in incorporating as many adjacencies as possible, evenly from all the input genomes, we are actually maximizing, rather than minimizing, the sum of distances over all other maximal sampling schemes. Conclusions We propose to explore compromise instead of parsimony as the organizing principle for the small phylogeny problem.


Background
In comparative genomics, a median genome m for a set of k ≥ 3 given genomes g 1 , . . . , g k in a metric space (G, d) minimizes over all m ∈ G [1]. This is meant to embody a compromise among the given genomes, usually as an inference of a common ancestor. While the simplicity of the median concept is appealing, and it has stimulated a large literature [2], it suffers from important shortcomings: it is hard to calculate [3][4][5] for almost all (G, d), and is not a compromise in the most important contexts. For example, for k ≥ 3 random signed permutations of length n, and for d the "breakpoint distance", the median tends to one or more of the given permutations as n increases [6][7][8].
The "near median" was proposed to get around these difficulties [9]. For k random genomes, the same proportion of gene adjacencies is sampled from each one, in such a way that the union of the samples is compatible -an "end" of a gene is adjacent to no more than one other gene end. The proportion of the compromise genome remaining to be constructed can be filled by any matching of the unassembled gene ends, as in Fig. 1.
If comparable proportions of the constructed genome are contributed by each of the k genomes, the spirit of compromise is ensured. The sampling is rapidly carried out.
In the original paper [9], only the following, highly symmetrical cases were studied for k = 3: three purely random genomes, three genomes all with common adjacencies forming a proportion ψ of their adjacencies, and three genomes all with a proportion ψ of common adjacencies and additional proportions ω 1,2 , ω 1,3 , ω 2,3 of adjacencies in their pairwise intersections. We only investigated the maximum θ such that the same proportion θ could be sampled from the three input genomes.
In the present paper we extend our analysis to examine the entire set of compatible triples (θ 1 , θ 2 , θ 3 ).
In the process, we discover the surprising fact that not only does our sampling procedure fail to minimize the sum in (1), it actually maximizes it! In doing so, it illustrates that the search for optimality and compromise are at cross-purposes. In concluding we suggest how the goal of compromise may be used as a criterion for the small phylogeny problem in the place of optimality.

Definitions
Consider three signed genomes, g 1 , g 2 and g 3 , each consisting of one or more chromosomes -circular orderings -containing the same n genes and each containing n gene adjacencies. Although we assume the chromosomes are circular for technical simplicity, the analysis is essentially the same for linear, circular, unichromosomal or multichromosomal genomes; the effect of allowing a bounded number > 1 of chromosomes would be O(n) as would be the differences between circular and linear models. We also assume n is large so that for an arbitrary proportion θ, the O(1) difference between θn and the nearest integer to θn may be neglected. The probabilistic justification behind these assumptions is discussed in [9].
That the genomes are "signed" means the genes have polarity, so the two ends of a gene have distinct labels. Each adjacency is thus an unordered pair of the 2n gene ends, chosen from among 2n 2 possibilities. For a genome to be "compatible", no gene end may be part of more than one adjacency. There is no constraint involving the two ends of the same gene, other than that both ends of all genes must eventually be included in any genome we construct. E.g., there is no constraint against the two ends of the same gene being adjacent, forming a minimal circular chromosome.
The breakpoint distance between two genomes can be defined as d = n − a, where a is the number of adjacencies they contain in common. For example d(g 1 , For a genome x the sum of the normalized distances to the three input genomes, is called its score. A sample is defined by a triple of (θ 1 , θ 2 , θ 3 ) each between 0 and 1 and summing to less than 1 − ψ − ω 1,2 − ω 1,3 − ω 2,3 such that a random choice of θ 1 n adjacencies from g 1 , nθ 2 from g 2 , and nθ 3 from g 3 are compatible with each other and with the adjacencies in the overlaps. A sample is "randomly completed" to form a genome with n genes by the addition of 1 − ψ − ω 1,2 − ω 1,3 − ω 2,3 adjacencies constructed by randomly pairing gene ends that are not in any of the adjacencies in the sample or in the overlaps. In other words, to focus on the purely statistical consequences of the sampling procedure we thus do not consider the increment in the number of adjacencies obtainable in individual instances by the ad hoc matching algorithms developed in [9]. The random completion process does not add to the number of adjacencies in the sample in common with one, two or three of g 1 , g 2 and g 3 .
A "maximal" sample is one where none of the θ i may be increased without causing a number (greater than O(n)) of incompatible adjacencies.

The construction
From the three input genomes, we construct a set containing adjacencies sampled in various proportions among g 1 , g 2 and g 3 and including the adjacencies in the given two-way and three-way overlaps, randomly completed by pairs of gene ends matched from among the remaining unsampled ends. The only constraint in adding an adjacency is that it must have two "free ends"; i.e., no adjacency previously included, whether given or sampled, may contain either of these two ends.
Note that two random permutations can be expected to have virtually no adjacencies in common; the expectation of the number of adjacencies goes to a small constant as n increases [10].
As an illustration, consider the case where ψ = ω 1,2 = ω 1,3 = ω 2,3 = 0. As a first step, we may select θ 1 n adjacencies from g 1 , where 0 ≤ θ 1 ≤ 1. Then for g 2 , the expected proportion of "two free ends", adjacencies where neither end appears in a previously selected g 1 adjacency, is (1 − θ 1 ) 2 . As long as θ 1 = 1, we can pick θ 2 n adjacencies from genome g 2 that do not conflict with any of those selected from g 1 Similarly, having then selected θ 1 n pairs of gene ends from g 1 and θ 2 n pairs of gene ends from g 2 , the expected proportion of pairs in g 3 with two free ends is (1−θ 1 −θ 2 ) 2 . As long as this quantity is greater than zero, we can chose some θ 3 n compatible pairs from g 3 .
Indeed, the derivative of the expression in (4) with respect to either θ 1 or θ 2 , is zero iff θ 1 + θ 2 = 0.5. The second derivatives are negative, so the surface is convex.
Examining some values of max θ 3 and s(x) in Table 1, we confirm that the maximum value of s(x) occur for a genome x where θ 1 + θ 2 = 0.5 and θ 3 = 0.25. By symmetry, we can obtain all of: The unique solution of all three equations is θ 1 Turning to the more general case where ψ and the ω i,j are not required to be zero, as illustrated in Fig. 3, Eq. (4) becomes and Eq. (6) become The unique solution of all three equations is which maximizes s(x) over all maximal samples.
We might imagine that it would be "fairer" to distribute adjacencies among the θ's in the proportions: where each genome would contribute a number of adjacencies in proportion to the number it has already contributed in ψ and the ω's. However, this is not a solution for the equations in (8) for general values of ω 1,2 , ω 1,3 and ω 2,3 , and upon reflection, there is no reason to consider this a better compromise than an equal division of adjacencies among the three genomes, beyond the unbalances already inherent in the pairwise overlaps.

Discussion
The breakpoint median minimizes the sum of the breakpoint distance to three given genomes but in doing so foregoes any property of "compromise" among the three, despite this being the original motivation for the median. The anti-median represents a complete emphasis on "compromise" instead of on shortest distances. Somewhat surprisingly, the anti-median actually maximizes the sum of the breakpoint distance to three given genomes, in the process assuring that none of the three input genomes is disproportionately represented, other than through its given overlap with the other two genomes.
Note that the anti-median genomes are constructed to have precise normalized distances from g 1 , g 2 , and g 3 , in Fig. 3 Sampling scheme showing variable proportions θ 1 , θ 2 , θ 3 , and given two-way intersections ω 1 , ω 2 , ω 3 and three-way intersection ψ. All these contributions lower s. White area in genome h represent the randomly completed portion the sense of their limiting behaviour as n → ∞. This behaviour is predicated on the inclusion of all the adjacencies in the two-way and three-way overlap, and the completion of the sampled genome by random matching of unpaired ends. These anti-medians contrast with arbitrary random genomes whose normalized sums of scores to g 1 , g 2 , and g 3 approach 3. At the other extreme, they also contrast with the "near medians" [9] completed by maximum matching algorithms, whose scores are less than those of the randomly completed samples constructed here.

Conclusions
Median constructions form the basis of the steinerization strategy for solving the small phylogeny problem, finding the ancestral genomes to populate the ancestral nodes of a given phylogeny when the genomes at the leaf nodes are known. Each ancestral node in turn is subjected to a median search, based on its three neighbors, and this is iterated until convergence. This constitutes a search for a most parsimonious solution. But if we wish ancestral nodes to reflect all three neighboring nodes (in a binary tree), there is no obstacle in using anti-medians instead of medians, and actually searching for a least parsimonious solution, so that compromise becomes the organizing principle in the reconstruction. Exploring this becomes the most important project for future work on this subject.