Pinning down ploidy in paleopolyploid plants

Fractionation is the genome-wide process of losing one gene per duplicate pair following whole genome multiplication (doubling, tripling, …). This is important in the evolution of plants over tens of millions of years, because of their repeated cycles of genome multiplication and fractionation. One type of evidence in the study of these processes is the frequency distribution of similarities between the two genes, over all the duplicate pairs in the genome. We study modeling and inference problems around the processes of fractionation and whole genome multiplication focusing first on the frequency distribution of similarities of duplicate pairs in the genome. Our birth-and-death model accounts for repeated duplication, triplication or other multiplication events, as well as fractionation rates among multiple progeny of a single gene specific to each event. It also has a biologically and combinatorially well-motivated way of handling the tendency for at least one sibling to survive fractionation. The method settles previously unexplored questions about the expected number of gene pairs tracing their ancestry back to each multiplication event. We exemplify the algebraic concepts inherent in our models and on Brassica rapa, whose evolutionary history is well-known. We demonstrate the quantitative analysis of high-similarity gene pairs and triples to confirm the known ploidies of events in the lineage of B. rapa. Our birth-and-death model accounts for the similarity distribution of paralogs in terms of multiple rounds of whole genome multiplication and fractionation. An analysis of high-similarity gene triples confirms the recent Brassica triplication.

least four or five collinear pairs of highly related genesparalogs -in close succession in different regions of the genome, known as paralogous synteny blocks. Insofar as all the paralog pairs in a paralogous synteny block resemble each other to the same extent, this indicates that there was a duplication of the chromosomal region containing them, which can then be dated approximately according to the degree of DNA sequence divergence. If there are many syntenic blocks of the same age throughout the genome, this is suggestive of a whole genome duplication at that point in time.
In 2007 Jaillon et al. noted that syntenic regions in the genome of grape (Vitis vinifera) were distributed as triples, not just duplicates [1]. Much of the genome could be partitioned into seven sets of three syntenic regions, indicative of a whole genome triplication over 100 M years ago, producing a 21-chromosome grape ancestor from a 7-chromosome precursor. Of interest is that in each triplet of regions, forming three pairs of regions, there were many duplicate gene pairs -involving just two regions -but very few actual triples of three highly related genes, one in each region. In addition there were very few duplications within a single region, or between genes in two different triplets among the seven sets of grape triplets of regions. The 21-chromosome construct has since been widely recognized as the ancestor of the core eudicots. The principle of three-way similarities among syntenic regions, indicated by • some duplicated pairs between each two of the three regions, • with or without any triplicated genes, • no pairs within a single region and • no pairs between different triples of regions, is the signature pattern for ancient whole genome triplication, or paleohexaploidy. This may be generalized in straightforward ways to octoploidy and higher multiplicities of polyploidization. For example, an ancient octoploidization would be reflected in 4-tuples of regions, where the would be some duplicated gene pairs between each of the 4 2 = 6 pairs of regions, but no gene pairs within regions and no gene pairs between different 4-tuples.
Another type of important evidence in analyzing ancient polyploidization events is the distribution of coding sequence similarities between two paralogous genes. All flowering plants, and indeed most land plants, have at least one, and generally two, three or more polyploidizations in their history. The distribution of similarities is then a mixture of distributions, each of which is centered at a similarity value indicative of the age of one of the polyploidizations. We have developed a model for predicting the shape of these distributions based on the event times, the ploidy multiplicities of the events, rates of loss of duplicate genes from the genome (fractionation), and rates of sequence divergence [2]. This model produces a paralog tree in the form of a birth and death process with one biologically-motivated constraint, which remains mathematically tractable and whose parameters are well suited to statistical inference. Because of a tradeoff between ploidy and fractionation rates, however, in many instances the multiplicity of the various ploidy events in the evolution of a genome cannot be determined uniquely, which is a severe problem for understanding its history.
One goal of this paper is to remedy this shortcoming by combining the syntenic approach pioneered in [1] with the paralog tree model of [2] to produce a method capable of estimating the multiplicity of the polyploidization events, as well as the fractionation parameters.
The next section summarizes the general model for generating the distribution of paralog similarities. This is followed by a brief section describing the inference of the parameters. We then focus on two particular instances, one where a hexaploidization (whole genome triplication) precedes a tetraploidization (whole genome duplication), and the other where the triplication follows the duplication. The difficulty of ploidy inference is illustrated with data from the turnip, or Napa cabbage (Brassica rapa) genome, and investigated in algebraic detail. In a section entitled "Counting triples", we introduce the method inspired by [1] for distinguishing whole genome triplication from whole genome duplication, given the distribution of duplicate gene similarity, and we apply this to confirm the known sequence of events in the ancestral history of this species.

The general model
We summarize and correct a new and general model [3] for the repeated cycle of polyploidization events, each followed by fractionation. This model allows an arbitrary number of events and rates of fractionation of the progeny of any gene holding across the entire genome after each event. From this we calculate expected numbers of duplicate gene pairs, at the time of observation (i.e., the present time), originating at each of the historical polyploidization events, leading to the prediction of the entire distribution of similarities, using standard models of mutational processes.
The model is a continuous-time birth-and-death process with the entire population synchronized as to birth times and number of progeny, but with the number of deaths of the siblings in each individual "litter" determined probabilistically.

The birth-and-death process
The process starts with m 1 ≥ 1 genes at time t 1 ; at times t 1 < · · · < t n−1 for some n ≥ 1, each existing gene is replaced by r 1 , . . . , r n−1 progeny, respectively, where each r i ≥ 2. As illustrated in Fig. 1, for each gene's progeny, at least one and at most r i genes survive until time t i+1 , as governed by a probability distribution u The results are observed at time t n , namely a measure of similarity (e.g., coding sequence similarity) between all pairs of genes in the population, where the m 1 original genes are considered to be unrelated or too remotely related to be considered.
Let there be m i genes at time t i , let a (i) 1 , . . . , a (i) r i be the number of cases where 1, . . . , r i copies survive j is the number of times j progeny survive. In the diagram, thin solid lines represent individual progeny that survive, and thick grey lines represent the total progeny of a gene that do not survive. In this example the only gene, all of whose progeny survive, is g 4 . Here a (i) Note that there is no provision for g to have zero surviving descendants (i.e., a (i) 0 ≡ 0); these genes would be considered as leaving no evidence for their existence and are not counted in m i . Note that that m i+1 = r i j=1 ja j to represent the probability that j of the r i potential copies survive to time t i+1 , for j = 1, . . . , r i .
Thus the probability distribution of the evolutionary histories represented by The expected number of genes at time t n is then Similarly, we write for the probability measure over all events starting at time t j with m j genes, and preceding time t k . In this case the expected number of genes at time t k is

The paralog pairs
Having characterized the origin and survival of individual genes and their descendants in the environment of recurrent polyploidization and fractionation, we can now focus on the pairs of genes observed at time t n . Our discussion is illustrated by Fig. 2.
For each of the a (i) j genes with j surviving copies, j ≥ 2, there are j 2 surviving pairs of genes. If j = 1 there are

Fig. 2
Counting t i -pairs. The three unfractionated progeny of gene g define three t i -pairs, as indicated by three ovals. We follow the pair contained in the uppermost oval, as the two members at time t i+1 independently (shaded triangles) evolve into m n and m n genes, respectively, at time t n , defining m n m n t i -pairs at time t n no pairs. The total number of pairs created at time t i and surviving to time t i+1 is thus These are called the t i -pairs at time t i+1 . The expected number of such pairs is At time t j , for i + 1 ≤ j ≤ n, any two descendants of the two genes making up a t i -pair with no more recent common ancestor is also called a t i -pair (at time t j ). In other words, for any two genes at time t j , they form a t i -pair if their most recent common ancestor underwent polyploidization at time t i .
For a given t i -pair g and g at time t i+1 , where i < n − 1, the expected number of pairs of descendants d (i,n) having no more recent common ancestor than g and g , will be where m i+1 = 1. This follows from the independence of the fractionation process between time t i and time t i+1 and both parts of the process starting with g and g .
Not all the m n genes in Eq. (2) are in pairs. Because of fractionation after every polyploidization event, some genes will remain unpaired. We have where m * is the current number of unpaired genes.
The terms E d (i,i+1) and E (i+1,n) (m n ) in Eq. (7) both involve calculating probabilites with Eq. (3). As n and the r i increase, so do the m i , and this becomes computationally very expensive, due to the product of multinomial coefficients in Eq. (3) and the sum of many such probabilities in Eq. (4). Nevertheless, making use of the recursive nature of these calculations allows for more efficiency than the explicit generation of evolutionary histories and the counting of pairs within each one.

The Distribution of Similarities
Knowing the expected number of pairs of genes originating at each WGD in the past is the first step in predicting the full distribution F of similarities. The second step is to derive the actual distribution of gene pair similarities, or an appropriate approximation to it, for t i -pairs.
One way gene pair divergence may be measured is in terms of a probability p reflecting similarity -the proportion of nucleotide positions that are occupied by the same base in the two orthologs (or paralogs).
Besides p, the other important parameter is G, reflecting average gene length in terms of the number of nucleotides in the genes' coding region. Because this length varies greatly, in practice G needs to be estimated.
In the simplest case, the distribution of similarities is the binomial B(G, p i ), where and is related to the time t i ∈[0, ∞) elapsed since the event that gave rise to the pair. This distribution has The densities of similarities of t i -pairs can be approximated by a normal distribution N μ i , σ 2 i (as long as p i is not too close to 1.0), and the expected frequency by We can predict the entire frequency distribution over all t i as: keeping in mind that the model also predicts unpaired genes according to Eq. (8). Then the predicted relative frequencies become . (13)

Inference
The distribution of gene pair similarities is of the form f (k), where k = k min , . . . , k max . The data may also include f * , the frequency of unpaired genes. The value of k min is set to eliminate pairs due to noise or to polyploidization events earlier than those of immediate interest. At the other extreme, k max is set somewhat lower than 100%, in order to remove any effects of heterozygosity, whereby an apparent duplicate gene pair actually consists of two alleles of a single gene, rather than two genes at different positions in the genome.
The likelihood of a model, given some data set is where q depends only on the parameters of the model, and the maximum likelihood estimators of the parameters of a model can be found by maximizing with respect to these parameters.

Two models for one dataset
For a given instance of the above model, if we know some of the parameters, we can infer the others. This includes • the t i : the times of each event, • the fractionation rates, • λ: the divergence rate, and • G : the gene length parameter.
However, we cannot easily estimate the r i , the event ploidies, from the distribution of paralog pair similarities. To understand why, we consider an important example, namely the outcome of two events, a tetraploidy leading to a whole genome duplication and hexaploidy, leading to a whole genome triplication. We will illustrate with data on the Brassica rapa genome [4], member of the Brassica genus, known to have undergone a triplication after an earlier duplication shared with other Brassicales genera, such as Arabidopsis, as shown in Fig. 3.
The distribution of gene pair similarities derived from SYNMAP (on the COGE platform [5,6]) is shown in Fig. 4. Only the recent event, the Brassica triplication, is clearly visible as a distinct peak in the histogram, but the voluminous tail at early similarities attests to the effect of the earlier Brassicales duplication. Brassicales is a rosid order and as such also descends from the γ core eudicot triplication, which would have produced pairs with around 70% similarity, but very few remained in synteny blocks, so for the purposes of our subsequent analysis, we ignore this event. Indeed, we imposed no bounds k min or k max on the data produced by SYNMAP.
We can explore the discriminatory power of our method by fitting two models to these data, one where a whole genome duplication precedes a triplication, known to be true, and an incorrect one where the duplication follows the triplication.
The calculations leading to Eqs. (7) and (8) are not lengthy in the case of these two models, and are portrayed schematically in Figs. 5 and 6, where u, v, w, x, y and z are probabilities that can be fixed independently of each other, as long as u + v < 1 and x + y < 1. (To avoid trivial models in either case, u, w, x and z must be greater than zero and less than 1, while v and y must be greater or equal to zero and less than 1. We will term these valid models.) In Fig. 5, w = u (1) 2 , the probability that both offspring survive until time t 2 after the first duplication at time t 1 , so that 1 − w is the probability that only one survives. After the triplication at time t 2 , the probabilities are u = u (2) 3 and v = u (2) 2 that three offspring or two offspring, respectively, survive until time t 3 .
Looking at the second paralog tree in the left-hand column of Fig. 5, for example, Eq. (1) becomes The coefficient 2 in this expression corresponds to the two "versions" of the diagram with the colours of the gene pair and gene triple permuted. Note that the branches without coloured dots in most of the paralog trees are simply meant to be suggestive of the fractionation process, do not reflect anything in Eq. (1), and are not involved in the colour permutations in counting the number of versions. The number of t 1 and t 2 pairs at time t 3 , as calculated in Eqs. (5)(6)(7)(8), can be counted directly for each tree in the figure.
Turning to Fig. 6, z = u (2) 2 , the probability that both offspring survive until time t 3 after the second event (duplication) at time t 2 , so that 1 − z is the probability that only one survives. After the triplication at time t 1 , the probabilities are x = u (1) 3 and y = u (1) 2 that three offspring or two offspring, respectively, survive until time t 2 .
The expected number of pairs in the duplication precedes triplication model (Fig. 5) is given by: The same quantities in the triplication-first model (Fig. 6) are: Can the principle of maximum likelihood discriminate between the two models? The likelihood of either model depends only on the q(i) and q * in Eq. (14). Then the parameters of one model can be related to a set of parameter values in the complex field (i.e., not necessarily probabilities) with the same likelihood in the other model through the equations: This implies that all maximum likelihood solutions in both models have the same likelihood. Furthermore, for large enough samples we can expect at least one maximum likelihood solution in each valid model. Because the parameters are underdetermined, the likelihood depending only on the q(i) and q * and not the absolute frequencies f (i) and f * , there may be several solutions. In addition, if in one model the maximum likelihood solution involves parameters which satisfy the conditions of a valid model, this is not necessarily true of the corresponding parameters in the other model.
We may then ask, do Eqs. (17) and (18) determine a bijection between some valid model in (u, v, w) space and some valid model in (x, y, z) space? The answer is determined by the intersection of (u, v, w) ∈ [0, 1] 3 ∩ {u + v < 1} and (x, y, z) ∈[0, 1] 3 ∩ x + y < 1 and the algebraic variety determined by the system in Eq. (19).
By systematically exploring a three-dimensional grid in each cube, we located all points in the valid region of the (u, v, w) cube for which Eq. (19) produced points in the valid region of the (x, y, z) cube. This produced two surfaces as depicted in Fig. 7 between which the two models have a correspondence. Outside of this volume, Eq. (19) have only solutions that are complex or outside one or both valid regions.
In Fig. 7, we see that the multiple maximum likelihood solutions for the B. rapa data in (u, v, w) space form a linear subspace, only part of which also contains solutions for the (x, y, z) model.
To restrict the set of solutions, we may make use of f * , the observed number of unpaired genes, which is not directly involved in the likelihood maximization -only q * is. This number is 17,751. Unfortunately, we have no access to the number of genes in the ancestral genome preceding the two polyploidization events, but we can guess, based on core eudicots that have escaped polyploidization after γ , such as grape [1] and Robusta coffee [7], that 25,000 is a reasonable value. Of genomes that have polyploidized and fractionated, many, such as Utricularia, papaya, Mimulus or most pertinent, Arabidopsis, have returned currently to gene numbers less than 29,000 [8]. This means that of 25,000-29,000 ancestral genes, the average number of currently unpaired genes per original gene is 0.61-0.71. In Fig. 7, the grey dots at the extreme left represent valid solutions in the (u, w, v) space but not the (x, y, z) space, predicting 0.65-0.68 unpaired genes while for the blue dots corresponding to valid solutions in both spaces the predicted number is only 0.53-0.55, suggesting an ancestral gene complement of 32,000-34,000, which seems unlikely. These calculations thus lend more credence to the model where duplication precedes triplication.
In this model, we have used u and v as two independent parameters controlling fractionation for the triplication event, and similarly x and y in the triplication-first model. This may represent excessive parametrization, however, since there are very likely biological constraints on such pairs of parameters, though this has not yet been modeled or studied empirically. A reasonable way of modelling this is to postulate a constrained binomial process for the fractionation loss of one or two genes of each triple generated by the triplication event. Thus we may replace u and v by using a single parameter s and replace x and y by a single parameter h as follows: The investigation into the connection between the two models starts with There being only two parameters in each model, we can use only two of the Eqs. (21) to understand the correspondences between these models. In this case there is a bijection between (s, w) ∈[0, 1] 2 and (h, z) ∈[0, 1] 2 . However, corresponding points in the two spaces reflect different numbers of unpaired genes (and of t 1 pairs and t 2 pairs).
The vertical axis in Fig. 8 represents the difference between the predictions of unpaired genes in the two models, with the red line tracing the values where the difference is zero. The blue line represents the maximum likelihood solutions for the B. Rapa data. The yellow dots identify (s, w) solutions that predict 0.6-0.68 currently unpaired genes per ancestral gene. Far from the red line, we conclude that the triplication-first model does not represent reality.

Counting triples
We have seen that in contrast to the divergence and fractionation parameters, the ploidy of whole genome multiplication events is not easy to infer from the distribution of gene pair similarities. There are more direct ways, however, to establish the ploidy of these events. Most obvious in our case is the detection of highly similar, i.e., recently diverged, triples of genes or triples of chromosomal regions, as evidence of the late triplication model.
The total number of genes in the CoGe B. rapa data set is 41,020. The application of SYNMAP to compare B. rapa with itself (cf. Fig. 4) shows there are 14 triples of paralogous regions made of long synteny blocks with relatively little interruption. These regions cover 80% of the genome, as can be seen in Fig. 9. For some of these triples, one or more of the three regions are divided among two or three chromosomes, due to genome rearrangement processes. But they all display the signature pattern of recent triplication enunciated by Jaillon et al. [1], namely large numbers of highly similar gene pairs among the three regions and relatively few highly similar gene pairs within each region, or between the regions and other parts of the genome.
The total number of gene pairs detected by SYNMAP is 22,406, including 13,716 (61%) where the members are in two paralogous regions and have high similarity, defined as 81% or higher. Of these, for the overwhelming majority, both members are in the 14 triples of regions.
Looking at all the high-similarity pairs, a large number of these form gene triples, 2392 of them, that are not part of a 4-tuple or higher. The great majority of these gene triples, 2118, are located in the 14 triples of high-similarity regions.
We can contrast this situation with the predictions of the triplication-first model. If the time of the duplication corresponded to p = 0.87, then there would be many pairs with similarity > 0.81, but few triples where all the pairs satisfied > 0.81. There would, however, be many triples where all three pairs had similarity < 0.81. This is clearly not the case.

Conclusion
The decomposition of gene pair similarity distributions into a number of normal distributions has been staple of comparative genomics. Statistical mixture of distributions methods [9] have been used extensively, to detect the distributions, to find their means and to test their significance. Because these are general methods, they do not take into account the biological processes that gave Fig. 8 (s, w) surface showing the difference in the number of unpaired genes between corresponding points in the two models. The red line indicates points where the two models agree on the number of genes, but this does not intersect with the set of MLE solutions on the blue line). In particular, the yellow dots indicating realistic numbers of unpaired genes are far from the red line rise to the distribution and thus may lead to meaningless results. For example, they can find significance in a two-distribution decomposition, in which the one reflecting an earlier event has smaller variance than the most recent one, a biological impossibility. They can produce a decomposition where a peak with a large amplitude is succeeded by one with very small amplitude, again biologically implausible.
For distributions reflecting genuine events, mixture methods may provide accurate measures of the timing of these events, but offer little else of biological interest. Our models go further, allowing, for the first time, the estimation of fractionation rates from pair similarity distributions. We have proposed algebraic machinery for comparing competing models, and as an illustrative test, used it to confirm what was already well-known, that the B. rapa genome triplication is more recent than its duplication event.
At the end we must conclude, despite the unexpected insights provided by mathematically modeling the genome multiplication-fractionation cycle, that to decide on the ploidy of the multiplication events, the strongest evidence, at least for the most recent events, is found in the tabulation of high-similarity pairs, triples, or other multiples. If few of the high-similarity pairs are in triples or other tuples, then the most recent event is likely to have been a tetraploidization. If a large proportion of the pairs are in triples but not in higher tuples, the event must have been a hexaploidization.
By judiciously parsing the similarity axis using cut-off values between peaks of the distribution, or between the mean values of inferred normal components of the overall distribution, we might hope in some cases to extend this simple approach to find the multiplicity of the earlier events,