Peripheral structures in unlabelled trees and the accumulation of subgenomes in the evolution of polyploids

leaf-labelling


Introduction
Virtually all known flowering plants 1 evidence at least one polyploidization event in their lineage, many of them two, three or more.These may have been autopolyploidizations, where a new genome is created consisting of two or more exact copies -subgenomes -of the parent genome, leading to high rates of multivalent pairing of chromosomes during meiosis, or allopolyploidizations, where the subgenomes come from distinct parental species, and where normal bivalent pairing is more general.In both processes, the entire genome, including each chromosome and each gene, is present in two or more copies or versions.It is not uncommon for a present-day flowering plants to have undergone two, three or even more of these events.For example the canola (Brassica napus) genome was formed by the tetraploidization of cabbage (Brassica oleracea) and Napa cabbage (Brassica rapa), which are divergent species originating in an earlier Brassica hexaploid.Furthermore, the preceding history of this lineage contains two successive tetraploidization events and a hexaploidization, known as a; b and c, respectively, the latter reaching back some 120 Mya (Jiao et al., 2014).Maize (Zea mays) is the product of three or four tetraploidizations, the most recent one some 12 Mya (Gaut and Doebley, 1997), as well as the earlier q; r and s events dating as far back as 50 Mya (Bowers et al., 2003).These examples illustrate the important fact that two polyploids with m P 1 and n P 1 subgenomes may combine to form a polyploid with m þ n subgenomes.
As a consequence of polyploidization, the genomes of these plants will contain multiple versions of each of the n chromosomes.In autopolyploids, these ''homeologous" chromosomes are initially identical copies, while in allopolyploids they will be similar but may have small differences.In any case the homeologous chromosomes eventually diverge due to random point mutation, gene loss or gain and intra-or inter-chromosomal rearrangements.
In polyploids, it is important to the study of their evolution to understand the sequence of polyploidization events that gave rise to the current genome.This is not a trivial question.For instance, in the octoploid strawberry genome, the sequence of polyploidy events is controversial (Edger et al., 2019;Liston et al., 2019).In another example, the alloploidization origin of the octoploid Alpine Grey Willow is a matter of speculation (Wagner et al., 2020).It is in the context of this general question that we investigate the evolutionary history of an octoploid species of sugarcane.
In an autopolyploid like sugarcane, where there is little or no exchange between non-homeologous chromosomes, each of the n sets of n homeologous chromosomes could be studied as if it were a set of n independent species related by a phylogenetic tree T, descending from a common ancestor and diverging through random mutation.For present purposes, we assume tree T has a binary branching form.And if we were to observe the n sets of n homeologous chromosomes, these could be studied as n independent observations of T.More specifically, for each of the genes contained in the n original chromosomes, the n present day descendants of each of these genes could be considered to be related by the same phylogeny T.
A crucial difficulty in this approach is that the there is usually no clear correspondence connecting the non-homeologous chromosomes deriving from each of the n constituent subgenomes, no labelling of the leaves in one set of n homeologs that has a given relationship with a labelling in any of the other sets.Thus we can only compare unlabelled trees, and see whether they share common structural characteristics derived from a common evolutionary history.
As a result of substantial variability in the rates of evolutionary divergence of the chromosomes, the resemblance between two of the n homeologs no longer perfectly reflects the time elapsed since their common ancestor.Nevertheless, the effects of this divergence would affect the most closely related chromosomes the least.Even if the topology of a tree inferred from the resemblances among the n homeologous chromosomes no longer statistically reflected the earliest branching events, in particular the root of T, there might still be enough signal to detect the most recent events.In general the recent events should be reflected in the peripheral subtrees of T, such as the pairs or triples of the leaves.Our goal in this paper is to devise a test of whether the number of these subtrees differs from those predicted by a null hypothesis.
Perhaps the best known pathway to polyploidy of degree greater than tetraploid, is the formation of a tetraploid followed closely in evolutionary time by the recruitment of a third subgenome.This appears to be the process underlying the c hexaploidy as well as the Brassica hexaploidy mentioned above.And this may be a more general way for polyploids to accumulate subgenomes.Our statistical tests for the number of terminal pairs, triples, etc., are based on the null hypothesis that the subgenomes are added one at a time to a vertex inserted on some existing branch of the tree.Under this hypothesis, we derive a recurrence for P 2i;n , the probability of i terminal pairs in a binary tree after n generations.We explore a number of avenues to find a closed form for P 2i;n .Though in applications, n is not a very large number, there is mathematical interest in studying the asymptotic behaviour of the model.We compute the probabilities for a large range of n.It appears that the expected values, normalized by the number of terminals, are asymptotically equal to 1 4 .Furthermore, the normalized variances approach 1 16 , so that the limiting distribution is under-dispersed.In addition, we recursively calculate the probability distributions of other peripheral structures in a binary tree generated after n steps under the null hypothesis.The peripheral structures explored in this paper are triples and the two possible types of quadruples.We can conjecture that the normalized expected values of the number of triples, type I quadruples, and type II quadruples asymptotically equal to 1 8 ; 1

16
, respectively.Similar to the distributions of terminal pairs, the limiting distribution of the number of triples and of the numbers of two types of quadruples are under-dispersed as well.
We apply our analysis to data on sets of homeologous genes from the sugarcane species Saccharum officinarum (Zhang et al., 2021), where n ¼ 8 and n ¼ 10.We find that the necessary resemblance data from the genes in sets of homeologous chromosomes are usually incomplete, either because one or more genes have been lost over the course of evolution, or because of other experimental and measurement difficulties.This leaves us with very few sets of genes as samples to infer a consensus T in any of the n subgenomes.Thus we adapt our methods to cases where there are only 7 or only 6 homeologous genes.We apply the same phylogenetic procedure and derive appropriate weightings for the tree bipartitions obtained as evidence for T. We discuss the implications for alternative hypotheses after rejecting the null hypothesis.

Distribution of paired terminals
Consider a stochastic process defined on full-binary free unlabelled trees such that transition from one generation to another is by subdividing a randomly chosen edge and linking the new internal vertex to a new terminal vertex.In this section, the quantity of interest is the number terminal pairs and the initial state is a tree on n ¼ 4 terminals ( ).Let the random variables b n be the number of terminal pairs in nth generation with the probability distributions fPrfb n ¼ ig :¼ P 2i;n ; i ¼ 2; 3; . . .; b n 2 cg.The distributions can be obtained through the following procedure.
Proposition 2.1.For i ¼ 2; 3; . . .; b n 2 c and n ¼ 4; 5; . .., the probability distributions P 2i;n satisfy the following bivariate recurrence relation: This can be proved by induction.See Fig. 1 for an example.A tree b n with 2i paired terminals can be derived from a tree b nÀ1 with either 2i or 2i À 2 paired terminals.Regardless of number of paired terminals, a tree b nÀ1 has jEðb nÀ1 Þj ¼ 2n À 5 and jCj ¼ n À 4.
The probability of having a tree b nÀ1 with 2i paired terminals is P 2i;nÀ1 .In order to this tree lead to a b n with same number of paired terminals, an edge labeled by either a or c from all the edges in b nÀ1 must be picked for subdivision and linking to a new terminal.
Thus, the probability of this case is 2iþnÀ4 2nÀ5 P 2i;nÀ1 .On the other hand, a tree b nÀ1 with 2i À 2 paired terminals has ðn À 1Þ À ð2i À 2Þ edges labeled by b that can be picked to raise the number of paired terminals.Therefore the probability of this case is nÀ2iþ1 2nÀ5 P 2iÀ2;nÀ1 .Overall, the probability of having 2i paired terminals for b n is nþ2iÀ4 2nÀ5 P 2i;nÀ1 þ nÀ2iþ1 2nÀ5 P 2iÀ2;nÀ1 (see Fig. 2).A tree on n terminals has at least 2 and at most b n 2 c terminal pairs.Therefore, if i < 2 or n 2 < i, we have P 2i;n ¼ 0. At initiation of the process, the probability of having 2 terminal pairs in a tree on n ¼ 4 terminals is one, since it is the only possible tree topology.This completes the proposition.j We computed the probabilities of the random variable b n for the first 5,000 generations.The results, as shown in Fig. 3, lead us to the following conjectures.Although we can use the recurrence relation to compute the probability distributions exactly, finding its closed-form solution is challenging.In what follows, we explain our attempt regarding its closed-form solution.

Distribution of triples
In this section, the quantity of interest is the occurrence of a substructure, which we refer to as a triple, consisting of a pair of terminals and an unpaired terminal in addition to two internal edges, as shown in Fig. 5.
Proof.Assume the tree b n at nth generation contains m triples and k terminal pairs.This tree can be derived from a tree b nÀ1 with m þ 1; m or m À 1 triples and with k or k À 1 terminal pairs.
If the tree b nÀ1 also had k terminal pairs, then either a paired terminal a or an internal edge c has been chosen for the subdivision.In this case, the tree b nÀ1 could not have m þ 1 triples, since the number of triples decreases only if a non-paired terminal b of a triple is picked for subdivision.If the tree b nÀ1 had m À 1 triples, then there are k À ðm À 1Þ substructures, each of which consists of a terminal pair a in addition to an adjacent internal edge c such that the substructure is not a subgraph of any triples in the tree.There are 3ðk À ðm À 1ÞÞ edges involved in these substructures.Hence, the probability of this case is 3kÀ3mþ3 2nÀ5 P mÀ1;k;nÀ1 .However, if the tree b nÀ1 had the same number of m triples as the current tree b n , then any edge of the tree b nÀ1 could have been picked unless through its subdivision and then linking to a new external vertex, there would be an increase or a decrease in the number of triples.
Hence, there would be ð2n À 5Þ Þ edges which can be picked in order that the number of triples remains the same.Thus, the probability of this case is nÀ4þ3mÀk  In what follows, we analyze the distribution of another peripheral substructure of the trees.This substructure which we refer to as a type I quadruple comprises of a pair of terminal pairs descending from a single ancestor along with two internal edges which are adjacent to the ancestor and the pairs' ancestral nodes, as shown in Fig. 7.
Let / n be the random variable defined by the number of type I quadruples in nth generation with probability distributions fPrf/ n ¼ f g; f ¼ 0; 1; . . .; b n 4 cg.By marginalizing over the number of terminal pairs and of triples, we have:
Þ edges, none of which is a non-paired terminal b or is one of the three edges involved in the substructures on which the occurrence of subdivision contributes to an increment of number of triples; hence, the probability of this case is nÀ4Àkþ3m 2nÀ5 P f ;m;k;nÀ1 .On the other hand, if the number of terminal pairs has been increased by the transition, then the subdivision must have occurred on an non-paired terminal which is not involved in the construction of any triple.The number of such edges is ðn À 1Þ À 2ðk À 1Þ þ m ð Þ .Therefore, the probability of this case is nÀ2kÀmþ1 2nÀ5 P f ;m;kÀ1;nÀ1 .This completes the proof of the proposition.h We computed the probabilities of the random variable / n for the first 500 generations.The results lead us to the following conjectures.In this section, we shed light on the distribution of another type of quadruples consisting of a terminal pair, and two non-paired terminals in addition to three internal edges, one of which is adjacent to the terminal pair and one of the non-paired terminal, and the other two internal edges are adjacent to the non-paired terminals in a consecutive order.as shown in Fig. 9.We refer to this substructure as a type II quadruple.
Let h n be the random variable of the number of type II quadruples in nth generation with probability distributions fPrfh n ¼ eg; e ¼ 0; 1; . . .; b n 4 cg.By marginalizing over the number of terminal pairs and of triples, we have: see Fig. 10.nÀ1 had m À 1 triples, then there are k À ðm À 1Þ substructures, each of which are consist of a terminal pair in addition to an adjacent internal edge such that the substructure is not a subgraph of any triples in the tree.There are 3ðk À ðm À 1ÞÞ edges included in these structures that can be chosen for subdivision.Hence, the probability of this case is 3kÀ3mþ3 2nÀ5 P e;mÀ1;k;nÀ1 .Otherwise, if the tree b nÀ1 had m triples, then, based on its number of terminal pairs, there are two possible cases.If the tree b nÀ1 also had k terminal pairs, then there are ð2n À 5ÞÀ Þ edges which can be picked in order that the number of terminal pairs, triples, and type II quadruples remain the same.Therefore, the probability of this case  The probabilities of the random variable h n for the first 500 generations are computed.The results lead us to the following conjectures.
Conjecture 4.5.Let h n be the expected value of type II quadruples in the nth generation of the process of randomly subdividing an edge so as to link the new internal vertex to a external new vertex.Then, h n n ! 1 16 as n ! 1 Conjecture 4.6.Let h n and r 2 n be respectively the expected value and variance of type II quadruples in the nth generation of the process of randomly subdividing an edge so as to link the new internal vertex to a new external vertex.Then, the indices of dispersion shows that for the limiting distribution, we have: as n ! 1 ð4:8Þ

Experimental results
The preceding combinatorial analyses were carried out in the service of a null hypothesis that each increment in genome ploidy is the result of the addition of a complete complement of distinct chromosomes to the n À 1 copies already present, and that this addition is described by the attachment of a new edge to any branch of the existing phylogeny with equal probabilities.
Here we illustrate how to adapt our analysis to study the evolution of a variety of sugarcane, Saccharum officinarum, (L.A. Purple) (Zhang et al., 2021).The genome has 80 chromosomes in all, partitioned into 10 sets C i (1 6 i 6 n ¼ 10), each of which consists of 8 homeologous chromosomes.There is no given or known subgenome structure, i.e., for any choice of 10 chromosomes, one each set of 8 homeologs, there is no information whether these chromosomes were, or were not, added to the genome during the same polyploidization event.
We first calculated the sequence similarities (percent identity) in the set of all homologous gene pairs found by SynMap in syntenic blocks (Lyons and Freeling, 2008;Lyons, 2008).
For any labelling of the 8 chromosomes in each set of homeologs, say A,B,. ..,H, it would thus be futile to try to consider the 10 phylogenies constructed from the 10 sets as samples from some distribution of trees with labelled terminals, since there is no known 1-1 correspondence between the chromosomes in different sets.On the other hand, were there some well-supported, wellresolved phylogeny in common in some or all of the sets, though these consist of unlabelled trees, we might be justified in concluding that this gene tree represented the evolutionary history of the genome, solely on the basis of a common topology.Thus we will analyze each of the 10 sets of homeologous chromosomes, determine the tree that best fits the data on that set, and finally compare the 10 phylogenies, without labels, to see to what extent they share common structural features.
For each set of homeologous chromosomes, we first compile all gene families, sets of 8 paralogous genes one on each chromosome, that have no other paralogous links within the set or elsewhere.Because of the extensive pattern of missing genes and ambiguous paralogy, however, there are actually few complete gene families with 8 replicates.To address this problem, we extend our study to gene families with only 7 or 6 replicates.The numbers of families of each size are presented in Table 1.
Since our phylogenetic reconstruction in each gene family is based on bipartitions shared by the largest number of gene trees, we pay careful attention to the statistically fair weighting of bipartitions coming from the 6,7 and 8-replicate cases.
For each gene family, we calculate its maximum likelihood phylogeny via the RAxML wrapper contained in the Python package DendroPy (Sukumaran and Holder, 2010;Stamatakis, 2014).We carry out the rest of our analysis based on the non-trivial bipartitions of these phylogenies.
We define the compatibility of a pair of bipartitions to mean that the two bipartitions can co-exist in a tree.For instance, con- is the set of all the leaves' labels.The two bipartitions are compatible if and only if at least one of the intersections X \ Y; X \ ðR n YÞ; ðR n XÞ \ Y, and ðR n XÞ \ ðR n YÞ is empty.Constructing a tree on eight terminals entails inferring five compatible non-trivial bipartitions.While we are combining evidence from all the gene families in a homeology set, however, we should weight more heavily bipartitions that discriminate among the set of possible trees than those bipartitions that are compatible with a larger set of trees.Thus {A B C D}{E F G H}, which is compatible with only 225 unrooted binary branching trees, should be weighted more heavily than {A B}{C D E F G H}, which is compatible with 945 trees.Moreover, bipartitions derived from 7-gene families and 6-gene families will be compatible with (i.e.constitute evidence for) many more 8-terminal trees, and should thus carry much less weight in constructing the ''consensus" tree of all the gene families in the homeology set.
For each homeologous chromosome set C i , let m ðiÞ be the number of gene families that has been extracted.Let phylogenetic tree T ij corresponds to gene family G ij where 1 6 i 6 n ¼ 10 and 1 6 j 6 m ðiÞ .Suppose R is the set of labels of chromosomes of n replicates(here, R ¼ fA; B; . . .
g is a bipartition set of T ij where n 0 6 n ¼ 8, for some 1 6 i 6 n and 1 6 j 6 m ðiÞ .Let R 0 be the set of missing terminal labels of T ij , and d ij ¼ fd ij 1 ; . . .; d ij q j g be the new non-trivial bipartition set consisting of all non-trivial bipartitions of n terminals compatible with k & R, for 1 6 k 6 q j , such that: 1.The new set of compatible non-trivial bipartitions must support all possible bipartitions fairly.We provide this balance through the following criteria: where K ij k is the number of all full-binary trees with n 0 terminals with which d ij k is compatible.That is the number of all full-binary trees with n 0 terminals compatible with bipartition ÞÞg, for 1 6 k 6 q j .More explanation on this and counting K ij k is provided later.2. The new set of bipartitions d ij must contribute a total weight of 1 to the formation the consensus tree; thus, we let where k ij k is the number of all bipartitions in the set d ij such that there is same number of K ij k full-binary trees compatible with them.
Hence, we have a system of q equations involving q weights that can be easily solved to get the weights.
2 ðjR 1 jþjR 2 jÀ4Þ ðjR 1 j À 2Þ !ðjR 2 j À 2Þ !ð5:3Þ The bipartiton b ij k 0 corresponds to an edge linking R 1 and R 2 .Therefore, the number of all full-binary trees compatible with bipartition b ij k 0 is product of the number of full-binary terminally labelled trees with labels in R 1 and with labels in R 2 .Each of these two trees is rooted at one of the two incident vertices to the edge linking R 1 and R 2 .There are ð2nÀ3Þ ! 2 nÀ2 ðnÀ2Þ !full-binary rooted labelled trees with n terminals.See Table 2 :4Þ Afterwards, for 1 6 i 6 n ¼ 10, we let the non-trivial bipartition set CST ðiÞ corresponding to the consensus tree of the ith homeologous chromosome set be a subset of D i with cardinality of n À 3 (here, 5) such that the bipartitons have the highest weights among the bipartitions in D i that are compatible with one another.This consensus was estimated by applying a greedy algorithm, at each step selecting the highest weight remaining bipartition compatible with all the bipartitions already selected.As opposed to maximizing the sum over the weights, this approach promises a better consensus tree when the subsequent compatible bipartitions belong to the same former gene-trees(e.g. in the worse case, all the compatible bipartitions belong to the gene-trees corresponding to the bipartition with the highest weight in the new bipartition set).The consensus trees and their frequencies are shown in Table 3.
We examined the probability of observing such phylogenetic trees under the null hypothesis, the assumption that the phylogeny was generated by the one-branch-at-a-time model.The results, as shown in Table 3, are statistically significant for all the peripheral substructures discussed in this paper based on the Multinomial Exact Test (P-value < 0:05).Thus, we reject the null hypothesis that the homeologous chromosomes were added one at a time.
In addition to testing the one-branch-at-a-time model on the inferred consensus trees derived from applying a greedy strategy on the new compatible bipartitions of full-replicates, we applied ASTRAL III on our data (Zhang et al., 2018a), and reexamine our model.The result are shown in Table 4.
Although we justified our construction of unrooted trees rather than rooted trees in terms of lesser confidence in the ''early" parts of the ten phylogenies, as well as a consequence of the type of information contained in the bipartitions, we can nonetheless ask what effect this choice has on the distribution of the number of peripheral structures of each type.To answer this, we investigate the effect of randomly inserting a root on some edge of theunrooted tree.For a tree with four pairs of terminals, the probability of destroying one pair by placing the root on a terminal branch, thus creating one singleton node on each side of the root, is 8 13 .This will produce a tree with only three pairs.For three pairs in an unrooted tree, the corresponding probability is 6 13 it becomes two pairs.For two pairs in an unrooted tree, the probability is 4  13 that it becomes one pair in an rooted tree.
Then the observed frequencies of four, three and two pairs in Table 3, namely 3, 7,and 0, respectively, become 15 13 ¼ 1:15; 24 13 þ 49 13 ¼ 5:62 and 42 13 ¼ 3:23, respectively.We carry out a parallel calculation on the expected numbers, 0.3, 4.8 and 4.8, to derive new values 0.11, 2.79, 5.59 and 1.49 for four, three, two and one pair, respectively.The observed preponderance of larger numbers of pairs in the observed data persists.In general for any kind of peripheral structure rooting can only decrease the number.And the effect on the observed numbers will parallel the effect on the expected numbers.

Discussion and conclusion
In this paper, we have introduced an approach to deciphering the history of multiple polyploidizations of crops and other flowering plants.This is particularly pertinent for autopolyploids and for allopolyploids derived from closely related genomes (Zhang et al., 2018b).
In contrast to most phylogenetic analyses, our calculations involved unlabelled unrooted trees, because there is no matching between the different sets of homeologous chromosomes.Enumeration involving unlabelled unrooted trees is notoriously difficult, but we have succeeded in developing easily computed recurrences for the expected number of occurrences of small structures on the periphery of the tree.Indeed, even though we have not specified any alternative hypothesis, the results are suggestive of one or two successive whole genome doublings underlying the gene similarity data.And although our null hypothesis is motivated by known examples of subgenomes were recruited one at a time, other null hypotheses may be equally reasonable in particular contexts.
Finally, we note that our data form but a very small subset of the homeologous gene sets in the S. officinarum genome, because of the stringent requirements on gene families for our tree construction method.Whether or not consideration of a larger data set may suggest different conclusions than ours, is an open question.Our source code is publicly available at https://github.com/FatemehPouryahya/peripheral_structures.Consensus trees of homeologous chromosomes of S. officinarum and the test result of one-branch-at-a-time model.

Table 4
Test result of one-branch-at-time model on the trees inferred via ASTRAL III.
Conjecture 2.2.Let b n be the expected value of the number of terminal pairs in the nth generation of the process of randomly subdividing an edge in order to link the new internal vertex to a new external vertex.Then, the expected value and variance of the number of terminal pairs in the nth generation of the stochastic process of randomly subdividing an edge in order to link the new internal vertex to a new external vertex.Then, based on the index of dispersion, the limiting distribution is under-dispersed.

c
Fig. 1.An illustration to depict possible changes in cardinality of subsets A, B, and C as a result of subdividing an edge labelled by b; a, and c and linking the new internal vertex to a new terminal vertex.The binary tree in Figure (a) has N ¼ 6 terminalsand we have jAj ¼ 4; jBj ¼ 2, and jCj ¼ 3. Figures (b) is the result of subdividing an edge labelled by b where the jAj increases by 2 and jBj decreases by 1, while Figure (c) and (d) are the result of subdividing an edge labelled by a and c, respectively, the cardinality of subset A remains the same and the cardinality of subset B increased by 1.In all figures (b), (c).and (d), jCj increases by 1.

Fig. 2 .
Fig. 2. Probabilities of number of terminal pairs in trees generated by the stochastic process of linking to a new terminal vertex.Trees are from 6th to 100th generation with two gaps in between generations.

Fig. 3 .
Fig. 3. (a) Normalized expected value of number of terminal pairs, triples, and quadruples.(b) Indices of dispersion of number of terminal pairs, triples, and quadruples.

Fig. 4 .
Fig. 4. Lattice paths of the recurrence relation in Proposition 2.1.Point ð2i; nÞ on the lattice corresponds to some tree at nth generation with i. terminal pairs.
Let s n be the random variable defined by the number of triples in nth generation of our one-branch-at-a-time model with probability distributions fPrfs n ¼ mg; m ¼ 0; 1; . . .; b n 3 cg.By marginalizing over the number of terminal pairs, we have (see Fig. 6): Prfs n ¼ mg ¼ X b n 2 c k¼maxðm;2Þ Prfs n ¼ m; b n ¼ kg ð 3:1Þ We denote Prfs n ¼ m; b n ¼ kg by P m;k;n .Through the following procedure, we obtain a recurrence relation for the number of triples: Proposition 3.1.For 0 6 m 6 minðk; b n 3 cÞ and 2 6 k 6 b n 2 c, the following recurrence relation holds for the distribution P m;k;n : 2nÀ5 P m;k;nÀ1 .On the other hand, if the tree b nÀ1 had k À 1 terminal pairs, then a non-paired terminal b was picked for subdivision.In this case, the tree b nÀ1 could not have m À 1 triples, since the number of triples increases only if the subdivision occurs on any of three edges forming a substructure, consisting of a terminal pair a and the adjacent internal edges c, which is not a subgraph of a triple.If the tree b nÀ1 had m þ 1 triples, then a non-paired terminal of a triple was picked for subdivision.Therefore, the probability of this case is mþ1 2nÀ5 P mþ1;kÀ1;nÀ1 .However, if the tree b nÀ1 had the same number of m triples, then there are ðn À 1Þ À 2ðk À 1Þ þ m ð Þedges can be picked so that the number of triples does not change.Hence, the probability of this case is nÀ2kÀmþ1 2nÀ5 P m;kÀ1;nÀ1 .This completes the proposition.j We computed the probabilities of the random variable s n for the first 1000 generations.The results lead us to the following conjectures.Conjecture 3.2.Let s n be the expected value of the number of triples in the nth generation of the process of randomly subdividing an edge so as to link the new internal vertex to a new external vertex.Then,
b n ¼ kg by P f ;m;k;n .The following procedure yields a recurrence relation for the distribution P f ;m;k;n (see Fig. 8).Proposition 4.1.For 0 6 f 6 minðb k 2 c; b n 4 cÞ; 0 6 m 6 minðk; b n 3 cÞ, and 0 6 k 6 b n 2 c, the following recurrence relation holds for the distribution P f ;m;k;n :

Proof.
Consider a tree b n containing f type I quadruples, m triples, and k terminal pairs.A tree b n having f type I quadruples can be derived from a tree b nÀ1 with f À 1; f , or f þ 1 type I quadruples.The number of type I quadruples increases only if the subdivision occur on a non-paired terminal b of a triple.As a result, an increment of the number of type I quadruples is always accompanied by a reduction of the number of triples and by an increment of the number of terminal pairs.Therefore, if the tree b nÀ1 had f À 1 type I quadruples, it must have had m þ 1 triples and k À 1 terminal pairs, and the subdivision must have occurred on a nonpaired terminal b of a triple.The probability of this case is mþ1 2nÀ5 P f À1;mþ1;kÀ1;nÀ1 .On the other hand, If the tree b nÀ1 had f þ 1 type I quadruples, then each of the six edges involved in any of the type I quadruples of the tree b nÀ1 could have been picked for the subdivision; thus, the probability of this is 6ðf þ1Þ 2nÀ5 P f þ1;mÀ1;k;nÀ1 .Finally, if the tree b nÀ1 embraced f type I quadruples, then it must contain either m À 1 or m triples.Provided that the tree b nÀ1 had m À 1 triples, the subdivision must have occurred on any of three edges involved in a substructure consisting of a terminal pair and their adjacent internal edge such that the substructure is not a subgraph of any triples or of any type I quadruples in b nÀ1 .In this case, there are k À 2f À ðm À 1Þ of such substructures; therefore, with a probability of 3kÀ6f À3mþ3 2nÀ5 P f ;mÀ1;k;nÀ1 , the tree b nÀ1 give rise to the traced tree b n .On the other hand, if the tree b nÀ1 also had m triples, then the number of terminal pairs either remained the same or increased through the transition.If the number of terminal pairs has remained the same as k terminal pairs, then there exist ð2n À 5Þ À ðn À 1Þ

Fig. 7 .
Fig. 7.The layout of a type I quadruple.

Fig. 8 .
Fig. 8. Probabilities of number of type I quadruples in trees generated by the stochastic process of linking to a new terminal vertex.Trees are from 7th to 100th generation with three gaps in between generations.
n ¼ e; s n ¼ m; b n ¼ kg ð 4:5Þ We denote Prfh n ¼ e; s n ¼ m; b n ¼ kg by P e;m;k;n .Proposition 4.4.For 0 6 e 6 minðk; b n 4 cÞ; 0 6 m 6 minðk; b n 3 cÞ, and 0 6 k 6 b n 2 c, the following recurrence relation holds for the distribution P e;m;k;n : P e;m;k

Proof.
Suppose the tree b n contains e type II quadruples, m triples, and k terminal pairs.A tree b n with e type II quadruples can be derived from a tree b nÀ1 with e À 1; e þ 1, or e type II quadruples.The tree b n can rise from a tree b nÀ1 with e À 1 type II quadruples only if a triple which is not a subgraph of a type II quadruple contributes to the formation of a type II quadruple.In other words, the tree b nÀ1 must also have had m triples and k terminal pairs, and the subdivision must have occurred on any of four edges, consisting of a terminal pair and two internal edges, involved in a triple which is not a subgraph of a type II quadruple; there are 4 m À ðe À 1Þ ð Þedges which can be subdivided such that by linking the new internal vertex to a new external vertex, the number of type II quadruples would increase; therefore, the probability of this case is 4mÀ4eþ4 2nÀ5 P eÀ1;m;k;nÀ1 .On the other hand, if the tree b nÀ1 had e þ 1 type II quadruples, the subdivision must have occurred on either of the two non-pair terminals b of a type II quadruple, and therefore, the tree b nÀ1 must have had k À 1 terminal pairs.If the subdivision has occurred on the closest non-paired terminal to the terminal pairs of a type II quadruple, then the tree b nÀ1 must have had m þ 1 triples.Thus, the probability of this case is eþ1 2nÀ5 P eþ1;mþ1;kÀ1;nÀ1 .Otherwise, if the subdivision has occurred on the second closest non-paired terminal to the terminal pairs, then the tree b nÀ1 must have had m triples; thus, the probability of this case is eþ1 2nÀ5 P eþ1;m;kÀ1;nÀ1 .Finally, suppose the number of type II quadruples has not been altered through the transition.The tree b nÀ1 could have had m þ 1; m À 1, or m triples.If b nÀ1 had m þ 1 triples, it must have had k À 1 terminal pairs, and the subdivision must have occurred on a triple which is not a subgraph of a type II quadruples; the probability of this case is mþ1Àe 2nÀ5 P e;mþ1;kÀ1;nÀ1 .Alternatively, if the tree b is nÀkÀmþ4eÀ4 2nÀ5 P e;m;k;nÀ1 .However, if the tree b nÀ1 had k À 1 terminal pairs, then a non-paired terminal which is not involved in formation of a triple or of a type II quadruple must have been subdivided to link to a new external vertex.There are ðn À 1Þ À 2ðk À 1Þ þ 2e þ ðm À eÞ ð Þ of such edges.Hence, the probability of this case is nÀ2kÀmþ1Àe 2nÀ5 P e;m;kÀ1;nÀ1 .This completes the proof of the proposition.j

Fig. 10 .
Fig. 10.Probabilities of number of type II quadruples in trees generated by the stochastic process of linking to a new terminal vertex.Trees are from 8th to 100th generation with three gaps in between generations.
for some k 0 , on n 0 terminals corresponding to the phylogenetic tree of the gene family G ij .The bipartition d ij k can arise from all the binary trees with co-existing bipartition b ij k 0 .We know K ij k is the number of all these trees.Now, suppose b

Funding
This research was funded by a Discovery grants to D.S. from the Natural Sciences and Engineering Research Council of Canada.D.S. holds the Canada Research Chair in Mathematical Genomics.CRediT authorship contribution statement Fatemeh Pouryahya: Conceptualization, Investigation.David Sankoff: Conceptualization, Investigation.

Table 1
The number of extracted gene families of homeologous chromosomes in Saccharum officinarum.

Table 2
An example of computed x ij and K ij for a 7-gene family.The original bipartition set corresponding to the phylogenetic tree of one of the gene families is given in the first column.Second column indicates the new non-trivial bipartitions compatible with the phylogeny shown in the first column.Computed x ij and K ij shown in 3th and 4th columns corresponding to the new compatible bipartition in the same row in the second column.