Distributional analysis of the extra-clustering model with uniformly generated phylogenetic trees

The extra-clustering model for the group formation process of social animals was introduced by Durand, Blum and François. The model uses the relatedness of the animals, which is described by phylogenetic trees. If these trees are drawn from the Yule-Harding model, it was analyzed in recent work. Here, we analyze it for the uniform model, which is the other widely-studied model on phylogenetics trees. More precisely, we derive moments and limit laws for the number of groups, the number of groups of ﬁxed size and the largest group size. Our results show that, independent of the probability of extra-clustering, there is on average only a ﬁnite number of groups, one of which is very large whereas all others are small. This behavior considerably differs from the Yule-Harding case, where the ﬁniteness of the number of groups is dependent on the extra-clustering probability.


Introduction
The extra-clustering model, proposed by Durand et al. in [5], is a model for the group formation process of social animals. Under this model, the number of groups N n formed by n animals satisfies the distributional recurrence: for n ≥ 2, N In + N * n−In , if K n = 0 and I n ∈ {1, n − 1}; 1, if K n = 1 or I n ∈ {1, n − 1}, (1.1) where N * n is an independent copy of N n , the sequences of random variables K n , I n , N n , N * n are independent, K n is a Bernoulli random variable with P(K n = 1) = p and 0 ≤ p < 1, and I n has throughout this note the Catalan distribution: (a) (b) Figure 1: A phylogenetic tree representing the interrelationship of 5 animals (grey leaves; labels are omitted). On the left, the two encircled nodes are the clade of the first or second leaf from the left; note that this clade is not maximal since it is strictly contained in the clade of the third leaf from the left. On the right, the two maximal clades of the tree (which arise from the third and forth or fifth leaf from the left). So, if the the interrelationship of 5 animals is represented by this tree, then N n = 2, N [2] n = N We give a brief description of the aforementioned extra-clustering model which then will also explain the above recurrence for the number of groups.
First, consider p = 0, where the model is called the neutral model. In this case, the model is based on the assumption that the main (and in fact) only driving force behind the group formation process is genetic relatedness; see [5]. Thus, we first need to understand the interrelationship between the n animals which is done via phylogenetic trees, i.e., rooted, binary, leaf-labeled trees, where we do not consider a left-right order of the children of nodes and leaves represent the n animals; see Semple and Steel [9] for a comprehensive introduction into properties of such trees and Figure 1 for an example (where labels of leaves are omitted). A clade of a leaf of such a tree is the set of leaves contained in the tree which is rooted at the parent of the leaf; see Figure 1, (a). The reason for considering clades is that the leaves (resp. animals) from a clade can be considered to be all closely related. Of particular interest are maximal clades, i.e., clades which are maximal under set inclusion in the poset of all clades; see Figure 1, (b). The set of all maximal clades is taken to be the set of groups formed by the n animals under the neutral model and its cardinality is denote by N n (which so far is not random).
Of course, we usually do not have the phylogenetic tree representing the interrelationship of the n animals and thus we need to resort to probabilistic tree reconstruction methods. More precisely, we will consider random models on the set of all phylogenetic trees of size n. The most simple and widely-used models for such random phylogenetic trees are the Yule-Harding model and the uniform model (also called PDA model in the biological literature); see [9] and Aldous [1] for motivation and background on these models. Properties of the (now random) N n if the former model is used were studied by Durand and François [6] and Drmota et al. [3,4]. In this paper, we are interested in the uniform model which assumes that each phylogenetic tree with n leaves is equally likely.
Note that the distribution of N n for a random phylogenetic tree of size n does not change if one considers a left-right order of the children of the nodes in trees and also if one ignores the labels of the leaves; see for instance the discussion in Blum et al. [2] where this was also used. Thus, we will from now on (with a slight abuse of notation) consider phylogenetic trees as rooted, binary with children of nodes having a left-right ECP 25 (2020), paper 13. order and leaves having no labels. It is a basic combinatorial fact that the number of such trees with n leaves is given by C n−1 . Thus, under the uniform model, each tree with n leaves has probability 1/C n−1 and the probability that the left subtree has size j is given by (1.2) since there are exactly C j−1 C n−j−1 trees whose left subtree has size j.
(This now also shows that (1.2) is indeed a random distribution.) We can now explain the above distributional recurrence for the number of groups N n . Recall, that N n is the number of maximal clades of a random phylogenetic tree of size n under the uniform model. It is immediate that this number can be computed as the sum of the number of maximal clades of the left and right subtree unless all leaves are in one maximal clade. This, together with the fact that if the left subtree size equals j, then left and right subtree are independent random phylogenetic trees of size j and n − j, respectively, explains the above distributional recurrence for p = 0. (In particular, note that in this case, we have K n ≡ 0.) Next, we are going to explain the more general extra-clustering model. Recall that, as just explained, the neutral model is based on the assumption that the only reason for animals to form groups is genetic relatedness. Whereas for some types of social animals this assumption is reasonable, for others it is not; see [5] where this was discussed with real-world data. In order to take into account other factors which cause animals to form groups (and also in order to devise statistical tests to test whether or not the neutral model is appropriate), the authors in [5] introduced the more general extra-clustering model. Here, one in addition has a probability p which measures the degree of which other factors are decisive in the group formation process. According to this probability, in each step of the recursive procedure to compute N n , it may happen independently from everything else that an extra-clustering event occurs which means all the remaining animals are in one cluster. These extra-clustering events are modeled via the random variable K n in (1.1) which was the last unexplained piece in (1.1). Thus, the distributional recurrence for N n is now fully explained.
In [3,4,6], moments and limit laws of N n for the Yule-Harding model were studied. Here, we will prove corresponding results for N n as well as for more refined characteristics of the group formation process under the extra-clustering model with uniformly chosen random phylogenetic trees.
The paper is organized as follows. In the next section, we introduce cluster trees and associate two generating functions with it. This will then be used in Section 3 to derive limit distribution results for N n and the number of groups containing exactly m animals where m ≥ 2. Finally, in Section 4 we will study moments and the limit distribution of the largest group size. We will conclude in Section 5 by comparing the results from this paper with those for the Yule-Harding model from previous works.

Cluster trees and weights
In order to find the limit distribution of N n , one could work with the distributional recurrence (1.1). However, we will use a more combinatorial method which will turn out to be advantageous when dealing with more refined properties of the group formation process.
First, note that the definition of the extra-clustering model (with uniformly generated phylogenetic trees) can be broken into two probabilistic stages: (i) a phylogenetic tree of size n is picked uniformly at random and (ii) the picked tree is traced (starting from the root and then recursively in the subtrees) and one stops if either a node is encountered whose left or right subtree is a leaf or an extra clustering event has occurred. In the second step, we replace the subtrees at the places where one has stopped by leaves and call the resulting tree a cluster tree of the picked tree. Note that cluster trees ECP 25 (2020), paper 13. are again rooted, binary trees with children having a left-right order and leaves not labeled. Moreover, note that they are not unique but rather depend on the outcome of the probabilistic procedure in Step (ii) above; see Figure 2 where the tree on the left produces two different cluster trees: either an extra-clustering event occurs at the root itself (which gives the cluster tree in (c) with probability p) or there is no extra-clustering event at the root but possibly at the gray nodes (this gives the cluster tree in (b) with probabilities qp 2 if extra-clustering events occurred at both gray nodes; twice q 2 p if only one extra-clustering event took place at a gray node, or q 3 if no extra-clustering event took place at a gray node). Now, in order to keep track of the probabilities attached to cluster trees, we associate two generating functions with them. First, since no extra-clustering event has occurred at any internal node of a cluster tree, we attach the probability q := 1 − p to these nodes, i.e., we consider is the ordinary generating function of the Catalan numbers (see, e.g., Page 34 and 35 in [7]). Next, for the leaves of the cluster tree, they either resulted from an extra-clustering event (in which case we have to attach the weight p to them and there are C n−1 possible trees) or they have been nodes in the original phylogenetic tree with either one or two children as leaves (in this case, we use the weight q and the number of trees is either 2C n−2 in the former case or 1 in the latter case). Thus, for each single leaf of the cluster tree, we consider the generating function Now, by well-known facts about the combinatorics of generating functions (e.g., see Chapter I in [7]), the composition of these two generating functions, namely, G(H(z)) generates for any phylogenetic trees all its associated cluster trees with their corresponding probabilities. In particular, since for each phylogenetic tree the probabilities ECP 25 (2020), paper 13. of its cluster trees sum up to 1, we have and [z 1 ]G(H(z)) = 0 because all phylogenetic trees have at least two leaves. (Here, [z n ]f (z) denotes the n-th coefficient of the Maclaurin series of f (z).) We formulate this as a lemma.

Number of groups and number of fixed-size groups
The two generating functions from the previous section become really useful only if one introduces a second variable, say u, which keeps track of the number of leaves of the cluster trees which by definition is the number of groups under the extra-clustering model (see the introduction). More precisely, we consider now G(uH(z)). By the above description, this generating function is related to the distribution of N n via where the denominator incorporates the probability from Step (i) of the stochastic description of the extra-clustering model from the beginning of the last section and G(uH(z)) incorporates the probabilities from Step (ii).
Limit laws for random variables arising in the above way from a composition of generating functions have been studied before in the literature; see Flajolet and Sedgewick [7]. We recall here one such result which we will use in the sequel. To state the result, we need some notations.
Assume that g(z) and h(z) are two generating functions with non-negative coefficients and h(0) = 0. Denote the radii of convergence of g(z) and h(z) by ρ g and ρ h , respectively. Then, the following result holds.
Theorem 1 (Proposition IX.1 in [7]). Assume that τ h < ρ g . Moreover, assume that ρ h is finite and that z = ρ h is the only singularity of h(z) on the circle of convergence. Finally, assume that where c is a positive real number, 0 < λ < 1 and the above asymptotics holds for z in where r > 1 and 0 < φ < π/2.
Then, for the sequence of random variables defined by with convergence of all moments, where X is a discrete random variable with probability generating function: ECP 25 (2020), paper 13.
The proof of this result follows from singularity analysis, where the dominant singularity (the one closest to the origin) of g(h(z)) comes from the dominant singularity of h(z) since τ h < ρ g ; for details see [7] and the proof of Theorem 3 below. The condition τ h < ρ g is the so called subcritical case and one usually refers to f (ug(z)) as subcritical composition schema.
In fact, G(uH(z)) is also a subcritical composition schema and thus the limit distribution of N n follows from the above result.
Here, NB(r, p) denotes the negative binomial distribution.
Remark 1. NB in the above theorem is more precisely the (standard) generalization of the negative binomial distribution to the case where the first parameter is allowed to be any positive real number.
Proof. First, note that G(uH(z)) is indeed a subcritical composition schema. Moreover, by a straightforward expansion H(z) = 3 + p 16 − 1 + p 4 in a suitable ∆-domain.
Thus, by applying the proposition, we obtain the claimed result with the probability generating function of N given by From this, it is clear that N has the claimed distribution.
Remark 2. The previous theorem can also be proved by deriving the asymptotics of all moments of N n which can be done in a recursive way since it follows from (1.1) that all moments satisfy the same type of recurrence. Then, the above result also follows since the negative binomial distribution is uniquely determined by its moment sequence; see, e.g., [3,4] where such a recursive approach was employed to prove limit distribution results (but with other limiting distributions).
As a corollary, we obtain the following.
Thus, on average, there are only a finite number of groups.
Next, we fix m ≥ 2 and consider the number of groups of size m which we denote by N [m] n ; see the description of Figure 1 for an example. In order to understand the distribution of this random variable we can again use the two generating functions G(z) and H(z). However, this time we only mark with u those leaves of the cluster tree which correspond to groups of size m, i.e., only the coefficient of [z m ] in H(z). Thus, we consider G((pC m−1 + (2 − δ 2,m )qC m−2 )(u − 1)z m + H(z)), were δ 2,m is the Kronecker delta function. Then, In order to find the limit distribution of N

[m]
n , we cannot directly apply Theorem 1. However, the method of proof of Theorem 1 can be applied and yields the following result.

Theorem 3. We have the limit distribution result
with convergence of all moments, where which has dominant singularity at z = 1/4. By a straightforward expansion, as z → 1/4, Note that for u close to 1, we have |c m (u)| < 1 4q and the upper bound is the dominant singularity of G(z). Thus, G(H m (u, z)) has also dominant singularity at z = 1/4. Moreover, as z → 1/4, Now, by the transfer theorems of singularity analysis (see Chapter VI in [7]), n . From this the claimed result follows by standard results from probability theory; see, e.g., Chapter IX in [7]. As a consequence, we again obtain the asymptotics of the mean.

Corollary 2.
We have, Corollary 1 and Corollary 2 now imply the following proposition.

Proposition 1.
We have, Proof. This is proved by a straightforward computation (probably best done with mathematical software such as Maple).
This suggests that there is only one big group and all other groups are small. That this is indeed the case will be proved in the next section.

Largest group size
Denote by M n the largest size of the groups (i.e. largest size of the maximal clades) of a random phylogenetic tree of size n under the uniform model; e.g., for the tree in Figure 1 we have M n = 3. Due to the above observation that there should be one big group, we set X n := n − M n .
In order to find the distribution of X n , we again make use of the above two generating functions for the cluster tree. The main observation is that for 0 ≤ k < n/2, we have which is explained as follows: since the largest group size is equal to n − k, we have to replace one leaf of the cluster tree by a group of size n − k (this is the factor [z n−k ]H(z)), whereas all other leaves are replaced by arbitrary groups (this is the factor [z k ]G (H(z))); note that the restriction 0 ≤ k < n/2 is essential here, because it ensures that all other groups are indeed of size smaller than n−k. Moreover, the range 0 ≤ k < n/2 is expected to be sufficient for our purpose since we expect that the largest group size is close to n.
We start with the following lemma.

Lemma 2.
Uniformly for 0 ≤ k < n/2, we have The result follows from this by a standard computation using (3.1).
From the last lemma, we obtain the limit distribution of X n .
Theorem 4. We have the limit distribution result where X is a discrete random variable with probability generating function Here, and the claimed form follows now by plugging into this the expressions for G(z) and H(z) and straightforward computation.
Remark 4. Note that F (u) has dominant singularity at u = 1/4. Moreover, as u → 1/4, From this, we obtain by the transfer theorems of singularity analysis, Remark 5. Note that all moments of X are infinite. Thus, in contrast to Theorem 2 and Theorem 3, we do not have moment convergence in the above limit theorem for the largest group size.
Due to the latter remark, it is interesting to compute moments of X n (and thus of M n ). We will do this next with the help of Lemma 2, (4.2) and the Euler-Maclaurin summation formula (for the latter see, e.g., Chapter 9 of Graham et al. [8]). We first need the following (crucial) lemma.
For the first part, we have by Lemma 2, where p k was defined in Theorem 4 and ρ < 1/2 so that the last equality holds. Note that where we used (4.2) in the last step. Combining the two equations above, we get The asymptotic of the sum on the right-hand side of the equation can be derived by using the Euler-Maclaurin summation formula: where the last step holds whenever ρ > 1/3. The asymptotic of the O-term in (4.6) can be derived in a similar manner. Thus, we obtain that Now, we turn to the second part of the decomposition of (4.5) for which we use the expansions from Lemma 2 and (4.2): ECP 25 (2020), paper 13.
and thus n ρ ≤k<n/2 Together with a similar treatment of the O-term in (4.8), we obtain that n ρ ≤k<n/2 Finally, substituting (4.7) and (4.9) into (4.5) gives the desired result. Next, we proceed to the proof of (4.4). In a similar manner, we split the sum into where ρ is again chosen as the proof proceed.
For the first term on the right-hand side of (4.10): where the last step holds when ρ < 1/2.
For the second term on the right-hand side of (4.10), we again apply the expansions in Lemma 2 and (4.2): Using once more Euler-Maclaurin summation formula yields n ρ ≤k<n/2 The O-term in (4.11) is treated similarly.
Finally, substituting the above two equations into (4.10) gives the desired result.
From this lemma, we obtain now the asymptotics of all moments of X n .
where d is as in Lemma 3.
we only need to show that the second term is o(n −1/2 ). This follows directly from where (4.3) is used in the last estimate.
As a corollary, we obtain the asymptotics of moments of the maximal group size M n .
Corollary 3. We have, where d is as in Lemma 3.

Conclusion
In this paper, we considered the number of groups, number of fixed-size groups and the largest group size of the extra clustering model with uniformly distributed phylogenetic trees. For all these random variables, we derived limit laws and computed moments. Our results show that on average, there is only a finite number of groups and that one of these groups contains almost all animals (and thus all the others are small).
This holds for all p with 0 ≤ p < 1.
Our results have to be compared with those for the extra clustering model where the phylogenetic trees are generated by the Yule-Harding model; see [6] and [3,4]. In particular, in [6], the following asymptotics for the mean of number of groups (again denoted by N n ) was proved: Thus, for the Yule-Harding model, the number of groups is on average finite if and only if p > 1/2. In all other cases, the number of groups is growing as n tends to infinity.
Higher moments and limit laws of N n where discussed in [3,4], where the authors proved that the limit law for p = 0 is continuous, for 0 < p < 1/2 it is a mixture of a continuous and discrete random variables and only for p ≥ 1/2 it becomes discrete. On the other hand, for the uniform model we proved in this paper that it is always discrete. Moreover, one also has convergence of all moments which in the Yule-Harding model was only the case for 0 < p < 1/2 and 1/2 < p < 1.
For the number of fixed-sized groups in the Yule-Harding model, only the mean was considered so far. For example, in [6], the authors showed that for 0 ≤ p < 1/2, the mean is again of order n 1−2p . Using the tools from [3,4], higher moments and limit laws for the number of fixed-sized groups could be added as well (also for the range p ≥ 1/2).
ECP 25 (2020), paper 13. However, of possible greater interest would be a study of the largest group size in the Yule-Harding model, in particular, because it was claimed in [6] that the "typical" group size is of order log n in the neutral model (p = 0) and of order n in the extra clustering model with p > 0. Whether or not a similar sharp transition also holds for the maximal group size is an open problem.