Phylogenetic Diversity Theory Sheds Light on the Structure of Microbial Communities

Microbial communities are typically large, diverse, and complex, and identifying and understanding the processes driving their structure has implications ranging from ecosystem stability to human health and well-being. Phylogenetic data gives us a new insight into these processes, providing a more informative perspective on functional and trait diversity than taxonomic richness alone. But the sheer scale of high resolution phylogenetic data also presents a new challenge to ecological theory. We bring a sampling theory perspective to microbial communities, considering a local community of co-occuring organisms as a sample from a larger regional pool, and apply our framework to make analytical predictions for local phylogenetic diversity arising from a given metacommunity and community assembly process. We characterize community assembly in terms of quantitative descriptions of clustered, random and overdispersed sampling, which have been associated with hypotheses of environmental filtering and competition. Using our approach, we analyze large microbial communities from the human microbiome, uncovering significant variation in diversity across habitats relative to the null hypothesis of random sampling.

We define the probability of finding a particular sampled tree as P (g 1 , g 2 ....g kmax ), where k max is the number of edges in the metacommunity tree. An edge contributes S i to the expectation value of total branch length if g i = 1 and zero if g i = 0, and so to obtain the expected phylogenetic diversity for a given sampling scheme we wish to compute the expectation value of the following variable: where This expectation value is then (as for any arbitrary function of the variables g i ): E(P D) = g 1 ,g 2 ...g kmax H(g 1 , g 2 ....)P (g 1 , g 2 ....g kmax ) = g 1 ,g 2 ...g kmax i h i (g i ) P (g 1 , g 2 ....g kmax ) where all the sums over the variables g i are over g i = 0, 1, and the sum over i is over all edges in the metacommunity tree, i.e. all edges that could be in the sampled tree. In the third step we are using that the expectation value A + B is equal to A + B .
For each edge i we have a contribution to the expectation value which we can rewrite as: where we have introduced the marginal probability that a given edge appears in the sampled tree: Finally, this gives us: For sampling schemes such that all edges with a given number of descendent tips, k have the same marginal distribution p i (1) = P (k), we can rewrite this as The function S(k) is the sum over all edges with k descendent tips, which we term the Edgelength Taxa Distribution (ETD), or Edge-length Abundance Distribution (EAD) in the case of tips corresponding to individuals rather than taxa.

Sampling Schemes
A binomial sampling scheme with probability q that a given tip appears in the sampled tree leads to the marginal probability that an edge with k descendent tips appears in the sampled tree, which leads to the following expression for expected phylogenetic diversity: The EAD therefore performs an analogous role to that of the Species Abundance Distribution (SAD) in sampling theory based around species richness rather than phylogenetic diversity.
Other common sampling schemes include Poisson sampling (random sampling with replacement) and negative binomial: The parameter r represents the departure from random sampling, with positive r indicating clustered sampling, negative r overdispersed sampling, while in the limit of r− > ∞ the negative binomial and poisson sampling are equivalent.

Variance in Sampled Phylogenetic Diversity
where the joint probability p ij is defined by: Next, we note that for an edge i downstream from an edge j the hierarchical structure of a tree fixes p ij (1, 1) = p i (1) and so where by i < j we mean that i is downstream of j, i.e. that there is a path from j to i moving in the direction of a tip of the tree. Finally, we again assume that the sampling scheme is such that p i (1), the marginal probability that edge i appears in the sampled tree depends only on the number of descendent tips downstream from i. Then: where T (k) is the sum of squared edge lengths over all edges with k descendent tips, and U (k, l) is the product of edge lengths with k descendent tips and downstream edge lengths with l < k tips.
For realistic trees, computing U (k, l) increases faster with tree-size than T (k) or S(k), by a factor of approximately the number of tips. To make computing the variance more tractable, we have used an approximation which serves as an upper bound on the variance: where V (k) is the sum over i of the product S i j S j , where S i is any edge with k downstream tips, and the sum over j is over all edges in the clade downstream of edge i. In other words, for a given edge i, the sum over j gives the total branch length of the corresponding downstream clade. We then use is defined as above as the product of edge lengths with k descendent tips and downstream edge lengths with l < k tips.
2. For the probabilities P (k) that at least one tip is sampled from a clade with k tips, and P (l) that any subclade of this clade with l < k tips has at least one tip sampled, P (k) ≥ P (l).
to obtain the inequality: and hence Eq. (18). Again, the sum over l here is for all subclades with l tips downstream of an edge with k tips. Finally we note that V (k) is computationally much faster to obtain than U (k, l).
Expected Phylogenetic Beta-Diversity The expected shared branch length of two randomly-drawn subtrees can be formulated in a similar way, but depends on the probability P(g 1 , g 2 ...., g 1 , g 2 ....)of two sets of variables, {g i } and {g i } corresponding to the two trees: E(Shared) = g 1 ,g 2 ...g kmax g 1 ,g 2 ...g kmax H(g 1 , g 2 ...., g 1 , g 2 )P(g 1 , g 2 ...., g 1 , g 2 ....) = g 1 ,g 2 ...g kmax g 1 ,g 2 ...g kmax i h i (g i , g i ) P 1 (g 1 , g 2 ....)P 2 (g 1 , g 2 ....) where we have used that the trees are drawn independently and so P(g 1 , g 2 ...., g 1 , g 2 ....) factorizes into the probabilities defined in the previous sections, but where I have labeled these probabilities P 1 and P 2 to allow for the fact that e.g. the two trees may be of different sizes. The function h i (g i , g i ) is equal to S i if both g i and g i are equal to one, i.e. if both trees contain edge i, and is zero otherwise.
We can similarly express this in terms of marginal probabilities p αi (g i ) that edge i is present in tree α,where α corresponds to tree 1 or tree 2. Then: For sampling schemes such that all edges with a given number of descendent tips, k have the same marginal distribution p 1i (1) = P 1 (k), we can rewrite this as and under binomial sampling with probabilities q 1 and q 2 of each tip being sampled we have: Phylogenetic Beta Diversity and the Impact of Differing Sample Sizes This impact of sample size on studies of phylogenetic beta diversity points to a need to normalize measures of phylogenetic similarity. Taking two real communities containing n 1 and n 2 individuals, we can compute the expected shared and total branch length for two randomly sampled communities of the equivalent sizes. This gives us a way to normalize both shared branch length and total branch length separately, providing a new kind of baseline for phylogenetic diversity. Our approach is to normalize Unifrac with respect to a pair of samples drawn according to a specifed sampling scheme, and again we work with binomial sampling. Using this method, we cluster gut samples in Figure 7 (using Ward's criterion), and show that gut samples from the same subject, and in particular samples taken on consecutive days, are significantly more likely to have a normalized Unifrac score of less than 1-roughly speaking, only these consecutive samples from the same subject are more similar than random.