The conditioned reconstructed process

https://doi.org/10.1016/j.jtbi.2008.04.005Get rights and content

Abstract

We investigate a neutral model for speciation and extinction, the constant rate birth–death process. The process is conditioned to have n extant species today, we look at the tree distribution of the reconstructed trees—i.e. the trees without the extinct species. Whereas the tree shape distribution is well-known and actually the same as under the pure birth process, no analytic results for the speciation times were known. We provide the distribution for the speciation times and calculate the expectations analytically. This characterizes the reconstructed trees completely. We will show how the results can be used to date phylogenies.

Introduction

Phylogenetics is the science of reconstructing the evolutionary history of lineages (usually species). Besides providing data for systematics and for taxonomy, phylogenies are the pattern of past diversification and so can be analyzed to infer past macroevolutionary process. The first common step is to compare the reconstructed trees with expectations from neutral models of diversification (Gould et al., 1977, Mooers and Heard, 1997, Nee et al., 1992, Raup et al., 1973). The simplest class of neutral model is entirely homogeneous, and assume that throughout time, whenever a speciation (or extinction) event occurs, each species is equally likely to be the one undergoing that event. Of course speciation is not just random—lineages will differ in their expected diversification rates for both intrinsic and extrinsic factors (Mooers et al., 2007). However, a neutral model is often used as a null model to analyze the data, with departures pointing the way to more sophisticated scenarios (Harvey et al., 1994).

We investigate the constant rate birth–death process (Feller, 1968, Kendall, 1948) as it is probably the most popular homogeneous model. A birth–death process is a stochastic process which starts with an initial species. A species gives birth to a new species after exponential (rate λ) waiting times and dies after an exponential (rate μ) waiting time. Throughout this paper, we will have 0μλ. In the following, time 0 is today and tor the origin of the tree, so time is increasing going into the past. Special cases of the birth–death process are the Yule (1924) model where μ=0 and the critical branching process (Aldous and Popovic, 2005, Popovic, 2004) where μ=λ. When looking at phylogenies, we have a given number, say n, of extant taxa. We therefore condition the process to have n species today, we call that process the conditioned birth–death process (cBDP). The age of the tree, i.e. the time since origin of the birth–death process is tor; if tor is not known, we assume a uniform prior on (0,) for the time of origin as it has been done in Aldous and Popovic (2005) and Popovic (2004). Note that a tree of age tor, which evolved under a birth–death process includes extinct species, it is called the complete tree (Aldous and Popovic, 2005), see Fig. 1, left. From the complete tree, delete the extinct lineages and suppress degree-two vertices. This is called the reconstructed tree shape, see Fig. 1, right. Label its leaves uniformly at random (since each species evolves in the same way). The resulting tree is called the reconstructed tree (this follows the notation in Nee et al., 1994; it is also called a lineage tree, Aldous and Popovic, 2005). Note that when reconstructing a phylogeny from (molecular) data, we see the reconstructed tree. Extinct lineages are only apparent when the fossil record is included.

In Nee et al. (1994), the reconstructed tree of a birth–death process after time t is discussed. In this paper we additionally condition on having n extant species, since this allows us to compare the model with phylogenies on n extant species. We will obtain the probability for each speciation event in a tree with n species. This has been done for the Yule model and the conditioned critical branching process (cCBP) in Gernhard (2008) (note that the cCBP is the critical branching process conditioned on n extant species). For the general birth–death process, the joint probability for all speciation times and the shape as well as conditioning on the shape has been established in Thompson (1975); however, no individual probabilities have been established.

For establishing the individual probabilities, we introduce the point process representation for reconstructed trees (Section 2). This had been done for the critical branching process in Aldous and Popovic (2005) and Popovic (2004). In Section 3, we calculate the probability distribution of the age of a given tree on n species, assuming a uniform prior on (0,) for the age of the tree. This enables us to derive the density function for the time of the k-th speciation event in a tree with n extant species (Section 4) and its expectation (Section 5)—assuming a uniform prior or conditioning on the age of the tree. In Section 6, we discuss some further properties of reconstructed trees. We will determine the point process when not conditioning the cBDP on the time of origin. Also, we describe the point process of the coalescent, the neutral model in population genetics. Further, we will discuss the backwards process of reconstructed trees. The backward process is the process of the coalescence of the extant species. Knowing the time of the k-th speciation event in a reconstructed tree with n species allows us to calculate the time of a given vertex in the reconstructed tree (Gernhard et al., 2006, Gernhard, 2008). This becomes useful for dating phylogenies. If we are able to reconstruct the phylogeny of extant species, but do not obtain speciation times, we can use the expected time of a speciation event as an estimate for the speciation time. This estimate has been used for the undated vertices in the primate phylogeny (Vos, 2006), assuming the Yule model. Simulations were used for obtaining the expected speciation times. We provide analytic results assuming any constant rate birth–death model. The methods are implemented in python as part of our PhyloTree package and can be downloaded at http://www-m9.ma.tum.de/twiki/pub/Allgemeines/TanjaGernhard/PhyloTree.zip.

Formally, a reconstructed tree (Nee et al., 1994) is a rooted, binary tree with unique leaf labels and ultrametric edge lengths assigned, i.e. the distance from any leaf to the root is the same, see Fig. 2, left tree. We denote the set of interior vertices by V˚. A ranked reconstructed tree is a reconstructed tree without edge lengths but with a rank function defined on the interior vertices. A rank function (Semple and Steel, 2003) is a bijection from V˚{1,2,,|V˚|} where the ranks are increasing on any path from the root to the leaves. Note that a ranked reconstructed tree is also called a ranked phylogenetic tree in the literature (Semple and Steel, 2003). A rank function induces an order on V˚ which can be interpreted as the order of speciation events. A (ranked) reconstructed tree shape is a (ranked) reconstructed tree without leaf labels. A (ranked) oriented tree is a (ranked) reconstructed tree without leaf labels but where we distinguish between the two daughter edges of the interior vertices, w.l.o.g. label them l and r, see Fig. 2, middle tree. Note that a (ranked) oriented tree has n! possible labelings. We introduce the oriented tree to make the proofs clearer and the statements easier.

Remark 1.1

The cBDP induces a (ranked) reconstructed tree in the following way. Consider the complete tree which evolved under the cBDP. We delete the extinct lineages and label the n leaves uniformly at random with {1,2,,n} to obtain the reconstructed tree (there are n!2-k possible labelings, where k is the number of cherries in the reconstructed tree shape). The interior vertices shall be ordered according to the time of speciation, this defines the rank function. To make the reconstructed tree oriented, for each interior vertex, we label the two daughter lineages with l and r uniformly at random, there are 2n-1 possibities. We then ignore the leaf labelings (note that each labeling of the oriented tree is equally likely, since each labeling of the reconstructedtree was equally likely).

On the other hand, if we know the distribution on (ranked) oriented trees induced by the cBDP, we obtain the distribution on (ranked) reconstructed trees in the following way. We choose a labeling of the leaves with {1,2,,n} uniformly at random from the n! possible labelings. We then ignore the orientation. This gives us back the distribution on (ranked) reconstructed tree. Therefore, it is sufficient to determine the distribution on (ranked) oriented trees in order to determine the distribution on (ranked) reconstructed trees. Overall, let τr be a reconstructed tree, and let τo be a oriented tree which was induced by τr. Then P[τr]=P[τo]2n-1/n!, since an oriented tree has n! possible labelings and for the n-1 interior vertices, we have the distinction between the l and r daughter branches.

Section snippets

The point process

In this section, we provide the density for the time of a speciation event in the reconstructed tree given n species today and the time of origin being at time tor in the past. We do that using a point process representation. The following point process has first been considered in connection with trees in Aldous and Popovic (2005) and Popovic (2004).

Definition 2.1

A point process for n points and of age tor is defined as follows. Draw the n points on the horizontal axis at 1,2,,n. Now pick n-1 points to be

The time of origin

Suppose nothing is known about t, the time of origin of a tree. As in Aldous and Popovic (2005) and Popovic (2004), we then assume a uniform prior on (0,), i.e. a tree is equally likely to origin at any point in time. Note that the prior does not integrate to 1. For any constant function, the integral is . Therefore the prior is not a density. Such a prior is called improper; a discussion and justification is found e.g. in Berger (1980). Assuming the uniform prior, we will establish the

The time of speciation events

In this section, we calculate the density for the time of the k-th speciation event given we have n species today. Knowing that the distribution on ranked reconstructed trees is uniform (Theorem 2.3), this characterizes the reconstructed trees completely. These results allow us to calculate the density for the time of a given vertex in a reconstructed tree (Gernhard et al., 2006, Gernhard, 2008).

Expected speciation times

In this section, we calculate the expected time of the k-th speciation event in a reconstructed tree with n species analytically. Our Python implementation for dating trees uses the analytic results. Higher moments are calculated numerically.

More on the point process of cBDP

In Section 2, we showed that a reconstructed tree of a cBDP of age t can be interpreted as a point process on n-1 points which are i.i.d. We will see in this section that the same is not true if we do not condition on the age of the tree but assume a uniform prior. From Theorem 2.5 we obtain the density function for x=(x1,,xn-1), the order statistic of the speciation times, conditioned on the time of origin, t, f(x|t,n)=(n-1)!i=1n-1(λ-μ)2e-(λ-μ)xi(λ-μe-(λ-μ)xi)2λ-μe-(λ-μ)t1-e-(λ-μ)t.With the

Applications

Knowing the density and expectation of the k-th speciation time given we have n species today, we can obtain the density and expectation for the time of each interior node of a given tree. This can be used for dating phylogenies, if only the shape is inferred—missing dates in phylogenies could be due to supertree methods, morphological data or absence of a molecular clock. In earlier work (Gernhard et al., 2006, Gernhard, 2008), we gave the method and computer programs for dating phylogenetic

Results and outlook

The n-1 speciation points in the point process representation are i.i.d. if conditioning on the time of origin or the most recent common ancestor. As discussed, this allows us to calculate the speciation times in a reconstructed phylogeny. So far, we calculate the speciation time with only conditioning on the shape of the phylogeny. With the point process, one might be able to condition on the shape as well as on some known dates in the phylogeny. This would be valuable for dating supertrees,

Acknowledgments

The author thanks Mike Steel, Arne Mooers, Daniel Ford, Dirk Metzler, Anusch Taraz and the anonymous reviewer for very helpful comments and discussions. Financial support by the Deutsche Forschungsgemeinschaft through the graduate program “Angewandte Algorithmische Mathematik” at the Munich University of Technology and by the Allan Wilson Center through a summer studentship is gratefully acknowledged.

References (27)

  • D. Aldous et al.

    A critical branching process model for biodiversity

    Adv. Appl. Probab.

    (2005)
  • D.J. Aldous

    Stochastic models and descriptive statistics for phylogenetic trees, from Yule to today

    Statist. Sci.

    (2001)
  • J.O. Berger

    Statistical Decision Theory: Foundations, Concepts, and Methods

    (1980)
  • H. Dehling et al.

    Einfuehrung in die Wahrscheinlichkeitstheorie und Statistik

    (2003)
  • A.W.F. Edwards

    Estimation of the branch points of a branching diffusion process

    J. Roy. Statist. Soc. Ser. B

    (1970)
  • Feller, W., 1968. An Introduction to Probability Theory and its Applications, third ed., vol. I, Wiley, New...
  • Gernhard, T., 2008. New analytic results for speciation times in neutral models. Bull. Math. Biol. 70 (4),...
  • T. Gernhard et al.

    Estimating the relative order of speciation or coalescence events on a given phylogeny

    Evol. Bioinformatics Online

    (2006)
  • Gernhard, T., Hartmann, K., Steel, M., 2008. Stochastic properties of generalised Yule models, with biodiversity...
  • S.J. Gould et al.

    The shape of evolution: a comparison of real and random clades

    Paleobiology

    (1977)
  • E.F. Harding

    The probabilities of rooted tree-shapes generated by random bifurcation

    Adv. Appl. Probab.

    (1971)
  • Hartmann, K., Gernhard, T., Wong, D., 2008. Sampling trees from evolutionary models, submitted for...
  • P.H. Harvey et al.

    Phylogenies without fossils

    Evolution

    (1994)
  • Cited by (793)

    View all citing articles on Scopus
    View full text