The ancestral population size conditioned on the reconstructed phylogenetic tree with occurrence data

We consider a homogeneous birth-death process with three different sampling schemes. First, individuals can be sampled through time and included in a reconstructed tree. Second, they can be sampled through time and only recorded as a point ‘occurrence’ along a timeline. Third, extant individuals are sampled and included in the reconstructed tree with a fixed probability. We further consider that sampled individuals can be removed or not from the process, upon sampling, with fixed probability. Given an outcome of the process, composed of the joint observation of a reconstructed phylogenetic tree and a record of occurrences not included in the tree, we derive the conditional probability distribution of the population size any time in the past. We additionally provide an algorithm to simulate ancestral population size trajectories given the observation of a reconstructed tree and occurrences. This distribution can readily be used to perform inferences of the ancestral population size in the field of epidemiology and macroevolution. In epidemiology, these results will pave the way towards jointly considering data from case count studies and reconstructed transmission trees. In macroevolution, it will foster the joint examination of the fossil record and extant taxa to reconstruct past biodiversity.


Introduction
Since seminal papers by Yule (1925), Kendall (1948), and much later by Nee et al. (1994), birth-death models have become ubiquitous in evolutionary biology. They are used as an underlying tree prior, or as an underlying population dynamics prior in recent studies spanning fields as diverse as macroevolution, linguistics, or epidemiology (see e.g. Heath et al., 2014;Gray et al., 2009;Ratmann et al., 2016). Computing reliable estimates of the ancestral number of species, languages or infected individuals, i.e. estimates of past population size, is key to the understanding of past processes in these fields.
In both macroevolution and epidemiology, these inferences have initially relied on the fossil record and the case counts record, modeled as a sampling of individuals from the full process through time (Foote, 2000;Starrfelt and Liow, 2016).
In the recent years, impressive sequencing efforts targeting present-day species and pathogens have enabled the reconstruction of phylogenies, the branch-lengths of which can be used to infer ancestral population dynamics. Two main approaches have been introduced to estimate past population sizes from reconstructed evolutionary trees.
The first one, in a maximum likelihood framework, consists of estimating the birth and death rate of the population using all branch-lengths of the tree ahead. Analytical estimates of past population size are then computed in a second step, based on simplifications of the data, such as, e.g, taking only into account the number of individuals at the beginning (Morlon et al., 2011) or more recently at the beginning and end of the process (Billaud et al., 2019).
The second one is a variation of the first method in a Bayesian setting. Birth and death rates are supposed to be piecewise constant with a given a priori distribution, and their posterior is sampled using all branchlengths. Past population sizes are then computed by integrating the expected number of individuals over the rates posterior distribution .
While this second method accounts for uncertainty in the estimated population sizes, it also relies on numerical Monte-Carlo methods which make estimates much more computationally expensive to compute.
Additionally, the statistical signal in the number of lineages through time is only used to compute rate estimates, whereas it could in principle also more precisely inform the past population size distribution.
Statistical approaches stemming from the analysis of case count data or from the analysis of reconstructed evolutionary trees have been part of separate literature bodies for years, historically yielding conflict between biodiversity estimates based on the fossil record and estimates based on phylogenies of extant taxa (Quental and Marshall, 2010 but see also Morlon et al., 2011).
A first path towards merging these disparate data was introduced by the fossilized birth-death model of Stadler (2010), who considered a birth-death model with sampling and inclusion of individuals in the tree through time. This allowed taking into account infection trees reconstructed from pathogen sequences sampled throughout an epidemic (Stadler et al., 2011). In macroevolution, it paved the way to more precise phylogenetic dating using well-conserved fossil taxa which could be placed on a reconstructed phylogeny using morphological characters (Gavryushkina et al., 2016). Not so well-conserved fossils (i.e. occurrences) have also been used with this model, using a Markov Chain Monte Carlo (MCMC) scheme to integrate over all possible placements along a fixed tree (Heath et al., 2014).
Very recently, Vaughan et al. (2019) introduced another extension to the fossilized birth-death model, considering that individuals sampled through time could either be included or not included in the reconstructed evolutionary tree. This variation has profound consequences for data analysis, for it allows one to combine the occurrence record that cannot be placed along a tree (e.g. poorly preserved fossils, or case count epidemiological record) with samples for which molecular sequences or phenotypic measurements allow reconstructing a phylogenetic tree. They also provide a Monte-Carlo particle filtering algorithm to reconstruct past population sizes by sampling from the distribution of the number of ancestors conditioned on the joint observation of the reconstructed tree and the occurrence record. Analytical developments around this new model have further been made by Gupta et al. (2019), who derived an analytical formula for the probability density of an outcome of the process.
In this paper, we build on the analytical developments presented by Gupta et al. (2019), and aim at providing a more analytical, and faster, way to compute the same distribution as originally targeted by Vaughan et al. (2019). The efficiency of our method paves the way towards considering much bigger datasets, or towards extending the method to multi-type or density-dependent birth-death processes.
In section 2, we present the model, notation, and an overview of the strategy to express the targeted distribution. In section 3, we adapt the main results of Gupta et al. (2019) to compute the probability density of observations made after a given time, conditioned on the past population size. In section 4, we provide a way to compute the joint density of the past population size and observations made before a given time. Combining results of sections 3 and 4, we compute the distribution of past population sizes conditional on the full outcome of the process in section 5. We finally discuss applications and potential extensions of the model.

Parameters of the process
We consider a population of individuals, any of which can give birth to another individual at rate λ or die at rate µ. The process starts at time t or in the past with one individual, and unfolds until reaching present time 0, i.e. time is oriented from the present towards the past.
We superimpose to this background population dynamics three different sampling schemes. First, individuals can be ψ-sampled at rate ψ throughout their lifetime. When ψ-sampled, the individual will be included in the reconstructed tree. Second, individuals can be ω-sampled at rate ω throughout their lifetime. When ω-sampled, the individual is not included in the reconstructed tree, but its sampling time is nevertheless recorded and thereafter called 'an occurrence'. Last, the process is finished upon reaching the present time 0, and each extant individual at that time is ρ-sampled with fixed probability ρ, leading to their inclusion in the reconstructed tree. The sum of all per-capita rates will be called for short γ = λ + µ + ψ + ω.
Following Vaughan et al. (2019), we also include in the model an effect of the ψ-and ω-sampling through time on the population dynamics. We consider that, upon sampling, an individual is either removed from the process with probability r ∈ (0, 1), or is unaffected by the sampling with probability (1 − r). The overall number of individuals, denoted (I t ), thus follows a linear birth-death process with birth rate λ and death rate µ + (ψ + ω)r. Because this work aims at fitting the model to empirical data, we only consider here a supercritical process with λ > µ + (ψ + ω)r.

Introducing useful probabilities
This process has already been thoroughly studied. In the following sections, we will use two key probabilities which analytical solutions are well known. First, we will call u t the probability that a process starting at time t with only one individual is never sampled before reaching present time 0. We recall that u t satisfies the Master equation The solution of this for a particular initial condition z being the following where ∆ = γ 2 − 4λµ > (λ + µ) 2 − 4λµ ≥ (λ − µ) 2 > 0 and x 1 , x 2 are the two roots of the polynomial .
Second, we call p t the probability that a process starting at time t with one individual leads to one sampled particle at present time 0. Writing the Master equation governing the evolution of this quantity leads to The solution of this being the following These formulas are well known, and correspond respectively to quantities called p 0 (t) and p 1 (t) in Stadler (2010). When z = 1 − ρ, we will drop the dependence on z and use the shorter notation u t , p t . We recall standard ways to derive these expressions in Appendix A.

Strategy of the paper
The process with sampling leads to the observation of two distinct objects (T , O), as illustrated in Figure   1.
The reconstructed tree T , on the one hand, represents the evolutionary relationships between all ψ-sampled and ρ-sampled individuals. We further consider that ψ-sampled individuals are labeled either as 'removed' or 'non-removed'. All ψ-sampled removed individuals are necessarily leaves of T , whereas ψ-sampled nonremoved ones can either stand as leaves (when the descent of the individual is not sampled) or as vertices along a branch (when the descent of the individual is further sampled), in which case they are referred to as sampled ancestors.
The record of occurrences O, on the other hand, is an ordered list of all ω-sampling times. We also consider that these sampling times are labeled as either 'removed' or 'non-removed'.
In this paper, we are interested in computing the probability distribution of the number of individuals in the past, conditioned on the observed outcome (T , O) of the process. If k t denotes the number of sampled lineages in T at time t, we call our target distribution, We will refer to epochs as the maximal time slices within which no sampling event in O, nor branching event in T , happened. These epochs are delimited by the union of sampling times in O, branching times in the tree T , and sampling times of leaves and sampled ancestors in T . All pooled together, we call these ordered times (t h ) n h=0 , starting at present time t 0 = 0 and ending at the root time t n = t or .
At any time t ≥ 0 we also introduce: T ↑ t := the tree T cut at time t T ↓ t := the collection of trees (or forest) obtained by cutting T at time t, and considering all subtrees descending from cut lineages The general strategy -and outline -of the paper is the following. In a breadth-first backward traversal we will compute the probability density of observations made down any time t, conditioned on the population 5 1 2 3 1 2 3 Figure 1: General setting of the method. To the left in green, the full process with sampling. Red dots correspond to ω-sampling (sampling through time without sequencing), blue dots correspond to ψ-sampling (sampling through time with sequencing) and orange dots correspond to ρ-sampling at present. Filled or unfilled dots correspond respectively to sampling with or without removal. To the right, the observations: a reconstructed tree T together with the record of occurrences O.
size at that time, as time increases towards the past. We call this probability density, Provided we get expressions of (L t ) tor t=0 and (M t ) tor t=0 , our target distribution can then be expressed by combining both, noting that: where the last line holds because, conditionally on I t = k t + i, the future of the (Markov) process is independent of what happened before.
In the process of getting the likelihood of observations under the same model, Gupta et al. (2019) provided an analytical formula and an algorithm to compute the first ingredient L t in the case where all individuals are removed upon sampling (i.e. r = 1). We thus recall their main result, and adapt it to our slightly different framework, in the next section.

The density of observations below conditioned on past population size
We start this section by presenting the Master equations satisfied by the probability density L t . This provides us with a numerical algorithm to compute L t , which we subsequently simplify with analytical results for specific sets of parameters.

Master equation driving L t
One way to get the probability density L t consists in studying its evolution through time, relying on its Master equation. First, remark that we can express L 0 at present time 0. Indeed, provided we know the exact number of individuals living at time 0, the probability to see the tips of the tree is directly driven by the ρ-sampling, We now derive the Master equation driving the evolution of L t through time across any given epoch.
We consider an infinitesimal time step δt and list the events which could have happened in the full process between t and t + δt, leading to our observations. Suppose the number of observed lineages in this epoch is k, and the total number of individuals alive is k + i. We emphasize three cases, illustrated in Figure 2: 1. nothing happened with probability (1 − γ(k + i)δt)

a birth event happened
(a) among the k sampled lineages in T ↓ t , and it leads to an extinct or unsampled subtree to the left or to the right, with probability 2λδt.  3. a death event happened among the i particles, with probability µiδt.
These allow us to write, ∀i ∈ N, is not defined, but the term cancels out thanks to the factor i.
Letting δt → 0 and combining this with the initial condition on L 0 , we get the following set of ordinary differential equations driving the evolution of L t , Last, we need to study how L t changes at punctual events. There are 6 types of punctual events that we can come across at time t in the past, listed below and illustrated in Figure 3.
We denote L t + the probability just before (i.e. up) the punctual event and L t − the probability immediately after (i.e. down). We can either find, 1. a leaf of T ↓ t , labeled as removed. This is a ψ-sampling with removal event for which the number of unsampled lineages remains constant, and the number of sampled lineages increases by one. It thus gives, 2. a leaf of T ↓ t , labeled as non-removed. This is a ψ-sampling without removal event for which one of the unsampled lineage becomes a sampled one. It thus gives, 3. a sampled ancestor along a branch of T ↓ t , necessarily labeled as non-removed. This is a ψ-sampling without removal event, not impacting the number of sampled or unsampled lineages. It thus gives, labeled as removed. This is a ω-sampling with removal event, for which the number of unsampled lineages increases by one. It thus gives, Note that here also, for i = 0, L (−1) t is not defined but the term cancels out thanks to the factor i.

an occurrence in O ↓
t , labeled as non-removed. This is a ω-sampling without removal event, not impacting the number of sampled or unsampled lineages. It thus gives, 6. a branching event between two branches of T ↓ t . The number of sampled lineages decreases by one. It thus gives, sampled ancestor removed leaf non-removed leaf removed occurrence non-removed occurrence branching event This Master equation can be numerically approximated. To do so, we fix a finite upper bound N on the number of hidden individuals and numerically integrate a truncated ODE system. We detail this in the following algorithm to compute an approximation of L t at any time t.
Algorithm 1 Computes a numerical approximation of L t for a specific set of times Input: Observed tree and occurrence data (T , O), parameters (t or , λ, µ, ψ, ω, ρ, r), set of time points (τ j ) S j=1 for which we want to compute the density, and the truncation N setting the accuracy of the algorithm.
1: Pool all (τ j ) and all branching and sampling times of (T , O) in an ordered list (t h ) n h=1 2: Set j = 1 and initialize B as a S × (N + 1) empty matrix else if t h is a removed leaf then 13: else if t h is a sampled ancestor then 17: else t h is a branching event 23: We also define a slight variation of this algorithm, that we will refer to as algorithm 1', where no set of time points (τ j ) is required, and the values of L t are not recorded through time (i.e. matrix B disappears).
Instead, when reaching t n = t or we simply return L (0) t , which by definition is an estimate of the probability density of (T , O). Note that this strategy is very close to what has been used to compute the probability density of a reconstructed tree under a logistic birth-death process (Leventhal et al., 2013).
These two algorithms will prove useful to deal with the general case. However, we can further get analytical expressions for L t when ω = 0 as well as when r = 1 (Gupta et al., 2019). We expose these in the next two subsections.

Special case
where W t is a function of time only, and u t is defined as in equation 2.2. We first get, from the initialization in equation (3.10), that W 0 = ρ k0 . Moreover, Thus leading to the following ODE for W t , on any epoch (t h , t h+1 ) where the number of sampled lineages remains fixed and equal to k, This is very close to the Master equation (2.3) governing the evolution of p t , and it leads to (see derivation Last, because ω = 0, updates (3.11) to (3.16) simplify to only the following ψ-and λ-events, Combining these updates with equation (3.17) leads to the following proposition.
Proposition 3.1. At any time t across epoch (t h , t h+1 ), considering that we observed so far -i.e. on (0, t h+1 )v sampled ancestors, w removed leaves at times t j ∈ W, x branching events at times t j ∈ X , y non-removed leaves at times t j ∈ Y, we get, Proof We prove this proposition by induction across the epochs in Appendix D, using as the main arguments the equation updates (3.18) to (3.21), combined with equation (3.17).
Note that this proposition is very similar to what is presented in section 3 by Gupta et al. (2019). We nevertheless need to highlight two differences.
The first one is that we allow here for removal or not of the individual upon sampling, with a given probability r, whereas Gupta et al. (2019) considered that all individuals were removed upon sampling (r = 1), and Stadler (2010) considered that individuals were not removed upon sampling (r = 0).
The second difference concerns the underlying framework under which we derive our results. In Gupta et al.
(2019), individuals where distinguishable (say, each one is assigned a number and they can be ordered), whereas in the present paper they are not. When individuals are ordered, the probability density L (i) t is changed by a factor (k+i)! i! , which is the number of ways we can arrange k + i elements in a list of size k, i.e. the number of ordered configurations of hidden individuals.
Note that, when reaching the root of the tree, it reduces to a very similar formula for the probability density of T because i = 0 and k = 1. We summarize this as the following corollary.
Corollary 3.2. When ω = 0, the probability density of a reconstructed tree T with v sampled ancestors, w removed leaves at times t j ∈ W, y non-removed leaves at times t j ∈ Y, and branching events at times Proof It directly follows from Proposition 3.1, by noting that L(T ) = L tor . Note also that a rooted binary tree with w + y + k 0 leaves shows necessarily x = w + y + k 0 − 1 branching times.

Special case r = 1
When r = 1, only three kinds of punctual events, corresponding to updates (3.11), (3.14) and (3.16) need to be taken into account. Because the number of unsampled individuals i goes into formula (3.14), the simple expression L (i) t = u i t W t cannot be considered anymore, and one needs to find another expression. This has already been done in Gupta et al. (2019) and we only need to adapt here their result to our slightly different framework. For this we define the distinguishable version of the probability L ( 3.23) 12 We now derive the Master equation for L (i) t . Multiplying both sides of (3.10) by is the probability that a coalescing pair of randomly chosen lineages (from (k + i + 1) total lineages) does not consist of two sampled lineages. This shows that L With these transition conditions and initial condition L where q is the number of ω-sampling events in the time-interval [0, t) and the q + 1-dimensional time-varying t ) can be analytically computed following the approach in Gupta et al. (2019). Therefore from (3.23) we state the following proposition.
where W t is a q dimensional time-varying vector which can be computed following Algorithm 2 in Gupta et al. (2019).
Note that when there is no ω-sampling, then q = 0 for all times and W (0) t is the same as W t defined in the previous section.
This ends our section on the computation of L t . It thus remains to (i) present a way to compute M t and (ii) combine L t and M t to get the target distribution K t at any time t. We do this in turn in the next two sections.
13 4. The joint density of observations above and past population size Recall that we are now interested in computing, ∀i ∈ N, M (i) t We start by presenting the Master equation satisfied by M t , before turning to its resolution for specific parameter sets.
The approach is very similar to the one presented in the previous section to compute L t , with the slight difference that we will need to traverse the tree forward in time instead of backward in time.

Master equation driving M t
At the time of origin of the process t or , we only observe one starting lineage in T ↑ tor . This provides us with the following initialization condition on M , We then derive the Master equation driving the evolution of M t across an epoch on which the number of observed lineages is fixed and equal to k. Suppose we know M t , and we observe no punctual event on the infinitesimal time interval (t − δt, t). Unobservable events have already been illustrated in Figure 2. It allows us to get 2. a leaf of T ↓ t , labeled as non-removed. This is a ψ-sampling without removal event for which one sampled lineages becomes unsampled. This gives, 3. a sampled ancestor along a branch of T ↓ t , necessarily labeled as non-removed. This is a ψ-sampling without removal event which does not affect the number of lineages. It gives, 4. an occurrence in O ↓ t , labeled as removed. This is a ω-sampling with removal event, for which the number of unsampled lineages decreases by one. This gives, 4.29) 5. an occurrence in O ↓ t , labeled as non-removed. This is a ω-sampling without removal event which does not affect the number of lineages. It gives, 6. a branching event between two branches of T ↓ t . This is a λ-event increasing the number of sampled lineages by one. This gives, Finally, upon reaching present time 0, one needs to take into account the ρ-sampling, leading to the following update, (4.32) As already exposed for L t , we can build a similar algorithm to compute M t in the general case, relying on a numerical ODE solver for approximating equation (4.25). As for algorithm 1, a slight variation of this algorithm would allow one to compute an estimate of the probability density of (T , O). Note that this strategy is very close to what has been used to compute the probability density of a reconstructed tree under a logistic birth-death process (Etienne et al., 2012).
In the remainder of this section, we derive analytical results to avoid resorting to a numerical ODE solver.

The corresponding probability generating function
We introduce now the probability generating function corresponding to the density M t , which will prove useful to get analytical results.
The initial condition on M translates in M as, ∀z, M (t or , z) = 1. The ODE (4.25) furthermore translates as the following partial differential equation (PDE), Our target probability generating function M is thus the solution of the following PDE problem across a given epoch (t h−1 , t h ), on which the number of observed lineages remains constant and equal to k, Solving this PDE problem allows us to come with an analytical expression of M any time across an epoch, provided we now the expression of M (t h , z) at the end of the epoch.
Proposition 4.1. The solution to the PDE problem (4.33) is given by where we introduce R(t, z) = p(t, z)/(1 − z) to ease notation.
Proof We used the method of characteristics, see derivations in Appendix B.
Between epochs, one must also update M according to punctual events taking place. Previously presented updates of M (equations (4.26) to (4.31)) translate as the following for M , if t is a removed leaf, If we are interested in the distribution at some point, we can thus start the formula at t or with F (z) = 1, and then iteratively alternate between the updates at punctual events and the use of Proposition 4.1 over each epoch. When reaching present time 0, the step of ρ-sampling expressed in equation (4.32) moreover translates here as, While this procedure in theory allows us to get the analytical formula of M at any time, updates (4.37) and (4.38) require differentiating the probability generating function, greatly complicating the expression of the function after a few occurrences. When ω = 0, these two updates disappear and a nice recursion leads to a closed-form formula that we will detail in Proposition 4.3.
We also implemented this procedure in a programming language able to deal with symbolic calculus, but were not able to provide an implementation efficient enough to apply to standard datasets in the field.
Instead, when ω = 0, we suggest another strategy for computing the M (i) t 's, by approximating M across punctual events by a polynomial of order N , while still relying on Proposition 4.1 to drive the evolution of the probability generating function between events. This is a more efficient alternative to numerically solving the ODE system. We only need to derive the expression of the generating function at punctual events as given in the following proposition 4.2.
Proposition 4.2. The derivatives in z = 0 of a generative function which expression is can be numerically computed using the formula Proof The derivation is detailed in Appendix C.1.
This derivation is at the heart of algorithm 2, for it allows to drive the evolution of the M (i) t 's through each epoch, as well as at times when we want to record them.
We will refer to algorithm 2' as the slight variation of this algorithm aimed at computing the density of (T , O). No set of time points (τ j ) is required, and the values of M t are not recorded through time (i.e. matrix B disappears). Instead, when reaching t h = t 0 we simply return

Algorithm 2 Computes a numerical approximation of M t for a specific set of times Input:
Observed tree and occurrence data (T , O), parameters (t or , λ, µ, ψ, ω, ρ), set of time points (τ j ) S j=1 for which we want to compute the density, and the truncation N setting the accuracy of the algorithm. Output: A numerical approximation of M t at times (τ j ) S j=1 , ( M Compute the values right before the punctual event,

7:
if t h = τ j then One may wonder why we introduced a probability generating function in this section and not in the previous one to fasten Algorithm 1. It turns out that we tried to follow the same approach to derive L t , as described in Appendix E, but it leads to another PDE problem that will require further work to be solved.

Special case ω = 0
This corresponds to the special case leading to the observation of O = ∅. In that case, a nice recursion leads to a closed-form formula for M . Proposition 4.3. At any time t, considering that we observed so far -i.e. on (t, t or )v sampled ancestors, w removed leaves at times t j ∈ W, x branching events at times t j ∈ X , y non-removed leaves at times t j ∈ Y, we get, Proof Proposition 4.3 offers yet another proof of corollary 3.2 by noting that where the last equality follows from equation (4.40) taking into account the ρ-sampling at present.
When ω = 0, Proposition 4.3 also offers an alternative to algorithm 2 for deriving M t . Indeed, resorting to the probability generating function to get back the probability density, one can get the following corollary.
Corollary 4.4. At any time t, considering that we observed so far -i.e. on (t, t or )v sampled ancestors, w removed leaves at times t j ∈ W, x branching events at times t j ∈ X , y non-removed leaves at times t j ∈ Y, we can compute M (i) t using the following recursion, where we define Proof The probability density M (i) t can be found back by taking The result follows from the derivation of these derivatives in Appendix C.2.
This special case ends the section. In the next one, we will combine results from sections 3 and 4 and use our ability to compute L t and M t to compute K t , the probability distribution of the population size given (T , O).

The distribution at fixed times
In section 3, we exposed how to compute L t , the probability density of the observations below time t conditioned on the population size. This relies either on Algorithm 1 in the general case, or on the more optimized Proposition 3.1 in case ω = 0, or Proposition 3.3 in case r = 1.
In section 4, we exposed how to compute M t , the probability density of the observations above time t and the population size. This relies either on Algorithm 2 in the general case, or on the more optimized Corollary 4.4 when ω = 0.
We now combine our ability to compute L t and M t to derive the probability distribution of the population size given (T , O). Recall from the first section that, where ⊗ refers to the Hadamard product, and tr is the transpose of the vector.
Depending on the parameter space that one wants to consider, it thus remains to arrange pieces stemming from the previous sections. We provide a flowchart in Figure 4 to guide the reader to chose the most efficient path. Figure 4: The most efficient results depending on the parameter space considered. In red, results already described in Stadler (2010) and Gupta et al. (2019). In blue, the new contribution of this manuscript.

Generator of trajectories
The previous result gives us the distribution of the population size at any time in the past, but does not state anything about population size trajectories. We provide now an approximate way of simulating population size trajectories conditioned on (T , O).
Indeed, recall we have, We thus get, Where we introduced in the last line the following notation, Using these, we see that Q . This allows us to draw trajectories of the number of ancestors in the past as a continuous in time Markov process with the (inhomogeneous) rates Q t written above.
Remark that we could equally write these ODE coefficients using the M (i) t 's, depending on which traversal we prefer to perform prior to the simulation. This gives, where we introduced in the last line the following notation, This is a standard result for Markov chains that are conditioned on a final state, and the shape of the newly derived transition kernel is called a Doob's transform. Note that these transitions symplify for special cases when we have an analytical expression of either L

Numerical implementation
Results of this paper have been implemented numerically and the code is freely available on GitLab: https://gitlab.com/MMarc/popsize-distribution/.
We used the numerical implementation to verify the correctness of the results in several ways: 1. We verified that the values of the probability density of (T , O) computed using L t and M t (algorithms 1' and 2') were equivalent to values computed using already known formulas when ω = 0 (Stadler, 2010) or when r = 1 (Gupta et al., 2019). See result in Figure 5AB.
2. We verified that the values of the probability density of (T , O) computed using L t or M t (algorithms 1' and 2') were identical on examples for which no previous formula was known. See result in Figure   5C.
3. We assessed the distribution of the population size against the only numerical method performing the same goal, the particle filtering developed in Vaughan et al. (2019). We compared values of a few quantiles computed using the two methods, see result in Figure 5DEF).   Hints of the true population size (in green) under three different processes. A) A homogeneous birth-death with ρ-sampling at present; B) A homogeneous birth-death with ρ-sampling at present and ψ-sampling through time; C) A homogeneous birth-death process with ρ-, ψ-and ω-sampling.

Discussion
The results we have derived in this paper fit into two main categories.
The first category concerns results allowing one to compute the probability density L of (T , O). While results allowing to compute L when ω = 0 or r = 1 were already known, our two algorithms 1' and 2' have the potential to improve the computation time of L when ω = 0 and r = 1.
The second category concerns our main results, allowing one to compute the probability distribution of the population size in the past and to generate population size trajectories conditioned on (T , O) (section 5).
These constitute a dramatically more efficient alternative to the Monte-Carlo particle filtering algorithm developed by Vaughan et al. (2019) to perform the same task.
We discuss first how these results can be used in the context of current phylodynamics, before turning to future extensions.

Use for inference purposes
When applying phylodynamics methods to empirical data, it is common practice in the field to condition the model generating the phylogeny on either (i) the observation of at least one individual at present, descending from time t or , an event that we call S 1 tor ; or (ii) the observation of at least two individuals at present, descending from time t mrca , an event that we call S 2 tmrca , where mrca stands for most recent common ancestor. Indeed, we can study a dataset precisely because it survived to the present.
The probability density of (T , O) needs be conditioned on S 1 tor or S 2 tmrca when used in empirical applications, which can be done by dividing L by, respectively, (1 − u tor ) and (1 − u tmrca ) 2 .
However, since S 1 tor , S 2 tmrca are always satisfied on empirical data, and we already condition on the observation of empirical data, nothing more needs to be done to use the probability distribution K t to retrieve past population sizes.
If the tree is directly fixed and observed, results from the first category can be used to perform parameter inferences in a maximum likelihood framework. Taken as a function of the parameters, the probability density of (T , O) is called the likelihood, and can be numerically optimized to look for parameters maximizing its value (i.e. maximum likelihood estimators).
If one also knows the true parameter values or has inferred estimates of those, results from the second category can also be used to compute the distribution of the population size in the past. Any quantile of interest can then be numerically computed from this distribution. As an example, we chose to represent on Figure 6 the 95% envelope of the distribution, comprised between the level 2.5% and the level 97.5% quantiles. Alternatively, if one needs a point estimate of the population size, the median of the distribution is a good candidate.
If the tree is not directly observed, but sequencing data or any other set of phenotypic measurements, called A, is available for the set of ψ-sampled individuals, then one might consider analyzing those in a Bayesian framework. For instance, consider a model describing the evolution of molecular sequences or phenotypic measurements along a fixed tree is chosen, relying on parameters that we call θ. We are then interested in quantifying the posterior distribution f (T , θ, λ, µ, ρ, ψ, ω, t or | O, A) ∝ f (A | T , θ)f (O, T | λ, µ, ρ, ψ, ω, t or )f (λ, µ, ρ, ψ, ω, θ, t or ), with f (λ, µ, ρ, ψ, ω, θ, t or ) being the prior distribution on the model parameters. Standard MCMC algorithms can be used to sample from this posterior distribution. Results from the second category can then be used and applied to a sample from the posterior distribution to integrate the distribution of the population size over the posterior distribution of trees and parameters values.
In addition, the distribution K t of population sizes (or a trajectory sampled from this distribution using the results of section 5.2) can be computed directly for each sampled combination of tree and parameters, and thus used to compute the full joint posterior of both population size trajectories, trees and model parameters. While this is similar to the capability of the method of Vaughan et al. (2019), we expect the method presented here to be far more computationally efficient as it cuts out the necessity of simulating particle trajectories within the MCMC step. (The trajectory specific calculations or simulations in our method need only be done after the MCMC algorithm is complete, and only for a subsample of the full MCMC chain states.)

Future extensions
This process lends itself well for various biologically realistic extensions to allow closer fit to empirical data in a variety of situations.
The first extension that we envision is to relax the assumption of rate homogeneity and instead work with time-varying rates. This has already been considered in different studies relying on birth-death processes, either with exponentially varying functions (Morlon et al., 2011) or with piecewise constant rates (a model dubbed as skyline birth-death process, see Gavryushkina et al., 2016). Because most of what we wrote can be straightforwardly adapted to such a framework, this would not require much theoretical work. However, the challenge would be to do so without overfitting the data.
Another popular extension that has been described in the literature on birth-death processes for phylodynamics is to consider multi-type birth-death processes (Maddison et al., 2007). Each individual is assigned a type, which impacts its propensity to give birth to other types. All sampling-related parameters can also be considered type-dependent. The challenge here boils down to dealing with an increase of dimensionality, 27 where we would be interested in the joint distribution of subpopulation sizes. Recent work by Freyman and Höhna (2018) could first allow us to find the distribution of the types along sampled lineages in the tree. If we have j types, we would then be interested in where all quantities are type-dependent analogs of quantities defined in the current paper, and T c would be a type-coloured tree, i.e. a tree where each sampled lineage is assigned a type. This extension is particularly interesting for epidemiological applications, when different populations of infected individuals, clustered according to some characteristic (e.g. patient behaviour or geography) might have very different dynamics .
Finally, we are very hopeful that this piece of work could be applied as well to density-dependent birthdeath processes, also known as logistic birth-death models. Indeed, very similar ideas to the breadth-first forward and backward traversals as applied in algorithms 1' and 2' appear in the context of logistic birthdeath models (Etienne et al., 2012;Leventhal et al., 2013). Preliminary results obtained by adapting our numerical algorithms to this framework are very encouraging, and we are currently in the process of deriving as much analytical results as we can to speed up the method. We are hoping to present this in a subsequent paper.

Conclusion
This manuscript presents a way to efficiently compute the distribution of the number of ancestors in a linear birth-death process, conditioned on the observation of a reconstructed tree and a record of occurrences through time. The method paves the way to the consideration of more complex demographic scenarios, with either time-dependent or density-dependent parameters. We hope that it will foster important research advances for unravelling demographic histories in epidemiology, macroevolution, and any other fields where birth-death processes form a relevant model framework.
Integrating both sides between time 0 and t, we get This is the result stated in equation (2.2). Note that this quantity is called p 0 (t) in Stadler (2010), or E(t) in Maddison et al. (2007).
Appendix A.2. Probability to leave only one sampled descendent We aim here to integrate a slight variation of equation ( All these three terms can be integrated visually between some time t h and t, leading to, Leading to the final expression below Note that the case k = 1, t h = 0 and W 0 = 1 − z corresponds to the probability p t given as equation (2.4), while the general case can be expressed using function p as
• Solving the main PDE in Appendix B requires inverting u, using the first property with, z = u(t − t h , z 0 ) ⇐⇒ u(t h − t, z) = u(t h − t, u(t − t h , z 0 )) ⇐⇒ u(t h − t, z) = u(0, z 0 ) ⇐⇒ z 0 = u(t h − t, z) • The same Appendix section requires also composing function p and u, using • In the proof of Proposition 4.3, we switch to the notation R(t, z) = p(t, z)/(1 − z) and again compose R and u in the same way, ⇐⇒ R(t j − t h , u(t h − t, z)) = R(t j − t, z) R(t h − t, z)