Phase-type distributions in mathematical population genetics An emerging framework

A phase-type distribution is the time to absorption in a continuous-or discrete-time Markov chain. Phase-type distributions can be used as a general framework to calculate key properties of the standard coalescent model and many of its extensions. Here, the ‘phases’ in the phase-type distribution correspond to states in the ancestral process. For example, the time to the most recent common ancestor and the total branch length are phase-type distributed. Furthermore, the site frequency spectrum follows a multivariate discrete phase-type distribution and the joint distribution of total branch lengths in the two-locus coalescent-with-recombination model is multivariate phase-type distributed. In general, phase-type distributions provide a powerful mathematical framework for coalescent theory because they are analytically tractable using matrix manipulations. The purpose of this review is to explain the phase-type theory and demonstrate how the theory can be applied to derive basic properties of coalescent models. These properties can then be used to obtain insight into the ancestral process, or they can be applied for statistical inference. In particular, we show the relation between classical first-step analysis of coalescent models and phase-type calculations. We also show how reward transformations in phase-type theory lead to easy calculation of covariances and correlation coefficients between e.g. tree height, tree length, external branch length, and internal branch length. Furthermore, we discuss how these quantities can be used for statistical inference based on estimating equations. Providing an alternative to previous work based on the Laplace transform, we derive likelihoods for small-size coalescent trees based on phase-type theory. Overall, our main aim is to demonstrate that phase-type distributions provide a convenient general set of tools to understand aspects of coalescent models that are otherwise difficult to derive. Throughout the review, we emphasize the versatility of the phase-type framework, which is also illustrated by our accompanying R-code. All our analyses and figures can be reproduced from code available on GitHub.


Introduction
Phase-type distributions have successfully been applied in the actuarial sciences and queuing theory for more than 30 years.The pioneering work of A. K. Erlang at a Copenhagen telephone company, where he developed the method of stages for the duration of calls, is among the first systematic applications of phases in stochastic modeling.A fascinating recount of these developments can be found in Brockmeyer et al. (1948).Later, Jensen (1953) formalized the notion of stages, and defined a (univariate) phase-type distribution as it is known today.It would take another two decades before Neuts (1975) and co-workers provided a systematic development of the theory, with main examples and contributions in queuing theory.We also note that John distribution.Additionally, reward structures in the context of phasetype distributions were introduced by Kulkarni (1989), which allowed for the definition of multivariate phase-type distributions.
Much of the theory and methods for phase-type distributions have been developed for queuing and risk theory, but phase-type distributions are increasingly being applied in other disciplines (Hurtado and Richards, 2021).In particular, phase-type distributions are an emerging framework in mathematical population genetics, but it is still in its infancy and not yet fully explored.While most applications in queuing or risk theory employ phase-type distributions out of convenience, with phases being unobserved and without a physical interpretation, the situation in population genetics is somewhat unique in the sense that many models can be identified directly as being of phase-type form.
In queueing theory and insurance risk, the typical data are interarrival times, service times, and claim sizes.In general, there is no reason for these quantities to be phase-type distributed, but the assumption allows for expressing important properties like waiting-time distributions, queue lengths, or ruin probabilities by explicit formulas.From a statistical point of view, we are most often in the situation of incomplete data, since data are only times until absorption (modeling e.g.claim sizes) and not the whole underlying history of a Markov process.Estimation can be achieved by invoking an EM-algorithm (Asmussen et al., 1996) or a Markov chain Monte Carlo approach (Bladt et al., 2003).In population genetics, the data is even more incomplete: we only observe the present-day DNA sequences, and information about the underlying evolutionary tree is missing.
The usage of stochastic simulation is increasingly popular, easy, and efficient for statistical inference and model selection in population genetics (e.g.Baumdicker et al., 2022;Freund andSiri-Jégousse, 2021 andSchrider andKern, 2018).An advantage of the simulation-based likelihood-free approaches is that they are very flexible and general; the analyst 'just' needs to be able to simulate from the model.A disadvantage is the (often high) computational cost and lack of optimality for the estimation procedure (see e.g.Chapter 6 in Bijma et al. (2017) for an introduction to optimality theory in mathematical statistics and the motivation for using maximum likelihood estimators).
In this review, we show that the class of population genetic models that allows a detailed mathematical treatment is larger than one might perhaps expect, and we point to references for extending the class of analytically tractable models even further.We have developed PhaseTypeR, a software package in R that facilitates straightforward implementation and simulation of coalescent models.In particular, all analyses in this paper can easily be reproduced and are available on GitHub.The package is available on CRAN and is described in Rivas-González et al. (2023).Software for fitting both time-homogeneous and time-inhomogeneous phase-type distributions with unidentified phases, as in Asmussen et al. (1996) and Albrecher et al. (2022), is available through the R package matrixdist (Bladt and Yslas, 2021).
Applications of phase-type methods to population genetic data is still in its infancy.Statistical inference procedures and analysis of large sample sizes based on phase-type theory remains a challenge.In this review we also discuss a number of limitations of the current phasetype framework.The limitations include a fast increase in the size of the state space when the sample size increases, and lack of tractable expressions for reward transformations for inhomogeneous coalescent models.
The review is organized as follows: Sections 2 and 3 constitute a basic introduction to the advantages of using the phase-type framework in population genetics.In Section 2, we consider univariate statistics such as tree height, total tree length and external branch length, and show that they are phase-type distributed.In Section 3, we consider the joint distribution of phase-type distributed variables.Section 4 is concerned with the discrete phase-type distribution, and we show that the number of segregating sites follows such a distribution.The time to fixation in the Wright-Fisher model is also discrete phase-type distributed.In Section 5, we establish the connection between likelihood inference for coalescent models via the Laplace transform originally suggested by Lohse et al. (2011) and the phase-type framework.In particular, the Laplace transform is analytically tractable for multivariate phase-type distributed variables, and this feature facilitates likelihood inference.Finally, in Section 6, we discuss further applications, extensions, limitations, and perspectives on the use of phase-type theory in population genetics.All our analyses and figures are available at https://github.com/rivasiker/phasetype_review.

Phase-type distribution with a view towards coalescent theory
In this section, we describe why phase-type distributions are useful in population genetics.We first provide simple examples of key quantities in population genetics that are phase-type distributed.A phase-type distribution is the time to absorption in a continuous-time Markov chain and examples from population genetics include the tree height, total tree length, and external branch length in a coalescent tree.We then describe the general theory and most important formulae from continuous phase-type distributions and relate them to the classical first-step analysis of continuous-time Markov chains.Finally, we introduce reward-transformed phase-type distributions.Rewardtransformed phase-type distributions allow us to weigh the time spent in a state before absorption.For example, the total tree length is the weighted tree height where the weights correspond to the number of lineages in the different states, and the external tree length is the weighted tree height where the weights correspond to the number of lineages that are ancestral to exactly one leaf (see Fig. 1).The joint distribution of reward-transformed phase-type distributions is multivariate phase-type.This is the topic for Section 3 and allows us to jointly consider e.g. total tree length, external tree length, and internal tree length.

Tree height, total tree length, and external branch length: Classical approach
In the standard (Kingman) coalescent model, the time it takes for two sequences to coalesce is exponentially distributed with rate 1, i.e.,  2 ∼ exp(1).Generalizing this, the time   between coalescent events from  sequences to (−1) sequences is exponentially distributed, i.e.   ∼ exp(  ), with rate   = (  2 ) = (−1) 2 .Since these exponential distributions are independent, we can calculate the tree height  for a given number of starting sequences  ≥ 2 as  = ∑  =2   .In other words, the time  until the most recent ancestor of  sequences is the sum of independent exponential distributions (Fig. 1).
The mean and variance of  are well-known, and so is the probability density function   ().The calculation of   () requires the use of convolutions of exponential distributions.In the simple case when  = 3, we get the convolution of two exponential distributions, namely  2 ∼ exp( 2 ) and  3 ∼ exp( 3 ).Since  2 ≠  3 we get the probability density function (Wakeley, 2008, equation 2.63).By substituting  2 = 1 and  3 = 3, we get the density function of  when  = 3.For  > 3,   () requires a longer series of convolutions of exponential distributions, and then substituting the corresponding exponential rates.Many other quantities in population genomics are also sums of exponential distributions like, e.g., the total tree length  = ∑  =2   = ∑  =2   , which is similar to  but weighted by the number of ancestral branches in each state of the coalescent process.Recall that positive scalar multiples of exponential distributions remain exponential: if  is exponential with rate  and  > 0, then  is exponential with rate ∕.We can then apply the same approach as described above to get   (), by using that   ∼ exp(  ), where now (2) A. Hobolth et al.
Now consider the singleton branch length or -equivalently -the external branch length .In the case  = 3 we have  = 3 3 +  2 , and since 3 3 ∼ exp(1) and  2 ∼ exp(1) we now have to consider the sum of two exponential distributions with the same rate.The distribution is a so-called Erlang distribution, and we have that the sum of two exponential distributions with rate  is (3) By substituting  = 1 we get the density for the singleton branch length when  = 3.The above approach, while feasible, relies on daunting analytical formulations.Moreover, the convolution of exponential distributions as described above cannot be used in general for non-standard coalescent models, in which, unlike the Kingman coalescent, trees are not necessarily bifurcating, and the number of branches might be reduced by two or more after a coalescent event.This happens, for example, in multiple merger coalescent models such as the -coalescent (Pitman, 1999;Sagitov, 1999), the beta coalescent (Schweinsberg, 2003), the psi coalescent (Eldon and Wakeley, 2006), and the seed-bank coalescent (Lambert and Ma, 2015;Blath et al., 2016), where  ≥ 2 branches might coalesce to −1, −2, … , 1 branches.Additionally, when performing convolutions of exponential distributions, it is important to distinguish between the case when the rates are the same (as in Eq. ( 3)) and the case when they are different (as in Eq. ( 1)).Fortunately, the desired convolutions can easily be computed by formulating the coalescent as an absorbing Markov jump process and using phase-type distributions.

Properties of the coalescent tree: Phase-type distributions
Consider the standard coalescent when  = 3 as a Markov jump process {  } ≥0 , with  = 2 transient states and a single absorbing state.The transient states correspond to the stages in the coalescent tree where 3 or 2 lineages are present, while the absorbing state corresponds to the state when all of the sequences have coalesced.The coalescent can then be summarized using a transition rate matrix , such that where For  we have ( 3 ,  2 ) = (3, 1), for  we have ( 3 ,  2 ) = (1, 1∕2), and for  we have ( 3 ,  2 ) = (1, 1).
In general, a phase-type distributed random variable is defined as the time to absorption in a continuous-time Markov chain with an absorbing state.Let the first  states of the Markov chain be the transient states and state  + 1 be the absorbing state.We can then write the transition rate matrix for the Markov chain as in (4) where  is now a sub-intensity matrix of size  ×  which holds the transition rates between the transient states,  is a column vector of size  with the exit rates from the transient states into the absorbing state, and  is a row vector of zeros of size .By convention, the rows of  sum to zero, i.e.  =  with the column vector  = (1, … , 1) ′ , where the superscript denotes transposition.Therefore, the transition rate matrix for the absorbing Markov jump process is fully determined by  , since the exit rate vector is given by  = − .We also need to provide an initial state distribution  of starting in each of the  transient states.We write  ∼ PH(,  ) for a phase-type distributed random variable with parameters (,  ), and the probability density function is given by A main advantage of continuous phase-type distributions (PHs) is that they have well-defined formulas in matrix notation for calculating their basic properties.In (6) we provided the probability density function.Similarly, the mean, variance, moments, and cumulative distribution function are also available in analytically tractable expressions.The mathematical expressions are general, defined solely by  and  , and available in our R package PhaseTypeR.
In those cases where phase-type distributions correspond to convolutions of exponential distributions, the phase-type formulas using matrix notation naturally yield the same result.We consider the two examples of tree height and external branch length to demonstrate this agreement.

Example: Tree height distribution
We return to   () when  = 3 for an illustrative example.The tree height  is the time until absorption,  = inf{ ≥ 0 ∶   =  + 1}, of a continuous-time Markov jump process {  } ≥0 with  = 2 transient states.The initial distribution is given by the row vector  = (1, 0) and the  ×  = 2 × 2 sub-intensity matrix where  3 = 3 and  2 = 1.When the tree height  is described as a phase-type distribution we write  ∼ PH(,  ).In other words,  is the time until absorption of the sum of two sequentially occurring exponential distributions defined by a probability vector  of starting in a certain state, and a sub-intensity matrix  with the transition rates between the  = 2 transient states given by ( 7) with ( 3 ,  2 ) = (3, 1).
We now calculate the density function ( 6) for  = (1, 0),  given by (7) with  2 ≠  3 , and  = −  = (0,  2 ) ′ .The first term in (6) with  = 0 is zero because  0  =  =  = 0.The remaining terms with  ≥ 1 are determined by where in order to obtain the second equal sign we used that we have an alternating sum when multiplying by ( 3 −  2 ).In the end, we get which is the same result as in (1).

Example: External and total length distribution
For the external length distribution  we have a phase-type representation given by initial distribution  = (1, 0) and sub-intensity matrix  given by (7) with  3 =  2 = 1.In the following we let  3 =  2 =  to establish the connection with (3).Note that  = −  = (0, ) ′ .Let us again consider the density function (6).The first term with  = 0 is zero because  0  =  =  = 0.The remaining terms with  ≥ 1 are determined by and we get which is the same result as in (3).
For an arbitrary number of leaves , the tree height  is phase-type distributed with initial distribution  = (1, 0, … , 0) and sub-intensity matrix where   = (

2
) . The sub-intensity matrix for the total tree length  has the same structure, but with   = ( − 1)∕2 (recall Eq. ( 2)).Phase-type distributions with rate matrices given by ( 12) are called generalized Erlang distributions (Bladt and Nielsen, 2017), and if the starting state is  = (1, 0, … , 0) then this distribution corresponds to the distribution of the sum of exponential random variables with rates   ,  −1 , … ,  2 .In the special case when all the parameters   are distinct, i.e.   ≠   for all  ≠ , we have the closed form solution for the density (see e.g.equation (2.64) in Wakeley (2008)).This expression is a linear combination of exponential distributions with positive and negative coefficients.
Calculation of the mean, variance, density function, and cumulative distribution function for ,  or  for any sample size  ≥ 2 is straightforward because the phase-type representation with parameters  and  is available.In particular, the computational methods for calculating matrix exponentials (e.g.Moler and Van Loan (2003)) are stable and reliable; for instance, we have never encountered numerical problems with the expm package in R (Goulet et al., 2021).

Example: Probability of a number of lineages at a fixed time in the past
The distribution of the number of lineages at a past time point  can also be obtained by using phase-type results.Tavaré (2004) (p. 19) derived the distribution of the number of lineages   () at time  for a coalescent tree that starts with  lineages at time 0.More specifically, he showed that where and Subsequently, Blum and Rosenberg (2007) used this distribution to obtain the distribution of intercoalescence times conditional on   ().
We briefly explain how the distribution of   () can be obtained using a phase type approach.Indeed, let  () = ∑  =+1   be the time until the coalescent reaches a state with  lineages.Obviously, the event   () ≤  is equivalent to the event  () ≤  which can be computed by choosing '' lineages'' as the absorbing state.Thus, the computation of the distribution of  () is analogous to the computation for the tree height in the previous example, but with the reduced sub-intensity matrix  taken as the {1, … , ( − )} × {1, … , ( − )} sub-matrix of (12).

Mean and variance for phase-type distributions from first-step analysis
We now consider the mean, variance, and higher-order moments for a phase-type distribution with the sub-intensity matrix  , where   =   .As described above, a phase-type distribution corresponds to the first passage time into the absorbing state, and the traditional procedure for determining the mean and variance uses a first-step analysis (e.g.Section 5.4 in Ibe (2013)).An alternative procedure for deriving the higher-order moments is to differentiate the Laplace transform and evaluate at zero.We describe both procedures in this subsection.
Let   be the expected time to absorption given the initial state is .For any transient state , we have (see e.g.Section 5.4 in Ibe (2013) or Section 5.1.3in Wakeley (2008)) because the time before leaving state  is exponential with rate −  , and the probability of jumping from state  to another transient state  is   ∕(−  ).We can rearrange the equation to get Let  = ( 1 , … ,   )  be the vector of means.In matrix format, we write (14) as and we get  = (− ) −1 .We call  = (− ) −1 the Green matrix (Bladt and Nielsen, 2017).For  ∼ PH(,  ), the initial distribution is , and we get the mean Now let us derive an expression for the variance.Let be the second moment for the time to absorption given initial state .Wakeley (2008, Section 5.1.3, page 145) derives the equation Here,   is the exponentially distributed time before the jump from state  (with rate −  ), and we have and re-arranging we obtain Using Eq. ( 13), this formula amounts to A. Hobolth et al.Let  = ( 1 , … ,   ) ′ be the column vector with entries   ,  = 1, … , .
In matrix format, we have We conclude that E[ 2 ] = 2(− ) −2  In general, the higher-order moments for a PH(,  ) distributed random variable are given by as stated in, e.g., Corollary 3.1.18in Bladt and Nielsen (2017).This formula is general, has a simple implementation, and avoids calculating the moments using first-step analysis.Perhaps the easiest derivation of the general formula is to first calculate the Laplace transform Second, the higher-order moments for the PH(,  ) distribution can be derived by differentiating the Laplace transform with respect to  an appropriate number of times, and evaluate at zero.We end this subsection with a brief discussion of the evaluation of the matrix formulas in the spirit of Røikjer et al. (2022).Suppose we want to evaluate the first moment, i.e (− ) −1 .If we let  = (− ) −1  then we need to solve the linear equation system −  = , and multiply the solution  by  from the left.In population genetics, the rate matrices are often upper triangular, and in this case the linear equation system can be solved using the back solve algorithm (e.g.Arnold et al. (2019), Section 2.5).To illustrate how the back solve algorithm works we consider the total tree height with sample size  = 4.In this case the linear equation system .
Furthermore, the back solve methodology also works if the rate matrix is an upper block triangular matrix instead of a strictly upper triangular matrix; see Røikjer et al. (2022) for more information.Now consider the problem of evaluating the second moment 2(− ) −2 .We first find  1 = (− ) −1  as the solution to the linear equation system −  1 =  using the back solve algorithm as described above.Second we solve the linear equation system −  2 =   with respect to  2 (again using the back solve algorithm), and finally we compute the second moment from 2(− ) As a final remark we note that the rate matrices in population genetics are often sparse and large.Using R, the matrices could be stored in the compressed sparse row (CSR) format for efficient evaluation.Efficient implementation of the phase-type matrix formulas is an important future research area.

Reward-transformed phase-type distributions
Recall from Fig. 1 that quantities such as the total tree length, external tree length, or internal tree length are weighted versions of the time spent in the states of the ancestral process.In this subsection, we consider these so-called reward-transformed variables.The class of phase-type distributions is closed under reward-transformations, meaning that a reward-transformed phase-type distribution is, again, phase-type distributed.This is a huge advantage in order to obtain the probability distribution and moments of the reward-transformed variable.
Let  ∼ PH(,  ) and let {  } ≥0 be the underlying Markov jump process.We then define a vector  = ((1), (2), … , ()) consisting of non-negative rewards, so that a new reward-transformed random variable τ is given by If () = 1 for all , then τ = .If () = 1 and () = 0 for  ≠ , then τ is the time spent in state .If instead any of the entries in  is a non-negative number different from 1, then τ corresponds to the total reward accumulated until absorption.
An alternative but equivalent definition of the reward-transformed variable is where   is the total time spent in state  before absorption.
If all of the values in  are positive, then it can be shown that τ ∼ PH(, T ), where T =▵ () −1  (Bladt and Nielsen, 2017).Here, ▵ () denotes the diagonal matrix of size  ×  with  in the diagonal.Alternatively, if one or more of the rewards in  are equal to 0, then τ will also follow a PH, but the number of states in the new PH is  minus the number of zero-reward states.This reflects that even though transitions through a zero-reward state are possible, they should not be accounted for (they can be removed) in the distribution for τ.Theorem 3.1.33in Bladt and Nielsen (2017) provide general formulas for the initial distribution and sub-intensity matrix for a reward-transformed phase-type distribution, and these formulas are implemented in PhaseTypeR.
The ability to reward transform is extremely useful in population genetics, since, as explained above, many quantities are, in fact, weighted sums of exponential distributions.This is the case for the total tree length , which is a version of the tree height  weighted by the number of branches in each time interval between coalescent events.Since  ∼ PH(  ,   ), as explained in Sections 2.1 and 2.2, we can define a reward vector   = (,  − 1, … , 2), so that  ∼ PH(  ,   ), where   =▵ (  ) −1   and   =   =  1 .The sub-intensity matrix   will be the same as when calculated using Eq. ( 12).
Many other quantities in population genomics are linear combinations of the times spent in a state before absorption.This is the case for the total branch length leading to each of the elements of the site frequency spectrum (singletons, doubletons, etc.), the total length of the external branches, and the total length of the internal branches.Consider for example the total branch length  1 leading to singletons when  = 4.A naive first step would be to represent the total tree height  using phase-type theory.We could use Eq. ( 12) to calculate the sub-intensity matrix so  ∼ PH(  ,   ), where   = (1, 0, 0).However, we cannot directly reward transform   to get a PH-representation of the total singleton branch length.The reason for this is that we are now interested in modeling a specific type of branch, so we need to consider all possible trees for the coalescent process.For  = 4, there are two types of trees, depicted in Fig. 2A.Both trees start with a coalescence between two singleton branches forming a doubleton branch.From the remaining sequences, any of the two singleton branches will coalesce with the doubleton branch to form a tripleton branch with probability 2/3 (left tree), or the two singleton branches will coalesce with probability 1/3 to form a second doubleton branch (right tree).Finally, in both trees, the two remaining branches will find common ancestry and enter the absorbing state in the next coalescent.Following this tree pattern, we can re-formulate the PH representation of the tree height  by splitting the state in   with two branches into a state with one singleton branch and one tripleton branch and a state with two doubleton branches See Fig. 2B for the state transition diagram and Fig. 2C for the new phase-type formulation of the tree height distribution.Thus, the tree height can also be modeled as  ∼ PH( α , T  ), where α = (1, 0, 0, 0).This is an equivalent representation of  when using   and   , but with the added advantage of modeling all branch types explicitly.
We can now define reward vectors to obtain the distribution of total branch length  (Fig. 2D) and total singleton, doubleton and tripleton branch lengths (Fig. 2E).The reward vector   = (4, 2, 1, 0) correspond to the number of singleton branches in each of the states.After reward transforming T  with   to obtain   , we get that  1 ∼ PH(  ,   ), where   = (1, 0, 0).Note that the new sub-intensity matrix   is of size 3 since state 4 in T  is discarded due to it having a reward of 0 (no singleton branches).In a similar manner, we can get PH representations of the branch length leading to doubletons ( 2 ) and that leading to tripletons ( 3 ) by reward transforming T  by   = (0, 1, 0, 2) and   = (0, 0, 1, 0), respectively.Reward transformation is implemented in PhaseTypeR, and for  = 4 we get (see the accompanying R code) The fork tree does not contain any tripleton branches and occurs with probability 1/3.This is reflected in the initial distribution  3 that only adds to 2/3.We say that the tripleton branch length distribution is defective; with probability 1/3 the tree does not contain tripleton branches.
For any sample size , the reward vectors for all of these quantities can be automatically computed by careful consideration of the block counting process of the coalescent as described in Algorithm 4.2 in Hobolth et al. (2019) and Example 1 in Rivas-González et al. (2023).Recall from Fig. 2B that if the number of samples is 4, then the state space is of size 4, and the states are (4, 0, 0), (2, 1, 0), (1, 0, 1) and (0, 2, 0), where the first entry in each triplet is the number of singleton branches, the second entry is the number of doubleton branches, and the last entry is the number of tripleton branches.We now note that 4 = 1 + 1 + 1 + 1 = 1 + 1 + 2 = 1 + 3 = 2 + 2, and these are exactly the five possible ways of partitioning the number 4 into a sum of positive integers.In general, the size of the state space for the block counting process is given by the so-called partition function from number theory.
The partition function is sequence A00041 in the Online Encyclopedia Of Integer Sequences (OEIS) and can be found at https://oeis.org/A000041.In the second row in Table 1 we provide a selected number of values for the partition function.For a sample size  larger than 45, the size of the state space for the block counting process increases to more than 100,000 states.Fortunately, all hope is not lost.In Table 1 we show different measures of sparsity for the block counting process.In the third row we show the number of non-zero entries for the subintensity matrix.For example, the sub-intensity matrix for sample size  = 4 in Fig. 2C has seven non-zero entries.The number of out-going edges in Fig. 2B is five and the number of nodes is the partition number (4) = 5, so the average number of out-going edges is 1.Finally, the sparsity is calculated as one subtracted by the fraction between the number of non-zero entries and the size of the subintensity matrix (() − 1) 2 .For  = 4, this amounts to {1 − [7∕(5 − 1) 2 ]} = 0.5625.It is encouraging that the amount of sparsity grows with the sample size.For a sample size of  = 50 the average number of out-going edges for each node is smaller than 15 despite a total number of 204,226 nodes.

Moments for reward-transformed phase-type distributions
We now consider a first-step analysis for the mean reward before absorption.The reward transformation is given by be the mean accumulated reward given the initial state is , and let  1 be the time until the first jump.We always jump from  to some other state  ≠  and therefore By taking the mean on both sides of the previous equation we get We can rearrange this equation as Letting  = ( 1 , … ,   ) ′ be the column vector of   ,  = 1, … , , we obtain where ▵ () denotes the diagonal matrix with  as diagonal.If the initial probability vector is , we get In general, it holds that Note that in case of strictly positive rewards   > 0 for all  = 1, … , , this result is consistent with ( 16) and  being phase-type distributed with initial distribution  and sub-intensity matrix ▵ () −1  .

Multivariate phase-type distribution and joint branch lengths
A multivariate phase-type distribution is the joint distribution of two or more reward-transformed phase-type distributions.Let  ∼ PH(,  ) and let {  } ≥0 be the corresponding Markov jump process with  transient states.Consider  non-negative reward functions and let  = {  } be the  ×  matrix with entries   =   ().Hence, the 'th column of ,  ⋅ , consists of (  (1), … ,   ()).Let be the accumulated reward for reward function   (⋅).Then the random vector  = ( 1 , … ,   ) is said to be multivariate phase-type distributed with parameters ,  , and , and we write  ∼ MPH(,  , ).
Recall the tree height , total tree length , external tree length , and internal tree length  from Fig. 1.The joint distribution of these four stochastic variables is multivariate phase-type distributed with sub-intensity matrix  and reward-matrix  given by ) .
In Section 3.1, we show how to calculate the covariance between the variables in a multivariate phase-type distribution, and in Section 3.2 we describe how to obtain the covariance between the branch lengths (, , , ) in the standard coalescent model with sample size .

Covariances in the multivariate phase-type distribution
In this subsection, we use first-step analysis to derive the mixed moment of two reward-transformed variables.Bladt and Nielsen (2017) derive the mixed moment using the Laplace transform of the multivariate phase-type distribution.
Consider the two reward-transformed variables be the mixture moment of the accumulated rewards given that the initial state is .Let  1 be the time until the first jump.We have and we get where  1 () and  2 () are the mean accumulated rewards for each of the two variables given the initial state is .Multiplying by −  on both sides of the equation and applying Eq. ( 21) we get A. Hobolth et al.We rearrange the terms to get Letting  = ( 1 , … ,   ) ′ be the column vector of   ,  = 1, … , , and using Eq. ( 22) we get where  = (− ) −1 .If the initial probability vector is  we get which is in accordance with Bladt and Nielsen (2017, page 440).

Covariances of branch lengths in the n-coalescent
A main advantage of using MPHs is that we can now easily determine the covariance between the different reward-transformed phasetype variables.For example, we can straightforwardly obtain the variance-covariance matrix of the -ton branch lengths ( 1 , … ,  −1 ) for a sample of size , given that we know the initial state vector, sub-intensity matrix and reward matrix for the block counting process.
Using MPHs, we can also calculate the joint moments of different summary statistics of coalescent trees.For example, for a sample size of , we can define reward vectors for the total tree height , the total tree length , the total length of external branches , and the total length of internal branches .These rewards can be computed using the blockcounting process of the standard coalescent model (Hobolth et al., 2021).As an example, for  = 4, recall Fig. 2E and the corresponding MPH-formulation in Eq. ( 26).The reward vector for  is given by   =  1 +  2 +  3 = (4, 3, 2, 2), which corresponds to the sum of all branches.The reward vector for  is given by   =  1 = (4, 2, 1, 0), because the external branches correspond to the sum of all branches leading to singletons.The reward vector for  is   =   −   = (0, 1, 1, 2), because the total internal branch length is the sum of all branches not leading to singletons.Finally,   = (1, 1, 1, 1), corresponding to weighting each state equally.Collected into a reward matrix, these reward vectors can be used to build a MPH with the base sub-intensity matrix in (26).A similar procedure can be used for an arbitrary sample size .
After having defined the MPH, calculating the covariance between any pair of variables is straightforward using Eqs.( 23) and ( 27 In Fig. 3, we show the covariance and correlation between tree height , tree length , external branch length , and internal branch length  for varying sample sizes in the standard coalescent model.Our Fig. 3 is in accordance with Figure 2 of Alimpiev and Rosenberg (2022), where the authors provide a compendium for covariances and correlation coefficients between pairs of coalescent tree properties.The formulas in Alimpiev and Rosenberg (2022) are derived using careful bookkeeping.Phasetype theory, in contrast, offers an attractive alternative that operates on the matrix level, which avoids cumbersome mathematical derivations.

Beyond the n-coalescent
The application of the phase-type framework in mathematical population genetics extends well beyond the standard coalescent.The multiple merger coalescent models can also be defined within the phase-type framework (Hobolth et al., 2019), as well as the structured coalescent, the coalescent with recombination, and the seed bank coalescent.
For example, in the -coalescent,  sequences coalesce to  lineages ( = 1, … ,  − 1) with a rate of where  is a probability measure on [0, 1].The -coalescent corresponds to the Kingman coalescent when  =  0 , that is, the unit mass at zero.For any other , the sub-intensity matrix for the multiple merger coalescent can be calculated using where  , = (   )  , for  = 2, … ,  and  = 2, … , , and   = ∑  =2  , .The PH formulation of the tree height under the -coalescent is now straightforward, since  ∼ PH(,  ), where  = (1, 0, … , 0) and  is calculated from (29).We can now apply standard phase-type formulas, without the need for specific formulas for the -coalescent.For the definition of the sub-intensity matrix for the Bolthausen-Sznitman coalescent, we refer to Kersting et al. (2021), and for other multiple merger models, we refer to Birkner and Blath (2021).Blath et al. (2020) use phase-type theory to obtain the expected site frequency spectrum for the seed bank coalescent and the two-island structured coalescent.The correlation structure between tree heights in the coalescent with recombination with two samples and two loci (e.g.Wakeley (2008), Chapter 7.2) is described in Rivas-González et al. (2023).

The classical approach
The number of segregating sites  is the total number of mutations that have happened along the coalescent process.Using the infinitesites model, the mutation rate when  branches are present is   = ∕2.For  = 2, the mutation rate is  2 = .Mutations are allowed to occur until the two sequences find common ancestry, which happens with a coalescence rate of   = (  2 ) = (−1)∕2.The number of segregating sites  when  = 2 is therefore geometrically distributed  ∼ Geo( 2 ), with probability  2 that a mutation occurs before a coalescent event.Here, we define the geometric distribution  ∼ Geo() as P( = ) =   (1−) for  = 0, 1, …, and 0 <  < 1.To calculate  2 , we can view the situation as a competition between two independent events with exponential waiting times, namely a mutation event and a coalescent event.In such cases, the probability that a particular event happens before the other can be computed as the relative rate of the events of interest.Thus, The geometric distribution also applies for all other stages of the coalescent process, i.e., when  branches are present, the number of mutations that happen before coalescence into  − 1 sequences is   ∼ Geo(  ), where To obtain the full distribution of the total number of segregating sites with an initial sample size of , we thus need to sum the independent geometric distributions, so that  = ∑  =2   .Similar to how the distribution for the tree height  is calculated, the probability density function for , P( = ), can be calculated using convolutions of geometric distributions.
Another classical way of formulating the probability density function for  is by realizing that mutations happen following a Poisson process sprinkled over the evolutionary tree, i.e.,  |  ∼ Poisson(∕2), where  is the total tree length.By knowing the distribution of , we can use formulas for calculating the marginal distribution of  given the total tree length, and then integrate over all possible tree lengths such that We refer to Wakeley (2008, equation 4.3) for further details.The issue with these approaches is that they both require convolutions of either geometric distributions to directly calculate P( = ), or exponential distributions to calculate   ().This might become a problem, for example, when trying to calculate the distribution of  for non-standard coalescent models such as multiple merger models.An attractive alternative is to use phase-type distributions.

Using phase-type distributions
We can circumvent convolutions by embedding the number of segregating sites  into a discrete-time Markov chain {  } ∈N .For an initial sample size of  = 3, there are three possible states.The  = 2 transient states correspond to 3 branches in the ancestral process or 2 branches in the ancestral process, and the last state is the most recent common ancestor (the absorbing state).The probability of jumping from one state to another can be defined using a transition probability matrix .Because {  } ∈N only has a single absorbing state,  can be partitioned into where where   is given by (30).Here,  is a sub-transition matrix of size  ×  which holds the transition probabilities among the transient states,  is a column vector of size  with the exit probabilities from the transient states into the absorbing state, and  is a row vector of zeros of size .
Since   = P( +1 =  |   = ), each row in  sums to 1, i.e.  = .Thus, the exit probability vector is given by  =  −   = ( −  ), so the Markov chain can be defined solely by the sub-transition probability matrix  .
Let  be the number of jumps (or mutations) until absorption of the Markov chain so that  = inf{ ≥ 0 ∶   =  + 1}.If we define , a vector of initial probabilities, such that P( 0 = ) =   for  = 1, … , , then the total number of jumps until absorption follows a discrete phase-type distribution, i.e.  ∼ DPH(,  ).In other words,  is the total number of jumps (or mutations) of a series of sequentially occurring geometric distributions, which is a Markov chain defined by a starting probability vector  and a sub-transition matrix  holding the transition probabilities among the  transient states.For , when  = 3,  = ( 3 , (1 −  3 ) 2 ).Note that, similarly as in the continuous case,  does not necessarily need to sum to 1, since the Markov chain could also start directly in the absorbing state.In this case, this would correspond to all sequences having coalesced without any mutations.The probability of this happening is called the defect, which can be calculated as P( 0 =  + 1) = 1 − ∑  =1   .PhaseTypeR was used to create the graph.
An alternative procedure of arguing for the discrete phase-type distribution for  is by using Theorem 3.5 in Hobolth et al. (2021) An equivalent formulation is  ∼ DPH( ′ ,  ), where  is as in ( 34), but now  ′ =    .Applying this theorem avoids the need for keeping track of individual geometric distributions, and it allows for easily obtaining a discrete phase-type representation of  for an arbitrary number of initial sequences  or for non-standard coalescent models, given that a continuous phase-type representation of  is available.For example, obtaining a phase-type representation of the number of segregating sites for the -coalescent is straightforward by applying Eq. ( 34) to the sub-intensity matrix calculated from Eq. ( 29).
As for the continuous case, the main advantage of discrete phasetype distributions is that they have closed formulas for the mean, variance, moments, probability density function, and cumulative distribution function.These formulas are in matrix notation, so these properties can be calculated based solely on  and  .

Coalescence times for diploid consanguineous populations
Campbell (2015) considered the ancestral history of two lineages, in a model involving  diploid mating pairs resulting in a total of 2 individuals.In each generation  of the mating pairs are randomly chosen to be consanguineous (siblings).Two lineages are followed backwards in time, and the expected time until their common ancestor is derived.This model can be phrased in terms of a Markov chain consisting of three transient and one absorbing state.The absorbing state is the coalescence of the two lineages.The transient states arein this order -two lineages being in the same individual (1ind), two lineages in two individuals of the same mating pair (1mp), and two lineages being in two different mating pairs (2mp).
The adjacency graph can be found in Fig. 4, and the sub-transition probability matrix between the states is given by The time until absorption has a discrete phase-type distribution.We may therefore use Corollary 1.2.64 in Bladt and Nielsen (2017) to obtain the expected time until absorption as E() = ( −  ) −1  with starting distribution  and the vector of ones  = (1, … , 1) ′ .Using that .
The three entries in this vector correspond to the three possible starting states (1ind,1mp,2mp).This result reproduces findings in Campbell (2015), and can also be found as equations ( 4)-( 6) in Severson et al. (2019).Severson et al. (2021) also derive the variance of .Using a phasetype approach, the variance may be alternatively obtained using the expression for factorial moments provided by Theorem 1.2.69 in Bladt and Nielsen (2017).For the second factorial moments, this formula translates into In our case we get , with the components representing the three possible starting states (1ind,1mp,2mp).
The variance is given by E(( − 1)) + E() − [E()] 2 , and for the three different initial states we get Higher order moments can be obtained using similar computations.

The wright-Fisher process
Other classical quantities in population genetics apart from the number of segregating sites also follow discrete phase-type distributions.The time to fixation (in the number of generations) of the Wright-Fisher process is one example.Krukov et al. (2016) used classical absorbing Markov chain theory to determine the probability of fixation or extinction of an allele, the expected time to fixation or extinction, and other key properties of the Wright-Fisher process.Krukov et al. (2016) describes a very efficient and scalable solution for linear systems.Here, we provide the solutions of the linear systems using phase-type theory.
Given a panmictic, haploid population with a finite population size , let   denote the number of individuals carrying a certain allele in generation , and   =   ∕ the frequency of the allele.Because the population has a finite size, there are only  + 1 possible allele frequencies ranging from lost (  = 0) to fixed (  = 1).The frequency of an allele in the population depends on the frequency of that allele in the previous generation.This way, the probability of observing a certain frequency can be calculated through binomial sampling from the previous generation, by fixing the number of trials to  and the probability of success to   .The dynamics are thus given by  +1 |  ∼ Binom(,   ), where   =   ∕. ( Following the probability mass function of the binomial distribution, we can calculate the probability of observing any of the intermediate frequencies  +1 in the next generation for a certain   .We can then define  as the number of generations until the allele becomes either fixed or lost.This way,  can be described in the phase-type framework as  ∼ DPH(  ,   ) by letting the transient states  be the ( − 1) possible frequencies   between fixation or loss.The probabilities calculated through binomial sampling are the probabilities of jumping from a certain frequency   in generation  to a frequency  +1 in generation  + 1.Thus, these probabilities can be summarized in the sub-transition matrix   with entries where Additionally, all the starting probabilities in   are set to 0 except the entry corresponding to the starting frequency  0 , which is set to 1.
Instead of an allele drifting neutrally, the Wright-Fisher model can also be used to model alleles under selection.The time to fixation or loss is then  ′ ∼ DPH( ′  ,  ′  ), where where  is the selection coefficient for the focal allele (Etheridge, 2011).By plotting the density function of  ′ under different selective forces, we can observe a bimodal distribution for intermediate  values (see Fig. 5, black lines).This happens because the absorbing state of  ′  includes both instances when an allele is either lost or fixed.In order to distinguish these two processes, i.e., in order to know the distribution of the time to loss and the time to fixation separately, we must transform the underlying phase-type distribution based on the exit vectors of each of the two processes.
Let  be a transition probability matrix with  transient states and  absorbing states, such that where  is a sub-intensity matrix of size  and   is the exit rate vector into absorbing state  + .If we define  = inf{ ≥ 0 ∶   ∈ { + 1, … ,  + }}, then  ∼ DPH(,  ).The probability of exiting into absorbing state  +  is then given by Additionally, it has been shown through time reversal that a discrete phase-type distributed variable with more than one absorbing state is still discrete phase-type distributed conditioned on being absorbed in a certain absorbing state (Gardner et al., 2021), so  | (  =  + ) ∼ DPH(  ,   ).The conditional phase-type distribution can be obtained by modifying the original sub-probability matrix  and initial probability vector  using the exit probability vector   of the corresponding absorbing state  to obtain a new sub-probability matrix   and initial probability vector   for each of the absorbing states (Gardner et al., 2021, Section 3.2) Thus, one can obtain phase-type representations of the Wright-Fisher model with selection for the time until absorption conditional on loss or fixation.Let   =   and   =   be the exit vector rates for the loss and the fixation, respectively.We can define these to be It is now straightforward to obtain DPH representations conditional on these two absorbing states.Fig. 5 shows the conditional density for loss and fixation weighted by the probability of loss and fixation (see Eq. ( 39)), respectively, on top of the original phase-type distribution for different selection coefficients.
Apart from the discrete case, continuous phase-type distributions with multiple absorbing states can also be conditioned on a certain absorbing state, and the resulting distribution will also be a continuous phase-type (Andersen et al., 2000).

Reward transformation in discrete phase-type distributions
In addition to the continuous case, DPHs can also be rewardtransformed (Campillo Navarro, 2018).Let  ∼ DPH(,  ), where {  } ∈N is the underlying Markov chain.We can define a vector  = ((1), (2), … , ()) consisting of non-negative integer rewards so that a new random variable  ′ satisfies It can be shown that  ′ is also DPH (Campillo Navarro, 2018).This is particularly useful to study the elements of the site frequency spectrum, since singletons, doubletons, etc. are all reward-transformed versions of the total number of segregating sites (see Hobolth et al. (2021) for further details).

The probability of a configuration and statistical inference
Phase-type theory also provides a systematic approach to computing probabilities for observed mutational patterns.These probabilities may then be used for subsequent likelihood-based statistical inference.Due to a large number of possible tree topologies, likelihood computations are notoriously difficult for coalescent models.Therefore, statistical inference often relies on summary statistics.In this section, we explain how phase-type theory provides a principled approach to obtain likelihoods for small sample sizes, as well as distributions for summary statistics such as the total number of segregating sites and features of the site frequency spectrum (SFS).
As a first example, consider the estimation of the scaled mutation parameter .Different estimators of this parameter have been proposed, such as Watterson's estimator, Tajima's , or the estimate θ in Fay and Wu (2000).In Hobolth et al. (2021), it is explained how the distributions of these estimates can be obtained using a phase-type approach.Again, this provides a flexible and general tool to obtain both univariate and multivariate distributions.It can be applied to several features of the site frequency spectrum, such as linear combinations used with test statistics such as Tajima's .Indeed, Hobolth et al. (2021, page 15) explains how phase-type distributions are constructed by using a blockcounting approach.Their Figures 3 and 4 provide illustrative examples for samples of size  = 4 and  = 5.The obtained distributions are useful when constructing cut-off values for hypotheses testing, and calculating confidence intervals.Although these distributions can also be obtained directly, see for instance equation (4) in Griffiths and Tavaré (2018) or (Tavaré, 2004) for the probability generating function of the number of segregating sites, the phase-type approach is quite convenient in terms of its generality and flexibility.
In this section, we illustrate that phase-type theory also provides tools to compute full small sample likelihoods for subsequent statistical inference.We will also explore the connection to the generating function approach by Lohse et al. (2011).

Likelihoods via Laplace transforms
We consider full likelihood inference in the framework of the infinite sites model where data consists of homologous DNA sequences.In the standard coalescent model, the data can be used to infer the scaled Fig. 5. Wright-Fisher model with selection, for  = 100 and an initial frequency for the selected allele of 0.05.The selection coefficient  is shown on top of every panel.The (re-scaled) phase-type density of the number of generations to either loss or fixation is shown as a black line.The full path of the phase-type distribution was simulated 50 times for each , and each simulation is plotted as an allelic trajectory colored by whether the end state was loss (blue) or fixation (yellow).Blue and yellow lines and areas correspond to the conditional densities of the time to loss or fixation, respectively.These conditional densities are weighted by the probability of exiting into each absorbing state in order to match the phase-type density.mutation rate .Due to a large number of possible coalescent tree topologies, full likelihood inference eventually becomes computationally infeasible with an increasing number of sampled sequences.Therefore, importance sampling methods such as genetree have been proposed (see Griffiths (1989), Griffiths and Tavaré (1994)) to approximate the likelihood under an infinite sites model.
Nevertheless, progress has been made, and for small samples Lohse et al. (2011), Uyenoyama et al. (2019) and Uyenoyama et al. (2020) suggested full likelihood methods for inference also for demographic parameters.

Joint likelihood for mutation counts on coalescent branches
In Lohse et al. (2011), the authors use generating functions (Laplace transforms) to obtain joint probabilities for the number of segregating sites on the branches of a small genealogy.This work has subsequently been extended; see, for instance, Lohse et al. (2016), andBisschop (2022) for an efficient evaluation of the Laplace transform using a graph traversal algorithm.
The basic idea of their approach may be summarized as follows.Suppose there is a branch  in a standard coalescent tree and the mutation rate is  = ∕2 per unit length.A standard assumption is that the number of observed mutations   given the length   of lineage  is Poisson(  ).Then the probability of observing   mutations on branch  is where If   is exponentially distributed with rate parameter , then   () =  + .Applying this observation to a standard coalescent with two samples (denoted by  and ), we have  = 1 (see e.g.page 76 in Wakeley (2008)).Thus, for lineage .
By using the multivariate Laplace transform two lineages may be considered simultaneously.Notice that here   =   , since the two lineages share the time until their coalescence.Similar arguments as above may now be used to obtain the joint probability of observing   mutations on branch , and   mutations on .This gives This approach may be extended also to a few more lineages and to some basic demographic models.It also provides us with the likelihood when the resulting probabilities are viewed as a function of .We now explain how analogous results can be obtained via phase-type distributions.

General phase-type approach
To illustrate how results such as the one from the previous subsection can be obtained via a phase-type approach, we will use Theorem 2.7 in Hobolth et al. (2021).As explained above in Section 4.2, the theorem describes how to add Poisson-distributed mutations onto a (multivariate) phase-type distribution and provide the joint probability generating function of mutation counts on disjoint branches of the coalescent tree.More specifically, the result assumes multivariate phase-type distributed random variables ( 1 , … ,   ) that are defined using rewards.Conditional on ( 1 , … ,   ), the random variables ( 1 , … ,   ) are then taken as independent Poisson distributed with rates  which implies   |   ∼ Poisson(  ).In this setup, the probability generating function of ( 1 , … ,   ) is given as where ▵ (⋅) gives a diagonal matrix with entries specified by the argument.
We now explain this formula in a coalescent context.Assuming that the underlying Markov process has  states besides the absorbing state, the -dimensional vector  provides the starting distribution.In our context, it assigns probability one to the starting configuration before the first coalescence event.The reward matrix  ∈ R × + represents branches or sets of branches on which independent mutation counts ( 1 , … ,   ) occur.Due to the independence assumption, the same mutations should not contribute to different   's.Each column of  represents one mutation count.The × sub-intensity matrix  contains the transition rates between the transient states.Finally, the vector  = −  provides the absorption rate for each transient state.The vector  = ( 1 , … ,   )  contains the variables of the generating function for the mutation counts.
The joint distribution of the mutation counts of interest   (1 ≤  ≤ ) can now be obtained by taking suitable partial derivatives of the probability-generating function.More specifically, (42)

Phase-type for two samples
With two samples, there are two possible states: the initial state (with two lineages), and the absorbing state after coalescence.The transition rate to the absorbing state is 1.Therefore,  = 1 and  becomes the scalar −1.Furthermore, the reward matrix  = (1, 1) uses the same reward twice to capture both branches of equal length.Additionally, the initial state vector is  = 1 and the exit rate is  = 1, since there is only one transient state.Together this leads to Therefore, the probability of no mutations on either branch is According to Eq. ( 42), P(  =   , 1 ≤  ≤ 2) can be obtained by taking partial derivatives.Indeed, with  1 derivatives with respect to  1 , and  2 derivatives with respect to  2 , an evaluation at (0, 0) reproduces formula (41).
It is now easy to obtain the maximum likelihood estimate (MLE) of  by setting the first derivative of (41) to zero.This leads to λ = ( 1 +  2 )∕2 as our MLE.

Phase-type for three samples
We next extend to a coalescent model for three samples.Our model (see Fig. 6) may be represented by a Markov chain with two transient states (3,0,0) and (1,1,0) that represent the initial configuration with three singleton lineages, and the configuration with one singleton and one doubleton lineage after the first coalescence event.The absorbing state is denoted by (0, 0, 1) and represents the most recent common ancestor of our three samples.
As the first transition occurs at rate 3, and the second at rate 1, the sub-transition matrix for the transient states is given by ) .
We next derive the joint distribution of the mutation count vector ( 1 ,  2 ,  3 ,  4 ) that belong to the four branches as shown in Fig. 6.For this purpose, we use the four linear reward functions specified by the columns of  = ( 1 1 1 0 0 0 1 1 ) .
As with two branches, a general formula for the joint mutational patterns can be obtained by taking suitable partial derivatives.Indeed, The resulting derivatives may be rewritten in a simpler form as with the sum being over the possible allocations of  3 to the two branch parts with lengths  3 and  2 .As a numerical example, we obtain (3, 1, 4, 2) ≈ 0.000254 for  = 2.We next consider the labeled coalescent and assign samples , , and  to the leaves of the tree in Fig. 6.There are 3! = 6 possible such assignments that occur with equal prior probability.We want to compute probabilities for the mutational configurations (  ,   ,   ,   ,   ,   ).First, notice that under the infinite site model, only one of the doubleton mutation counts can be nonzero.Furthermore, since the configuration of doubleton mutations identifies For other nonzero doubleton mutation counts, the formula is analogous.
If there are no doubleton mutations, i.e.   =   =   = 0, then all six permutations lead to nonzero probabilities for corresponding mutational configurations.The six assignments of the label permutations to the tree in Fig. 6 can then be partitioned into three pairs of equal probability (again due to permutations within the shorter branches), and therefore ( (  ,   ,   , 0) + (  ,   ,   , 0) + (  ,   ,   , 0) ) .

Two sub-populations with three samples
We now consider a model with two subdivided populations (or demes) and migration at rate  between the populations.Likelihood inference for such models has been considered for instance by Costa andWilkinson-Herbots (2017, 2021).In this work, (composite) likelihoods for two samples are obtained via general Markov chain theory.Here we obtain likelihoods for three samples using phase-type theory.The mutation rate is assumed to be  in both populations, and the two populations are assumed to be of equal size.We assume that  and  are sampled in the first population and  is taken from the other.We denote this configuration by (, |).As mutation and coalescence rates are equal for both populations and the migration rate between the two populations is also equal, we have full symmetry.We, therefore, do not distinguish between configurations that are swapped across populations.For instance, (, |) and (|, ) are combined into a single state.Following this strategy, we have a total of 11 states, with (, |) being the initial state, and (|) being the final absorbing state.Samples that are not separated by a vertical bar or a comma are assumed to have coalesced.For instance,  represents the state where all three lines have coalesced.Table 2 summarizes the 11 states of our three-lineage model with two subdivided populations.
The sub-transition matrix  between the transient states is given in Box I.In Fig. 7, we show the adjacency graph for the three sample model with migration.
There are six types of lineages characterized by the labeled leaves that they subtend: singleton ,  and , and doubleton ,  and  (recall Table 2).The rewards are constructed to measure the contribution of the states to these lineages.They are represented by the following 10 × 6 reward matrix , with the rows representing the transient states ordered in the same way as in  .

𝑹
The column entries are either zero or one, depending on whether the lineage represented by a column is present in a given state.The columns of  are allocated to the lineages in the order singleton mutation in , singleton mutation in , singleton mutation in , doubleton mutation in  and , doubleton mutation in  and , and doubleton mutation in  and .As before, the generating function can be obtained as with  = ( 1 ,  2 , … ,  6 ).Probabilities for a given mutational pattern can again be computed by taking suitable partial derivatives of (⋅) and evaluating at  = (0, … , 0).The partial derivatives of (⋅) can either be computed directly or by using the matrix identity and the chain rule.By plugging in values for the mutation rate  and the migration rate , likelihoods can be obtained for arbitrary parameter values.Analytic computation of the likelihood is possible but leads to long formulas.We are currently exploring the possibility of a matrix analytic computation.

Likelihood based statistical inference
The multivariate probability distributions obtained in the above subsections can be used for statistical inference.As our first illustrative example, we consider a standard coalescent model for a sample of size three.Suppose for instance that (  ,   ,   ,   ) = (0, 1, 1, 3).Then Eq. ( 44) and its extension to the labeled coalescent may be used to obtain the likelihood function.Fig. 8 shows the maximum likelihood estimator (MLE) using our formula (44), as well as the Ewens-Watterson estimator proposed both by Ewens (1974)) and Watterson (1975).They are clearly different.
We next look at the situation where there are  = 20 independent loci each with a sample of size three, and  is estimated from their combined information.Fig. 9(a) displays the log-likelihood (summed across all loci) and shows that the MLE is more accurate than the Ewens-Watterson estimator.
In Fig. 9(b), we consider one locus, and a sample of size  = 20.Although there are methods such as genetree (see Bahlo and Griffiths (2000)) available that approximate the full likelihood for a coalescent with twenty lineages, we rely on our formula for three lineages and use a composite likelihood approach.For this purpose, we took 1000 subsamples from the original sample, each of size three without replacement.The likelihood is then computed pretending that these subsamples are independent.With 20 starting lineages, it would be easily possible to consider all ( 20 3 ) = 1140 possible configurations.We relied on subsampling, however, to illustrate that the approach can be easily generalized to any number of lineages.
To show that our illustrations indeed represent typical cases, we look at the average behavior over 100 simulation runs for the three scenarios considered above.Table 3 provides the root mean squared errors, i.e. the square root of the average squared differences between the estimates and the true parameter value.As expected from coalescent theory, data from 20 independent loci lead to the most accurate Box I. Fig. 7. Adjacency graph for the three sample model with migration.The states correspond to those specified in Table 2.The initial and the absorbing state are colored in green and red, respectively.Possible transitions, either due to migration or coalescence events, are represented by arrows.The transition rates can be found in the sub-transition matrix  .
PhaseTypeR was used to create the graph, and the R code can be found in the accompanying script.2016) also consider a composite likelihood approach, their focus is on multiple linked loci and not multiple subsamples at one single locus.It might therefore be interesting to explore our proposed composite likelihood approach further, for instance in terms of uncertainty quantification.We next provide an example for the model described in Section 5.2 involving both mutations and migration.For this purpose, we used ms (Hudson (2002)) to simulate two independent loci for a coalescent that starts with three sequences.We took  = 3 as our scaled mutation rate, and the migration rate  = 4. Notice that ms uses the scaled migration rate  = 4 m, with  being the population size, and m the per generation migration rate.It is related via  = 2 to our migration parameter.The resulting mutational pattern can be found in Table 4.
Using the generating function derived in Section 5.2, we computed the partial derivatives that correspond to the mutational patterns displayed in Table 4.More specifically, we took the derivatives of order 1, 6, and 10 with relation to  2 ,  3 , and  4 for locus  1 , and similarly, the required derivatives for  2 .This was done symbolically with Mathematica.The likelihood is then obtained by evaluating both derivatives at  1 =  2 = ⋯ =  6 = 0 and taking the product.For the likelihood computations, the normalizing constants ( 1 ! 2 !⋯  6 !) −1 may be omitted, as they do not affect the MLE.
Taking the logarithm gives us the log-likelihood.The expression for the likelihood function becomes rather long and can be found in our GitHub repository.Despite its length, however, we still obtained an explicit formula in terms of a function of the unknown parameters.Fig. 10 provides a plot of this composite likelihood function.We observe that the likelihood is much flatter in direction  than in direction .Thus, the data provide more information for estimating the mutation rate  than the migration rate .We also computed M = 4.50, and θ = 3.84 as the maximum likelihood estimates for our example.

Discussion: Extensions and further perspectives
We have demonstrated that phase-type theory provides a general framework to derive basic properties of coalescent trees.Whenever the quantities of interest can be interpreted as (possibly reward weighted) absorption times of a discrete or continuous time Markov chain, the theory provides matrix analytic formulas for computing properties of their distribution.In Section 2 we provided several examples.
In Section 3, we illustrated that joint distributions and their corresponding mixed moments can be obtained by specifying multiple reward patterns that represent quantities of interest such as tree height and length, as well as the external and internal branch lengths.Going beyond the standard coalescent model, phase-type results can be obtained in an analogous way for multiple merger models such as the -coalescent.
Section 4 explained how discrete phase-type distributions can be used to obtain properties of the number of segregating sites, the site frequency spectrum, and coalescence times under consanguineous mating.In a further classical population genetic application, namely fixation or loss in a discrete-time Wright-Fisher model with selection, we demonstrated how to treat a situation with more than one absorbing state.
Distributions and moments of summary statistics, such as the site frequency spectrum, can also be used for statistical inference.In Section 5, we explained how to use phase-type distributions to obtain full likelihoods for small size coalescent trees.Using a composite likelihood approach, we show how these results may be applied to carry out statistical inference under a simple demographic model.
All our applications and examples illustrate that phase-type theory provides a general set of tools that can be applied to a large variety of coalescent models in population genetics.
We end this paper with a short discussion on important extensions of the homogeneous coalescent models, challenges with large sample sizes, and further population genetic models where the phase-type framework seems particularly promising.

Extensions of the homogeneous coalescent model
Phase-type distributions in population genetics are still in their infancy.In this review, we have focused on time homogeneous coalescent models, but several important extensions are natural to consider.A crucial extension is to consider inhomogeneous coalescent models and the corresponding inhomogeneous phase-type distribution (Albrecher and Bladt, 2019).This model corresponds to a coalescent model with variability in population size (e.g.Wooding and Rogers (2002) and Polanski and Kimmel (2002)).Zeng et al. (2021) discretized time into homogeneous blocks of time with constant population size to analyze genetic diversity for balancing selection.Reward-transformation in inhomogeneous phase-type distributions still needs to be developed.

Large sample sizes and computational running time
Large sample sizes pose a challenge for the phase-type methodology because the state space increases very fast with the sample size (recall Section 2.4).In Fig. 11 we show the running time of PhaseTypeR for calculating the mean tree height (left) and variance-covariance matrix of the site frequency spectrum (right) for increasing sample size.In our implementation, the calculation of both quantities is cubic in the size of the state space, and the calculation of the variance-covariance matrix can take up to half a minute when the size of the state space is larger than 250.
A possible strategy to handle a large number of states is to exploit that the rate matrices are very sparse and have a block structure.Røikjer et al. (2022) and Bisschop (2022) represent the rate matrices as graphs and carry out the desired matrix operations on the graph.Røikjer et al. (2022) is mainly focused on calculating the moments, and Bisschop ( 2022) is mainly focused on calculating the likelihood.A natural next step would be to combine the two graph-based approaches in a single software program.
It could be important to have limit theorems for the multivariate phase-type distribution.Limit theorems for the site frequency spectrum are available for the standard coalescent (Dahmer and Kersting, 2015), but a general framework for reward-transformed statistics of homogeneous coalescent models is missing.

Discriminating between coalescent models
An application where phase-type theory could perhaps be rather easily applied is in the context of summary statistics for discriminating between coalescent models.Koskela (2018) suggest using summary statistics consisting of the normalized number of singletons and the cumulative number of -tons with  larger than or equal to 15. Koskela (2018) in his Figure 1 clearly shows that these joint statistics are able to discriminate between the Beta(1 − , ) − −coalescent and exponential or algebraic growth for certain values of sample size, number of loci, size of loci, mutation rates, the value of  and growth parameter.
The choice of singletons and cumulative tail probabilities of the site frequency spectrum as summary statistics discriminating between the

Fig. 2 .
Fig. 2. A. The two possible trees with a sample size of  = 4.The coloring of the branches is according to the number of descendant leaves.B. The states and state transition diagram.C. The corresponding sub-intensity matrix and phase-type distribution for the tree height .D. Rewards for total branch length .E. Rewards for total singleton branch length  1 , total doubleton branch length  2 , and total tripleton branch length  3 .

Fig. 3 .
Fig. 3. Covariance (left axis, blue) and correlation (right axis, red) between tree height , tree length , external branch length , and internal branch length  in the Kingman coalescent for varying sample size .

Fig. 4 .
Fig. 4. Adjacency graph for the sib-mating model for two lineages.The states are ''lineages within 1 individual'' (1ind), ''lineages within one mating pair'' (1mp), ''lineages in two mating pairs'' (2mp), and coaleascence (''coal'').The discrete time Markov chain may start in either of the transient states.Possible transitions are represented by arrows.The transition rates can be found in the sub-transition matrix  .

Fig. 6 .
Fig.6.Mutation patterns and Markov chain structure for a coalescent with  = 3 samples.The probability generating function of the four mutation counts  1 ,  2 ,  3 ,  4 can be obtained by specifying a multivariate reward matrix consisting of four suitable reward vectors.

Fig. 11 .
Fig. 11.Left: Running time of PhaseTypeR for calculating the mean tree height of the standard coalescent from the block counting process.Right: Running time of PhaseTypeR for calculating the variance-covariance matrix of the site frequency spectrum from the block counting process.

Table 1
Partition function () (second row), number of non-zero entries in the subintensity matrix of dimension (() − 1) × (() − 1) (third row), average number of out-going edges (fourth row), and sparsity for the block counting process (fifth row), with varying sample size  (first row).The average number of out-going edges is the accumulated total number of out-going edges for each node (number of states that have a non-zero rate for each state) divided by the total number of states (including the MRCA).See the main text for examples.

Table 2
State number and the corresponding configuration of the 11 states of a three-sample model with migration.The first 10 states are transient, and state 11 is absorbing and corresponds to the most recent common ancestor (MRCA).The first 4 states have three branches and states 5-10 have two branches.|)(,,|) (|, ) (|, ) (|) (, |) (|) (|) (, |) (, |) (|)the sequences belonging to the short branches under the infinite sites model, the probability of observing corresponding mutation counts is nonzero for only two out of six possible permutations of , , .These two permutations only differ with respect to the assignments within the two shorter branches and therefore have the same probability.Assume now that there is a nonzero number of doubleton mutations, and let for instance   > 0. In this case P(  =   ,   =   ,   =   ,   =   ) = 2 6 (  ,   ,   ,   ).

Table 3
Root mean squared error of the M(C)LE (maximum (composite) likelihood estimate) versus Ewens-Watterson estimate from 100 simulation runs.Scenarios: (1) MLE for sample of size  = 3 from  = 1 locus; (2) sample of size  = 20 from  = 1 locus, composite likelihood estimates based on sub-samples of size 3 drawn from 20 genes are used; (3) MLE from samples of size  = 3 from  = 20 independent loci.

Table 4
Sample from model with migration  = 4 and mutation rate  = 3. Mutation counts per lineage for three genes and two independent loci.