Mortality modeling and regression with matrix distributions

In this paper we investigate the flexibility of matrix distributions for the modeling of mortality. Starting from a simple Gompertz law, we show how the introduction of matrix-valued parameters via inhomogeneous phase-type distributions can lead to reasonably accurate and relatively parsimonious models for mortality curves across the entire lifespan. A particular feature of the proposed model framework is that it allows for a more direct interpretation of the implied underlying aging process than some previous approaches. Subsequently, towards applications of the approach for multi-population mortality modeling, we introduce regression via the concept of proportional intensities, which are more flexible than proportional hazard models, and we show that the two classes are asymptotically equivalent. We illustrate how the model parameters can be estimated from data by providing an adapted EM algorithm for which the likelihood increases at each iteration. The practical feasibility and competitiveness of the proposed approach, including the right-censored case, are illustrated by several sets of mortality and survival data.


Introduction
The statistical modeling of human lifetimes is a central topic in actuarial science, as it has immediate consequences for the pricing and management of a wide range of insurance and pension products (see, e.g., Olivieri and Pitacco (2015), Dickson et al. (2019)).Starting all the way back with Gompertz (1825), who proposed a mortality rate that increases exponentially over the lifetime, the field has seen a tremendous development over the last 200 years, and nowadays, sophisticated models are available that take into account the changing nature of mortality over time as well as common and individual trends across different (sub)populations (see, e.g., Pitacco (2004Pitacco ( , 2019) ) for surveys and Denuit and Trufin (2016), Dowd et al. (2020), Lin et al. (2021), Shapovalov et al. (2021), Zeddouk and Devolder (2020) and Renshaw and Haberman (2021) for some very recent developments in different research directions, and Barigou et al. (2021) for a proposition how to combine various alternative models into one).A detailed understanding of these trends is crucial for prudent management of longevity products (Barrieu et al. (2012), Sherris and Zhou (2014)) and the future of old-age provision on the society level in general (see, e.g., Albrecher et al. (2016)).At the same time, the statistical estimation of mortality models from given (often incomplete and intervalcensored) mortality data can be quite challenging; see, e.g., Macdonald et al. (2018) for a recent account.In the presence of the many available model options for a given purpose (e.g., constructing lifetables or making mortality projections), a number of countries nowadays rely on an accorded effort of academia and insurance practice for respective recommendations or guidelines (see, e.g., Antonio et al. (2017) for an excellent survey on the current approach of Belgium and the Netherlands).
Most of the models in practical use are based on a purely statistical fitting of observed mortality patterns and subsequently extrapolating the trends of the past into the future, which historically has sometimes led to an underestimation of the actual mortality improvement.In addition, the resulting models typically rely on an enormous number of parameters, like for instance, 300 in the very popular Lee-Carter model (Lee and Carter (1992)), which employs separate parameters for each year and each age, that then need to be estimated from the given data, cf.Li et al. (2009).This may not be considered an issue, as long as the values of these parameters remain reasonably stable during updates, and if for instance cohort effects are thought of dominating the stochastic nature of mortality.At the same time, it can be interesting to see whether, through other and possibly parsimonious approaches, one can also capture observed mortality patterns to a satisfactory degree.Motivated by corresponding approaches in medicine and early contributions by, e.g., Gutterman and Vanderhoof (1998) in the actuarial context, Lin and Liu (2007) proposed an interesting alternative approach for the modeling of human lifetimes that links its parameters to the biological and physiological mechanisms of aging.Concretely, one may imagine a lifetime as traversing through a number of stages that can be thought of as markers of the biological age, with the time that it takes an individual to go from one stage to the next being random, and an additional possibility to die 'prematurely' at each stage along the way (e.g., due to accidents or other external causes).If such a model can be calibrated to mortality data in a satisfactory way, then one could possibly attribute particular markers to each stage, linking the parameters of the model more directly to the physiological aging mechanism, and opening up the possibility to include expert opinions on particular mortality trends into the models and the resulting forecasts more directly.In addition, such an approach could also allow to compare the mortality patterns of different (sub)populations on the basis of changes of transition rates between those stages, which is a somewhat appealing alternative to the purely statistical approach that is typically followed today, see Lin and Liu (2007) and also Cheng et al. (2020) for a detailed discussion as well as Hassan Zadeh et al. (2014) for an application in a disability context.
A natural candidate for a model of the aging process of the human body is a finite-state homogeneous continuous-time Markov process with a single absorbing state (death).In the context of modeling stages of a disease, such a Markov approach can, e.g., be traced back to Kay (1986), Longini Jr et al. (1989), Guihenneuc-Jouyaux et al. (2000).For early contributions towards human lifetime modeling along these lines, see Gavrilov and Gavrilova (1991), and Aalen (1995) who particularly focussed on the modeling of hazard rates.The resulting class of distributions for the lifetime are known as phase-type distributions.Phase-type (PH) distributions are particularly tractable and have been systematically studied as matrix distributions over the years (see Neuts (1975Neuts ( , 1981) ) and Bladt and Nielsen (2017) for a survey).One of the most attractive features of PH distributions from a statistical perspective is that they are dense in the class of distributions on the positive real line in the sense of weak convergence, so that any given distribution on the positive half-line can be approximated arbitrarily well.However, for a good fit to a given histogram of data, often a large number of states (phases) are required, making the estimation procedure complex.Several estimation methods have been proposed in the literature, including the method of moments in Bobbio et al. (2005); Horváth and Telek (2007) and references therein, Bayesian inference in Bladt et al. (2003), and maximum likelihood estimation via the expectation-maximization (EM) algorithm in Asmussen et al. (1996) (for the rightcensored case see Olsson (1996)).The EM method considers the full likelihood arising from the path representation of a PH distribution and has been the most popular approach for many applications in applied probability.
In Lin and Liu (2007), the human aging process is modeled by such a PH distribution (concretely, a Coxian distribution).It turns out that for an adequate fit to data (the eventual lifetimes), about 200-250 phases are needed (see also Asmussen et al. (2019), where about 50 phases are suggested in the context of pricing equity-linked insurance products, when the focus is more on the stability of eventual prices rather than the accuracy of the fit).Cheng et al. (2020) introduce additional restrictions of the parameters for the fitting procedure.The main reason for the need of so many phases (here and in other applications) towards an adequate fit is typically due to the fact that all PH distributions have exponentially decaying tails, whereas the actual data do not have that property.Rather than using many phases to correct for that, Albrecher and Bladt (2019) proposed a framework of inhomogeneous Markov processes for the PH construction, where time evolves according to a deterministic transform of the original time, which can be targeted towards capturing the tail behavior of the resulting distribution via the transform rather than the introduction of additional states.The resulting class of inhomogeneous phase-type distributions (IPH) is still dense in the class of all distributions on the positive half-line and turns out to lead to adequate fits with substantially fewer phases (and consequently parameters) in various applications.The statistical estimation of IPH distributions was then subsequently developed in Albrecher et al. (2022b).A number of IPH distributions can be seen as matrix extensions of other base distributions, where the latter determine the tail behavior, and the matrix-valued parameters improve the fit of the formerly scalar counterparts, cf.Albrecher and Bladt (2019).
The purpose of this paper is to study the potential of IPH distributions for the modeling of human mortality.In particular, we will look into matrix versions of the Gompertz distribution, which will give the needed additional flexibility for fitting observed mortality rates (hazard functions), yet keeping the number of phases (and consequently parameters) of the model reasonably low.Importantly, the resulting model will allow for an interpretation in terms of the above aging process through phases, with time being transformed according to a deterministic mechanism.The focus in the modeling will be on an adequate representation of the hazard rate.In that way, we combine the original approach of hazard rate modeling with a generalized version of Lin and Liu (2007).The advantage of this combined approach is that the number of necessary phases for a visually acceptable fit goes down from 200-250 to 10-12, and the approach does not have to be restricted to the sub-class of Coxian PH distributions.
In a second step, we will then investigate (for the first time) regression of IPH distributions, which will lead a way to multi-population mortality modeling within the above framework of modeling the aging mechanism.In the life sciences, particular instances of PH distributions (namely Coxian distributions) were considered as survival regression models, within the accelerated failure time (AFT) specification, cf.McGrory et al. (2009); Rizk et al. (2021); Tang et al. (2012) and references therein.The popularity of these models for several decades is the product of their natural interpretation in terms of states and their statistical efficacy.Markov Chain Monte Carlo (MCMC) methods seem to be particularly popular for this type of application.In this paper, we propose survival regression models based on general IPH distributions.Concretely, we propose a proportional intensities model for the regression, which contains a number of classical methods as special cases (such as the proportional hazards model for any parametric family, the AFT model and Coxian regression).The combination of inhomogeneous time transformation together with general sub-intensity matrices makes our model significantly more flexible both in terms of tails and hazard function shapes.Regressing on the inhomogeneity parameters also leads to matrix generalizations of models such as the one of Hsieh and Lavori (2000).A particular case of our specification was studied in Bladt (2021) for severity modeling, where right-censoring, aggregation, and regressing on the inhomogeneity parameters were not considered.
To make the models practically useful, we adapt techniques developed in Asmussen et al. (1996); Olsson (1996) and Albrecher et al. (2022b) to obtain effective estimation procedures based on the EM algorithm.While the focus in this paper is on mortality modeling, the generality of the approach could, with corresponding adaptations of the techniques, also be of interest for other application areas.We would also like to stress that with its great flexibility comes a drawback, which is that our model does not and is not expected to outperform any individual survival regression model specification in the literature.In particular, the message that we aim to convey is that inhomogeneous Markov aging models are reaching a mature state where they actually are able to resemble many of the classical models when it comes to explaining real data.Prediction and other more delicate aspects of mortality modeling (in particular in connection with time effects and cohort effects) are supported but do not outperform the state-of-theart machine learning methods.Nonetheless, our models enjoy favorable estimation and closed-form mathematical formulas for the resulting lifetime distribution.The latter is important to build realistic insurance products that can be analyzed explicitly, and thus the contribution is also targeted to researchers who want to use our model as a building block in their own research (with the advantage that they may calibrate them on the entire lifetime distribution).
The structure of the remainder of the paper is as follows.In Section 2, we revisit the necessary existing background on IPH distributions with an emphasis on the intensity function, which plays a central role in the paper, and establish a number of explicit expressions in the context of mortality modeling with matrix distributions, addressing the main PH classes under consideration in the paper in more detail.Section 3 is devoted to regression and proposes a proportional intensities model.Main properties are derived and it is shown that it is asymptotically equivalent to the proportional hazards model, but more flexible for our purposes.Section 4 then develops the estimation methodology for the regression models via the EM algorithm.The presented methods and models are illustrated in Section 5 for mortality data from Denmark, Japan, and the USA, both for the univariate case and the regression.In particular, it is shown that for these data dimensions, around 10-12 in the matrix framework are already reasonably competitive when for instance, compared to the classical Lee-Carter model.For survival data, Section 6 implements the right-censored version of the main algorithm.Finally, Section 7 concludes.

Mortality rate modeling using matrix distributions
In this section, we recall some basic properties of inhomogeneous phase-type distributions that will be relevant later on.In the sequel, for a random variable X, we write X ∼ F with F a distribution function, density, or acronym, when X follows the distribution uniquely associated with F .For two real-valued functions, g, h the terminology g(t) ∼ h(t), as t → ∞ is defined as lim t→∞ g(t)/h(t) = 1.Unless stated otherwise, equalities between random objects hold almost surely.
We will denote by µ f (s) the hazard (or mortality) rate of a lifetime random variable X with density f , so that is the corresponding survival function, and µ f (s)ds can be interpreted as the probability of dying in the time interval [s, s + ds) given that the individual has survived up until time s.The variation of µ f (s) with s is linked to the aging process.
In this paper, we make the assumption that for some β > 0, that is, we have an underlying Gompertz hazard rate.For presentation purposes, and also because many of the formulas also work for other hazard rates, we will still often work abstractly with the symbol µ f (x) in the sequel.Now let (J t ) t≥0 denote a time-inhomogeneous Markov jump process on a state-space {1, . . ., p, p + 1}, where states 1, . . ., p are transient and state p + 1 is absorbing.In other words, (J t ) t≥0 has an intensity matrix of the form where T (t) is a p×p matrix function and t t t(t) is a p-dimensional column vector function.
Here, for any time t ≥ 0, t t t(t) = −T (t) e e e, where e e e is the p-dimensional column vector of ones.Let π k = P(J 0 = k), k = 1, . . ., p, π π π = (π 1 , . . ., π p ) be a probability vector corresponding to the initial distribution of the jump process.In particular, we have P(J 0 = p + 1) = 0. Then we say that the time until absorption has an inhomogeneous phase-type distribution with representation (π π π, T (t)).For the purpose of this paper, the following particular case is of interest: T (t) = λ(t) T , where λ(t) is some known non-negative real function, named the intensity function, and T = {t kl } 1≤k,l≤p is a sub-intensity matrix.Similarly, we let −T e e e = t t t = {t k } 1≤k≤p .
Note that for λ(t) ≡ 1 one returns to the time-homogeneous case, which corresponds to the conventional phase-type distributions with notation PH(π π π, T ); a comprehensive account of PH distributions can be found in Bladt and Nielsen (2017).It is not difficult to show that if Y ∼ IPH(π π π, T , λ), then there exists a function g such that where Z ∼ PH(π π π, T ).Specifically, g is defined in terms of λ by or, equivalently, To avoid degeneracies, we assume that g −1 (y) < ∞, ∀y ≥ 0, and lim For further properties on IPH distributions and motivation for their use in statistical modeling, we refer to Albrecher and Bladt (2019).See also Albrecher et al. (2022b) for extensions to the multivariate case.
The tail behavior of IPH distributions can be derived from the corresponding tail behavior of a PH distribution.Recall that the survival function of a PH distribution is given by where −η j < 0 are the eigenvalues of the Jordan blocks J j of T , with corresponding dimensions κ j , j = 1, . . ., m, and a jl and b jl are constants depending on π π π and T .If −η is the largest real eigenvalue of T and n is the dimension of the Jordan block of η, then it is easy to see from (2.4) that where c is a positive constant.That is, all PH distributions have exponential tails with Erlang-like second-order bias terms.The tail behavior of Y then follows from (2.2).
A number of IPH distributions can be expressed as classical distributions with matrixvalued parameters.For the representation of such distributions, we make use of functional calculus.If ν is an analytic function and A is a matrix, define where Γ is a simple path enclosing the eigenvalues of A; cf.(Bladt and Nielsen, 2017, Sec.3.4) for details.In particular, the matrix exponential can be defined in this way.
The matrix distribution of particular interest for the mortality modeling in this paper is defined through the following survival function: ds e e e = π π πe T (e βx −1)/β e e e .
Here, (π π π, T ) is a p-dimensional PH representation (so that the underlying Markov process has p transient states).In view of (2.3), this is the survival function of the time until absorption of a time-inhomogeneous Markov process initiating according to π π π and with intensity matrix This matrix-Gompertz distribution H ∼ mGompertz(β, π π π, T ) has tails much lighter than exponential.Observe that from an inhomogeneous Markov point of view, the sub-intensity matrix of the underlying jump process is given by T exp(βx), which for scalar T leads back to the exponential hazard function of the Gompertz distribution, the latter being the classical model for human mortality over long age ranges (see, e.g., Macdonald et al. (2018)).
The density function h(x) corresponding to H is given by Remark 2.1.One way to justify this choice of model is to argue that H(x) allows to improve a base model F (x) in a parsimonious way.Indeed, since IPH distributions are dense in the class of distributions on the positive real line (in the sense of weak convergence, cf.Albrecher and Bladt ( 2019)), one can obtain an arbitrarily close approximation to a given lifetime distribution by adding sufficiently many phases to the state space, and in contrast to the homogeneous case, this can typically be achieved with very few states when the base model is already a good first approximation.In addition, this modeling approach also gives rise to the additional interpretation that life (aging) goes through various phases, prone to an overall time-inhomogeneous variation represented by µ f (t).
Let (J t ) t≥0 be the underlying Markov jump process with intensity matrix (2.6).Its transition probabilities are given by for k, l = 1, . . ., p being transient (non-absorbing) states.Now let V l denote the time spent in state l during a person's life prior to death and let with U = {u kl } k,l=1,...,p denoting the corresponding matrix, which is referred to as the Green matrix.Then with τ ∼ H,

and
(2.9) Again, for the one-dimensional case with T = −1, we have that For a Gompertz hazard rate function µ f (s), define a function g in terms of its inverse by (2.10) or, equivalently, Note that g −1 (s) → +∞ as s → +∞.Then for a η > 0, using both substitution and partial integration, we get where L g denotes the Laplace transform of g, and where Ei Correspondingly, (2.9) reduces to which can then be written as since −T and L g (−T ) commute.More generally, U can be expressed in an explicit form whenever L g can (the matrix function L g (−T ) may be computed in different ways in practice, cf.Bladt and Nielsen (2017)).We then have the following consequence of the above formulas: Lemma 2.1.For X ∼ H we have Proof.The expected value of the distribution H can be obtained as the weighted sum of conditional expected times in the different states, where the weights correspond to the initial probabilities.Hence E(X) = π π πUe e e = π π πL g (−T )(−T e e e) = π π πL g (−T )t t t , and we may use the other expression for U to complete the proof.
Note that for X ∼ H, residual lifetimes given survival up to time x, say, are readily obtained by the survival function Hence the residual lifetime distribution given survival until time x is again IPH with initial vector , intensity matrix T , and hazard function µ x f (u) = µ f (x + u).We now establish a relationship between the intensity function of an IPH distribution and its hazard function.Note that the intensity function λ of a IPH(π π π, T , λ) distribution is not the same as its hazard function, µ.The latter is given by This distinction is subtle, but leads to increased flexibility in the interplay of hazards between subgroups, as we will discuss in Section 3. The cumulative hazard function of an IPH(π π π, T , λ) distribution reads The next result shows that λ is asymptotically equivalent to the hazard function in the upper tail.
Theorem 2.2.Let Y ∼ IPH(π π π, T , λ).Then the hazard function µ and cumulative hazard function M of Y satisfy, respectively, Proof.We have that if Z ∼ PH(π π π, T ), then its density f Z has a representation of the same form as (2.4) (cf.(Bladt and Nielsen, 2017, Sec.4.1.1)for details), which only differs on the multiplicative constants.Thus, as y → ∞, where −η is the largest real eigenvalue of T , n is the dimension of the Jordan block of η, and c 1 , c 2 are positive constants.The first expression then follows by recalling that d dy [g −1 (y)] = λ(y).The second one can be shown in an analogous manner.
2.1.Special Coxian sub-structures.The analysis becomes particularly tractable when the underlying PH distribution (π π π, T ) has a simple form.For example, a Coxian PH distribution can be written in the form , where all λ k , k = 1, . . ., p, are distinct.The phase diagram for such a distribution is given by That is, the states are traversed sequentially, and at each state, there is a probability to exit to the absorption state p + 1 (i.e., die) directly.In this case , so with ν p = 0 we get the density .
Remark 2.2.If π π π is relaxed to be any initial probability vector (i.e., the Markov jump process can also start in other states than the first one), then one speaks of a generalized Coxian distribution.While for the modeling of the aging process as discussed in Section 1 the classical Coxian construction is particularly appealing as it has the direct interpretation of traversing through the p stages of aging, but with a positive probability to die 'prematurely' during one of these states, the generalized Coxian construction is also of interest.In particular, if it leads to a much better fit than the classical Cox construction, it may suggest a significant heterogeneity in the population under consideration.

Regression with proportional intensities
In a next step, we now introduce regression in our PH modeling framework.Regression models for life tables were cast into the scene by the introduction of the proportional hazards models, cf.Cox (1972), and these models have remained popular, both in their parametric and non-parametric forms up to today.In the logarithmic scale, the specification implies, however, parallel log-mortality curves, which are typically not observed in datasets for which a joint mortality modeling is of interest.We, therefore, look for introducing regression in a less restrictive way, while still allowing for a direct interpretation of the regression coefficients.This is achieved by letting the intensity function of our IPH distribution (rather than the hazards) vary proportionally with the covariates.As we will see later, this leads to a more flexible model, which is asymptotically equivalent to the proportional hazards construction.A physical interpretation of this construction is that each population subgroup has a natural clock running at a proportional speed relative to other subgroups.
We, therefore, propose a regression model for IPH distributions with representation IPH(π π π, T , λ), where the intensity λ( • ; θ θ θ) is a non-negative parametric function depending on the vector θ θ θ and incorporate the predictor variables X X X = (X 1 , . . ., X d ) by specifying Here β β β is a d-dimensional column vector and m is any positive-valued and measurable function.We call this model the proportional intensities (PI) model.The interpretation in terms of the underlying inhomogeneous Markov structure is that time is changed proportionally for a given subgroup when time-transforming the associated PH distribution into an IPH one.In the sequel, we will indistinctly denote by X a vector of covariates at the population level, or a matrix of covariates for finite samples.In the latter case, X i denotes the i-th row of the matrix, corresponding to the i-th individual in the sample.
Note that the density f and survival function F of a random variable with distribution are given by Remark 3.1.Since the exponential function m(x) = exp(x) in the above specification is the most common choice (for instance, the one used in the original Cox proportional hazards model, Cox (1972)), the PI model draws its name from that particular example.Other choices also go under the proportional name, see for instance Royston and Parmar (2002).In terms of coefficients, the interpretation of β β β is akin to the parameters in a regular Cox regression, except that the effects here are on the intensity of a Markov process, which is only asymptotically equivalent (but not equal) to the hazard rate.
Remark 3.2.Within the above setup, it is possible to add dependence of g on X also.That is, we may consider the model where θ θ θ(Xγ γ γ) is a vector-valued function, mapping the score Xγ γ γ to the parameter space of λ.For instance, in the one-dimensional case, is a natural choice.In the sequel, any θ θ θ specification may be of the form θ θ θ(Xγ γ γ), and the notation may be used interchangeably, when there is no risk of confusion.In this specification, the role of θ θ θ is to regulate the Gompertz time-transformation of the underlying Markov dynamics.The incorporation of covariates thus means that shortening or elongating virtual times is necessary to describe the differences observed in the data for individuals of differing characteristics.
Remark 3.3.In the proportional hazards model, one assumes that the hazard function satisfies with µ(t) referred to as the baseline hazard function or the hazard function for a standard subgroup (with X X Xβ β β = 0).Common parametric models for µ(t) in the context of mortality modeling are the exponential function (referring to the Gompertz assumption) and other more refined approximations employed for the fitting of empirical mortality curves (cf.Kostaki (1992), Macdonald et al. (2018)).
One of the advantages of the PI model is that the implied hazard functions between different subgroups can deviate from proportionality in the body of the distribution (for dim(T ) > 1), as can be observed in Section 5.This addresses one of the common practical drawbacks of the proportional hazards model.If p = 1, we have that µ(t) = cλ(t) for a constant c > 0, so that the PI specification reduces to the proportional hazards model for this case.
Another important property of the PI model is that a random variable following this specification can be expressed as a functional of a PH random variable.More specifically, consider Y ∼ IPH(π π π, T , λ( • | X X X, γ γ γ)) and let g( • | X X X, γ γ γ) be defined in terms of its inverse function Then it is not hard to see that This distributional property is a key observation for the estimation procedure proposed in Section 4.
Next, we concentrate on two important distributions which, in the presence of covariates, generalize the commonly employed models for parametric proportional hazards modeling: the exponential and the Weibull baseline hazard models.
Example 3.1 (PH regression).Consider λ ≡ 1, thus λ(t | X X X) = m(X X Xβ β β) for all t > 0. The survival function takes the form Note also that Y d = Z/m(X X Xβ β β), where Z ∼ PH(π π π, T ).In particular, this implies that The survival function takes the form Note also that , where Z ∼ PH(π π π, T ).The expression for the expected value is then given by Remark 3.4.For p = 1, we recover the conventional proportional hazards models with exponential and Weibull baseline hazards.Moreover, both cases have the interpretation that the expected values are proportional among subgroups.That is, for arbitrary matrix dimension, when the baseline hazard is PH or matrix-Weibull, the PI model is equivalent to the accelerated failure time (AFT) model (cf.Pike (1966) for the Weibull case, and more generally, the log-location scale model in Lawless (2011)).It is also not hard to see that homogeneous functions g simultaneously satisfy both models.However, for our purposes the AFT model is not of primary interest, since it regresses proportionally on the means instead of on the underlying intensity, and the physical interpretation in terms of hidden Markov models would typically be lost.

Parameter estimation
To estimate the hazard rate of a distribution, there are several approaches one may take.It is common in mortality modeling to consider the problem as a regression on the mortality or log-mortality curve, as a function of age, and with an appropriate loss function.However, for PH distributions, such an approach can only be used if very good (or near-optimal) initial parameters are supplied.Otherwise, local maxima and boundaries of the parameter space are almost unavoidable, and fitting times may be very long.
With that in mind, we present in this section two estimation approaches for the PI model.The first one is the target approach, which deals with the mortality curve directly and is a general procedure, here developed without covariates.It can also be seen as an alternative to the parametric mortality curve fitting in the spirit of Heligman and Pollard (1980), Kostaki (1992), now for the hazard rate of a matrix-Gompertz distribution.The second is a maximum-likelihood approach via the EM algorithm which provides a fast way of obtaining good initial parameters for the first method 1,2 .The main conceptual difference between the two approaches is that the first one gives equal weight to all ages, while the second gives weights proportional to their abundance in the dataset.4.1.Direct mortality modeling.We define a global loss function as the aggregated age-specific losses L, which is a function of the observed and fitted mortality curves, 1 However, also note that if the modeler is interested in tail probabilities instead of mortality estimates, it is much preferable to exclusively use the EM algorithm to directly estimate the density function.The alternative, that is, obtaining survival estimates through the integrated hazard, will incur model errors that propagate from younger into older ages.
2 The initial parameters of the EM algorithm itself are of lesser concern since the routine is much faster and more robust.In all our examples, for the EM algorithm, we randomly simulated a valid PH structure (initial vector summing to one, and sub-intensity matrix having negative diagonal and rows summing to less than zero) as initial parameters, and the number of EM iterations was 2000, which was sufficient for the final changes in likelihood to be less than 0.01% in all cases.
as follows: where µ x is the observed mortality at age x and µ h is given by (2.8).Several choices for L have been proposed, arising either from graphical considerations (for instance, Euclidean or L p norms, absolute or relative differences) or from probabilistic considerations on the counts of the life table itself (for instance, Poisson, cf.Broström (1985), or other count distributions such as Binomial or Negative Binomial).The empirical work for the present manuscript yielded that the sturdiest choice for the PI model is given by L(µ, ν) = (log(µ) − log(ν)) 2 .In that case, we may apply the following decomposition: with the correction factor It is noteworthy to remark that although this method will yield visually better estimation of the mortality (or log-mortality) curves, the resulting PI model will have a lower likelihood than the one arising from the EM algorithm.4.2.Maximum likelihood estimation.We now present estimation via maximumlikelihood (ML) for the PI model.We consider the estimation procedure for the general setting where the parameters of the inhomogeneity transform are also regressed, incurring in no additional mathematical complexity.
In many applications, a large proportion of the data is not entirely observed, or censored.In this paper, we consider the case of right-censoring, since it arises naturally in the context of its applications in survival analysis3 The main difference is that we no longer observe exact data points Y = y, but only Y ∈ (v, ∞), conditionally on the corresponding covariate information.In particular, by monotonicity of g we have that g −1 (Y ; θ θ θ) ∈ (g −1 (v; θ θ θ), ∞), which can be interpreted as a censored observation of a random variable with conventional PH distribution.Some of the formulas from Olsson (1996) are useful to adapt an expectation-maximization (EM) algorithm for IPH distributions in the presence of covariates and censored data.
Let (z 1 , δ 1 ), . . ., (z N , δ N ) be a possibly right-censored i.i.d.sample from of a PH distributed random variable Z with parameters (π π π, T ), where the δ i , i = 1, . . ., N are binary indicators taking the value 1 if case of observation and 0 in case of rightcensoring.Now, for each k, l ∈ {1, . . ., p}, let B k be the number of times that the underlying jump-process (J t ) t≥0 initiates in state k, N kl the total number of jumps from state k to l, N k the number of times that we reach the absorbing state p + 1 from state k and let V k be the total time that the underlying Markov jump process spends in state k prior to absorption.These statistics are hidden, in the sense that Z alone cannot account for them.If they were not hidden, and given a sample of absorption times z z z, the completely observed likelihood L c can be written in terms of these sufficient statistics as follows: which can be brought into the canonical form of the exponential dispersion family of distributions, which possess explicit maximum likelihood estimators.
However, since the full data is not observed, we employ the EM algorithm to obtain the ML estimators.At each iteration the E-step consists in computing the conditional expectations of the sufficient statistics B k , N kl , N k and V k given that Z = z, for the fully observed case, and given that Z > z for the right-censored case.The M-step maximizes L c (π π π, T , z z z) using the estimates of the sufficient statistics from the previous step, obtaining updated parameters (π π π, T ).
Below we write the formulas needed for the E-and M-steps for a sample of size N .We denote by e e e k the column vector with all elements equal to zero besides the k-th entry which is equal to one, that is, the k-th element of the canonical basis of R d .
We now consider a sample of size N .For any two sets, A, B define their product with the definition extending inductively to the case of more than two sets.Define the following product set Z = Z(z z z) = N i=1 A i , where A i = {z i } if δ i = 1, and A i = (z i , ∞) otherwise.Then it can be shown that the conditional expectations can be written as follows: 1) E-step, conditional expectations: 2) M-step, explicit maximum likelihood estimators: We set π π π = (π 1 , . . ., πp ), T = { tkl } k,l=1,2,...,p , t t t = ( t1 , . . ., tp ) T .
Note that the contributions from the conditional expectation of N k given that Z > z i are zero, since absorption has not yet taken place.
For convenience, a summary of the entire procedure is provided in Algorithm 1.
Remark 4.1.Algorithm 1 generalizes the EM algorithm for IPH distributions in Albrecher et al. (2022b) in the following two settings: i) presence of censored observations; and ii) incorporation of covariate information.
Remark 4.2.The E-steps require the computation of matrix exponentials which can be performed in multiple ways (Moler and Van Loan, 1978).In Asmussen et al. (1996), this is done by converting the problem into a system of ODEs, which are then solved via a Runge-Kutta method of fourth-order (a C implementation, called EMpht, is available online Olsson (1998)).We employed a new implementation in Rcpp of such a method for the illustration here and everywhere else in the paper.An important consideration is that the Runge-Kutta method requires the sample to be ordered.This is relevant since the transformed data of the above algorithm, z i = g −1 (y i | x x x i , γ γ γ) will in general not be ordered anymore, given that covariate information varies across subjects.Thus, intermediate ordering steps have to be introduced before proceeding to the numerical implementation of the EM algorithm of Olsson (1996) by the Runge-Kutta method.
The re-ordering also modifies the ties in the data, which must be accounted for.Additionally, the evaluation of the likelihood in Step 2 may be done using the by-products of Step 1, which implies that re-ordering and re-weighting will also be necessary at each iteration within the optimization inside Step 2. These additional steps pose a significant computational burden relative to the non-covariate algorithm.The uniformization method (Albrecher et al., 2022b) is an alternative, but the comparison of the two methods' performance is beyond the scope of this paper.
The following assertion is a general property of any EM algorithm.
Proposition 4.1.Employing Algorithm 1, the likelihood function increases at each iteration.Since the likelihood is bounded for fixed p, it is guaranteed to converge to a (possibly local) maximum.

Examples
All examples in this section were preliminarily fitted with the c++ wrapper functions fit (for the no-covariates case) and reg (for the PI model), available in the matrixdist package in R, cf.Bladt and Yslas (2021a,b).Subsequently, the direct mortality modeling approach with L(µ, ν) = (log(µ)−log(ν))s 2 was implemented in the same language, and optimized with the fortran wrapper function optim.
The preliminary fitting was done using as log-likelihood weights the implied density from the mortality rates (death to exposure ratio), and using the midpoints between ages as the observed ages (corresponding to the data y y y).An interval-censored approach would be slightly more accurate, but is outside the scope of the current paper, since we are using the EM procedure only to identify good initial parameters.For numerical purposes, age was divided by 100 at the time of fitting, and all given parameters are on that scale.All plots are scaled back to the original size.All data was obtained from https://www.mortality.org/.In terms of running times, all analyses were completed individually within two minutes to 1.5 hours 4 , with the more complex examples being lengthier to optimize.5.1.Danish female mortality without covariates.We first consider the modeling of Danish females since the year 2000 using a generalized Coxian structure as well as a Coxian structure.The distinction is made since the former seems to provide better fitted models, while the latter has a possibly more intuitive interpretation in terms of state transitions (as all individuals start in the first state).
5 Increased performance in this context means a substantial decrease in loss function value upon convergence for each model of varying dimensions.Information criteria such as AIC and BIC are not helpful here, since the effective degrees of freedom of PH models are generally unknown (see also Albrecher et al. (2022a)).Thus we chose a relative decrease of less than 0.1% as the stopping rule.
6 Here, and in subsequently reported parameters, zeros are only exactly zero when the structure implies it (such as a Coxian structure).Else, they represent the rounding of a small number.t t t = (0, 0.42, 1.77, 0, 0.01, 0, 0, 0, 0, 10.16, 0.13, 492.2, 0, 0, 0.01) , β = 10.68.Note that for the classical Gompertz fit on the left-hand side one obtains β = 10.55, but one is not able to take into account the humps at the left end at all.On the other hand, the additional flexibility of the generalized Coxian can accomplish that, maintaining a similar value β = 10.68 for the transformation.
While the choice of the generalized Coxian is motivated by the tractability of the corresponding estimation procedure (it is much more tractable than fitting a general PH distribution) within the purpose of obtaining a good statistical fit, the restriction to a classical Coxian distribution (starting in state 1 with probability 1) opens up the interpretability in terms of stages of aging as discussed in Section 1.The right panel of Figure 5.1 depicts the resulting Coxian fit to the same dataset, together with the cumulative aggregated means of the sojourn times for each state of the underlying state space (here 10 states).In the Coxian construction, these are traversed sequentially, and thus the last vertical line is equal to the life expectancy.The parameters and line locations are given by π π π = e e e 1 , Diag(T ) = − (27.72, 5.95, 1.84, 0.29, 0.02, 2.32, 2.88 × 10 11 , 0.04, 0.03, 0.04), t t t = (0.03, 0, 0.01, 0, 0.01, 0, 189.94, 0.01, 0.01, 0.04) , lines = (2.92, 10.81, 22.20, 41.02, 73.30, 73.45, 73.45, 78.02, 80.93, 81.73), β = 7.85.
One can see that the wavy behavior of the log-mortality curve is also captured quite well with a pure Coxian PH distribution.Recall that the Coxian construction allows for the interpretation that life traverses through different states until death (absorption), staying at each one for a random duration, and with many individuals going through all the states, whereas some individuals die (go to absorption) from an earlier state directly.From the above figures, one readily computes the probabilities of dying while being in state i (i = 1, . . ., 10): (0.001, 0, 0.003, 0.006, 0.326, 0.001, 0, 0.217, 0.428, 1).
The probability of reaching state p = 10 before death (i.e., going through all states) here is 0.299.Observe that some states have a very short duration, which can serve as an artifact to generate possible deaths at these concrete points in time.Note also that the Coxian matrix-Gompertz leads to a nice fit, but 'needs' a different value β = 7.85 for the transformation than the classical Gompertz distribution in order to fit the humps at the left end.The generalized Coxian matrix-Gompertz then is flexible enough again to return to a similar β-value as the classical Gompertz, which is determined by the right end.From the concrete resulting parameters, one sees that there is a high probability to start in state 5, 7 or 8 only, so that the resulting model can be interpreted in terms of more flexible mixtures of Coxian distributions with lower state space.
The positive value of β β β indicates an increased underlying intensity (and thus shorter sojourn times) for the USA population, which is multiplicative with factor exp(0.91) ≈ 2.48.On the other hand, the parameters of θ θ θ indicate that there is a baseline Gompertz factor of size exp(2.28)≈ 9.78, and the USA population has a multiplicative decrease of this factor by exp(−0.07),or roughly 93%, incurring in an overall Gompertz factor of size 9.16.The latter means in terms of the PI model that there is an "environment" lifetime generating Gompertz process which is unfavorable for the Japanese female population, but their underlying slower traversing through the multi-state process (of their lives) makes up for it, to the degree that can be observed in Figure 5.2.
Theoretically, by Theorem 2.2, both curves are asymptotically parallel, and although we see a substantial difference at most ages, the gap closes and is quite small at very advanced ages.This observation can be made from the data itself, implying that old-age mortality in both countries seems to be quite similar.However, the takeaway from the hidden Markov point of view is that the flexibility gained from considering intensities instead of hazard rates is advantageous for modeling more delicate inter-curve features in regression analyses.The proportional hazards model is not well suited for this task (the resulting logmortality curves for different time periods would be parallel!),and so we use the Lee-Carter (LC, cf. Lee and Carter (1992)) model as a benchmark.The LC model specifies the log-mortality by where x,t are Gaussian random variables.The a x term is estimated as the average log-mortality over time at each age x, and then b x and k t are computed from a singular value decomposition of log(µ x,t ) − a x .The procedure is non-parametric and thus can capture very minute changes in individual mortality curves across different ages.
On the other hand, by Theorem 2.2 the PI model asymptotically satisfies, as x grows, where a = log(c) is a constant which depends on the parameters π π π and T , b x,t = log(λ(x; exp(θ 0 + θ 1 t * ))) = exp(θ 0 + θ 1 t * ) log(x) (in our case) is the logarithm of the time-dependent Gompertz intensity, and k t = β 1 t * (also in our case) is the logarithm of the proportionality factor between the underlying intensities.For smaller x (before convergence to the limit) the log-mortality does not have such a simple decomposition, and is instead given by log(µ x,t ) = b x,t + log π π π exp   1960, 1980, and 2000, where in both cases we observe a downward trend in mortality every 20 years.Both models have room for improvement, but overall show accurate descriptions.The PI model seems to not capture a surge in mortality for old ages in 1960, while the LC model does not seem to capture the decreases of mortality from 1980 to 2000 for young ages.The non-parametric nature of the LC model is sturdier to outliers as is the case for an old-age point of 1960, while the PI seeks to compromise large parts of the curve for this one observation.However, observe that the LC model has some wiggling occurring for ages 0 to 40, which may indicate a slight degree of overfitting, a known issue for smaller populations and one which the PI model seems to handle better in this case.
Figure 5.4 depicts the resulting log-mortality curves for 1960, 1980 and 2000 together with the empirical observations.Indeed, here the wiggling of the LC curves is considerably improved.One sees that in this case from a purely statistical point of view the LC model may be preferable over the PI model, but one should keep in mind that the PI model offers a more direct interpretation of the resulting model, with much fewer parameters, while still providing a visually acceptable statistical fit.
Remark 5.1.Forecasting with such a model is straightforward if we are only interested in non-autoregressive effects 7 , and the incorporation of cohort effects can also be done as with time, and with possible non-linear terms.However, the predictive performance falls short of models which are specifically designed for this task, and incorporating these temporal dependencies is an interesting subject of future research.

General survival modeling
This section illustrates the use of inhomogeneous Markov models for survival analysis, showcasing the implementation of Algorithm 1 in the right-censored case.The first is a simulated example, while the second is a Veterans' lung cancer dataset, both of which are subject to right-censoring and include covariate information, and thus the regression methods from this paper can be applied.
6.1.Simulated data.Consider the following simple two-group setup.We study two groups of equal size, 500 each.For the first group, Given a desired time t (and thus t * ) and age x, Equation (5.1) provides the corresponding mortality estimate.Moreover, all the implied distributional properties can also be extracted using the fitted parameters, using Equation (2.7).while for the second Subsequently, we randomly censor the Z i by 1000 exponential variables E i of rate 1/10 by leading to roughy 10% right-censored observations.Notice that in the above setting, the inhomogeneity function also changes with the covariates, such that we need to implement the fitting procedure associated with model (3.2).Additionally, for comparison, we fitted two misspecified models: a standard Weibull proportional hazards model and the scalar (dim(T ) = 1) version of model (3.2).The resulting empirical versus fitted survival curves for each group, as well as the hazard ratio between the two groups, are given in Figure 6.1.We observe that only the correctly specified model recovers the oscillating shape of the theoretical hazard ratio, and within each group, the survival curve is better estimated in the tails of the distribution.Additionally, the goodness of fit can be evaluated by considering a generalization of the Cox-Snell residuals given by ds T e e e , i = 1, . . ., 100 , which under the null hypothesis of correct specification of the model, should constitute a right-censored unit-exponential dataset.Figure 6.2 depicts the Kaplan-Meier and Nelson-Aalen non-parametric estimators of the survival curve of the residuals r i to assess standard exponential behavior visually.We observe that only the correctly specified model provides an adequate fit in the sense of staying within the confidence bounds implied by Greenwood's formula for the first approach, and by aligning itself with the identity for the second one.
6.2.Veterans data.We consider a well-known dataset from the Veterans' Administration Lung Cancer study (publicly available in the R package survival ).
The data consist of 137 observations and 12 variables coming from a randomized trial where two groups are given different treatment regimens for lung cancer.The variables of interest for our analysis are survival time, censoring status (binary), treatment (binary), prior therapy (binary), and Karnofsky performance score (from 0 to 100).Previous studies (cf.for instance, the vignette https://cran.rstudio.com/web/packages/survival/vignettes/timedep.pdf,pp.15-22) have shown that only the Karnofsky score violates the proportional hazards assumption.Common ways to address this issue are stratifying the dataset into time groups or considering continuous time-dependent regression coefficients.However, the coefficient for treatment becomes significant when improving the model's fit.Consequently, we presently focus only on the goodness of fit of the model introduced in the text.The latter is measured by its log-likelihood and residuals.We observe that model b) improves the fit to where the residuals are within bounds, and the log-likelihood, AIC and BIC reflect this.
The AIC and BIC computed here for the matrix-distributions works well for comparison purposes, given the low dimensions of the distributions.However, in general, one should proceed with caution (especially in high dimensions) due to the well-known identifiability issue for PH distributions, which carries over to IPH survival models.Namely, different dimensions and parameter configurations may lead to very similar density shapes.

Conclusion
In this paper we investigated the potential of matrix-Gompertz distributions as special cases of inhomogeneous phase-type distributions for the modeling of mortality across the entire lifespan as a parsimonious alternative to popular modeling approaches.The goal is not to replace the latter but rather to see how close in performance one can get with this type of models, as they also allow a causal interpretation in terms of biological aging, with lifetime traversing through different stages and a possible early death in each of them.In that latter context, the introduction of the time-transform inherent in the construction of inhomogeneous phase-type distributions allowed to reduce the necessary dimension for adequate modeling reported in previous studies from 200 -250 to 10 -12.In addition, we studied a regression approach that can be used for multi-population mortality modeling based on proportional intensities.As the first contribution on regressing IPH distributions, this part may also be of independent interest.We developed an estimation procedure and illustrated the models for female mortality data from Denmark, Japan and the USA.
Since matrix-Gompertz distributions are dense in the class of all lifetime distributions, from a general perspective we have illustrated that misspecified survival models can sometimes be just one matrix away from providing a good fit.The practical implementation of the methods in this paper relies on the EM algorithm, which for large matrices and datasets can become significantly slow.It is our experience that the use of an inhomogeneous intensity function in the underlying stochastic jump process can alleviate the need for a large number of phases, thus making estimation feasible.
Although we have used the midpoint approach, which is standard and works well for 1 × 1 resolution life tables, an interval-censored estimation procedure is statistically more accurate and has potential to be applied to data at lower resolutions, which will be the subject of future research.In the present paper we considered regression according to proportional intensities, as a first and workable approach for IPH regression.We also showcased how the right-censored version of our algorithm can be used in survival settings.In general, for multi-population mortality modeling it is of course restrictive to assume that the operational time for each population is scaled with a constant across the entire lifetime.Another promising direction is hence the regression of individual rates of the intensity matrix, as this would allow for interesting conclusions in terms of the aging interpretation underlying the IPH models.However, if applied to all possible parameters, the model will be highly overparametrized, and hence some constraints will have to be imposed, challenging the corresponding estimation methods.It will be an interesting subject of future research to identify workable compromises in this direction, which should increase the versatility of the resulting models.
e T k exp (T z i ) e e e π π π exp (T z i ) e e e , k T exp(T (z i − u))t t tπ π π exp(T u)e e e k du π π π exp(T z i )t t t +(1 − δ i ) z i 0 e e e k T exp(T (z i − u))e e eπ π π exp(T u)e e e k du π π π exp(T z i )e e e , E(N kl | Z ∈ Z) = N i=1 t kl δ i z i 0 e e e l T exp(T (z i − u))t t t π π π exp(T u)e e e k du π π π exp(T z i )t t t +(1 − δ i ) z i 0 e e e l T exp(T (z i − u))e e e π π π exp(T u)e e e k du π π π exp(T z i )e e e , e T k exp (T z i ) e e e π π π exp (T z i ) e e e ,E(V k | Z ∈ Z) k T exp(T (z i − u))t t tπ π π exp(T u)e e e k du π π π exp(T z i )t t t +(1 − δ i ) z i 0 e ee k T exp(T (z i − u))e e eπ π π exp(T u)e e e k du π π π exp(T z i )e e e , E(N kl | Z ∈ Z) e l T exp(T (z i − u))t t t π π π exp(T u)e e e k du π π π exp(T z i )t t t +(1 − δ i ) z i 0 e e e l T exp(T (z i − u))e e e π π π exp(T u)e e e k du π π π exp(T z i )e e e , E(N k | Z ∈ Z) = N i=1 δ i t k π π π exp(T z i )e e e k π π π exp(T z i )t t t .

5. 3 .
Danish female mortality with time as a covariate.As a second example of the PI model, let us consider the Danish female population during the interval 1950−2000, and examine the effect of time when considering it as a numerical covariate.For numerical convenience, we actually use the transformed covariate t * = (year − 1950)/1000.The same approach as in the two previous examples is used to select the size p of the underlying state space.
s,t + k t )ds T t t t (5.1) − log π π π exp x 0 exp(b s,t + k t )ds T e e e .

Figure 5 .
Figure5.3 shows the resulting fitted PI and LC models for the selected years1960, 1980,  and 2000, where in both cases we observe a downward trend in mortality every 20 years.Both models have room for improvement, but overall show accurate descriptions.The PI model seems to not capture a surge in mortality for old ages in 1960, while the LC model does not seem to capture the decreases of mortality from 1980 to 2000 for young ages.The non-parametric nature of the LC model is sturdier to outliers as is the case for an old-age point of 1960, while the PI seeks to compromise large parts of the curve for this one observation.However, observe that the LC model has some wiggling occurring for ages 0 to 40, which may indicate a slight degree of overfitting, a known issue for smaller populations and one which the PI model seems to handle better in this case.

Figure 5
Figure 5.3.PI model using time as a covariate, plotted for the years 1960, 1980 and 2000, and for the Danish females dataset.

Figure 5
Figure 5.4.PI model using time as a covariate, plotted for the years 1960, 1980 and 2000, and for the USA females dataset.

Figure 6 . 2 .
Figure 6.2.Simulated example goodness of fit based on the three models from Figure 6.1.Top: survival curves based on the Kaplan-Meier estimator of the residuals r i .Bottom: plot between r i and − log( F (r i )) based on the Nelson-Aalen estimator.

Figure 6 . 3 .
Figure 6.3.Veterans dataset.Goodness of fit diagnostics based on the Kaplan-Meier estimator of the residuals (top row), and on the residuals versus Nelson-Aalen implied cumulative hazard plot (bottom row).The corresponding models are: a) Weibull proportional hazards model; b) Matrix-Weibull PI model.
The results are summarized in the table below and in Figure6.3.