Empirical Transition Matrix of Multi-State Models: The etm Package

Multi-State models provide a relevant framework for modelling complex event histo-ries. Quantities of interest are the transition probabilities that can be estimated by the empirical transition matrix, that is also referred to as the Aalen-Johansen estimator. In this paper, we present the R package etm that computes and displays the transition probabilities. etm also features a Greenwood-type estimator of the covariance matrix. The use of the package is illustrated through a prominent example in bone marrow transplant for leukaemia patients.


Introduction
When dealing with complex event history data in which individuals may experience more than one single event type, multi-state models provide a relevant modelling framework. A multi-state model is a stochastic process that at any time occupies one of a set of discrete states, which can be in biomedical applications health conditions (e.g., healthy, diseased, dead) or disease stages (e.g., cancer grades or viral loads in AIDS studies). Data typically consist of times of occurrence of transitions between the states and the types of transitions that occur. Observation of the multi-state process is usually subject to right-censoring and/or left-truncation. A helpful way to approach multi-state models is through drawings, with boxes representing the states and arrows the possible transitions. For instance, the usual survival situation may be depicted in the multi-state framework as in Figure 1. Here patients enter the study in state 0 (e.g., alive without any disease) and can reach state 1 (e.g., death). State 1 is said to be absorbing, as no further transitions are modelled. States that are not absorbing  are said to be transient.
A first extension of the univariate survival data is the competing risks situation illustrated in Figure 2. This model has a transient state 0 and k absorbing states that could represent several possible causes of death. In general, the competing event states will be different types of some first event (Putter et al. 2007). Note that the present process description of competing risks does not impose an 'independent risks' assumption (Moon et al. 2002).
A well-known and more complex multi-state model is the illness-death model with recovery ( Figure 3). This model was used primarily for studying transitions between being healthy (state 0), diseased (state 1) and dead (state 2). In this model, the disease is curable as transitions between the states 0 and 1 are possible in both directions. More generally, such a model can be used to study the effect of any discrete, reversible time-dependent covariate on the transition toward the ultimate state.
Usually, a multi-state process is assumed to be time-inhomogeneous Markov. I.e., the future development of the process only depends on the current state and the time elapsed since time origin. The stochastic behaviour of such a model is completely described through the transition hazards. Of interest are also the transition probabilities. For example, the probability of a transition from state "alive" to "death" in the survival data situation displayed in Figure 1 may be estimated by one minus the Kaplan-Meier estimator. The Kaplan-Meier estimator has been generalised to arbitrary Markov processes with a finite number of states by Aalen and Johansen (1978), see also Aalen (1978), and Fleming (1978a,b) for related work. The estimator of the transition probability matrix is either denoted as the empirical transition matrix or the Aalen-Johansen estimator.
In R (R Development Core Team 2010), the survival package (Therneau and Lumley 2010) permits to compute the Kaplan-Meier estimator, while package cmprsk (Gray 2010) estimates transition probabilities in competing risks (also denoted cumulative incidence functions). The survival package allows for left-truncated data while cmprsk does not. Outputs of these functions can be used to compute the empirical transition matrix in some multi-state models like the illness-death model without recovery for which the elements of the matrix take an explicit form. However this is not generally the case. The changeLOS package (Wangler et al. 2006) permits to compute the Aalen-Johansen estimator for arbitrary multi-state models, but lacks variance estimation and cannot handle left-truncated data. Transition hazards in multi-state models, potentially subject to left-truncation and right-censoring, can be estimated in R using the mvna package (Allignol et al. 2008). The empirical transition matrix could be computed by hand as a matrix-valued function of the value returned by mvna, but estimation of the covariance matrix would still be lacking. These drawbacks are settled in the R package etm available from the Comprehensive R Archive Network (CRAN) at http://CRAN.R-project.org/package=etm. etm computes the empirical transition matrix for any finite-state multi-state models, along with the covariance estimator described in Andersen et al. (1993, Equation 4.4.17). The package handles both left-truncated and right-censored data.
This paper proceeds as follow: Section 2 provides statistical details on estimation of the empirical transition matrix and the covariance matrix. Section 3 describes the etm package and illustrates its use with an example on bone marrow transplant for patients suffering from leukaemia. A discussion is in Section 4.

2.
The multi-state model framework 2.1. Data and notation Consider a stochastic process (X t ) t∈[0,+∞) with a finite state space S = {0, . . . , K} and right continuous sample paths. We further suppose that it is time-inhomogeneous Markov. As usual with survival data, individuals are generally followed over a certain period of time, providing right-censored and/or left-truncated observations. We further assume that truncation and censoring is independent (Andersen et al. 1993, Chapter III). Simple examples are random lefttruncation and random right-censoring. Also note that, e.g., the assumption of independent censoring allows the censoring process to depend on the currently occupied state (Aalen and Johansen 1978). The transition hazard from state i ∈ S to state j ∈ S, i = j is defined as where we write dt both for the length of the infinitesimal interval [t, t + dt) and the interval itself. The cumulative transition hazards are then defined as be the probability that an individual who is in state i at time s will have moved to state j at a later time t, and we write P(s, t) for the (K +1)×(K +1) matrix of those transition probabilities. P(s, t) can be recovered from the transition hazards through product integration (Aalen and Johansen 1978;Gill and Johansen 1990): where A(t) is the matrix of cumulative hazards.

The empirical transition matrix
Let N ij (t) be the number of observed direct transitions from state i to state j up to time t and Y i (t) be the number of individuals under observation in state i just before time t.
The non-diagonal entries of the matrix of cumulative transition hazards A(t) may be estimated by the Nelson-Aalen estimator (Andersen et al. 1993;Allignol et al. 2008) Then we plug in the estimator of the cumulative transition hazards matrix, leading toP (s, t) = (s,t] (I + dÂ(u)).
AsÂ(t) is a matrix of step-functions with a finite number of jumps on (s, t], the product integral can be written as a finite matrix product where the product is taken over all observed transition times in (s, t]. Note that the nondiagonal entries (i, j), i = j, of I + ∆Â t k are the number of observed direct i → j transitions, divided by the number of individuals under observation in state i just prior time t k . The diagonal entries of I + ∆Â t k are such that each row equals 1.

Covariance estimation
The covariance of the empirical transition matrix may be estimated by (Andersen et al. 1993, Equation 4.4.17), where denotes vector transpose and ⊗ the Kronecker product. This formulation enables integrated cumulative hazards to not be necessarily continuous, leading to an estimator of the covariance matrix of the Greenwood type. That is, (1) reduces to the Greenwood estimator for survival data (Figure 1). Moreover, (1) accounts for tied observations. Formula (1) may be used for computational purpose. However, computation is facilitated using the recursion formula (Andersen et al. 1993, Equation 4.4.19) cov(P(s, t)) ={(I + ∆Â(t)) ⊗ I} cov(P(s, t−)){(I + ∆Â(t)) ⊗ I} where cov(∆Â(t)) is a (K + 1) 2 × (K + 1) 2 matrix defined element-wise by where δ ln is a Kronecker delta, and N k. (t) is the number of transitions from state k up to time t. This matrix can be partitioned in blocks of (K + 1) × (K + 1) matrices, where only the diagonal elements are non-zero.
The covariance matrix cov(P(s, t)) is also a (K + 1) 2 × (K + 1) 2 matrix, whose (h, r)th block, h, r ∈ S, when cov(P(s, t)) is written as a partitioned matrix, is the (K + 1) × (K + 1) matrix of covariances between the elements of the hth and the rth column ofP(s, t). Using this block notation is useful for locating a specific covariance in the entire matrix. For example, cov(P kl (t),P mn (t)) is situated in the (l, n)th block, at the kth line and the mth column of this block. The etm package provides the trcov method to easily extract a specific covariance.

Package description
The etm package contains the main function etm that computes the empirical transition matrix, six methods (print, summary, plot, xyplot, trprob, trcov), the function etmprep that helps to transform a data set in the wide format (i.e., with one row per individual) into the long format, i.e., with one row per individual and per transition (Putter et al. 2007), suitable for using etm, and two data sets. The etm function takes seven arguments: data: A data frame in the long format, i.e., with one row per transition. The data frame must contain the following columns: id: The individual identification number.
from: State from where a transition occurs.
to: State to which a transition occurs.
time or entry and exit: Time when a transition occurs. If the data are lefttruncated, entry and exit times can be used, where entry is the entry time into a state and exit is the exit time from this state.
state.names: A character vector giving the state names.
tra: A quadratic matrix of logical specifying the possible transitions.
cens.name: A character giving how censored observations are identified. In case of complete data, cens.name may be set to NULL.
s and t: Times for computingP(s, t). Only s is mandatory, and t will be the last data time if not specified. s can take any value between 0 and t.
covariance: A logical specifying whether to compute the covariance matrix.
The etm function returns an object of class etm with the following components: est: The empirical transition matrix computed for each time in (s, t]. This is a 3dimensional array whose first dimension corresponds to the states from which transitions occur, the second dimension corresponds to the states to which transition occurs, and the third dimension gives the event times.
cov: The covariance matrix. Each cell of the matrix gives the estimated covariance between the transition probabilities indicated by the rownames and the colnames, respectively.
time: Times at which the empirical transition matrix is computed s and t: Same as in the function call.
trans: A data frame with two columns from and to giving the possible transitions.
state.names : Same as in the function call cens.name: ditto.
n.risk: A matrix giving the number of individuals at risk of experiencing a transition.
n.event: An array giving at each times in (s, t] the number of transitions. This is also a 3-dimensional array that is read in the same manner as est.
Six methods exist for etm objects: print: Prints details about the multi-state model, e.g., the number of absorbing and transient states, the possible transitions, and the estimates of P(s, t) and cov(P(s, t)).

summary:
The summary method returns a list of data frames, one for each transition. Each data frame contains the transition probability estimates, their variance, the number of individuals at risk, the number of events, and the lower and upper bounds of the pointwise confidence intervals.
xyplot: High level trellis function to display the transition probabilities over the course of time. By default, all transition probabilities are drawn, one per panel.
plot: A function that draws all the transition probabilities in a default scatterplot.
trprob and trcov: Functions used to extract a specific transition probability and covariance, respectively.
The first data set included in the etm package is a subset of the SIR-3 data set ). SIR-3 was a prospective cohort study on the effect of nosocomial infections on intensive care patients' survival. The subset present in the package focuses on the effect of ventilation on the length of stay in the intensive care unit. A description and a hazard-based analysis of this data subset is found in Allignol et al. (2008). The second data set contains data on pregnancy duration of women exposed to coumarin derivatives (Meister and Schaefer 2008;Allignol et al. 2010).

Illustration: The DLI data set
A random sample of the original DLI data set with added random noise will be used for illustration. It contains information on 304 patients who received allogeneic stem cell transplantation for chronic myeloid leukaemia at the Hammersmith Hospital in London between February 1981 and February 2002. Patients' health status can be described by the nine-state model depicted in Figure 4.
All the patients start the study being alive and in complete remission after the transplant (state 0). Patients can die while in complete remission due to treatment complications (state 1) or relapse (state 2). The patients who have relapsed are offered a salvage treatment of donor lymphocyte infusion (DLI), which is a very effective treatment for restoring complete remis- sion, though patients may die while waiting for the DLI (state 3). Once they receive the DLI (state 4), patients achieve second complete remission (state 6) or die (state 5). Finally, patients in their second complete remission may then die in remission (state 7) or relapse (state 8).
At the end of the follow-up, 81 patients were alive in remission (state 0). 100 patients died in remission and 123 relapsed. Among them, 18 were censored in state 2 at the end of the study, 40 died while waiting for DLI (state 3), and 65 were offered the salvage therapy. 11 patients were still in state 4 at the end of the follow-up, 10 died before remission (state 5), and 44 achieved complete remission, of whom 38 were still in complete remission at the end of the study, 2 died and 4 relapsed. The patients' course over time may now be studied through transition probabilities. This information is sometimes further condensed into summary quantities, that are functionals of the transition probabilities. A prominent example is the current leukaemia free survival function (Klein et al. 2000;Liu et al. 2008).
The current leukaemia free survival (CLFS) is defined as the probability that a patient is alive and leukaemia free, either in his first or second post-transplant remission. CLFS is the probability of being in state 0 or 6 at a certain time point t and may be estimated by where P 06 (0, t) is the probability that a patient will be alive and in remission after a DLI at time t. The variance of CLFS(t) may be estimated by var( CLFS(t)) = var(P 00 (0, t)) + var(P 06 (0, t)) + 2 cov(P 00 (0, t),P 06 (0, t)).
In the following, we will demonstrate how the CLFS can be easily computed using the etm package, assuming that the model in Figure 4 is time-inhomogeneous Markov.
Below is an excerpt of the data frame with one row per transition,   Rows of tra designate the states from which a transition may occur (e.g., the first row corresponds to state 0, the second row to state 1, and so on), whereas columns represent states to which a transition may occur. For instance, the only possible transitions out of state 0 are towards states 1 or 2. As a consequence, only the second and the third entry of the first row of tra are TRUE. Next, we compute the empirical transition matrix with arguments data frame, state names, tra, the censoring indicator and s. Without any specification, t will be the last data time: R> library("etm") R> dli.etm <-etm(dli.data, as.character(0:8), tra, "cens", s = 0) We can have a first look at the transition probabilities using the xyplot method that by default plots, along with pointwise confidence intervals, the transition probabilities for all possible direct transitions.
Another useful display that helps to understand how the patients' status develops over time is to plotP 0j (0, t), j = 1, . . . , 8, i.e., the probabilities of being in each state at a given time point (see Figure 7).

Discussion
The etm package enables to easily estimate and display the transition probabilities from any time-inhomogeneous Markov multi-state model without covariates. This may be further use-ful for summary quantities that are functionals of these transition probabilities, as illustrated in the DLI example. As stated above, the covariance estimator that is implemented in etm is of the Greenwood type. We just note that estimating the covariance matrix can be computationally expensive. We hope that this package will encourage applied researchers to use multi-state modelling more frequently, as it provides a relevant way to study complex event histories.
We also wish to mention that after etm was released on CRAN, the mstate package (de Wreede et al. 2011) has also been made available on CRAN. Among other things, the package also allows to compute the empirical transition matrix with covariances, both with and without covariates.
Finally, we note that the empirical transition matrix remains a consistent estimator of the state occupation probabilities P ii (s, t), i ∈ S, in a non-Markov model if censoring is entirely random (Datta and Satten 2001). However, variance estimation should probably be based on resampling in such a situation (Glidden 2002).