gems : An R Package for Simulating from Disease Progression Models

Mathematical models of disease progression predict disease outcomes and are useful epidemiological tools for planners and evaluators of health interventions. The R package gems is a tool that simulates disease progression in patients and predicts the eﬀect of diﬀerent interventions on patient outcome. Disease progression is represented by a series of events (e.g., diagnosis, treatment and death), displayed in a directed acyclic graph. The vertices correspond to disease states and the directed edges represent events. The package gems allows simulations based on a generalized multistate model that can be described by a directed acyclic graph with continuous transition-speciﬁc hazard functions. The user can specify an arbitrary hazard function and its parameters. The model includes parameter uncertainty, does not need to be a Markov model, and may take the history of previous events into account. Applications are not limited to the medical ﬁeld and extend to other areas where multistate simulation is of interest. We provide a technical explanation of the multistate models used by gems , explain the functions of gems and their arguments, and show a sample application. This manuscript was published in the Journal of Statistical Software Blaser et al. (2015).


Introduction
We present a simulation algorithm and the R package gems (Salazar Vizcaya et al. 2013) for simulating from a multistate model with arbitrary transition-specific hazard functions.
In epidemiology, mathematical models of disease progression are useful for predicting disease outcomes and for planning and evaluating interventions (Garnett et al. 2011).Disease progression is often characterized by a series of events, such as diagnosis, treatment and death.From this characterization, disease progression can be displayed in a directed acyclic graph (DAG) (Pearl 2009), where disease states are denoted by vertices and the directed edges connecting them correspond to the events.
Traditional compartmental models of infectious diseases assume that transition times between the different stages of a disease are exponentially distributed (Anderson and May 1992).The use of exponential transition times has the advantage that models can be formulated deterministically with ordinary differential equations.Exponential times can also be simulated using the Gillespie algorithm (Gillespie 1977).However, the distribution of transition times between states is often not exponential (Lloyd 2001).Although it is possible to divide states into substates, so that an exponential transition-specific hazard fits the data for those substates, this approach is inflexible.Typical model structures using non-exponential transition times are agent-based stochastic simulation models (Estill et al. 2012;Phillips et al. 2011).
For instance one study used history-dependent Weibull distributed transition times to investigate the effect on HIV transmission of bringing patients lost to follow-up back into care (Estill et al. 2014).This study found that 116 tracing efforts were needed to prevent one new infection.Agent-based models usually apply to one specific disease and include a limited number of interventions.We are not aware of any agent-based model structure that can be applied simultaneously to different diseases and interventions.We therefore propose a more flexible simulation algorithm that can simulate from any DAG.
We developed a multistate model that allows disease progression to be monitored in a cohort of individual patients, and takes into account the history of previous events.The R package gems allows simulation from a directed acyclic multistate model with general transitionspecific hazard functions.The package simplifies definition of the multistate model, its relevant transition-specific functions, its parameters, and their uncertainty.It also calculates transition probabilities and cumulative incidences, and thus facilitates analysis of the simulated cohorts.The R package gems is used for simulation and not parameterization of multistate models.To parameterize the transition-specific hazard functions, the R packages survival (Therneau 2014), mstate (de Wreede et al. 2011) and muhaz (Hess and Gentleman 2010) can be used.
In Section 2 we present a mathematical description of the multistate model.We present the simulation from this model and demonstrate the inclusion of parameter uncertainty.In Section 3, we describe the use of gems in detail, providing explanations for and examples of all the important package functions.In Section 4 we present a case study in cardiology.Finally, in Section 5, we discuss the strengths and limitations of the package.

Technical description of the simulation model
We describe a directed acyclic multistate model and the algorithm used in gems to simulate from it.For a general introduction to multistate models, see Putter et al. (2007).

General setup of the multistate model
A multistate model consists of a set of states and the transitions between them.The states can be divided into three groups: initial states, intermediate states and absorbing states.
gems only considers multistate models without loops, that is models, which can be written as a directed acyclic graph (DAG) (Pearl 2009).A DAG consists of states and the directed edges that connect them, so that no sequence of directed edges can connect back to a previous state.
Consider a directed acyclic multistate model with n states E 1 , . . ., E n , where a transition from state E i to state E j is only possible if i < j.Let (X t ) t≥0 be the stochastic process that describes the progression through the different states.It is an E = {E 1 , . . ., E n }-valued jump process with jump times given by for states E i that are visited, and S i = ∞ otherwise.Transition times to state E j from the previous state are defined by T j = S j − S max {k | k<j,S k <∞} , where S 0 = 0 by convention.
The entire process is determined by transition times T j to state E j , described by transitionspecific hazard functions h ij as where F ij is the cumulative distribution function of the transition time from state E i to state E j .See Figure 1 for a graphical representation of these hazard functions and transition times.Unless all hazards are constant, X does not have a Markovian structure.

Hazards and transition probabilities
Consider the relatively simple multistate model described in the DAG in Figure 1.
A B can then be calculated from the transition-specific hazard functions as follows.
The probabilities from state E 4 are p 44 (s, t) = 1 and p 4j (s, t) = 0 for all j = 4.
From states E 2 , E 3 , the only two possibilities are to remain in the current state or move to state E 4 , so the transition probabilities are The transition probabilities from the initial state are already difficult to solve analytically.
Assuming S 1 = 0, the transition probabilities can be calculated from the integrals Intuitively, these formulas express that the process X remains in state E 1 from time s to time u.Then it moves to state E 2 or E 3 respectively, where it remains until time t.The transition probabilities p 12 and p 13 can then be calculated as the integral over u.These integrals become increasingly difficult to solve when there are more states, and they cannot usually be solved analytically.The gems package uses Monte Carlo methods to simulate the transition times associated to those probabilities.

Simulating from hazard functions
In this Section we describe the methods used in the package gems to simulate from a transitionspecific hazard function for one agent.For each state E i , all transition-specific hazard functions and their parameters must be specified.For instance, an exponentially distributed transition with mean µ = 2 can be specified as a constant function h(t) = 1 µ with parameter µ = 2, or equivalently if specifying h(t) = r with parameter r = 1 2 .For the description in this Section, the choice of parameterization is arbitrary, but it will be relevant in Section 2.3 where we consider parameter uncertainty.
It is possible to simulate the times T ik from the hazard functions, as explained below.By taking the minimum over all k, we get the transition time T j , and the corresponding state E j .To simulate X, we therefore start by simulating from the initial state T 1k and calculate the first transition time by taking the minimum.Then we continue the simulation from the corresponding state E j .This procedure is iterated until an absorbing state is reached, at which point the simulation ends.
In order to simulate from a hazard function, we approximate the specified hazard function h(t) by a piecewise constant function h pc (t).Then we use the rpexp function of the msm package (Jackson 2011) to simulate from the piecewise constant approximation of the hazard function h pc (t).The rpexp function generates random variables from an exponential distribution with piecewise constant rates.
To calculate the transition probabilities from the initial state at time t = 0, the process X is simulated for N agents (i.e., patients in the context of a cohort).At each time point the proportion of simulated patients that are in any state at that time is calculated.For large enough N these proportions approximate the transition probabilities from the initial state at time t = 0.

Including uncertainty into the multistate model
The exact parameters of transition-specific hazard functions are often not known.This uncertainty should thus be included in the model's parameter values.Parameters estimated from data are often asymptotically normally distributed for a suitable parameterization.We therefore included parameter uncertainty in the model by sampling the parameters of the transition-specific hazard functions from a multivariate normal distribution.Therefore the transition-specific hazard functions need to be parameterized so that parameters are multivariate normally distributed.
For each simulated patient, all parameters are first drawn from the specified distribution.Then the simulation for this patient is performed, as described in Section 2.2.This procedure allows the direct inclusion of uncertainty in the estimated parameters into the model, and obtains confidence intervals in the statistical analyses of the hypothetical cohorts.These confidence intervals reflect both sampling and parameter uncertainty.
In order to include uncertainties in the transition probabilities, the N simulations are split into M groups.Then the above-mentioned proportions for each of these groups is calculated.Finally, the 2.5% and 97.5% quantiles are computed to get a 95% confidence interval for the transition probabilities at each time point.This procedure requires N to be fairly large.

Using gems
In this section we illustrate how to use gems (Salazar Vizcaya et al. 2013).Figure 2 shows a flowchart of the steps to take to use gems and indicates where these steps are described in detail.First, Section 3.1 shows how to specify all the necessary input (number of states, hazard functions and parameters) to run a simulation.Then Section 3.2 shows how to simulate from this input.Section 3.3 describes how to include parameter covariances in the model and Section 3.4 shows how to add baseline covariates.Finally Section 3.5 describes how to include history dependence in the hazard functions and Section 3.6 describes an alternative to using hazard functions for specifying the transitions.
The package is available at CRAN and can be loaded by

R> require("gems")
The package gems uses three classes to encode all model inputs and outputs.
1.A transition.structure contains the number of model states and a matrix with elements that are used to specify transition-specific hazard functions, their parameters and covariances.2.An ArtCohort contains all aspects of the simulated cohort, including the model input and a data.framewith the entry times for each patient into each of the states.
3. The PosteriorProbabilities contain the transition probabilities or cumulative incidence that can be calculated from the ArtCohort.
The model has six main functions.The first three are used to specify the model, the fourth is used for simulation and the last two are used to summarize the results.
1. generateHazardMatrix creates a template of class transition.structure that can be used to specify the transition-specific hazard functions.
2. generateParameterMatrix creates a template of class transition.structure that can be used to specify the parameters.
3. generateParameterCovarianceMatrix creates a transition.structure that can be used to specify the parameter covariance.
4. simulateCohort simulates the specified artificial cohort and returns an object of class ArtCohort.
5. transitionProbabilities returns an object of class posteriorProbabilities that contains the transition probabilities from the initial state over time.
6. cumulativeIncidence returns an object of class posteriorProbabilities that contains the cumulative incidence over time.

Specifying the model
Suppose we want to simulate a disease with initial state E 1 , intermediate state E 2 and absorbing state E 3 as depicted in the DAG in Figure 3.In order to fully specify the model, Figure 3: Model specification DAG.
the hazard functions, their parameters and the covariance structure of these parameters must be specified.The hazard functions are specified in a transition.structure of dimension states × states.
The function generateHazardMatrix can be used to specify such a transition.structure that contains only the model structure.The argument statesNumber specifies the number of states in the multistate model.The resulting transition.structureonly provides the basic structure for how hazard functions are specified and the desired hazard functions must be entered.

R>
For exponential, Weibull and Weibull mixture distributions, the built-in functions can be specified as "Exponential", "Weibull", and "multWeibull" respectively.Arbitrary continuous hazards can also be specified as functions.
For instance, assume that the transition times T 12 and T 13 are exponentially distributed and the transition time T 23 is Weibull distributed.Then the transition.structurecan be set up using double square brackets as follows.To show different ways of specifying time-toevent distributions, we used a user-supplied function for T 12 and a built-in function for T 13 , even though they are both exponentially distributed.For the first transition, the hazard function of an exponential is specified as a hazard function with its own parameterization.
The parameterization of the built-in functions is explained below.Note that user-supplied functions need to return a result of the same length as the time argument t.The required code to specify the hazard functions described above is When specifying a function as "Exponential" or "Weibull", the parameterization is the same as in the rexp or rweibull function; that is, h W eibull (t, shape, scale) = shape scale t scale shape−1 . ( For the Weibull mixture model, the parameterization is where f W and F W are the Weibull density and distribution functions in the same parameterization as before, where k i is the shape and λ i the scale of the i-th Weibull distribution.Here k and λ are n-dimensional vectors and ω is an (n−1) dimensional vector with ω n = 1− n−1 i=1 ω i being defined automatically.Mixed Weibull distributions can be used when there are multiple modes of failure that result in the same state and can be estimated using maximum likelihood or non-linear least squares methods (Ling et al. 2009).
Specifying built-in functions is more efficient than specifying the hazard function of a Weibull distribution, because the simulation internally uses rweibull instead of using piecewise approximation of the Weibull hazard function and rpexp to generate Weibull distributed random numbers.
Once all hazard functions are suitably specified, the parameter values must be determined.The easiest way to do this is by using the function generateParameterMatrix, with the hazard structure hf as an argument, The transition.structure generated by generateParameterMatrix is again only a framework; the specific parameter values need to be assigned afterwards.Note that the parameters need to be specified in the order in which they appear in the function.

Simulation and post-processing
Once the model is specified, the simulation can be invoked with the simulateCohort function.The arguments for simulateCohort are the previously specified transitionFunction, parameters, as well as the number of patients cohortSize to be simulated and the final time to of the simulation.For the transitionProbabilities function, the argument times specifies the timepoints at which the transition probabilities should be returned.The plot function admits the argument ci in order to add confidence intervals to the figure.Figure 4 shows the transition probabilities and Figure 5 shows the cumulative incidence in the above example.

Parameter uncertainty
Suppose that we want to include parameter uncertainty in the above example.For instance, we estimate the shape and scale parameters for the transition from E 2 to E 3 be distributed as follows: Then the covariance matrix can be specified using the generateParameterCovarianceMatrix function with the previously generated parameter transition.structure as an argument.

Baseline characteristics
Baseline characteristics can be included in the model by letting the hazard depend on the argument baseline; for example, if age and sex are important characteristics.Consider sex to be encoded as 0 for males and as 1 for females, and let the baseline age be the age in years.Suppose we want to simulate a cohort of 50% men and 50% women with ages distributed uniformly between 20 and 50.Baseline characteristics should be specified in a matrix or data.frameas follows.

History dependence
In many real-world applications, transitions between states may depend both on the current state, and on the history of previous events in the patient history.For instance, in an HIV treatment model, the immune system worsens between failure of first-line therapy and start of second-line treatment and the mortality hazard after starting a second-line therapy depends on how long a person spent on failing first-line treatment (Gazzola et al. 2009).
History-dependence of the model can be specified by letting the hazard function depend on the argument history.This history-argument is a vector indexed by the transition-number The history argument allows to use the clock-forward approach (time refers to the time since the patient entered the initial state) instead of the clock-reset approach (time refers to time since entry into current state) to multistate modeling (Putter et al. 2007).To use the clockforward approach, t can be replaced by t + sum(history).Note that the clock-forward approach does not support built-in function ("Exponential", "Weibull" and "multWeibull") and users have to supply their own functions.If the transition-specific hazard for transition 3 was estimated using the clock-forward approach instead of the clock-reset approach, the Weibull function could be specified as follows.

Time to transition functions
Sometimes it is easier not to specify transitions via their hazards, but to directly specify the time it takes until the transition occurs.For instance, in some cases test results need to be confirmed by a second test three months later (e.g., HIV treatment failure tests (Estill et al. 2012)).Then the time would be three months and the hazard function would be a function with infinite point mass in 3.An additional argument timeToTransition to the simulateCohort function would have to be given by a matrix; the position of this kind of transition would be TRUE and the rest (usual hazard functions) would be FALSE.This procedure and the specification of the transitionFunction is as follows: 4. Case study: Transcatheter aortic valve implantation

Introduction
Calcific aortic stenosis is a degenerative disease characterized by progressive narrowing of the aortic valve, which compromises oxygenated blood output from the heart.Medical therapy as a sole treatment option has not improved survival among patients with symptomatic severe aortic stenosis.Surgical aortic valve replacement (SAVR) is the treatment of choice and the gold standard for aortic valve disease treatment.In the presence of serious co-morbidities, and in patients considered to be at high-risk for SAVR, transcatheter aortic valve implantation (TAVI) techniques offer less-invasive treatment of valvular aortic stenosis.Older patients who have severe calcific aortic stenosis, characterized by the presence of co-morbidities and compromised left ventricular ejection fraction, have increased risk of complications from the surgical procedure itself.These high risk patients were managed medically until catheterbased treatment TAVI was introduced in 2002.During a TAVI implantation a bio-prosthetic valve is inserted and implanted within the diseased aortic valve through a catheter.The result of increased interest in this catheter-based approach is that this less invasive procedure is now used in patients with less severe disease (Pilgrim et al. 2012).

Statistical analysis
The tavi data set contains data on kidney injuries, bleeding complications and the combined endpoint of stroke or death for 194 patients.The variables kidney, bleeding, death are indicator variables that show if an event has occurred; the variables kidney.dur,bleeding.dur,death.durare the times at which the events occurred or the patients were censored.In the following discussion, the DAG depicted in Figure 6 is assumed.Since no patients experience both kidney injury and bleeding complications, we assume these events to be mutually exclusive.We then create the transition matrix using the mstate package.According to the DAG the transition matrix is given by In order to estimate the transition-specific hazard functions, we prepare the data using the mstate package.We use msprep to get the data into long format, and split to split the data according to the transition.
ds2 <-simulateCohort(transitionFunctions = hm2, + parameters = par2, + cohortSize = 100 * nrow(tavi), + to = maxtime) R> cinc2 <-cumulativeIncidence(ds2, 0:maxtime, colnames(tmat), M = 100) The plot function also admits an argument states, which can be used in order to only plot certain states as shown in the following example.Figure 7 shows how the simulated cumulative incidence depends on the statistical model.The package gems admits the choice of any transition-specific hazard function.We will now use the second model with piecewise constant hazard functions to estimate the effect of an intervention on mortality.

Intervention modeling
Suppose there is a new intervention that dramatically reduces the probability of getting bleeding complications, and we are interested in the impact of this intervention on mortality.For simplicity, we assume that the intervention reduces the transition-specific hazard of bleeding complications by 80%.Then   Figure 8 shows that reducing bleeding complications by 80% decreases three-year mortality by 18.5% from 43.3% to 35.3%.

Conclusion
In this paper we have presented the R package gems, which allows simulation from a directed acyclic multistate model.The R package gems is a flexible tool for investigating and evaluating health interventions.We have given detailed examples of the use of each function of gems, and an example of its use to evaluate the effect of reduced bleeding complications in TAVI patients on mortality.
Several packages estimate and simulate Markov models, but we are not aware of any other packages that allow simulation from a multistate model with arbitrary transition-specific hazard functions.This flexibility in hazard functions improves its fit to data and allows to more accurately estimate the effects of different interventions.This package's inclusion of history-dependent transitions is also a major improvement on many traditional model structures.
The gems package has some limitations, including the fact that a DAG is required.Sometimes it is useful to have models in which patients can return to a previous state.If this feature is not frequently required, the problem can be resolved by repeating the state in a DAG.Otherwise a different model structure is needed.A further limitation is that higher flexibility requires more intensive computation, compared to traditional models.Longitudinal processes could be incorporated in a joint model, and evaluations of the artificial cohorts could be further automated in future expansions.
The gems package has useful functions for simulating hypothetical cohorts of patients based on a multistate model with general transition-specific hazard functions, and is a flexible and user-friendly tool for planning and evaluating public-health interventions.

Figure 1 :
Figure 1: Sample DAG of states and transitions.Panel A shows the states and transition hazards.Panel B shows the potential transition times T ij and the transition times T j of a simulation where T 13 < T 12 .The red solid lines depict the path taken in this particular simulation and the gray circles represent the states visited by the individual.The blue dashed lines represent the potential transitions that never occurred.
transition time T i for the transitions that have occurred.For transitions that have not yet occurred, the history argument is 0.

Figure 6 :
Figure 6: DAG for case study on TAVI.

Figure 7 :
Figure 7: Cumulative incidence with constant and piecewise constant transition-specific hazard functions.

Figure 8 :
Figure 8: Effect of reducing bleeding complications on mortality.
The output ArtCohort contains the entry time into the different states for each patient.Next, we can calculate and plot the transition probabilities and cumulative incidence including 95% confidence intervals from the initial state using the functions transitionProbabilities and cumulativeIncidence respectively.
If there is a sex-specific rate, one option would be to record it as a numeric of length 2, with the first position describing the rate for male and the second position the rate for women.Suppose age is another risk factor, specified as a rate ratio per year.In this case the function would depend on the sex-specific rate rate, the rate ratio rr per year of age and the baseline characteristics bl.The model could then be specified as follows.In order to simulate from this model that includes baseline characteristics, an additional argument baseline is added to the simulateCohort function as follows.
As a first step we fit an exponential distribution to all transition times.For each transition, we estimate the rate and the variance.Figure7shows the overall mortality from the simulated cohort.Because the purpose of this study is to illustrate the use and flexibility of the package, we split time into monthly intervals and calculate piecewise constant hazard functions using the pehaz function from the muhaz package.