Understanding Dynamic Status Change of Hospital Stay and Cost Accumulation via Combining Continuous and Finitely Jumped Processes

The Coxian phase-type models and the joint models of longitudinal and event time have been extensively used in the studies of medical outcome data. Coxian phase-type models have the finite-jump property while the joint models usually assume a continuous variation. The gap between continuity and discreteness makes the two models rarely used together. In this paper, a partition-based approach is proposed to jointly model the charge accumulation process and the time to discharge. The key construction of our new approach is a set of partition cells with their boundaries determined by a family of differential equations. Using the cells, our new approach makes it possible to incorporate finite jumps induced by a Coxian phase-type model into the charge accumulation process, therefore taking advantage of both the Coxian phase-type models and joint models. As a benefit, a couple of measures of the “cost” of staying in each medical stage (identified with phases of a Coxian phase-type model) are derived, which cannot be approached without considering the joint models and the Coxian phase-type models together. A two-step procedure is provided to generate consistent estimation of model parameters, which is applied to a subsample drawn from a well-known medical cost database.


Introduction
Rising expenditures and constraints on health care budgets have prompted the development of a variety of methods for the analyses of hospital charge and length of stay (LOS) as discussed in Gold [1], Lipscomb et al. [2], and Lin et al. [3]. Correctly fitting the charge and LOS data is a critical step in optimizing the allocation of healthcare resources. But due to the protection of private information, the detailed information regarding the treatment process that patient experience in hospital is not available from many wellknown medical outcome databases, like the New York State's Statewide Planning and Research Cooperative System. The missing longitudinal information of the treatment process makes it more challenging to generate good fitting; meanwhile it becomes demanding to have a dynamic model, through which effective inference can be made against the hidden treatment process. To that goal, a bunch of stochasticprocess-based models have been well developed and applied to analyze the medical datasets.
The continuous-time Phase-Type (PH) model has been widely used in the study of hospital charge and LOS data. Many authors focus in particular on a special subclass of PH model/distribution, namely, the Coxian phase-type (CPH) model/distribution Tang [4]; Faddy et al. [5]; Marshall et al. [6][7][8]; Fackrell [9]. Unlike other popular theoretical distributions widely used in inpatient data, such as log-normal and gamma distribution, the CPH model/distribution not only provides a theoretical distribution that can be used to fit the empirical data, but also gives us a sketch of the treatment dynamics that patient experience in hospital. In fact, from CPH models, we can track the pathways that patient went through in different medical stages (characterized by the discrete set of phases in the PH model) during a hospital stay.

Computational and Mathematical Methods in Medicine
The pathway information makes it possible to cluster patients and facilitate the use of healthcare process improvement technologies, such as Lean Thinking or Six Sigma McClean et al. [10,11].
The other popular approach to study hospital charge and LOS is through dynamically modelling the charge accumulation process and the determination of the time to discharge, which belongs to a more general class of joint models of the longitudinal measurements and time to event, Ibrahim et al. [12]; Tsiatis and Davidian [13]; Henderson et al. [14]; Kim et al. [15]; Sousa [16]; Lawrence Gould et al. [17]. In medical cost studies, the charge accumulation is a monotonic nondecreasing process; the joint model used in this case is reduced to a class of random growth with random stopping time (RGRST) models.
Like CPH models, the RGRST models do also capture the treatment dynamics that patient experience in hospital. But in contrast to tracking the pathways of patient moving through different medical stages, the RGRST models focus more on describing how patient and/or doctor makes the discharge decision in reaction to the change of actual charge level and the length of time that patient has stayed in hospital. Therefore, the story of RGRST models is more about the behavioural patterns of patient/doctor behind the treatment dynamics, while the story of CPH models is more on the medical side.
It is natural in this paper to think of the possibility of combining CPH models and RGRST models together in order to extract more information regarding the discharge decision-making on different medical stages. However, there is a natural gap between the two models. The CPH model is a finitely jumped stochastic process in essence, while the charge accumulation in the RGRST model is continuous. It is not trivial to combine a jump process with a continuous process. To deal with that difficulty, we propose a partition-based approach with each partition cell determined by solving a boundary differential equation. These boundary differential equations are subtly designed to merge the continuous charge into discrete "phases" involved in a Coxian phase-type model. In sum, the main contributions of this paper are as follows: (i) We show that there is a natural way to convert a special subclass of RGRST models to CPH models.
(ii) We propose an algorithm to estimate the transition matrix of the CPH model converted from a given RGRST model and the parameters involved in that RGRST model.
(iii) Based on the correspondence between RGRST models and CPH models, we derive a variety of different measures of the "cost" of staying in a medical stage at each time. That "cost" information is important for the purpose of insurance payment and healthcare process improvement.
McClean et al. [11] tried a different way to incorporate the charge accumulation process into a CPH model. But in their work only the case that the charge accumulation process adopts a piece-wise linear form was discussed. It turns out that the piece-wise linear assumption is quite restrictive while crucial to their main result. Without it, the matrix technique in McClean et al. [11] is no longer applicable to achieve the th order moments of total charge for > 1, while our differential-equation-based approach does still work. In fact, we believe our method extends the work of McClean et al. [11] in the following two aspects.
(1) Instead of being piece-wise linear, we consider a much more general situation in which the charge accumulation process can take arbitrary forms as long as a conditional expectation function of that process satisfies a general regularity condition. In particular, within our framework, it is possible to consider the potential influence of the current charge level on the future charge accumulation which is neglected by the piece-wise linear assumption.
(2) In addition to the moments of total charge, it is derivable from our model of the joint distribution of the total charge and LOS, and the joint distribution of the costs and time being spent on every stage by every fixed time . Therefore, our model provides more detailed information of the treatment that the patient experiences in hospital.
Although the motivation of our work is the analysis of the charge accumulation and the determination of hospital length of stay, it turns out that the proposed method is useful for many other problems where the relation among the time to event and a hidden continuous process as well as a jump process is in interest. For example, in the field of investment risk management, it is always important to detect how the default probability of the corporate bond issued by a firm is affected by the growth stage and profitability (say measured by the flow of revenue) of that firm. In this case, our model can definitely provide some insights if we identify the default as the event in interest and consider the revenue flow as determined by a continuous process similar to the charge accumulation and the transition among different growth stages of the firm as described by a CPH process. In addition to problems of the survival-type, it is also natural to extend our work to the case of competing risks, of which every stage in our model can be identified with a type of risk. Although in competing risk models, the CPH transition matrix is no longer sufficient, it turns out that the partition-based technique introduced below is extendible to derive the joint distributions of a wide class of the competing risk models, the details of which will be discussed in a related paper by the authors. This paper is organized as follows. In Section 2, after a short review of the CPH models and RGRST models, we present the correspondence between them and briefly introduce the estimation algorithm. In Section 3, we conduct numerical studies to show the validity and usefulness of our model. A couple of interesting findings toward the medical outcome database, the New York State's Statewide Planning and Research Cooperative System 2013, are discussed. Section 4 concludes the paper.

Model
In this section, a new model (denoted as CPH-RGRST model) is constructed that connects the CPH models to RGRST models in the sense that (1) a CPH-RGRST model is a RGRST model; (2) charges in a CPH-RGRST model can be classified into a number of stages such that every stage is identified with a Computational and Mathematical Methods in Medicine 3 phase in a given CPH model in the sense that, at every time , the probability of staying in a stage is exactly given by the probability in the th phase of the CPH model. In particular, the marginal distribution of LOS induced by a CPH-RGRST model is a CPH distribution. We shall state the detailed construction of the CPH-RGRST models after a brief review of the definition and some basic properties of RGRST models and CPH models.

the Joint Model (RGRST) versus the CPH Model.
A RGRST model can be formally defined as follows as discussed in Gardiner et al. [18,19] and Polverejan et al. [20]: { ( )} is a nonnegative process characterizing the potential increment rate of charge per unit time provided that patient decides to stay, and 0 is a nonnegative random variable representing the charge at the initial time. We shall denote by ( ) = 0 + ∫ 0 ( ) the potential charge accumulation process in distinguishing the actual charge process { ( )}.
As shown in the supplementary materials (available here) note that a RGRST model can be completely specified by the initial probability density function (pdf), ( , 0), induced by the initial charge 0 and the following two conditional expectation functions: (2) And using (2), the joint probability density function (pdf) of the LOS ( ) and the total charge ( ) at the discharge time can be expressed as follows: where the function ( , ) in variable is the time-dependent pdf induced by ( ). The detailed derivation of (3) can be found in the supplementary materials. Expression (3) is useful in the estimation algorithm stated in the next section as it is the key component of the likelihood function.
To associate RGRST models with the CPH models, the hospital length of stay, represented as the random variable in (1), should induce a CPH distribution generated from a CPH model, which is a finite-state continuous-time stationary Markovian process with only one absorbing state/phase (we shall use the term "phase", by convention, in place of "state"). A CPH model is determined by an initial probability mass vector with ≥ 0 and ∑ =1 = 1, and the transition intensity matrix where , > 0 and the entry , of represents the transition intensity of a patient from phase to phase at every time > 0; formally: As suggested in McClean et al. [10], a phase in a CPH model can be identified with a treatment stage during hospital stay, such as diagnosis, acute care, assessment, rehabilitation, and long-stay care. The transition of patients among these stages characterizes the treatment progress.

Correspondence between CPH and RGRST Models.
The main result of this section is that there does exist a correspondence between CPH and RGRST models. The correspondence is built through converting the continuous variable, charge, in a RGRST model to finite many discrete states by partitioning the product space, R + × R + (representing charge and time, respectively), into a number of cells such that each cell corresponds to a phase in a CPH model, while the evolution of the probability of staying in those cells is exactly determined by the given CPH model. More precisely, we have the following theorem.
Set PDE by Eq. (6) subject to boundary condition on with in Eq. (6) replaced by ; is generated by a CPH model with the initial mass and its transition matrix is given as in (4) with ≡ for = 1, . . . , .
The proof of Theorem 1 is presented in the Appendix. From the proof, it is clear that the connection between RGRST models and CPH models is equivalent to a constraint put on the conditional probability function in (2) of the underlying RGRST model by the condition (6). In fact, the functional form of is completely determined by (6) and the function , which gives a first-order partial differential equation (PDE) of . This equation turns out to be solvable and has a unique solution for a given boundary condition. Therefore, using the characteristic method, Evans [21], we can solve (6) and express the function as follows: (8) where evaluated at = 0 is constantly 1 which means that all patients have to stay in hospital for a positive time before discharge; the form of the boundary and the value of the function on (denoted as ) are constructed by iteratively solving the Initial Value Problem (IVP) (A.11) in the proof; the details of the iteration are presented in Algorithm 1 of Corollary 2. * is the first time when the solution trajectory ( ) of IVP (A.11) (starting from ( , )) touches the boundary curve . Equation (8) is crucial to determining the parametric form of the joint pdf (3) and the likelihood function used for estimation.
The next corollary is a direct result of Theorem 1. It extends the construction in Theorem 1 to a more general situation where the transition intensity from different transient states to the absorbing state does not have to be identical; i.e., does not have to be equal to for different , . Therefore, it is always possible to achieve an arbitrary CPH model from a RGRST model satisfying a generalized version of condition (6) with replaced by for different . is very simple given that at time patients stay in the th-stage, : With the help of the conditional density (10), we can define a variety of measures of the "cost" of staying in a stage. For instance, fixing a stage and a time , we can think of the price as the amount that has been charged since patients arrived in that stage for the first time, the daily price as the amount being charged per day within that stage, and the time cost as the length of time that patients have spent in that stage by . Formally, the price ( ( )), daily price ( ( )), and the time cost ( ( )) for every stage and every time are defined as: where is the conditional mean of the first arrival time into the stage − 1 given the charge level and current time , and similarly, represents the conditional mean of the charge at the first arrival time to the stage − 1 given and .
Although the price, daily price, and the time cost are defined through the first-order moment, the availability of the conditional probability (10) enables us to define the quantile version of (11). When there are large parts of outliers, a quantile version of those "cost" measures turns out to be more useful.
The information regarding the price and time spent in every such stage, as defined above, is helpful in rationalising the care process, thus reducing waste, in terms of unnecessary or inappropriate treatment, and avoiding delay, often the result of batch and queue processes, in a similar fashion to that adopted for industrial processes (McClean et al. [10]).

A Two-Stage
Algorithm. Corollary 2 implies a two-step algorithm that uses the real hospital charge and LOS data as input to estimate the underlying CPH model and the RGRST model from which the CPH model is derived.
The use of FML guarantees that all estimators obtained from the two-stage algorithm are consistent and asymptotically normal-distributed.

Numerical Studies
In this section, we conduct the numerical studies to show the validity of our two-stage estimation procedure. We will apply our procedure to both of the real data and simulation sample.
Our data source is the medical outcome database, New York State's Statewide Planning and Research Cooperative System 2013 (SPARCS 2013). The histogram of the entire SPARCS 2013 indicates that the total charge approximately follows a log-normal distribution; therefore, we will take the following parametric form for the function : and the initial 0 is assumed to satisfy log 0 ∼ ( , ) .
It turns out that under (12), (13), and (6), the resulting marginal distribution of total charge is close to a log-normal distribution.
When covariates exist (denoted by ), we assume that the random vector ( , ) is independent from the covariate vector, and (exp( ), exp( )) follows the joint pdf given by (3) with the initial pdf, , and specified as in (13), (12), and (6), respectively. The covariates are linked with the total charge, , and LOS, , through the following regression equations: ) are the regression coefficient vectors.
As for the dimension of the underlying CPH model, we follow the convention in the previous studies of Faddy et al. [5]; Tang [4]; McClean et al. [10] and only consider the two cases where the number of nonabsorbing phases is 3 and 4. After a preliminary study, we select the 4-Phase CPH model as it can generate better fitting to the SPARCS data.
Under the specification above, there are three classes of parameters to estimate. They are (1) the parameter vectors , and involved in the CPH model, (2) the parameter ( , ) involved in the initial pdf, and (3) the regression coefficients ( + , + ). We call the parameters of types (1) and (2) as the dynamic parameters because they specify the joint model that generates the distribution of charge and LOS.  Table 1.
In the simulation study, we generate 5000 samples from a joint model without covariates and the true value of the dynamic parameters is taken as the estimated value from the real data study, which is given as in Table 2.
In both of the real data and simulation studies, the computer code is written in the language of Python 2.7 with python-scipy, python-numpy libraries being used.

Simulation Study.
The goodness of fit is measured through comparing the fitted curves and the empirical histogram (drawn from the simulation sample) for both of the marginal charge and LOS, as shown in first line of Figure 1. We conduct Pearson's 2 test; the value of the 2 statistics and the associated values are (0.0171, 1.0) for the marginal charge and (6.1054, 0.8662) for the marginal LOS. From both of the fitting plots and the results of 2 test, our fitting is fairly good.
We also evaluate the goodness of fit in terms of the joint distribution through Pearson's 2 test; the 2 statistics and its value are (0.1911, 1.0), which is consistent with Figure 1. Therefore, the simulation study verifies the effectiveness of our two-step estimation procedure. Table 3, from which both of the severity and mortality of illness have significantly positive effect on both of the total charge and LOS that is consistent with the intuition.

Regression Coefficients. The estimated regression coefficients are reported in
On the other hand, among all the MDC groups, the Newborn And Other Neonates (MDC 15) and the Diseases and Disorders of the Musculoskeletal System And Connective Tissue (MDC 8) has the greatest negative and positive effects on the total charge, respectively, which is also consistent with the intuition. In contrast, the MDC groups with greatest negative and positive effect on LOS are the Diseases and . Among them, except the MDC 20 group, all the other groups have a more expensive bill but shorter hospital stay, and therefore a higher daily charge. In contrast, patients with alcohol/drug abuse tend to pay less but stay in hospital longer.
Combining the estimated coefficients in Table 3 and dynamic parameter in Table 2, we can even identify the stage     from which a patient exits to discharge. The discharge stage encodes critical information of the treatment pathways and is important for management purpose. Table 4 reports the estimated conditional mean (log-)charge, LOS, severity, and mortality risk for patients who exit from every stage. It is clear that the value of all the four variables monotonically increases as patients get discharged from later stages. Especially for severity and LOS, there are clear jumps from stages 1 and 2 to stages 3 and 4. The mean severity is doubled when transiting from stage 2 to stage 3 while the mean LOS almost gets tripled. This gap suggests that patients who are diagnosed with more severe conditions during admission are more likely to go over the treatment in stage 3 or 4, while their rehabilitation usually takes more time and more medical resources (McClean et al. [10]). This observation is consistent with the intuition and, to some extent, verifies the viewpoint that interprets the entire treatment process as a series of transitions among multiple medical stages.

Cost.
As discussed in the end of Section 2.1, the CPH-RGRST model enables us to evaluate the "cost" of each medical stage in different manners. Using the estimation results provided in the previous section, we can numerically compute the "cost" for our SPARCS sample. In Figure 2, we plot the estimated mean price, mean daily price, and mean time of staying for each of the four stages of the CPH model, where the "mean" refers to the CPH-RGRST process that generates the mean charge and LOS, = exp( ) exp( (0+̂)), and = exp( ) exp( (0+̂)). In Figure 3, the probability of staying in every nonabsorbing stage is plotted against the time. There are the following three major findings.
(i) From the plot 2 in Figure 2, the daily price for stage 1 declines over time, which is caused by the fact that, for those long-stay patients, they must have already switched into the higher stage treatments after the preexam period (represented by stage 1), which is very well captured in plot 1 of Figure 3. In contrast, for all the stage 2, 3, and 4, the daily price inclines to grow up in long run, which rejects the piecewise linear assumption claimed in McClean et al. [11]. In fact, in contrast to the constant growth rate of charge within each   stage, the increasing growth rate tends to be more reasonable, because a longer stay usually implies a worse health condition for a patient, who, therefore, needs better care, including more expensive medicines, more frequent exams, and the like. These items lift up the cost of stay per day. The same reasoning also applies well to the observation that the time cost of all stages is increasing over time as shown in plots 3 and 4 in Figure 2.
(ii) Although the time cost is slightly lower in stage 3 than in stage 4, both of the two stages (by (11), the time cost for stage 1 is trivial and constantly equal to the total time in hospital, so we omitted it in Figure 2) have their time cost almost identical to the total time that patients spent in hospital since they were admitted. In contrast, the time cost of stage 2 displays quite different features, which is not only much lower than that of the other stages, but, within the first 13 days, its growth rate is also slower. The different features of stage 2 are consistent with the estimated dynamic parameters in Table 2 and plot 2 in Figure 3. From Table 2, it is clear that the intensity of switch-in and switch-to-discharge in stage 2 is significantly higher than in the other stages, which means that there are two factors that lower down the time cost at stage 2.
(1) There are a large portion of patients switching from stage 1 to stage 2 in the early time (<5 days, see plots 1 and 2 in Figure 3); in contrast there is almost no patient who could switch from lower stage to stage 3 or 4 (see plots 3 and 4 in Figure 3), which implies on average that the first arrival time to stage 2 is later than to stages 3 and 4. (2) The portion of patients switching out of stage 2 (mainly to discharge by plots 3 and 4 in Figure 3) is also high, which is not the case for stages 3 and 4 (reflected as the scale of plots 3 and 4 being much smaller than plots 1 and 2 of Figure 3). Therefore, the switch-out time from stage 2 is earlier than from stage 3 or 4 on average.
Factors (1) and (2) shown in Figure 2 and Table 2 indicate that stage 2 should associate to the major treatment procedures, like the main surgery, that most inpatients have to experience when staying in hospital. In fact, it is usual that patients need a couple of days as the preparation period before the main surgery, such as the period for preexams. This preparation period is exactly captured by factor (1) of stage 2. On the other hand, patients usually recovered soon after the main treatment procedure gets done, and then are discharged, which is reflected by factor (2) of stage 2.
(iii) The third interesting observation from Figure 2 is regarding the median price of stage 2. From plots 1 and 2 in Figure 2, we can see that only at stage 2 there is a clear deviation between the expectation version and the median version of the price. More precisely, at stage 2 the median price is significantly lower than the price defined through the first-order moment, and this fact holds at almost all the time and also holds for the daily price. It is well known that when the median of a distribution is below its first-order moment, there exists a group of outliers with extremely great value. In the other words, Figure 2 indicates that a portion of patients in stage 2 are charged much higher than the others in that stage and all the time. From the perspective of patient's welfare and the effective allocation of medical resources, it is meaningful to have some further researches in identifying the causes that make some patients in stage 2 being charged more. Figure 3, it is clear that the probability of staying in stages 3 and 4 is small over all time. This fact might be induced by model overfitting as pointed out by a referee, but it is not. In contrast, the low probability reflects a deep-level distributional property of SPARCS data. A very large proportion of inpatients recorded in SPARCS only have extremely short hospital stay and 99+ percent of them get discharged by day 10, while no more than 0.01 percent of patients can stay in hospital for more than 25 days. But at the meantime, there do exist a small group of patients who can live in hospital for a couple of months before discharge. The same pattern can be observed for the total charge; the most expensive expenditure can take million dollars while more than 99 percent of patients are charged no more than 100,000 dollars.

Remark 3. By
Based on the observation above and the nondecreasing design of the CPH-RGRST models that higher stages correspond to longer hospital stay and higher total costs, the low probability of staying in the highest two stages just reflects a fact that both of the charge and LOS data in SPARCS 2013 have a very long and thin tail to the right. This tail property may not be well fitted if a CPH-RGRST model with fewer phases is used, because there will not be enough freedom to distinguish the portion of patients in the tail from those whose charge and LOS stay around the mode.

Summary
We introduced a methodology whereby the widely used CPH models and RGRST models can be combined together and a variety of measures of the cost of phases in the CPH model can be defined. A two-step procedure is proposed to estimate the combined CPH-RGRST model and the simulation study is done to verify the effectiveness of the estimation procedure. With the data sampled from SPARCS 2013, we estimated a four-phase CPH-RGRST model and drew the cost curves for every phase. To distinguish the effect of different types of illness on the charge and LOS distribution, we incorporated MDC groups and the severity and mortality risk of illness as covariates into the estimation.
We found that the effect of illness on the total charge and LOS is not always homogeneous. In particular, there are five MDC groups that affect the charge and LOS in different direction. Among them, there is only one MDC group, representing the alcohol/drug abuse, which has the negative effect on the final charge while it lifts up the LOS drastically.
The daily charge for all the stages, 2, 3, and 4, is increasing over time. This fact implies a nonlinear charge accumulation process within every stage and therefore contradicts the piece-wise linear assumption used by the other authors, McClean et al. [11]. We believe that the increasing daily charge is more realistic and reflects dynamic interaction between the health condition of patients and the treatment they accept.
Among all the four stages, stage 2 shows quite different features in both the price measure and the time cost measure. In terms of the time cost, stage 2 is significantly lower than stages 3 and 4 almost all the time. This observation is consistent with the relatively high switch-in and switch-todischarge intensity that the stage 2 has and associates the stage 2 with the major treatment procedures that most patients need to experience when staying in hospital.
The median is much lower than the mean of both the price and daily price in stage 2, while this kind of deviation does not exist for the other stages, and it implies that there is a portion of outlier patients who are charged much more than the other patients in stage 2. We believe that further studies are needed to find out the causes of those patients being charged more in stage 2, since it matters to the efficiency of the allocation of medical resources.
Firstly, notice that, by condition (6) It is easy to check that condition (1) does always hold. Finally, thanks to the existence and uniqueness theorem of solutions to IVPs, the solution curves

Data Availability
The data used in this paper is sampled from New York State's Statewide Planning and Research Cooperative System 2013, which is publicly available online at the following url: https:// health.data.ny.gov/Health/Hospital-Inpatient-Discharges-SPARCS-De-Identified/npsr-cm47. The specific sample used in this study and the python code for processing the data and implementing the estimation and simulation study are available from the corresponding author on request.

Conflicts of Interest
The authors declare that they have no conflicts of interest.