A scalable approach for continuous time Markov models with covariates

Abstract Existing methods for fitting continuous time Markov models (CTMM) in the presence of covariates suffer from scalability issues due to high computational cost of matrix exponentials calculated for each observation. In this article, we propose an optimization technique for CTMM which uses a stochastic gradient descent algorithm combined with differentiation of the matrix exponential using a Padé approximation. This approach makes fitting large scale data feasible. We present two methods for computing standard errors, one novel approach using the Padé expansion and the other using power series expansion of the matrix exponential. Through simulations, we find improved performance relative to existing CTMM methods, and we demonstrate the method on the large-scale multiple sclerosis NO.MS data set.


Introduction
Multiple sclerosis (MS) is a chronic neuroinflammatory and neurodegenerative disease that affects the central nervous system.Progression of the disease over time causes increasing disability, which is measured on the expanded disability status scale (EDSS) (Kurtzke, 1983).The EDSS measures several aspects of disability (such as walking) and reflects the functioning ability of an individual.The higher the EDSS score, the more disabled the individual.For instance, an EDSS score of 0 reflects a normal neurological examination, EDSS 6 indicates the use of a walking cane, and EDSS 10 is death.As part of clinical trials, the EDSS score is an ordinal outcome measured approximately every 4-6 months as a part of clinical trials in relapsing and progressive MS patients.To understand better how the accumulation of disability is related to demographic and disease-related variables, modeling of how the progression of EDSS varies from subject-to-subject, for example, remains unchanged, increases (disease worsening) or decreases (disease improvement), is of interest.In addition, similar to many other neuroinflammatory and neurodegenerative diseases, MS is heterogeneous, i.e., patients' EDSS scores change at individually varying rates, with a trend to gradually accumulate disability over the years.
A critical question for MS researchers is identifying the prognostic factors that influence EDSS progression.These factors provide insights into how progression varies between patients and point to possible directions for treatment.Different factors have been reported to be associated with the progression of disease; for instance, Vukusic and Confavreux (2007) showed age is a factor of progression; Runmarker and Andersen (1993) showed 5 years after onset of MS that a low number of affected neurological systems, a low neurological deficit score, and a high degree of remission from the last bout were the most important prognostic factors.
The analysis of disease progression in MS clinical trials has typically used time-to-event analysis, which is the recommended method for the demonstration of a drug effect on the first clinically meaningful disability progression in regulatory guidelines (Alvarez and others, 2015).Other works on exploring prognostic factors in MS (Kappos and others, 2015;Vukusic and Confavreux, 2003) have also relied on a time-to-event modeling framework.However, outside demonstrating a treatment effect on disease progression in randomized controlled trial settings, time-to-event models are not adequate for characterizing disease progression for three reasons; firstly, time-to-event models consider unidirectional changes and hence ignore that patients can improve or recover from their current EDSS state (see Mandel and others, 2007 for a detailed discussion), secondly, they do not account for repeated events of worsening thereby not fully using the available longitudinal information; and thirdly, these models cannot handle heterogeneity in EDSS transitions, i.e., time-to-event analysis treats worsening from EDSS 2 to 3 the same as worsening from EDSS 4 to 5.However, factors that influence the early transitions (e.g., from EDSS 2 to 3) may not be the same as the factors influencing transitions at a later stage in the disease (e.g., from EDSS 4 to 5).For this reason, survival models for EDSS are subject to biases introduced by left censoring, i.e., they do not account for the fact that patients in the analysis may start at different EDSS scores.Markov models address all of these concerns, allowing any transition (deterioration and recovery) and flexible modeling of covariates, where the covariate effect is estimated for each possible transition separately.A Markov process is one where the outcome at a given time only depends on the outcome of the previous time and is independent of its past history.
In this article, we focus on the continuous-time Markov model (CTMM).To aid us in defining this type of model, consider that Markov models can either be defined in discrete or continuous time.A discrete-time Markov model is one in which the system evolves through discrete time steps (e.g., every 1 month), while in a continuous-time Markov model transitions between states can happen at any continuous time (e.g., 2 months, 1 year, 20 days etc.).MS patients often have measurements that are taken at variable time-points because in addition to regular check-ups, EDSS assessments are also conducted when patients relapse.An MS relapse is an acute neurological deterioration caused by a demyelinating event in the central nervous system, or in other words, a sudden worsening in MS symptoms, from which patients may or may not fully recover.These varying observation times, therefore, require a continuous time model.A second consideration is whether a patient's observation time corresponds to the exact time when a change in status occurs.The exact time that a change in disability (EDSS) occurs for an MS patient may occur before a patient arrives in the clinic.Therefore, our data are interval censored.Because the exact time that a transition from one disability state to another is not directly observed, the CTMM as outlined in Kalbfleisch and Lawless (1985) is a suitable choice for our EDSS data.This form of likelihood takes into consideration all possible intermediate state transitions and timings between observation time-points.Once it's been fit to the data, the CTMM can generate the probability of moving from one EDSS state to another at any time, allowing one to estimate how factors are associated with faster or slower EDSS transition.CTMM are widely utilized to fit multistate longitudinal data.These models in particular are applied in the field of public health, where the states of a Markov chain may refer to worsening stages of a chronic disease, such as breast cancer (Hsieh and others, 2002).In longitudinal settings, individuals are followed at regular intervals, where the exact transition times between disease states are typically not observed.Because state transitions can happen at any time and data are collected in a nonequidistant longitudinal manner, the CTMM offer a more parsimonious approach over the discrete version of Markov models, and handle the underlying variable nature of the disease.Lastly, previous work has applied hidden Markov Models to multistate data in continuous time for progression (Williams and others, 2020;Luo and others, 2021).Herein, we however choose to focus on the nonhidden Markov model.This is because, in MS patients, the current EDSS state is generally the most prominent factor in determining a patients future disability status rather than a latent unobserved state.
A recent study (Lublin and others, 2022) showed that relapse and age are important factors in the accumulation of disability in MS disease, although there is continuous interest in which clinical factors are associated with EDSS transitions.Therefore, our goal here is to estimate the effect of different covariates on different EDSS transitions by applying CTMM to a large longitudinal data set of MS patients.We are motivated by the large-scale NO.MS data set, which includes approximately 20, 000 MS patients with longitudinal EDSS data followed for up to 15 years (Dahlke and others, 2021), and with a range of clinically relevant covariates.However, the large sample size and the nonequidistant time intervals between outcome measurements present a challenge for a CTMM because the likelihood and score function involve the evaluation of a matrix exponential for each observation, and as a result, model fitting and parameter estimation become burdensome, requiring numerical approximation methods (Moler and Van Loan, 2003;Kalbfleisch and Lawless, 1985).Previously published articles focus on models in which the number of EDSS states has been restricted, with three or fewer state transitions allowed (Mandel and others, 2013).However, in this article, we relax this assumption and allow for more granular transitions between different EDSS states.A large database allows us to explore the influence of covariates in specific stages of the MS disease course.Mandel and others (2007) developed a method for analyzing MS disease states using fixed-effects transition models, which along with their first-order Markov assumption suffered from lack of fit, partially due to the heterogeneous nature of the disease.NO.MS contains a large number of observations, allowing for more granular transitions between EDSS states, which results in a large number of model parameters, and the analysis becomes complex and computationally intense.In this article, we propose a mini-batch stochastic gradient descent optimization technique combined with differentiation of the matrix exponential, based on the approach presented in Van Loan (1978) and Wilcox (1967).This method has been previously applied to the context of hidden Markov models (Marshall and Jones, 1995), but to our knowledge has not been utilized for CTMM.We show that the mini-batch stochastic gradient used in this article can accommodate large scale data.While other studies have modeled the transitions between different EDSS states without investigating clinical prognostic factors (Zurawski and others, 2019), these factors or covariates affect the transition intensities between the different disease states.Indeed, the introduction of covariates should allow increased accuracy when predicting transition rates (Cook and others, 2002).We show that our model can handle different types of covariates including baseline as well as time-varying covariates.
Section 2 represents the details of the model and inference.Section 3 presents our approach to estimation and inference in CTMM.Section 4 first describes how we simulate data from the proposed CTMM and then describes how we assess the performance of our models.Finally, Section 5 presents an application and discussion of the proposed model on the NO.MS.2 data set.

Continuous time Markov models
Consider a longitudinal study in which individuals can move among S = {1, 2, . . ., S} states.Let M denote the number of subjects in the study and N m denote the number of observation times for the mth subject.The time of subject m's kth observation is denoted by t mk , when we observe the subject's state as a random variable denoted by s mk ∈ S. States are assumed to follow a first-order continuous time, discrete state Markov process, that is, Pr(s mk |s m(k−1) , s m(k−2) , …, s m1 ) = Pr(s mk |s m(k−1) ), and we denote the probability of subject m's transition from state i to state j as Figure 1 shows an illustration of state transitions and their corresponding probabilities.Individuals, after staying for some time in state i, move with some probability p ij from state i to state j.The likelihood for each individual is the product of all such transitions across observation times.The Markov process is fully characterized by its transition intensities  where q ii (t) = − j =i q ij (t) Cox and Miller (1977).For the matrix representation, suppose P(t 1 , t 2 ) and Q(t) denote the S × S matrix of transition probabilities p ij (t 1 , t 2 ) and matrix of transition intensities q ij (t), respectively (the transition intensity matrix is also known as the rate or infinitesimal generator matrix).Each entry q ij (t) of the transition matrix Q represents the rate of transition from state i to state j at time t.
Under time homogeneity, transition intensities are independent of the time, i.e., q ij (t) = q ij and transition probabilities depend only on the elapsed time between successive observations, i.e., p ij (t 1 , t 2 ) = p ij (t 2 − t 1 , 0) for i, j ∈ S. Then in this case, transition probabilities can be related directly to transition intensities through the so-called Kolmogorov equation, where Exp(•) indicates the matrix exponential (we denote the scalar exponential with exp(•)).

Incorporating covariates in the model
Suppose subject m has R covariates z mr (t mk ), r = 1, …, R, measured at time t mk (kth observation), and written as the R-vector Z m (t mk ).This could contain different types of covariates such as time-independent covariates (e.g., sex), as well as timevarying ones (e.g., age).We further assume that the covariate value remains constant in the future from the present value until the next observation.Marshall and Jones (1995) described a form of proportional hazards model, wherein the presence of covariates, the transition intensity matrix elements q ij are replaced by where β ij,r is a regression coefficient that quantifies the impact of each covariate on state transition from i to j, and q 0 ij is the baseline intensity; we write β for the R × S × (S − 1)-vector of all coefficients, and Q 0 is called baseline intensity matrix with entries q 0 ij .The interpretability of Q 0 is the transition intensity when all covariates take on the value zero or at the reference value for categorical covariates, and as a result, all covariates should generally be centred.We call this version of our model a "transition-dependent regression" where the effect of each covariate can vary for each state transition.We also can consider another version where regressors have a common effect over all state transitions -i.e., q ij,m (Z m (t mk )) = q 0 ij exp R r=1 z mr (t mk )β r , which we call the transitionindependent regression case.For the rest of the article, we will focus on the transition-dependent regression, though the transition-independent regression is modeled similarly and is nested within the more flexible model.
Let τ mk = t mk − t m(k−1) , k = 1, …, N m , be the time lag between two consecutive observation times.
Denote the parameters of interest θ , elements of Q 0 and β, with dimension R × S × (S − 1).This allows us to formulate the time-invariant likelihood for subject m as follows where (•) s m(k−1) ,s mk refers to the corresponding entry of the matrix (row: state at time t m(k−1) , and column: state at time t mk ).The complete likelihood then becomes the product over all subjects, i.e., L(θ 3) is the generalized version of likelihood used by Kalbfleisch and Lawless (1985) when covariate vectors are measured at every observation time (canonical decomposition of Q at every observation).
The ML estimator of θ can be obtained through the maximization of L(θ ).Although L(θ ) has a simple form, evaluation is computationally intensive because it includes the matrix exponential operation for each product of the likelihood.In the next section, we introduce our approach to calculating the derivatives of L(θ ) and develop a stochastic gradient descent approach to find the ML estimates.

Optimization with stochastic gradient descent
Combined together, Wilcox (1967) and Van Loan (1978) (Theorem 1) show that for any matrix function A(λ) = (a ij (λ)) for scalar λ, the derivative of Exp(A(λ)) w.r.t λ can be found via a Padé approximation Our goal is to obtain the partial derivatives of L(θ ) w.r.t elements of θ , with θ standing in for each q 0 ij and β ij,r , for all transitions i, j ∈ S such that i = j and all covariates r; for subject m where the final ratio of matrices is evaluated as an entry-wise ratio for each entry (i, j), i = j.Thus, we need to calculate the derivatives for every subject m and at every observation time lag τ mk .To use the result (3.4) above, for each event τ mk , we take A = Q mk τ mk which gives Ȧ = Qmk τ mk , the partial derivatives w.r.t q 0 ij , and elements corresponding to derivatives w.r.t According to (3.4), having the matrices Ȧ for each observation time lag, we can form the block matrices C and use (3.5) to calculate ∂/∂θ log(L m (θ )).
Finally, the gradient is obtained, Having ∂/∂θ log(L(θ )), we can now use gradient descent to find ML estimates of the parameters θ in (2.3).However, the standard gradient descent requires application on the full data set, hence, computing matrix exponentials and derivatives at each observation time lag τ mk for all individuals.Consequently, it is computationally expensive to scale it to a large data set.Instead, we use a minibatch stochastic version of gradient descent (Sakrison, 1965), which randomly partitions the data into mini-batches and performs the calculations on each mini-batch instead of the entire data set.This allows our method to scale to essentially arbitrarily large data sets.Let B d be a randomly chosen subset of subjects (with or without replacement) at the iteration step d (|B d | < M).The updated parameters at step d + 1 of the mini-batch stochastic gradient descent are obtained via where λ d+1 is the learning rate sequence (we use decreasing learning rate with starting value equal to λ d+1 = (d + 1) −0.6 ).This procedure iterates until the parameters using (3.7) converge to θ (i.e., θd+1 − θd < for a small ).This then results in ML estimators of θ = ({q 0 } ij , {β} ij,r ), for all transitions i, j ∈ S, i = j, and all covariates r.Moving forward, we will refer to this optimization approach as the SCTMM.
3.1.1.Calculation of confidence intervals Asymptotic standard errors and confidence intervals are computed from the Hessian matrix of the log-likelihood evaluated at the parameter estimates.There are different numerical approximations/approaches in the literature to compute the Hessian for the parameters θ .Since the calculation of the Hessian matrix requires computing (S × (S − 1) × R) 2 elements, existing methods (Hanks, 2018;Fleming and Calabrese, 2021;Jackson, 2011) are computationally prohibitive if M is large.Here, we propose two scalable approaches to calculate the Hessian matrix; the first is via calculation of second-order derivatives through reapplication of the Padé approximation in (3.4); and the second approach is with a second-order approximation using the power series definition of the matrix exponential (see Moler and Van Loan, 2003).

Padé expansion for Hessian
Let θ = (θ 1 , θ 2 ) where θ 1 and θ 2 are the {q 0 } ij and {β} ij,r components.The Hessian then has a 2 × 2 block form, with blocks ( , ), with , ∈ {1, 2}.To obtain the Hessian matrix, we need to calculate the second derivative of the log-likelihood stated in (2.3) as follows (3.8) The terms in expression (3.8) can be found in Section 3.1, except for ∂/∂θ (∂/∂θ Exp(Q mk τ mk )), which is found as follows where , C 3 is 2S × 2S zero matrix, and 0 is an S × S zero matrix.Specification of the Hessian is completed by considering the different possible values for θ and θ : ( , for every u.

Power series expansion for Hessian
Using the Padé approximation to calculate confidence intervals is computationally expensive especially when number of states S is large, due to the costly computation of the Exp function.Thus, we also introduce a second approach using approximation, which could be faster in terms of computation but with a slight compromise in accuracy.Recalling (2.3) and using the definition of matrix exponential we have where I is the identity matrix, s mk is the occupied state at time t mk , and we truncate the power series after the third term (see Appendix for more details).We need to mention that the trade-off in this method is that truncation leads to a slight underestimation of the variance in the estimation of confidence intervals since some of the terms in the power series are dropped.We will explore the impact on coverage in Section 4. Having the second derivatives we can simply form the Hessian matrix and the calculation of confidence intervals then becomes straightforward.

Evaluation methods
We compare our method to an existing and widely used CTMM tool.Jackson (2011) proposed a method that allows CTMM to be fitted to longitudinal data and developed its R package called MSM as well (Jackson, 2011).MSM uses different approaches for the optimization step, and we will evaluate each individually: • MSM_opt: optim method which uses the deterministic Nealder-mead approach (R Core Team, 2021).

Simulation overview
We perform a simulation study under different scenarios to assess the performance of both the proposed methods for optimization by SCTMM and the construction of confidence intervals outlined in Section 3. Let t max be the maximum time in which subjects are followed up (for instance 10 years).
Simulating data for the CTMM for one subject (m) is carried out via the following steps: 1. We first choose an arbitrary baseline transition matrix Q 0 and some values for the baseline covariates z mr (t m0 ), for every covariate r, then we can compute Q m0 via equation (2.2).
The interpretation of the Q 0 depends on the centering of the covariates.To obtain the stationary (also called steady) state probability (shown by π ), we solve the following matrix equation: This specifies the initial state occupied by subject m.In other words, we draw a sample state from the finite space of states S with probabilities π .2. To obtain the time of the next state transition t m(k+1) , we draw one sample from an exponential distribution with rate −q s mk s mk , where s mk stands for the state occupied by subject m at observation k (note that for the very first step of the simulation algorithm k = 0).Furthermore, randomly generate some values from a uniform distribution for z mr (t m(k+1) ) for every r. 3. Notice that the time t m(k+1) generated at Step 2, is the instantaneous transition time where subject m moves to any other state but not the currently occupied one s mk .Because of this, in order to simulate patients who remain clinically stable, we need to generate a number of dummy observations in which subject m has been observed several times staying at the same state s mk but with different values of covariates (randomly).The choice of how many dummy sojourn observations would be required is arbitrary; however, it is advised to use the frequency in which the states are observed in the real data set at hand to mimic the data; for instance, every 4 months.
4. Now, we choose the next state s m(k+1) by drawing a sample from the set of states S with probabilities computed by equation (2.1). 5.If t m(k+1) < t max go to step 2, otherwise terminate the algorithm.
We then repeat the above steps for any given number of subjects M.

Simulation
We consider two main simulation scenarios; first, a null case where there is no effect of covariates on the transition rates (i.e., β ij,r = 0 for every i, j, and r); and a second case where there is an effect from the covariate impacting transition rates.In both cases, we try to mimic the real data set that we will use in the next section and choose the number of states based on where we have the majority of transition data, and so we reduce the 20 EDSS states down to eight states (S = {1, 2, …, 8}), t max = 15 years (follow-up time).For the second case where the covariates impact the transition rates, we use 10 covariates and assess the estimation of transition-dependent regression parameters β ij,r .All 10 covariates are randomly drawn from a uniform distribution and then centered at zero.We perform a sensitivity analysis over sample size M in both scenarios and assess the performance of our SCTMM method.For both scenarios, and each sample size, we perform Monte Carlo simulation and generate 1000 realizations of the data set, and evaluate the operating characteristics of bias, standard error, coverage, and rejection rates.Estimation with each realization uses different initial values for β ij,r for each i, j, and r.For each realization, β ij,r are drawn from a normal distribution with mean 0 and standard deviation 1.The bench-marking platform used for this study ran R-4.0.0 to generate data sets and perform the analysis on 7 Intel Ivy Bridge cores each running at 1.15 GHz speed and 16 GB of RAM memory in total.
There are many configuration choices in implementing a mini-batch stochastic gradient optimization, and we take guidance from the existing literature (Franchini and others, 2020;Perrone and others, 2019).With large-scale data (M > 1000), it is advised to set the mini-batch size |B d | to be between 500 and 1000.In this article, we fix |B d | = 500.
The CTMM likelihood is not globally concave and thus multiple restarts are recommended to get as close to the global maximum as possible; the optimization result that produces the largest log likelihood is taken as the optimal solution.Thus, we advise using between 1000 and 5000 restarts with random initial values, for any number of parameters up to what we have here (we use 1000 restarts).Careful selection of initial values is required to reduce the chance of the optimizer arriving at saddle points and local optimal.We suggest one of the following policies where applicable to set the starting parameters for the SCTMM method: • Apply SCTMM on a smaller section of the data with no stochastic computation (for instance M = 500), estimate the parameters and then set these derived parameters for the restarts in the SCTMM.
• Sample off-diagonal elements for row i and column j from positive values of a univariate normal distribution with mean and standard deviation 1/|i − j|.
The above suggestions aim at minimizing the number of restarts leading to highly suboptimal solutions and improving the accuracy of the estimates.We perform two analyses, one with no covariates in the model, where we estimate the transition intensities across all state transitions, and a second case where we include covariates and remove from the model any state transition ij that has less than 1% transition data in the entire data set.Then, we estimate covariate effects for this case.We also use the Padé expansion method to calculate confidence intervals.

Simulation Results
To assess the estimation performance of the coefficients β ij,r in the presence of covariate effects (i.e., the transition-dependent scenario), we have tested the performance of SCTMM against MSM.Table 1 shows a comparison between the optimization methods used in MSM to obtain maximum likelihood estimates, and our proposed SCTMM method.These results show that under the smallscale setting (M ≤ 1000) SCTMM has equal or better performance to MSM in terms of bias, variance, coverage, and rejection rate.While all methods have some under coverage and slight inflation of the null rejection rate, SCTMM is closest to the nominal.In a large-scale setting (M ≥ 10000), none of MSM's methods was available (due to numerical instability), but the SCTMM approach estimates the parameters with reasonably good performance.Furthermore, the results demonstrate the scalability of the SCTMM approach to large data sets.Table 2 shows a comparison between the optimization methods used in the MSM software and our proposed SCTMM approach when there is no effect of any covariates incorporated in the model (β ij,r = 0 for every i, j, and r).Again, the results show that under a small-scale setting, SCTMM has equal or better performance than MSM, and in a large-scale setting the SCTMM can handle the scalability problem. Figure 2 shows a comparison of the computation time between the SCTMM and MSM, for only one initialization (no restarts), demonstrating that both methods are on par in this case.

NO.MS.2 data
We utilize the novel Novartis-Oxford MS (NO.MS.2) data set in this article, formed by aggregating multiple sclerosis patient data from 34 MS clinical trials and their extensions (Dahlke and others, 2021;Lublin and others, 2022).Currently, NO.MS.2 is the largest and most comprehensive clinical trial data set in MS, spanning all MS phenotypes and containing data from over 27 000 patients with up to 15 years of follow-up visits, and with regular monitoring of patients' neurological status by highly trained raters across all stages of MS disease.The earliest clinical trial in this data set began in 2003 and the latest measurement is from January 2020.These trials collected longitudinal data on disability, as measured by Kurtzke's EDSS (Kurtzke, 1983).In this analysis, EDSS was simplified as follows: state 0 remained 0; for EDSS 1 − 1.5, 2 − 2.5, 3 − 3.5, • • • , were rounded down to 1, 2, 3, • • • ; EDSS≥ 8 was set to 8, resulting in 9 reduced states.In this analysis, EDSS was simplified as follows: state 0 remained 0; for EDSS 1−1.5, 2−2.5, 3−3.5, • • • , were rounded down to 1, 2, 3, • • • ; EDSS≥ 8 was set to 8, resulting in 9 reduced states.
In this study, we used a cohort of M = 13 320 patients, with N = 170 628 observations in total, S = {0, 1, …, 8} EDSS states and a set of the following R = 7 covariates.
• ARR1: The annualized relapse rate is the number of relapses a patient experiences within the year prior to the current observation time (time−varying).
• Age: Age at event time divided by 10, to interpret coefficients as the effect by a decade of age and to make coefficients more comparable with other variables (time−varying).
• DURFS: Duration of MS in years from first symptoms as estimated at trial entry (baseline).
• MSTYPE: A three-level categorical variable, indicating the phenotype of MS disease, with levels relapsing-remitting MS RRMS (used as reference level), secondary progressive MS (SPMS), and primary progressive MS (PPMS).
• Sex: Male or female (reference level female).We perform two analyses: 1. First, an analysis where we include no covariates in the model, and we consider all EDSS transitions (i.e., keeping rare transitions).2. Second, an analysis where we only allow transitions that have more than 1% of transition data in the entire data set, which for this particular data set turns out to be mostly EDSS transitions with jumps of more than two states.For these transitions, we fit covariate effects.
All covariates are centered, and we apply SCTMM using mini-batches of size |B d | = 500 in each iteration step d.We also use 1000 restarts with different initial values of θ , and draw these initial values from a uniform distribution with min = 0, max = 1 and min = −1, max = 1 for {q 0 } ij and {β} ij,r , respectively.

Real Data Results
The right panel of Figure 3 shows the model-based estimates of the baseline transition probability matrix when no covariate data were included in the model parameterization.It can be seen that patients are most likely to stay unchanged in their current states and then the probability of transitioning to a higher/lower EDSS state is higher in the early-mid stages of disability than in later stages of the disease.The plausibility of these results has been checked by looking at the left panel of Figure 3, which displays a normalized empirical transition probability matrix.This empirical transition matrix is calculated using adjacent clinical assessments in the entire dataset.The graphs shown in Figure 4 illustrate transition-specific hazard ratios (exp(z r β ij,r )), when the covariate is set to its mean value, and corresponding 95% confidence intervals (all covariates are included in the model at the same time).These graphs show the effect of each covariate on successive EDSS transitions.Figure 4a demonstrates hazard ratios of ARR1, showing that relapses have a significant association with the accumulation of EDSS disability, particularly in but not limited to, the early stages of the disease.This is in concordance with the findings of a recent study by Lublin and others (2022).Based on these results, for instance, having a relapse in the past year for someone with an EDSS of 3, will increase the risk of transition to EDSS 4 by approximately 40% (25%−50%), controlling for other covariates.Figure 4b shows that older age is associated with faster EDSS transition, primarily but not exclusively early in the disease.For instance, having a decade increase in age increases the risk of transitioning from EDSS 2 to 3 by approximately 19% (14%−24%).Figures 4c show some association of duration since the first MS symptoms were observed on the EDSS transitions.
Figure 5 shows a more comprehensive set of heat-map plots of hazard ratios for different covariates, where we do not only show the successive transitions but more EDSS changes up to two jumps of states.
Figures 5a and 5b show that having a relapse in the last year and older age impact the chance of EDSS worsening or improvement particularly at the early stage of the disease (most patients at this stage of the disease are of the 'relapsing-remitting MS' sub-type, hence the name).Figure 5c shows the longer the duration of the disease, the higher the chance of further deterioration (HR> 1) and  the lower the chance of recovery (HR< 1), controlling for other covariates including age.It may be expected that covariates have an opposing effect on the risk of worsening versus improvement, however, figures 5a and 5d show that this is not the case of having a relapse and being male; i.e. having a relapse in the last year always increase both chances of deterioration and improvement (e.g.recovery from relapses), and males have a lower hazard of transitioning compared to females (this may likely be explained by the higher probability of male patients belonging to the progressive subtypes where the female to male ratio is approximately 1:1 than the relapsing-remitting subtype of MS where the corresponding ratio may be 2:1 or even 3:1).Figures 5e and 5f show that patients with progressive MS (SPMS or PPMS) have a higher chance of deterioration and a reduced chance of recovery compared to patients diagnosed with RRMS.

Conclusion
We have proposed a method to overcome the problems associated with fitting a CTMM with covariates to large-scale data sets.We use a mini-batch stochastic gradient descent algorithm which uses a random subset of the data set at each iteration, making it practical to fit large scale data.Furthermore, we used the results introduced by Wilcox (1967) andVan Loan (1978) to calculate the derivatives of the matrix exponential, and using this, then proposed a novel approach for computing confidence intervals via two applications of a Padé approximation to find the second derivatives.We also proposed another method for computing confidence intervals based on the approximation of the power series definition of the matrix exponential.The latter is useful when the number of states in the CTMM problem is high (≈ S > 20).In a small/mid scale setting, (M ≤ 1000) our simulation studies show slight out-performance of the proposed SCTMM over the MSM software.In a large-scale setting (M ≥ 10 000), where the MSM software is unable to estimate the parameters, the proposed SCTMM can be used and shows a good performance.Some of the important findings in the analysis of NO.MS.2 are as follows: • The number of relapses in the last year (ARR1) and older age increase the risk of deterioration (EDSS increase) and reduce the chance of recovery (EDSS decrease), particularly in the early stage of the disease.
• Disability accumulation is a slow process in MS: in our large MS data set in which EDSS assessments occur on average approximately every 4 months, most patients stay unchanged in their EDSS state between consecutive visits.The probability of transitioning to a higher/lower EDSS state occurrs more frequently in the young and in RRMS compared with the progressive subtypes of the disease (SPMS or PPMS) where patients are more likely to worsen and less likely to recover.
• Our proposed statistical method makes the fitting of Markov models with covariates feasible and scalable to large data sets.It allows the investigation of covariate effects on transition probabilities between (disease) stages, which may find its application far beyond MS.

Software
The code used to implement the approach outlined in the simulation study and data application are available in github at the following link: https://github.com/farhad-hat/SCTMM.A sample input data set and complete documentation is available on request from the corresponding author Thomas Nichols (thomas.nichols@bdi.ox.ac.uk).
2. If i = s m(k−1) and j = s mk : q s m(k−1) n • q nj = ∂/∂q 0 ij [q s m(k−1) i • q ij ] = q s m(k−1) i H is mk ,  q in • q nj = ∂/∂q 0 ij [q ij q jj + q ii q ij ] = (q s m(k−1) s m(k−1) + q s mk s mk , z mc (t mk )q s mk s mk H s m(k−1) s mk , i f a = b = s mk , (q s m(k−1) s m(k−1) + q s mk s mk )z mc (t mk )H s m(k−1) s mk , if a = s m(k−1) and b = s mk , ∂/∂β ij,r (U) = (q s m(k−1) s m(k−1) + q s mk s mk )q s m(k−1) s mk z mr (t mk )H s m(k−1) s mk , ∂ 2 /∂β ij,r ∂q 0 ab (U) = H ab z mr (t mk )q s m(k−1) i , if a = i and b = s mk , q is mk z mr (t mk )H ab , if a = s m(k−1) and b = i, ∂ 2 /∂β ij,r ∂β ab,c (U) = ⎧ ⎨ ⎩ q s m(k−1) s m(k−1) q s m(k−1) s mk z mc (t mk )z mr (t mk )H s m(k−1) s mk , i fa = b = s m(k−1) , q s mk s mk q s m(k−1) s mk z mc (t mk )z mr (t mk )H s m(k−1) s mk , i fa = b = s mk , 2 (q s m(k−1) s m(k−1) + q s mk s mk )q s m(k−1) s mk z mr (t mk )H s m(k−1) s mk , if a = s m(k−1) and b = s mk . (A.6) Notice that when i = s m(k−1) and j = s mk then derivatives w.r.t both q 0 ij and β ij,r are zero.

Fig. 2 .
Fig. 2. Running time comparison between MSM software and the proposed SCTMM method.

Table 1 .
Comparison of bias, variance, coverage, andrejection rate between the methods used in the MSM software and our proposed SCTMM method, when varying ∂q 0 ij ∂q 0 ab (U) = H ab H is mk , if a = s m(k−1) and b = i, 0, otherwise, ∂ 2 /∂q 0 ij ∂β ab,c (U) = q s m(k−1) i z mc (t mk )H is mk , if a = s m(k−1) and b = i, q s m(k−1) i z mc (t mk )H is mk , if a = i and b = s mk , ∂/∂β ij,r (U) = q ij z mr (t mk )q s m(k−1) i , ∂ 2 /∂β ij,r ∂q 0 ab (U) = H ab z mr (t mk )q s m(k−1) i , if a = i and b = s mk , q is mk z mr (t mk )H ab , if a = s m(k−1) and b = i, ∂ 2 /∂β ij,r ∂β ab,c (U) = q is mk z mr (t mk )z mc (t mk )q s m(k−1) i , if a = i and b = s mk ,q is mk z mr (t mk )z mc (t mk