Individualized survival predictions using state space model with longitudinal and survival data

Monitoring disease progression often involves tracking biomarker measurements over time. Joint models (JMs) for longitudinal and survival data provide a framework to explore the relationship between time-varying biomarkers and patients’ event outcomes, offering the potential for personalized survival predictions. In this article, we introduce the linear state space dynamic survival model for handling longitudinal and survival data. This model enhances the traditional linear Gaussian state space model by including survival data. It differs from the conventional JMs by offering an alternative interpretation via differential or difference equations, eliminating the need for creating a design matrix. To showcase the model’s effectiveness, we conduct a simulation case study, emphasizing its performance under conditions of limited observed measurements. We also apply the proposed model to a dataset of pulmonary arterial hypertension patients, demonstrating its potential for enhanced survival predictions when compared with conventional risk scores.


Introduction
Safeguarding and enhancing patients' well-being is of paramount importance in healthcare, often necessitating ongoing monitoring.In recent years, numerous medical institutions have opted to store patients' information in electronic health records (EHRs) databases [1,2].As time progressed, the potential inherent within this wealth of raw data, encompassing data from thousands of patients, became increasingly evident to researchers.These databases encompass a wide range of information, including diagnostic tests, laboratory results, medical procedures, their respective outcomes, and other medical occurrences that patients may encounter over their lifetimes [2][3][4].The availability of such extensive healthcare data is progressively turning the aspiration of personalized treatment into a tangible reality.
Despite its advantages, EHR modelling is still relatively simplistic due to the difficulty encountered in applying conventional statistical methods [5,6].Obtaining an accurate, parsimonious and explanatory model is hindered by many characteristics of EHRs, including heterogeneity, irregular timing of events, missing data, and lack of standardization, among others [1,2,[4][5][6].That being said, these databases typically contain richer and more realistic symptoms dynamics, additional biomarkers, and more frequent visits, when compared to clinical trials [1].Despite the challenges posed by modelling such complex data, successful achievement in this endeavour holds the promise of enabling more precise outcome predictions and, consequently, the realization of personalized medicine.
Disease monitoring often involves tracking specific biomarkers that medical experts have identified as significant indicators, drawing on their experience to gauge a disease's progression [7].In contemporary healthcare practices, these biomarker values are commonly employed in constructing risk scores, with the resultant score indicating the patient's present disease stage [8,9].However, this approach typically overlooks the underlying dynamics and trajectory of biomarkers.When dynamics are considered, they are usually simplified, such as assessing changes in biomarker values over a 1-year period [10].
The state space model (SSM) serves as a versatile tool for capturing the temporal evolution and measurement processes within a probabilistic framework.It consists of two key components: a dynamics equation, which describes how hidden states evolve over time, and an observation equation, illustrating the connection between these continuous-valued hidden states and the observed biomarkers.The strength of the SSM lies in its ability to employ first-order difference equations, making it computationally efficient and appealing [11,12].Consequently, it has found success in various domains, including engineering [13], life sciences [14], social sciences [15], econometrics [16] and healthcare [17].Interestingly, its efficacy has extended to movement ecology as well, where SSMs have gained more popularity than linear mixed effects (LME) models, becoming the preferred choice [18,19].In the realm of healthcare, the SSM has been employed for diverse purposes, including the reconstruction of EEG and MEG signals [20,21], decoding ensemble spikes in neuroscience [22,23], classification and prediction of clinical data [24], and general clinical monitoring [25], among other applications.Notably, SSMs have primarily been applied to relatively short-term observational data, and their use in modelling irregular time-series spanning months or years has been limited.
A common healthcare application involves time-dependent biomarkers for monitoring diseases, such as cluster of differentiation 4 (CD4) counts for human immunodeficiency virus [26][27][28], prostate-specific antigen for predicting prostate cancer recurrence risk [29][30][31], and echocardiogram variables for assessing cardiovascular diseases [32][33][34].Early attempts to model survival in these scenarios often involved directly integrating these time-varying measurements into the timedependent Cox proportional hazards model.However, it became evident that such an approach introduced bias [35,36].Subsequently, researchers turned to a two-stage approach, initially modelling the longitudinal process and then integrating the output into a survival model [37].Unfortunately, this did not eliminate bias, prompting further exploration of alternatives [26,38,39].
The joint model (JM) for longitudinal and survival data emerged as a solution to this bias issue [26].JM consists of two sub-models: the longitudinal process, often represented by a mixed effects model, and the survival process, typically modelled using the Cox proportional hazards model.These two processes are linked through random effects, with the key assumption that, given these random effects, the longitudinal and survival processes are conditionally independent [26].These random effects are personalized for each patient, allowing deviations from the population trajectory.JM has been successfully applied in various disease monitoring contexts, including those involving the biomarkers mentioned earlier.
In this study, we extend the SSM to incorporate survival data for patient health modelling and survivability assessment.We enhance the canonical SSM [40] by introducing a proportional hazards model influenced by the hidden states.Drawing inspiration from the JM expectation maximization (EM) algorithm [26], we present the linear state space dynamic survival model (LSDSM) with Gauss-Markovian assumptions.LSDSM diverges from JM mainly in its representation of longitudinal biomarkers as a dynamic model, as opposed to relying on basis functions.Furthermore, LSDSM provides coefficients associated with previous hidden state values, in contrast to the time-dependent covariates employed in the LME model.
The advantages of SSMs are manifold: (i) they serve as generalizations of various time-series models like autoregressive (AR) and autoregressive integrated moving average (ARIMA) models; (ii) they can model time series without necessitating covariates or design matrices, although they can be included seamlessly; (iii) they can accommodate expert knowledge by allowing the fixation of sub-structures; (iv) they maintain interpretability with appropriate matrix selection; and (v) they offer flexibility in altering temporal order without changing the estimation procedure [41,42].Furthermore, SSMs inherently distinguish between process variation and measurement error, aiding in the identification of true underlying processes [19].They also naturally account for correlation structures between measurements and sequential time points, alleviating the need for precise pre-specification of such correlations, as is often required in LME models [12].Using an SSM presents certain drawbacks when compared with LME models.Specifically, in scenarios where longitudinal observations are sparse, basis functions offer a natural interpolation capability, while the state space approach necessitates handling missing data.Additionally, the linear state space framework is constrained by specific longitudinal patterns, although its smoothing capabilities may alleviate this limitation when dealing with sparse data.However, it is worth noting that for certain applications, these drawbacks are relatively minor and can be addressed effectively, such as incorporating smoothness priors and/or extending to nonlinear SSMs.Moreover, wearable technologies are gaining prominence in healthcare, enabling the collection of more frequent and regular longitudinal time-series data, a domain in which SSMs excel in effectively modelling such information.Finally, it is worth noting that SSMs focus on the connection between the current and future time steps, simplifying the forecasting of future values even beyond the observation period [16].In this paper, we harness these advantages, coupled with the computational efficiency of SSMs, to formulate an estimation framework and investigate the performance of LSDSM through simulations and a real-world application.
The remainder of the paper is structured as follows.Section 2 provides an overview of the notation and methodology underpinning the LSDSM framework.In §3, we delve into the estimation process and lay out the critical assumptions made.Performance metrics for assessing the predictive capabilities of the models, along with the procedure for individualized survival predictions, are detailed in §4.Moving on to §5, we present the results of our simulation studies.In §6, we pivot our focus to an application involving patients with pulmonary arterial hypertension (PAH), where we discuss the analysis conducted using LSDSM and compare its performance against the established risk score approach.The paper concludes with a summary of findings and a broader discussion in the final section.

Linear state space dynamic survival model
The linear state space dynamic survival model is constructed using two sub-processes, these being the longitudinal and survival processes.The aim of the former process is to reveal the true biomarker values in the presence of measurement error, while the objective of the survival process is to identify the relationship of the true biomarker values and other covariates of interest, with the hazard of the patient.For ease of reference, the notation is listed in table 4 in appendix A.
In this work, we use a discrete-time SSM for the longitudinal sub-process, while also introducing the survival sub-process to the model through the form of a proportional hazards model.Thus, we are proposing a Markov-based dynamic model for the longitudinal process that has the potential to capture the rate of variations.In other words, the current true biomarker values will be a function of previous biomarker values, as an alternative to the design matrices that involve basis functions of time in the JM.
The hidden states trajectories x i,j [ R mxÂ1 include the true underlying biomarkers, and they are dictated by the Markovian assumption, i.e. that the current state values are a function of the state values at the previous time point.In this work, we shall make use of the linear Gaussian SSM with the following dynamic and observation equations: where i and j represent the ith patient and the jth time step, respectively, A [ R mxÂmx is the transition matrix dictating the dynamics of the longitudinal hidden states in time, and w i,j [ R mxÂ1 is the disturbance term vector that allows variation from the population trajectory.y i,j [ R myÂ1 is the observation vector containing a list of measurements for patient i at time step j, C [ R myÂmx is referred to as the observation matrix which provides a relationship between the observation and hidden state vectors, while v i,j [ R myÂ1 is the measurement error vector, sampled from a zero-mean Gaussian distribution N ð0, VÞ.
Note that in the first expression of equation (2.1), x i,j and x i,jþ1 are separated by a constant time step Δt.The number of hidden states may exceed the number of available biomarkers, offering the potential to capture higher-order dynamics.In our model, we characterize the true biomarker trajectory using an AR(M ) process where M is the AR order, resulting in a total of m x = M × m y states.Consequently, we structure C ¼ ½I 0 [ R myÂmx , where I [ R myÂmy represents an identity matrix.This construction allows for a direct association between the hidden states and the observed biomarkers.It is important to note that C remains fixed, and therefore is not included in the set of parameters to be estimated.Complementing this structure in C, the A matrix is chosen to maintain the SSM in a canonical form, ensuring identifiability and obtaining a unique solution [40]: This special structure adopts simple AR behaviour.As an example, for an AR(2) process, m x = 2 × m y where m y is the number of unique biomarkers observed.Thus, for this A matrix, A [ R myÂmx , I [ R ðMÀ1ÞmyÂðMÀ1Þmy , and 0 [ R ðMÀ1ÞmyÂmy .Note that an additional error term is introduced in the dynamics equation, w i,j , which is typically referred to as the disturbance or uncertainty term, and it captures the deviations represented by the dynamics A, which in turn is determined from the whole population.This is assumed to be sampled from a zero-mean normal distribution N ð0, WÞ.This random walk effect allows patients' trajectories to deviate from the population, and thus provides individualized longitudinal components for every patient.In the canonical representation specified above, this disturbance is assumed to influence solely the first m y states.This leads to W ¼ G WG `where G ¼ ½I 0 `[ R mxÂmy , incorporating an m y × m y identity matrix, I. Here, W denotes the reduced disturbance variance that impacts the first m y hidden states, while the trailing hidden states are assumed deterministic.
The survival model used for LSDSM is the proportional hazards model.The typical approach to include longitudinal information is to take the current value of the true biomarker trajectory, but many different associations may be implemented.For an enhanced list of these associations, the reader is referred to the review by Hickey et al. [43].For this model, the hidden states representing the true underlying biomarker trajectories are used for the current hazard calculation.Hence, the survival sub-process is governed by the following hazard function: where h i (t) is the hazard value for patient i at time t, v i [ R mvÂ1 is the set of baseline covariates for patient i, and g [ R mvÂ1 and a [ R maÂ1 are the coefficients linking the baseline covariates and the hidden states to the hazard function, respectively.
The matrix H [ R maÂmx permits a linear combination of the hidden states to influence the hazard function.This flexibility in the model structure allows for the incorporation of specific hidden states in modulating the hazard function.Additionally, the introduction of changes in the true biomarker value as potential time-dependent covariates within the hidden states can be achieved using the matrix H.The determination of H is undertaken by the analyst, who relies on their expert insights to identify which associations are likely to exert an influence on the patient's probability of survival.The survival function is related to the hazard function as This formulation allows for an individualization of the survival curves as a function of the baseline covariates (v i ) and the true biomarkers trajectories (x i,j ) for patient i.
The assumptions represented by this model and additional assumptions required to facilitate the estimation procedure are: A.1 Markovian assumption for the longitudinal sub-process; A.2 conditional independence such that given the hidden state values, the longitudinal and survival processes are independent; A.3 observation times and censoring are not affected by the hidden states values (i.e. they are observed or missing at random); and A.4 the hidden state values remain constant between time steps, i.e. xðjDtÞ ¼ xððj þ 1ÞDt À eÞ, where e is some small number ≪ Δt, leading to a simplification in the estimation procedure.

Estimation procedure
In our effort to integrate and monitor survival data within the extended SSM, we build upon the foundational work introduced by Dewar & Kadirkamanathan [40].We also derive inspiration from the EM algorithm, a well-established maximum-likelihood approach repeatedly employed in JMs [26].Consequently, we further develop the EM algorithm tailored for the SSM by seamlessly incorporating survival data as an additional set of accessible observations.

Expectation maximization algorithm
Our proposed model employs a discrete-time SSM to monitor the true underlying biomarkers.Here, the current states serve as a representation for predicting the patient's future health outcome.This framework introduces novel probability distributions within the observed likelihood expression.Similar to the conditional independence assumption commonly used in JMs, we maintain the independence between longitudinal and survival data given the values of the hidden states.Operating within the maximum-likelihood paradigm, our objective is to maximize the likelihood of the observed data: i ; uÞ: ð3:1Þ The above equation assumes that all patient data are independent of each other, and hence we can simply multiply their likelihoods.On the basis of assumption A.3, the hospital visiting process and censoring are assumed non-informative [38,44] and hence may be disregarded.Equation (3.1)Using assumption A.4, the graphical model of the proposed framework is simplified as shown in figure 1, where d i,j ¼ Iðt i,j , T i t i,jþ1 , d i ¼ 1Þ, and I( • ) is equal to 1 if all conditions inside the brackets are satisfied, and zero otherwise, and x i,j and y i,j are the hidden states and observed biomarkers vectors for patient i at time step j, respectively.Assuming a linear Gaussian SSM and a proportional hazards survival model for the longitudinal and survival processes, respectively, the conditional probabilities used for this framework are defined as follows: d i,j a `Hx i,j À t i,j exp {g `vi þ a `Hx i,j }}; ð3:5Þ royalsocietypublishing.org/journal/rsif J. R. Soc.Interface 21: 20230682 where X i and Y i are the sets of all values of the patient's true and observed biomarkers, respectively.Equation (3.7) shows the distributions for the general SSM.In using the outlined canonical representation of SSM, this would require slight modifications, as shown in §1 of the electronic supplementary material, where we would require the probability distribution of the leading m y hidden values, labelled as X Ã i .Note that in using a discrete-time SSM, the biomarker measurements are structured to follow a regular time series with Δt as the pseudo-sampling time, representing the chosen discrete-time points at which the true biomarkers are estimated.If the biomarker measurements are captured at irregular time intervals, these can be binned as observations at the chosen time steps.If no observations are made in a time interval, they are treated as missing measurements.Hence for every patient, we observe m i = ceil(T i /Δt) measurements, some are typically missing.
The decomposition of equation (3.5) is made possible using assumption A. 4, where the components of the survival probability distribution can be defined as shown in equations (3.8) and (3.9): A patient may have a maximum of one d i,j equal to 1, and it can be easily verified that the third line in equation (3.8) is equivalent to the first.Note that T i may be located between SSM time steps, and thus τ i,j = Δt for all periods except the last one, where t i,m i ¼ T i À m i Dt.An example diagram is shown in figure 2. Also, note that even though the time step for the last value is not regular, for convenience the final recorded time for patient i is denoted as t i,m i þ1 ¼ T i .
This decomposition of the survival probability distribution can be explained as follows.If the survival data are split according to the respective time steps, then every element within the product of equation (3.5) is identifying the distribution for the patient surviving or experiencing the event within that time frame.Hence, it can be expressed as the probability distribution including only survival data within the next time step, i.e. pðt i,jþ1 , d i,j jx i,j Þ.Note that this is a function of only the values of the current hidden states x i,j .This decomposition is advantageous since in using the inherent properties of the SSM, the current observation also depends solely on the current hidden states values [11].This leads to a simplification in the joint probability distribution.
Expectation step: The proposed observed data log likelihood is expressed as This is computationally difficult to maximize directly with respect to the parameters, due to the integration being located inside the log function [45].Thus, we resort to the EM algorithm.In the EM algorithm, the goal is to maximize a lower bound to the observed data log likelihood.This requires the expectation of the complete data log likelihood with respect to the unobserved parts of the complete dataset.Since some time steps may contain no biomarker measurements, then some y i,j may be unobserved, and should be treated as 'missing'.Let Y ðOÞ i represent the set of observed measurements, while Y ðMÞ i denote the set of missing measurements for patient i [11,46].Then the required expectation is with respect to the distribution pðX i , Y ðMÞ i jY ðOÞ i , T i , d i ; u ðkÞ Þ for every patient, where k is the current iteration in the EM algorithm.From now on, we shall use the shorthand notation E[ • ] to represent E Xi,Y ðMÞ i jY ðOÞ i ,Ti,di;u ðkÞ ½Á, unless stated otherwise.Also, all probability distributions are with respect to the unknown values of the parameter values, and thus pð Á ; uÞ shall be expressed as p( • ) unless stated otherwise.Hence, the expectation of the complete data log likelihood can be expressed as Using the distributions defined in equations (3.5)-(3.7), the above expectation simplifies to (ignoring additive constants) Maximizing the expectation of the complete data log likelihood with respect to the parameters requires the evaluation of the following expectations: :15Þ where mi,j and Ŝi,j are the expected mean and variance of the hidden state values, M i,j ¼ Ŝi,j J ì,jÀ1 , and * here refers to the reduced vectors and matrices, retaining information only about the first m y states.These are used for the parameter updates of the canonical representation of SSM.I ðMÞ i,j is the identity matrix with zeros corresponding to the rows and columns of observed measurements for time step j, r i,j ¼ I À VðV ðOÞ i,j Þ `ðV ðO,OÞ Þ À1 V ðOÞ i,j , having V ðOÞ i,j be a matrix extracting only the observed parts of the y i,j vector, and V ðO,OÞ be the biomarker measurement error variance retaining the rows and columns corresponding to the observed parts of the vector y i,j [46].
The first five expectations in the list could have been derived using Rauch-Tung-Striebel (RTS) smoother, had the observations been only longitudinal [47,48].The presence of survival data demands the introduction of a suitable estimator to compute these expectations.Using Bayes' theorem, the posterior distribution of the hidden state at time j given the observed data up to that time can be expressed as royalsocietypublishing.org/journal/rsif J. R. Soc.Interface 21: 20230682 The probability distribution in the denominator may be ignored as it is only dependent on the data and acts as a normalization constant.Hence, we obtain the following expression: where the latter factor is the prediction distribution for the value of the hidden states at the current time step given all previous observations, while the former updates the posterior distribution of the current hidden states given new evidence.This correction term may be distributed into the product of pðt i,j , d i,j jx i,j Þ and pðy i,j jx i,j Þ using the conditional independence assumption.The distributions for these factors can be extracted from equations (3.5) and (3.6), respectively.It was empirically observed that for most cases, the output of equation (3.23) provides a distribution that has a similar shape to a Gaussian distribution, and hence for computational efficiency, we approximate pðx i,j jy i,1 : j , t i,jþ1 , d i,1 : j Þ as a Gaussian distribution around its mode with the variance computed from the Hessian matrix, using the Newton-Raphson iterative procedure.Appendix B provides some examples of these empirical observations for a one-dimensional hidden state.These modifications can be incorporated into the RTS smoother with the slight amendments mentioned in the filtering steps to incorporate survival data, having the backward smoothing part of the algorithm remain unchanged.
Having dealt with expectations (3.13)-(3.17),we now turn to equation (3.18).This expectation cannot be computed in closed form, and thus a Laplace approximation can be employed [49, ch. 2].The detailed derivations of expectations (3.13)-(3.18)are provided in §1 of the electronic supplementary material.The expectations in equations (3.19)-(3.21)are derived using the missing data modifications to the SSM, as explained by [11, ch. 4].
Maximization step: This involves finding the parameter values that will set the gradient of the expectation of the complete data log likelihood to zero.The parameters of interest are u ¼ f x 1 , W 1 , A, W, V, g, ag.The first five parameters in u are the state-space parameters, and they have closed-form solutions for their updates given by Note that equations (3.25) and (3.27) require the updated solutions of (3.24) and (3.26) within their formulation, respectively [45].Furthermore, equations (3.26) and (3.27) are tailored towards the canonical representation of SSM [40].Slight modifications are required for the general SSM, as shown in §1 of the electronic supplementary material.The updates for the survival parameters g and a have no closed-form solutions, and therefore we resort to Newton-Raphson iterative procedure, which is formulated as ! À (H f j g ðKÀ1Þ ,a ðKÀ1Þ ) À1 rfj g ðKÀ1Þ ,a ðKÀ1Þ , ð3:29Þ royalsocietypublishing.org/journal/rsif J. R. Soc.Interface 21: 20230682 where K represents the iteration of the Newton-Raphson method, rf is the gradient vector, H f is the Hessian matrix, and The derivations of all update equations are given in the first section of the electronic supplementary material.After the M step is completed, the algorithm restarts from the E step with the updated parameters, and repeats the process until convergence.The convergence criterion chosen for this algorithm is a difference of less than 5 × 10 −4 across all parameters.If the algorithm executes 600 EM iterations without reaching this criterion, then it is assumed as failure to converge.Table 1 provides an overview of the estimation procedure for LSDSM.A more detailed summary of the EM algorithm for LSDSM is shown in table 6 in appendix C. The implementation of LSDSM was carried out using MATLAB (v.9.13.0 (R2022b)).

Performance analysis
The predictive performance of survival models is typically assessed through two main criteria, these being the calibration and the discrimination abilities.Calibration states that the model is able to represent and track the data appropriately, while discrimination focuses on concordance, and the model's ability to discriminate between those that experience the event, to those that do not experience it.In this work, we focus on two performance metrics, these being the dynamic area under the receiver operating characteristic (ROC) curve (AUC), and time-dependent brier score (BS).These are formally explained by Blanche et al. [50].It is claimed that AUC is used for discrimination purposes, while both calibration and discrimination abilities may be captured by BS.
Provided the fact that the proposed framework requires some dynamic data to make individualized predictions, these scores are not analysed at baseline.Hence, we require the use of landmarks and horizons.A landmark is a point in time where the model makes a prediction, allowing the model to use all previous measurements.Horizon is the time span beyond the landmark at which the survival value of the patient is considered for predictions [50].Landmark time and horizon time are denoted by t s and t h , respectively.In the case of discrimination, the prediction is whether the patient is expected to experience the event or not at that time, based on some pre-defined threshold.For calibration, this is a simple comparison between the predicted survival value and the observed outcome of the patient at the time of interest.
When assessing the model's proficiency in tracking longitudinal data, the root mean square error (RMSE) is a widely used method.It provides insights into the average deviation of estimated trajectories from the true biomarker trajectories [51,52].Hence, we employ RMSE for evaluating longitudinal performance.Additionally, in simulated data where the true survival curve for each patient is known, we employ this metric to assess the model's calibration capabilities.The equations for the area under the ROC curve, the BS and RMSE, are provided in §2 of the electronic supplementary material.
A recursive solution for dynamic survival predictions using LSDSM is formulated in §3 of the electronic supplementary material.This approach draws upon concepts similar to those used in the E step of the estimation procedure.The subsequent steps provide a concise overview of the solution.At each time step: 1. One-step forward prediction of hidden states pðx i,miþj jT i .t miþj , y i,1 : m i , d i,1 : miþjÀ1 ; uÞ 2. Correction assuming patient survived the current time step pðx i,m i þj jT i .t m i þjþ1 , y i,1 : m i , d i,1 : m i þj ; uÞ 3. One-step forward survival prediction pðT i .t m i þjþ1 jT i .t m i þ1 , y i,1 : mi , d i,1 : m i ; uÞ The described steps are reiterated until the intended horizon is attained.The initialization of this recursion relies on the final time step of the proposed RTS filter for each patient.It is important to highlight that Steps 2 and 3 are approximated using the Laplace method.In the former, the distribution is approximated as a Gaussian, while in the latter, it is employed to approximate the integral.These approximations are akin to the procedures executed in the proposed modifications of the RTS filter, and can be executed concurrently.

Simulation study
This section focuses on creating simulations using the LSDSM framework, and assessing the model's capability to accurately estimate true parameters and effectively track the patient's true survival curve.The study encompasses an examination of the impact Table 1.Overview of the estimation procedure using the EM algorithm for LSDSM.
For every patient i:

3.
→ Predict the longitudinal biomarkers using the state space model and correct (filter) using the observed longitudinal and survival data 4.
→ Smoothen the longitudinal biomarkers trajectories based on the entire observation dataset

5.
Calculate the required expectations based on the smoothed results 6.
Update the model parameters using these expectations The selection of parameters for the true model is as follows: While the following parameters are assumed fixed and known: This setup incorporates a solitary biomarker trajectory, two states derived from recent history, an intercept value for the baseline hazard, a baseline covariate sampled from a normal distribution N ð0, 1Þ, and a direct influence of the current true biomarker value on the hazard function.It is noteworthy that due to the utilization of a canonical representation of the SSM, W is a scalar denoting the disturbance variance in the AR(2) model.The choice of variances ensures that both the disturbance and the measurement error possess zero mean and standard deviations of 0.2 and 0.5, respectively.
This setup mirrors real-world scenarios where a patient's health might gradually decline over time, with occasional disturbances influencing either improvement or further deterioration in their health status, alongside potential measurement errors linked to recording biomarker observations.The trajectory dynamics resemble those observed in the walking distance trajectories of patients with PAH.
To simulate scenarios resembling real-world data situations where observations are frequently missing (due to irregular clinic appointments, patient absenteeism etc.), a certain percentage of observations are randomly assigned as not a number (NaN) values.The survival times and event indicators for each patient are calculated using the inverse transform sampling method as outlined by Walke [53].This procedure emulates the piecewise exponential survival function, adhering to the stated assumptions.The survival status of each patient is assessed at every time step.Censoring times were drawn from a uniform distribution C Ã i Uð10, 50Þ.The maximum observational study duration for each patient is capped at 30 time steps, denoted as This can be likened to bi-monthly follow-ups over a span of 5 years.This simulation setup results in approximately 49% of patients experiencing the event, with an average survival time of around 21 time points.
The training dataset comprises of 500 simulated patients.The missing observation percentages are set at 0%, 25%, 50% and 75%.For each configuration, 100 simulations are generated.Table 2 presents summary statistics for the estimated parameters within this simulation study.This table records the average values across all simulations for each configuration, alongside the sample standard deviation.Notably, the model successfully converges in all runs during these simulations.On average, convergence is achieved within 26, 11, 17 and 50 EM iterations for the 0%, 25%, 50% and 75% missing observations configurations, respectively.It is observed that all true parameter values lie within 1 standard deviation of the estimated parameter samples, except for the survival parameters in the 75% missing observations configuration, where bias is evident.Therefore, it is advisable to select the Δt hyperparameter, signifying the time step for the SSM, such that the fraction of missing longitudinal observations remains within 50%.Histograms depicting the estimated parameters' distributions are presented in figure 7 in appendix D.
We also evaluate the RMSE of both the longitudinal and survival trajectories to assess the model's ability to accurately track these crucial indicators.These assessments are conducted on both the training and testing data to ensure that the estimated model does not suffer from overfitting, and can generalize to unseen data.Furthermore, we examine the first hidden state and the corresponding survival curve for each simulation.Additionally, we calculate the RMSE for the curves using the true and estimated model parameters for comparison.This is done to mitigate the RMSE variation observed across all simulations due to the inherent stochasticity.The results are summarized in table 3. It is evident that the RMSE values obtained using the estimated model closely resemble those obtained with the true model parameters when tracking longitudinal and survival data, and thus are within an acceptable level.
Notably, in the 75% missing observations scenario, even when the correct parameters are employed, the model still exhibits difficulty in accurately tracking the survival curve, with an average error exceeding 5% in survival probability.This is due to the larger uncertainty in the true hidden state values with more sequential missing observations.Despite the apparent bias in the survival parameters for the estimated model in this scenario, the RMSE values for the survival trajectories are nearly identical to those obtained using the true model.As evidenced in figure 8 in appendix E, with the limited number of observed measurements, multiple solutions exist with different parameter values providing similar results, despite the observed bias in the survival parameters.
These observations underscore the importance of selecting the hyperparameter Δt appropriately, with consideration for maintaining the percentage of missing measurements around 50%.

Analysis on a real dataset
This work is primarily motivated by the need for improved survival analysis for patients with PAH.PAH is a rare and life-threatening condition characterized by vascular proliferation that results in elevated pressure in the pulmonary artery [10].This often leads to reduced cardiac output, which in turn manifests as limited exercise capacity in patients [9,10].The dataset employed in this study was collected at the Sheffield pulmonary vascular disease unit.Patients at this unit undergo regular follow-ups, typically recommended at intervals of three to six months [54].Among the various tests conducted, the exercise test is frequently administered, with a focus on either the 6-minute walk test or the incremental shuttle walk test.Our investigation primarily centres on the latter test, examining how the trajectories of walking distances across several exercise tests relate to the survival of PAH patients.
Currently, the prevailing method for assessing the risk of death in PAH patients involves using a risk score, such as REVEAL 2.0 risk score and the European Society of Cardiology/European Respiratory Society (ESC/ERS) risk-assessment model [54].However, these scores do not account for longitudinal trends in relevant biomarkers.Some initial efforts have been made to incorporate these trends, but they are inherently limited, often involving simple checks on changes in walking distance over the past year and defining thresholds to adjust patient risk [10].To the best of our knowledge, there has been no prior study that jointly models longitudinal and survival data within this dataset.
The data were collected at Sheffield Pulmonary Vascular Disease Unit and include a total of 5391 patients.This dataset encompasses patients with pulmonary hypertension beyond just those classified as having PAH.After specifically selecting PAH patients and ensuring that each patient had undergone at least two exercise tests, the dataset was narrowed down to 1105 patients.On average, this cohort was approximately 59.5 years old at the time of diagnosis ( ±15.5 years), with an average follow-up period of 4.25 years (±2.79 years).Among these patients, 376 (34%) experienced the event of interest, which in this case is death, during the observed 10-year period.The average number of measurements per patient in this dataset was approximately 6.6 (±4.1) observations.The median time between visits was six months, with a mean period of 7 (±5.9)months.The majority of patients were females, accounting for a total of 788 (71.3%).This is an expected observation given that prevalence in females is higher [54].A large portion of the patients were classified in World Health Organization (WHO) functional classes III and IV (1043), leaving only 9.4% of the patients within classes I or II.
Our objective is to identify patients who face a higher risk of death.This enables clinicians to anticipate and provide personalized treatment to those at greater risk, potentially prolonging their survival time.In our model, we consider age, sex and a binary We model the walking distance as the longitudinal trajectory using the AR(2) configuration described in previous sections.Upon initial inspection of the longitudinal data, the dynamics appear relatively straightforward, prompting the selection of a low AR order.Various AR orders were evaluated, and it was found that the second order yielded superior results while maintaining parsimony.The observed walking distance ranges from 0 to 1020 m, truncated at 1000 m for easier interpretability of coefficients, and normalized to a scale ranging from 0 to 1.This truncation affected only 21 data points.Hence, a normalized value of 0.5 now corresponds to 500 m in the response variable rather than 510 m.This simplifies interpretation when analysing hazard ratios.Our SSM consists of two states, with one tracking the other.We assume that the hazard function is influenced by the baseline covariates and only the current value of the true walking distance, with the lagging state not affecting a patient's survival, thus H ¼ ½1 0. Alternative configurations were explored, yet none yielded improvements in the results.Therefore, for the sake of parsimony, this choice was ultimately selected.We divided the available data such that 70% of the patients are in the training dataset, resulting in 773 patients for training, while the remaining 332 patients constitute the testing dataset for evaluating the model's performance on unseen data.
The estimated parameters, denoted as u, encompassed the following variables: The matrix C was held fixed at [1 0], indicating a direct linkage between the observed measurements and the true underlying biomarker value with some associated measurement error.The hyperparameter Δt was configured to a duration of six months, aligning with the median time between visits and falling within the recommended timeframe for regular follow-up of PAH patients.This resulted in approximately 45% missing observations and an average of five observations per patient, which falls within the recommendation outlined in the simulation study of retaining missing observations within 50%.The model achieved convergence within 23 EM iterations.The deduced parameter values are as follows: x From these parameter values, we can infer that the population's walking distance tends to gradually decrease over time.The transition matrix further reveals that, on average, the rate of change decreases to approximately 0.4 times the rate of change at the previous time point.This suggests that while the walking distance is gradually declining, the rate of decline decelerates over time, indicating a trend towards more stability in the walking distance.The model estimates the process and measurement error standard deviations to be 0.023 and 0.041, roughly equivalent to 23 and 41 m, respectively.For a new patient with no initial values, the model assumes an approximate starting point of 190 m, with a standard deviation of 135 m.While this standard deviation is relatively large, it reflects the lack of prior information about this particular patient.Appendix G shows some examples of patient longitudinal observations together with their expected leading hidden state trajectories.
The patient's age (γ 1 = 0.0337), sex (γ 2 = 0.51) and membership in WHO functional classes III or IV (γ 2 = 1.03) make proportionally high contributions to the hazard function.Specifically, a patient who is 10 years older has a hazard ratio of 1.4, while being male increases the hazard, with a hazard ratio of 1.67.Belonging to the more severe WHO functional classes results in a hazard ratio of 2.79 compared to those in lower functional classes.Additionally, we observe a notable effect of the true walking distance on the hazard function (α = −8.23),with a hazard ratio of 0.44 for a patient who walks an extra 100 metres.In summary, we illustrate the impact of baseline covariates and initial walking distance on a patient's survival with four plots in figure 3.These findings confirm the effects of covariates on the hazard, aligning with risk scores for PAH [8].
Next, we assess the model's performance on the test dataset.We evaluate the BS and the AUC for various horizons and landmarks.Specifically, we examine landmarks at 1, 2, 3 and 4 years, and for each of these landmarks, we analyse evenly spaced horizons ranging from 6 months to 5 years.The results are presented in figure 4. From these plots, it is evident that as the landmark time increases, the model's accuracy also improves.Notably, we observe a slightly lower average BS and a higher AUC for the 4year landmark compared to the 1-and 2-year landmarks.This implies that predictions become more accurate as patients are tracked over longer durations, which aligns with inherent expectations.
Finally, we assess the model's performance in comparison to a risk score developed for risk stratification of PAH patients.Since our dataset predominantly employs the incremental shuttle walk test rather than the 6-minute walk test, we compare our model with the approach by Billings et al. [10], which uses only the walking distance as a prognostic covariate and employs landmarking techniques for evaluation at different times.
Billings et al. [10] also employ thresholding techniques to classify patients and evaluate sensitivity and specificity values for various thresholds, which are then used to compute the area under the ROC curve.To generate this curve, we consider threshold values at 10 m intervals, ranging from 0 to 1000 m.We perform these comparisons across the same landmarks as in the previous analysis, ensuring that only patients who underwent an exercise test within two months prior to the landmark time are included, reducing the effective test sample size.We evaluate the performance metric for horizons of 1, 2 and 3 years, as was performed by Billings et al. [10].This was performed using R (v. 4.3.1).
To ensure a fair comparison, we retrain LSDSM without incorporating any baseline covariates.We also employ thresholding techniques to compute the AUC; however, these thresholds are based on the forecasted survival values at the horizon of interest rather than the actual walking distance.Furthermore, the model is limited to obtaining performance metrics solely for patients royalsocietypublishing.org/journal/rsif J. R. Soc.Interface 21: 20230682 used in the comparison model.In order to prevent a significant reduction in the size of the testing sample, we divided the data to allocate 50% of the patients to the testing dataset, which necessitated retraining LSDSM on a smaller dataset.This led to minor changes in the parameter values.
The results are illustrated in figure 5.It is evident that in nearly all scenarios, both methods produce comparable outcomes, with a slight advantage in favour of LSDSM.Notably, LSDSM outperforms the risk score method in all configurations except one within the 1-year horizon.This is significant because the 1-year horizon is often the most clinically relevant for healthcare providers treating PAH patients [8,54].This finding underscores the promising potential of employing a framework jointly modelling longitudinal and survival data for dynamic survival predictions in PAH patients, as opposed to static models that rely on landmarking.An additional advantage of the proposed model is its efficiency in generating results.Unlike risk scores, which necessitated the creation of 12 separately trained models, our model achieved all results using a single integrated framework.However, it is crucial to exercise caution when interpreting these results due to the limited number of patients available under the stringent configuration made for a fair comparison.Specifically, there were only 117, 85, 67 and 40 patients for landmarks at 1, 2, 3 and 4 years, respectively, within the test dataset.

Discussion
The LSDSM presents a unique capability to concurrently monitor and forecast longitudinal and survival data using an SSM.It leverages system identification techniques offered by SSM while integrating survival information into state and parameter estimation for effective hazard modelling.This framework provides interpretability for the dynamics through coefficients that describe the evolution of biomarkers and their interdependencies.The incorporation of SSM into the longitudinal process can be especially advantageous when tracking physiological trajectories that follow differential or difference equations over time [55].
The simulation studies provided valuable insights into the performance of the proposed model under varying percentages of missing observations.The simulations were set up with different missing data configurations rather than adjusting time step values to understand bias effects without changing parameters.This is effectively equivalent to changing the time step hyperparameter, where increasing values of missing data may correspond to a shorter time step.Thus, the time step choice was also indirectly evaluated here.It was observed that accurately estimating survival parameters becomes challenging when dealing with 75% missing observations in the longitudinal trajectory.Therefore, it is advisable that when selecting the hyperparameter royalsocietypublishing.org/journal/rsif J. R. Soc.Interface 21: 20230682 In the practical application of LSDSM to predict survival probabilities for patients with PAH, who typically undergo multiple exercise tests during their follow-up and treatment, the longitudinal data were derived from the walking distance recorded in these tests.The analysis revealed that, on average, the walking distance trajectory gradually decreased over time, resulting in a continuous increase in the hazard rate.Additionally, factors such as age, sex and WHO functional classes were found to influence the hazard, aligning with previous findings by other researchers in this field [8,54].Furthermore, it was confirmed that longer observation periods for patients lead to improved survival predictions, supporting the intuitive notion that more extended patient histories yield better predictions.
Moreover, LSDSM was compared to a risk score that relied solely on walking distance as a prognostic factor.The results demonstrated that LSDSM holds the potential to enhance the predictive accuracy compared to this risk score.Nevertheless, further research is warranted to validate this hypothesis, potentially through a comparison with a more comprehensive risk score that necessitates a richer dataset.
The data in clinical settings, such as those for patients with PAH, are often sparse due to infrequent patient visits.However, this sparsity does not necessarily mean missing information.Missing information arises as measurements within discretized time intervals aggregate.In PAH datasets, instances where two measurements fall within the same time interval are rare, and when they do occur, they typically have similar values, minimizing information loss.Continuous-time survival models are often based on conventional basis functions, which can interpolate between observations.System dynamics can be represented either in basis function form or in state space parametrized forms, there being mathematical equivalence of continuous-time models with their discrete counterparts in linear dynamical systems.The equivalence only applies at the discrete-time points at which there are deemed to be observations or where predictions are to be made.This would be natural if the patients are typically monitored on a regular basis, as is suggested in the PAH guidelines for optimal care.Hence the proposed time interval of six months within this application for LSDSM [54].
This model exhibits versatile applicability across various domains.It can serve as a potent tool for screening individuals in atrisk populations who undergo recurrent assessments over time.Furthermore, it finds utility in evaluating treatment responses to specific therapies, leveraging longitudinal follow-up data to gauge efficacy.Additionally, this model contributes to the fine-tuning of mortality predictions, which can be instrumental in guiding counselling and treatment decisions.Lastly, the LSDSM holds promise in the realm of patient prioritization for transplantation, exemplified in cases like PAH.By assessing the risk of mortality of the patients, it aids in the selection of candidates for urgent transplant procedures.This approach stands in contrast to existing empirical methods that rely on disease severity, often lacking a structured risk assessment.
If the AR order remains constant, then the number of parameters grows quadratically with an increasing number of biomarker trajectories.This implies that, as expected, a large dataset may be necessary for highly intricate models.However, it is worth noting that the computational time for the estimation procedure should remain relatively stable, owing to the computational efficiency of the RTS smoother, which is a significant advantage of LSDSM.The number of parameters can be seen in §4 of the electronic supplementary material.To reduce the number of estimated parameters, expert knowledge can be leveraged.If certain biomarkers are known to be statistically independent of others, a neighbourhood structure can be introduced.This allows for the retention of relevant parameters equivalent to zero throughout the estimation process, as demonstrated by Dewar & Kadirkamanathan [40].
The proposed model has a few limitations.First, LSDSM assumes a regular time series, and the approximations made may result in reduced accuracy.More specifically, since it bins observations into fixed intervals, occasional measurements between smaller time intervals may being disregarded.This also results in the additional hyperparameter of choosing the time step Δt.Second, LSDSM as derived here assumes a constant baseline hazard function, restricting the flexibility granted by other functions.However, this variation is implicit through the variation of the hidden states.That being said, extensions to include time-varying population mean survival curves can be readily made.Third, the proposed model only accounts for linear dynamics, limiting the complexity that can be captured in the longitudinal process.In this regard, the estimation procedure was greatly simplified, allowing us to use a more computationally efficient estimation method.Nonlinear dynamics may be included on this established framework; however, linear models do retain the capability to capture the approximate dynamics in the sparse data with fewer parameters.Fourth, only controllable and observable biomarker dynamics can be captured by the LSDSM approach.Incorporating unobservable and/or uncontrollable biomarkers will require strong assumptions and a priori knowledge of the mechanistic model, as carried out by Desmée et al. [56] with the JM.Fifth, this model assumes that all patients follow a single population dynamics with some disturbances.While individualized predictions are still possible for new patients, the heterogeneity within the patient dataset may not allow us to extract the exact disturbances within every patient that directly reveal the comprehensive deviations from population.With limited data, this assumption helps with capturing the common effects that are seen across this cohort of patients.Lastly, another limitation emerges from the simulation study performed, where it consists of a single experimental setup covering a single biomarker trajectory based on an AR(2) process, with a single baseline covariate.
Future endeavours could address several of these limitations.For instance, exploring alternative parametric functions, such as the Weibull distribution, for the baseline hazard function could be considered.This would involve adjusting the hazard function accordingly and determining equations that optimize the expectation of the complete data log likelihood with respect to these new parameters.Incorporating nonlinear longitudinal biomarkers might involve adapting the dynamics and observation equations to suit the anticipated distributions.This allows the framework to model more complex patterns and perhaps improve longitudinal and survival predictions.Moreover, the population variation may be accomplished by introducing further randomized effects into the model.It is also worth noting that the current model naturally accommodates the inclusion of additional longitudinal biomarkers by adapting the model parameters.The LSDSM approach could uncover inherent correlations among these biomarkers.However, this study did not delve into the analysis of multiple biomarkers, which is an avenue for future exploration and can be pursued effortlessly.Future work may also address the simulation study limitation by analysing more complex setups, including multiple longitudinal biomarkers, and additional baseline covariates affecting the hazard function.
LSDSM presents a promising avenue for jointly modelling longitudinal and survival data.Exploring the application of LSDSM to diverse datasets, particularly those featuring regular time-series data from sources like wearable technology, holds substantial research potential and would be a compelling avenue for further investigation.The estimation procedure is developed within a maximum-likelihood framework.Our proposed model has been applied to both real-world and synthetic datasets, yielding encouraging results in terms of survival predictions for PAH patients.It exhibits an advantage over a conventional risk score, the prevalent method used for PAH patient survival analysis.Furthermore, LSDSM offers flexibility in terms of complexity, although this scalability may necessitate larger datasets for accurate parameter estimations.In conclusion, LSDSM provides an alternative approach to the JM for longitudinal and survival data, explaining dynamics as a function of past true biomarker values.This unique perspective opens up significant opportunities for further improvements, drawing from the extensive research available in SSMs.

Appendix A. Table of notation Appendix B. Gaussian-approximated filter distribution
Table 5 illustrates four instances of varied configurations.Figure 6 displays the true distribution of equation (3.22) alongside a Gaussian approximation achieved through Laplace approximation using the Newton-Raphson method.It is evident that the approximation closely resembles the true distribution, exhibiting only slight differences, primarily noticeable near the tails of the distribution.
It was observed through empirical analysis that, in many instances, the outcome of equation (3.22) exhibits a distribution with a shape resembling that of a Gaussian distribution.Furthermore, this equation on expansion, bears a resemblance to a negative quadratic in the exponent, reinforcing the notion that this function behaves akin to a normal distribution.Therefore, for computational efficiency, we approximate pðx i,j jy i,1 : j , t i,jþ1 , d i,1 : j Þ as a Gaussian distribution centred around its mode, with the variance determined from the inverse of the Hessian matrix using the Newton-Raphson iterative method.Although this approximation involves a slight sacrifice in accuracy, it allows for the advantageous computational simplicity of working with Gaussian distributions, streamlining the state estimation process.¼ A mi,jÀ1 12.
→ Find mi,j ¼ arg min x i,j gðx i,j Þ and Si,j ¼ ðH g j x i,j ¼ mi,j Þ À1 where H g is the Hessian matrix of gðx i,j Þ

20.
→ Where gðx i,j Þ ¼ À d i,j g `vi À d i,j a `Hx i,j þ t i,j exp fg `vi g exp fa `Hx i,j g þ 1 2 ðx i,j À m i,j Þ ` S À1 i,j ðx i,j À m i,j Þ

Figure 1 .
Figure 1.Graphical model for the state space model for longitudinal and survival data representing the causal relations influencing the latent complete patient trajectories.White circles indicate latent variables, while shaded circles are observed variables.

Figure 2 .
Figure 2.An example showcasing how τ i,j is calculated for every observation period.

Figure 3 .
Figure 3. Survival trajectories for patients with different age, sex, WHO functional class and initial walking distance.
simplifies to .org/journal/rsif J. R. Soc.Interface 21: 20230682 of gradually introducing a higher percentage of 'missing data' within the longitudinal biomarkers, i.e. a decreasing number of expected measurement observations per patient.These simulations are constructed based on the LSDSM presented in equations (2.1) and (2.3): royalsocietypublishing

Table 2 .
Parameter estimation for the simulation study, showing the average and sample standard deviation across several configurations.

Table 3 .
Average RMSE across all simulations for the training and testing datasets, for the estimated and the true models compared to the simulated true hidden values and survival trajectories, across several configurations.indicating whether patients belong to WHO functional classes I and II or III and IV as baseline covariates. variable

Table 4 .
Table of notation for the proposed model framework.=M×mynumber of hidden states considered in state space model n number of individuals in the study t i ¼ ft i,j : t i,j T i , j ¼ 1, ..., m i g set of timings of biomarker measurements for individual i yi,j ¼ y i ðt i,j Þ [ R m y Â1biomarker measurements vector for patient i at time t i,j Y i ¼ fy i,j : j ¼ 1, ..., m i g set of all biomarker measurements for individual ix i,j ¼ x i ðt i,j Þ [ R m x Â1true biomarker values for patient i at time t i,j X i ¼ fx i,j : j ¼ 1, ..., m i g set of true underlying biomarker values for individual i Δt fixed time step for the state space model such that jΔt = t i,j τ i,j time step for period j of patient i: t i,j ¼ Dt, 8j = m i , andt i,m i ¼ T i À m i Dt w i,j [ R m x Â1 model errors vector for patient i at time t i,j v i,j [ R m y Â1biomarker measurement errors vector for patient i at time t i,j A [ R m x Âm x state matrix governing the dynamics in the state space model C [ R m y Âm x observation matrix dictating the observed states in the state space model W [ R m x Âm x covariance matrix of the disturbance term V [ R m y Âm y covariance matrix of the biomarker measurement errorx 1 [ R m x Â1initial conditions vector of the hidden states W 1 [ R m x Âm x covariance matrix for the initial conditions of the hidden statesh i (t) hazard function for individual i v i [ R m g Â1 baseline covariates vector for individual i g [ R m g Â1corresponding coefficients vector for effects of baseline covariates on survival a [ R m a Â1 parameters relating the longitudinal and survival processes H [ R m a Âm x matrix taking a linear combination of the hidden states to relate to the hazard functionD ðoÞ i ¼ ðT i , d i , Y i , t i , v i Þ observed data for individual i D ðcÞ i ¼ ðT i , d i , Y i , t i , v i , X i Þ complete data of LSDSM for individual i u ¼ f x 1 , W 1 , A, W, V,g, ag parameter set to be estimated for the proposed joint model royalsocietypublishing.org/journal/rsif J. R. Soc.Interface 21: 20230682 Appendix C. Expected maximization algorithm for LSDSM See table 6.

Table 6 .
Summary of EM algorithm for LSDSM.