The Partial Least Squares Spline Model for Public Health Surveillance Data

Factor discovery of public health surveillance data is a crucial problem and extremely challenging from a scientific viewpoint with enormous applications in research studies. In this study, the main focus is to introduce the improved survival regression technique in the presence of multicollinearity, and hence, the partial least squares spline modeling approach is proposed. The proposed method is compared with the benchmark partial least squares Cox regression model in terms of accuracy based on the Akaike information criterion. Further, the optimal model is practiced on a real data set of infant mortality obtained from the Pakistan Demographic and Health Survey. This model is implemented to assess the significant risk factors of infant mortality. The recommended features contain key information about infant survival and could be useful in public health surveillance-related research.


Introduction
Survival approach is a common regression modeling method used for prognostic analysis as it examines the relationship between the covariates, the response, and the time until the occurrence of an event. The framework for survival analysis is based on the Cox proportional hazard (PH) model due to its ease of computing the hazard ratio (HR) without needing to estimate the baseline hazard function. The Cox PH model maximizes the partial likelihood function which estimates the regression parameters but not the baseline hazard function. Consequently, the survival probability and the hazard rates can be estimated only at the event times and not for the long-term evaluations [1].
Parametric survival models specify the probability distribution to estimate the absolute measure of effect in time to event response. A common specification is the Weibull dis-tribution in these models to estimate the baseline hazard h o ðtÞ. A parametric survival model with a scale parameter (λ > 0), a shape parameter (γ > 0), and time (t) is defined as h o ðtÞ = λγt γ−1 . For the absolute measure of effect, the Weibull distribution can generally facilitate accurate predictions for a constant, monotonically decreasing or monotonically increasing hazards. However, for more complex hazard functions, the parametric survival model specifying a Weibull function will lead to inaccurate predictions [2].
The Royston and Parmer model is an advanced type of flexible parametric survival model featuring a restricted cubic spline to model more complex hazard shapes and to estimate a continuous function [3]. This model considers the baseline log cumulative hazard function on the log timescale. For Weibull distribution, this function is ln ðHðtÞ | z i Þ = ln ðλÞ + γ ln ðtÞ + βz i where ln ðλÞ and γ ln ðtÞ represent the baseline hazard with respect to log time and βz i denotes the vector of predictors. This function can be generalized as ln ðHðtÞ | z i Þ = ln ½H o ðtÞ + βz i where ln ½H o ðtÞ describes a general baseline log cumulative hazard function. Royston and Parmar used a restricted cubic spline to model the baseline hazard function on the log timescale. A restricted or natural cubic spline has an additional restriction featuring the first and last subfunctions beyond the boundary knots as linear instead of cubic. A restricted cubic spline can be mathematically expressed as [15] where K denotes the number of knots, x i represents derived variables, and η i describes the coefficients for these variables. This spline has the ability to fit complex shapes of baseline log cumulative hazard functions improving the stability of the function [4].
Multivariate survival regression models assume that there is no multicollinearity among covariates. Most of the survival methods are not appropriate to model large data with correlated covariates. The partial least squares (PLS) regression is considered as a good alternate of traditional regression methods in the presence of multicollinearity [5,6].
Therefore, the partial least squares-Cox (PLS-Cox) regression model was developed to analyze survival systems in the presence of multicollinearity [7]. Due to several limitations of the PLS-Cox regression model, the PLS flexible parametric (PLS-FP) survival regression model is proposed to estimate smooth hazard ratios of predictors and corresponding cumulative hazard functions and to extrapolate the survival model [2]. However, the major limitation of the PLS-FP model is that it is not appropriate for all complex shapes of hazard function. The motivation of this research was to develop a survival model that has the ability to model complex shapes in the presence of multicollinearity. The proposed method is developed by integrating partial least squares with the Royston and Parmer restricted cubic spline model, hence the named as the partial least squares spline (PLS-spline) model. This model has the ability to fit more complex shapes of baseline log cumulative hazard functions. The efficiency of the partial least squares spline (PLS-spline) model is tested using simulated data by examining its performance on different scales with various spline knots. The proposed model is applied to a real data set of infant mortality to estimate the hazard function and regression coefficients. The analyses based on different scales using simulated and real data set reveal the efficiency of these models to estimate baseline log cumulative hazard functions in the presence of multicollinearity.

Materials and Methods
where λ o ðtÞ is the baseline hazard function, β is the vector of coefficients, and X is a ðn * pÞ matrix of covariates. In this model, the baseline hazard function is unspecified.

The Partial Least Squares-Cox (PLS-Cox) Regression
Model. Partial least squares-Cox (PLS-Cox) regression model is used as a benchmark model in this study. Let t represent the survival time and X ∈ ℝ n * P . The partial least squares model computes k latent components for p correlated covariates; then, the Cox model assumes the baseline hazard function as where λ o ðtÞ is the unspecified baseline hazard function, β is the vector of coefficients, and S is a ðn * kÞ matrix of components. The hyperparameters are found by maximum likelihood estimation method.

The Royston-Parmar Spline Model.
In the context of the PH model, the Royston-Parmar (RP) model can be expressed as where sðln ðtÞ | η, k o Þ describes a restricted cubic spline that is a function of the derived variables η and the number of knots k o . Generally, three different scales, hazard, odds, or normal, are used to model the RP spline model. When no knots are specified, the restricted cubic spline reduces to the Weibull distribution if the scale is hazard. For odds and normal scales, no knots give log-logistic and lognormal models, respectively.

Partial Least Squares Spline (PLS-Spline) Survival
Regression Algorithm. Let X ∈ ℝ n * P denote the matrix of p correlated covariates x 1 , ⋯, x p for a sample of size n. The algorithm executes the FP model based on the C components (as 1: function PLS model X, t, e, a where X is the covariate matrix, t is the time, eis the event, andc is the number of components. 2: w ðcÞ = X t ðc−1Þ T ðc−1Þ loading weights 3: w ðcÞ ⟵ w ðcÞ /‖w ðcÞ ‖ normalized loading weights 4: s ðcÞ = X ðc−1Þ w ðcÞ score vector 5: p ðcÞ = X t ðc−1Þ ðs ðcÞ /s t ðcÞ s ðcÞ ÞX-loadings 6: q c = T t ðc−1Þ ðs ðcÞ /s t ðcÞ s ðcÞ Þt-loadings ⊳ repeat the above steps until c < C 7: forc = 1 to Cdo 8: fRPfSurvðt, eÞgg c~∑C c=1 s c ⊳ Royston and Parmer (RP) restricted cubic spline model on PLSR components.

Computational and Mathematical Methods in Medicine
C ≤ p) of PLSR computed with time T as a response variable and X as a matrix of covariates for c = 1, 2, ⋯, C. The pseudocode for the proposed PLS-spline model is expressed as follows.
2.5. Data Simulation. Simulated data is generated using the simsurv R-package to evaluate the efficiency of existing and proposed survival models. The simulated data set is generated from Weibull distribution for the scale parameter (λ = 0:1) and shape parameter (k = 1:5) over 5 years of censoring. The correlation structure between 200 covariates ranged from 0 to 0.9 over 100 samples.
2.6. Real Data Set. This study used publically available secondary data, borrowed from the Demographic and Health Survey (DHS), collected during 2012-13 from Pakistan with the support of the United States Agency for International Development and ICF International. Therefore, there are no ethical concerns involved in this work, and no ethics review is required for this study [8]. The secondary data of infants from birth to aged 12 months born to ever married women aged 15-49 years in Pakistan is used in this study. The outcome of interest was infant survival within 12 months after first month of birth. The sample consists of 80 infants belonging to Pakistan, and 86 covariates are included.

Simulation-Based
Results. Using Weibull distribution, the high dimensional simulated data set having multicollinearity is generated. The constructed data is then split into test and training sets with 70 : 30 to train and evaluate the performance of benchmark and proposed methods. The hazard, odds, or normal scales are modeled each with zero and one knot.
The PLS-spline model with different knots measured on different scales is fitted over the simulated data set generated from Weibull distribution to access the performance of models based on the Akaike information criterion (AIC) and Bayesian information criterion (BIC). Figure 1 shows the comparison between the standard, PLS-Cox regression model, and six PLS-spline models with different knots based on various scales. The proposed PLS-spline models based on the hazard scale with zero knot and one knot are symbolized as RP plsH o and RP plsH 1 , respectively. Similarly, RP plsO and RP plsN stand for odds and normal scales accordingly. Figure 1 shows that the PLS-spline model based on all three scales with one knot has the highest performance compared to the PLS-Cox and PLS-spline models with zero knot. But it is also clear from Figure 1 that the PLS-spline model having zero knot showed even higher efficiency than the benchmark PLS-Cox method. Figure 2 shows the efficiency comparison based on the BIC defending performance based on AIC. The presence of multicollinearity is evident in the heat map. Hence, the existence of multicollinearity among covariates in high dimensional survival data is detected visually.

Computational and Mathematical Methods in Medicine
The high dimensional infant survival data set having multicollinearity is used for comparison of models and identification of risk factors of infant mortality. The sample data is split into test and training sets with 70 : 30 to evaluate the efficiency of PLS survival methods.
The PLS-spline models with zero and one knot are fitted over the real data set to access the performance of models based on different scales using AIC and BIC. Figure 4 shows the comparison presenting the higher efficiency of all proposed methods compared to PLS-Cox based on AIC. Also, the highest performance of RP plsO 1 is observed in Figure 4 compared to other RP pls methods. This result showed that the proposed PLS-spline model based on the odds scale with one knot is the optimal model for the observed data. Figure 5 shows the comparison of models based on BIC. The visual representation showed that the PLS-spline model based on the odds scale with zero and one knot has nearly the same efficiency. On the basis of both model assessment criteria, we may conclude that the PLS-spline model based on the odds scale is the best fitted model for the observed data. For identification of significant risk factors, the PLSspline model based on the hazard scale with one knot is executed as being best fitted. Table 1 presents the selected influential risk factors of infant mortality by using the RP plsO 1 as being the optimal X 2 X 7 X 8 X 9 X 1 0 X 1 1 X 1 2 X 1 3 X 1 4 X 1 5 X 1 6 X 1 7 X 1 8 X 1 9 X 2 0 X 2 1 X 2 2 X 2 3 X 2 4 X 2 5 X 2 6 X 2 7 X 2 8 X 2 9 X 3 0 X 3 1 X 3 2 X 3 3 X 3 4 X 3 5 X 3 6  Computational and Mathematical Methods in Medicine model. After analysis, 27 influential factors are found significantly associated with infant mortality in Pakistan. The positive association of mother' age, type of place of region, de facto place of residence, relationship of mother to household head, type of cooking fuel, number of births in last five years, distance, transport and accompany to health facility, mother's occupation, person who usually decides on respondent's health care, person who usually decides on visits to family or relatives, person who usually decides what to do with money husband earns, succeeding birth interval, and blood relation with husband is found for infant mortality. Furthermore, negative association of region, selection for domestic violence, household has motorcycle/scooter, reading newspaper or magazine, watching television, wealth index, awareness of tuberculosis and hepatitis, beating justified if wife neglects the children or argues with husband or if wife burns the food, and preceding birth interval is observed. Figure 6 shows the estimates of the baseline cumulative hazards from the PLS-spline model measured on hazard, normal, and odds scales with zero and one knot for the data set of infant survival. All six PLS-spline models produce smooth estimates of the baseline cumulative hazards extrapolated to time of 12 months showing consistent estimates. The PLS-spline model based on the odds scale with one knot is represented by the red line in Figure 6 showing the lowest cumulative hazard for the first 4 months after birth, moderate increase in the fifth month, and maximum at the sixth month.

Discussion
Alongside advances in statistical techniques, several modifications are suggested for survival analysis to improve efficiency of the model. Yang et al. [9] introduced Deep-CoxPH, an estimation strategy based on deep learning and the Cox model which is proposed to improve the risk strat-ification for overall survival analysis. Rueda et al. [10] used discrete-time Markov chain theory and the Cox regression to predict survival function. The authors also employed a parametric analysis for comparison and variable selection. Another study developed an algorithm as a conjugate of the parametric model and partial least squares in the presence of extreme observations to enhance model performance [2]. In this study, the PLS-spline model is proposed to treat survival response with collinear predictors using the spline strategy based on different scales with various knots regarding better model performance and superior interpretation potential. To examine hazard function with higher accuracy, the PLS-spline model is proposed by integrating PLS and the Royston and Parmer spline model in the presence of multicollinearity. The proposed model is compared with the PLS-Cox model using simulated and real data sets for efficiency comparison. The PLS-spline model with one knot over hazard, odds, and normal scales turns out to be the best model  to estimate cumulative hazards based on AIC and BIC over simulated data generated from Weibull distribution. More importantly, for known simulated data, the PLS-spline model showed better performance than the PLS-Cox model. For the real data set of infant mortality, the PLS-spline model with one knot over the odds scale is observed to be optimal model. The finally selected model is used to identify the influential risk factors of infant mortality in Pakistan. Maternal age, occupation, and place of residence are found to be significant predictors of infant mortality in the present study. Previous studies observed that younger and older maternal ages are significantly associated with infant mortality [11]. Another study reported that the region of residence and working status of mother are statistically significant risk factors for stunted, underweight, and wasted children [12]. Consistent with literature, domestic violence is found to be significantly associated with infant mortality [13]. The present study observed that an increase in media awareness (watching television and reading newspaper) and wealth level could decrease the ratio of infant mortality. Literature described that media exposure and income level are associated with maternal outcomes [14,12]. Availability and utilization of health facility is determined an important risk factor of mortality rate among infants. Several former studies verified that health expenditure potentially reduces maternal and infant mortalities across different countries [15,16]. Closely similar to previous literature, birth interval and consanguineous marriage showed a significant association with infant mortality [17,18]. The overall accuracy of the proposed algorithm enhances the model performance to a higher extent, considering collinear covariates. This efficiency suggests that survival function, hazard function, cumulative hazard function, and parameters of distribution for the survival time data with unknown distribution can be estimated more efficiently in terms of smooth lines. The PLS-spline model is viewed as a useful addition to the toolbox of estimation and prediction of survival time response for the widely used PLS-Cox model in the survival settings.

Conclusion
The proposed PLS-spline model based on different scales with various knots is shown to be a better choice regarding model performance and superior interpretation potential. Using the PLS-spline model based on the odds scale with one knot, the influential factors identified as the important predictors of infant mortality are in agreement with other studies. So, the PLS-spline model has the potential as a multivariate survival technique in scientific research to treat high-dimensional correlated survival times data more efficiently.

Data Availability
Data are freely available at http://www.dhs.org.

Conflicts of Interest
The authors declare that they have no conflicts of interest.