AFlexible andRobustApproach toAnalyze Survival Systems in the Presence of Extreme Observations

Survival systems are difficult to analyze in the presence of extreme observations and multicollinearity. Finding appropriate models that provide a robust description of such survival systems and that address the smooth hazards in the context of covariates can be challenging given the sheer number of possibilities. Survival time algorithms that evaluate the efficiency of models in the presence of extreme observations over different datasets provide an effective tool to identify robust systems. However, the existing algorithms addressing the analysis of survival systems are limited in long-term evaluations.(erefore, an algorithm that can analyze survival time response on high-dimensional complex survival systems having extreme observations is developed which explores large margins dynamically. (is algorithm is developed as a conjugate of flexible parametric models and partial least squares to estimate smooth, flexible, and robust functions to extrapolate the survival model in long-term evaluations in the presence of extreme observations. (e algorithm is tested and validated using four distributions based on a simulated dataset generated from the Weibull distribution and compared with partial least squares-Cox regression. (e comparison shows its flexibility and efficiency in handling different survival systems in the presence of extreme values. (e algorithm is also used to analyze four real datasets of breast cancer survival time, each containing seven gene signatures. (e coefficients of significant genes for each dataset are estimated. (e flexibility in handling various distributions as parametric survival models supports the application of the algorithm to a large variety of different survival problems and represents a robust statistical framework for survival analysis in the presence of extreme observations.


Introduction
"Time-to-event" responses have become progressively relevant in the context of research studies, where the response of interest could be based on time to remission, overall survival [1], or progression-free survival [2] following treatment intervention. e Kaplan-Meier product-limit method [3] is the popular nonparametric approach to analyze time to event response. Cox's proportional hazards (PH) regression model [4] is a widely used semiparametric survival approach when the proportional hazards assumption is valid [5].
Despite the popularity and advantages of nonparametric and semiparametric methods, parametric modeling procedures offer flexible, efficient, and robust alternatives. When the proportional hazards assumption is violated and the distributional assumption on the survival times is valid, flexible parametric models (FPM) are an ample alternate of nonparametric and semiparametric methods resulting in more efficient estimates having smaller standard errors even in the presence of extreme observations. Also, statistical inferences are more precise as the full likelihood is used and results are easy to interpret. To approximate survival data, numerous theoretical distributions have been employed so far in FPM. e exponential distribution assists as the primary model for survival time. Other commonly used distributions to modeled big survival data with extreme observations are Weibull, gamma, Gompertz, log-normal, loglogistic, generalized gamma, and generalized F-distribution. Also, the FPM can efficiently incorporate covariates to investigate the dependence of survival response in the context of estimated parameters, survival function, and cumulative hazard function [6][7][8][9][10]. Multivariate regression models assume that there is no multicollinearity among covariates of survival systems.
is assumption, however, is often violated specifically in the case of the curse of dimensionality which analyses a large number of covariates with limited observations. erefore, nonparametric and semiparametric survival methods are no longer appropriate to modeled large data with correlated covariates and extreme observations. erefore, within the framework of partial least squares (PLS) regression, the partial least squares-Cox regression (PLS-CoxR) model was developed. e basic PLS algorithm transforms the original correlated covariates of the system into uncorrelated components [11,12]. Hence, the PLS-CoxR model is used as a conjugate of Cox and PLS regression to model survival systems [13]. A major limitation of the nonparametric survival methods is that the determinants cannot be integrated into the analyses. Regarding the disadvantages of semiparametric survival methods, Cox regression is limited in the long-term evaluation as the cumulative hazard functions and survival functions cannot be extrapolated in the model as being incomplete. Because in long-term survival studies, the duration of the survey is often not sufficient to collect all the observations of every event that occurs. Another weakness of the Cox regression model is that it estimates survival function and cumulative hazards function as step functions. Hence, the PLS-CoxR model is limited and inflexible with unsmooth estimates. Alternatively, flexible parametric models (FPM) can be used to estimate smooth hazard ratios of predictors and corresponding cumulative hazard functions and hence to extrapolate the survival model. However, to use these models, the hazard function and survival function must follow a specific distribution. e FPM can estimate a continuous function instead of a step function due to its flexibility. Additionally, more complicated shapes can be modeled without sacrificing prediction accuracy and appropriate convergence is achieved. e generalization of several standard survival models is used as a series of models in FPM [14]. e motivation of this research was to develop a robust model that is specifically designed for the analysis of survival systems in the presence of extreme observations and is also able to handle various probability distributions. is robust algorithm, namely, Partial Least Squares Flexible parametric Models (PLS-FPM) supports the user defines its probability distribution, for estimating the survival, hazard, and cumulative hazard functions and returning a calculated model selection criterion.
After validating and testing the performance and efficiency of PLS-FPM using simulated data, as well as showing its flexibility to handle different distributions, the algorithm is applied to the analysis of four breast cancer survival datasets, and significant genes are estimated. e analyses based on different distributions using several datasets revealed the robustness of these models to estimate smooth survival and cumulative hazard functions in the presence of extreme values.

Materials and Methods
e Cox proportional hazards (PH) model is the most frequently used regression technique to address survival data. In the presence of multicollinearity, the Cox algorithm is integrated with PLS resulting in the PLS-Cox model.

e Partial Least Squares Cox (PLS-Cox) Regression
Model.
e PLS-Cox regression model is used as a reference model in this study. Let t represent survival time and let z j � z 1j , z 2j , . . . , z nj be the vector of p correlated explanatory factors over n samples. e PLS model computes k latent components for p factors then the baseline hazard function takes the form of where λ o (t) is the unspecified baseline hazard function, β is the vector of coefficients, and L is a (n * k) matrix of components.

Flexible Parametric Survival Model (FPSM).
Let T represent a survival time random factor and let Z denote the vector of covariates z 1 , . . . , z p for a sample of size n. Let T have the probability density function f(t) and cumulative distribution function F(t) � Pr(T ≤ t). e survival function S(t) � Pr(T > t) is the probability of being alive at time t. e hazard function is λ(t) � f(t)/S(t) and the risk function is Any distribution defined for t ∈ [0, ∞] can serve as a parametric function. In this study, the distributions integrated with PLS are as follows.

e Gamma Distribution.
e inclusion of incomplete gamma integral in the survival and hazard functions of gamma distribution limited its use in survival analysis. A survival time random variable T is gamma distributed with shape parameter k and rate parameter β; then, the survival finction is where I k (t) is the incomplete gamma function and the gamma hazard function is Complicated numerical calculations are required to estimate parameters as maximum likelihood estimation is difficult to exercise due to incomplete gamma integrals. e gamma hazard function may be constant, monotonically increasing, or monotonically decreasing.

e Weibull Distribution. Let the survival time T be
Weibull with scale parameter λ and shape parameter k; then, the survival function is Transforming survival function to the log cumulative hazard scale, FPM is formulated as the linear function of log time: Adding covariates, the log cumulative hazard model becomes us, ln(λ) + k ln(t) in the baseline log cumulative hazard function is combined with covariates. e Weibull distribution has constant, increasing, or decreasing hazard rates.

e Log-Logistic Distribution.
For T following the loglogistic distribution with scale parameter α and shape parameter β, the survival function is and the cumulative hazard function is e hazard function of log-logistic distribution may be increasing, decreasing, or hump-shaped.

Log-Normal Distribution.
e natural logarithm of the lifetime T is assumed to be normally distributed with location parameters μ and scale parameter σ. e log-normal survival function is e cumulative hazard function of the lognormal distribution is and Φ is the cumulative distribution function of the normal distribution. e log-normal hazard function is monotonically decreasing [15]. Various other standard and defined distributions including extreme value distributions can be used in FPM. In these models, proportional hazards imply proportional cumulative hazards; hence, covariates can be interpreted as hazard ratios similar to nonparametric models under PH assumption. e cumulative hazard, in FPM, is a more stable function compared to nonparametric models as being a function of a log time scale. For instance, the cumulative hazard function is a straight line in Weibull models. us, more stable-shaped functions are accurately captured. e PLSR model integrated with FPM using gamma, Weibull, log-logistic, and log-normal distribution is introduced in this study for robust estimation in the context of high dimensional survival data in the presence of extreme values.

Partial Least-Squares Flexible Parametric (PLS-FP) Survival
Regression Algorithm. Suppose Z denote the matrix of p correlated covariates z 1 , . . . , z p for a sample of size n.
e algorithm executes the FP model based on the K components (as k ≤ p) of PLSR computed with time as a response variable and Z as a matrix of covariates. Figure 1 expresses the main steps of the proposed model.
e PLS-FP model works in two steps. It computes PLS components at the first step and executes FP distribution at the second step.
is proposed model enhances model performance in terms of prediction and accuracy in the presence of extreme observations. Generation of simulated data for survival response from standard parametric distributions. e simsurv R package is used to generate simulated data to compare the performance of existing and proposed PLSbased models. e simulated dataset is generated from the Weibull distribution for scale parameter (λ � 0.1) and shape parameter (k � 1.5) over 5 years of censoring with extreme observations. e correlation structure between covariates is introduced as R � [0.5, 0.5, 0.3, 0.5, 0, 0, 0, 0, 0, 0] in the datasets over 100 samples with 100 covariates.

Breast Cancer Datasets.
e 10-year censored survival time datasets of breast cancer patients used in this study contain the seven-gene signature innovated by [16]. e seven genes included in the data are AURKA representing proliferation; PLAU representing tumor metastasis; STAT1 representing immune response; VEGF representing angiogenesis; CASP3 representing apoptosis phenotypes; ESR1 representing estrogen receptor (ER) signaling; and ERBB2 representing human epidermal growth factor receptor 2 (HER2) signaling. e clinical data from four different datasets each with a subset of the gene expression and annotations in the presence of extreme observations are analyzed. e following four datasets were used to compare the models: mainz7g (200 observations) [17], transbig7g (198 observations) [18], vdx7g (344 observations) [19,20], and upp7g (234 observations) [21]. e breast cancer datasets are freely available in an R package called survcomp and are available from http://www.ulb.ac.be/di/ map/bhaibeka/software/survcomp/. [22].

Results
e PLS-FP model parameterised with gamma, Weibull, loglogistic, and log-normal distributions are fitted over simulated dataset generated from Weibull distributions for comparison of five models with 100 correlated covariates. e results supported the application of proposed models over the traditional PLS-Cox model to deal with survival time response in the presence of collinear covariates and extreme observations. e model performances based on AIC and BIC for simulated data generated from the Weibull distribution are presented in Figure 2. Figure 2 shows a box Mathematical Problems in Engineering plot comparison of models which is drawn in R package, namely, lattice. e Weibull, gamma, log-logistic, and lognormal distributions are fitted and compared to examine the difference in performance in the presence of correlated covariates and extreme observations. Figure 2 shows performances of five models based on AIC and BIC indicating that incorporated with PLS, the Weibull survival model has the highest performance for simulated data with known Weibull distribution having defined correlated covariates and extreme observations. More importantly, all proposed methods conjugated with PLSR performed more efficiently than PLSR conjugated with Cox regression. e results of simulated data demonstrate that the proposed methods are accurate and robust as showing the highest performance using the corresponding distribution. e application of simulated data generated from the Weibull distribution recommended the practical implementation of the proposed methods to analyze high-dimensional survival time data in the presence of multicollinearity and extreme observations. Before analyzing real datasets, multicollinearity among covariates and the presence of extreme observations is verified. For this purpose, correlations among covariates for all breast cancer datasets are plotted. e correlation maps for breast cancer datasets presented in Figures 3 and 4 portray the high correlations between covariates for all datasets. Figure 3 presents the circle of correlations for mainz7g and transbig7g datasets.
e correlation maps showed that gene symbols VEGFA and PLAU represented by the probe names 211527 x at and 211668 s at, respectively, for the Affymetrix microarray are highly correlated for both datasets. is shows that the vascular endothelial growth factor A (VEGF-A) and Plasminogen Activator, Urokinase (PLAU) are highly correlated covariates in both datasets. Figure 4 visually displays the correlation strength among covariates of two datasets, namely, vdx7g and upp7g. e correlation plot shows that similar to mainz7g and transbig7g, the gene symbols Vegfa and Plau are highly correlated genes for the vdx7g dataset. e correlation maps showed that gene symbols AURKA and ERBB2 represented by the probe names 208079 s at and 216836 s at, respectively, for the Affymetrix microarray are highly correlated for the upp7g dataset. is means that the Aurora Kinase A (Aurka) and Erb-B2 Receptor Tyrosine Kinase 2 (ERBB2) have a high correlation for the upp7g dataset. Furthermore, gene symbol STAT1 and VEGFA represented by the probe names 209969 s at and 211527 x at, respectively, also showed correlation for upp7g dataset presenting association between signal transducer and activator of transcription 1 and vascular endothelial growth factor A (VEGF-A). e correlation maps evidence the presence of multicollinearity. Moreover, for the identification of extreme observations, starburst graphs (also called bagplot) are plotted, presented in Figures 5 and 6. e starburst plot is usually used as a robust method to visualize two or three-dimensional data. Detection of extreme observations is performed separately for each dataset. Figure 5 shows the bagplots for two breast cancer datasets, namely, mainz7g and transbig7g. e principal components are computed for the matrix of covariates and two-dimensional bagplots are constructed. In Figure 5, 16 outlying observations are detected for the mainz7g dataset and 26 outliers appeared for transbig7g data. Figure 6 represents the starburst plot for vdx7g and upp7g datasets. For vdx7g, 7 extreme observations are detected and 16 outliers are identified. e inner bags of the plot contain 50% of the observations. Hence, the presence of multicollinearity among covariates and extreme observations are detected visually. Regarding the parametric approach, the gamma, log-normal, and Weibull distributions are included addressing monotonically increasing, monotonically decreasing, arc-shaped, and bathtub hazards. Hence, incorporated with PLS, this family of distributions increases the efficiency of the model. e log-logistic distribution is included addressing only arc-shaped and decreasing hazards but still showed higher performance Each dataset is measured over seven genes and is randomly split into test and train components to increase reliability. e collinearity among genes is examined and verified. e existence of extreme observations is also detected. en, the PLS-FPM parameterised with Weibull, gamma, log-logistic, and log-normal distribution is considered for all four datasets and compared with the PLS-Cox model. Figure 7 represents the comparison of performance based on AIC for mainz7g and transbig7g datasets. For the mainz7g dataset, the PLS-FPM showed nearly similar performance for gamma, log-normal, Weibull, and log-logistic distribution. e graphical comparison evidence that PLS-FPM with all four distributions has higher efficiency compared to the standard PLS-Cox model. Figure 7 further displays the comparison of efficiency for transbig7g data. Regarding model comparability with PLS-Cox, the FPM-PLS shows higher performance integrated with all four distributions, while the comparison of PLS-FPMs demonstrates that log-normal distribution performs better among other distributions. Figure 8 illustrates the model comparison for Vdx7g and upp7g datasets. e box plot for vdx7g shows that PLS-FPM integrated with log-normal distribution has a minimum value of AIC representing the higher efficiency of this  Figure 9 shows the estimates of the baseline cumulative hazards from the PLS-Cox model and PLS-FPM based on gamma, Weibull, log-logistic, and log-normal distribution for mainz7g and transbig7g datasets of breast cancer. e  e PLS-Cox model resulted in an unsmooth baseline cumulative hazards curve with unusual fluctuations at certain time points. All the parametric models exhibit monotonically increasing smooth curves of the baseline cumulative hazards due to parametric flexibility and the unsmooth pattern is observed for the PLS-Cox model for all datasets. Figure 10 illustrates the estimates of the baseline cumulative hazards from the PLS-Cox model and PLS-FPM based on gamma, Weibull, log-logistic, and log-normal distribution for vdx7g and upp7g datasets of breast cancer. e most efficient model for the vdx7g dataset is PLS-FPM integrated with log-normal distribution which has lower estimates of the baseline cumulative hazards compared to other distribution at initial survival time and it gradually increases with the increase of survival time. Consistent with the vdx7g dataset, the lognormal distribution shows decreased estimates of the baseline cumulative hazards compared to other distribution at the first two years of survival time and this estimate is minimum for gamma distribution for further years.
All seven genes are found to be significantly associated with breast cancer in six datasets. e parameters of all genes are estimated for each survival dataset and presented in Table 1. e coefficients are estimated for the optimal model corresponding to each dataset. e PLS-FPM based on lognormal distribution is used for parameter estimation of mainz7g, transbig7g, vdx7g, upp7g, and nki7g datasets based on performance. For unt7g data, Weibull distribution is integrated with PLS-FPM for the estimation of coefficients having higher accuracy. Positive and negative associations are observed for different genes regarding various datasets of breast cancer.

Discussion
Efficient model selection with robust estimates remains a challenging and computationally intensive task, especially for survival systems in the presence of extreme observations. erefore, the number of candidate models for evaluations and comparisons is usually limited in studies. However, nonparametric and semiparametric survival methods can misappropriate model structures without considering specific probability distribution. e PLS-Cox regression model is used to deal with multicollinearity among covariates in survival time analysis. e new approach is proposed mainly for two disadvantages of Cox regression. Firstly, Cox regression is a semiparametric approach; hence, it produces unsmooth estimates which are limited in the long-term evaluations. Secondly, the standard Cox regression model is not robust to extreme values. In this article, PLS-FPM is proposed as a fully parametric survival technique to examine hazard function and efficiently estimate the parameters. e PLS-FPM was particularly projected to address survival systems in the presence of multicollinearity by using various distributions to produce smooth estimates to extrapolate the survival model. Since this approach is flexible enough to combine different distributions, it can produce a robust model in the presence of extreme observations by using a suitable probability distribution.
Overall, PLS-FPM compares favorably with the benchmark method on both simulated and real datasets in the presence of multicollinearity and extreme observations. e PLS-FPM using Weibull distribution turns out to be the best model to estimate cumulative hazards according to AIC and BIC over simulated data generated from Weibull distribution. More generally in the setting of simulated survival data, the fully parametric PLS-FPM had better performance than the semiparametric PLS-Cox regression model. e optimal model for each real dataset shows that seven genes are found significantly associated with each breast cancer survival dataset. Tumor cell proliferation is found to be one of the most significant predictors of breast cancer survival. Various previous studies investigated proliferation in tumor cells and found it a significant factor of breast cancer [23,24]. Supporting previous studies, tumor metastasis is found to be an important factor of breast cancer survival [25,26]. Previous literature reported that high immune scores are significantly associated with improved disease-free survival and overall survival in breast cancer patients [27]. e present study also finds the association of immune response with breast cancer. Like previous studies, it is found that the apoptosis phenotype in a breast tumor seems to predict survival rate [28,29]. Consistent with the present study, various previous studies have reported the association of angiogenesis and breast cancer to citelongatto2010angiogenesis, min2010tie2. e ESR1 is reported as the most attractive target for clinical therapy of ER-positive breast cancer [30]. e amplification of ERBB2 of clinical samples of breast cancers along with its association with aggressive disease and poor survival is discussed in literature [31]. Both the genes are found to be significantly associated with breast cancer survival for the present analysis supporting past evidence. e overall accuracy of these algorithms enhances the model performance to a higher extend, considering collinear covariates and extreme observations. is efficiency suggests that survival function, hazard function, cumulative hazard function, and parameters of distribution for the survival time data with unknown distribution can be estimated more efficiently in terms of smooth lines.

Conclusion
PLS-FPM not only extrapolates survival outcomes beyond the available follow-up data but also supports a wide range of hazard shapes including monotonically increasing, monotonically decreasing, arc-shaped, and bathtub-shaped hazards. In a word, PLS-FPM is viewed as a useful fully parametric addition to the toolbox of robust estimation and prediction of survival time approaches for the widely used PLS-Cox model in the survival settings.  Data Availability e breast cancer datasets are freely available in an R package called survcomp and are available at http://www. ulb.ac.be/di/map/bhaibeka/software/survcomp/.