Identification of influential observations in high-dimensional survival data through robust penalized Cox regression based on trimming

: Penalized Cox regression can efficiently be used for the determination of biomarkers in high-dimensional genomic data related to disease prognosis. However, results of Penalized Cox regression is influenced by the heterogeneity of the samples who have different dependent structure between survival time and covariates from most individuals. These observations are called influential observations or outliers. A robust penalized Cox model (Reweighted Elastic Net-type maximum trimmed partial likelihood estimator, Rwt MTPL-EN) is proposed to improve the prediction accuracy and identify influential observations. A new algorithm AR-Cstep to solve Rwt MTPL-EN model is also proposed. This method has been validated by simulation study and application to glioma microarray expression data. When there were no outliers, the results of Rwt MTPL-EN were close to the Elastic Net (EN). When outliers existed, the results of EN were impacted by outliers. And whenever the censored rate was large or low, the robust Rwt MTPL-EN performed better than EN. and could resist the outliers in both predictors and response. In terms of outliers detection accuracy, Rwt MTPL-EN


Introduction
The determination of biomarkers in high-dimensional genomic data related to disease prognosis can be used to understand the pathogenesis of disease prognosis, so as to find new therapeutic drugs targeted to improve the prognosis of patients.Biomarkers can also be used to predict the patient's prognosis, and to provide individualized treatment for patients.So the discovery of biomarkers associated with prognosis has become an active research area.The study has two challenges.One is that the prognostic data tends to contain censored survival time; the other is the high-dimensional data, in which the number of variables is much higher than the sample size.The Cox proportional hazard model is widely used to model censored survival time, to screen for associate factors and to establish a prognostic prediction model, but it is not suitable for high-dimensional data.Penalized Cox regression can solve the problem of prognostic factors screening and prediction model establishment in high-dimensional data.For example, Liu, Z., M. Li, Q. Hua, Y. Li and G. Wang [1] used L1-penalized (i.e., LASSO-type) Cox regression to identify non-coding RNAs related to breast cancer prognosis.Patients were stratified based on risk scores.Shen, X. Y., X. P. Liu, C. K. Song, Y. J. Wang, S. Li and W. D. Hu [2] also used the LASSO-type penalized Cox proportional hazard regression model to finally identify two genes related to lung adenocarcinoma survival.Penalized Cox regression has also recently been used to identify driver genes for bladder cancer prognosis [3].
However, the prediction accuracy of these methods is often influenced by the heterogeneity of samples from cancer patients [4,5].The main known source of the heterogeneity is genomic instability [6].For example, genomic instability is a prominent source of genetic diversity within tumours, generating a diverse cell population that can be subject to selection in a given micro-environmental or therapeutic context.Some individuals have different dependent structures between survival time and covariates, which means that these patients may show different mechanisms from most individuals.Individuals with poor survival prediction by fitting Cox regression, "died too early" or "lived too long" as compared to the estimated survival probabilities for their covariate pattern composed of selected associated factors revealed by most individuals [7].These observations are called influential observations or outliers.These outliers, especially long-term survivors have a great impact on Cox regression [8].On the other hand, it is very important to detect outliers in survival data, because the analysis of individuals with long or short survival will lead to the identification of new prognostic factors [7,9].Peng, S., H. Dhruv, B. Armstrong, B. Salhia, C. Legendre, J. Kiefer, J. Parks, S. Virk, A. E. Sloan and Q. T. Ostrom [10] compared the integrated genomics glioma "outliers" (patients with long-term survival and short-term survival to discover the molecular markers with different prognoses after standard treatment.Therefore, individualized treatment can avoid treatment failure caused by wrong treatment.
So it is very important to identify outliers in survival data.On the one hand, a robust model can be obtained to improve the prediction accuracy of the model by removing the influence of outliers.On the other hand, outliers that are identified may reveal hidden information on the covariate and probably be worth studying further.
Because outliers can affect parameter estimation, residual analysis cannot be directly used for outlier identification, and there is a very high probability of masking, that is, the lack of identification of true outlier recognition [11].Therefore, robust estimates are a prerequisite for distance-based outlier detection procedures [12].For a low dimensional data, a robust estimation method, least trimmed square (LTS), was proposed by Rousseeuw, P. J. [11].LTS is highly robust to outliers in both the response and predictors.It is effective for identifying outliers and can solve the problem of the masking phenomenon caused by the coexistence of multiple outliers.Farcomeni, A. and S. Viviani [12] proposed a robust Cox regression model based on trimming to analyze survival data with outliers.They fitted Cox model by trimming the individuals with small contribution to partial likelihood function, to obtain a robust estimation which is not affected by outliers.However, there are few studies on robustness in high dimensional survival analysis dealing with omics data.Carrasquinha, E., A. Verí ssimo, M. B. Lopes and S. Vinga [9] studied the outliers in high dimensional survival analysis.Elastic net (EN) and LASSO were adopted to screen variables from high-dimensional data to low-dimensional data, and two low-dimensional robust Cox models were used to identify outliers.Then, rank product test and ensemble were used to combine the outliers identified by the two methods.The disadvantage of this method is that screening variables from high-dimensional data may be affected by outliers because EN and LASSO are not robust.
For high dimensional survival analysis, a robust penalized Cox model based on trimming is proposed in this study.Compared with Carrasquinha, E., A. Verí ssimo, M. B. Lopes and S. Vinga [9], we considered robustness in high-dimensional data directly, to avoid the influence of outliers on dimensionality reduction.
In this article, a robust EN-type penalized Cox model based on trimming is proposed and the algorithm to find the solution of the model is described in section 2. In Section 3, the results of simulation studies and the analysis of glioma gene expression data are described.We conclude with a discussion in section 4 and a conclusion in section 5.

Robust penalized Cox regression model based on trimming
Assume there are n observations in the follow-up study.  represents the outcome of object i, where   = 1 represents event, and   = 0 represents censorship.Times of n objects when they died or censored are denoted as  1 <  2 < ⋯ <   .Let (  ) be the number of people alive at time   , that is, the number of people at risk.
Let us consider a penalized Cox observation model with outliers.Let I be a pure set without outliers, (1) That means individuals in pure set I are subject to Cox proportional risk model, and the outliers outside of I obey an unknown, unspecified risk function   ().So the outlier cannot provide useful information for the estimation of β.

XX β X
observations is h = ⌊⌋ accordingly, where ⌊•⌋ means round down.Then the maximum partial likelihood function of the MTPL-EN model is: ), where  ≥ 0 is a penalty parameter, and α, which is the mixing proportion of the ridge and LASSO penalties, takes a value in [0, 1].The EN tends to select groups of correlated variables.The robust penalized Cox regression based on trimming is to find the subset whose sample size is h, which corresponding regularized partial likelihood function is the maximum.The corresponding regularized partial likelihood estimation of the subset is denoted as ˆMTPL-EN β .
Generally, the proportion of outliers is unknown in practice, and the selection of 1-η is higher than the expected proportion of outliers.However, too high trimmed proportion usually leads to estimated asymptotic variance inflate and the reduction of the estimation efficiency.We adopted the trimmed ratio 1 −  = 0.25 in this article.In this study, after ˆMTPL-EN β was estimated, reweighted step was considered to detect outliers in the data.And then estimation on dataset in which outliers were removed again to further improve efficiency.

Algorithm
Maximizing (2) is equivalent to find an optimal subset with regularized maximum partial likelihood function in all subsets with a sample size ( ) . It is impossible to search exhaustively because the number of all subsets (  h ) is too large and (  h ) increases rapidly with the increase of the sample size.The computational burden of finding the optimal subset by exhaustive method in such a huge subset is considerable.This kind of optimization is very common in robust statistics, and the method usually used is a repeated concentration-steps algorithm, which also known as the C-step algorithm [13].In the C-step algorithm, it is necessary to separate the individual's contribution to the objective function at each step of the iteration.But the objective function of Cox regression is the partial likelihood function, the contribution of the partial likelihood function corresponding to each observation is difficult to be decomposed from the partial likelihood function of the complete set.Because the risk set of an observation is related to the survival time order of other individuals.If an observation is included or excluded, and corresponding risk set changes accordingly.Especially for the censored individuals, its contribution to the likelihood function is reflected in the denominator of the partial likelihood function, and its contribution is also difficult to be separated.So ˆMTPL-EN β cannot be solved directly by C-step algorithm.Another reason why the C-step algorithm cannot be used is that, in the penalized regression, the regulator λ of each step in C-step algorithm needs to be re-determined.The objective function does not decrease with iteration, so it does not necessarily converge on a subset of the optimal objective function.Farcomeni, A. and S. Viviani [12] applied the acceptance-rejection algorithm (proposed by Chakraborty, B. and P. Chaudhuri [14]) to solve the robust Cox model based on trimming in low-dimensional cases.In the acceptance-rejection algorithm, at each iteration, the candidate samples are randomly taken from the remaining samples.So, the direction of the iteration is random, and it leads to the slower convergence of the algorithm.In this study, the residuals of each individual are used to replace each individual's contribution to the partial likelihood function.
where   is the censoring indicator, and  ̂0(  ) is the baseline cumulative risk function.Martingale residuals can be interpreted as the difference between the number of observed events of an individual and the expected number of events under the Cox model, that is, the part that is not predicted by the model and exceeds the estimated number of events.
Martingale residuals can reveal that, compared to other observations with the same covariate, observations that do not fit the model well, that is, those who live too long ( ̂ is a large negative value) and fail too early ( ̂ is close to 1).
The martingale residual is asymmetric, and its range is (−∞, 1].The martingale residual is transformed to approximate its normal distribution to obtain the deviance residual: Deviance residuals are more symmetrical than martingale residuals.A large negative residual indicates that the number of observed events is less than the model's predicted number, that is, an outlier that "lived too long".And a large positive residual indicates the number of observed events is larger than the model's predicted number, that is, an outlier that "died too early", relative to other observations with the same covariate.Some simulation studies have shown that for "lived too long" type outliers, can be detected both by the deviance residuals and martingale residuals.While for "failed too early" type outliers, can be identified only by the deviance residuals [15,16].
2.2.1.2.Find regression estimates from subsamples, and get the residuals of all samples.
In C-step algorithm, it is necessary to perform regression estimation on the subset, and then substitute the estimated coefficients into the complete set to obtain the contribution of each individual to the criterion function.In the penalized Cox model, the baseline risk needs to be estimated before the residuals of all individuals are obtained based on the estimated coefficient β .However, we found that the baseline risk is greatly affected by the outliers through simulation experiments, so the baseline risk needs to be estimated through subset without outliers.Then it can be extended to that of all individuals.
At the k-step iteration,   denotes the current subset containing h observations. ̂ is the solution of the penalty Cox regression based on   , and the corresponding partial likelihood function is recorded as ℓ( ̂ ,   ).The baseline risk is estimated by Under  ̂ , for each sample in the subset   , the baseline risk  ̂0(  ) is estimated according to the formula (5),  ∈   .For the samples of the remaining set (  ) of the subset   , the baseline risk is estimated according to the following method.We assumed that the baseline risk follows the Weibull distribution, that is the baseline risk  0 () and survival time  are linear after transformed by logarithmic transformation, Using the samples of the subset   ,  ̂0 and  ̂1 are estimated by linear regression (6).Then the survival time   of the sample in (  ) is substituted to ( 6) to obtain the corresponding baseline risk  ̂0(  ),  ∈ (  ).In order to make the estimates of  ̂0 and  ̂1 robust, median regression is used here.The objective function of the usual least squares estimation is to minimizes the sum of the squared residuals, while the objective function of the median regression minimizes the median of the squared residuals, which is a robust estimate with high breakdown point [11].
According to the baseline risk of each individual, the deviance residual  ̂ of each individual is obtained according to formulas (3) and ( 4),  = 1, 2, ⋯ , .The h individuals with the smallest absolute value of residuals are selected to form a new subset   .In order to keep the censoring rate of the subset the same as that of the full set,   is composed of ℎ 1 individuals with the smallest residuals among individuals who have an outcome, and ℎ 0 individuals with the smallest residuals among the censored individuals.Let  =  1 +  2 where  1 individuals have an outcome and survival time of  2 individuals are censored.Let ℎ 1 = ⌊( 1 + 1)η ⌋ and ℎ 0 = ℎ − ℎ 1 , where ⌊.⌋ means round down, and 1-η is the trimmed rate.
In order to make the algorithm reach the global optimal value, multiple initial subsets can be taken.In order to ensure that the initial subset does not contain outliers, the sample size should be smaller.But too small a sample size will cause inaccurate estimation, especially too few samples with outcomes (i.e., not censored) in the subset will make it impossible to estimate the coefficient of the penalized Cox regression.In this study, the sample size of the initial subset was 20, and a total of 500 subsets were selected randomly.A two-step AR-Cstep were executed on these 500 subsets.Ten subsets with the largest partial likelihood function from the 500 subsets after iteration were selected.
Then the AR-Cstep algorithm was ran on 10 subsets until convergence.In the 10 convergent subsets, the subset with the smallest partial likelihood function is selected as the optimal subset, which is represented by  .And a penalized Cox regression was ran on  to obtain the solution  ̂ .
k indicates the number of iterations, and r indicates that the current maximum likelihood value has not changed after r iterations.

Reweighted step
In this article, we choose the subset of size h n =     η where η=0.75.So, the trimmed rate 1-η is the initial guess that less than 25% of outliers contained in the data.This is a rather conservative estimation of proportion of outliers.There may not be so many outliers in the data.Therefore, reweighted step is considered to detect outliers via  ̂ .Then these outliers are excluded and a new subset  ℎ is obtained.Then EN-type penalized Cox regression is applied to  ℎ to get the solution  ̂ℎ .Usually, the size of  ℎ is larger than h, such that more samples can improve the performance of  ̂ℎ compared to  ̂ .
In order to make the estimation of baseline risk  0 () in deviance residuals more accurate and not affected by outliers,  0 () was obtained on  .For samples other than   , the baseline risk is also estimated by Eq (6).
After obtaining the deviance residuals  ̂ of each observation, define a binary weight for the i-th observation as follows: whereΦ(x) is the distribution function of a standard normal distribution.We set δ= 0.005 and Φ −1 (1 − δ) = 2.57.That means observations with residuals beyond 2.57 are regarded as outliers.reweighted is composed of observations that are not flagged as outliers.
The reweighted estimator ̂ℎ is the solution of the penalized Cox regression based on ℎ , It is called Reweighted MTPL-EN (Rwt MTPL-EN).To distinguish them, the unweighted  ̂ is called Raw MTPL-EN.

Choice of the regulator parameter and standardization of predictors
We select  over a grid of values in the interval (0, λ max ] as discussed by Breheny and Huang [14].
t is the survival time.In iteration step of AR-Cstep, we take a grid with steps of size 0.05λ ̂max and α = 0.5 to reduce the computational burden.In the reweighted step, we take a grid with steps of size 0.01 λ ̂max of  to derive the solution  ̂ and ̂ℎ .The choice of α is selected by cross-validation in the interval [0.1,1] with a step size of 0.1.It would be better to standardize predictors before applying the penalized Cox regression.Standardization mainly aims to eliminate the influence of dimension and quantity of a predictor.However, mean and standard deviation computed from all sample are not robust with outliers.In the algorithm described above, penalized Cox regression is applied to subsample in every iteration step of AR-Cstep.So, we firstly respectively compute mean and standard deviation from subsamples.Then we standardize all samples with this mean and standard deviation before applying penalized Cox regression.

Simulation settings
In this section, we will compare the accuracy of elastic net, Raw MTPL-EN and Rwt MTPL-EN in variable selection, outlier identification and prediction by simulating whether there are outliers in survival data, symmetrical and asymmetric outliers, different censoring ratios, and outliers in the response and predictors.

Simulation data generation
The simulation data in this study is based on Bender, R., T. Augustin and M. Blettner [17].Assuming that the baseline risk  0 () obeys the exponential distribution of λ = 1, the survival time T can be generated by the following formula, where U obeys the uniform distribution on [0,1].The censoring time is generated by an exponential distribution with mean (  ), where V obeys the uniform distribution of [, ].And different values of CL and CU give rise to different censoring ratios.

Simulation scenario setting
Considering that the omics data is usually high-dimensional data, and the high group correlation caused by gene interaction, the following settings are made.We set the sample size n = 300, the number of independent variables p = 1000, where the independent variables X follows a p-dimensional multivariate normal distribution (0,   ).The correlation structure of the independent variables is assumed to be block correlation.A block includes 50 independent variables.The correlation structure within the block is (  ,   ) =  |−| , ≠ ， = 0.9.There is no correlation between blocks.The related structure within blocks is as follows: Considering that the effect of a single gene on prognosis is often low, the absolute value of the effect size is set between 0.3 and 0.8.The non-zero regression coefficients are set in the first 12 block groups, and the regression coefficient of each block group is set as: Considering that the censoring rate of survival data is often large, 35% censoring rate was set, which was also close to 37.58% censoring rate in glioma data analysis.In addition, censoring rates of 15%, 25% and 45% were also set to see the effect of censoring rate on the results.
Referring to the setting of the outliers by Farcomeni, A. and S. Viviani [12], the maximum and minimum values of (  ) were recorded as  ℎℎ and   respectively.Then, the observations with the proportion of ε were randomly selected from the data set, and their corresponding (  ) were changed to     + (1 −   ) ℎℎ .  is a random number obeying the Bernoulli distribution with parameter p. Two cases of symmetric outliers (p = 0.5) and asymmetric outliers (p = 0.9 and p = 0.1) are set.Figure 1 shows the setting of outliers in the simulation data.For elastic net, the R package "glmnetUtils" is used, which is an extension of the R package glmnet.The parameter α can also be cross-validated.Among them, the choice of λ adopts default choice in the package.That is, 100 steps of equal step size on a log scale from λ min to λ max are generated, where λ min = 0.01 and λ max is the minimum value of λ that makes all regression coefficients 0. The range of is [0.1,1], and 10 alpha values are generated with a step size of 0.1.For MTPL-EN, the trimmed rate is set to 25%.
(2) Symmetric outliers, that is, outliers that "lived too long" relative to the prognosis index and outliers that "failed too early" relative to the prognosis index each account for 50%.
(3) Censored rate was 35%.(4) Censored rate was 45%.Simulation Scenario 3: It was mainly to see the impact on the performance of the elastic net and MTPL-EN when outliers deviate from the main data in the response, or when the deviation also occurs in the predictors.n = 300, p = 1000, ε = 10%, symmetrical outliers, censoring ratio 35%.
Case A: 10% individuals are randomly selected, with a 50% probability of min (h (t)), which is the minimum value of the risk function, and a 50% probability of max (h (t)), which is the maximum value of the risk function.
Case B: Others are the same as case A, but there is a 50% probability that the outliers are min (h (t)) / exp (15) and a 50% probability is max (h (t)) * exp (15).
Case C: Others are the same as A, and the independent variables of the outliers are set to follow the independent N (3,1) distribution.
Case D: Others are the same as B, and the independent variables of the outliers are set to follow independent N (3,1) distributions.
In case B compared with case A, we can see that the survival time of the outlier that "lived too long" is longer, and that of the outlier that "died too early" is shorter.Compared with case A, in case C, the outliers are shifted to the right.Compared to case A, in case D, outliers are shifted to the right and deviates also farther from the main data in the response.Graphical representation is shown in Figure S1 in the supplementary file.
Training data and test data were generated according to the above sampling schemes.Training data were generated to fit the model and evaluate the accuracy of variables selection and outlier detection.And test data were generated to evaluate the prediction of the model.The test data were generated without outliers.For each setting, we calculated the average of the performance measures over 100 simulation replicates implemented in R software.Codes is available on Github (https://github.com/hwsun2000/MTPL-EN).

Results of scenario 1
Figure 2 shows the performance of EN, Raw MTPL-EN, and Rwt MTPL-EN when there were no outliers, and when there were 10% outliers in the data.
Here we used two indicators Sn (sensitivity) and FPR (false positive rate) in the screening test [18].The outliers to be identified is regarded as patients to be detected in the screening test.Sn represents the proportion of truly outliers among outliers that identified.FPR represents the proportion of normal samples that are determined to be outliers.A detailed description of the indicators is provided in the supplementary file.PSR (Positive Selection Rate) indicates the proportion of real disease-related biomarkers that are screened out, and FDR (False Discovery Rate) indicates the proportion of biomarkers screened out that are not related to the disease.We used a comprehensive indicator GM [19,20], which is the geometric mean of (PSR and (1−FDR)) to measure the accuracy of variable selection.High PSRs and low FDRs will give high GMs, which indicates high accuracy of variable selection.A detailed description of the indicators is provided in the supplementary file.When there were no outliers, EN performed best.From both the accuracy of variable selection and the log-likelihood function, the results of Rwt MTPL-EN were close to EN.It showed that Rwt MTPL-EN didn't lose much efficiency for datasets without outliers.When there are outliers, the estimation of EN is affected by outliers.Compared with the absence of outliers, the FDR changes little, but its PSR is reduced by about 60%, indicating that EN will miss a significant percentage of non-zero variables.The performance of MTPL-EN is better than EN.Compared with the case the absence of outliers, the FDR of Rwt MTPL-EN remains basically unchanged, and the PSR exceeds 50%.From the perspective of the comprehensive indicator GM, it shows that the accuracy of Rwt MTPL-EN in variables selection remained stable.
As far as the accuracy of outlier identification is concerned, 25% of samples were considered as outliers by Raw MTPL-EN, which is higher than the percentage of outliers actually set in the simulation experiment.So, its FPR was high with 18%.For Rwt MTPL-EN, outliers were further identified the through the reweighted step, so that the number of outliers was less than that in Raw MTPL-EN.And its FPR was less than 7%.The sensitivity is higher than EN (Sn, 0.75 vs 0.42), see Figure 2. Taken together, detected outliers was the least by EN and has the smallest FPR, but its sensitivity is also the lowest.The outliers identified by Rwt MTPL-EN has its FPR less than 7%, and its sensitivity reached more than 70%.
According to the log likelihood function, when there were outliers, the log-likelihood function of MTPL-EN were much larger than that of EN.Log-likelihood function of of Rwt MTPL-EN were higher than that of Raw MTPL-EN and EN, indicating that Rwt MTPL-EN had the highest prediction accuracy when there were outliers.
As can be seen from Figure 2, compared with the case of symmetric outliers, En behaved differently under asymmetric outliers.When 90% of the outliers were outliers that "lived too long" relative to their prognosis index, the accuracy of the variable selection of EN was worse (GM 0.28 vs 0.29), and the ability to identify outliers becomes worse (Sn 0.24 vs 0.42, FPR 0.03 vs 0.003).However, the accuracy of variables selection of Rwt MTPL-EN was improved (GM, 0.43 vs 0.39) and the ability of outliers identification was also improved (Sn 0.94 vs 0.75, FPR 0.07 vs 0.07).
When 90% of outliers were samples that "died too early" relative to their prognosis index, the impact on EN was smaller than that of symmetrical outliers.The accuracy of variable selection was higher (GM 0.40 vs 0.28) and the ability of outliers identification remains almost unchanged (Sn 0.51 vs 0.42, FPR 0.001 vs 0.003).The effect on the robust Rwt MTPL-EN was similar to that of the symmetrical outliers.The accuracy of variable selection remains was improved (GM 0.44 vs 0.39), and the ability to identify outliers decreases (Sn 0.66 vs 0.75, FPR 0.03 vs 0.07).
Compared to the outliers that "failed too early", the outliers that "lived too long" made EN perform worse.It was easy for Rwt MTPL EN to identify outliers that "lived too long", and the performance of variable selection is not affected by outliers.
In short, when there were no outliers, the results of MTPL-EN were close to EN.When there were 10% outliers, the accuracy of variable selection and prediction of MTPL-EN were higher than those of EN.More outliers were identified by Rwt MTPL-EN and its FPRs were within 7%.Compared to outliers "failed too early", outliers that "lived too long" made EN perform worse.And it was easier for Rwt MTPL-EN to identify outliers that "lived too long".In any case, the performance of Rwt MTPL-EN remained stable under outliers.

Simulation Scenario 2
As can be seen from Figure 3, when the censored ratio was 15% and 25%, the accuracy of variable selection of Rwt MTPN-EN is higher than that when the censored ratio is 35%.And the accuracy of outlier identification is also improved.
When the censored ratio is 45%, the accuracy of variable selection and outlier identification decreases.But EN was lower than Rwt MTPN-EN on the accuracy of variable selection, outlier identification or prediction.There is a large gap in the ability to identify outliers between EN and Rwt MTPN-EN.When the censoring ratio is 45%, the sensitivity of outliers detection of EN is 0.370, and the FPR is 0.007.The sensitivity of Rwt MTL-EN is 0.748, and the FPR is 0.088.This showed that whenever the censored rate was large or low, using Rwt MTL-EN could identify outliers accurately.

Simulation scenario 3
As can be seen from Figure 4 that, in case B, when outliers deviate in the response farther, outlier detection results of Rwt MTPL-EN were better than that of Case A (Sn 0.83 vs 0.75, FPR 0.07 vs 0.07).The PSR and FDR of variable selection of Rwt MTL-EN and EN changed little.The prediction accuracy of EN was lower than that in case A (Log-likelihood, -876 vs -804).Rwt MTL-EN performed better than EN in terms of outlier detection, variables selection and prediction.
In case C, when outliers also deviated in the predictors compared to case A, outliers detection accuracy of EN decreased (Sn 0.344 vs 0.417, FPR 0.001 vs 0.003).Variables selection accuracy of EN changed (PSR 0.578 vs 0.261, FDR 0.806 vs 0.563).That is due to the number of variables selected by EN increased from 47.9 to 185.6 on average, which is much larger than 60, the number of true non-zero variables.The number of variables selected by Rwt MTPL-EN changed little, and the outlier identification and log-likelihood functions also remained stable.In case D, when outliers deviate in the response farther than case C, the results of EN, Raw MTL-EN and Rwt MTL-EN were similar to that in case C.
As can be seen from above simulation experiments, the results of EN were impacted by outliers.However, outlier detection accuracy of Rwt MTPL-EN was much higher than that of EN.The prediction and variables selection accuracy were higher than EN, which showed that Rwt MTPL-EN could resist the outliers in both predictors and response.
In addition, we simulated the situations when the trimmed ratio 1-η was 5%, 15%, 25%, and 35%, respectively.The results are shown in Figure S2 of the supplementary materials.As can be seen from Figure S2, since the proportion of outliers is 10%, Rwt MTPL-EN had little difference in the accuracy of outliers detection and variable selection when the trimmed ratio 1-η is 15%, 25% and 35%.But when the trimmed ratio was 35%, the prediction accuracy was slightly lower.When the trimmed ratio was 5%, the sensitivity of outliers detection decreased.So Rwt MTPL-EN remained stable when the trimmed ratio was higher than the outlier ratio.However, when the proportion of trimmed samples was much higher than that of outliers, the accuracy of outliers detection and variable selection remained stable, but the accuracy of prediction decreased slightly.When the trimmed ratio was lower than the outlier ratio, the accuracy of outlier detection was affected.In practice, the percentage of outliers is usually less than 25%, so we recommend a trimmed ratio of 25% to make it larger than the percentage of outliers, so that the result is not affected by outliers.If, based on practical experience, the percentage of outliers in the data is likely to exceed 25%, the trimmed ratio should be increased.
We also simulated situations of different sample sizes n = 300 and 500, and different dimensions p = 600 and 1000, as shown in Figure S3 of the supplementary material.The accuracy of variable selection and outlier detection was the best when n = 500 and p = 600, and the accuracy decreased when the sample size n decreased or the dimension p increased.

Glioma gene expression dataset
Example data were obtained from gene microarray expression data of 301 patients with glioma in China (CGGA, http://www.cgga.org.cn/), and 298 patients were analyzed after removing 3 patients with missing survival time.Rwt MTPL-EN and EN were used to screen the genes that affect the prognosis of gliomas, and to detect the possible outliers.The results of the two methods were compared.The parameter setting of two methods were the same as that of simulation evaluation.
The median survival time of the data set was 38.3 months, and the range was 0.7~158.7 months, and the censored rate was 37.58%.There were 116 cases of glioma WHO II, 66 cases of III, 126 cases of IV, and 3 cases of missing WHO classification.The clinical variables include TCGA subtype, PRS type, histological type, grade, gender, age, radiation status, chemotherapy status, IDH mutation status, 1p19q co deletion status and so on.Gene expression microarray expression profiles were analyzed on an Agilent Whole Human Genome Array.There were 19,416 genes in the gene expression data.

Application of EN to glioma dataset
In this study, EN and Rwt MTPL-EN were applied to glioma gene expression dataset (n = 298, p = 19416).The parameter setting was the same as that set in the simulation study.
Eighty-seven genes were identified by EN, which are listed in supplementary file.
As can be seen from Table 2 and Figures 5 and 6 that, there were 12 outliers identified by EN, and only one outlier was that "lived too long" relative to the prognosis index estimated by EN.The remaining 11 were all outliers that "died too early" relative to the prognostic index estimated by EN.As shown in Figure 5, 11 outliers were also the dead individuals with the shortest survival time in all samples.However, from Figure 5, most of the outliers who "failed too early" had PI greater than 0. And even 5 outliers whose PI were greater than 1, which showed a relatively high risk of death estimated from gene expression data, and their survival times were relatively short.So, it is not appropriate to consider them as "outliers" estimated from gene expression data.From Table 2, according to the clinical characteristics of the outliers, 8 of the 11 individuals were classified as grade IV by WHO.In terms of age, six individuals were over 50 years old, and eight were over 40 years old.From the perspective of histological types, 8 are glioblastomas (GBM or sGBM) with poor prognosis.From the IDH mutation, 7 are wild-type with poor prognosis.So, for most of the outliers that "failed too early", their illness was also serious, and it was also unreasonable to identify them as outliers from their clinical characteristic data.

Results of Rwt MTPL-EN applied to glioma dataset
As can be seen from Table 3, EN identified 87 genes and 12 outliers.Its log-likelihood functions was −833.4,and the C index was 0.842.The AUC corresponding to the median survival time was 1164 days was 0.793.Rwt MTPL-EN screened 56 genes and identified 18 abnormal points, and the corresponding prediction index was lower than EN.However, for a subset with 18 outliers removed, the log-likelihood, C-index, and AUC of Rwt MTPL-EN were all higher than those of EN.
As can be seen from Table 4, Figures 7 and 8, the number of outliers identified by Rwt MTPL-EN was 18, of which 3 were outliers that are "failed too early" relative to the prognosis index estimated by Rwt MTPL-EN.There are 15 outliers that "lived too long".Of the 15 outliers that "lived too long", 13 had prognostic indices greater than 0, 10 were greater than 1, and 5 were greater than 2.Among them, the absolute value of 8 residuals were greater than 4, indicating that most outliers who "lived too long" had a higher risk of death estimated from gene expression data.This means that there were different correlation patterns between the prognosis and covariate values in these outliers from that in most individuals.According to the clinical characteristics, three of the 15 outliers that "lived too long" were classified as IV and 6 were on level III.In terms of IDH mutation types, there were 9 wild types with poor prognosis.This means that most of the 15 individuals who "lived too long" were obviously outliers with high risk but "lived too long".(Note: black hollow dots: normal points; red hollow triangles: outliers) Among 3 outliers that "failed too early" identified by Rwt MTPL-EN, the survival time of CGGA was 21 days and their outcomes were death, but the prognosis index were very low, only −2.41.According to the covariate value of the individuals, they should have longer survival times.This showed that there were different correlation patterns between the survival time and covariates of these individuals.The other two outliers that "failed too early", CGGA_1059 and CGGA_11, had shorter survival time, which were 21 days and 155 days respectively.However, their prognostic indexes were higher, 0.98 and 1.86 respectively, indicating that they were not obvious outliers.
Except for CGGA_640, EN did not identify other outliers that "lived too long".As shown in the results of simulation experiments, outliers that "live too long" were hardly detected by EN but Hamilton, S. Bailey, A. Malovannaya, D. Chan, J. Qin and B. W. O'Malley [41]), CCDC34 [42], CKS2 [43], KIAA0141 [44] and USP34 [45].As potential genes related to the prognosis of glioma, they provide reference information for the next experimental verification.
Through the analysis of the glioma gene expression data, only 28 genes selected by Rwt MTPL-EN and EN were coincident, indicating that there were samples in this data that have a greater impact on the estimation of EN.The dependence structure of these patients' survival time and covariates is different from that of other patients, that is to say, they "died too early" or "lived too long" relative to the model's estimated risk of death.After removing the outliers, the prediction accuracy of Rwt MTPL-EN was higher than that of EN.In terms of identified outliers, most of outliers identified by EN were those "failed too early", but most of them were not obvious outliers according to their PI and clinical variables.And only one outlier that "lived too long" was identified, indicating that the sensitivity in identifying outliers that "lived too long" was low.Most of outliers that identified by Rwt MTPL-EN were those "lived too long", and most of them were obvious outliers from their clinical characteristics and prognostic index.And through simulation experiments, it can be known that outliers that "lived too long" had a greater impact on the accuracy of variable selection of EN.Rwt MTPL-EN had advantages in identifying outliers that "lived too long".So, their influence on the estimation of EN can be removed by Rwt MTPL-EN.

Discussion
It is a great challenge to find biomarkers related to prognosis from high-dimensional genomic data, and to be able to resist the influence of noise and heterogeneity of samples in the experimental process, to obtain robust estimation.In this article, a robust penalized Cox model based on maximum trimmed partial likelihood estimation was established, and an AR-Cstep algorithm combining Metropolis-type acceptance-rejection algorithm and C-step algorithm was proposed to solve the estimation of MPTL-EN.By simulating high-dimensional datasets with outliers, the robust MPTL-EN performed better than non-robust EN-type penalized Cox regression in variable selection, outlier detection, and prediction.Moreover, Rwt MTPL-EN is better than Raw MTPL-EN.When outliers in response deviated farther, the number of vairables selected by EN became less.When the outliers in predictors also occurs, the number of vairables selected by EN was far greater than the number of real non-zero variables.Both situations made the accuracy of variables selection of EN decrease.However, Rwt MTPL-EN remains stable under various conditions, which indicates that the Rwt MTPL-EN can resist outliers in the reponse and predictors.According to the analysis of glioma gene expression data, the variables selected by Rwt MTPL-EN were different from those of EN, and a higher proportion of genes related to glioma had been identified by Rwt MTPL-EN.After removing outliers, prediction accuracy of Rwt MTPL-EN was higher than that of EN, and more outliers that "lived too long" relative to the prognosis index were identified.
The robust penalized Cox model based on trimming directly selected variables for high-dimensional dataset.Compared with robust estimation after reduced dimensions from high dimensions, it avoided the influence of outliers on the accuracy of dimensionality reduction.Compared with the residual analysis, it avoided the influence of "masking" and "swamping" on the estimation.Compared with other robust methods such as Huber's loss function and Tukey's loss function, it could resist the outliers in the response and the predictors.
The AR-Cstep algorithm, which combined the acceptance-rejection and C-step algorithm, can solve the problem that the C-step algorithm does not converge because the penalty parameter changes during the iteration.In order to achieve the maximum of the trimmed likelihood function, and to avoid falling into local optimum, the metropolis-type probabilistic acceptance rejection algorithm was combined.This improvement can make the AR-Cstep algorithm generalized to other robust penalized regression models, such as robust Adaptive LASSO, Group LASSO, SCAD, MCP and so on.The improved AR-Cstep algorithm based on residuals no longer relies on separating individual contributions from the model's likelihood function, but instead used residuals to measure the individual contributions.This idea can also be generalized to solve the robust problem of similar models.Such as robust Cox regression in low-dimensional situations, conditional logistic regression, and so on.In the likelihood function of these models, it is difficult to separate the individual's contribution to the objective function, so AR-Cstep algorithm based on residuals can be used.
In this paper, it is found that the outliers that "lived too long" and "failed too early" had different effects on EN, and outliers that "lived too long" had a greater impact on the estimation of EN.This is also the case in Cox regression.Valsecchi M et al. [8] explained why long-term survivors have a greater impact on Cox regression, which is also applicable to penalized Cox regression.First, long-term survivors are part of many risk sets (all individuals who fail before them).Secondly, the risk set of early failure individuals is usually very large, but the individuals who fail at the end of the study correspond to very small risk sets.The risk sets of the two groups with different exposure states were similar at the beginning.But over time, as the individual failure rate of the high-risk group is higher than that of the low-risk group, the comparison of the risk set size between the two groups will change accordingly.In the end, the risk sets of the two groups are highly unbalanced, and the risk set of the high-risk group may be only one or two individuals.So, removing or adding such an individual will have a great impact on the estimation of the hazard ratio.
Subsequent analysis methods of identified outliers need to be explored in combination with the application.Peng S et al. [10] compared the integrated genomics of long-term survival and short-term survival glioma patients to discover the molecular markers with different prognoses after standard treatment.Therefore, individualized treatment can avoid treatment failure caused by wrong treatment.According to Burrell, R. A., N. McGranahan, J. Bartek and C. Swanton [6], phenotypic heterogeneity is not determined solely through genetic distinctions between subclones, but also through stochastic events in gene expression and protein stability, epigenetic divergence and micro-environmental fluctuations.There is a crucial need to understand mechanisms driving genomic instability so that therapeutic approaches to limit cancer diversity, adaptation and drug resistance can be developed.
In practice, when the penalized Cox model is used to screen prognostic biomarkers, it is often impossible to know whether there are outliers in the data.So, both robust and non-robust models can be used to fit the data.If the results of the variable selection of the two models are similar, it means that the non-robust model is not affected by outliers, and there are no outliers in the data.In this case, a non-robust model can be used because the efficiency of the non-robust model is higher.However, If the results of the two models are quite different, it means that there are some individuals who are not suitable for the model and the penalized Cox regression model estimation is incorrect.At this time, the robust Cox model is more suitable for this data.
In this article, we assumed that the pure data without outliers satisfied the proportional risk assumption of Cox model.Time dependent covariates often needs to be specified according to practical experience, which requires a better understanding of the impact of associated biomarkers on prognosis.If we are at the stage of extensive screening of related biomarkers from high-dimensional data, and the actual problems are not well understood, the proportional risk model can be as a preliminary choice.Whether the variable is related to the prognosis is estimated firstly, and then whether the influence of the variable changes with time is determined by a more sophisticated model.

Conclusions
The robust penalized Cox model based on trimming Rwt MTPL-EN established in this paper can select variables more accurately than the non-robust EN model when outliers exist.It can resist outliers both in response and predictors.Rwt MTPL-EN can identify outliers more accurately, especially in identifying outliers that "lived too long", while outliers that "lived too long" had a greater impact on the accuracy of variables selection in EN.The AR-Cstep algorithm established in this article solves the problem that the C-step algorithm does not converge due to the change of the penalty parameters of penalized regression.It no longer depends on separating the individual contributions from the likelihood function of the model.This improvement allows the AR-Cstep algorithm to be generalized to more models.

45 )
, and 601−1000  is set to the zero vector.

Figure 1 .
Figure 1.Graphical representation of outlier settings in simulated data (scatter plot of logarithmic survival time and prognostic index PI).(Notes:  = (  ).Black solid dots: normal points with outcomes; black hollow dots: censored normal points; red solid triangles: outliers with outcomes; red hollow triangles: censored outliers.)

Figure 4 .
Figure 4. Comparison of the results between EN and MTPL-EN when there were outliers in the response or (and) predictors (n = 300, p = 1000).

Figure 5 .
Figure 5. Scatter plot of the prognostic index PI estimated by EN and survival time.

(Figure 6 .
Figure 6.Deviance residuals corresponding to the model estimated by EN. (Note: black hollow dots: normal points; red hollow triangles: outliers)

Figure 7 .Figure 8 .
Figure 7. Scatter plot of the prognostic index PI estimated by Rwt MTPL-EN and survival time.(Note: black solid dots: normal points with outcomes; black hollow dots: censored normal points; red solid triangles: outliers with outcomes; red hollows: censored outliers,  =   ′  − )

Table 2 .
outliers identified by EN and their corresponding values on clinical variables.

Table 3 .
Number of genes, outliers, prediction estimated by EN and Rwt MTPL-EN.

Table 4 .
Outliers identified by Rwt MTPL-EN and corresponding values in clinical variables.