Elsevier

Computers in Biology and Medicine

Volume 91, 1 December 2017, Pages 159-167
Computers in Biology and Medicine

Comparison of variable selection methods for high-dimensional survival data with competing events

https://doi.org/10.1016/j.compbiomed.2017.10.021Get rights and content

Abstract

Background

In the era of personalized medicine, it's primordial to identify gene signatures for each event type in the context of competing risks in order to improve risk stratification and treatment strategy. Until recently, little attention was paid to the performance of high-dimensional selection in deriving molecular signatures in this context. In this paper, we investigate the performance of two selection methods developed in the framework of high-dimensional data and competing risks: Random survival forest and a boosting approach for fitting proportional subdistribution hazards models.

Methods

Using data from bladder cancer patients (GSE5479) and simulated datasets, stability and prognosis performance of the two methods were evaluated using a resampling strategy. For each sample, the data set was split into 100 training and validation sets. Molecular signatures were developed in the training sets by the two selection methods and then applied on the corresponding validation sets.

Results

Random survival forest and boosting approach have comparable performance for the prediction of survival data, with few selected genes in common. Nevertheless, many different sets of genes are identified by the resampling approach, with a very small frequency of genes occurrence among the signatures. Also, the smaller the training sample size, the lower is the stability of the signatures.

Conclusion

Random survival forest and boosting approach give good predictive performance but gene signatures are very unstable. Further works are needed to propose adequate strategies for the analysis of high-dimensional data in the context of competing risks.

Introduction

Over the last decade, gene signatures based on micro-array data are on the rise in oncology [1], [2]. The main objective of gene signatures is to improve the management of cancer patients by prognostication and treatment prediction [3]. Different studies demonstrated that gene signatures were not unique and strongly dependent on both the patients' selection and the regression models used [4], [5], [6]. Gene signatures are generally developed and validated using time-to-event endpoints such as metastasis free survival, disease free survival, or overall survival. As several event types are included in their definition, these endpoints can be considered as composite [7]. In order to improve risk stratification and treatment strategy, it will be interesting to identify gene signatures for each event type in the context of competing risks [8], [9]. For example, loco-regional recurrence is becoming less common in breast cancer. To better guide optimal loco-regional treatment, it is important to identify gene signatures which specifically predict the risks of loco regional recurrence. Breast cancer patients are also at risk of other event types, such as distant metastasis and death, which can preclude the occurrence of loco-regional recurrence. Other various cancers can be greatly impacted by the development of genes signatures for a given event type.

Recently, several regression methods for handling high-dimensional data have been extended to the competing risk data setting. Until recently, little attention was paid to the performance of such methods in deriving molecular signatures for predicting cumulative incidence in competing risk settings. One popular approach in the context of competing risks with high-dimensional data is to use cause specific hazard modeling. Cox proportional hazard is fitted using a penalized regression model for the event of interest and by considering individuals who fail from competing events as censored observations [10]. But a covariate that reduces the cause specific hazard of a competing risk can indirectly increase the cumulative incidence of the event of interest [11]. In fact, cumulative incidence represents the probability of disease in presence of a competing risk. For low-dimensional data, the Fine & Gray model, which is an extension of the Cox model, has been proposed to model the subdistribution hazard [12]. In high-dimensional data (number of covariates >> number of observations), the Fine and Gray model cannot be fitted to identify most predictive genes and less traditional approaches are required. Methods based on random forests have recently been adapted for survival analysis in presence of competing risks [13], with a modified weighted log-rank splitting rule modeled according to the Gray's test [14]. On the other hand, Binder et al. [15] have proposed a gene selection method based on the Fine and Gray model with a boosting approach. These different methods, now implemented in statistical packages, become increasingly popular for the analysis of competing risks data. But, to our knowledge and contrary to classical survival methods, there is no previous work which has compared these two methods on different criteria such as stability and prognostic ability.

The main objective of this publication is to compare different selection methods for high-dimensional time-to-event data in the context of competing risks using a published data set on bladder cancer and simulated datasets. After presenting an example of the application of these methods on the former, a resampling strategy was performed to evaluate both gene selection and predictive accuracy and to explore the effect of the training set sample size on the performance.

Section snippets

General principles: competing risks setting

Fundamentals of competing risks have been extensively reviewed in the literature [11], [16], [17]. In a competing risks setting, patients are at risk for different event types (for example k). We only observed the pair of variables (Y, Δ) where Y corresponds to the time to first event (or last follow-up news) and Δ the type of first event:Δ={0,censored1,eventoftype1...k,eventoftypek

One quantity of interest is the cumulative incidence of event k, denoted Fk(t), which corresponds to the

Application on bladder cancer data

To illustrate the use of these selection methods, they were applied to the bladder cancer data set with a training sample size of 100 (1/3 of the overall data set). Twenty-six patients presented an event of interest (23 before 5 years) and a competing event occurred for 16 patients. RSF and Boosting have identified, respectively, 68 and 4 genes, with none in common. Both the risk score and groups were then tested in the validation set. The C-index and the Brier score for risk scores were

Discussion

The main objective of this work was to compare the ability of existing methods to generate stable gene signatures and their prognostic capacity, in order to highlight main problems when developing gene signatures, in the case of high-dimensional survival data with competing events. Using a real case study and 4 simulated data, we have discussed the main advantages and limits of each approach. The main conclusion for this work is that RSF and Boosting give correct predictive performance when the

Conflict of interest

The authors have no conflict of interest to disclose regarding this work.

Acknowledgements

This project was supported by Institut National Du Cancer (France): (Award code: INCa_081,2012). JP Delord and T Filleron were partly supported by the CAPTOR academic project: ANR-11-PHUC-0001.

References (43)

  • C.A. Drukker et al.

    Gene expression profiling to predict the risk of locoregional recurrence in breast cancer: a pooled analysis

    Breast Cancer Res. Treat.

    (2014 Dec)
  • A.P. Mitra et al.

    Discovery and validation of novel expression signature for postcystectomy recurrence in high-risk bladder cancer

    J. Natl. Cancer Inst.

    (2014 Nov)
  • J.J. Dignam et al.

    Choice and interpretation of statistical tests used when competing risks are present

    J. Clin. Oncol.

    (2008 Aug 20)
  • J.P. Fine et al.

    A proportional hazards model for the subdistribution of a competing risk

    J. Am. Stat. Assoc.

    (1999)
  • H. Ishwaran et al.

    Random survival forests for competing risks

    Biostatistics

    (2014)
  • R.J. Gray

    A class of K-Sample tests for comparing the cumulative incidence of a competing risk

    Ann. Stat.

    (1988)
  • H. Binder et al.

    Boosting for high-dimensional time-to-event data with competing risks

    Bioinformatics

    (2009 Apr 1)
  • J.J. Dignam et al.

    The use and interpretation of competing risks regression models

    Clin. Cancer Res. Off. J. Am. Assoc. Cancer Res.

    (2012 Apr 15)
  • R.L. Prentice et al.

    The analysis of failure times in the presence of competing risks

    Biometrics

    (1978 Dec)
  • L. Dyrskjøt et al.

    A molecular signature in superficial bladder carcinoma predicts clinical outcome

    Clin. Cancer Res.

    (2005)
  • H. Ishwaran et al.

    Random survival forests

    Ann. Appl. Stat.

    (2008)
  • Cited by (18)

    • Machine learning-based prediction of 1-year mortality for acute coronary syndrome<sup>✰</sup>

      2022, Journal of Cardiology
      Citation Excerpt :

      As in our current cohort of patients post ACS, we cannot assume the data satisfy the linear proportional hazards condition. Second, the overall discrimination of an RSF model is at least comparable to standard methodologies if not better, and RSF has shown its ability to outperform classic CPH regressions [7, 37, 38]. However, RSF may miss predictors with low representation in the population, and this would go against personalized prediction.

    • A long non-coding RNA signature to improve prognostic prediction in clear cell renal cell carcinoma

      2019, Biomedicine and Pharmacotherapy
      Citation Excerpt :

      This method is classified as a tree-based approach with the advantage of detecting interactions. The algorithm uses the largest subtree for efficient variable selection and uses a two-step randomization process to trigger random growth of the trees in the surviving forest [34]. Our study ultimately identified a potential 11-lncRNA signature risk value model to predict the prognosis of ccRCC.

    • One-lincRNA and five-mRNA based signature for prognosis of multiple myeloma patients undergoing proteasome inhibitors therapy

      2019, Biomedicine and Pharmacotherapy
      Citation Excerpt :

      Method 1 calculated the correlation of each gene’s expression profile with PFS by univariate cox regression model and obtained corresponding significant level (P1 value) [20]. Then patients were divided into two groups by each gene’s median expression level, and method 2 compared PFS difference between the two groups using log-rank test and obtained significant level (P2 value) for each gene [21,22]. The first step was to select the genes with both P1 and P2 values lower than 0.05 as primary prognosis-related genes.

    • A weighted random survival forest

      2019, Knowledge-Based Systems
      Citation Excerpt :

      Due to many advantages of decision trees as a tool for classification and regression, several tree-based modifications solving the survival analysis problem have been proposed last decades [35–42] . RSFs have been applied to many real applications, for example, [43–45]. A detailed review of survival trees as well as RSFs is represented by Bou-Hamad et al. [46].

    View all citing articles on Scopus
    View full text