Comparison of variable selection methods for high-dimensional survival data with competing events
Introduction
Over the last decade, gene signatures based on micro-array data are on the rise in oncology [1], [2]. The main objective of gene signatures is to improve the management of cancer patients by prognostication and treatment prediction [3]. Different studies demonstrated that gene signatures were not unique and strongly dependent on both the patients' selection and the regression models used [4], [5], [6]. Gene signatures are generally developed and validated using time-to-event endpoints such as metastasis free survival, disease free survival, or overall survival. As several event types are included in their definition, these endpoints can be considered as composite [7]. In order to improve risk stratification and treatment strategy, it will be interesting to identify gene signatures for each event type in the context of competing risks [8], [9]. For example, loco-regional recurrence is becoming less common in breast cancer. To better guide optimal loco-regional treatment, it is important to identify gene signatures which specifically predict the risks of loco regional recurrence. Breast cancer patients are also at risk of other event types, such as distant metastasis and death, which can preclude the occurrence of loco-regional recurrence. Other various cancers can be greatly impacted by the development of genes signatures for a given event type.
Recently, several regression methods for handling high-dimensional data have been extended to the competing risk data setting. Until recently, little attention was paid to the performance of such methods in deriving molecular signatures for predicting cumulative incidence in competing risk settings. One popular approach in the context of competing risks with high-dimensional data is to use cause specific hazard modeling. Cox proportional hazard is fitted using a penalized regression model for the event of interest and by considering individuals who fail from competing events as censored observations [10]. But a covariate that reduces the cause specific hazard of a competing risk can indirectly increase the cumulative incidence of the event of interest [11]. In fact, cumulative incidence represents the probability of disease in presence of a competing risk. For low-dimensional data, the Fine & Gray model, which is an extension of the Cox model, has been proposed to model the subdistribution hazard [12]. In high-dimensional data (number of covariates >> number of observations), the Fine and Gray model cannot be fitted to identify most predictive genes and less traditional approaches are required. Methods based on random forests have recently been adapted for survival analysis in presence of competing risks [13], with a modified weighted log-rank splitting rule modeled according to the Gray's test [14]. On the other hand, Binder et al. [15] have proposed a gene selection method based on the Fine and Gray model with a boosting approach. These different methods, now implemented in statistical packages, become increasingly popular for the analysis of competing risks data. But, to our knowledge and contrary to classical survival methods, there is no previous work which has compared these two methods on different criteria such as stability and prognostic ability.
The main objective of this publication is to compare different selection methods for high-dimensional time-to-event data in the context of competing risks using a published data set on bladder cancer and simulated datasets. After presenting an example of the application of these methods on the former, a resampling strategy was performed to evaluate both gene selection and predictive accuracy and to explore the effect of the training set sample size on the performance.
Section snippets
General principles: competing risks setting
Fundamentals of competing risks have been extensively reviewed in the literature [11], [16], [17]. In a competing risks setting, patients are at risk for different event types (for example ). We only observed the pair of variables (Y, ) where corresponds to the time to first event (or last follow-up news) and the type of first event:
One quantity of interest is the cumulative incidence of event , denoted , which corresponds to the
Application on bladder cancer data
To illustrate the use of these selection methods, they were applied to the bladder cancer data set with a training sample size of 100 (1/3 of the overall data set). Twenty-six patients presented an event of interest (23 before 5 years) and a competing event occurred for 16 patients. RSF and Boosting have identified, respectively, 68 and 4 genes, with none in common. Both the risk score and groups were then tested in the validation set. The C-index and the Brier score for risk scores were
Discussion
The main objective of this work was to compare the ability of existing methods to generate stable gene signatures and their prognostic capacity, in order to highlight main problems when developing gene signatures, in the case of high-dimensional survival data with competing events. Using a real case study and 4 simulated data, we have discussed the main advantages and limits of each approach. The main conclusion for this work is that RSF and Boosting give correct predictive performance when the
Conflict of interest
The authors have no conflict of interest to disclose regarding this work.
Acknowledgements
This project was supported by Institut National Du Cancer (France): (Award code: INCa_081,2012). JP Delord and T Filleron were partly supported by the CAPTOR academic project: ANR-11-PHUC-0001.
References (43)
- et al.
Statistical controversies in clinical research: prognostic gene signatures are not (yet) useful in clinical practice
Ann. Oncol.
(2016 Dec) - et al.
Prediction of cancer outcome with microarrays: a multiple random validation strategy
Lancet Lond Engl.
(2005 Feb 5) - et al.
Guidelines for the definition of time-to-event end points in renal cell cancer clinical trials: results of the DATECAN project
Ann. Oncol.
(2015 Dec 1) - et al.
Competing risks data analysis with high-dimensional covariates: an application in bladder cancer
Genomics Proteomics Bioinforma.
(2015) - et al.
An R function to non-parametric and piecewise analysis of competing risks survival data
Comput. Methods Programs Biomed.
(2010 Oct 1) - et al.
Assessment of performance of survival prediction models for cancer prognosis
BMC Med. Res. Methodol.
(2012) - et al.
Gene expression profiling predicts clinical outcome of breast cancer
Nature
(2002 Jan 31) - et al.
Comparison of PAM50 risk of recurrence score with oncotype DX and IHC4 for predicting risk of distant recurrence after endocrine therapy
J. Clin. Oncol. Off. J. Am. Soc. Clin. Oncol.
(2013 Aug 1) - et al.
Impact of bioinformatic procedures in the development and translation of high-throughput molecular classifiers in oncology
Clin. Cancer Res. Off. J. Am. Assoc. Cancer Res.
(2013 Aug 15) - et al.
Outcome signature genes in breast cancer: is there a unique set?
Bioinformatics
(2005)
Gene expression profiling to predict the risk of locoregional recurrence in breast cancer: a pooled analysis
Breast Cancer Res. Treat.
Discovery and validation of novel expression signature for postcystectomy recurrence in high-risk bladder cancer
J. Natl. Cancer Inst.
Choice and interpretation of statistical tests used when competing risks are present
J. Clin. Oncol.
A proportional hazards model for the subdistribution of a competing risk
J. Am. Stat. Assoc.
Random survival forests for competing risks
Biostatistics
A class of K-Sample tests for comparing the cumulative incidence of a competing risk
Ann. Stat.
Boosting for high-dimensional time-to-event data with competing risks
Bioinformatics
The use and interpretation of competing risks regression models
Clin. Cancer Res. Off. J. Am. Assoc. Cancer Res.
The analysis of failure times in the presence of competing risks
Biometrics
A molecular signature in superficial bladder carcinoma predicts clinical outcome
Clin. Cancer Res.
Random survival forests
Ann. Appl. Stat.
Cited by (18)
Machine learning-based prediction of 1-year mortality for acute coronary syndrome<sup>✰</sup>
2022, Journal of CardiologyCitation Excerpt :As in our current cohort of patients post ACS, we cannot assume the data satisfy the linear proportional hazards condition. Second, the overall discrimination of an RSF model is at least comparable to standard methodologies if not better, and RSF has shown its ability to outperform classic CPH regressions [7, 37, 38]. However, RSF may miss predictors with low representation in the population, and this would go against personalized prediction.
A long non-coding RNA signature to improve prognostic prediction in clear cell renal cell carcinoma
2019, Biomedicine and PharmacotherapyCitation Excerpt :This method is classified as a tree-based approach with the advantage of detecting interactions. The algorithm uses the largest subtree for efficient variable selection and uses a two-step randomization process to trigger random growth of the trees in the surviving forest [34]. Our study ultimately identified a potential 11-lncRNA signature risk value model to predict the prognosis of ccRCC.
One-lincRNA and five-mRNA based signature for prognosis of multiple myeloma patients undergoing proteasome inhibitors therapy
2019, Biomedicine and PharmacotherapyCitation Excerpt :Method 1 calculated the correlation of each gene’s expression profile with PFS by univariate cox regression model and obtained corresponding significant level (P1 value) [20]. Then patients were divided into two groups by each gene’s median expression level, and method 2 compared PFS difference between the two groups using log-rank test and obtained significant level (P2 value) for each gene [21,22]. The first step was to select the genes with both P1 and P2 values lower than 0.05 as primary prognosis-related genes.
A weighted random survival forest
2019, Knowledge-Based SystemsCitation Excerpt :Due to many advantages of decision trees as a tool for classification and regression, several tree-based modifications solving the survival analysis problem have been proposed last decades [35–42] . RSFs have been applied to many real applications, for example, [43–45]. A detailed review of survival trees as well as RSFs is represented by Bou-Hamad et al. [46].