The Concordance Index decomposition: A measure for a deeper understanding of survival prediction models

The Concordance Index (C-index) is a commonly used metric in Survival Analysis for evaluating the performance of a prediction model. In this paper, we propose a decomposition of the C-index into a weighted harmonic mean of two quantities: one for ranking observed events versus other observed events, and the other for ranking observed events versus censored cases. This decomposition enables a finer-grained analysis of the relative strengths and weaknesses between different survival prediction methods. The usefulness of this decomposition is demonstrated through benchmark comparisons against classical models and state-of-the-art methods, together with the new variational generative neural-network-based method (SurVED) proposed in this paper. The performance of the models is assessed using four publicly available datasets with varying levels of censoring. Using the C-index decomposition and synthetic censoring, the analysis shows that deep learning models utilize the observed events more effectively than other models. This allows them to keep a stable C-index in different censoring levels. In contrast to such deep learning methods, classical machine learning models deteriorate when the censoring level decreases due to their inability to improve on ranking the events versus other events.


Introduction
More and more data is being collected to improve the estimation of the probability of survival and the expected remaining lifetime, for humans as well as equipment.Making such estimates is the purpose of Survival Analysis.This is an analysis of the time to an event, e.g., an individual's death or the breakdown of a piece of equipment.While several statistical methods for survival analysis have been developed [1], the availability of large quantities of data has spurred the development of machine learning (ML) based approaches that consider more intricate covariate effects [2].
An important aspect of survival analysis is handling censored cases, e.g., hospitalized patients who do not experience a relapse before the end of a study, equipment that is replaced before a breakdown, or equipment that has not experienced a breakdown yet.Censoring is very common in clinical studies and can occur for various reasons.It is possible for a patient not to experience the event during the study's timeframe (for example, death or relapse).Also, a patient might experience a different event, making it impossible to follow up on the event of interest.
Censoring also makes it more difficult to evaluate the goodness-of-fit when the target variable is not fully observed.Several evaluation metrics have been proposed to assess various aspects of a model's performance [3].However, the Concordance Index (C-index) is one of the most used metrics as it encompasses both observed events and censored cases.In doing so, it quantifies the rank correlation between actual survival times and a model's predictions.Multiple C-index estimators have been proposed, like Harrel's C-index [4], Uno's Cindex [5] (a modified weighted version of Harrel's C-index), and Gonen and Heller's measure [6].The latter serves as an alternative estimator based on the reversed definition of concordance.Finally, a time-dependent version of the C-index was proposed in [7], which takes the whole survival function into consideration.
Harrel's C-index, the focus of this study, is perhaps the most often used index and has an intuitive and straightforward interpretation.It measures the ability of a predictor to order subjects by estimating the proportion of correctly ordered pairs among all comparable pairs in the dataset.In the presence of censoring, there are two types of times; event time and censoring time.This results in two types of comparable pairs: event vs. event (ee) and event vs. censored (ec).A predictor may not perform equally well in ranking both types of comparable pairs.Comparisons of given models' performance using the C-index tend to show few significant differences in those datasets with a high ratio of censored cases.More significant differences however appear on datasets with low censoring ratios.This phenomenon can be attributed to unseen differences in the models' abilities to rank the different types of pairs (ee) and (ec).
We therefore propose a decomposition of the C-index into a weighted harmonic mean of two quantities: the C-index for ranking observed events (CI ee ), and a C-index for ranking observed events versus censored cases (CI ec ), weighted by α ∈ [0, 1].This decomposition makes it easier to understand an algorithm's strengths and weaknesses under different censoring levels.As such, the role of the weighting factor α in assessing the balance of a predictor when dealing with the two categories of pairs, namely (ee) and (ec) becomes clearer.
From a modeling perspective, the primary outcome of such survival analyses is the Survival Function denoted as S(t) = P (T > t), which represents the probability of surviving beyond time t, where T is the event time.Over time a number of classical statistical and machine learning models have been developed to estimate the survival function S(t) in a non-parametric, semi-parametric, or parametric way [8,9,10,11,12,13].More recently however, deep learning models have been introduced for survival time modeling [14,15,16,17,18,19,20,21,22,23].DeepSurv [15], for example, is a direct extension of the Cox Proportional Hazard (CPH) model [10] that employs a deep neural network in place of the CPH linear predictor.As such, DeepSurv maintains the constraint of the proportional hazards assumption.Unlike DeepSurv however, some deep learning models discretize the survival timeline.Most notably, DeepHit [16] estimates the probability mass function based on a discrete output.Predictions from such discrete-time models, in contrast to continuous-time models, are however constrained by the choice of the upper limit of the output timeline.
Deep generative models facilitate the estimation of date distributions.In the case of survival analysis, deep generative models can be utilized to estimate the distribution of the event times in both parametric and non-parametric ways [14,18].The Deep Adversarial Time-to-Event model (DATE) [17] for example, is a survival model based on Generative Adversarial Networks (GAN) [24].DATE estimates the event distribution in a non-parametric manner using adversarial training and is trained to generate p(t|x) while penalizing fake samples (x, t).
However, such GAN models suffer from instability issues, such as the Mode Collapse and the Non-Convergence problems, making them challenging to train and potentially lead to a poor local equilibrium [25,26].
Recently, the Variational Survival Inference (VSI) model [20] was introduced, adopting variational inference to approximate p(t|x).VSI is a discrete-time model that employs two encoders, p(z|x) and q(z|x, t), and encourages these two distributions to be similar by using Kullback-Leibler divergence which means the model can better account for interactions between covariates and event times.In addition, the VSI model discretized output constrains the prediction timeline to be limited by the maximum time in the training data.To highlight the importance of the interactions between the covariates and the event times captured by the q branch, the authors of the VSI model developed a variant of VSI, labeled VSI-NoQ which lacks the encoder's q branch.It is worth noting that although the VSI performs significantly better than VSI-NoQ, the role of the q(z|x, t) branch is unclear.
In this work, a new survival model is proposed: SurVED (Survival Variational Encoder-Decoder).SurVED is essentially a translation of the Variational Auto Encoder (VAE) [27] into the field of survival analysis.It is a conditional generative model with a single encoder and a single decoder, which learns to model the distribution of events conditioned on the covariates x.
SurVED and VSI are both variational-inference-based models.However, SurVED derives its objective function from the DATE model [17].This adaptation enables SurVED to deal with continuous time where, unlike the VSI model, no discretization is required.Moreover, SurVED does not impose any upper-limit constraint on the timeline of the model predictions.The loss function has separate terms with different weights for censored and non-censored samples.
Additionally, SurVED and VSI differ in terms of architecture.Specifically, while VSI comprises two encoders p(z|x) and q(z|x, t), where q is utilized to capture the interactions between the covariates and the event times, SurVED uses only one encoder.This makes SurVED more similar to the variant VSI-NoQ, albeit with additional regularization on the latent space, continuous output, and a different loss function.
In summary, this work presents two contributions.Firstly, it derives a decomposition of the concordance index which provides insights into the distinctions between seemingly similar-performing models.It also helps to understand why there are larger-magnitude differences between classical and deep learning models in the case of low censoring.Ultimately, by showing areas of strengths and weaknesses, the C-index decomposition has the potential to serve as a guide in the development of new survival models and offers insights to enhance existing ones.Additionally, this work introduces a new continuous-time variational-based model that overcomes the limitations of its predecessors, DATE and VSI, and achieves a ranking performance comparable to the state of the art.

Method
In this section, we introduce the Concordance Index Decomposition as a new approach to highlight the difference between survival models.Additionally, we present the SurVED model (Survival Variational Encoder-Decoder) and provide an overview of the four datasets used for numerical tests and comparisons.

The Concordance Index Decomposition
The C-index is a measure of the probability that the predicted event times ( ti , tj ) of two randomly selected subjects maintain the same relative order as their true event times (t i , t j ), i.e., P ( ti > tj |t i > t j ).It's important to note that not all pairs can be compared when censoring is present; a pair (x i , x j ) is comparable (usable) if the earliest time represents an event, or both times are events.Conversely, a pair is deemed not comparable if the earliest time is censored or if both are censored cases [28].
The C-index can be decomposed into two parts; one to measure the relative ordering of cases with observed events, and another to measure the ordering of cases with observed events relative to censored cases.This decomposition is useful when comparing how methods perform in situations with a high proportion of censored cases, to situations with a low proportion of censored cases.
We define the random variable o ij = ti > tj |t i > t j that takes the value 1 if the ij pair is ordered (concordant) and 0 if it is discordant.We also define the random variable k ij , which takes the value (1) if the (ij) pair is an event-event (ee) pair and (0) if the (ij) is an event-censored (ec) pair.To simplify the notation, P (o) represents P (o ij = 1), P (ee) represents P (k ij = 1), and P (ec) represents P (k ij = 0).Note that P (ee) + P (ec) = 1.With these definitions, the C-index can be written as CI = P (o), and hence: We define CI ee as a C-index for event-event cases, CI ec as a C-index for eventscensored cases, and we introduce the notation α for the conditional probability that the pair is an event-event pair (ee) given that it is a correctly ordered pair: This yields the following relationship, which shows that the full C-index (CI) is a weighted harmonic mean of the C-indices defined for the subsets ee and ec: The C-index and its decomposed parts CI ee , CI ec , and α can be estimated based on the number of correctly ordered pairs N + , incorrectly ordered pairs N − , and the number of ties N = .Since there are two kinds of comparable (usable) pairs: event-event pairs (ee) and event-censored pairs (ec), then: There are multiple ways to handle ties, and we use the Somers' d measure [29], which considers the ties in the event cases to be incomparable pairs.It also considers the ties in the predicted values to be binary random guesses; hence, half of them are counted as correctly ordered.
From expressions (1), (2), and (3) we thus have: The factor α is the conditional probability that the pair is event-event (ee) given that it is a correctly ordered pair.This factor weights the contribution of the correct ordering of event-event pairs relative to the correct ordering of event-censored pairs in the C-index.Changes in α are directly associated with variations in the model's performance in accurately ordering pairs and indirectly related to the ratio of observed events to censored cases in the dataset.A predictor that can order all events and censored cases correctly will have an α value equal to the fraction of event-event pairs within the comparable pairs, a value we can denote as α * .However, even an imperfect predictor can have α = α * as long as it scores equally on event-event pairs and event-censored pairs in proportion to their percentages; such a predictor can be denoted as a "balanced" predictor.
The α-Deviation is defined as the difference between α and α * .A predictor that excels at ordering event-event (ee) pairs more than event-censored (ec) pairs will have α > α * , resulting in a positive α-Deviation.On the other hand, a predictor that is better at ordering event-censored (ec) pairs compared to event-event (ee) pairs will have α < α * , leading to a negative α-Deviation.
α-Deviation ≡ α − α * (10) where N ee and N ec are the number of the comparable (ee) and (ec) pairs in the dataset.In this paper, we study the absolute value of the α-Deviation.This is a measure of how unbalanced the predictor is when making mistakes.

SurVED: Survival Variational Encoder-Decoder
Our model, SurVED, employs a conditional generator G θ to estimate f (t|x), the distribution of death conditioned on the covariate vector x, with θ representing the parameters of the model.This generative model can be sampled to produce the conditional death function f (t|x), from which the conditional cumulative death distribution function (F ) and the conditional survival function (S) can be computed: The model comprises two components: an Encoder E θ1 (z|x) which encodes the input x into a multi-dimensional Gaussian latent space represented by (µ z , σ z ), and a Decoder D θ2 (t|z) responsible for decoding a sample z from the latent space and generating a sample t from the conditional distribution f (t|x).
Here θ 1 and θ 2 constitute θ; the total parameters of G θ .For each input x, n values t i (i = 1, . . ., n) from f (t|x) are sampled.The survival function can be estimated using the Kaplan-Meier estimator considering the sampled times t i as observed event times.These n samples (t i ) are also utilized to estimate the expected value E t∼f (t|x) [t] for the purpose of model evaluation.

The Objective Function
The objective function of the generative model G θ consists of four parts: L e , L c , L KL , and C lb .The first two, L e and L c , represent construction losses that are separately evaluated for event cases and censored cases.These losses are designed to optimize the balance between events and censored cases.The third term, L KL , originates from the VAE formulation and is the Kullback-Leibler divergence, serving as a regularization term.The first three terms are: where the subscripts, e and c, indicate that the terms exclusively involve event cases or censored cases, respectively.The notation P e (x) denotes that x was drawn from the event cases, while P c (x) indicates that x was drawn from the censored cases.Additionally, KL(p, q) represents the Kullback-Leibler divergence between the two distributions p and q.The fourth term is a differentiable lower bound for the C-index [30].Here, ε is the set of comparable pairs, the symbol σ is the standard sigmoid function, and |ε| denotes the set ε cardinality.Adding the C lb term to the loss function enables the model to directly optimize the C-index, encouraging concordance in the model predictions.The SurVED model aims to minimize the total loss: where the λ e , λ c , λ KL , and λ lb are tunable weights.
These objective terms have been used previously in the literature in different settings.The L e and L c terms, eqs.( 14) and ( 15), match the ℓ 2 and ℓ 3 terms used in the DATE loss function [17].However, they can be traced back to earlier work by Van Belle et al. [12].The fourth objective term, eq. ( 17), was suggested for the DATE model [17] as well.

Description of Datasets
The SurVED method has been evaluated against the reference methods on four publicly available medical datasets.The datasets are all fairly large, and cover different censoring levels, number of samples, and number of features; see Table 1.They have also been used in several previous benchmark studies.
FLCHAIN: A dataset used in a study [31] to determine whether the free light chain (FLC) assay is a predictor of better/worse survival for the general population.The study showed that a high FLC was significantly predictive of worse overall survival.
METABRIC: The Molecular Taxonomy of Breast Cancer International Consortium dataset [32].This dataset is used to predict the survivability of breast cancer patients using gene expression profiles and clinical data.
NWTCO: Data from the US National Wilm's Tumor Study to predict survival based on tumor histology [33].This data is available in the package survival in R [34].
SUPPORT: This data comes from the Study to Understand Prognoses and Preferences for Outcomes and Risks of Treatment [35].This study aimed to understand the survival of seriously ill hospitalized patients and validate the predictions of a new prognostic model against an existing prognostic model and predictions by physicians.The SUPPORT data is sometimes split into subsets since there is more than one diagnosis, but it is used as one dataset here.The models were first compared based on the C-index performance and then analyzed further using the C-index Decomposition.
The same sampling scheme was applied to all the experiments: 30% of the data was used as a hold-out test set, and the remaining 70% was used for  in CI ee on these three datasets, SurVED was better in terms of CI ec .Additionally, due to its higher α-Deviation, SurVED placed higher weight on the CI ec , resulting in a higher overall CI performance.
Looking at the full list of results in Tables 2, 3, 4, and 5 we see that in the cases where there are no significant differences between the models in the C-index, they show significant differences in the decomposition terms CI ee and CI ec .
For example, comparing RSF and DeepHit on the NWTCO dataset shows that RSF has a significantly better CI ee with no significant difference observed on the CI ec .However, because DeepHit has a higher α-Deviation, it places more weight on the CI ec , resulting in no significant difference in the overall C-index.A similar scenario unfolds when comparing SurVED and CPH on the FLCHAIN dataset.
More interesting cases show contrasting differences in the decomposition terms leading to an insignificant difference in the C-index due to weighted averaging.For instance, on the NWTCO dataset, DeepHit exhibits a higher CI ee while CPH outperforms in CI ec .Consequently, the total C-index shows no significant difference.A similar phenomenon is observed on the FLCHAIN dataset when comparing RSF with DeepHit and DeepSurv, where RSF excels in CI ee while DeepHit and DeepSurv demonstrate better performance in CI ec , thereby diminishing the difference in the total C-index.This pattern is also observed in the comparison between DeepHit and DeepSurv on the FLCHAIN and the METABRIC datasets.
Contrasting differences in the decomposition terms do not always diminish the difference in the total C-index.In some cases, a higher α-Deviation can outweigh one model over another.For example, consider the comparison of SurVED and DeepSurv on NWTCO, where DeepSurv exhibits a higher CI ee while SurVED has a higher CI ec .Nevertheless, SurVED's higher α-Deviation shifts the balance in favor of the CI ec term, resulting in a higher C-index.Similar scenarios arise in various cases like the comparison of CPH with RSF, DATE, VSI, and DeepSurv on NWTCO.In all these cases CPH demonstrates a lower CI ee but a higher CI ec and a higher α-Deviation resulting in a higher C-index.
Occasionally, outweighing one term does not compensate for the differences in the terms, especially when the difference is substantial.For example, consider the case of CPH compared to DATE, DeepHit, and DeepSurv on the METABRIC dataset.While CPH has a higher CI ec and a higher α-Deviation, it has a much lower CI ee .In this scenario, outweighing the CI ec term does not compensate for the considerable gap in the CI ee term, resulting in CPH having a significantly lower total C-index.
Poor performance on the METABRIC dataset was observed for the DeepHit model.This is similar to the VSI model which shares the discrete-time property with DeepHit.It is worth noting that this result cannot be compared to the result reported in DeepHit paper [16] as they used a different version of the METABRIC dataset, where they re-scaled the time step to a month instead of a day as in our case.Additionally, they used the time-dependent C-index (C td ) as an evaluation measure.
Overall, the results indicate that classical models either outperformed or performed equally well compared to deep learning models for the smaller datasets with higher censoring levels.RSF was the best on METABRIC, while CPH was the best on NWTCO.On FLCHAIN, RSF shares the best performance with DeepSurv and DeepHit.However, deep learning models have a clear advantage on SUPPORT, the largest dataset with the lowest censoring level.
To assess the models comprehensively, pair-wise comparisons were performed between the seven models on the four datasets.Each model was compared against the other six models on each dataset, resulting in 24 comparisons for each model.The results are summarized in Figure 2 4) of the compared models on the four datasets.
treating draws as a 50% chance of winning or losing.Regarding the C-index performances, SurVED, DeepHit, RSF, and DeepSurv show a similar performance, whereas CPH, DATE, and VSI lag behind.However, analyzing the other C-index decomposition terms reveals more interesting insights.For example, DATE has an excellent performance in terms of the CI ee but falls short in the CI ec which impacts its overall C-index.In contrast, the VSI model shows poor performance in both terms.The results also show that the main differences between the models stem from the CI ee part, while all models, except for DATE and VSI, exhibit similar overall CI ec performance.
The Deep learning models outperformed classical models by a substantial margin on the SUPPORT dataset.To understand this notable difference and to explore how the models behave under different levels of censoring and dataset sizes, the following section employs the C-index decomposition to investigate the models' performances across various conditions simulated using the SUPPORT    dataset.

The Effect of Censoring and Size
Among the datasets utilized in this paper, the SUPPORT dataset is the largest and has the highest proportion of events.This characteristic allowed us to investigate the impact of varying the censoring and the dataset size across three different dimensions.Originally, the dataset contained 9,105 examples, with 6,201 observed events and 2,904 censored cases, resulting in 68% events, and 32% censored cases.In the first experiment (Size Only), we varied the dataset size by randomly removing examples while keeping the censoring level fixed.This resulted in four datasets with different sizes (3,642, 4,462, 5,828, and 9,105) and approximately the same event percentage of 68%.In the second (Censoring Only), we varied the censoring level by randomly censoring observed events while maintaining the size.This resulted in four datasets of the same size (9,105) with varying event percentages (20%, 35%, 50%, and 68%).Lastly, in the third experiment (Size and Censoring), we simultaneously varied both dataset size and censoring level, by randomly dropping observed event examples.
This resulted in four datasets with different censoring levels (events percentages) (20%, 35%, 50%, and 68%) and different sizes (3,630, 4,467, 5,808, and 9,105) respectively.The models were trained and tested on each of the four datasets in each experiment, and Fig. 3 illustrates how the C-indices for the models changed with varying dataset sizes and fractions of event cases (different levels of censoring).It is worth noting that the right-hand side of the three figures 3a, 3b, and 3c is the performance of the models on the original SUPPORT dataset.
Two distinct types of behaviors can be observed in these experiments (see Fig. 3): One related to the group {SurVED, DeepSurv, DeepHit, VSI}, i.e., the deep learning models except for DATE, and one related to the group {DATE, CPH, RSF}, i.e., the classical models plus DATE.In the first experiment, Figure 3a, where only the dataset size was changed, all the models improved in C-index performance as the dataset size increased.However, they maintained their relative differences between the two groups.In the second experiment, Figure 3b, where only the censoring level was varied, the models' performances remained relatively constant, with DATE and the classical models exhibiting a slight drop in the C-index performances.
The most intriguing result was obtained in the third experiment, Figure 3c, where classical models behaved unexpectedly when both the size and the censoring level of the dataset were varied.The Deep learning models maintained a constant C-index performance as the data set size and the percentage of the observed events both decreased (reading Figure 3c from right to left).In contrast, DATE and the classical models' performance improved eventually reaching a point where, in the extreme case of the smallest dataset and the lowest event percentage (the left-hand side of Figure 3c), all models performed similarly.To better understand these trends in the behavior concerning changes in censoring levels and dataset size, the performance of the models was further examined using the C-index Decomposition.The aim was to shed light on the underlying reasons behind such differences in behavior.In the first experiment (the leftmost column in Figure 4), increasing the size of the dataset led to an increase in both the CI ee and CI ec .Furthermore, keeping the percentage of the events fixed maintained similar values for the α term in the decomposition through the four datasets (approximately 0.5).This balance in the α gave equal weight to the two terms in the C-index decomposition resulting in improvement in the total C-index for all models with increased dataset size.4) as the ratio of events changes.The x-axis shows different percentages of events (for the SUPPORT dataset).
In the second experiment (the middle column in Figure 4), keeping the size fixed and decreasing the censoring level (increasing events %) slightly increased the CI ee performance for deep learning models and, to a lesser extent, for classical models.On the other hand, the CI ec stayed almost constant for deep learning models, with a slight increase for classical models.Nevertheless, changing the censoring level affected α changing the weighting on the two decomposition terms across four datasets.As a result, with smaller α, the total C-index was mainly influenced by the CI ec at the high censoring level (low events % to the left side of the figure), whereas α increases (hence the weight on the CI ee ) as the events percentage increase.This caused the total C-index to stay constant for deep learning models but slightly decreased for classical models.
In the third experiment, when changing the dataset's size and the censoring level (the column to the right in Figure 4), the impact became more pronounced.
All the methods essentially achieved high C-indices at a high censoring level (low % of events) and smaller dataset, resulting in very similar performances with respect to CI, CI ee , and CI ec .However, at such a high censoring level, the α term of the C-index is relatively small, which makes the C-index primarily influenced by the CI ec term with minimal contribution from the CI ee term.As the size increases and censoring decreases, the α value increases, giving more weight to the CI ee term.In this case, as the classical models did not exhibit improvements on the CI ee , which remained almost the same as more events were added to the dataset, this caused the total C-index to decrease with the increasing weight on this term.In contrast, the deep learning models exhibited an increase in CI ee , which kept the total C-index the same for all levels of censoring.
The main difference between the second and the third experiments lies in their approach to handling censoring.In the second experiment (Censoring Only), a fraction of the observed event examples are censored, while in the third experiment (Censoring and size) those observed event examples are entirely removed from the dataset.To achieve the same censoring percentage in the two scenarios, more event cases need to be removed in the third experiment compared to the ones that need to be censored in the second experiment.This results in that, for example, a dataset with 20% events in the second experiment has 1,821 event cases compared to 726 event cases in a dataset with a similar event percentage in the third experiment.This explains the larger drop in performance in the CI ee in the third experiment which has less number of observed event cases.

Conclusion
In this work, we derived a decomposition of the C-index, separating it into two terms: one for ranking observed events, and another for ranking observed events versus censored cases.These terms are weighted by the parameter α.
The α factor expresses the contribution of the two parts for the total C-index and can be interpreted as a conditional probability for event-event pairs given that it is correctly ordered P ((ee) pair|ordered pair).A model that perfectly orders the two types of pairs will have an optimal α factor (α * ).Unbalanced models, i.e., models that are not equally good at ranking event-event pairs and event-censored pairs will deviate from this value.Based on this deviation from the α * , the α-Deviation measure can assess how balanced a model is with respect to the ranking of the two groups of pairs.SurVED is also proposed, a new approach for estimating the time-to-event distribution using a variational encoder-decoder with a Gaussian latent layer.
In benchmark tests, SurVED performs significantly better than the two closely related methods, DATE and VSI, and achieves a comparable overall performance to DeepSurv and DeepHit.
Using the C-index decomposition, it was shown that in cases where models perform differently in terms of the CI ee and CI ec , such differences often go unnoticed when evaluating the total C-index due to the averaging.Furthermore, it was demonstrated, using the SUPPORT dataset with varying censoring levels and dataset size, that all methods benefitted from increasing the dataset size.It was also shown that all methods have comparable performance in terms of the total C-index at a high censoring percentage and smaller dataset size, but all methods do better at ranking event-censored pairs compared to ranking eventevent pairs.However, as the number of events grows, SurVED and the other deep learning models VSI, DeepSurv, and DeepHit are better than the other algorithms at improving their performance in ranking event-event pairs.This helped deep learning models maintain a constant C-index performance across different censoring levels in contrast to the classical models which suffered from a drop in the C-index.This explains the large magnitude of the difference between deep learning models and the classical models on the SUPPORT dataset.
This work focuses on analyzing the ranking performance of survival models using the C-index decomposition trying to get a better understanding of the strengths and weaknesses of models with respect to the different types of events and censored observations.Such understanding drawn from decomposition can help to design better objective functions of survival models which we leave for future work.Moreover, studying the relation between the decomposition terms and other evaluation metrics can potentially give more insights that help develop better survival models which we also leave for future work.

Declarations of interest
The authors declare that they have no known competing financial interests or personal relationships that could have influenced the work reported in this paper.

Acknowledgement
This research was performed under the CAISR+ project funded by the Swedish Knowledge Foundation.
hyperparameter tuning and training.The models were tuned using five-fold cross-validation, maximizing the C-index performance.At each fold, three sets were used for training, one set for early stopping for deep learning models, and the last set was used for validation.The early stopping set was not used for optimizing RSF.In the final testing phase, a 100-fold testing on the hold-out test set was done, varying the training data.At each fold, 90% of the training data was used to train the models keeping 10% as a validation set for deep-learning models.The categorical features were one-hot encoded, and the numerical features were standardized with zero mean and unit variance.The target variable was scaled by the maximum value of the training set, and power transformed.Moreover, the missing values were filled with the training data median and mode for numerical and categorical features, respectively.The deep learning models were configured with a common architecture that included two hidden layers with 32 nodes, a hyperbolic tangent activation function, and a 0.5 dropout rate on the first hidden layer.For the models that have special types of structure (DATE and VSI), we used the suggested structures in their repositories.SurVED has a latent size of four nodes and a single-layer linear perceptron as its decoder.Details about the network structures, data standardization, and transformation are available on our Github repository 1 .DATE's implementation from the authors' GitHub repository 2 was used, while the Scikit-Survival library[36] was used for the CPH and the RSF models.Moreover, the VSI model implementation provided by the authors on Github 3 was used.For DeepHit and DeepSurv, the PyCox library[37] was used.A random search was performed to optimize the weights of the loss functions for DeepHit and SurVED.The number of output bins for the two discrete models, VSI and DeepHit, were optimized with choices including[100, 200, 400, 1000].Additionally, a random search was conducted for RSF to optimize parameters such as max depth, min samples split, min samples leaf, and max features.

3. Results and Discussion 3 . 1 .Figure 1 .
Figure 1.This instability may be attributed to the fact that METABRIC is the smallest dataset with the largest time horizons spanning over 9, 200 days.In such cases, time discretization can lead to information loss.Remarkably, although SurVED outperformed DATE in the C-index on NWTOC, FLCHAIN, and METABRIC they demonstrated contrasting behaviors regarding CI ee , CI ec , and α-Deviation.While DATE showed better performance

Figure 2 :
Figure 2: The Win/Lose/Draw comparison based on CI, CIee, CIec, and α-Deviation in eq.(4) of the compared models on the four datasets.

Figure 3 :
Figure 3: The change of CI as the size of the dataset and the ratio of events change.The x-axis shows the sizes of the datasets and percentages of the events (for the SUPPORT dataset) in the three experiments.

Figure 4
Figure4shows the C-index decomposition of the seven models on SUPPORT datasets in the three experiments (varying the dataset size only, varying the censoring level only, and varying both the size and the censoring level).Two distinct trends in behavior are observed: one corresponding to classical models, CPH and RSF.The other one corresponds to the deep learning models except for DATE, which followed the classical models' behavior.Hence DATE will be included with the classical models when referring to the classical models' behavior below.

Figure 4 :
Figure 4: The change of CI, CIee, CIec, and α in eq.(4) as the ratio of events changes.The

Table 2 :
The C-index (CI) values (%) of the compared models on the four datasets.Numbers show the median, the 2.5%, and the 97.5% quantiles of 100-folds.The highest numerical value in each dataset is boldfaced.

Table 3 :
The CIee values (%) of the compared models on the four datasets.Numbers show the median, the 2.5%, and the 97.5% quantiles of 100-folds.The highest numerical value in each dataset is boldfaced.

Table 4 :
The CIec values (%) of the compared models on the four datasets.Numbers show the median, the 2.5%, and the 97.5% quantiles of 100-folds.The highest numerical value in each dataset is boldfaced.

Table 5 :
The α-Deviation values of the compared models on the four datasets.Numbers show the median, the 2.5%, and the 97.5% quantiles of 100-folds.All values are scaled by a factor of 10 2 .The lowest numerical value in each dataset is boldfaced.