Interpretable machine learning model to predict survival days of malignant brain tumor patients

An artificial intelligence (AI) model’s performance is strongly influenced by the input features. Therefore, it is vital to find the optimal feature set. It is more crucial for the survival prediction of the glioblastoma multiforme (GBM) type of brain tumor. In this study, we identify the best feature set for predicting the survival days (SD) of GBM patients that outrank the current state-of-the-art methodologies. The proposed approach is an end-to-end AI model. This model first segments tumors from healthy brain parts in patients’ MRI images, extracts features from the segmented results, performs feature selection, and makes predictions about patients’ survival days (SD) based on selected features. The extracted features are primarily shape-based, location-based, and radiomics-based features. Additionally, patient metadata is also included as a feature. The selection methods include recursive feature elimination, permutation importance (PI), and finding the correlation between the features. Finally, we examined features’ behavior at local (single sample) and global (all the samples) levels. In this study, we find that out of 1265 extracted features, only 29 dominant features play a crucial role in predicting patients’ SD. Among these 29 features, one is metadata (age of patient), three are location-based, and the rest are radiomics features. Furthermore, we find explanations of these features using post-hoc interpretability methods to validate the model’s robust prediction and understand its decision. Finally, we analyzed the behavioral impact of the top six features on survival prediction, and the findings drawn from the explanations were coherent with the medical domain. We find that after the age of 50 years, the likelihood of survival of a patient deteriorates, and survival after 80 years is scarce. Again, for location-based features, the SD is less if the tumor location is in the central or back part of the brain. All these trends derived from the developed AI model are in sync with medically proven facts. The results show an overall 33% improvement in the accuracy of SD prediction compared to the top-performing methods of the BraTS-2020 challenge.


Introduction
Brain cancer patients' survival rate is lower than other cancer types. The glioblastoma multiforme (GBM), or simply, glioblastoma, is the most invasive and frequently diagnosed type of brain tumor [1,2]. Due to its infiltrative and diffuse characteristics, the World Health Organization (WHO) has categorized it as a Type-4 tumor [3]. Following the central-brain-tumor registry of the United States (CBTRUS)-2021 report, there were a total of 83 029 deaths in the USA alone between 2014 and 2018 due to malignant brain tumors and other central nervous system disorders tumors [2]. help to visualize how a particular radiomic feature affects prediction across different patients. This establishes the relationship between radiomic features and prediction and also reveals nonlinear dependencies amongst features. Thus, both can help make informed decisions and offer valuable insights into GBM prognosis.
In summary, our work focuses on the points listed follows: • Finding an optimal feature set that augurs well for SD prediction.
• Validation of SD prediction on the BraTS-2020 dataset.
• Providing detailed explanations and rationale for the selection of the dominant features set.
• Interpretation of the model behavior and biomedical inference of the top six most important features.
All the obtained results are validated through the BraTS-Challenge-2020 evaluation platform [29].

End-to-end approach for SD prediction
The structural diagram of the proposed end-to-end approach is shown in figure 1. The multiple parametric MRI images are the input to the model such as T1-weighted (including contrast agent), T2-weighted, and fluid-attenuated inversion recovery (T2-FLAIR) images. The segmentation model is built on 3D U-Net architecture, known as the 'No-new Network' [28]. The architecture relies on 3D UNet, which is a well-proven architecture for biomedical segmentation tasks and is robust for tumor segmentation.
The network consists of a symmetric five-layered encoder and decoder structure. It is a simple, easy-to-implement architecture with 8.3 million parameters. This makes it suitable for the resource-constrained 16 figure A1(a) of the supplementary section. In addition, figures A1(b)-(d) also exhibits a qualitative comparison between the given input (T2-FLAIR) MRI image predicted image and ground truth. The SD predictor model was trained using the dataset's segmented results and ground truth. In contrast, it was tested on the features extracted from the segmented results of the validation set. The feature selection module finds the best group from these extracted features, which are then used to predict SD. Finally, the SD prediction module is investigated for its decision, understanding its generic (global/overall) and specific (local/sample-wise) behavior on SD. The details of the feature extraction, a feature selection module, the survival prediction model, and its interpretability are discussed in the subsections below.

Feature extraction module
The feature extraction module obtained the image-based features [25] and radiomics-based features [35] (table 1 lists the specifics of the features).
Here, the image-based features are extracted by determining the tumor's shape and location. In contrast, radiomics-based features are extracted from necrotic and non-enhanced tumor regions using wavelet and Laplacian of Gaussian (LoG) filters (with σ value 1 to 5). Here, the lower value of σ highlights fine textures, and the higher σ focuses on coarse textures. The wavelet filters denoise the images and capture spatial and global signals [36]. The LoG filter pinpoints the blob centers and approximates its size, shape, and orientation [37]. Thus, we obtained 1264 features (1225 radiomics-based + 39 image-based). We also considered the metadata, e.g. the age of patients, as a feature. As a result, 1265 features in total are being taken into account for the evaluation. Since some of these features can be redundant or not contribute to the prediction, a feature selection procedure is essential.

Feature selection module
The primary goal of feature selection methods is to eliminate unimportant or repetitive features. Here, we employed RFE [38] and PI [39] as feature selection methods. RFE is a backward feature selection method that re-fits the model after iteratively ranking the features according to their importance and eliminating the least important features. The description of the chosen dominant features identified by RFE are shown in supplementary table A1. On the other hand, PI finds influence in the model score by randomly re-arranging a single feature value. The pseudo-code of PI is shown in algorithm1. This technique breaks the connection between the desirable feature and the output feature. The model's score decline demonstrates how largely it depends on that feature. Thus, we weighted the features according to their importance. In general,

Image-based features
Shape-based features (27) Surface area of ROIs, the volume of ROIs, proportion of ROIs, proportion ratio between each ROI, the area-to-volume ratio of ROIs, and amount of tumor.
Location-based features (12) Centroid of ROIs, the distance between the center of ROIs and the center of the brain.
Note: The values within the parenthesis represent the number of features extracted.
dominating features are given greater weight than other features. Zero or negative weights indicate no contribution of the feature for the prediction. Therefore, we removed them, bringing the set down to 180 features. These 180 features were further examined using the Spearman correlation coefficient (SpearmanR) with an absolute cutoff value of 0.5. It is clear from the correlation values that the necrotic, active, and WT centroids are firmly connected, given that they have similar characteristics in common. As a result, we narrowed the set of features to 29 by eliminating redundant features. The pipeline for feature selection is as follows: • We eliminated the features based on the PI weights (which define their contribution to the outcome). The threshold value of the weights is 100. Any features with PI weight <100 are eliminated. This results in 180 prominent features. • Further, we eliminate the weaker features from these 180 features by finding the SpearmanR and a sorting process. For this, (a) we take features one by one (from the 180 feature set), starting with the feature having the least PI weight, and find its SpearmanR with the rest of the 179 features, (b) then we select the features which are having correlations less than 0.5, (c) then from this selected features, we identify the feature which is having highest PI weight value and use it to replace the feature that is having least PI weight (that we chose in step (a)). This process is repeated for each feature in the 180 feature set. That means the loop will run 179 times. Lastly, we find 29 dominant features (having less correlation) out of 180 features.
A detailed description of the selected dominant features using PI is shown in supplementary table A2.

Survival prediction module
The random forest regressor (RFR) [40] is based on ensemble learning, where decision trees (DT) are fundamental building blocks. Each DT was created using random samples from the training set; hence it is called a random forest. This method is widely used as it has been proven accurate and robust [40] across multiple complex problems, including SD prediction [41,42]. The RFR model is often more successful than other models because the outcomes obtained by averaging the prediction from each tree result in lower variability. Additionally, randomization during tree growth and splitting helps prevent overfitting [43].
Hence, the RFR model is robust for predicting brain tumor patients' survival [44]. Here, a five-fold cross-validation technique was used to train the RFR model. Also, the hyper-parameters of the model were fine-tuned using grid search. The fine-tuned parameters are the maximum tree depth, maximum number of features at each split, number of trees, and the minimal sample size required to be at a child node at a split point.

end for
Measure importance I of each feature j defined as:

Interpretability methods for the proposed SD module
Understanding the decisions taken by AI or machine learning (ML) models is essential. Especially in the medical domain, the interpretability of such an AI model is vital to increase its reliability. Generally, the non-linearity in an AI model makes them hard to decipher. That is why we use model-agnostic methods like SHAP [32,33] and PDP [30,31] to find the interpretability of the proposed model.
The primary objective of the SHAP method is to determine how much each feature impacts the prediction for a given instance. The SHAP-value of a feature is the average marginal contribution (MC) of that feature to the value of the predecessor set among all possible permutations of the feature set. It can be expressed as in equation (1) [45].
where, (Φ j ) is the SHAP-value of feature of interest j, Π(N) is the possible coalitions of all feature sets, π is the specific coalition, feature of interest is j, v is contribution of feature(s), (P π j ∪ j) is the predecessor set of feature j in a particular coalition, including the j feature whereasP j is predecessor set of feature j in a particular coalition, excluding j feature.  (please see the supplementary table A4 for a more detailed explanation of this example). Algorithm 2 displays the pseudo code to find the SHAP value for a feature.
The PDP displays the global effect of the feature on the target. The PDP considers all the samples and can show and examine the global association between SD and input variables. The partial dependence function is represented as : where x s are the desirable feature(s) for which we want to plot partial dependency (PD) and x c are the remaining features used to train the model. x c = x ′ s and X = x s + x c is the whole feature set. In PDP plot, we assume that feature subset x s and x c are uncorrelated to each other and hence can be calculated using average interaction effect [31] as: Algorithm 3 displays the pseudo code to find the samples' PD values.

Algorithm 2.
Calculating SHAP-value for a feature.
Input: Number of feature N and their respective real value v signifying their contribution. The contribution vector v of a particular feature is calculated through perturbation feature values of coalition π. More details can be found here [46]. k is the number of sampling permutations Output: SHAP value ϕ j for the feature j ϵ N.

Dataset BraTS-2020
The training BraTS 2020 [8][9][10] dataset includes 369 3D MRI samples for the segmentation and metadata (resection status information, Age, and SD). Out of this, 236 patients' metadata are provided for the SD prediction task. The validation BraTS 2020 dataset contains 125 MRI sample images and metadata of 29 patients. Each sample instance includes the fluid-attenuated-inversion recovery (T2-FLAIR), T2-weighted MRI preoperative images, T1 weighted (T1), post-contrast T1-weighted (T1-ce), and corresponding ground truth. In addition, the dataset is skull-stripped, aligned to the identical anatomical structure, and re-sampled to an isotropic resolution. The segmentation class labels, as defined in the BraTS-2020, are label-0 for background voxels, label-1 for necrotic and non-enhanced tumor voxels-(NCR or NET), label-2 for edema voxels-ED, and label-4 for ET voxels-ET.

Results and discussion
In this work, the prediction model was evaluated through the BraTS evaluation platform [29]. In addition, we have used the BraTS-2020 top-performing models as benchmarks to compare our results. Finally, this section discusses the results of the proposed end-to-end model for its performance and interpretability.

Correlation study of dominant features
To gain a better insight, we have plotted the correlation matrix of the features as shown in figure 2 (refer to supplementary table A3 for the annotation of the features). The plot shows that most features are highly uncorrelated, which signifies that they have captured distinct properties of phenotypes. Furthermore, the histogram in figure 2 validates that most of the selected features correlate from −0.13 to +0.17, suggesting they are uncorrelated, and it justifies the merits of our chosen features.

SD prediction results
The comparison of our SD prediction results with top-ranking methods of BraTS 2020 are shown in table 2. A robust method must perform well on multiple performance metrics apart from accuracy, as each quantifies the models on different parameters. Hence, we compared the proposed model with benchmark models [7] and reported the improvement as computed using equation (4). Here the percentage of improvement ϕ for each performance metrics x for our proposed model P given by: With this, the survival prediction result of the proposed method shows a 33.33% improvement in accuracy.
There is a 19.13% improvement in MSE, which measures the variance around the fitted regression and indicates the deviation of model prediction from the actual one. However, it is sensitive to outliers [49]. In the case of median SE, there is a 60.80% improvement, which uses the median value of the residuals and is unaffected by the outliers. All these results obtained using various metrics indicate the robustness of the prediction [49]. We can see a 2.62% improvement in stdSE and a 181.03% improvement in SpearmanR coefficient often used to measure the relation between the therapy response and the SD [48]. As shown in table 2, our model has performed consistently in all the standard metrics used for SD prediction.

Interpretability of SD prediction model
This section presents a detailed analysis of the influence of features on SD prediction. The SHAP results provide local and global impact details, whereas PDP helps analyze features' global impact.

SHAP analysis results
SHAP depicts the importance of features in predicting a sample by calculating SHAP values. It shows the contribution of features to the expected prediction among all the feature combinations. The SHAP value shows how much a single feature affects the forecast, whereas the signs indicate whether the impact is positive or negative on the prediction outcome. Figures 3 and 5 show the SHAP summary and waterfall plots, respectively. The SHAP-summary plot helps us to visualize the global (generalize) and local (as it plots for every sample) impact of features on the model. In contrast, the waterfall plot allows us to visualize and study the features' impact on an individual sample. It will enable us to explore the role of features and their value on the particular prediction, where we can minutely examine each feature behavior for any desirable sample.
In the SHAP-summary plot, X-axis displays the SHAP value, which signifies the impact of features on the target feature (here, the target feature is the SD). The greater the value (absolute), the more significant the effect on the target component, whereas the sign (+/−) indicates whether that impact is positive or negative.
In the Y-axis, features are listed in the order of importance (from top to bottom). Each point on the summary plot represents a sample, and the point's color represents the value of the corresponding instance.
Here, blue denotes a low feature value, and red a high one. From figure 3, we observe that Wavelet-LLL_firstorder_InterquartileRange (WIR) feature has the highest importance. It is a first-order radiomic feature extracted using the wavelet low pass filter and depicts the distribution of specific pixel values. WIR measures the pixel intensity between the 25% to 75% percentile range. From the plot, we can observe that the samples with intermediate or high feature values (purple and red color) of WIR contribute positively to the prediction, which has a maximum positive SHAP value. In other words, the intermediate or high feature value of theWIR feature increases the SD of patients. It is also apparent that there is an aggregation of large samples (with blue color) within the SHAP value range of −15 to −25 (refer to the WIR feature row listed on the Y-axis). It signifies that the majority of the samples fall into this SHAP range. The samples within this range are responsible for reducing patients' SD. Also, it shows that tumor intensity (pixel value) information falls within this range, reducing the SD. It signifies that the intensity of pixels of a tumor in an MRI plays a significant role. Both Aboussaleh et al and Bae et al mention this fact [51][52][53].
The second most crucial feature is Age. From figure 3, it is clear that samples with the lower feature value of age have positive SHAP values. In other words, the lower age increases the SD of patients. This observation aligns with medical science inference, i.e. the age of GBM patients is crucial in determining SD, i.e. the lower the age, the more the survivability [54].
The third most crucial feature is the cent_wb_x shown in figure 3. It is a location-based feature representing the centroid coordinate of a WT along the X-axis of an MRI image (a physical coordinate). The plot shows that this feature negatively impacts prediction with intermediate and higher feature values. That means the higher feature value is responsible for reducing the SD of patients. Here, the X-axis represents the axial view [55], and higher feature values represent the physical coordinates of the central part of the brain. This plot signifies that tumors in the brain's central and latter-mid parts will reduce patients' SD [56]. Similar resemblance can be observed for cent_at_x and cent_nec_x features, which are centroid of active tumor and centroid of necrosis, respectively.
The fourth most important feature is the log-sigma-2-0-mm3D_firstorder_Kurtosis (LFK). It is a first-order radiomic feature extracted using a LOG filter, which signifies the distribution of voxels without considering their spatial relations [57]. This feature measures the tailedness (outliers) of data distribution. From the plot, we can observe that low kurtosis values are increasing SD, and higher kurtosis values are reducing the SD of patients. Most samples fall within the SHAP-value range of −20 to 60.
The fifth most crucial feature is log-sigma-2-0-mm-3D_glcm_Correlation. It is a second-order radiomic feature extracted using a LOG filter, which measures the inter-relationship of intensity between neighbor voxels [57]. The plot shows that higher feature values are responsible for increasing SD, and lower feature values reduce SD. In other words, the higher correlation between voxels value increases SD, and low correlation values reduce SD.
The sixth most important feature is wavelet-HHH_firstorder_Kurtosis. It is a first-order radiomic feature extracted using a wavelet filter that uses high-pass filters in the series of z, y, and x directions. The distribution of voxels is independent of their spatial relations, similar to the fourth most important feature. Here, the plot shows that lower feature values are responsible for increasing SD (for more information, see figure 4 for SHAP-distribution plot).
In summary, comparing all the features, we can say that the range of SHAP values for all the features is −40 to +40 (in X-axis). Also, with the decreasing of feature importance, the range of SHAP value decreases. That means the features with a low SHAP range have a lower impact on the SD.
Note: most samples and their SHAP-value can also be verified through figure 4, which shows the respective features' SHAP-value and feature value distributions.
Again the SHAP-waterfall plot provides the visual interpretation of features contribution for a single prediction. Figure 5 is a SHAP-waterfall plot for a single sample. The average SD is shown on the X-axis, and the features are arranged on the Y-axis in descending order according to their SHAP values (from top to bottom). From this plot we can analyze, how much the features are impacting negatively (blue) or positively (red) and thereby shift the prediction from the expected outcome E[ f(x)]. The expected outcome is the average of all the outcomes for all the samples. We observe that for our example sample (for which the plot is generated), the model output is f(x) = 331.732. The expected output is E[ f(x)] = 478.91. This deviation in the model outcome can be understood by quantifying the influences of each of the features.
The SHAP value of each feature in figure 5 depicted this quantification. By adding all of the SHAP values from each feature, it is possible to determine how much each feature (N) contributed to the model output. This is given by f(x) = E(f(x)) + N SHAP. Here the N SHAP represents the sum of the SHAP value of all the features. From this analysis also we can see that the feature Age is having a higher impact on the model outcome. For this sample, the Age value is 71.37 and it reduces the average SD by 39.33 days (− (minus) value indicates a reduction in SD). Similarly, cent_wb_x value is 164.651, which is also reducing SD by 28.22 days. Whereas mapping Age, cent_wb_x features to SHAP-summary Plot (figure 5) or SHAP-distribution plot figure A3 which shows global impacts. We can extract similar observances of reducing SD for these features. For e.g. visualizing Age, cent_wb_x feature on SHAP-summary Plot, which shows a higher value of these

PDP analysis
A PDP shows a marginal effect between desirable and target features (SD) [30]. It shows how a dependent variable changes when an explanatory variable changes, provided all other variables remain constant. If changing the value of a particular feature creates more variation in the average SD indicates that the feature is crucial. In this analysis, we consider the top six features according to their importance (with respect to their absolute SHAP value). These are WIR, Age, cent_wb_x, LFK, log-sigma-1-0-mm-3D_glcm_Correlation, and wavelet-HHH_firstorder_Kurtosis. The PDPs of the dominant six features are shown in figure 6 (and plots of the rest of the features are shown in figure A2 supplementary section). The PDPs were arranged in the order of importance (higher to lower) obtained through the SHAP summary plot ( figure 3).
Furthermore, visualizing the PDPs, we found the marginal impacts are in line with the order of importance of features that supplements SHAP-value analysis. The detailed analysis of the top six features is: the marginal effect of WIR feature on the SD is shown in figure 6. The trend shows value sharply increases within the range 100 to 300, reduces within the 300-350 range, and remains saturated within 350-800 intensity value. It indicates that intensity heterogeneity is very high in the range from 100 to 300, which causes a sharp increase in marginal impact. This suggests that intra-tumor tissues are highly heterogeneous. Comparing PDP (figure 6(a)) with SHAP (figure 3) and its distribution plots ( figure A3(a)), we can conclude that this intensity range between 100 and 300 is decreasing SD (testified through a decrease in SHAP value). Hence we can conclude that tumor pixel intensity within this range is detrimental to a patient's survival. Also, as mentioned in this study, higher tumor heterogeneity is associated with increased malignancy [58]. This also complies with the other studies, which suggest that wavelet filters help capture enhanced texture features [58,59].
Similarly, from the PDP of the Age feature ( figure 6(b)), we can observe the trend of marginal effect, which shows a maximum deviation in marginal effects for lesser Age patients, signifies maximum impacts on SD. Further, the marginal effect reduces with the increasing Age of patients. Comparing SHAP summary and SHAP-distribution plots, we can observe that after 60 years of Age, there is a decrease in SD (as there is a decrease in SHAP-value beyond this range). Whereas, the PD plot for the cent_wb_x feature (shown in figure 6(c)) is the physical coordinate of the WT. The plot shows that the marginal effect is more significant if the centroid is within the range of value, i.e. 75-112 (approx.), and less significant for the 113-160 range of value. Also, comparing these ranges to the SHAP distribution plot, we can observe the former range of values is increasing the SD and the latter is reducing the SD, which signifies tumor lesions in the central or latter part are detrimental to patients. At the same time, the log-sigma-2.0-mm-3D_firstorder_Kurtosis feature is a radiomic first-order statistical information that measures the peakedness of data distribution. For a normal distribution, kurtosis (k) is 3. If k > 3, the dataset tends to have significant outliers. If k < 3, the dataset has fewer or no outliers. PD plot (figure 6(d)) shows for k = 3; there is a higher marginal impact on SD. Comparing PDP with SHAP and SHAP-distribution plots, we can conclude that, for most samples, k is 3, and it increases SD. At the same time, there are enough samples with k > 3, decreases SD. It signifies that there are considerable amounts of outliers or intra-heterogeneity among samples. As mentioned by Steven et al [60], diffusion kurtosis imaging works on a similar principle of capturing non-normal distribution behavior, which signifies tissue heterogeneity. It is observed that the SD are positively skewed [61].
The log-sigma-1-0-mm-3D_ glcm_Correlation is a radiomics feature that calculates the joint likelihood of occurrence of the given pixel pairs with the specified intensity value. At the same time, the gray-level co-occurrence matrix explores spatial relationships between pixels at a specific distance and direction. From the PDP (figure 6(e)), we can observe that if the pixel pairs correlation is more than or equal to 0.6 value, it impacts the SD more. Similarly, observing the correlation threshold of 0.6 in the SHAP and SHAP-distribution plot, one can observe that it impacts positively (having a positive SHAP value). Also, it is mentioned by Sanghani et al in their study [62], which shows texture features played a crucial role in SD prediction.
Further, wavelet-HHH_firstorder_Kurtosis is a first-order statistical feature like log-sigma-2.0-mm-3D_firstorderKurtosis (the fourth most feature), but it is extracted using wavelet filters. Comparing their PDPs (figures 6(d) and (f)) shows both capture kurtosis information but in different dimensions. From the PDPs (shown in figure 6(f)), we can observe that the kurtosis value is steeply increasing between 0 and 100 and then stagnant for the rest of the values. Further, observing SHAP-distribution, we can conclude that most samples are in this range, and samples near 1 − 10 values are decreasing the SD, and the rest are increasing the SD. However, some samples are sparsely distributed, signifying that they are outliers.
All these signify the importance of these features in determining the SD. With the above analysis, we find the Age, WIR, cent_wb_x, LFK, log-sigma-1-0-mm-3D_glcm_Correlation, and wavelet-HHH_ firstorder_Kurtosis plays a crucial role in determining a patient's SD. Similarly, we can analyze other remaining features. Finally, we agree that the WIR feature could tell us about tumor heterogeneity associated with high malignancy. Again, the Age feature showed us the trend of survivability, where the survival chances decrease with the patient's increasing age (this is further validated by the KM [34] plot as shown in the supplementary figure A4). At the same time, the centroid of tumors enabled us to locate tumors in the central or latter-central part, which are detrimental for patients. All these analyses using the SHAP and PDPs are analogous to medical findings and related studies. This signifies the model's reliability and validates the explainability methods such as SHAP and PDP.

Limitations of the proposed approach and future prospect
The SHAP and PDP techniques are the post-hoc methods that interpret the model after the completion of training. However, for the further understanding of a model, the study of the intrinsic characteristics may help to an extent. Functional imaging like positron emission tomography (PET), functional magnetic resonance imaging (FMRI), and magnetic resonance spectroscopy (MRS) can provide insights into GBM by capturing molecular or physiological information not captured by normal MRI or CT Scans. The methods like the neural ordinary differential equation model (NODE) can provide the learning behavior of a model, especially to understand the spatiotemporal deep feature extraction of a segmentation model [63]. Further, the diffusion imaging modalities such as diffusion kurtosis imaging [64] may help us to understand the underline biological and pathological characteristics of GBM. However, these kind of functional imaging are more complex to analyze, has a high variability across imaging sessions, are more susceptible to noise, and are also expensive. In short, they face several challenges for routine GBM prognosis [65,66]. Still we believe, integration of these modalities with conventional MRI techniques will enhance the understanding of GBM with added model transparency and interpretability.

Conclusion
We have proposed an end-to-end approach for the SD prediction task. We have identified the 29 most dominant features that help predict SD accurately. Again, we validate the optimality of these features using correlation and histogram plots. The trained model performs better on multiple performance metrics. Also, it predicts a more accurate SD than the top-ranking method of the BraTS-2020 competition. Further, we also explore the interpretability of the model to understand its decision globally and locally using post-hoc methods, i.e. SHAP and PDP. Observing these plots, we found that first-order statistical features, Age, location-based and texture features play a crucial role in prediction. Also, these interpretability methods can provide valuable insights into the model that can give human-understandable inferences. The inferences obtained for six dominant features using these interpretability methods were in line with medical facts. We also find that WIR, Age, and location-based features influence the most in predicting SD. We further verify this conclusion using the KM estimation method on the metadata available with the BraTS dataset. Thus, the model is robust in predicting brain tumor patients' survivability. In addition, the interpretability methods can help us to understand model behavior at multiple levels. This will ultimately help to develop trust between medical experts and ML models and incorporate it into clinical practices.

Data availability statement
The data cannot be made publicly available upon publication because they are not available in a format that is sufficiently accessible or reusable by other researchers. The data that support the findings of this study are available upon reasonable request from the authors.    Figure A3. SHAP value and its distribution for the dominant features. The X-axis shows the feature value of the respective feature, whereas Y-axis shows the SHAP-value of respective instances. The shaded region shows the distribution of instances. Each dot is an instance from the training dataset. This can help us to visualize and analyze where the majority of SHAP feature value lies, how individually the instances impact target features (range of impact instance wise), and its distribution. This testifies how these values play a role in defining the important feature. From all the SHAP plots, we can observe the magnitude of the SHAP value reduces with the order of importance of features (high to low). Note: SHAP value calculates how much feature value changes the model's predicted value from the average. The KM estimator measures the percentage of patients that have survived over a certain period after the treatment or surgery. It computes probabilities of the occurrence of events for a duration of time by dividing them into small intervals and re-estimates the probabilities to get the final estimate. The survival probability is computed as follows:

Declarations
where, N denotes the number of people at risk and D denotes the number of people who died and t is the time interval. The KM survival curve is shown in figure A4. It is a cumulative measure, and the survival remains the same until another individual encounters the risk. From this plot, we observed that the survival probability of older patients is low. The survivability reduces almost linearly after the age of 50 and almost exponentially after the age of 70 and it is very low beyond the age of 80. This KM analysis on the metadata supports the PDP analysis, which shows the exponential decay of survivability from the age near 70. LoG-sigma-1-0-mm-3D-glcm-Correlation: it assesses the association between pairs and its respective voxel intensity value after applying the LoG filter with sigma value 1.

0722.95
Wavelet-HHH-Gldm-DependenceVariance: it calculates variance in the dependence matrix of the image after applying the Wavelet HHH band filter.

0669.58
LoG-sigma-2-0-mm-3D-firstorder-Kurtosis: it assesses the peakiness of the intensity distribution of a given image after applying the LoG filter with sigma value 2.

0555.77
Wavelet-HLH-Gldm-SmallDependence-LowGrayLevelEmphasis: it assesses the combined spread of small-dependence with lower gray-level values after applying Wavelet HLH band filter.

0464.70
Wavelet-LLL-firstorder-InterquartileRange: it assesses the difference between the 75th and 25th percentile of the image array after applying the Wavelet LLL band filter.

0438.99
Wavelet-LHH-firstorder-RootMeanSquared: it assesses the root-mean-square of the intensity value of an image after applying the Wavelet LHH band filter.

0420.35
LoG-sigma-4-0-mm-3D-Glcm-SumAverage: it assesses the relationship between pair occurrences with lower intensity values and pair occurrences with higher intensity values, after applying LoG filter with sigma value 4.

0406.08
Wavelet-HHH-Glrlm-RunLengthNonUniformity: it assesses the homogeneity between different run lengths of the image.

0395.63
LoG-sigma-5-0-mm-3D-Glszm-SmallAreaEmphasis: it assesses the spread of small size-zones or the number of connected voxels that have the same gray-level intensity value, after applying LoG filter with sigma value 5.

0357.96
Wavelet-LLH-Ngtdm-Coarseness: It assesses the spatial rate of change in the intensity value after applying the Wavelet LLH band filter.

0357.51
Wavelet-LLH-firstorder-InterquartileRange: it assesses the difference between the 75th and 25th percentile of the image array after applying the Wavelet LLH band filter. 0340.11 cent-at-x: centroid of active tumor across x-axis.

0282.22
Wavelet-HHH-firstorder-Kurtosis: it assesses the peakedness of the spread of the image's intensity values after applying the Wavelet HHH band filter.

0247.80
Wavelet-HHH-Glcm-DifferenceAverage: it assesses the relationship between the occurrences of pairings with similar intensity values and those with different intensity values after applying the Wavelet filter. 0247.36 cent-wb-x: centroid of whole-tumor brain across x-axis.

0229.91
LoG-sigma-1-0-mm-3D-firstorder-Variance: it measures the distribution spread about the mean intensity value after applying LoG filter with sigma value 1.

0217.80
LoG-sigma-2-0-mm-3D-Glszm-LargeAreaHighGrayLevel-Emphasis: it assesses the combined spread of larger size-zones with higher gray-level values, after applying the LoG filter with sigma value 1.

0183.47
Wavelet-LLH-firstorder-Range: it assesses the distribution of gray-level values of an image.

0131.10
Wavelet-LHH-Gldm-DependenceEntropy: it assesses randomness in the dependencies of an image after applying Wavelet LHH band filter.