Integrative radiogenomics analysis for predicting molecular features and survival in clear cell renal cell carcinoma

Objectives: To assess the feasibility of predicting molecular characteristics by computed tomography (CT) radiomics features, and predicting overall survival (OS) using combination of omics data in clear cell renal cell carcinoma (ccRCC). Methods: Genetic data of 207 ccRCC patients was retrieved from The Cancer Genome Atlas (TCGA) and matched contrast-enhanced CT images were obtained from The Cancer Imaging Archive (TCIA). Another cohort of 175 ccRCC patients from West China Hospital was used as external validation. We first applied radiomics features and machine learning algorithms to predict genetic mutations and mRNA-based molecular subtypes. Next, we established predictive models for OS based on single omics, combined omics (radiomics+genomics, radiomics+transcriptomics, radiomics+proteomics) and all features (multi-omics). Results: Using radiomics features, random forest algorithm showed good capacity to identify the mutations VHL (AUC=0.971), BAP1 (AUC=0.955), PBRM1 (AUC=0.972), SETD2 (AUC=0.949), and molecular subtypes m1 (AUC=0.973), m2 (AUC=0.968), m3 (AUC=0.961), m4 (AUC=0.953). The TCGA cohort was divided into training (n=104) and validation (n=103) sets. The radiomics model had promising prognostic value for OS in validation set (5-year AUC=0.775) and external validation set (5-year AUC=0.755). In the validation set, the radiomics+omics models enhanced predictive accuracy than single-omics models, and the multi-omics model made further improvement (5-year AUC=0.846). High-risk group of validation set predicted by multi-omics model showed significantly poorer OS (HR=6.20, 95%CI: 3.19-8.44, p<0.0001). Conclusions: CT radiomics might be a feasible approach to predict genetic mutations, molecular subtypes and OS in ccRCC patients. Integrative analysis of radiogenomics may improve the survival prediction of ccRCC patients.


INTRODUCTION
Renal cell carcinoma (RCC) is a diverse group of carcinomas including three common histopathological subtypes (clear cell, papillary, chromophobe) and other rare subtypes, which resulted in 403,300 new cases and 175,100 deaths worldwide [1,2]. Clear cell renal cell carcinoma (ccRCC) is the most prevalent type (70-75%), and more prone to advanced T stage, higher nuclear grade, metastatic lesions and poorer prognosis than papillary and chromophobe RCC [3,4]. Patient status, tumor stage, nuclear grade, coagulative necrosis and proinflammatory markers are commonly used prognostic indicators in clinical practice [5]. However, precise treatments of ccRCC are developing rapidly, thus it is necessary to AGING continuously evaluate the emerging prognostic factors to enable clinicians to individually risk-stratify and select the optimal therapeutic strategy for patients.
Recent advances in genetics performed extensive genomics and transcriptomics profiling to reveal the underlying molecular mechanism of ccRCC [6]. The characteristic of ccRCC is the loss of chromosome 3p, including mutations of genes encoding von Hippel-Lindau tumor suppressor (VHL), polybromo-1 (PBRM1), BRCA1-associated protein 1 (BAP1) and SET domain containing 2 (SETD2) [6]. The VHL pathway is essential for adaptation to hypoxia. Loss of VHL function and dysregulated hypoxia lead to the activation of downstream pathways associated with angiogenesis [7]. Several targeted blockers of vascular endothelial growth factor pathway (e.g., sorafenib, axitinib) have been approved for treatment of metastatic RCC [8]. However, the capability of VHL mutation as a prognostic factor was not significant in ccRCC [9]. Conversely, PBRM1 mutation was associated with shorter survival, early tumor progression, and response to immune checkpoint therapies in ccRCC patients [10,11]. Patients with BAP1 or SETD2 mutation were also more likely to have worse prognosis [12], and BAP1 mutation was related with poor response to everolimus [13]. Furthermore, the four molecular subtypes obtained from gene expression pattern analysis showed different genetic alterations and survival probabilities [6]. Taken together, these molecular features have implications to prognosis prediction and treatment strategies for ccRCC patients. Given the invasive and high-cost molecular assays, developing non-invasive and cost-effective biomarkers for these mutations and molecular subtypes would be important.
Quantitative image analysis can capture distinct imaging phenotypes that cannot be recognized by naked eye, and reveal the underlying pathophysiology of biomedical images [14]. Radiomics is the practice of converting digital images into mineable quantitative features, then performing subsequent analyses to improve accuracy of differentiating tumor types and grade, predicting prognosis and therapeutic response [15]. Radiogenomics refers to the investigation of connection between radiomics and genomics data, which has extended to link radiomics to broader biological features such as proteomics and metabonomics in recent years [16]. Previous studies have used radiomics features to non-invasively identify gene expression, mutations, molecular subtypes and methylation status within the tumors [17][18][19][20]. In addition, the integration of radiomics and omics has been reported to achieve a more accurate prediction of survival in several cancers [21][22][23]. Therefore, the radiogenomics analysis might offer insights into the molecular phenotypes of cancer, and provide predictive biomarkers for personalized management of cancer patients.
In ccRCC, computed tomography (CT) is routinely used to capture the physical characteristics of the whole tumor volume. Radiomics features extracted from CT scans have showed significant ability to predict mutation status of VHL, BAP1 and PBRM1 in ccRCC [24,25]. To our knowledge, no study has applied CT image features to identify molecular subtypes, and performed multi-omics analysis to predict prognosis of ccRCC patients. Therefore, we first aimed to comprehensively evaluate the potential value of CT radiomics features in classifying mutations and molecular subtypes of ccRCC, with the use of multiple machine learning algorithms. Next, the present study established and validated various predictive models (radiomics, genomics, transcriptomics, proteomics, multi-omics) to enhance the prediction of overall survival in patients with ccRCC.

Predicting mutations and molecular subtypes from radiomics features
In this study, we included 207 ccRCC patients from TCGA dataset (Table 1). To present the clinical utility of radiomics features in ccRCC, we first estimated the power of radiomics features to predict four commonly mutated genes (VHL, BAP1, PBRM1 and SETD2) of 137 patients and four molecular subtypes (m1-m4) reflected by mRNA patterns of 180 patients (Table 1). We used four algorithms (GBDT, LASSO, RF, XGBoost) for feature selection, and eight algorithms (RF, GBDT, AdaBoost, LR, DT, SVM, NB, KNN) as classifiers. Among the 32 combinations of two algorithms, the RF achieved the best performances on the test set no matter which feature selection method was adopted ( Figure 1A). The radiomics models based on RF were able to distinguish the mutations VHL (AUC=0.971), BAP1 (AUC=0.955), PBRM1 (AUC=0.972), SETD2 (AUC=0.949), and subtypes m1 (AUC=0.973), m2 (AUC=0.968), m3 (AUC=0.961), m4 (AUC=0.953) in the test set (Supplementary Table 1). The models built by GBDT and AdaBoost also obtained comparable accuracy to that of RF model, suggesting that radiomics features were capable of identifying somatic mutations and molecular subtypes of ccRCC through machine learning.

Prognostic value assessment of radiomics features
According to median value of radiomics features, the patients were divided into two groups (higher than  Table 2), most of which were adverse prognostic factors (p<0.05, Figure 1B). Moreover, we performed multivariate LASSO-Cox regression analysis, and selected four features with non-zero coefficients, including sphericity, surface-to-volume ratio, GLCM_correlation and GLDM_gray-level nonuniformity (GLNU). Kaplan-Meier curves showed the survival differences between groups with high-value or low-value features (Supplementary Figure 1). Except surface-to-volume ratio, other three features were significantly predictive of OS. Taken together, it indicated that individual radiomics features had the potential to predict OS of ccRCC patients.

Integrating radiomics with genomics features to predict survival
Previous studies have found that mutation status has an important impact on the prognosis of cancer patients. Therefore, we compared the prognostic role of radiomics and genomics features, and combined two omics, hoping to provide more accurate stratification of patients' prognosis. We randomly AGING divided the TCGA cohort into training (n=104) or validation (n=103) sets ( Table 1). The 100 most common somatic mutations of training set were used for models (Supplementary Table 3 and Figure 2A). In the training set, we established models to predict OS based on 107 radiomics features and 100 mutations. Time-dependent ROC is more suitable for time-to-event results than classical ROC, thus we used it to obtain dynamic AUC values throughout these models in the validation set [26].
In the validation set, the AUC of radiomics model (R) were significantly higher than genomics model (G) at 5year (0.775 vs. 0.684, p=0.030) time point ( Figure 2B and Supplementary Table 4). Integrative model of radiomics and genomics (RG) achieved better predictive performance, the 1-year, 3-year and 5-year AUCs of which were 0.807, 0.814 and 0.784. The increase in AUC was obvious compared with radiomics and genomics models (all p<0.05; Figure 2B). We then defined high-risk and low-risk groups by median risk score calculated from each model and drew the Kaplan-Meier curves ( Figure 2C). The high-risk groups in radiomics model (HR=3.62, 95%CI: 2.04-6.73, p=0.002) and radiomics+genomics model (HR=3.93, 95%CI: 2.27-12.12, p=0.017) were significantly related with poorer survival.

Integrating radiomics with transcriptomics features to predict survival
Similarly, we selected a part of whole expressed genes to reduce the dimensionality. Patients of training set were assigned into short-term (deceased, 12 months≥OS≥1 month) and long-term (OS≥60 months) survivors. We used the GSEA to reveal enriched KEGG pathways in the short-term survival group ( Figure 3A). NOD-like receptors (NLRs) are closely related to autoimmune and inflammatory responses, and chronic inflammation caused by abnormal NLR signaling can promote tumorigenesis and progression, especially in gastric and colorectal cancers [27]. However, the role of NLR pathway in renal cancer remains unclear and needs further investigation.

Integrating radiomics with proteomics features to predict survival
The protein expression data was also considered in this study, expecting to improve prognosis prediction. We performed the reverse phase protein array (RPPA) analysis, and included all 131 proteins in the models (Supplementary Table 6). In the validation set, the prediction accuracy of proteomics model (P) was similar to that of radiomics model (all p>0.05; Figure Figure 4D).

External validation of radiomics model to predict survival
We additionally included 175 ccRCC patients from West China Hospital to verify the prognostic performance of radiomics model. The age of onset was younger and the rate of high-grade tumor was lower in the external validation set (Table 1). However, the radiomics model also successfully predicted patient survival in external validation set, which reached AUCs of 0.726, 0.781, 0.755 for 1-year, 3-year and 5-year OS, respectively ( Figure 4E). In Kaplan-Meier survival curves, the high-risk group predicted by radiomics model showed significantly shorter survival time (HR=2.08, 95%CI: 1.09-3.96, p=0.026; Figure 4F).

Integrating multi-omics features to predict survival
Previous results suggested the feasibility of integrating radiomics with genomics, transcriptomics or proteomics to improve survival prediction. Finally, we investigated whether incorporation of all features (radiomics, genomics, transcriptomics and proteomics) could make AGING further improvement on the prognostic model. Tested by the validation set, the 1-year, 3-year and 5-year AUC of the multi-omics model were 0.871, 0.873 and 0.846 ( Figure 5A), which were significantly increased compared to the models formed by single omics (Supplementary Table 4). We also found a significant difference between high-risk and low-risk groups' survival (HR=6.20, 95%CI: 3.19-8.44, p<0.0001; Figure 5B). In addition, DCA curves demonstrated that the multi-omics model added more net benefit than radiomics and other single-omics models when used to predict survival ( Figure 5C).

DISCUSSION
Radiomics analysis offers quantitative image features that can be correlated with clinical or molecular characteristics of cancer [28]. In this study, we mined radiomics features from contrast-enhanced CT, and built various machine learning classifiers based on radiomics features to predict different somatic mutations and molecular subtypes of ccRCC. In addition, we presented a comprehensive framework to evaluate the prognostic value of individual radiomics features and omics-based predictive models. The results demonstrated that the mutations, molecular subtypes and prognosis of ccRCC patients could be predicted from CT radiomics features. Furthermore, the integrative models combining radiomics with gene mutation, gene expression and protein expression features improved the predictive power of overall survival than models constructed by any omics alone. The median risk scores generated by models were feasible to separate patients into high-risk and low-risk groups with significantly different survival outcomes.
As a hallmark of cancer, genetic instability and mutations can lead to uncontrolled cell proliferation [29]. Genomics profiles are widely used as biomarkers to predict survival and therapeutic response, helping to make medical decision, especially the targeted therapy selection [30]. Considering the invasive procedures of genomics testing, radiomics is rapidly emerging as a translational approach to non-invasively identifying mutations and expression patterns [28]. In this context, many efforts have been made to improve radiogenomics analysis in ccRCC [16]. For example, random forest classifier of CT radiomics features correctly identified mutations in PBRM1 (AUC=0.987) and BAP1 (AUC=0.897) of ccRCC [31,32]. In this study, we presented a more comprehensive assessment including four common mutations and 32 combinations of  AGING machine learning algorithms. Meanwhile, we applied radiomics features to predict mRNA-based molecular subtypes (m1-m4) of ccRCC, which has not been reported before. Clustering of mRNA expression provides a molecular classification of individual tumors, which facilitates the understanding of underlying tumor heterogeneity [33]. For instance, m1 subtype tumors had significant survival advantages and more frequent PBRM1 mutation than m2 and m3, while m3 subtype was enriched for CDKN2A deletion and PTEN mutation [6]. The m4 subtype was related with higher mutation frequency of BAP1 and mTOR [6]. Our results showed that CT radiomics features could be used to develop a non-invasive, convenient and effective tool to predict mutations and molecular subtypes of ccRCC, and random forest algorithm performed best among the eight classifiers.
Timely and accurate identification of patients with highrisk of developing worse outcomes is crucial for their clinical decision-making. For renal cancer, radiomics has been widely applied to predict patient survival and cancer progression [34]. In this study, several radiomics features had prognostic value for OS, including sphericity, GLCM_correlation and GLDM_GLNU. Sphericity (regularity of shape) is a shape-based feature, GLCM_correlation reflects the linear dependency of graylevels, and GLDM_GLNU describes the non-uniformity of gray-levels of texture [35]. Previous studies also reported that sphericity and GLCM_correlation were associated with survival outcomes in cancer patients [36,37]. However, evaluation of individual radiomics features was not enough to reflect the whole tumor properties. Therefore, we imputed all radiomics features into the random forest classifier to build a radiomics model, which obtained good predictive capability for OS in both internal validation (5-year AUC=0.775) and external validation sets (5-year AUC=0.755).
Furthermore, we comprehensively estimated the predictive models integrating radiomics with genomics, transcriptomics and proteomics in ccRCC. The results demonstrated that the predictive accuracy of models using two omics was higher than that of single-omics models, and the multi-omics model reached highest accuracy and clinical net benefit. Study for glioblastoma also showed the enhanced predictive capacity of models combining genomics and radiomics features [21]. Radiomics offers important strengths for tumor assessment, such as large data sources because radiologic images are available in almost all cancer patients, measurement of heterogeneity of entire tumor, and longitudinal use in treatment monitoring [14,38]. In our results, radiomics features were not correlated with genomics data especially in Kaplan-Meier survival curves, indicating that radiomics may provide additional information on cancer characterization that are different from genomics. The integration of radiomics and genomics data could be valuable and enable more precise survival prediction. However, the prognostic role of radiomics and other omics features is still controversial, for example, few studies demonstrated that VHL mutation could predict patient survival [9], and many genes and pathways still lack related evidence. Moreover, the correlation between radiomics and other omics features is complex. Therefore, further investigation to reveal the causal relationship between these features is needed.
There were several limitations in this study, including the retrospective study design and limited sample size. We randomly divided TCGA patients into training and validation sets to verify the robustness of predictive models in validation set. However, the quality of images from public dataset showed great differences, which may influence the radiomics analysis. Thus, we further estimated the generalizability and reliability of radiomics model on an external validation set. However, the genomics, transcriptomics and proteomics data were lacked in external validation set. It was necessary to assess our integrative models in other datasets and populations. In addition, small patient populations with BAP1 or SETD2 mutations might cause potential bias, which was a common problem in other radiogenomics studies. Therefore, before clinical application, the findings based on the limited dataset need to be prospectively validated on multi-center and large-scale studies.

CONCLUSIONS
Radiomics features of contrast-enhanced CT had promising potential in predicting mutation status, molecular subtypes and overall survival of patients with ccRCC. Moreover, integrative models of radiomics, genomics, transcriptomics and proteomics could improve prognostic prediction compared with radiomics or other omics alone, which may help clinicians to perform better risk stratification and decision-making for ccRCC patients.

Study design and data sources
In this retrospective study, radiomics features were applied to predict somatic mutations and molecular subtypes, and to form integrative models for survival prediction in ccRCC patients ( Figure 6). One cohort of ccRCC patients with available CT images was downloaded from The Cancer Imaging Archive (TCIA) data portal (http://www.cancerimagingarchive.net/). The matched clinical data, genomics, transcriptomics and AGING proteomics profiles were obtained from The Cancer Genome Atlas (TCGA) (https://portal.gdc.cancer .gov/) and The Cancer Proteome Atlas (TCPA) (http://tcpaportal.org/tcpa/). The inclusion criteria were pathologically confirmed ccRCC patients with preoperative contrast-enhanced CT for better tumor visualization. Patients without preoperative CT (n=16) or without contrast-enhanced CT (n=14) were excluded, thus 207 patients were eligible for radiomics analysis.
Another ccRCC cohort was diagnosed in the West China Hospital from January 2011 to December 2018. We excluded the patients without preoperative contrastenhanced CT (n=33) or lost to follow-up (n=21). Finally, a total of 175 patients with ccRCC were enrolled in this study. All these patients were followed up until death or last follow-up of December 2020. The Ethics committee of West China Hospital approved this study. All personal information was de-identified, thus the patient informed consents were waived.

Image labeling and feature extraction
Before feature extraction, an expert oncologist (X.M., 10 years of experience) that blinded to patients' information manually draw the region of interest (ROI) of the whole tumor. We used the 3D Slicer software 4.10.2 (http://www.slicer.org) for both ROI labeling and feature extraction. Radiomics features can be divided into shape-based, first-order, secondorder and higher-order statistical metrics. Shape-based features measure the three-dimensional shape and size of tumor, including surface area, volume, sphericity and others. First-order features are generally based on histogram-related methods, which represent the value distribution of individual voxels without considering the spatial relationship. For example, skewness describes the degree of asymmetry of histogram distribution, and kurtosis refers to the peakedness of histogram. Second-order features evaluate the spatial relationship between voxels with similar contrast levels to reveal the intratumoral heterogeneity, including gray-level co-occurrence matrix (GLCM), gray-level dependence matrix (GLDM), gray-level run length matrix (GLRLM), gray-level size zone matrix (GLSZM), neighborhood gray tone difference matrix (NGTDM) and so on. Finally, there were 107 quantitative features for each patient, which contained 14 shape-based features, 18 histogram-based features, 24 GLCM features, 14 GLDM features, 16 GLRLM features, 16 GLSZM features and 5 NGTDM features.

Statistical analysis
1. Prediction of mutations and subtypes: We divided the TCGA cohort into training and test sets with the ratio of 7:3. Considering the risk of over-fitting due to data complexity, we reduced the dimensionality of radiomics features by machine learning algorithms including gradient boosting decision tree (GBDT) [39], least absolute shrinkage and selection operator (LASSO) [40], random forest (RF) [41] and extreme gradient boosting (XGBoost) [42]. Afterwards, eight machine learning algorithms, namely RF, GBDT, adaptive boosting (AdaBoost) [43], logistic regression (LR) [43], decision tree (DT) [44], support vector machine (SVM) [45], naive Bayesian (NB) [46] and K-nearest neighbor (KNN) [47] were utilized to establish binary classifiers on remained radiomics features to predict somatic mutations (VHL, BAP1, PBRM1 and SETD2) and mRNA-based molecular subtypes (m1-m4). The reason for using multiple algorithms was to indicate the feasibility of this method in different algorithms. The 5-fold cross-validation was used to assess classifiers in the training set. Finally, the performances of trained classifiers were independently validated by the area under the receiver operating characteristic (ROC) curve (AUC) in the test set.
2. Survival analysis: According to median value of radiomics features, TCGA patients were divided into high-value and low-value groups. Compared with progression-free survival, the evaluation of overall survival (OS) is more objective and less affected by artificial factors. We chose the OS as the endpoint to better explore the prognostic performance of features. Univariate Cox regression analysis calculated the hazard ratio (HR) and 95% confidence interval (CI) for OS between two groups. Then we used the multivariate Cox regression with LASSO to shrink the regression coefficients of uncorrelated features to zero, and obtained features with non-zero coefficients. Kaplan-Meier survival curve and log-rank test were analyzed. The p<0.05 was considered as statistically significant.
3. Model features pre-selection: To comprehensively present the prognostic value of each omics, we estimated prognostic models based on one omics data (radiomics, genomics, transcriptomics, proteomics) and combinations (radiomics+genomics, radiomics+transcriptomics, radiomics+proteomics, multi-omics). The TCGA cohort was randomly divided into training (n=104) or validation (n=103) sets. To reduce the dimensionality of genomics data, 100 most common mutations of training set were used for building models. For transcriptomics, we first assigned the training set into short-term (deceased, 12 months≥OS≥1 month) and long-term (OS≥60 months) survivors. Differently expressed genes (DEGs) between two groups were further involved in models. In addition, we applied the Gene Set Enrichment Analysis (GSEA) to show enriched Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways. Gene sets were significant if the p<0.05 or false discovery rate (FDR)<0. 25. 4. Prognostic model construction and validation: In the training set, random forest (RF) was used to establish prognostic models, because of its excellent performance of handling numerous inputs and selecting relevant features. The 1000 decision trees of RF and 5-fold cross-validation were used. In the validation set, we verified the robustness of these models using the timedependent ROC curve and compared AUC values through the built-in comparison function of timeROC package. Patients in validation set were then separated into high-risk and low-risk groups according to median value of risk score estimated by models. In addition, decision curve analysis (DCA) was applied to compare the net benefits on threshold probabilities of models [48]. Statistical analyses were conducted using R version 3.6.1.

AUTHOR CONTRIBUTIONS
XM and HZ had contributions to the conception and design of the work; HZ, LC and MW performed the data analysis, interpretation and manuscript drafting; YH and YL contributed to the data acquisition; all authors have revised the manuscript, approved the submitted version, and agreed to be accountable for all aspects of the work.  Please browse Full Text version to see the data of Supplementary Tables 2, 3