Deep learning and radiomic feature-based blending ensemble classifier for malignancy risk prediction in cystic renal lesions

Background The rising prevalence of cystic renal lesions (CRLs) detected by computed tomography necessitates better identification of the malignant cystic renal neoplasms since a significant majority of CRLs are benign renal cysts. Using arterial phase CT scans combined with pathology diagnosis results, a fusion feature-based blending ensemble machine learning model was created to identify malignant renal neoplasms from cystic renal lesions (CRLs). Histopathology results were adopted as diagnosis standard. Pretrained 3D-ResNet50 network was selected for non-handcrafted features extraction and pyradiomics toolbox was selected for handcrafted features extraction. Tenfold cross validated least absolute shrinkage and selection operator regression methods were selected to identify the most discriminative candidate features in the development cohort. Feature’s reproducibility was evaluated by intra-class correlation coefficients and inter-class correlation coefficients. Pearson correlation coefficients for normal distribution and Spearman's rank correlation coefficients for non-normal distribution were utilized to remove redundant features. After that, a blending ensemble machine learning model were developed in training cohort. Area under the receiver operator characteristic curve (AUC), accuracy score (ACC), and decision curve analysis (DCA) were employed to evaluate the performance of the final model in testing cohort. Results The fusion feature-based machine learning algorithm demonstrated excellent diagnostic performance in external validation dataset (AUC = 0.934, ACC = 0.905). Net benefits presented by DCA are higher than Bosniak-2019 version classification for stratifying patients with CRL to the appropriate surgery procedure. Conclusions Fusion feature-based classifier accurately distinguished malignant and benign CRLs which outperformed the Bosniak-2019 version classification and illustrated improved clinical decision-making utility. Supplementary Information The online version contains supplementary material available at 10.1186/s13244-022-01349-7.

He et al. Insights into Imaging (2023) 14:6 Background The detection rate of cystic renal lesions (CRLs) is rising quickly as computed tomography (CT) becomes increasingly prevalent. A minority of CRLs are malignant renal neoplasms requiring surgical intervention. Cystic renal neoplasms present a broad category of kidney tumors with a wide range of biological profiles according to the WHO kidney tumor classification and the necessity of early surgical treatment for malignant CRL cannot be overstated [1]. However, the majority of CRLs are simple renal cysts or benign cystic renal neoplasms, which do not necessitate a radical surgery procedure like partial or radical nephrectomy. Since the components of CRL must be accurately identified in order to determine the appropriate treatment strategies, CT imaging is commonly utilized to differentiate CRL. Meanwhile, malignant CRL are difficult to diagnose and manage, especially in the early stage, due to their complex pattern on CT images including thickness of septation, enhancement of the mural nodule, calcifications, and etc. [2]. In an effort to identify malignant CRL at an early stage, standardize the terminology explaining complicated renal cysts, and offer criteria for classifying radical surgery-required malignant CRL, the Bosniak classification system was established [3,4]. The updated 2019 version of the Bosniak classification system introduced more discriminative and quantitative criteria to improve the specificity in identifying higher risk CRL categories. In addition, it explicated detailed meanings of key terms to promote agreement and consistency among different readers. Based on the updated Bosniak-classification, one or more enhancing nodules in the CRL with obtuse margins (more than 4 mm) or with acute margins indicate a malignant renal neoplasm. Thickened wall or septa with enhancement in CRL also indicate the possibility of malignancy. However, these high-risk CRLs (IIF, III, IV) according to Bosniak classification could still be benign CRLs rather than malignant neoplasms. Inaccurate treatment and associated diagnostic errors caused by the misapplication of Bosniak categorization may lead to excessive medical care following adverse results like renal function impairment, re-operation surgery and medical disputes [5,6]. It has been demonstrated that the diagnostic performance of 2019-Bosniak classification criteria do not significantly improve over its previous version [7][8][9]. According to the 2019-Bosniak version, a considerable proportion of previously diagnosed Class III lesions will be reclassified as IIF, resulting in lower sensitivity [10,11]. Bosniak grades I and II are most commonly renal cysts, while grades IIF, III, and IV are more frequently malignant renal neoplasms. The latest study concluded that approximately 10%-20% of Bosniak IIF lesions, 50% of Bosniak III lesions, and 90% of Bosniak IV lesions were malignant renal neoplasms [12]. To improve diagnostic sensitivity and overcome the limitations of biased visual image evaluation, quantitative image analysis techniques, also known as radiomics, combined with machine learning methods have gained popularity in recent years [13,14]. The purpose of this research is to develop and validate a blending ensemble machine learning algorithm for stratifying malignant and benign CRLs with the combination of deep learning and radiomic features.

Enrollment criteria and characteristic distribution
This retrospective analysis was approved by each hospital's ethics committees, and all patient information was anonymized. In the training cohort, required CT scans were obtained from 128-slice spiral CT scanners (Siemens Healthcare, Germany) or 64-slice spiral CT scanners (General Electric, USA). In the testing cohort, required CT scans were obtained from a 128-slice spiral CT scanner (LightSpeed VCT, GE Medical Systems, USA). CT data were generated from the standardized scanning protocols. Details are as follows: CT-tube voltage (120-140 kv), CT-tube current (125-300 mAs), scanning matrix (512*512 pixels), body reconstruction kernel and slice thickness (ranging from 1 to 5 mm). After intravenous administration of iohexol (300 mg/ mL at a rate of 3.0 mL/s, followed by a 30-mL saline flush), contrast-enhanced CT samples were captured. We retrieved CT images from the corresponding picture archiving and communication systems (Vue PACS, Carestream Health Inc & General Electric Advantage Workstation). Candidate participants included those with renal cysts exceeding 1 cm, without surgery history (renal needle biopsy, nephrolithotomy, nephrectomy or partial nephrectomy), without conditions associated with multiple renal cysts like poly-cystic disease, Von Hippel-Lindau syndrome (VHL) or Autosomal dominant polycystic kidney disease (ADPKD), and less than 25% solid portion in CRL. Each participant in this study could only include CRL confirmed by final pathology results, ensuring a realistic and reliable model's presentation. Figure 1 depicts the detailed selection method and pathological results in two cohorts. 103 participants in the development cohort were diagnosed with benign CRL and 56 participants were diagnosed with malignant CRL. In the testing cohort, 10 participants were identified to have malignant CRL and 53 participants were identified to have benign CRL according to the pathological results. Table 1 shows detailed characteristic distributions in the training and testing cohort.

Handcrafted radiomic features extraction
All ROI labeling of CRLs was completed by two senior radiologists using ITK-SNAP software. When labeling the tumor margin, the radiologists will integrate image information in 3 different planes: the axial, sagittal, and coronal planes. In the case of contentious CRL sketching, another senior radiologist will participate in the discussion and help develop the final sketching results together. Radiomic features can be separated into three classes: (1) first-order statistics, (2) shape features, and (3) secondorder features. Image types of radiomic Features can be classified into three categories: (1) Original, (2) Log, and (3) wavelet. Using the default parameters setting provided in the official Pyradiomics yaml file, we extracted 1231 radiomic features from each individual.

Deep learning features extraction
For extracting deep learning features, we defined a 3D-cropbox to contain CRL area. The width and length of 3D-cropbox correspond to the maximum cross-sectional area of the CRL, while the height of 3D-cropbox corresponds to the dimensions containing the CRL region in the Z-axis. In the 3D-cropbox, NumPy array values outside ROI areas will be assigned to 0. Figure 2 displays the detailed 3D-cropbox workflow. The 3D-cropbox region will be transferred into a 3DResnet50 model with pretrained weights. We extracted 2048 deep learning features from each individual by removing the last layer of the pre-trained model, disabling gradient updates and adding a 3D maximum pooling layer. The detailed 3DResnet50 structure is depicted in the Additional file 1: table2.

Radiomic features harmonization
Genomic related research has widely adopted combat methods to deal with the batch effect. CT acquisition and reconstruction parameters have a direct impact on handcrafted radiomic features [15]. However, it is not realistic to standardize platforms and parameters in advance across different institutions. There is mounting evidence that radiomics research requires the same strategy [16,17]. In this study, combat harmonization methods were adopted to address the difference in extracted radiomic features originated from different image acquisition procedures.

Correlation coefficients test
To verify whether the selected features are highly reproducible and reliable, the intra-class correlation CRLs were classified as benign or malignant CRLs based on pathological results. Following that, training cohort were adopted to build machine learning classifier and testing cohort were used to evaluate model performance compared with Bosniak-2019 version coefficients and inter-class correlation coefficients were employed. Results of the inter-class correlation coefficients were originated from two independent readers who re-labeled 25% participants CRLs in the training and testing cohorts. These re-labeled participants are randomly picked by an additional independent radiologist. Results of the intra-class correlation coefficients were estimated by one reader who randomly outlined same participants in the enrolled datasets at different times (1 month interval) [18].

Quality control procedures
The quality control process for fusion features extraction and model construction consists of five steps: (1) Quality control of images; (2) quality control of ROI; (3) quality Table 1 Detailed distribution of Bosniak-2019 classification and pathology results in the training cohort and external validation cohort  (1) Bosniak IV (n = 7) Cystic nephroma (1) Clear cell renal cell carcinoma (5) Multilocular cystic renal neoplasm of low malignant potential (1) Fig. 2 Detailed workflow of the 3D-cropbox. 3D-cropbox consists of four parts: CT HU conversion, ROI area cropping, region background filling, and input size tuning. The area outside the ROI will be filled with black to assure that deep learning features retrieved from the 3D-cropbox are entirely from CRL control of feature extraction; (4) quality control of feature selection; and (5) quality control of machine learning methods. We followed by the advice provided by the Image Biomarker Standardization Initiative (IBSI) [19]. Radiomics quality score (RQS) was adopted to assess the reliability in this research [20]. In the Additional file 1, detailed quality control procedures and RQS calculation results were presented.

Statistical analysis
ITK-SNAP (version 3.6.0) was used to generate ROI. Pyradiomics package (version 3.0.1) was used to extract handcrafted radiomic features. The pretrained weights in 3DResnet50 model are from 23 medical datasets (including brain MR images and lung CT images, etc.  [21,22]. Pearson correlation coefficients for normal distribution and Spearman's rank correlation coefficients for non-normal distribution were utilized to check for redundancy in the primary selected handcrafted radiomics features and deep learning features. Figure 3 depicts the entire procedure for model construction. The Receiver Operating Characteristics (ROC) curve and the accuracy score (ACC) are utilized to evaluate the final model's performances. DeLong test is used to determine whether there was statistically considerable heterogeneity in the area under the receiver operating characteristic curve (AUC). Calibration curve is adopted to evaluate consistency performances of the final model in the external validation dataset. Decision curve analysis (DCA) is adopted to assess the clinical applicability compared with Bosniak-2019 version. The level of statistical significance is determined by two-sided p value of less than 0.05. The scikit-learn package and "Pycaret" package are adopted to create the final machine learning model. All model construction and plot drawing are developed in python environment (3.9 version) and R software (4.0.5 version).  Table 2.

Blending ensemble classifier performances in CRL classification
Meanwhile, the fusion-feature-based machine learning algorithm displays strong calibration performance in Fig. 5.

Clinical impact of blending ensemble classifier compared with Bosniak-2019 classification
The decision curve analysis in external validation datasets for the final model find that, in any threshold probabilities, the fusion-features machine learning model will outperform "none" and "all" treatment strategies and deliver higher net benefit (Fig. 6). Figure 7 exemplifies the performance of the final model compared with Bosniak classification in testing dataset. The final model exceeds the management guideline based on the Bosniak 2019 classification in correctly classifying cystic renal lesions into malignant CRLs and benign CRLs in testing dataset. This suggests that using machine learning algorithm could provide better clinical decision support. Detailed confusion matrix for four models is displayed in the Additional file 1.

Discussion
Although there is a strong association between the updated low-level Bosniak classification (Bosniak I, II) and benign CRL, it has limitations when evaluating the pathological results of Bosniak IIF, III, and IV labeled CRLs, which might result in unnecessary surgical procedures and excessive follow-up costs. Recognizing low malignant risk Bosniak classified high-level CRL can help avoid unnecessary treatment and increasing healthcare expenses [23,24]. According to prior research, the progression of Bosniak IIF cystic renal masses is four years, which indicates a four years follow-up is inevitable [25]. Rapid progression of the high-risk Bosniak CRL necessitates radical nephrectomy rather than ineffective surgical procedures like renal cyst decortication [26,27]. In this retrospective study, we employed a blending ensemble machine learning model to stratify malignant and benign CRLs in cystic renal masses, which outperformed the Bosniak classification system. We employed 3 deep learning features and 16 radiomic features in the final model which are reliable and discriminatory and performed robustly and consistently across internal validation and testing datasets. The reliability of blending ensemble model is determined by the following key elements:  16, which demonstrates that this study's quality is trustworthy and repeatable. The updated 2019 version of the Bosniak classification intends to address inter-reader variability and improve diagnostic performance in predicting malignancy CRL. However, the proposed classification ability has yet to be confirmed [28]. Taking into account the pathologic reference standard, recent research indicates that Bosniak-2019 version IIF CRL have a higher malignancy risk than the previous Bosniak classification. Meanwhile, there are no variations in the proportion of malignancy when compared class III CRL with irregularities to class IV CRL with acute or obtuse nodules [29]. Nevertheless, there are still some disagreements between two welltrained radiologists in CRL Bosniak classification, which required an additional radiologist for help. As opposed to this, the blending decision algorithm performed well and consistently without the need for subjective evaluation across the testing dataset. Previous researches have demonstrated that machine learning approaches can be adopted for CRL malignancy stratification [30,31]. Adopting first-order texture features (Mean, Entropy, Skewness and Kurtosis), Miskin et al. developed a radiomic-based machine learning method to classify cystic renal masses as benign cysts and potentially malignant cysts based on the Bosniak 2019 version reclassification [32]. However, they did not rely on pathology as the diagnostic criteria [33]. Bosniak classification is not as accurate as a pathological criterion and the Class IIF, III and IV CRLs could still be benign neoplasm, which means   [35,36]. In the follow-up study, to minimize the burden of radiologists and expand the applicability of the machine learning model, we will attempt to apply automated segmentation models like 3D-Unet or nn-Unet. Second, although external validation datasets were used in this work, the diagnostic performance of our machine learning model in large samples still has to be confirmed. Third, rather than employing Triple-phase CT scans, we adopted arterial phase CT images to build the machine learning method. Previous study adopted CNN and gated RNN model to distinguish malignant hepatic tumors based on multi-phase Contrast-enhanced computed tomography (CECT). The SpatialExtractor-TemporalEncoder-Integration-Classifier (STIC) successfully extracted the changing pattern across different CECT phases [37]. In Bosniak 2019 version, MRI standard criteria were formally introduced while there are very little researches focus on renal cysts textural features in MRI scans [38,39]. Future researches could attempt to integrate the Triple-phase CT images and MRI images by sequence-to-sequence  [40]. In fact, cystic nephroma is more common in females aged 50-60 years, which indicates that clinical characteristic such as age and gender may be a possible predictor, and a mixture model that combines radiomics data with clinical features like STIC model may boost diagnostic model performance even further.

Conclusion
In conclusion, a blending radiomics machine learning model demonstrated well discrimination capability in stratifying malignant and benign CRLs across testing datasets, which will benefit in diagnosing malignant CRL at an early stage and reducing overdiagnosis and overtreatment in CRL.