Diagnostic accuracy of deep learning in detection and prognostication of renal cell carcinoma: a systematic review and meta-analysis

Introduction The prevalence of Renal cell carcinoma (RCC) is increasing among adults. Histopathologic samples obtained after surgical resection or from biopsies of a renal mass require subtype classification for diagnosis, prognosis, and to determine surveillance. Deep learning in artificial intelligence (AI) and pathomics are rapidly advancing, leading to numerous applications such as histopathological diagnosis. In our meta-analysis, we assessed the pooled diagnostic performances of deep neural network (DNN) frameworks in detecting RCC subtypes and to predicting survival. Methods A systematic search was done in PubMed, Google Scholar, Embase, and Scopus from inception to November 2023. The random effects model was used to calculate the pooled percentages, mean, and 95% confidence interval. Accuracy was defined as the number of cases identified by AI out of the total number of cases, i.e. (True Positive + True Negative)/(True Positive + True Negative + False Positive + False Negative). The heterogeneity between study-specific estimates was assessed by the I2 statistic. The Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines were used to conduct and report the analysis. Results The search retrieved 347 studies; 13 retrospective studies evaluating 5340 patients were included in the final analysis. The pooled performance of the DNN was as follows: accuracy 92.3% (95% CI: 85.8–95.9; I2 = 98.3%), sensitivity 97.5% (95% CI: 83.2–99.7; I2 = 92%), specificity 89.2% (95% CI: 29.9–99.4; I2 = 99.6%) and area under the curve 0.91 (95% CI: 0.85–0.97.3; I2 = 99.6%). Specifically, their accuracy in RCC subtype detection was 93.5% (95% CI: 88.7–96.3; I2 = 92%), and the accuracy in survival analysis prediction was 81% (95% CI: 67.8–89.6; I2 = 94.4%). Discussion The DNN showed excellent pooled diagnostic accuracy rates to classify RCC into subtypes and grade them for prognostic purposes. Further studies are required to establish generalizability and validate these findings on a larger scale.


Introduction
Renal Cell Carcinoma (RCC) is the most common primary renal neoplasm, affecting nearly 300,000 individuals worldwide annually, and it is responsible for more than 100,000 deaths each year (1).RCC is a heterogeneous group of cancers with distinctive molecular characteristics, histology, clinical outcomes, and therapy response.RCC arises from the renal parenchyma and, according to the World Health Organization (WHO) has three main subtypes: Clear cell (ccRCC), Papillary RCC (pRCC) and Chromophobe.The remaining subtypes are rare, each occurring with a total incidence of ≤1%.Each type has different histologic features, distinctive genetic and molecular alterations, clinical courses, and different responses to therapy (2).
The ccRCC type accounts for 70-90%.It is named due to the presence of clear cells from the lipid and glycogen-rich cytoplasmic content, ccRCC has the worst prognosis among the RCC subtypes with a 5-year survival rate between 50 and 69%.When metastasis occurs, the 5-year survival decreases further to about 10%.The pRCC type has a spindle-shaped pattern of cells with areas of hemorrhage and cysts.Pathologists further classify it into two subtypes based on the lesion's histological appearance and biological behavior, and it accounts for about 14-17% of the cases.The subtypes, pRCC type 1 (basophilic) and pRCC type 2 (eosinophilic) differ in their prognostic significance, with type 2 having a poorer prognosis.Chromophobe RCC is common in adults over the age of 60 years.Histologically described as a mass formed of large pale cells with reticulated cytoplasm and perinuclear halos, it carries the best prognosis among the RCC types in the absence of sarcomatoid changes.If sarcomatoid transformation occurs, it tends to be more aggressive with worse survival (3).
Due to its relevance and applicability, the Fuhrman nuclear grading method is commonly used for staging to determine prognostic significance.Using nuclear morphology and characteristics, it designates a prognostic indicator grade (4).The histological classification of RCC is of great importance in patient care, as RCC subtypes have significant implications in the prognosis and treatment of renal tumors.The incidence of RCC has increased, likely due to the increased detection of incidental renal masses on abdominal imaging (5).Around 60% of RCCs are detected incidentally (6).The inspection of complex RCC histologic patterns is prolonged and time consuming due to tumor heterogeneity.There is also a moderate amount of interobserver and intra-observer variability due to the absence of a defined threshold for determining the minimum percentage of an area with high nuclear grade (7).
With the advancement of whole-slide images in digital pathology, automated histopathologic image analysis systems have shown great potential for diagnostic purposes (8)(9)(10).Computerized image analysis has the advantage of providing a more efficient, less subjective, and consistent diagnostic methodology to assist pathologists in their medical decision-making processes.In recent years, significant advancement has been made in understanding and applying deep neural network (DNN) frameworks, especially convolutional neural networks (CNNs), to a wide range of biomedical imaging analysis applications.These CNN-based models can process digitized histopathology images and learn to diagnose cellular patterns associated with tumors (11,12).In our systematic review and metaanalysis, we provide a comprehensive assessment of the existing literature and present the pooled diagnostic performances of DNN frameworks in detecting RCC and predicting outcomes.

Data sources and search strategy
The literature search was conducted from inception through December 2023 in the following electronic databases, Pubmed, Embase, Web of Science, Cochrane Library, and Google Scholar, using the following terms, "Renal Cell Carcinoma" OR "RCC" OR "Kidney Cancer" AND "Histopathology" OR "Histological Analysis" OR "Tissue Histopathology" AND "Deep Neural Network" OR "DNN" OR "Deep Learning." Additional pertinent studies were added by searching the bibliographic section of the articles of interest.The search strategy is shown in the Supplementary data section.

Study selection
The studies retrieved from the search were screened by two authors (D.C and P.S).Abstracts of the studies were initially screened, followed by full-text screening to include studies based on prespecified inclusion and exclusion criteria.Any disagreements between authors were resolved through consensus.The Checklist for critical Appraisal and data extraction for systematic Reviews of prediction Modelling Studies (CHARMS) for prediction modeling studies was followed (13) and The Preferred Reporting Items for Systematic Reviews and Meta-Analyses guidelines was used to select the final articles (14).The CHARMS and PRISMA checklists are shown in the Supplementary data section.The study protocol was registered in PROSPERO, a database of systematic reviews, with registration number CRD42024497980.
The inclusion criteria were as follows: (1) studies reporting the histopathological diagnosis of RCC using DNN; (2) studies reporting detection of RCC using DNN models after validation.The exclusion criteria were as follows: (1) studies lacking sufficient data on reported accuracy, sensitivity, specificity, positive predictive value, negative predictive value or area under the curve of DNN models; (2) review articles, conference abstracts and case reports; (3) studies conducted on animal models; (4) studies not published in English; (5) studies reporting data on DNN models predicting RCC based on imaging; (6) studies reporting only the mathematical development of DNN models without internal or external validation and (7) studies that reported RCC detection using methods other than DNN.Ethics approval was not required for our meta-analysis because the data was accessible to the public.

Outcomes assessed
The outcomes assessed were accuracy, sensitivity, specificity, and area under the curve (AUC) of the DNN models in subtype detection of RCC and grading them for prognostication.
We defined True positive (TP) as the number of cases correctly identified as RCC by the models.True negative (TN) was the number of cases correctly identified as non-RCC.False positive (FP) was the number of cases incorrectly identified as RCC and False negative (FN) was the number of cases incorrectly identified as non-RCC.Accuracy was defined as the ability to detect the presence or absence of RCC and calculated as TP + TN/ TP + TN + FP + FN.Sensitivity was the ability to detect RCC cases correctly, calculated as TP/TP + FN.Specificity was the ability to detect non-RCC cases correctly, calculated as TN/TN + FP.These definitions were derived from the existing literature (15,16).
Outcomes were only recorded if the studies had reported those and were not calculated.

Data extraction
After removing duplicates, the retrieved articles were checked for duplicates using the EndNote 21 reference manager (17).Data was extracted using the CHARMS spreadsheet (18).All the authors extracted the data.Author information, country, total number of patients, and histopathology slides were extracted.The accuracy, sensitivity, specificity, and AUC of the models on the external dataset were collected.The author, D.C, verified the extracted data.

Statistical analysis
Mean ± standard deviation was used to express continuous variables, and percentages to express categorical variables.The pooled rates, mean estimates, and 95% confidence intervals (CI) were calculated using the random effects DerSimonian-Laird method (19).We used the random effects model due to the assumption that the studies were selected from a random sample and that they vary in their effect sizes (20).
Two methods evaluated heterogeneity.First, we used the Cochran Q statistic.The Cochran Q statistic tests the null hypothesis that the included studies share the same effect size.A p-value of <0.05 was considered significant.We then utilized the I 2 statistic to detect and quantify the heterogeneity.Low, moderate, substantial, and considerable heterogeneity correspond to values <30, 31 to 60%, 61 to 75%, and > 75%, respectively, (21).
Publication bias was initially evaluated by visually examining the funnel plots and later by Egger's test.A cut-off p-value of <0.05 was considered significant for the Egger's test (22).When there was an indication of publication bias, we utilized Duval and Tweedie's 'Trim and Fill' method to examine the difference in the effect size after the imputation of studies using computer software (23).The statistical analyses was conducted using the Comprehensive Meta-Analysis software, version 4 (Biostat, Englewood, NJ, USA) (24).

Quality assessment and risk of bias
The assessment of the individual study's quality and risk of bias was done using the Prediction model Risk of Bias Assessment Tool (PROBAST).It contains four domains: participants, predictors, outcomes, and analysis to assess the risk of bias and applicability.A total of 20 signaling questions were used to determine if a domain was low or high risk (25).The assessment was done independently by two authors (D.C and P.S).

Quality assessment and risk of bias
Most of the studies showed a high risk of bias in the selection of study participants.Figure 4A shows the results of the PROBAST scoring of individual studies.Figures 4B,C show the summary of the risk of bias and applicability across all studies.

Heterogeneity
Both the Q statistic and I2 statistics were utilized to assess heterogeneity.Upon quantification of the heterogeneity, we concluded that the degree of heterogeneity was considerable, as they exceeded 75%.

Sensitivity analysis
Sensitivity analysis was performed by eliminating one study at a time to determine whether there is any difference in the effect sizes.We found no significant differences except in the analysis of pooled specificity.This was due to the reported specificity of 44.9% by Wessels et al., which was lower than other studies.The sensitivity analysis of all the outcomes is shown in the Supplementary material.

Publication bias
Analysis of Publication Bias was done initially by visual inspection, and it showed a potential publication bias due to the presence of asymmetry.Therefore, an Egger's test was performed, and the regression intercept gave a 1-tailed p-value of 0.28, indicating the lack of publication bias.The funnel plot with the observed and imputed studies is shown in Figure 5.

Discussion
Our systematic review and meta-analysis demonstrate that deep machine learning can be utilized to diagnose renal cell carcinoma, classify subtypes, and grade RCC.Based on our analysis, the DNN models had excellent performance.The pooled accuracy was 92.3%, sensitivity was 97.5%, specificity was 89.2%, and area under the curve 0.91.
Artificial intelligence (AI) in pathology or computational pathology, referred to as pathomics, is a rapidly developing field.Whole-slide imaging (WSI) technology has allowed the capture and storage of histopathologic images into a high-resolution virtual slide, which is used to train deep learning algorithms (38).
At present, deep learning methods are the most successful among other machine learning types in detecting abnormalities in histopathologic images (27).CNNs, by their design, can detect spatial information and compare images (39).These can then be used for deep Study selection process according to the preferred reporting items for systematic reviews and meta-analysis statement.feature extraction in a weakly supervised or unsupervised learning setting to identify relationships between random variables in a large dataset (36).A supervised approach is where the WSI have annotations showing the irregularity in histopathology, which the machine learning model then uses as a representative to learn from (2).Similarly, the MMDLM uses clinical, radiologic, and histopathological data to train its algorithm and a "fusion" approach to reach a conclusion.Schulz et al. used MMDLM to predict the prognosis and survival among patients with ccRCC (32).Big data is essential to develop and train such deep learning algorithms.In the field of renal malignancies, the TCGA dataset is an excellent resource for genetic, pathologic, molecular, and clinical data that could be used to train and validate these models (1).Various architecture frameworks have been used to construct a CNN model.These networks comprise several interconnected layers composed of several blocks (30).One of the more commonly used architectures is the ResNet (residual network), which allows more deeper layers to be created and reduces errors (39).ResNet architecture based CNN has been found to have better performance than the Inception-v3 and VGG-16 (visual geometry group) (29).
Typically, in oncology, clinical decision-making involves multiple data points such as biomarkers, gene expression profiling, and radiology imaging.Machine learning algorithms can help in combining various data to improve detection.Eigengenes extraction and radiomics, where CNN can extract genetic and radiology information to augment the prediction accuracy has good outcomes (30).The relationship between copy number alterations (CNAs), a common cause of gene alterations in malignancies, and histopathology can also be elucidated using machine learning.Marostica et al. demonstrated that their model recognized histopathological changes in CNAs involving VHL (von Hippel-Lindau), EGFR (epidermal growth factor), and KRAS (Kristen rat sarcoma virus) genes.Their model also distinguished between low and high-risk RCC and predicted overall survival (29).
Another study by Ning et al. used a combination of features extracted from computed tomography (CT) and histopathology added to eigengenes to create a prognostic model for ccRCC (30).
A high percentage of patients with RCC face recurrence after surgical resection, and current predictive models lack the ability to predict recurrence accurately.DNNs can assist in prognostication and determine survival (30,32,40).The model used by Wessels et al. was able to predict the 5-year overall survival (OS) with an AUC of 0.78.The model's accuracy increased when other data points, such as age, tumor size, and metastasis were added (34).Ohe et al. (31) used their CNN model based on AlexNet to grade ccRCC into clear and eosinophilic types according to the WHO/ISUP system to predict prognosis.When evaluating survival analysis, the concordance index (C-Index) is used to determine the efficacy of matching patients according to their risk.The studies by Ning, Ohe and Sculz et al. reported good performance of their model's C-index (30)(31)(32).The ability to integrate different datasets and perform large quantities of tasks demonstrates that such models could be utilized in the near future to complete large-scale histopathological tasks without compromising diagnostic accuracies (41).
Our study has some limitations.First, all the studies were retrospective, and the data depended on the accuracy of the collection process.Second, there is also a possibility for the introduction of selection bias when datasets were accessed to include patients with RCC or a particular subtype of RCC.Third, although most of the models included in the study were CNN-based, differences exist in the structure and construct of these models.Lastly, heterogeneity was noted in our analyses due to these differences in the models.Therefore, caution must be observed while interpreting these results.
To our knowledge, this is the first meta-analysis to assess the performance of machine learning models in the diagnosis, subtyping and prognostication of RCC using histopathology.Histopathologic classification of renal cell carcinoma into its subtypes and grading is a challenging task.Deep learning can help fill a large void in the early detection of RCC as well as accurate determination of its subtypes.Although it cannot replace the skill and experience of a pathologist or radiologist, it can decrease their workload and improve efficiency.participants' legal guardians/next of kin in accordance with the national legislation and institutional requirements because data used in the study is available publicly.

FIGURE 2 Forest
FIGURE 2Forest plots showing (A) accuracy in renal cell carcinoma subtype detection, (B) accuracy in renal cell carcinoma survival analysis (C) overall accuracy of deep neural network in detection and prognostication of renal cell carcinoma.

FIGURE 4
FIGURE 4 Risk of bias assessment of studies by Prediction model Risk of Bias Assessment Tool (PROBAST).(A) Assessment of individual studies, (B) summary of Risk of bias assessment for all studies, (C) summary of applicability for all studies.

FIGURE 5
FIGURE 5Analysis of publication bias by funnel plot showing the effect size of the total number of patients and the total number of histopathology slides/whole slide images.Egger's test for a regression intercept gave a 1-tailed p-value of 0.284 indicating no publication bias.The intercept (B0) is 1.942, 95% confidence interval (−5.326 and 9.211), with t = 0.588, df = 11.

TABLE 1
Summary of the included studies.

TABLE 2
Deep neural network model characteristics.