Performance of deep learning algorithms to distinguish high-grade glioma from low-grade glioma: A systematic review and meta-analysis

Summary This study aims to evaluate deep learning (DL) performance in differentiating low- and high-grade glioma. Search online database for studies continuously published from 1st January 2015 until 16th August 2022. The random-effects model was used for synthesis, based on pooled sensitivity (SE), specificity (SP), and area under the curve (AUC). Heterogeneity was estimated using the Higgins inconsistency index (I2). 33 were ultimately included in the meta-analysis. The overall pooled SE and SP were 94% and 93%, with an AUC of 0.98. There was great heterogeneity in this field. Our evidence-based study shows DL achieves high accuracy in glioma grading. Subgroup analysis reveals several limitations in this field: 1) Diagnostic trials require standard method for data merging for AI; 2) small sample size; 3) poor-quality image preprocessing; 4) not standard algorithm development; 5) not standard data report; 6) different definition of HGG and LGG; and 7) poor extrapolation.


It is aimed to evaluate deep learning performance in glioma classification
We found DL achieved high accuracy in glioma grading Heterogeneity was found in pooled analysis and still remained in subgroup analysis In different subgroups, the performance of DL suggested major limitations in this field

INTRODUCTION
Glioma originates in the glial cells surrounding and supporting neurons in the brain and is the most common type of malignant brain tumor, representing approximately 80% of all cases. 1 The estimated annual incidence of glioma is in the range of 6 out of 100,000 worldwide. 2 Although relatively rare compared to other malignant tumors, glioblastoma, the most common and deadliest form of glioma, results in a remarkably high mortality rate. The median overall survival is only approximately 19 months regardless of care. 3 The World Health Organization (WHO) categorizes glioma into 4 subtypes-grades I to IV based on their aggressiveness. 4 Clinically, gliomas are normally grouped into low-grade glioma (LGG) and high-grade glioma (HGG).
Accurate categorization of LGG and HGG is indispensable to determining the treatment option and the prognosis of patients. Histopathological characterization following biopsy is a routine procedure to diagnose and grade glioma in clinical practice. However, the procedure is expertise-demanding, workforce-intensive, and time-consuming. 5 To fill in this gap, state-of-the-art medical imaging techniques, especially magnetic resonance imaging (MRI), are widely applied to identify and classify glioma non-invasively, yet both inter-and intraoperator variability cannot be fully avoided. The interpretation of medical images is also highly dependent on the experience and skills of clinicians.
To overcome the aforementioned drawbacks, deep learning (DL), a subset of artificial intelligence (AI), has shown great promise in the automatic classification of medical images. 6,7 For instance, the recent advancement of DL algorithms has rendered Food and Drug Administration (FDA) approves a few diagnosis tools for clinical practice. 8 In our context, numerous independent studies have investigated the performance of DL in glioma classification worldwide. To date; however, there is no systematic review and meta-analysis to assess the diagnostic performance of DL algorithms in grading glioma. This evidence-based study is expected to contribute to the further implementation of DL-based models in routine clinical practice.
for no outcome, no target disease, no English article, etc. Finally, we included 49 articles that met our inclusion criteria for systematic review, among which 33 articles can fully provide data for meta-analysis ( Figure 1).
Considering the problem of reusing samples, we also used the highest accuracy as the criteria to select only one reported performance for each study. The pooled results of highest accuracy for SE and SP were 94% (95% CI: 90-96%) and 94% (95% CI: 90-96%) (Figure 3), with the AUC of 0.98 (95% CI: 0.96-0.99) ( Figure 4B).   To explore the causes of heterogeneity, we applied meta-regression containing suspectable variables. Including:1) sample size; 2) data sharing; 3) type of internal validation; 4) transfer learning applied; 5) data unit; 6) classification; and 7) type of validation. Among the first 5 variables, data sharing showed no statistical significance (p = 0.39 in SE, p = 0.91 in SP), but the rest 4 variables showed significance at least in one of SE or SP, which indicated heterogeneity. As for the classification and type of validation, both of them had statistical significance in SE and SP, revealing heterogeneity (Table S1).

Subgroup analysis
All variables included in the meta-regression were divided into 2 groups for subgroup analysis.

Publication bias evaluation
In the overall pooled analysis, the p value of Deeks' funnel plot was 0.873. In the highest accuracy pooled analysis, which value was 0.493. Neither of these analyses indicated publication bias ( Figure S15).

Quality assessment
The quality of the total 48 included studies was assessed using QUADAS-2 and a summary of the risk of bias and applicability concerns for 48 studies was provided in Figure S16. The detailed results were also supplied in the Figure S17. In the patient selection domain of risk of bias, 35 studies were deemed high or unclear risk due to unreported inclusion and exclusion criteria, or unknown patient enrollment procedure. For index test, 35 studies were considered at an unclear risk because of a lack of pre-specified In the applicability concerns, 25 studies were considered at high or unclear applicability in the patient selection domain, 11 studies at unclear applicability in the index test domain, while low applicability concerns were observed for all studies in the reference standard domain.

DISCUSSION
Up to now, previous systematic reviews and meta-analyses on AI applied to glioma focused on the following topics: prediction of AI on the molecular classification of glioma, 57-59 prediction the prognosis, 60 differential diagnosis between glioma and other brain tumors, 61,62 glioma image segmentation, 63 and grading of glioma. [64][65][66] As for the grading of glioma, two studies pointed out the current obstacles of AI deployment, 64,65 and one study conducted a meta-analysis on machine learning (ML) of grading. 66 However, though DL showed sufficient superiority in other cancers, such as cervical cancer and breast cancer, 67 it still remained vacant in grading glioma. Moreover, it is notable that glioma grading combined with AI exhibits some traits that are not present in other cancers, such as extensive use of public databases, and the importance of classification for prognosis. 22  iScience Article tumor grading is critical in glioma progression and prognosis; glioblastoma multiforme (grade IV) has a 5-year survival rate of less than 5% 68 while the survival rate of 15-year for grade II glioma is 86%. 68,69 Compared with traditional diagnostic methods, DL has advantages such as shorter diagnostic time, labor saving, and the ability to improve cancer screening in low-resource areas. 68 Thus, DL performance on glioma grading is worth lots of attention.
In our study the SE and SP were 94% (95% CI: 90-96%) and 94% (95% CI: 90-96%), respectively. A pertinent systematic review and meta-analysis focused on ML, which pooled 5 studies and showed the pooled SE and SP were 96% (95% CI: 93-98%) and 90% (95% CI: 85-94%). 66 From above results we can't differentiate the superiority of DL over ML. The potential explanation is that DL outperforms ML when the sample size is huge. 70 However, in our study, the median sample size was 130, which indicated that most of the eligible studies belonged to small sample data. Thus, DL performance might be hindered by data size limitations. Moreover, DL automatically extracts image features while ML mainly relies on images whose features have been extracted before, usually by clinicians or other experts. 64 This trait of DL makes it strongly hinge on the quality of images. In our study, only 16 of 33 studies excluded poor-quality images before processing. However, since DL algorithm after exclusion of poor-quality images will hardly present the real clinical setting; therefore, DL models should limit the exclusion of images.
It was noteworthy that we assessed DL from two different criteria: one used all available contingency tables; the other used only one contingency table reporting the highest accuracy from each article. The pooled results (SE:94% (95% CI: 91-95%), SP:93% (95% CI: 91-95%), and AUC:0.98 (95% CI: 0.96-0.99)) were modest worse than those in highest accuracy (SE:94% (95% CI: 90-96%), SP: 94% (95% CI: 90-96%), and AUC: 0.98 (95% CI: 0.96-0.99)). Besides, when the sample size increased, the confidence interval narrowed, which explained the phenomenon that CIs of overall datasets was narrower than the highest accuracy datasets. By the repeating use of samples in the overall analysis, it factitiously added the sample volume of duplicated articles. The phenomenon of single article containing multiple DL algorithms is commonplace in the oncology field, 9,22,71 which requests further meta-analysis of diagnostic evaluations to be equipped with the method to merge multiple sets within each study. Such an approach has already been used in clinical trials, but still remains vacant in diagnostic trials. 72 As for subgroup analysis, in sample size (% 130 or > 130) results, we didn't find the expected results that the bigger sample size group performed better than a smaller one. In views of the forest plot and original data, we could see that the > 130 sample size group contained narrower confidence intervals than the % 130 group, but still incorporated poor results such as Shen et al. with 296 images (only 60.6% SE), 27    , which is the top academic competition and play a cardinal role in AI. In these databases, the images were labeled and quality-checked by experienced clinical specialists and had been processed with standardization. By contrast to private data, the images of open data were of higher quality, which led to better results in the subgroup. Here, our results once again emphasized the great importance of data preprocessing. Besides, some recent efforts, which devoted toward standardization of preprocessing, preprocess datasets the same way as it is done for Brats. Thus, the data processed from these tools can be used alongside Brats data. 76 In this advanced field, there are not any regulations to ensure uniformity and high quality of preprocessing. Recently, the US Food & Drugs Administration (FDA) has approved serial available AI/MLbased medical devices and algorithms to standardize the process of AI tool development, which means that developers of algorithms go through rigorous evaluation before they launch their program. 8 As for the type of internal validation, k-fold cross-validation outperformed random split-sample validation (SE: 96% vs. 92%; SP: 97% vs. 92%; AUC 0.99 vs. 0.97). K-fold cross-validation fits in small samples data and can conduct parameter tuning through multiple times of training and testing sets segmentation in the same database. 77 Therefore, it can improve the efficiency of data utilization. However, random split-sample validation only carries out cross-validation through one training set and test set segmentation, which has a large uncertainty, hardly to achieve true randomization. 63 In our study, since only 27% of the included research used k-fold cross-validation, we appeal for more k-fold cross-validation to be used in this field in future.
Another DL-related item was transfer learning, but in this study, we couldn't tell the superiority of transfer learning. Transfer learning enables a previously trained model used in another domain. Therefore, it skips the effort required to collect training data. 78 An article indicated that due to differences in demographic characteristics, transfer learning used on underrepresented patients might exert a negative influence on AI integration with oncology. 79 In this study, except for transfer learning used for open data, there were many studies using it from open data on private data with the discrepancy in the patient characteristics

Title Introduction
Data S1 Search strategies for different databases Figure S1 HSROC curves of different sample sizes   27 which indicated poor SE of 60.6%. Therefore, our result doesn't mean that transfer learning isn't suitable in this domain, since transfer learning from different population might create dissatisfying results due to populations but not the AI algorithm itself. Further, we hope that in the future when researchers apply transfer learning, they are supposed to take population heterogeneity into consideration.
With suspect to using images or patients as data unit, we concluded that image-based dataset showed better SE (95% vs. 92%) and SP (96% vs. 88%). It is noteworthy that whether the study reported image number or patient number, the AI process is still based on the image. Therefore, articles only reporting patient number were somehow without preciseness. Especially in contingency tables, if the article only provided patient numbers instead of image numbers, we inevitably underestimated the sample size of these studies, since every patient usually generates more than 1 image. Therefore, to obtain high-quality results, articles should exactly report not only patient numbers but also image numbers, and better report the image-contained results in contingency tables.
In the classification of HGG and LGG, 56% contingency tables (30/54) included in our study defined IV grade as HGG, such as Luo et al. 31 and Decuyper et al., 32 while others defined III and IV(24/54) grades as HGG, such as Danilov et al. 11 and Li et al. 16 Here, our study found that IV represented HGG in classification was similar to III+IV(SE: 94% vs. 93%; SP:93% vs. 94%; AUC 0.98 vs. 0.98). In WHO glioma classification, diffuse glioma is defined as WHO grade II, anaplastic, or in case of 1p/19q-non-codeleted tumor as grade III and glioblastoma as grade IV. 80 In the image diagnosis, glioblastoma has the most invasive feature, which can be distinguished from diffuse astrocytoma and anaplastic astrocytoma. 81 However, distinguishing anaplastic astrocytoma and diffuse astrocytoma features in the images is another story. If researchers deem III+IV as HGG, which is to distinguish grade III from grade II, it cannot be easy to achieve. 82 Recent studies indicated that molecular profiling differences existing between these two grades might be used in classification. Important molecular diagnostic markers, such as isocitrate dehydrogenase (IDH) mutation, 83 1p/19q co-deletion 84 and O-6-methylguanine-DNA methyltransferase promoter methylation, 85 had been included into guideline since WHO glioma classification 2016. 80 Therefore, in the future, DL algorithms evaluation in image-based glioma grading should also take molecular diagnostic markers into consideration, especially in the distinguishment of grade IIand III glioma.
As for internal validation or external validation, in our study, internal subgroup was superior to external subgroup (SE: 94% vs. 92%; SP: 94% vs. 82%; AUC 0.98 vs. 0.94). Internal validation is that in the validation phase, the testing set is separated from the original dataset, whereas external validation is that using a completely independent dataset out of the original one. 86 Though DL performed inferiorly in the external group, we still appeal to further studies to apply external instead of internal validation. One of the major limitations of including studies is that the majority of them didn't implement external validation, which made them hard to be generalized and reproduced. DL development is supposed to consider data extrapolation. DL algorithm should generalize to the real-world usage, which means not only exerts well in online database but also can show acceptable quality in clinical practice, such as being auxiliary with hospital clinician or commune healthcare worker. Xian et al. used DL in near-infrared fluorescence imaging to help intraoperative diagnosis, 86 which requested not only accuracy but also celerity. Besides, as for the application of AI technology in low-resource areas, the acceptance ability of healthcare workers also needs to be considered. 87 Therefore, in order to facilitate the practical application and promotion of DL merging with glioma classification, in addition to algorithm optimization, time of DL diagnosis, maneuverability, protection of patient information, etc, should also be under rigorous design.
To improve DL algorithms combined with glioma, based on previous analysis of our research, we try to summarize limitations in this field: 1) diagnostic trials require standard method for data merging for AI; 2) small sample size; 3) poor-quality image preprocessing; 4) not standard algorithm development; 5) not standard data report; 6) different definition of HGG and LGG; and 7) poor extrapolation. iScience Article contingency tables; and 7) encourage external validation. Besides, we expect diagnostic trials to provide normative guidelines for data fusion, and top institutes can convene specialists from regarding professions, such as clinicians, AI engineers, pathologists to standardize image preprocessing, and AI development.
Our study used meta-analysis to integrate articles about the performance of DL algorithms in image-based glioma grading. To the best of our knowledge, this is the first meta-analysis to explore the performance of DL in this field. When analyzing the full data, we considered both the full use of data and the selection of representative data (the highest accuracy) of an article, which might be a reference way in the absence of the standard of combining multiple sets of data in diagnostic tests. We further used meta-regression to explore the source of heterogeneity, which indicated that sample size, data sharing, type of internal validation, transfer learning applies, classification, and type of validation did play an important role in heterogeneity. In subgroup analysis, we find that DL displayed with distinguishment in different subgroups. In explanation of difference, we gave recommendations under which DL performs more superior. More importantly, we provided suggestions on how DL should be normalized in glioma grading in the future based on the dilemma of DL development that existed in our results.
However, there are still some limitations in this study. Our results showed high heterogeneity, which was not significantly reduced in subgroup analysis. The items used in subgroup analysis were proved to exert an impact on heterogeneity in meta-regression, and they were considered as possible heterogeneity sources in previous studies. Liu et al. conducted a pooled analysis to evaluate the performance of healthcare workers versus DL, which implied the separation of DL from clinician data in studies. 88 Besides, DL-related items also contained huge diversity, such as performing external validation or internal validation, 89 using open-access dataset or not, 90 the application of transfer learning or not, as well as the validation type. 91 Therefore, items included in this study were scientific and have been shown to contribute to heterogeneity. Moreover, high heterogeneity was common in studies of the convergence of AI and medicine, such as the DL study of breast and cervical cancer, 67 glioma segmentation, 63 gastrointestinal cancer classification, and prognostication 92 and so on. Admittedly, the reason why heterogeneity was not reduced might also be explained by other possible factors, such as prospective or retrospective studies and DL diagnoses or clinician diagnoses. 67 Due to the scarcity of articles containing prospective studies ( 2 / 3 3) or with a comparison of DL versus clinician (3/33) in our study, we couldn't perform meta-regression on them. Another limitation is that in glioma classification, we failed to incorporate molecular information, which is becoming increasingly important, since it marks a more refined classification of patients and is critical for clinical treatment choice and prognosis. 64 In addition, the QUADAS-2 assessment was not tailored for AI-based studies, which resulted in risk of bias and applicability concerns.
In conclusion, though the SE, SP, and AUC of DL algorithms are high in glioma grading, we still couldn't prove the superiority of DL over ML. In the whole dataset pooled analysis, we considered both the full use of data and the selection of representative data (the highest accuracy) for each article. Our study suggested that the results were highly heterogeneous and sample size, data sharing, type of internal validation, transfer learning applies, classification, and type of validation were the possible reasons. In subgroup analysis, we didn't find the bigger sample size group displayed better than the smaller one. DL in open data appeared superior to private data. As for type of internal validation, k-fold cross-validation outperformed random split-sample validation. In transfer learning use, we couldn't tell the superiority of transfer learning appliance in comparison to not use. Image-based datasets showed better results than the patients-based ones. In the classification of HGG and LGG, our study indicated that IV represented HGG excelled III+IV. As for internal validation or external validation, in our study, the internal subgroup was superior to the external. Besides, from the perspective of the whole results of our study, we strongly recommend separate research:1) use open databases; 2) before disclosure, be approved by FDA or other authoritative institutions first; 3) embrace big data; 4) encourage the use of random split-sample validation; 5) consider the consistency of characteristics of the two studies populations when using transfer learning; 6) report the number of images in contingency tables; 7) include molecular typing results to assist diagnosis if grade III incorporated in HGG; 8) encourage external validation; and 9) incorporation of AI-based quality of reporting tools (such as Quadas-AI, Probast-AI or Tripod-AI). Moreover, we can't emphasize more on normalization of image extraction, preprocessing, and algorithm development in this field. However, since the heterogeneity still remained in subgroups, these recommendations should be considered cautiously.

ACKNOWLEDGMENTS
We appreciate Professor Yu Jiang for teaching meta-analysis courses. We also are grateful for involved articles researchers to offer detailed diagnostic test results. We would like to thank Peking Union Medical College Education Foundation (NO: B0202023F-11) for funding us.

AUTHOR CONTRIBUTIONS
W.Y.S. carried out data analysis and article writing. C.S. was responsible for research retrieval. C.T. and C.H.P. took part in figures visualization and table compiling. J.H.F. and P.X. offered the idea of this research and supervised this work. Y.L..Q reviewed the article and provided suggestions for revision.
(2020). Two independent researchers (W.S. and C.S.) reviewed the full-text articles (and supplemental information if available) and extracted study characteristics (patients' information, imaging modality, DL algorithms, etc.) and diagnostic performance of DL (true-positives, false-positives, true-negatives, and false-negatives) into a predetermined data extraction form. Conflicts were resolved through a team discussion and consensus. All classifications other than HGG and LGG were then converted into an exclusive binary classification of HGG and LGG to generate contingency tables for meta-analysis. The extracted data was used to calculate the pooled sensitivity, specificity, and area under the curve (AUC).

QUANTIFICATION AND STATISTICAL ANALYSIS
To assess the performance of DL algorithms to differentiate HGG from LGG, the definition of true positive (TP) was set for HGG while that of true negative (TN) was LGG. The included studies with inconsistent definition were redefined for our calculation. A hierarchical summary receiver operating characteristic (HSROC) curve with 95% confidence intervals (CI) and 95% prediction regions was employed to assess the overall performance of DL algorithms along with diagnostic parameters including pooled AUC, sensitivity, and specificity. [93][94][95] Given the inherent differences among the included studies, a bivariate random-effect model was implemented. Heterogeneity was estimated using the Higgins inconsistency index (I 2 ) statistic, of which 50% was defined as moderate and higher than 75% was defined as high respectively. Important variables affecting heterogeneity were assessed using meta-regression. The variables finally included in meta-regression analysis were: 1) sample size (%130/>130; 130 is the median of sample size); 2) data sharing (open data/private data); 3) type of internal validation (random split-sample validation/k-fold cross-validation); 4) transfer learning applied (applied/no applied); 5) data unit(image/patient); 6) classification (grade IV represented HGG/grade III+ IV represented HGG; According to the WHO classification standard and the actual classification of articles); 7) type of validation(internal/external). The first 5 variables did not differ in multiple DL performances in one study, so the meta-regression based on the highest accuracy pooled analysis was applied for these 5 factors. However, the rest 2 variables displayed diversely in one study, which requested the overall pooled analysis. Further subgroup analysis was performed by variables with statistically significant heterogeneity contribution. Meta-analysis was only conducted only when the number of studies was equal to or greater than three. Data analysis was conducted by STATA (version 15.1) software and the MIDAS module was used. The p value less than 0.05 was considered statistically significant. The original data and code were deposited at Science Data Bank and were publicly available (key resources table).

Quality assessment
The risk of bias and applicability concerns of the included studies were assessed by two researchers (W.S. and C.S.) using the Quality Assessment of Diagnostic Accuracy Studies 2 (QUADAS-2) tool, 96 which allows for more transparent rating of bias and applicability of diagnostic accuracy studies. QUADAS-2 tool consists of four key domains: patient selection, index test, reference standard, flow and timing. Publication bias was assessed by a funnel plot.

ADDITIONAL RESOURCES
The study was registered with the open-access PROSPERO International prospective register of systematic reviews (CRD42022360385). The study was performed strictly following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) 2020 statement. 97 Both ethical approval and informed consent were not applicable since this study was a secondary analysis of publicly available data.