Esophageal cancer is an aggressive cancer with a mean estimated 5-year survival rate of 35–45%, even after treatment with curative intent.1, 2 The reported survival rate in advanced-stage disease drops further to 5–10% and can be attributed to the malignancy’s insidious onset and aggressive tumor biology that often favors recurrence.3,4,5 Similarly, gastric cancer has a poor 5-year survival rate and is still the third leading cause of malignancy-related death worldwide.6 A number of investigations, such as computed tomography (CT) scans, positron emission tomography (PET) scans, endoscopic ultrasound (EUS), and endobronchial ultrasound (EBUS), are utilized in the diagnostic and staging pathway of esophagogastric (EG) malignancy, with CT being the most commonly used of those that are noted.7 Unlike colorectal, hepatocellular, and pancreatic cancers, there is no reliable biomarker that can be tested and tracked non-invasively for diagnostic or surveillance purposes in esophageal and gastric cancers.8,9,10 Consequently, patients are often reliant on radiological investigations for diagnosis with staging, detection of recurrence, and monitoring response to treatment.7 These workflows necessitate both timely and expert radiological interpretation, a requirement that is often difficult to achieve given busy clinical work schedules and a lack of expertise outside tertiary oncological centers. As such, there has been increasing calls to explore the use of AI-centred diagnostic systems to alleviate this issue.

In the context of medical diagnostics, AI is the use of a system to mimic human cognition in the comprehension, analysis, and presentation of medical data.11,12,13 This is often achieved using machine learning (ML), which is a specialized sub-field within AI that improves the performance of systems through repetitive experience. For example, in EG cancers, ML has been used extensively by AI systems to understand endoscopy images and enhance the interpretation of solely operator-dependent endoscopy.14,15,16 Naturally, the next step will be the integration of AI into the major imaging modalities used in the management of EG cancers, specifically CT scans. Typically, this involves the high-throughput extraction of large quantities of data from the images and is a technique termed as radiomics. Radiomics is an emerging field using a non-invasive approach to extract numerous quantitative features from medical images, especially parameters not visible to the naked human eye or quantifiable by routine analysis.17, 18 Specifically, with CT scans, radiomics offers the unique advantage of combining ML to acquire images; segment images into regions of interest (ROIs) or volumes of interest (VOIs); extraction of quantitative imaging features from ROIs and VOIs; and, lastly, constructing and validating models. Recently, there has been an increase in work reporting on the combined or individual use of AI or radiomics to diagnose or monitor EG cancers. This review aims to summarize the potential applicability of AI diagnostic systems in the diagnosis and surveillance of esophageal and gastric cancers.

Methods

Literature search methods, inclusion and exclusion criteria, outcome measures, and statistical analysis were defined according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines.19 Patients were not involved in the conception, design, analysis, drafting, interpretation, or revision of this research, hence ethical approval was not required and was thus not sought for this study.

Literature Search

The following databases were searched: MEDLINE (from 1946 until the first week of April 2021) via OvidSP; MEDLINE In-Process and other non-indexed citations (latest issue) via OvidSP; Ovid EMBASE (from 1974 to the latest issue); and Scopus (from 1996 until the present). The last search was performed on 15 April 2021. Search terms used several strings that were linked by standard modifiers in the following order: ‘machine learning’, ‘artificial intelligence’, ‘radiomic’, ‘AI’ OR ‘ML’, as well as ‘esophageal cancer’, ‘esophageal squamous cell cancer’, ‘esophageal adenocarcinoma’, ‘ESCC’, ‘EAC’, ‘esophageal malignancy’, ‘upper gastrointestinal cancer’, OR ‘upper GI cancer’. Additionally, the references of included articles were hand-searched to identify any additional studies.

Selection and Quality Assessment of Studies

Articles were screened for eligibility by SC and VS, and, where conflict arose, a third co-author (SRM) was consulted. Studies were included if they had incorporated the use of AI-centred systems in CT imaging for evaluating both esophageal and gastric cancers. Studies with diagnostic, prognostic, and monitoring intents were included. Studies were excluded if they did not evaluate ML, used imaging modalities other than CT, did not include patients with esophageal or gastric cancers, had incomplete data on outcome measures, were not written in the English language, had sample sizes fewer than 30 patients, or had incompatible designs, including letters, comments and reviews. Studies were assessed for robustness of methodology using the Quality Assessment Tool for Diagnostic Accuracy Studies 2 (QUADAS-2), which comprises four domains covering patient selection, index test, reference standard, and flow of patients through the study and timing of the index test(s) and reference standard. Each domain is evaluated in terms of the risk of bias, and the first three domains are also assessed for any concerns regarding applicability. In doing so, this highlights aspects of the study design that may be exposed to bias.

Statistical Analysis

All statistical analyses were performed using STATA/SE version 16.0 (StataCorp LLC, College Station, TX, USA). The overall pooled estimate of sensitivity and specificity, with their corresponding 95% confidence intervals (CIs), was calculated using the random-effects model with the metandi command in STATA/SE. Sensitivity was defined as the proportion of patients with esophageal cancer who were correctly confirmed by AI, while specificity was defined as correctly identifying patients without the disease. Forest plots were used to visualize the variation of the diagnostic parameter effect size estimates with 95% CI and weights from the included studies.

Results

Study Selection

The database search yielded a total of 1439 studies, of which 137 duplicates were removed. Titles and abstracts of the remaining 1302 studies were screened for eligibility and 648 studies were removed. A further 617 studies were excluded after full-text review due to incompatible outcome measures, study design, or small sample sizes of fewer than 30 patients (Fig. 1). Thirty-seven studies that described the use of ML (a branch of AI) platforms for the diagnosis and surveillance of esophageal and gastric cancers were included in this study (Table 1).

Fig. 1
figure 1

PRISMA diagram showing the sequence of the study screening and selection process. PRISMA Preferred Reporting Items for Systematic Reviews and Meta-Analyses

Table 1 Characteristics of the included studies

Quality Appraisal

Assessment of studies using the QUADAS-2 tool showed a low level of bias among the studies (Table 2). The risk of bias and concerns on their applicability was low across most domains. Some risk of bias was present due to the heterogeneity of the patients included; however, in most studies, there was little reporting of the sensitivity and specificity of the ML algorithms used.

Table 2 Characteristics of image acquisition and processing using AI and/or radiomic approaches
Table 3 QUADAS assessment of studies included for risk of bias and applicability

Use of Machine Learning and Radiomics in the Management of Esophageal Cancer

Takeuchi et al. reported diagnostic accuracy of 84% (sensitivity 71.7%; specificity 90.0%) in detecting stage T1–T5 esophageal cancer in 46 patients.20 One study looked at the prognosis of patients with esophageal cancer, in which Foley et al. reported six variables to be predictive of overall survival in their work of 405 patients.21 Two studies evaluated the use of ML to assess response to chemoradiotherapy for esophageal cancers. The model developed by Wang et al. to evaluate the scan of 131 patients who underwent neoadjuvant chemotherapy diagnosed lymph node metastasis better than the preoperative short axis size of the largest lymph node on CT, with an area under the curve (AUC) of 0.887.22 Jin et al. combined a radiomics and dosimetric approach and reported an AUC of 0.708 in predicting the treatment response of patients with esophageal cancer who underwent chemoradiotherapy.23

Use of Machine Learning and Radiomics in the Management of Gastric Cancer

Two studies investigated the use of radiomics in diagnosing gastric cancer, specifically in differentiating gastric cancer from other gastric lesions.24, 25 In their study evaluating VOI-based textural features on preoperative arterial phase and portal phase scans of 95 patients, Ba-Ssalamach et al. differentiated gastric adenocarcinoma with an error rate as low as 3.1%.24 Two studies reported that there was little correlation between radiomic features and histological grades, with AUCs below 0.7,9, 10 while five studies evaluated images for lymph node status, vascular invasion, and occult peritoneal metastasis, with AUCs as high as 0.941.11,12,13,14,15 Of the included studies, two studies evaluated the use of AI for prognosis after surgical resection for gastric cancers. Li et al. extracted 273 features from each ROI and 485 features from each VOI, and used the least absolute shrinkage and selection operator (LASSO) method to predict overall survival, although the results were not promising in their test set.26 In contrast, Giganti et al. extracted 107 features from each VOI that were significantly associated with a negative overall survival in patients with resectable gastric cancer.27 Four studies also investigated the use of AI for predicting response to neoadjuvant chemotherapy. Giganti et al. determined 14 features in pretreatment arterial phase images that were significantly different between responders and non-responders, while another study by Li et al. showed similar results with portal venous phase images.28, 29 In their multicenter study, Jiang et al. identified potential predictors from portal venous phase scans of 1591 patients that were significantly different between responders and non-responders to neoadjuvant chemotherapy and predictive of disease-free survival.30, 31 Two studies evaluated the response to targeted immunotherapy with trastuzumab or radiotherapy. Hou et al. showed that radiomic signatures can predict response to radiotherapy with an AUC of 0.749, while Yoon et al. reported AUCs of 0.75–0.77 in their small pilot study of 26 cases of HER2-positive gastric cancer treated with trastuzumab.32, 33

Artificial Intelligence as a Diagnostic and Monitoring Tool: Quantitative Analysis

Six studies involving 1352 patients provided sufficient data of true positive, true negative, false positive, and false negative rates for the calculation of sensitivity and specificity. Of these studies, four studies assessed its utility in gastric cancer diagnosis, one study assessed its utility for diagnosing esophageal cancer, and one study assessed its utility for surveillance (Table 1). The pooled sensitivity and specificity were 73.4% (64.6–80.7) and 89.7% (82.7–94.1), respectively, as visualized on the forest plot and summary receiver operating characteristic curve (Figs. 2 and 3).

Fig. 2
figure 2

Forest plot of diagnostic accuracy for machine learning platforms. TP true positive, FP false positive, FN false negative, TN true negative, CI confidence interval

Fig. 3
figure 3

Summary receiver operating characteristic curve for diagnostic accuracy for machine learning platforms

Discussion

Our systematic review shows that the application of radiomics and AI for the diagnosis and surveillance of upper gastrointestinal tract malignancies is promising, despite being in its nascency. The included radiological studies show that AI can be potentially used to diagnose cancers, differentiate malignancies from benign lesions, and detect occult disease. AI systems may also be used for staging disease, determining if surgery will improve survival outcomes in patients with resectable disease, and in predicting whether patients will respond to adjuvant or neoadjuvant chemoradiotherapy. Our paper also highlights the different AI platforms available for these purposes and captures their breadth.

The typical patient undergoes several CT scans during their journey, with diagnosis as the primary aim. Combining radiomics and AI to current scans will enable clinicians to simultaneously predict how they will respond to treatment and also assess how they have responded to treatment. In other cancers, radiomic data have provided support to genomic data in generating a prognostic signature that exceeds the accuracy of traditional TNM staging.34 Given that there is a direct correlation between histopathological response of patients who underwent chemoradiotherapy and the overall survival rate, the ability to assess clinical response will be useful in adjusting the dose and regimes of chemoradiotherapy.35, 36 Our paper has included at least one study using radiomics or AI to assess the response to surgery, chemotherapy, radiotherapy and immunotherapy, and all report high performance; however, there is still scarce evidence to add support to existing studies described here.

AI can also help in overcoming any technical limitations faced by traditional imaging. For example, Jin et al. combined radiomic and dosimetric analyses to overcome the artefacts in wall thickness created by the regular peristaltic waves of contraction.23 In another study, Ding et al. showed that their models detected occult peritoneal metastasis more accurately than conventional CT scans. Previous studies including the Worldwide Esophageal Cancer Collaboration have reported that survival decreases with the presence of lymph node metastases, and imaging examinations are often the first-line investigations for assessing most lymph node statuses in esophageal cancer.37,38,39 However, the accuracy of CT in diagnosing the N stage of esophageal cancer was just 59%.40 Most clinicians use a size criterion of 1 cm to differentiate between benign and malignant enlargement of lymph nodes but this only has a sensitivity of 30–60% and a somewhat higher specificity of 60–80%.41,42,43 In their study, Wang et al. showed that support vector machine (SVM) models have better diagnostic capability for lymph node metastasis than the traditional LN size criteria.22 Furthermore, Bollschweiler et al. used a different ML methodology, termed artificial neural network (ANN), and reported a diagnostic accuracy of 79% in predicting LN metastasis in esophageal cancer.44

Strength and Limitations

The strength of our systematic review lies in its up-to-date unified analysis of esophageal and gastric cancers in different countries. We also identified challenges that will need to be overcome for the technology to be implemented into daily clinical practice. Our study has several weaknesses. First, most of the articles included in the study did not report the specificity or sensitivity of their AI technologies, which prevented a more comprehensive quantitative analysis to achieve a pooled statistic for the diagnostic accuracy of AI. This also prevented the stratification of pooled data based on study intent (diagnostic vs. prognostic). Furthermore, the diagnostic or predictive accuracy of AI depends on several parameters, including the specific AI program or model developed, scanning equipment, image preprocessing, acquisition protocols, and image reconstruction algorithms.

Although there is heterogeneity between the studies, most of the work is limited to a few specific groups that have taken an interest in this field. The majority of the studies are based in Asia, and several of the included papers stem from the work of the same group. Hence, within the same group, the data acquisition and processing techniques are identical but the aims of the study were different and hence merited inclusion. For example, in the studies by Jiang et al., the first study evaluated the use of radiomics and AI in characterizing the tumor microenvironment, while the other study focused on identifying occult metastasis.30, 31 Another example are the smaller studies by Giganti et al., each of which separately investigate the response to curative resection and chemotherapy.27, 28 Together, these studies shed light on a different aspect of the tumor biology of gastric cancers. In the same vein, we also included some studies with a sample size that was <100. Although small sample sizes lend to a greater degree of variation on the quantitative analysis, these studies were relevant in studying a niche area of treatment response. Larger studies have previously tended to focus on the diagnostic aspects, while other facets such as monitoring for recurrence, response to curative resection and chemotherapy, and tumor heterogeneity are areas that are still in their infancy and hence studied at a smaller level. Furthermore, this emphasizes the paucity of studies of large sample sizes and hints at areas that need further work within the field of AI in esophagogastric cancers.

Future Directions

Future work should be aimed at the ‘in silico’ bench to bedside translation of these technologies. Although we highlight much promise in these technologies, several factors require evaluation prior to these technologies being employed in routine upper gastrointestinal oncological care:

  1. (1)

    Use case: There needs to be early clarification in the lifecycle of these AI devices as to (1) their specific clinical task; (2) potential risk and benefits; (3) whether they are used within either new or existing clinical workflows; and (4) whether they are used independently to diagnose disease/recurrence or as a ‘second reader’ alongside a human clinician. Downstream validation of these systems is dictated by many of these early decisions.

  2. (2)

    Model development: The development of these systems are reliant on diverse, large-scale, and well-maintained datasets that are accurately labeled for the purposes of model training and internal validation. Systems created upon small single-center datasets with post hoc labeling rarely perform well when subjected to out-of-set testing.

  3. (3)

    Validation: Independent validation of AI systems is crucial, with comparison against expert clinicians to demonstrate either non-inferiority or superiority in diagnostic performance to be undertaken when feasible. Such evaluations require careful study planning, with the need for diverse demographic representation in test datasets in order to assess for bias.

  4. (4)

    Infrastructural requirements: Aside from developer considerations, the bottleneck for many contemporary AI products is the end-user adoption and experience. There needs to be careful consideration of the IT infrastructural requirements at hospitals in which these technologies may be reasonably deployed.

  5. (5)

    Cost effectiveness: Lastly, although it is assumed that the introduction of AI systems will lead to cost saving across health systems, this requires formal quantification. If deemed not to be financially beneficial, it may be more cost effective to hire diagnostic clinicians, which is the focus on current large-scale studies.

Furthermore, the power of these models is dependent on a large and diverse diet of datasets. At present, the retrospective single-center work available is insufficient and is limited in size, scope and variety. Given that the largest advances in esophagogastric surgery have occurred based on large prospective studies, the advent of ML only calls for further collaborative efforts at an international level to fully reap the potential of this technology.

Conclusion

AI and radiomics have a huge potential for diagnostic and surveillance of esophageal and gastric cancers. There is currently a paucity of large-scale studies evaluating the usefulness of AI and radiomics in esophageal cancer and the evidence is limited to retrospective studies of small sample sizes. Further progression of its clinical application will require collaborative efforts to generate a large and diverse dataset that can produce an accurate model. This relies on determining the best and most feasible methodology for ML and standardizing this across centers. Hence, further work should focus on these areas.