Artificial intelligence-aided ultrasound imaging in hepatopancreatobiliary surgery: where are we now?

Background Artificial intelligence (AI) models have been applied in various medical imaging modalities and surgical disciplines, however the current status and progress of ultrasound-based AI models within hepatopancreatobiliary surgery have not been evaluated in literature. Therefore, this review aimed to provide an overview of ultrasound-based AI models used for hepatopancreatobiliary surgery, evaluating current advancements, validation, and predictive accuracies. Method Databases PubMed, EMBASE, Cochrane, and Web of Science were searched for studies using AI models on ultrasound for patients undergoing hepatopancreatobiliary surgery. To be eligible for inclusion, studies needed to apply AI methods on ultrasound imaging for patients undergoing hepatopancreatobiliary surgery. The Probast risk of bias tool was used to evaluate the methodological quality of AI methods. Results AI models have been primarily used within hepatopancreatobiliary surgery, to predict tumor recurrence, differentiate between tumoral tissues, and identify lesions during ultrasound imaging. Most studies have combined radiomics with convolutional neural networks, with AUCs up to 0.98. Conclusion Ultrasound-based AI models have demonstrated promising accuracies in predicting early tumoral recurrence and even differentiating between tumoral tissue types during and after hepatopancreatobiliary surgery. However, prospective studies are required to evaluate if these results will remain consistent and externally valid. Graphical abstract Supplementary Information The online version contains supplementary material available at 10.1007/s00464-024-11130-0.

Ultrasound imaging has been key in detecting gastrointestinal pathologies for many years.Especially many cholecystitis, appendicitis, pancreatitis, and gallstones cases have been successfully diagnosed with the use of ultrasound [1].Compared to other imaging modalities, ultrasound has no radiation, does not require patient preparation, and can be performed quickly.In addition, ultrasound can be used to provide diagnostic information, but also serve as a screening instrument [2].The introduction of endoscopic ultrasound has enabled even more treatment opportunities within gastrointestinal surgery, such as applying angiotherapy and collecting abdominal fluids [3].
However, challenges of ultrasound imaging are still present, such as the inability to detect small tumors or fistulas within the gastrointestinal tract.Visualization of anatomical structures such as the bile or pancreatic ducts is also difficult with ultrasound only, as such structures are overshadowed by luminal gas bubbles or liquids in the gastrointestinal tract [4].Furthermore, the accuracy of ultrasound imaging is highly dependent on the clinician's experience and pathology progression [5].There are many cases in which misdiagnosis has occurred [6], consequently diagnostic accuracies of ultrasound imaging have been reported to fluctuate between 50 and 90% [7].
The current challenges could be overcome by applying AI models on ultrasound imaging.In a typical medical image-based AI model, ultrasound images are exported and converted into specific formats such as jpg or DICOM (Digital Imaging and Communications in Medicine) format.Using software tools, a region of interest (ROI) can be drawn to segmentate the relevant anatomical structure manually or automatically, depending on the tool [8].Within this ROI, tumoral features are extracted by using radiomics.To build the AI model, machine learning or deep learning methods are selected to combine the extracted features and clinical variables of patients to make predictions.By training on large datasets and recognizing patterns within data, AI models can increase the accuracy of the prediction model efficiently [9].Using this ability on ultrasound imaging could optimize the diagnostic accuracy of ultrasonography.Along with the benefits of being easily applicable, and quickly accessible, combining AI models and ultrasound might be very promising for detecting and predicting gastrointestinal diseases that require surgery.
Although AI models have been frequently applied within hepatopancreatobiliary surgery, the current status and progress of ultrasound-based AI models have not been evaluated in recent literature.Therefore, this review aims to evaluate the development, validation, and predictive performances of ultrasound-based AI models that have been used in hepatopancreatobiliary surgery.

Materials and methods
Literature was retrieved and systematically reviewed in conformity with the Cochrane Handbook for Systematic Reviews of Interventions version 6.0 and PRISMA guidelines.This review was registered in the PROSPERO database (CRD42024525032).

Literature search
A systematic search was conducted in the following databases: PubMed, Embase.com,Clarivate Analytics/Web of Science Core Collection, and the Wiley/Cochrane Library.The period in the databases was from inception to the 15th of October 2023.The literature search was performed by G.L.B. and M.B.The search included keywords and free text terms for (synonyms of) 'artificial intelligence' along with (synonyms of) 'digestive system surgical procedures' and 'ultrasound.'Using the PRESS checklist [10], this search strategy was peer-reviewed by an information specialist (G.L.B).A complete overview of search terms per database can be found in the supplementary information (see Online Appendix 1).No limitations on data were applied within this search.Studies reporting on conference proceedings, book chapters, editorials, notes, errata, letters, tombstones, or surveys were excluded from the search.

Eligibility criteria
Studies were only considered eligible if they met the following criteria: (I) described artificial intelligence methods, (II) involved patients undergoing any type of hepatopancreatobiliary surgery, (III) involved the use of any type of ultrasound imaging (preoperative and intraoperative), (IV) clinical study.As this review is only targeting new machine learning and deep learning techniques, studies that only included statistical models without the use of artificial intelligence have been excluded from this review.Studies were also excluded if they: (I) were not written in English, (II) reported on reviews, letters, editorials, or study abstracts, (III) included children as patients.No peculiar study setting, or design was favored in the inclusion criteria.

Methodology assessment
The Probast risk of bias tool was independently applied by two assessors (M.B. & C.M.C.) to evaluate the methodological quality of included AI methods [11].This tool evaluated the overall risk of bias based on four domains: participant selection, predictors, outcomes, and analysis.

Data extraction
The collection of data was performed according to the Cochrane guidance for data collection, and the CHARMS checklist [12].The following data items were extracted from each study: first author, publication year, country of research, number of patients, mean age, benefited surgical procedures, purpose of study, ultrasound modality, data type, region of interest in the data set, segmentation tool, annotation, AI subfield, internal validation, external validation, study outcomes, predictive performance (discrimination and calibration).All data items were independently extracted by two assessors (M.B. & C.M.C.).Conflicts were solved by consensus between the two assessors.

Data synthesis
Descriptive summaries were used to illustrate applied AI subfields, benefited surgical outcomes, risk of bias assessment, model development, model performance, and validation.The discriminative abilities of AI models were presented as the mean accuracy (ACC) or area under the curve (AUC).

Results
The results of the search and screening process are presented in the PRISMA flowchart (Fig. 1).Of the identified 348 records, 89 were filtered for duplicates.Subsequently, 259 records were screened on title and abstract using the Rayyan platform [13], where 229 records were excluded.After performing full-text screening on the remaining 30 records based on the eligibility criteria, 11 were included in the review.

Surgical outcomes
The three main areas of hepatopancreatobiliary surgery benefiting from ultrasound-based AI methods were liver cancer (n = 6), pancreatic cancer (n = 4), and biliary cancer (n = 1).Additionally, three major categories of surgical outcomes were reported: prediction of prognosis; differentiation between two or more classes of tumoral tissue; identification of lesions within the field of view during ultrasound imaging (Table 1).
Out of the six studies involving liver malignancies, AI models were used for predicting early recurrence of HCC (n = 4), differentiating between liver cancer and metastatic cancer (n = 1), and detecting focal liver lesions (n = 1).In four studies concerning pancreatic cancer, one AI model was developed to differentiate between benign and malignant intraductal papillary mucinous neoplasms (IPMN), while another algorithm was used to differentiate between pancreatitis and pancreatic cancer.Furthermore, one AI model aimed to predict malignant IPMNs, and one model was designed to identify pancreatic ductal adenocarcinomas versus chronic pancreatitis regions.Finally, one AI model was developed to differentiate between malignant and benign lesions of the gallbladder.

Model development
Within included studies, the following trend was observed in the process of model development: purpose scoping, data handling, AI modeling, and evaluation.A framework depicted in Fig. 2 summarizes the applied workflows within included studies.A nomenclature of AI processes is shown in Table 2.

Radiomics
Apart from two studies that used a combination of image features and clinical factors [15][16][17][18][19][20][21][22], all other studies used image data containing a region of interest (ROI) defined as the source of predictors for the model training.Except for three studies that used full image data after size reduction steps (such as cropping [18], resizing [17,19,21], or downsampling [16], the remaining studies reduced the acquired image data to the ROI via image segmentation [14,15,20,22,23].Two studies applied data augmentation in addition to the mentioned pixel size reduction [19,23].The image segmentation steps were mainly performed using open-source software.In the study of Norton et al. [18], the neural network was trained on features extracted from a single image.

Model training and validation
Most studies in this review have divided the dataset into training, validation, and/or test sets, for use in training and evaluating the AI models.Convolutional neural networks were the most applied artificial neural network algorithms in this review, used with different backbones such as Fish-Net-150 [16], Resnet-50 [17,21,23], or Efficient-net [19].Other used algorithms included Random Forest, Support Vector Machine, and Gradient Boosting.In two specific studies [14,24], several algorithms were trained on the same training cohort for selection of the algorithm with the highest accuracy.
Validation of the models was observed in all studies in either internal validation only or a combination of internal validation, testing, and/or external validation.To gain internal validation of AI models, the following methods were used: cross-validation (n = 7), random splitting (n = 3), and bootstrapping (n = 1).Independent testing and external validation were performed in five and three studies, respectively.

Model performance
The discriminative abilities of ultrasound-based AI models varied between 0.8 and 0.99.In predicting microvascular invasion of HCC, AUCs of AI models ranging between 0.8 and 0.84 were achieved.Differentiation between primary and metastatic liver cancer was performed with an AUC of 0.82.Additionally, for the intraoperative identification of focal liver lesions, an AUC of 0.8 was discovered.AI models for the differentiation between benign and malignant IPMNs have shown AUCs from 0.91 to 0.99.An AUC of 0.98 was discovered for the identification of PDAC and chronic pancreatitis.In addition, differentiation between pancreatitis and pancreatic cancer was performed with an AUC of 0.80.Differentiation between malign and benign gallbladder masses was achieved with an AUC of 0.91.Three studies extended their evaluation to compare the prediction and observation ACC and reported it in the form of a calibration plot [21] or decision curve analysis [22,23].

Methodological quality assessment
All included studies including AI models were based on a retrospective study design.According to the Probast risk of bias tool for AI models, the majority of studies received a low risk of bias score for the domains 'participants,' 'predictors,' and 'outcomes.'For most studies, the domain of 'analysis' has received an unclear risk of bias score due to improper measures to adjust for missing data and overfitting.Therefore, a low overall bias was reported for 30% of the studies, whereas 30% received a high overall bias.In addition, an unclear overall bias was given for 40% of the studies (Table .3).

Discussion
This review addresses the current AI models on ultrasound imaging within hepatopancreatobiliary surgery.All studies used similar approaches toward AI model development, by combining radiomics and AI models toward preoperatively predicting tumor recurrence and differentiation of tumoral tissue types.These ultrasound-based AI models show promising AUCs regarding the prognosis of hepatic  In clinical practice, preoperatively understanding tumoral behavior could guide surgeons in deciding between conservative or surgical treatment plans.As radiomics features are closely related to tumoral microstructures and reveal intratumoral heterogeneity, these features could be extracted to recognize tumoral behavior based on ultrasound images [25].As an example, Dong et al. [15] have demonstrated that by using radiomics, microvascular invasion of HCC can be predicted without the need for postoperative biopsy specimens and histological examination.The ability of AI models to differentiate between various tumoral tissue has been emphasized in several studies.Differentiations have been made between primary and metastatic cancer [14], benign and malign IPMNs [19], pancreatitis and pancreatic cancer [18,21], malign and benign gallbladder masses [24], showing AUCs of 0.82, 0.99, 0.95, and 0.91, respectively.Using this ability can prevent overtreatment and missing invasive carcinomas, therefore enhancing personalized treatment Algorithms able to process input data through multiple layers for pattern detection, in which the model itself assigns weights for the features within the data Image Acquisition Formats DICOM: Digital Imaging and Communications in Medicine, a high-quality standard format to digitally store medical images JPG: Joint Photographic Experts Group, a general low-quality format to store images Image Segmentation Drawing a region of interest on images to extract features from the selected region or using the complete image for feature extraction Data Labeling Process of identifying data (images or text files) and adding informative labels to provide context for the model Software for Feature Extraction Software packages in which modules are stored that can be installed for feature extraction.Examples include Pyroradiomics, Ultrasomics, Mazda, ResNet-50, EfficientNet B5, FishNet-150, IFoundry Internal Validation Methods Cross-validation: Dividing datasets into multiple subsets to use one subset for validation and the remaining subsets as training subsets.This process is repeated to validate the prediction on multiple subsets Random Splitting: Randomly selecting a percentage of the dataset to use as validation sets Bootstrapping: Resampling datasets multiple times to enable variability in validation sets Table 3 Methodological quality assessment of AI models, according to Probast risk of bias tool strategies.Additionally, a Convolutional neural networks model was applied on intraoperative ultrasound images to differentiate between normal liver tissue and focal lesions, showing an AUC of 0.80 [16].As such an application is rarely reported, this study illustrates that surgeons could even be assisted intraoperatively to prevent overlooking lesions and possible metastasis during the surgical procedure.Ultimately, ultrasound-based AI models could have the largest impact in clinical practice due to their ability to differentiate accurately between benign and malignant lesions.Tumoral lesions are often found incidentally on CT scans, but this modality lacks the proper resolution to characterize specific tumoral features, especially when tumors are small scaled.With ultrasound-based AI models, morphological features could be extracted to reveal tumors in an early stage and facilitate optimal treatment plans for the patients.
As all ultrasound-based AI models demonstrated AUCs above 0.8, the discriminative ability of these models could be described as promising.This ability was emphasized by the study of Kuwahara et al. [17], in which the ultrasoundbased AI model has even surpassed human diagnosis in predictive accuracy.In addition, compared to other imaging modalities, ultrasonography is easily applicable, quickly accessible, and does not cause patient radiation.
In real world settings, AI models have been implemented to detect safe dissection planes during robotic surgery, in which AI was able to provide accurate intraoperative feedback during the procedure [26].Additionally, automatic visualization of structures such as nerves have been accomplished during laparoscopic gastrointestinal surgery [27].Within hepatopancreatobiliary surgery, AI models have demonstrated AUCs up to 0.85 in predicting surgical outcomes such early HCC recurrence, response to chemotherapy, and postoperative complications [28][29][30].AI models have shown superiority in predictive capabilities compared to conventional logistic regression models and risk calculators.However, to pre-operatively predict such surgical outcomes in daily practice, large-scale datasets are required to improve the performance of the models.In addition, multicenter studies are required to externally validate the accuracies of these models.It is vital to maintain identical workflow of the AI models, including imaging formats, data variables, feature extraction methods, and the number of patients within datasets.
In terms of clinical practicality, AI models might also encounter a few challenges.Inter-and intra-observer variability may influence the generalizability of the predictions, as the quality of ultrasound images is highly dependent on the operator.Another challenge could be the dynamicity of intraoperative ultrasound imaging.Within the operation room, ultrasound is analyzed on-site, whereas radiomics require captured and stored images to extract quantitative features.
To demonstrate the potential of ultrasound-based AI models in clinical practice, the implementation steps are essential.Using Fig. 2, which summarized the most adopted workflow in designing an AI solution within this review, the critical steps, and suggestions for the implementation of an ultrasound-based AI solution for a selected GI surgical procedure are elaborated.The first key step is defining a task type that can be solved using ultrasound image features.Common task types and their example applications are classification (e.g., categorizing lesions to tissue types), regression (e.g., staging of tumor), or clustering (automatic image segmentation of lesion).This step is frequently formulated toward imitating a current clinical procedure, with the intention of automating or standardizing the manual procedure.For example, classifying the marked lesion into appropriate tissue categories based on radiological interpretation [14,15,19].Another observed approach in purpose scoping is relating postoperative data (e.g., pathology report) to preoperative ultrasound images, in order to deliver diagnoses that otherwise will not be possible following conventional clinical procedures [15,22].
The next step involves extracting features from the ultrasound images.As image quality has an impact on the model performance [31], it is recommended to prepare the images in a lossless compression format (e.g., DICOM with JPEG 2000 compression).Preprocessing is commonly applied prior to image feature extraction, to reduce the effect of outliers and biases that could result from the acquisition of the images.In practice, there are generally two approaches in feature extraction: using a standalone software package (e.g., Pyroradiomics, MaZda, Ultrasomics) which outputs extracted image features to the model, or using integrated feature extraction network (commonly known as backbone) in a deep learning model.It is recommended to refer to the Image Biomarker Standardization Initiative standard [32] for designing the radiomics workflow.To avoid model overfitting, datasets should be generally divided in a training and independent testing set by using cross-validation or a random-split approach [33].A decision curve analysis should be used to report the comparison between model performance and observed clinical risks, indicating the usability of the AI model in clinical settings [34].
The results of this review should be interpreted in consideration of several limitations.First, only studies involving hepatopancreatobiliary oncology were found with the current search strategy, this might indicate a limited extent of the search strategy, as only general terms and synonyms were used for AI, ultrasonography, and hepatopancreatobiliary surgery.Second, a comparative meta-analysis of AI models was not achievable, due to heterogeneous methodologies of the AI studies.Third, as most AI models were based on retrospective studies, none of the models were implemented in clinical practice.Therefore, the accuracies of the AI models should be considered as performances specific to the respective reported studies.
Future studies should focus on new task types, external validation, and reproducibility for the developed AI models.Classification was the major task type in all the articles, in which the models were trained to classify the new input US images into one of the trained classes.Other task types of AI including regression and clustering could be explored to train models that predict continuous annotation or automatic image segmentation.For instance, ultrasound as both a qualitative and quantitative imaging modality [35] could provide tissue mechanical properties data (e.g., elastography) in addition to the spatial image data, for staging of tumor grade [36] and automatic identification of tumor boundary on the ultrasound images.
As external validation was missing in most studies, independent datasets should be used for validating the usability, performance, and risks of the AI model.External validation is essential to support the generalizability of AI models, and eventually facilitate the clinical implementation [37].Before clinical implementation can occur, large datasets need to become available to enable external validation and improve the predictive accuracies of ultrasound-based AI models.In addition, the development of AI models is usually not properly illustrated, resulting in a lack of reproducibility and transparency.Collective initiatives, such as the FAIR (Findability, Accessibility, Interoperability, Reusability) principle and guideline [38], could help in overcoming the problem [39].These guidelines could help authors by providing a general checklist on how to properly secure reproducibility, transparency, and reusability for the AI model.Prospective studies, using large datasets of several hospitals, should be established to evaluate the definitive validation and predictive performance of ultrasound-based AI models within hepatopancreatobiliary surgery.

Conclusion
In conclusion, this review illustrates promising performances of AI models in predicting early tumoral recurrence and differentiating between tumoral tissue types for patients undergoing hepatopancreatobiliary surgery.Despite the promising accuracies in this review, the current literature only reported early-phase applications of ultrasound-based AI models, involving mostly retrospective studies without external validation.Whether these results will remain valid during prospective studies is yet to be determined.

Fig. 1
Fig. 1 PRISMA flow chart of the search strategy and study selection

Fig. 2
Fig. 2 Common workflow of AI models observed within the included studies

Table 1
General characteristics of included studies

Table 2
Nomenclature of AI processesMachine learning modelsAlgorithms able to process input data, train on this data to acquire a certain task, until an optimal accuracy is achieved.Examples include Random Forest, Decision Tree, SVM, Gradient Boosting models Radiomics Algorithms able to extract quantitative features of images that represent specific components of pathologies Neural Networks