Application of artificial intelligence for overall survival risk stratification in oropharyngeal carcinoma: A validation of ProgTOOL

Background: In recent years, there has been a surge in machine learning-based models for diagnosis and prognostication of outcomes in oncology. However, there are concerns relating to the model ’ s reproducibility and generalizability to a separate patient cohort (i.e., external validation)


Introduction
The incidence of oropharyngeal squamous cell carcinoma (OPSCC) which is one of the most common carcinomas of the head and neck has increased in recent years with a significant geographical variation in the incidence for males and females, and mortality [1]. OPSCC has an annual incidence of about 100,000 new cases and a mortality rate of 48,143 globally [1]. The risk factors that can be considered from two causal mechanisms [1]. Firstly, typical etiological agents for head and neck squamous cell carcinoma (HNSCC) (i.e. heavy alcohol consumption and smoking) and secondly, human papilloma virus infection [HPV] [1][2][3].
In the past decade, oncogenic HPV has emerged as the main causative agent for OPSCC in many countries [4]. Recently, HPV-related OPSCC surpassed cervical cancer as the most common HPV-caused cancer [5]. In the absence of available screening methods for OPSCC, most cases are detected at an advanced disease stage, which warrants the use of combined treatment approaches [5]. Delayed diagnosis will still influence the management outcome. As all the therapeutical modalities have the potential to cause side effects affecting quality of life, the required decision making is crucial. Thus, having an assistive tool for risk stratification of the survival OPSCC patients can help in targeted treatment planning and optimal care to meet their psychosocial need and improve their quality of life [6].
In recent years, artificial intelligence (AI), or its subfield, machine learning (ML) has been touted to contribute to improved clinical decision-making and proper management of cancer [7]. Many cancer centers and medical institutions are posited to embracing AI-based technologies targeted at improving clinical efficiency and providing safe and valuable oncological care, which are necessary for achieving improved quality of patient health [8]. Despite these suggested benefits, only few of these developed models and AI-based tools have made it to actual medical practice or patient care [9].
There are several barriers to the implementation of these models or AI-based tools in clinical practices [9]. These barriers have been broadly divided into two categoriesbarriers that are inherent to the science of ML itself and the barriers that relate to the clinical concerns of these models in healthcare [9]. The examples of barriers inherent to the science of ML include black box concerns, data, results and model interpretability, and generalizability of the models [9][10][11][12][13][14]. Barriers related to the clinical concerns include explainability, usability, clinical validations, ethical and moral, and regulatory framework [9,10,12] (Fig. 1).
Of note, the generalizability concern is somewhat directly related to the clinical validation concerns of the clinicians. Generalizability signifies the performance ability of the ML models outside the data from which the model was originally trained on [15]. This is generally known as the external validation approach for the developed models. It provides an impression of how the model would perform in actual clinical use. Therefore, once the model is considered generalizable, it implies that the model may be ready for clinical validation. The clinical validations seek to further reassure the clinicians on the performance of the model.
The performance of most of these models has mainly been evaluated using unseen data from the same initial longitudinal dataset set aside for the performance evaluation [16]. These set-aside data are usually known as a test set and the evaluation approach, in this case, is known as internal validation. Remarkably, most of these models, although internally validated, have shown considerably lower performance accuracy when externally validated with datasets from different institutions [16][17][18][19][20]. Therefore, the lack of external validation raises a concern about the true performance of these models. The inability to understand the true performance may impede the model's progression to clinical validations. According to the published secondary research regarding the application of AI/ML models in clinical decision-making, several factors may be responsible for the lack of external validation. For instance, the model may not be publicly available for others to conduct an independent external validation study [21].
This study aims to provide a validation study for a recently published ML model for overall survival risk stratification in OPSCC [3]. Recently, this model has been integrated as a web-based prognostic tool (Prog-TOOL) for overall survival risk stratification of OPSCC. Following addressing limitations of AI in oncology [8,22], the clinical significance of a validly tested and evaluated web-based AI prognostic tool is to serve as an ancillary tool for estimating the chance of survival and consequently providing informed decision-making regarding the proper management for OPSCC patients. This can enhance effective treatment planning of OPSCC and enable lower-costs, time-saving benefits, and better quality of life for the patients. In addition, this study also aims to systematically reviews the published studies that have applied ML to aid the prognostication of OPSCC. The essence of the review is to examine how many of these models in published studies were externally validated. Additionally, we examine the current methods for external validation of these models and compare the reported performances of internal validation to external validation. These afore-mentioned considerations are necessary to assess the readiness of the models for clinical validations. Consequently, paving way for the path to implementation of these models in daily practice. To the best of our knowledge, this is the first study to evaluate current applications of external validations of ML models for OPSCC prognostication. OPSCC was considered in this study as it constitutes a frequent tumor type of the upper aero-digestive and has been characterized with increasing incidence in the last decades [23].

Research questions
The research question was: "What is the predictive performance of a web-based prognostic tool for oropharyngeal squamous cell carcinoma (OPSCC) patients when externally validated with new cases"? To answer this main research question, we externally validated a publicly available web-based tool (https://oncotelligence.com/default.aspx) for risk stratification of OPSCC patients (sub-section 2.2). In addition, we reviewed the published studies to evaluate how many of these studies were externally validated (sub-section 2.3). Furthermore, we evaluated the modalities of external validation of these models. We explored the importance of external validation as an important approach that can influence the integration of these models into daily clinical practice. The performance of the model following external validation procedure was reported using the checklist for assessment of medical AI (ChAMAI formerly IJMEDI) [Supplementary I] [24].

Data description, inclusion, and exclusion.
A total of 224 retrospective cases of OPSCC from 2012 to 2016 were extracted from the electronic patient records at the Helsinki University Hospital (Finland) to externally validate our recently introduced machine learning-based web tool for overall risk stratification in OPSCC patients. The inclusion criteria for this study were a prospectively collected cases that included clinical and pathologic characteristics such as gender, age at diagnosis, stage, TNM class, grade of differentiation, marital status, human papillomavirus status, treatment modalities (surgery, and radiotherapy), disease-free survival months, and overall survival status recorded for the patients. Of note, marital status has long been recognized as an important prognostic factor for many cancers [25,26]. More specifically, there are reports which demonstrates a survival benefit for married cancer patients, especially for human papilloma virus related cancers such as oropharyngeal or cervical cancer [27][28][29][30]. All these extracted parameters followed the corresponding parameters contained in the web-based tool (ProgTOOL). The included patients were predominantly of Caucasian origin. The exclusion criteria included patients with missing value in any of the above mentioned clinical and pathologic characteristics since being a validation study.

Data preparation
Following the exclusion criteria, that is, cleaning and pre-processing of the data for missing values (deletion of missing rows), a total of 163 cases were finally used to validate the tool (Table 1). There were no outliers in the dataset. Hence, there was no need for imputation analysis for the missing values in this validation study. As this study is aimed at validating an already deployed machine learning model, neither feature preprocessing nor target variable balancing was performed. However, standardization of the dataset was performed (by converting all the variables, except the age of the patients into categorical variable as shown in Table 1). In addition, the imbalance nature of the dataset used for this external validation was mentioned as one of the limitations of this study.
Information regarding marital status was missing from the validation dataset. Thus, we assumed two scenarios (married or single) for the patients to externally validate the web-based tool. The included clinical and pathologic parameters in this validation study were categorized in similar manner as the original study that was used to develop the webbased prognostic tool [3]. All patients included in this study were treated with curative intent. The study was approved by the Research Ethics Board at the Helsinki University Hospital and an institutional permission for the study was granted (Dnr: 51/13/03/02/2013). A detailed description of the development of the ML model has been reported and published by our group [3].

Geographic external validation approach
We used geographical external validation approach recommended by Ramspek et al. for a publicly available machine learning model [31]. This ensures that the external validation cohort used in this study is structurally different from the development cohort. These differences lie in the different region or country and the source of the data which may inform a different type of care setting. In our previous study, the development cohort was obtained from the largest publicly available databases managed by the National Cancer Institute (NCI) through the Surveillance, Epidemiology, and End Results (SEER) Program of the National Institutes of Health (NIH) in the United States. While the external validation was obtained from Helsinki University Hospital (Finland). The former data was a registry data while the latter was a hospital data. Therefore, these two data appeared to be structurally different, and we have presented the area under receiving operating characteristics curve (AUC), Net Benefit, and Brier score to evaluate the model discrimination power, utility, and calibration abilities, respectively.

Evaluation metrics from external validation approach
The performance of the web-based tool from external validation process was evaluated using accuracy (weighted and balanced accuracy), Mathew's correlation coefficient, weighted AUC, Net Benefit, and Brier score. The Brier score was calculated by considering the prediction of the model at every 10th instance within the external validation data (i.e., a total of 16 instances since the data contained 163 cases). Similarly, we used the formular proposed by Vickers et al. to calculate the Net Benefit of the model at 10%− 50% probability [32,33].

Summary of the web-based prognostic tool model development and performance
The web-based tool was developed to classifying OPSCC patients into chance of survival groups (low-chance or high-chance) for overall survival (binary classification task). A trained voting ensemble machine learning algorithm was integrated as the web-based prognostic tool. The model was trained to classify the OPSCC into the chance of overall survival (model output). It was trained using a 5-fold cross validation with hyperparameter tunning (extreme gradient boosting [XGB], standardization: standard scaler wrapper, colsample_bytree: 0.9, eta: 0.3, gamma:1, max_depth: 10, n_estimators: 25). Each of the input variables were categorized with the exception of age of the patients (Table 1) before cross validation [3]. The performance of the model from the training phase showed 89.8% accuracy and 86.3% balanced accuracy, respectively [3]. Additionally, the Matthews' correlation coefficient, and weighted area under curve were 0.77 and 0.929, respectively.
Following the training phase, the model was temporally validated prior to integration as a web-based prognostic tool. The model showed an accuracy of 88.3% and Mathew's correlation coefficient of 0.72 from this temporal validation [3]. The feature importance analysis showed that human papillomavirus (HPV) status, age of the patients, T stage, marital status, N stage, and the treatment modality (surgery with postoperative radiotherapy) had significant effects on the predictive performance of the model. In addition, Local Interpretable Model Agnostic Explanations (LIME) and SHapley Additive Explanation (SHAP) frameworks were used to examine the effects of each variable on the predicted outputs by the model. As this is a validation study, the details of the model development and performance metrics have been previously published [3].
Due to the absence of totally independent new dataset, the model was validated using a temporal validation method [31]. A temporal validation method lies between an internal and external validation methods. It is considered as the simplest form of external validation where a small cohort of the same data source were recruited for temporal validation [31]. This cohort were neither used for training nor testing [3,31]. Hence, it is the simplest form of external validation and more robust and stronger than internal validation for prediction model reproducibility and generalizability [31,34]. To further demonstrate the viability, predictive and promising performance of the model, this study aims at validating the model with a totally new dataset (external validation) from a new geographic location. The model was trained using data obtained through Surveillance, Epidemiology, and End Results (SEER) program, United States. The external validation dataset was obtained from Helsinki University Hospital, Finland.

Search protocol and strategy
The search protocol was developed by combining search keywords: [('deep learning' OR 'machine learning') AND ('oropharyngeal')]. This search word was used to query PubMed, OvidMedline, Scopus, and Web of Science to retrieve relevant articles from inception until September 15, 2022. The retrieved articles were systematically reviewed for relevancy in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines (Fig. 1). The retrieved hits from this search operation were exported to EndNote software for further analysis. To avoid selection bias and reduce research waste, the reference lists of relevant papers were manually searched. In addition, Google scholar and preprints were searched for possible gray literature.

Study selection, inclusion, and exclusion criteria
Two independent reviewers (R.A. and A.S.) screened the titles and abstracts of potentially relevant articles for initial inclusion. Subsequently, all potential abstracts were subjected to full-text review by these two independent reviewers. Possible disagreements were resolved through discussion and meeting for consensus. We involved a third reviewer (A.A.) to overcome inclusion discrepancies and to foster agreement on the potentially relevant articles.
Studies were included if they met the following inclusion criteria: (1) original studies conducted in the English language and (2) studies that focused on the application of artificial intelligence or its subfield, machine learning for diagnosis and prognostication of outcomes in oropharyngeal cancer. The following were the exclusion criteria: (i) editorials, commentary articles, conference extracts, and narrative reviews were excluded, and (ii) studies on AI/ML in prognostication of head and neck cancer other than oropharyngeal regions, and of nonmalignant oral pathologies were excluded.

Study extraction
A data extraction sheet was used to compile the list of eligible studies. Relevant information relating to the study author, year, country of publication, number of training data, data types, subfield of AI used, algorithm used, performance metrics, and conclusion was extracted from these studies ( Table 2). As this review primarily aims at examining how many of the included studies were externally validated and the methodology of external validation of ML models, we specifically extracted those studies where external validation was performed (Table 3).

Quality assessment of included studies
We examined the risk of bias and applicability of studies where external validation was performed (Table 3) using the Prediction model Risk Of Bias Assessment Tool (PROBAST) [ Table 4]. Further quality assessment of these studies was completed using the guidelines for developing and reporting ML predictive models in biomedical research [35]. The guideline was summarized as previously reported in other studies where a mark was given for each of the guideline topics (Supplementary II) [9,36]. The threshold was set at half of the maximum marks, and the score was presented in Table 5.

Characteristics of the study population for validation study
The detailed characteristics of the external validation data (n = 163) are presented in Table 1 Table 3 for details).  + Indicates Low ROB/Low concern regarding applicability. − Indicates High ROB/high concern regarding applicability.

Performance metrics from the validation study
The web-based tool produced an accuracy of 90.2%, Mathew's correlation coefficient of 0.78 and F1 score of 0.84. The model showed comparable performance from the temporal and external validations. The weighted area under receiving operating characteristic curve was 0.94 (Fig. 2). In terms of predictive values, positive predictive value (precision) was 0.93 while negative predictive value was 0.89. The specificity (recall) and sensitivity were 0.97 and 0.76, respectively. A similar result was obtained when the patients were considered single (in terms of marital status). Other performance metrics relating to the confusion matrix and the predictive rates are given in Table 6. Furthermore, the sensitivity and specificity for 1-, 3-, and 5-year specific time points were 1.00 and 0.50; 0.97 and 0.40; 0.74 and 0.96, respectively ( Table 7). The calibration curve for this validation study is presented as Supplementary III.
The calculated Brier score was 0.06 (Brier score is usually between 0.0 and 1.0 where 0.0 indicates perfect model and 1.0 means worst model). Similarly, the Net Benefit value of the model was approximately 0.7 (i.e., acceptable) at 10% − 50% probability threshold.

Results of the database search
A total of 549 hits were retrieved. After deleting duplicates (N = 245), irrelevant papers (N = 256), and exclusions (N = 20), we found 31 studies eligible to be included in this study as shown in Fig. 1 [23,. The inter-observer reliability between the independent reviewers for data extraction showed a Cohen's Kappa coefficient (κ = 0.90) ( Table 2).

Summary of studies that performed external validation dataset
Out of the 31 studies found eligible (sub-section 3.3) [ Table 2], only 7 (22.5%) studies performed external validation (Table 3) [37,39,43,46,54,56,60]. These were the studies that were further evaluated and discussed in our analysis. From the mentioned studies, in terms of ML application, 3 (42.9%) studies emphasized the potential of these models in diagnosis [46,56,60] while 4 (57.1%) studies reported the prognostication abilities of these ML-based models [37,39,43,54]. Regarding the types of external validation, 3 (42.9%) studies each used temporal external validation [39,43,56] or geographic external validation [37,54,60] while a single study (14.3%) employed expert opinion as a form of external validation to examine the potential of the model in terms of generalizability [46].

Importance of external validation
The need for external validation of the ML models was emphasized in few of the included studies [40][41][42]50,51,55,66]. Despite the promising results from internal validation of these models, it was suggested that external validation of the trained model can enhance transportability (ability of the ML model to produce comparable performance when tested with an independent dataset that is different from the dataset that it was trained upon) [56,61]. In addition, it was equally observed that reproducibility and generalizability of the ML model can be guaranteed through external validation [40][41][42]50,51,55,66].

Quality assessment
According to the PROBAST assessment, all of the studies that examined external validation showed an overall low risk of bias and applicability concerns. Similarly, the quality assessment showed that 5 (71.4%) studies had extremely high quality [37,43,54,56,60] while 2 (28.6%) showed high quality [39,46]. This high-quality assessment indicates that these studies are well posited to answer the research questions in this study.

Discussion
This study highlights the importance of a thorough and independent external validation of a machine learning model that has been integrated as a web-based prognostic tool to ensure the model's generalizability and to facilitate its clinical evaluation. The important step of external validation is necessary to demonstrate the readiness of such models from developmental stages to real-world evaluation using a completely different dataset. Several studies have emphasized the significance of validation studies through external validation prior to clinical evaluation to guarantee reproducibility and generalizability [9,15,31]. Traditionally, a model's performance is usually reported based on internal validation. However, it has been reported that making judgments on the potential of the performance of ML-models in the real-world setting based solely on internal validation of underlying data may potentially report an over-optimistic, misleading, and unrealistic expectation of the model's accuracy [31]. Consequently, having implications for clinical applications.
The external validation of a web-based prognostic tool examined in this study demonstrated that the model showed extremely promising performance in its ability to stratify OPSCC patients into risk groups in terms of their chance of overall survival. The model showed a higher performance accuracy, AUC, Mathew's correlation coefficient, and Brier score from the external validation compared to the temporal validation. Presently, there is arguably no standardized approach for external validation procedure. For example, the study by Ramspek et al. emphasized the use of geographic external validation dataset to ensure a structurally different cohort from the development cohort [31]. Cabitza et al. proposed a two-step (meta-validation) approach for evaluating an external validation process [67]. In the first step involves building a linear regression model over the training dataset. However, in our study, we externally validated an already developed and web-based integrated model using a structurally different geographic dataset recommended by Ramspek et al. [31]. Additionally, the model may not be suitable for a linear regression since it was aimed at a classification task. In the second step, data similarity and cardinality were evaluated for discrimination, calibration, and utility using AUC, Net Benefit, and Brier score. Therefore, we combined the suggestions of Ramspek et al. to ensure that we used a structurally different (i.e., geographic external validation) to cater for dataset similarity while we examined the performance of the external validation procedure over arrays of evaluation metrics, specifically AUC, Net Benefit, and Brier score for the model's discriminating power, utility, and calibration abilities for the overall survival risk stratification of OPSCC patients. Thus, the external validation procedure showed that the model is excellent based on AUC and Brier score and may be acceptable based on the Net Benefit score.
With such a potential model, information regarding the estimate of overall survival of OPSCC patients can help guide the clinician for clinical decision-making. For example, clinicians can carefully examine the patients stratified as high chance for poor survival for individualized treatment planning. Remarkably, the initial model integrated as webbased tool was trained using a dataset from one of the largest publicly available databases managed by the National Cancer Institute (NCI) through the Surveillance, Epidemiology, and End Results (SEER) Program of the National Institutes of Health (NIH). The validation study was done using data obtained through the Helsinki University Hospital, Finland. Therefore, the dataset used for this validation study was from a different geographical location. Such an approach is posited to address any concern relating to the generalizability of this web-based tool. Consequently, positioning the web-based tool for further validation studies from other geographical locations prior is important prior to recommending it for clinical evaluation.
Remarkably, for every potential ML model to be transferred for clinical evaluation, two important characteristics should be present. These are model's reproducibility (validity) and generalizability (transportability) [68]. These two important characteristics serves as important cornerstone for the viability of a ML model in clinical practice. While reproducibility ensures that the model shows considerably similar performance when tested with patients similar to the development population, generalizability explores whether the model performs as expected when exposed to a separate population with a relatively similar patient characteristics to the development population. External validation may be used to achieve both characteristics.
However, as shown in this study, only a few studies have performed external validation of their reported models for outcomes prognostication in OPSCC. This raises the concern of generalizability as it is unclear how these models would perform in a newly collected, unseen, and balanced external validation data. The lack of external validation practices in these published studies may have limited the translation of these models from development environments to clinical consideration and evaluation. Without adequate and standard clinical evaluation of these models, the path to implementation in daily clinical practice remains vague. To the best of our knowledge, this may justify why none of these models are presently used in daily clinical practices for OPSCC management.
It is a common practice to validate ML models at the technical level using any of the widely accepted internal validation methods such as cross-validation (k-fold cross-validation or Monte Carlo crossvalidation), bootstrapping, or hold-out method (training, validation ± testing). This makes internal validation the most widely used approach because it is tuneable and highly heterogeneous [69,70]. However, the approach of using internal validation has been criticized because it involves significant parameter tuning. In addition, if the original input data are biased, the internal validation approach becomes a biased evaluation [69]. It is not uncommon for a ML model to show promising results during training and perform poorly when externally validated [18,34,71,72]. Several reasons such as underfitting, overfitting, or imbalanced dataset may be attributed to this. To overcome this concern, external validation is warranted.
Conducting an evaluation process such as an external validation study will allow stakeholders to assess and understand the true performance of these models. Validation of a standalone model or of a model integrated as a web-based tool as demonstrated in this study, is one of the crucial steps to secure generalizability prior to clinical evaluation and consequently, to follow other recommended paths to implementation into clinical practice. External validation, also known as a standalone test, is thought to be the approach of testing the predictive performance of a trained ML model with an entirely different set of new patients in order to ascertain that the model works as expected [31].

Types of external validation
As shown in this study, three forms of external validation approach have been used apart from the traditional internal validation methods. These were temporal, geographical, and expert external validations ( Table 3, Fig. 3). In temporal external validation, a certain portion of the same data is reserved [73]. This reserved data is neither used in the training, validation, nor testing. Such reserved data are usually used to evaluate the performance of the trained ML model [73]. The reservation can be done randomly by reserving a certain set of the same data from the first, middle, or last portions. In some cases, cohorts with a certain year within the same data may be reserved for temporal external validation.
Considering the daunting process involved in setting up standard external validation procedures (new set of data and model tested by different researchers) [74], temporal validation approach may be posited as a viable validation process for predicting model reproducibility and generalizability [74]. It is considered the simplest form of external validation, which is more robust and stronger than internal validation [34] even though the subsequent cohorts used for temporal validation were recruited from the same data source [73]. This type of external validation has been criticized for several reasons such as the fact that the patients are not structurally different [34,75]. In addition, the process seems to fall between internal and external validation approaches [75,76] (Fig. 3).
Therefore, geographical external validation is posited to address the concerns of using temporal validation. Geographical external validation ensures that the developed model is validated with a dataset that is structurally different from the development cohorts as demonstrated in this study [71]. These differences may be in terms of region, country, type of care, or treatment protocols [75,76]. The use of a geographical or totally independent dataset from another geographic location has been considered reliable [74]. Despite this, issues relating to the ethical concerns regarding the transfer or sharing of data, lack of external validation dataset, and unavailability of the model for independent external validation are some of the reasons for the dearth of external validation. Due to these factors, experts have been saddled with the responsibility of monitoring the performance of these models as a form of external validation. The concern with this approach is that these models are supposed to provide a second opinion to the experts, not otherwise. Hence, geographical external validation remains the widely favored approach [74]. To summarize, either internal validation or temporal validation should not be misconstrued as an adequate form of independent external validation [31]. If a model exhibits comparable performance in rigorous external validation, it may be a strong indication that the model works as expected and may be ready for further clinical evaluation by experts (Fig. 4). Similarly, a clearly low performance when externally validated may indicate that the model may not be generalized (Fig. 4).

Concerns relating to external validation
The modalities (how, who, number of cases, effectiveness, and evaluation) to perform external validation are sources of significant concern. For example, as shown in the previously published studies (Table 3), external validation was done by the developer of the model. There is a concern that the developer may fine-tune the model during external validation (i.e. the potential to introduce biasedness in the process of external validation) [18,71,77]. There is growing debate that external validation should be done by separate researchers [77]. The main concern is that how many of these developed models are publicly available in other to be externally validated by independent researchers? This may justify why very few of these models have been externally validated by different authors [18].
Considering these concerns, we would like to emphasize that an adequate framework of external validation may be warranted. This framework should define important parameters such as sample size for external validation, methodological caveats, minimum acceptable differences between the performance of the model during internal and external validations, and reporting standards for the validation process. Concerted efforts are required from different stakeholders to modify the framework to bridge the gap between the development and implementation of prediction models in daily clinical practices.
Of note, it has been discussed whether or not to include the process of external validation in routine ML model development pipelines [71,77]. External validation is thought to be sequel to internal validation as it addresses transportability (use in a different environment from which the model was initially developed) and generalizability (real-world performance), rather than reproducibility [78]. Therefore, considering the ethical and legal concerns relating to data sharing, we opined that the process of external validation should not be integrated as part of the ML development pipeline. This is because data can't be freely shared for geographical external validation due to ethical permission relating to the use of data. However, it may be feasible to include temporal validation as part of the routine ML development process (Fig. 4). Integrating the model as a web-based tool [21] or sending it as a standalone package seeks to address relating to sharing of data and to facilitate external validation.
Remarkably, the fact that a model has been externally validated does not imply straight acceptance in daily clinical practice. Other important aspects should be considered prior to the recommendation of the model for clinical evaluation. For example, the quest for standardized reporting for ML-models in medical oncology. Recently published studies have standardized guideline for reporting AI or ML-based models [34,79,80]. Furthermore, the performance metrics should be standardized such that an array of different metrics should be used to demonstrate the performance of these models.

Limitations, conclusions, and future research
There are limitations to be considered in this study. First, there was an imbalance in the target variable of overall survival in the external validation data. This may present certain level of model bias. Furthermore, the web-based tool was developed by our group although externally validated with a dataset from a different geographic location as shown in this study. Therefore, this validation study was not completely independent even though it fulfilled the requirements for external validation. In addition, the web-based tool was validated with a relatively small number of cases.
In conclusion, it is a good practice to repeatedly evaluate the trained model with an external geographic validation approach in tandem with internal validation that must have already been performed. If the model   persistently shows good performance after being repeatedly validated externally, this may imply that the model is robust, generalizable, and ready for clinical evaluation. Therefore, we recommend that the developed models should undergo external validation, followed by clinical evaluation, and randomized comparative impact assessment. This carries the potential to increase the likelihood of the utilization of these models in daily clinical practices in the future.
For future research, it is necessary to increase external geographic validation by other researchers to further guarantee an independent external validation and to further assess the generalizability of the model. Similarly, a defined path to the implementation of the ML model should be stated by stakeholders to enable the transition of these models from development to bedside. To enhance model maintenance, performance improvement and updating, we propose that the integration of ML model as a web-based tool should follow modern approaches such as the federated learning and particle swarm optimization paradigms that can preserve data sharing and privacy. This will ensure that the model's quality and performance is continuously improved without any data privacy concerns.
Summary points.
• A publicly available ML web-based prognostic tool (ProgTOOL) was externally validated using a geographic dataset for survival risk stratification of oropharyngeal squamous cell carcinoma (OPSCC). • A considerable number of published studies on the application of ML models for OPSCC outcome prognostication were not externally validated. • External validation of ML models has been traditionally done using temporal, expert, or geographic external validation paradigms. • Geographical external validation was considered the most reliable form of external validation. • External validation approach through a validation study is important for model generalization and their realization for clinical evaluation and consideration.
Code availability and daily clinical use of the model (web-based tool).
This study is a validation study of a model that has been integrated as a web-based tool that is freely, openly, and publicly available as a webbased tool. Thus, code availability is not needed. Additionally, the webbased tool is not presently used in daily clinical practice. However, detailed description of the development of the mode is available in our previous study [3]. This study also highlights some of the requirements that can facilitate the implementation of ML models in clinical practice. The ProgTOOL will be hosted freely for one year (until the end of 2023). It is intended for research prurpose only.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.