Preoperative Risk Stratification in Esophageal Cancer Surgery: Comparing Risk Models with the Clinical Judgment of the Surgeon

Numerous prediction models estimating the risk of complications after esophagectomy exist but are rarely used in practice. The aim of this study was to compare the clinical judgment of surgeons using these prediction models. Patients with resectable esophageal cancer who underwent an esophagectomy were included in this prospective study. Prediction models for postoperative complications after esophagectomy were selected by a systematic literature search. Clinical judgment was given by three surgeons, indicating their estimated risk for postoperative complications in percentage categories. The best performing prediction model was compared with the judgment of the surgeons, using the net reclassification improvement (NRI), category-free NRI (cfNRI), and integrated discrimination improvement (IDI) indexes. Overall, 159 patients were included between March 2019 and July 2021, of whom 88 patients (55%) developed a complication. The best performing prediction model showed an area under the receiver operating characteristic curve (AUC) of 0.56. The three surgeons had an AUC of 0.53, 0.55, and 0.59, respectively, and all surgeons showed negative percentages of cfNRIevents and IDIevents, and positive percentages of cfNRInonevents and IDIevents. This indicates that in the group of patients with postoperative complications, the prediction model performed better, whereas in the group of patients without postoperative complications, the surgeons performed better. NRIoverall was 18% for one surgeon, while the remainder of the NRIoverall, cfNRIoverall and IDIoverall scores showed small differences between surgeons and the prediction models. Prediction models tend to overestimate the risk of any complication, whereas surgeons tend to underestimate this risk. Overall, surgeons’ estimations differ between surgeons and vary between similar to slightly better than the prediction models.

The primary aim of this study was to assess whether the available prediction models are superior to the clinical judgment of the surgeon with regard to predicting the risk of any postoperative complication, while the secondary outcome was to assess how well surgeons can predict major (Clavien-Dindo grade IIIA or higher) postoperative complications.

Study Design
A prospective, single center, observational cohort study was conducted at a tertiary referral hospital (Amsterdam UMC). Ethical approval was waived by the Ethical Committee. Written informed consent was obtained from all patients in this study for use of their patient data. The TRIPOD and STROBE guidelines were consulted to ensure the correct reporting of the results. 24,25 Study Population Eligible patients were aged 18 years or older with resectable esophageal or gastroesophageal junction carcinoma (cT0-4aN0-3M0), and scheduled to undergo a minimally invasive transthoracic esophageal resection by one of the three surgeons. Patients were excluded in cases of a salvage esophagectomy, an esophagectomy for recurrent disease, or if nonresectable disease was found during esophagectomy.

Treatment of Patients
All patients were treated according to the Dutch guideline. 26 Patients were generally treated with neoadjuvant chemoradiotherapy followed by a minimally invasive transthoracic esophagectomy with a two-field lymphadenectomy. A gastric conduit reconstruction was performed with a cervical or intrathoracic anastomosis, depending on tumor characteristics and the extent of the radiotherapy field.

Selection of Prediction Models from the Literature
A systematic search of the available literature in the PubMed and Embase databases was performed in order to identify relevant studies describing prediction models for postoperative complications after esophageal surgery; the search strategy is described in electronic supplementary Table SDC1. Studies were eligible if they described the establishment of a prediction model that predicts the occurrence of postoperative complications after an esophageal resection with gastric conduit reconstruction; however, studies were excluded if they only described a prediction model that predicts specific (i.e. only pulmonary complications) complications or only mortality after esophageal surgery.
The literature search resulted in 123 studies. Two prediction models, one by Reeh

Clinical Judgment of the Surgeon
Three surgeons (IvBH, SSG, and WJE) were asked to estimate the probability of a consecutive series of patients to develop a postoperative complication (any complication and a major complication, i.e., Clavien-Dindo grade IIIA or higher). Surgeon 1 had 14 years of experience, whereas surgeons 2 and 3 had 9 and 2 years of experience, respectively. All surgeons were blinded from the other surgeons' response and outcome of the prediction models. One day prior to surgery, surgeons completed the Preoperative Risk Score Form (electronic supplementary Table SDC4) after studying the patient file and clinically evaluating the patient, regardless if they were operating themselves or not. The surgeons were not blinded from who was the operating surgeon. On this form, the surgeons could indicate their estimation on a 10-point scale with percentage categories for the patients to develop any or a major complication (Clavien-Dindo grade IIIA).

Study Outcomes
The primary aim of this study was to investigate the discriminative ability of the existing prediction models compared with the accuracy of the clinical judgment of the surgeon with regard to predicting the risk of any postoperative complication, while secondary outcomes were the performance of the selected prediction models in this cohort, the performance of the surgeons, and to describe how well surgeons can predict major (Clavien-Dindo grade IIIA or higher) postoperative complications. Complications were identified and collected until 30 days post-surgery and graded according to the classification by the Esophageal Complications Consensus Group (ECCG) and the Clavien-Dindo classification. 28,29

Reclassification Measures
Reclassification measures were used to describe the difference between the ability of the surgeon and prediction models to predict postoperative complications.
Comparing the area under the receiver operating characteristic curves (AUCs) is the most common strategy to compare prediction models; 30 however, comparing AUCs has proven to be insensitive to important changes in absolute risk. 31 Therefore, reclassification measures are recommended. Using these models, patients are stratified into clinical categories based on risk in the first (reference) model, then the ability of the second (the surgeon) to more accurately reclassify individuals into higher or lower risk strata is quantified. 32,33 Three reclassification measures were used in the current study: the net reclassification improvement (NRI), category-free NRI (cfNRI), and integrated discrimination improvement (IDI) indexes.
The NRI attempts to quantify how well a second model (in this case the surgeon) reclassifies subjects to a more appropriate risk category. The overall NRI is the sum of NRI events and NRI nonevents . NRI ranges from −1 to 1, where 0 indicates no difference. An NRI closer to 1 correlates with a better prediction of the second model (the surgeons) and an NRI closer to −1 correlates with a better prediction by the first model. The NRI requires a threshold in risk score in order to be able to categorize patients. In this study, a threshold of 60% was chosen since the incidence of complications after esophagectomy is reported to be around 60% in Dutch centers. 34,35 A probability of over 60% represents an increased risk. The cfNRI counts the direction of change for every individual instead of the crossing from a higher-risk group to a lower-risk group and vice versa. cfNRI values above 60% should be interpreted as a strong improvement in comparison with the reference model; those around 40% should be considered intermediate improvement and those below 20% should be considered weak improvement. 36 Finally, IDI counts the actual change in calculated risk for each subject instead of only the direction of change, as the cfNRI does. A higher IDI correlates with better estimation by the surgeon and a negative IDI indicates a better prediction by the prediction model.

Sample Size Calculation
To calculate the number of patients necessary for this study, the number of patients needed to develop a prediction model was used. This calculation was based on the number of degrees of freedom in the largest prediction model is this study, i.e. the study by Lagarde et al. This model includes six variables (either dichotomous or continuous). It is desirable to include a representative sample with at least 10 events and 10 nonevents per variable. Since more than half of the patients usually develop a complication, at least 60 patients without a complication were needed. 37 Internationally, the incidence of patients with one or more postoperative complications after an esophageal resection for surgery is 60%. 34 Therefore, the total sample size was set at 150 patients.

Statistical Analysis
All risk factors included in the prediction models are displayed in the baseline table. To compare categorical data, the Pearson Chi-square test or Fishers exact test were used, as appropriate. The independent samples t-test was used for continuous data with normal distribution. The Mann-Whitney U test was used to compare continuous data with nonnormal distribution.
The performance of the prediction models was estimated in the current dataset. For each surgeon and prediction model, the calibration (the ability to quantify the observed absolute risk) and discrimination (ability to discriminate between patients with and without an event) were described. Discrimination was examined with the AUC, and calibration was examined with the observed/expected ratio and calibration intercept and slope.
Based on calibration and discrimination, the best performing model was chosen and then compared with the clinical judgment of the surgeons using reclassification measures (NRI, cfNRI, and IDI) in addition to quantifying the difference between AUCs. Missing data were handled with single imputation. All p-values were based on a two-sided test and a p-value <0.050 was considered statistically significant. Data were analyzed using SPSS for windows, version 25 (IBM Corporation, Armonk, NY, USA) and R version 3.3.3 (R Foundation for Statistical Computing, Vienna, Austria).

RESULTS
A total of 208 patients underwent esophagectomy between March 2019 and July 2021. A total of 49 patients were excluded: 18 patients because none of the surgeons had estimated the postoperative complication risk, 17 patients underwent salvage esophagectomy, 5 were intraoperatively found to have nonresectable disease, and 9 underwent an open esophagectomy. Therefore, in total 159 patients were included in the present study. Overall, 88 of 159 patients (55%) developed postoperative complications within the first 30 days. Of those, 48 patients (55%) developed a minor complication (Clavien-Dindo lower than grade III), whereas 40 patients (45%) developed a major complication (Clavien-Dindo grade IIIA or higher). Clinicopathological characteristics were comparable between patients with and without complications (Table 1). Table 2 shows the incidence of specific postoperative complications and severity.

Performance of Prediction Models
Of the 106 patients with a PER score classified as 'low risk' by Reeh et al., 56 (52%) developed a complication; of the 35 patients with a PER score of 'medium risk', 20 (57%) developed a complication; and of the 18 patients with a PER score of 'high risk', 12 patients (67%) developed a complication. Median PER scores for patients with and without complications are shown in Table 3.
Using the model by Lagarde and colleagues, patients who did not develop complications had a median score of 22 (interquartile range [IQR] [19][20][21][22][23][24][25], and those who did develop complications had a median score of 23 (IQR 20-26) ( Table 3). The performance of both models to predict any complication are displayed in Table 4. The prediction model by Lagarde et al. had better performance and was therefore compared with the risk estimation by the surgeons.

Risk Estimation by the Surgeons
Risk estimations were made by three surgeons using the preoperative form for risk assessment. For surgeon 1, patients with a complication had a median risk score of 55% (IQR 25-65%) and patients without a complication had a median risk score of 45% (IQR 5-55%); for surgeon 2, patients with and without complications had a median risk score of 55% (IQR 35-75%) and 55% (IQR 35-75%), respectively; and for surgeon 3, patients with and without complications had a median risk score of 65% (IQR 45-65%) and 55% (IQR 35-65), respectively ( Table 3). The discriminative ability and calibration measures for each surgeon are displayed in Table 4.

Comparison of the Prediction Model and Risk Estimation by the Surgeon in Predicting Any Complication
Surgeons 1 and 2 had a lower AUC and surgeon 3 had a higher AUC than the prediction model by Lagarde et al., although these differences were small and were not statistically significant ( Table 5). The observed/expected ratio was 0.69 for the model by Lagarde et al., indicating an overestimation of the risk of complications, whereas the observed/expected ratios for the surgeons where all >1, indicating an underestimation of the risk of complications (Table 4).
This was reflected in negative NRI events , cfNRI events and IDI events scores and positive NRI nonevents , cfNRI nonevents and IDI nonevents scores (Table 5).
Overall, the NRI for surgeons 1, 2, and 3 were −8%, 1%, and 18%, respectively, indicating improvement for surgeon 3 compared with the prediction model, and a similar estimation compared with the model for surgeons 1 and 2. The cfNRI showed no improvement in the estimation from all surgeons (−6%, −3%, and 0% for surgeons 1, 2, and 3, respectively). The IDI showed small differences between the surgeons and the prediction model (−3%, 1%, and 1%, respectively). All reclassification measures are detailed in Table 5.

DISCUSSION
Risk stratification has been a hot topic in esophageal cancer surgery for many years. We investigated whether available prediction models are superior compared with the clinical judgment of the surgeon with regard to predicting the risk of postoperative complications after esophagectomy. Surgeons 1 and 2 performed similar to the prediction models and surgeon 3 performed slightly better. Moreover, the prediction models tended to overestimate the risk of any complication, whereas all surgeons tended to underestimate the risk of any complication. When estimating the risk for major complications (Clavien-Dindo grade IIIA or higher), three surgeons had an AUC of around 0.5 and poor calibration. It is of clinical relevance to identify high-risk patients in order to improve informing about their risks for complication. High-risk patients could be monitored more thoroughly postoperatively and perhaps a lower threshold for postoperative diagnostics or treatment would be justified. To our knowledge, this is the first study evaluating prediction models in comparison with the clinical judgment of specialized surgeons with regard to predicting the risk of postoperative complications after esophagectomy.
The performance of the model by Reeh and colleagues differed considerably between the current study and the original study. 14 The poor predictive performance could be explained by the differences in patient groups and treatment characteristics between both cohorts. Almost all patients in the current study received neoadjuvant chemotherapy or chemoradiotherapy, whereas none of the patients in the study by Reeh et al. were treated neoadjuvantly. Moreover, every patient in the current study underwent a transthoracic     40 Patients in the study by Lagarde et al. were all operated by the open approach and were not treated with neoadjuvant therapy, whereas in the current study, all patients were treated by a minimally invasive procedure and treated neoadjuvantly. These differences between the developmental cohort and the current study cohort might explain the differences in performance.
Clinical judgment, considering comorbidities, preoperative tests, and an estimation of a patient's ability to withstand the physical damage of surgery, are all essential in patient selection for esophagectomy. Multiple studies have demonstrated that the surgeons' clinical assessment is a good predictor of postoperative complications in major gastrointestinal surgery and is even more accurate than the POSSUM score. 41,42 Our results revealed heterogeneity between the surgeons' clinical judgment. All surgeons underestimated the risk of complications, which is also seen in another study evaluating surgeons' assessment in different types of surgery. 43 One surgeon had an acceptable AUC of 0.59, whereas others performed poorly, with an AUC ranging from 0.53 to 0.59. The lack of agreement could be explained by different factors on which surgeons base their clinical judgment or the difference in years of experience, since the most experienced surgeon had the best AUC.
Of all patients, 25% developed a major complication. Surgeons 1 and 3 performed better than surgeon 2 in predicting major complications, with an AUC of 0.59, and higher median risk scores for patients with major complications compared with patients without. Calibration also varied widely between surgeons, yet all surgeons had poor calibration measures. No other studies have compared surgeons' assessment with prediction models in assessing major complications after esophagectomy. D'Journo et al. developed and validated a risk prediction model of death within 90 days after esophagectomy. 44 This model showed good discriminative ability, with an AUC of 0.64 in the validation cohort. Future studies could compare surgeons' assessment with this model or develop a new prediction model specifically focusing on major complications.
All surgeons had negative NRI events , cfNRI events and IDI events , and positive NRI nonevents , cfNRI nonevents and NRI nonevents percentages. This indicates that surgeons underestimate the risk of a complication compared with the prediction model. The overall NRI percentage was similar to 0 for surgeons 1 and 2, and significantly positive for surgeon 3, indicating that only surgeon 3 was better at predicting overall complications than the prediction model, when utilizing a threshold of 60% risk. However, when comparing the surgeons with the prediction model without a threshold (cfNRI), one surgeon showed a weak improvement and two surgeons showed a small diminishment. When quantifying the improvement of reclassification by surgeons compared with the prediction models, the IDI scores were positive for two surgeons and negative for one surgeon, but all were close to 0%. Thus, surgeons perform similar to the prediction model in predicting overall complications based on IDI and cfNRI. To date, no studies exist that evaluate the clinical judgment of surgeons in predicting postoperative morbidity after esophagectomy to compare our results with. However, a systematic review assessing the accuracy with which surgeons can predict outcomes following many different types of surgery, including gastrointestinal, found that the surgeons' prediction of general morbidity was good and was equivalent to or better than pre-existing prediction models. 45 Evaluating the relationship between experience and accuracy in predicting complications showed that surgeon 3, the most experienced surgeon, had a higher median risk score for patients with complications, the highest AUC, and overall better calibration measures compared with surgeons 2 and 3. Surgeon 3 with the least experience has less favorable outcomes than surgeon 2. These data show that there is a trend of surgeons with more experience in predicting complications more accurately than less experienced surgeons, although no statistical tests could be performed reliably due to the low number of participants. These results are in line with another study that showed that senior surgeons were superior in predicting outcomes. 46 The present study has some limitations. First, there were few prediction models designed specifically for predicting postoperative complications after esophagectomy. Furthermore, our single-institution study provides less generalizable results. Moreover, most of these models were constructed before the implementation of minimally invasive transthoracic surgery and/or neoadjuvant therapy, which makes these models less generalizable to current practice. This also indicates that there is a need for an up-to-date prediction model. Furthermore, surgeons were not able to be blinded as to who the operating surgeon was since this team of surgeons always operates together and the scores were completed 1 day before the actual surgery. Additionally, the 10-point scale for the surgeons to indicate their estimated risk for a patient to develop a postoperative complication is a nonvalidated tool and was chosen because it is straightforward but still enables reclassification. However, this form can feel counterintuitive to some who would prefer a dichotomous scale or a visual analogue scale. The risk of complications should be weighed against the 'risk' of a complete pathological response, in which case surgery could be omitted. In this study, complication and pathological complete response (pCR) rates were 25% and 26%, respectively. Unfortunately, it is still not possible to reliably predict a cPR, as 60% of patients in a recent study still had vital tumor even though the clinical response evaluation was negative. 47 Perhaps in the future, with the evolving of (imaging) techniques, this rate will improve and patients can be safely offered active surveillance. This study has found that generally, surgeons underestimate the risk of complications and the prediction models overestimate the risk of complications. When comparing both, two surgeons predicted complications similar to the prediction model and one surgeon predicted complications slightly better than the prediction model. The surgeon's assessment is therefore important when counseling patients about the risks of esophageal surgery in addition to prediction models. However, there was a large heterogeneity between the risk estimations between surgeons. This implicates that both prediction models and the clinical judgment of the surgeons are equally useful, and possibly combining both might lead to the best risk assessment. Discussing patients in multidisciplinary teams with multiple surgeons and other specialists might benefit the risk estimation of the surgeon. One study evaluating the ability of the surgeon to predict complications among different types of surgery incorporated the surgeons' assessment in a previously developed multifactorial model and found an improved discriminative ability. 43 More so, evidence suggests that exposure to pre-existing prediction models leads to less varied and more accurate judgments of operative risk among surgeons and thus should be used in tandem with their gut feeling. 45,48 Therefore, further studies could validate our findings and incorporate surgeons' assessment in prediction models, or combine prediction models specifically for esophageal surgery with the clinical judgment of the surgeon. Given the finding that surgeons generally underestimated the risk of postoperative complications, it would be valuable to assess if providing feedback to the surgeons would help improve their estimation. Another interesting endpoint could be if the clinical judgment of the surgeon directly after surgery (incorporating blood loss, quality of the gastric conduit) changes their estimation. This does not facilitate the possibility to better inform patients about the risk of complications, but has the benefit of identifying high-risk patients. In addition, since major complications result in more postoperative mortality and decreased quality of life, future studies should therefore not only focus on predicting any complication but also on predicting major complications. We aim to conduct a follow-up study developing a new prediction model that takes into account the current treatment of neoadjuvant therapy and minimally invasive surgery. We will consecutively analyze if incorporation of the estimation by the surgeon would benefit this model. One of the endpoints in this follow-up study would be the inter-surgeon variability in predictions, and also to identify factors that contribute to a correct prediction by surgeons.

CONCLUSION
This study demonstrated that surgeons' assessment differs between surgeons and varies between similar to slightly better than the prediction models in predicting the risk of postoperative complications after esophageal cancer surgery. Prediction models could be used in tandem with surgeons' own risk estimation. Future studies are required in order to assess the benefit of incorporating the surgeon's assessment in prediction models to reach a higher level of predicting outcomes in this patient group with high chances of postoperative complications.

FUNDING None
DISCLOSURE No sources of support such as grants, equipment, and drugs were granted for this manuscript.
OPEN ACCESS This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.