Development and external validation of clinical prediction models for pituitary surgery

Introduction Gross total resection (GTR), Biochemical Remission (BR) and restitution of a priorly disrupted hypothalamus pituitary axis (new improvement, IMP) are important factors in pituitary adenoma (PA) resection surgery. Prediction of these metrics using simple and preoperatively available data might help improve patient care and contribute to a more personalized medicine. Research question This study aims to develop machine learning models predicting GTR, BR, and IMP in PA resection surgery, using preoperatively available data. Material and methods With data from patients undergoing endoscopic transsphenoidal surgery for PAs machine learning models for prediction of GTR, BR and IMP were developed and externally validated. Development was carried out on a registry from Bologna, Italy while external validation was conducted using patient data from Zurich, Switzerland. Results The model development cohort consisted of 1203 patients. GTR was achieved in 207 (17.2%, 945 (78.6%) missing), BR in 173 (14.4%, 992 (82.5%) missing) and IMP in 208 (17.3%, 167 (13.9%) missing) cases. In the external validation cohort 206 patients were included and GTR was achieved in 121 (58.7%, 32 (15.5%) missing), BR in 46 (22.3%, 145 (70.4%) missing) and IMP in 42 (20.4%, 7 (3.4%) missing) cases. The AUC at external validation amounted to 0.72 (95% CI: 0.63–0.80) for GTR, 0.69 (0.52–0.83) for BR, as well as 0.82 (0.76–0.89) for IMP. Discussion and conclusion All models showed adequate generalizability, performing similarly in training and external validation, confirming the possible potentials of machine learning in helping to adapt surgical therapy to the individual patient.


Introduction
Pituitary adenomas (PA) constitute for roughly 15% of intracranial tumors and can be resected by transsphenoidal surgery (TSS) in the majority of cases (Thapar et al., 2001).TSS has been adopted as the 'gold standard' approach due to its minimal invasiveness with concomitant low morbidity and mortality (Kanter et al., 2005).
Many variables play a role in determining surgical and endocrinological outcomes of pituitary surgery, which have been analyzed in several publications (Lobatto et al., 2018;Zhou et al., 2017;Braileanu et al., 2019;Hensen et al., 1999).This burdens the clinicians with a multitude of variables to consider in surgical decision-making.In times when "big data" is easily accessible, machine learning (ML) -at least in theorypromises the ability to integrate all of these factors to better guide clinicians based on the individual characteristics of the patient (Obermeyer and Emanuel, 2016;Stumpo et al., 2022).
Using ML to predict the likelihood of endocrinological endpoints, such as biochemical remission (BR) or the restitution of a priorly disrupted hypothalamus pituitary (HP) axis (new improvement, IMP) as well as gross total resection (GTR) from simple preoperative data would be beneficial for clinicians and patients by leading to improved clinical decision-making and therefore improve patient outcome.
To date, none of the clinical prediction models for pituitary surgery outcomes have been externally validated (Stumpo et al., 2022).External validation is a critical step in evaluating the applicability of any model before introduction into clinical practice as internal validation and resampling techniques only allow for a very limited conclusion of the ultimate performance on new patients (generalization) (Collins et al., 2014).Therefore, we aimed to create externally validated, clinically applicable prediction models for anticipation of the beforementioned outcomes after TSS for PAs.

Overview
Prediction model development was carried out using data of patients that were treated with endoscopic TSS by the Department of Neurosurgery, IRCCS Institute of Neurological Sciences of Bologna.The models were trained to predict GTR, BR, and IMP respectively.External validation of the trained models was subsequently performed with patient data provided by the Department of Neurosurgery, University Hospital Zurich.In this study we adhere to methodology described in a previous publication but apply it to additional new data.(Zanier et al., 2021).All examinations were conducted adhering to the principles of transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD) (Collins et al., 2015).

Ethical considerations
Patient data were treated according to the ethical standards of the Declaration of Helsinki and its amendments as approved by our institutional committee (Cantonal Ethics Committee Zürich, KEK St-V-Nr, 2015-0242) and the interhospital Ethical Committee of Bologna City (protocol CE17143, February 2018).

Data sources
This study was conducted using data from two centers, one of which was used for model training while data from the other center was applied to externally validate all models.Patients who underwent endoscopic TSS surgery for PA during the timeframe of August 1998 to January 2020 in Bologna as well as from July 2013 to May 2020 in Zurich were included in this study.All pre-and postoperative assessments as well as intraoperative procedures were carried out as specified in earlier publications (Maldaner et al., 2018;Serra et al., 2016).All patients included had to have information available for at least one of the three outcome measures and patients that underwent transcranial or combined procedures were excluded.

Outcomes
Classification machine learning models were trained for the prediction of GTR, BR and IMP as endpoints.GTR was strictly defined as an extent of resection of 100 The evaluation of the extent of resection was based on MRI images captured three months after the surgery and was conducted by a board-certified neurosurgeon with significant knowledge in pituitary imaging in collaboration with an experienced neuroradiologist.Furthermore, BR was defined as reduction of hormonal levels back into reference ranges, while IMP was defined as the restoration of one or more previously disrupted hypothalamus-pituitary axes into normal reference range of the respective hormones as specified by international guidelines (Giustina et al., 2020).Note that when calculating BR and IMP, additional treatment modalities such as medical and radiation therapy were taken into account.This is because withholding these essential treatments would be unethical.

Model development
As previously described the prediction models were derived using data from Bologna and then externally validated on patient data from Zurich.Both datasets were shuffled randomly before being checked for equal class distribution.In a next step recursive feature elimination was applied to all initially available variables in order to arrive at a sparse model (Staartjes et al., 2022).In terms of model architecture, we applied support vector machines (SVMs), Random Forests (RFs) and Bagged Classification and Regression Trees (CARTs).The models were then trained and selected based on the area under the receiver operating curve (AUC) through 10 iterations of 10-fold cross validation.In parallel to this we also trained a k-nearest neighbour (KNN) algorithm, which allowed for any current and future imputation of missing data (Batista and Monard, 2003).Since our models are capable of providing continuous probabilities, we binarized the results based on the closest-to-(0, 1)-criterion in order for model performance evaluations (Perkins and Schisterman, 2006).To evaluate discrimination, several metrics were employed, including the area under the curve (AUC), accuracy, sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV).Nonparametric 95% confidence intervals (CI) of these metrics were computed.We also evaluated model calibration using the calibration curve intercept and slope.Finally variable importance was computed for each of the models in a universal AUC-based approach before being scaled from 0 to 100 (Kuhn, 2008).We carried out all our examinations using R version 4.0.2(R Core Team, 2017).

Model performance
An overview of model performances, including calibration metrics and training performance, is supplied in Table 2 and the related calibration plots are provided in Fig. 1.

Improvement of one or more HP axes (IMP)
During external validation the bagged CART model trained to predict IMP attained an AUC of 0.82 (0.76-0.89), whereas sensitivity reached 0.88 (0.78-0.97) and a specificity of 0.72 (0.65-0.79) was registered.A NPV of 0.96 (0.92-0.99) was achieved.

Variable importance
Fig. 2 and Table 3 provide a synopsis of the variable importances for each of the developed models.Knosp classification and patient age contributed most to the prediction of GTR, while a preoperative deficit of ACTH and TSH contributed most to the prediction of BR.Lastly, the number of disrupted hypothalamus-pituitary axes and GNRH deficit had the greatest influence on prediction of improvement of one or more HP axes (IMP).

Deployment
We integrated our prediction models into a complimentary web application accessible at https://neurosurgery.shinyapps.io/pituicalc.

Discussion
With multicenter data of over 250 to 1200 patients, depending on the model, we developed and rigorously externally validated clinical prediction models that demonstrated moderate ability to predict GTR, BR and IMP following TSS for PA.Calibration performance was adequate.Although generalizable models were derived, their added value in clinical practice needs to be critically evaluated and their performance compared to that of human experts.
Surgical and especially endocrinological outcomes after pituitary surgery are notoriously hard to predict preoperatively (Sorba et al., 2021;Fatemi et al., 2008;Dhandapani et al., 2016).Currently, physicians approach questions like "How likely is it that you can remove the tumor completely?" by citing numbers derived from case series in the literature or from their own case series.However, this can hardly be considered "precision medicine".Existing approaches to make prediction of surgical and endocrinological outcomes more individualized include the use of classifications, e.g., the Knosp classification or the Zurich Pituitary score for GTR, to stratify patients into large groups with different risk-benefit profiles (Knosp et al., 1993;Serra et al., 2018).Still, this does not allow any statement of each particular patient's risk-benefit profile and can hardly be considered precise on an individual level.It is exactly here that clinical prediction modelsincluding ML techniqueshave promised improvements by integrating relatively complex sets of variables, enabling personalized predictions for each individual patient (Obermeyer and Emanuel, 2016;Stumpo et al., 2022).In some cases, models have even been shown to outperform human medical experts (Senders et al., 2018).Nevertheless, some factors obviously cannot be accounted for by any model or would simply be too cumbersome to collect, thus preventing efficient clinical use.Clinical prediction models should therefore be considered merely as assistive tools that may aid in physicians' decision-making process, but should never replace the literature, imaging, and medical expertise.
Using data from a single center and externally validating our models on data from a separate center, we have generated clinical prediction models for GTR, BR and IMP.A specific goal of our study was to keep the amount and the complexity of preoperative variables required as inputs to a clinically applicable minimum.On one hand inclusion of actual medical imaging files or sophisticated measurement methods might increase predictive performance but on the other hand this would likely preclude any wide-spread clinical application.
Predicting endocrinological outcomes or GTR is only an initial step that shows the potentials machine learning offers and its potential impact on clinical decision making.Important complicating factors like pituitary apoplexy could be included into the prediction models or risks Metrics are presented along with their 95% confidence intervals derived using bootstrapping.AUC, area under the curve; PPV, positive predictive value; NPV, negative predictive value; CART, classification and regression trees. of postoperative complications like transient hyponatremia, a primary cause for recurrent hospitalization after TSS PA resection, could be predicted (Zoli et al., 2016(Zoli et al., , 2017)).To date these still are immensely difficult to foresee, but the power of machine learning to deduct simple information out of complex data might help with the management of such complications.
The performance measures obtained at external validation show that predicting postoperative results based on simple preoperative data is quite challenging.Nonetheless, our models predicted the outcomes mentioned above with acceptable calibration and adequate discrimination.
There have been previous attempts to derive prediction models for outcomes after pituitary surgery.However, to the best of the authors' knowledge, almost none of the models for outcomes after surgery for PA have been externally validated.Only two externally validated prediction models exist, but these are targeted only towards surgery for GHsecreting PA.Qiao et al. (2021) predicted early BR after surgery in patients with acromegaly, with externally validated AUCs of 0.77-0.85.Likewise, we previously (Zanier et al., 2021) predicted BR, cerebrospinal fluid leaks, and GTR with AUCs of 0.63-0.77at external validation.Compared to these studies our model performances lie within a similar range.While these two models are the only ones externally validated, indeed, multiple models without external validation have been published for pituitary surgery in general.Stumpo et al. (2022) recently reviewed the literature on machine learning in pituitary surgery.A model developed by Hollon et al. (2018) demonstrated AUCs ranging between 0.80 and 0.85 based on internal validation.Fan et al. (2019) used radiomic data to predict BR among hormone-secreting adenomas demonstrating an internal validation AUC of 0.81.Staartjes et al., 2018Staartjes et al., , 2019 demonstrated that neural networks classified GTR likelihood more accurately than the Knosp classification or a logistic regression model with an internally validated AUC of 0.87 and that even intraoperative cerebrospinal fluid leaks can be predicted with comparable performance.
Without external validation, achieving relatively good performance in cross-validation or in a held-out internal validation cohort is fairly straightforward, but this does not in any way imply a similar model performance when applied on other cohorts (Staartjes and Kernbach, 2020).
The performance of human experts in predicting these outcomes has not yet been evaluated systematically, but it is likely to be inferior to the performance measures observed in this study.In other domains of neurosurgery, it has been shown that e.g., new neurological deficits or outcomes after spine surgery are only poorly predicted by neurosurgeons (Sagberg et al., 2017).Compared to current approaches, our models are at least able to provide an objective benchmark of expected outcomes on an individual level.Such objective benchmarks can be useful when comparing quality between centers, when evaluating scientific research, or simply as a "second opinion".In patient cases where the indication to undergo surgery is not as clear cut, a model predicting a rather high chance of GTR and BR might be of help in strengthening the decision to lead through with surgical intervention.In contrast to that, prediction of a low chance of GTR, also considering other established sources of information, might lead to changes surgical indications.Taking these points into account, we do not recommend using clinical prediction models as decisive components of clinical decision making on their own.
Our models have been integrated into a free available web application.We encourage physicians to attempt implementation into clinical practice taking however into account the developmental stage and limitations of the models.

Limitations
There are limitations arising from any prediction model, including the ones we built in this study: Countless immeasurable and measurable factors such as surgeon experience, caseload, postoperative management protocols, and others limit the generalizability of any parsimonious prediction model (Sorba et al., 2021;Barker et al., 2003).This means that our models are perhaps ill-suited for institutions that use wholly different treatment and postoperative management methods and may give erroneous results.Furthermore, if the input data fall outside the range of data which the model was trained on, it will unlikely provide accurate information (extrapolation).
Despite having externally validated our models, which shows that they are already generalizable, a multicenter training dataset could help to drastically improve generalizability of our predictive models due to factors such as the ones mentioned previously.The availability of more data preoperatively, such as sodium and potassium concentrations, peripheral hormone levels or surrogates like the fT4/TSH quotient, could also improve performance, but collecting and entering more data into the web-application may be cumbersome.
While the Knosp and Hardy scores are still commonly determined in clinical practice, they have a rather low interrater reliability for intermediate scores (Mooney et al., 2017a(Mooney et al., , 2017b)).Inclusion of simpler and more dependable scoring systems could potentially lead to additional enhancements in model performance.
Finally, although our cohort included a very decent amount of over 1200 PA patients -one of the largest cohorts in current literaturedue to its retrospective nature, this study encompasses a significant number of patients with missing information for some of the outcomes.In that respect, a higher number of training samples would likely further improve model performance and should be considered in the future development.

Conclusion
Based on a large cohort of patients with PAs, prediction of GTR, BR and IMP was feasible with moderate to good performance at external validation, thereby confirming generalizability.Although outcomes after pituitary surgery, especially endocrinological outcomes, are hard to predict, based on our results the role of clinical prediction models as assistive tools in surgical decision making can be reinforced.

Fig. 1 .
Fig. 1.Calibration curves of the prediction models during training (A) and external validation (B).Within each row, gross total resection (GTR) is shown to the left, followed by biochemical remission (BR) in the middle and improvements (IMP) on the right side.The predicted probabilities for the outcomes are distributed into five equally sized groups and contrasted with the observed frequencies of the outcomes.Calibration intercept and slope are then calculated.A perfectly calibrated model has a calibration intercept of 0 and slope of 1.The calibration intercept is influenced by the frequency of the outcome of interest in a certain population.Metrics are provided with bootstrapped 95% confidence intervals.

Fig. 2 .
Fig. 2. Variable importance based on AUC for the three models, with importance values scaled from 0 to 100.Gross Total Resection (A), Biochemical Remission (B), and New Improvements (C).

Table 1
Patient characteristics and incidence of outcomes.

Table 2
Quantitative evaluation of discrimination and calibration of the prediction models.

Table 3
AUC-based relative variable importance of the prediction models.