Development and validation of a multivariable risk factor questionnaire to detect oesophageal cancer in 2-week wait patients

Introduction Oesophageal cancer is associated with poor health outcomes. Upper GI (UGI) endoscopy is the gold standard for diagnosis but is associated with patient discomfort and low yield for cancer. We used a machine learning approach to create a model which predicted oesophageal cancer based on questionnaire responses. Methods We used data from 2 separate prospective cross-sectional studies: the Saliva to Predict rIsk of disease using Transcriptomics and epigenetics (SPIT) study and predicting RIsk of diSease using detailed Questionnaires (RISQ) study. We recruited patients from National Health Service (NHS) suspected cancer pathways as well as patients with known cancer. We identified patient characteristics and questionnaire responses which were most associated with the development of oesophageal cancer. Using the SPIT dataset, we trained seven different machine learning models, selecting the best area under the receiver operator curve (AUC) to create our final model. We further applied a cost function to maximise cancer detection. We then independently validated the model using the RISQ dataset. Results 807 patients were included in model training and testing, split in a 70:30 ratio. 294 patients were included in model validation. The best model during training was regularised logistic regression using 17 features (median AUC: 0.81, interquartile range (IQR): 0.69–0.85). For testing and validation datasets, the model achieved an AUC of 0.71 (95% CI: 0.61–0.81) and 0.92 (95% CI: 0.88–0.96) respectively. At a set cut off, our model achieved a sensitivity of 97.6% and specificity of 59.1%. We additionally piloted the model in 12 patients with gastric cancer; 9/12 (75%) of patients were correctly classified. Conclusions We have developed and validated a risk stratification tool using a questionnaire approach. This could aid prioritising patients at high risk of having oesophageal cancer for endoscopy. Our tool could help address endoscopic backlogs caused by the COVID-19 pandemic.


Introduction
Oesophageal cancer represents the seventh most common cause of cancer morbidity and the sixth most common cause of cancer-related death worldwide [1].There are two major histological subtypes: oesophageal squamous cell carcinoma (OSCC) and oesophageal adenocarcinoma (OAC).OSCC comprises 90% of oesophageal cancer cases worldwide and is predominantly found in Central Asia, East Asia and East Africa.Conversely, OAC make up the remainder but is the dominant histological subtype in the Western world, including the United Kingdom (UK) [2À4].Crucially, oesophageal cancer is often diagnosed late; in the United Kingdom 48% of cases with available staging information are diagnosed at stage IV, while 10-year survival stands at 12%, significantly worse than other cancer types.[4,5] Although, gastro oesophageal junction (GOJ) cancers represent a heterogenous entity, they have common risk factors with oesophageal cancers.As such, historically they have been included in studies with oesophageal cancer [6].
While upper gastrointestinal (UGI) endoscopy remains the gold standard in the diagnosis of oesophageal and GOJ cancers, it is expensive, uncomfortable for the patient and has a low yield for cancer [7].A UK series of over 580,000 patients demonstrated that only 2.1% of patients undergoing UGI endoscopy were subsequently found to have cancer, while other serious pathology such as peptic ulceration were found in a further 11.6% [7].These figures suggest that better selection of patients who are likely to develop oesophageal cancer could help to prioritise higher risk patients, better manage demand for endoscopy and improve overall patient experience.Furthermore, previous work from our group demonstrated a nationwide endoscopic procedural backlog because of the COVID-19 pandemic [8].In addition, during the first 6-months of the pandemic there were decreases in pathological diagnoses of Barrett's oesophagus and OAC [9].There is potential for worse patient outcomes as a direct consequence of the disruption of endoscopy services, on top of the already poor patient outcomes for oesophageal cancer.Modelling studies have demonstrated that disruption to National Health Service (NHS) cancer pathways could lead to excess deaths and life years lost due to delays in diagnosis [10,11].
Several research groups have created scoring systems to try to improve detection of oesophageal cancer.The Edinburgh Dysphagia Scale (EDS) categorises patients into low or high risk for groups using a combination of both patient characteristics and symptoms.In a validation cohort it achieved a sensitivity of 98.4% but specificity was low at 9.3% [12,13].
Increasingly, machine learning (ML) methods, which apply mathematical approaches to generating computerised algorithms, have been used to develop triaging models, which could optimise use of resources.These models can calculate an individual's risk of having a disease [14].Our group previously used an ML approach to develop a risk prediction model for Barrett's oesophagus with the aim of improving selection of patients who should be referred for UGI endoscopy for confirmation [14].Similar approaches have been used in acute UGI bleeding for risk stratification and it even outperformed standard of care clinical scoring systems [15].

Aims and objectives
The aim of this study was to use an ML approach to train, test and then independently validate a risk stratification tool which could be used to predict the risk of detecting oesophageal and GOJ cancers in unselected patients referred through NHS 2-week wait (2WW) suspected UGI cancer pathways.In addition, we aimed to trial the tool in a limited number of patients with gastric cancer, as both cohorts of patients require the same endoscopic investigation for diagnosis.

Participant selection and dataset description
Patients were selected from two separate prospective crosssectional studies; the Saliva to Predict rIsk of disease using Transcriptomics and epigenetics (SPIT) study (ISRCTN: 11,921,553) and predicting RIsk of diSease using detailed Questionnaires (RISQ) study (ISRCTN: 74,930,639).For the SPIT study, participants were recruited from those referred for UGI endoscopy through the NHS 2WW suspected UGI cancer pathway from 19 UK hospitals between September 2017 and May 2022.Major inclusion criteria were age over 18 and ability to provide informed consent.Exclusion criterion was pregnancy.For the RISQ study, participants were recruited from 2 UK hospitals between January 2020 and May 2022 using identical inclusion and exclusion criteria.Patients in both studies completed a symptom and risk factor questionnaire either independently or with support from a research nurse immediately before undergoing diagnostic endoscopy.The SPIT study questionnaire consisted of 209 questions; it was multidimensional with different domains that are known to impact disease risk (Supplementary Table 1).Major symptoms for oesophageal cancer such as dysphagia, odynophagia and weight loss were included [16].Questions on duration and severity of acid reflux symptoms were adapted from both the gastro oesophageal reflux disease (GORD) impact scale and GORD questionnaire [17À19].We included wider questions on medication use, food intake, anxiety and depression, loneliness and local engagement as these have been found to affect health outcomes; we were interested to see if these factors may add additional richness to our model [20À22].We used existing validated questionnaires in several sections of the questionnaire, including the Hospital Anxiety and Depression Score (HADS), the University of California Los Angeles (UCLA) 3 item loneliness scale and the dysphagia score by Mellow and Pinkas [23À25].The RISQ questionnaire was a shortened version of the SPIT questionnaire with 17 questions (Supplementary Table 1), which were selected following initial model training [26].Subsequent endoscopic and histological results were linked to the patient's questionnaire responses.As participants underwent endoscopic investigation as part of their routine care, endoscopists and histopathologists were both blinded to questionnaire responses.Questionnaires and outcomes were recorded using a bespoke electronic software programme (TrialSense, London, UK).

Data handling
All data handling and analysis was completed using R software version 4.1.2(R Core Team, Vienna, Austria) [27].Prior to data analysis the dataset was manually cleaned for data input issues and errors.We excluded any fields which had greater than 20% of responses missing.We performed missing data imputation using the 'missForest' package [28].The process flow for our study is shown in Fig. 1.

Feature selection
We used feature analysis to determine the most important predictors for oesophageal cancer.We used both information gain and chi-squared correlation-based feature selection from the 'FSelector' package, as per our previous methodology, with a 50:50 weighting given to each feature before producing a final ranking.[14,29].In brief, information gain is an ML method where each feature is compared separately as to its correlation with the variable of interest.Correlation based feature selection assesses multiple features and removes any features which are high correlated with each other [14].We were then able to select the most discriminating features to predict the presence of oesophageal cancer.

Model training and testing
We split the SPIT dataset in a 70:30 ratio for model training and testing respectively.We performed 10-fold cross-validation during model training.We used seven supervised ML methods from the R software 'caret' package: [30].linear discriminant analysis (caret function: lda), classification and regression tree (cart), k-nearest neighbour (knn), support vector machines (svm), random forest (rf), logistic regression (glm) and regularised logistic regression (glmnet).In particular, regularised logistic regression applies either ridge or lasso regularisation, which help to simplify any generated models and prevent overfitting of data [31].This is denoted by an elastic net mixing parameter termed alpha [34].A regularisation coefficient (lambda) is applied to the overall model which also penalises more complex models [31].We assessed the performance of the model using receiver operating characteristic curves (ROC) and calculated the area under the ROC curve (AUC) using the 'pROC' package [32].Our method automatically selects the optimal regularisation and lambda value to maximise AUC.Finally, to ensure that the model was weighted such that there was an increased penalty for misclassification of cancers, we applied a cost function to our final model using the 'ROCR' package [33].The penalty for false negatives (i.e., missing a cancer case) was set at 50 times greater than false positives.This was then used to determine the ideal threshold above which the model would predict the presence of cancer.We assessed the performance of the cost function and the associated threshold using sensitivity, specificity, positive predictive value (PPV) and negative predictive value (NPV).

Model validation
Patients in the validation dataset were from the RISQ study.We additionally enriched this cohort with further patients with confirmed oesophageal cancer.We also collected data from patients with confirmed gastric cancer to assess the performance of the model in this group.

Statistical analysis
For risk prediction models, there is no generally accepted approach to estimate sample size requirements for derivation and validation of risk prediction models.Discrete variables are presented as numbers and percentages, while continuous variables are presented as mean and standard deviation (SD) or median and inter-quartile range (IQR).To compare between normal and cancer groups, we used t-tests or chi-squared (X 2 ) tests depending on the variable.A pvalue of 0.05 was taken as significant.We followed TRIPOD guidelines in the preparation of this manuscript [34].

Demographics
Table 1 A total of 807 patients were included in model training (566 patients) and testing (241 patients), while 294 patients were included in model validation.Full demographic information and breakdown is presented in Table 1.A total of 65, 27 and 42 oesophageal and gastro oesophageal junction (GOJ) cancer cases were included in the training, testing and validation datasets respectively.TNM staging information, where available, is presented in Supplementary Table 2. 80%, 93% and 83% of cancers in the training, testing and validation datasets respectively were OACs.There was no statistically significant difference in the distribution on either the cancer site (X 2 = 2.21, p = 0.33) and or the histology (X 2 = 5.06, p = 0.28) within the three datasets.Notably, cancer patients were older and more likely to have had some quantified weight loss across the three datasets.Cancer patients across all three datasets were less likely to suffer from psychological disorders compared to non-cancer patients.

Selection of features
251 features were available for analysis, which includes data from the questionnaire, endoscopy and histology results.Prior to imputation we removed 101 features which had data missing in more than 20% of patients.We subsequently selected 17 features which were top ranked for the prediction of oesophageal and GOJ cancers (Supplementary Table 3).Top ranked features were multidimensional and included demographic information, symptoms and psychological and food related variables.These top ranked features were selected for ML model development.

Development of ML model
Fig. 2 demonstrates the distribution of area under the receiver operating characteristic curve (AUC) after 10-fold cross-validation in the training dataset, as well as median and IQR for AUC, sensitivity and specificity for the seven different ML methods.The best performing model with the highest median AUC was regularised logistic regression (AUC: 0.81, IQR: 0.69À0.85).The regularised logistic regression model was associated with an alpha value of 0.1 and a lambda value of 0.0149.We selected this model for testing and validation and for determining appropriate cut offs in the cost function.

Testing and validation of model
Fig. 3 demonstrates the receiver operating characteristic (ROC) curve for regularised logistic regression, with the final model reapplied to the training dataset for reference purposes and tested on both the testing and independent validation datasets.For the testing data set the model achieved an AUC of 0.71 (95% CI: 0.61À0.81).For validation, the model achieved an AUC of 0.92 (95% CI: 0.88À0.96).Table 2 demonstrates the sensitivity, specificity, positive predictive value (PPV) and negative predictive value (NPV) for each dataset with the cost function applied.Our final model achieved a sensitivity and specificity of 96.3% and 20.1% for the testing dataset.For the validation dataset, the sensitivity and specificity were 97.6% and 59.1% respectively.The final model with coefficients is presented in Supplementary Table 4.

Performance of model by histological subtype
We assessed the performance of our model on different histological subtypes of cancer.The model was able to predict correctly all 3 OSCC cases in the testing dataset and all 5 OSCC cases in the validation dataset as cancer.For OAC, all 21 cases were correctly predicted in the testing dataset,

Pilot testing with gastric cancer cases
In a further sub analysis, we assessed if our model was able to predict the presence of gastric cancers.Demographic information for this subgroup of patient is presented in Supplementary Table 5 and includes 12 gastric cancer cases with the same controls as the original validation dataset.Supplementary Figure 1 demonstrates the ROC curve for validation where it achieved an AUC of 0.78 (95% CI: 0.62À0.95).At the same cut off level, this led to a sensitivity of 75.0%(Table 2).

Discussion
Using an ML approach, we have created a risk stratification tool which could be used as a diagnostic aid to identify patients who may have oesophageal or gastro-oesophageal junction (GOJ) cancer.Our model is based on demographic factors such as age and sex which are routinely captured, as well as alarm features such as dysphagia and unintentional weight loss which are currently included in National Institute for Health and Care Excellence (NICE) UGI cancer referral guidelines [35].Our model further incorporates additional features, such as known psychological disorders, which we believe gives an additional richness to the model and helps improve its performance.In particular, we have also demonstrated the robustness of the models by both testing and independently validating the model.We were able to achieve a sensitivity of over 96% for detecting cancer, while specificity ranged from 20% in the testing dataset to 59% in the validation dataset.This large range of results for specificity is likely to be a result of both the relatively small numbers of cancers in both datasets, and also cancer and noncancer patients were not equally distributed in the datasets.
In addition, we also trialled our model on gastric cancer patients.Although the numbers were small it identified 9/12 (75%) gastric cancer patients correctly.It may be possible to enhance accuracy for this condition by capturing symptoms specific to gastric cancer such as early satiety and symptoms of anaemia [36].Incorporating these features when creating a model may further improve the sensitivity for detecting gastric cancers.Our work adds to previous research; ML algorithms have previously been developed in a Chinese population to predict the risk of UGI lesions [37].However, the study incorporated a combined endpoint of UGI pathology.This included both gastric cancer, which had a prevalence of 0.3% in the study population, and more benign pathology such as gastric erosions, which had a prevalence of 22% [37].Our model differs by emphasising cancers, which we believe will have a greater utility especially in the triaging of suspected cancer 2WW referrals and prioritising procedures which are likely to have a greater yield for serious pathology.
Our model appears to perform well against currently used triaging methods.In our validation dataset we were able to achieve a PPV of 28.5% and an NPV of 99.3%.This improves upon using alarm features solely, which on their own have PPVs ranging from 0 to 11% for oesophageal and gastric cancers, with considerable heterogeneity between studies [38].Even when using a combination of alarm features, PPVs still remain at around 10% [38].In addition, our work also improves on the EDS.[12,13].During the COVID-19 pandemic the EDS has been used for triaging urgent UGI referrals, and a prospective series demonstrated a sensitivity of 96.7% and a specificity of 32.6%, although the authors also included 10 gastric and one duodenal cancer in the outcome [39].This equated to a PPV of 9.7% and NPV of 99.3% for the detection of UGI cancer [39].One major advantage of our approach is that it has a higher PPV compared to the EDS, which could be especially advantageous in reducing the number of normal procedures being classed as urgent or being performed.Our model also compares favourably with models recently validated for the detection of incident cases of OAC and GOJ adenocarcinoma.The best model in validation achieved an AUC of 0.73, compared to 0.92 in our study.[40,41].

Limitations of the study
First, some patients were recruited after the diagnosis of cancer was known.Subsequently, such patients may be subject to recall bias and report higher rates of symptoms.While it would be preferable to perform prospective recruitment, there are challenges associated with this as UGI cancer has a low incidence in the UK [7].Second, we would have preferred to directly test the performance of the EDS on our cohort of patients, and therefore create head-tohead comparisons.However, we were unable to extrapolate for one of the elements of the score (dysphagia localises to neck) with our existing data.Third, as this study is based in the UK, where OAC is the predominant subtype, our scoring system may be less applicable to other nations or territories where other histological subtypes are more common.Fourth, our model incorporates a large number of features which increases its complexity during clinical use, although with the rise of electronic health record systems, patients are increasingly prepared to complete health questionnaires online.This can rapidly generate rich datasets for the clinician to review easily and is starting to make its way into routine clinical use.Finally, this model needs further validation in larger cohorts to ensure its performance remains consistent and slight adjustment to enhance its performance against gastric cancers.This is especially important to ensure greater consistency in the calculation of model performance metrics.

Conclusions
Using ML methods, we have created and validated a tool for the prediction of oesophageal and GOJ cancers.This could have real impact in helping to prioritise patients for urgent investigation, particularly at a time of COVID-19 related backlogs.

Fig. 1
Fig. 1 Process flow for model training, testing and validation.

Fig. 2
Fig. 2 Boxplot of each model and their spread of area under the receiver operating characteristic curve (ROC) for each model.Table demonstrates median ROC, sensitivity and specificity and their respective inter-quartile range (IQR) during model development using 10-fold cross-validation.

Fig. 3
Fig. 3 ROC curve for training, testing and validation datasets for regularised logistic regression for predicting oesophageal and GOJ cancer.
Ethical approval was obtained for both the SPIT Study (Coventry and Warwickshire Research Ethics Committee: 17/ WM/0079) and the RISQ Study (South Central -Oxford B Research Ethics Committee: 19/SC/0382).

Table 1
Demographic and questionnaire Responses for included patients in training, testing and validation data sets of top ranked features.Numbers in brackets following feature (e.g.no, never (0)) denote coding used for model development.p-values are for chi-squared tests unless otherwise stated.FET= Fisher's Exact Test.KW=Kruskal Wallis Test.