Introduction

Colorectal cancer (CRC) is the second leading cause of cancer-related death and the third most common malignant neoplastic disease worldwide with an annual incidence of 1.8 million new cases [1, 2]. The cornerstone of curative treatment of CRC is surgery, which is known to have a risk of postoperative complications of up to 47.4%, 14.4% major complications, and mortality of 3.3% in the postoperative period [3]. The past decades have seen vast progress in reducing mortality after colorectal cancer surgery [4]; treatment and complication rates may improve further by personalizing treatment based on each patient’s individual challenges.

Approximately one third of patients diagnosed with CRC can be considered frail due to age, comorbidity, functional capacity, and lifestyle factors [5, 6], and these patients face a higher risk of postoperative mortality [7]. Several studies suggest that frailty is an independent factor of increased mortality and complications after surgery and that identifying frail patients and optimizing their trajectory in relation to surgery can reduce the risk of postoperative adverse outcomes [5, 8, 9]. Similarly, to the variation in frailty, there is large heterogeneity in terms of age, comorbidities, nutritional deficiencies, and histopathological tumor variation between patients with CRC [3, 10, 11]. Several interventions exist to target these challenges such as iron infusions for anemic patients [12, 13], prehabilitation [14,15,16], medical nutrition therapy [17], optimization of medication, and even different neoadjuvant treatment strategies depending on the tumor are under development [18]. In order to individualize a patients’ treatment based on their risk profile, a clinician requires knowledge of the personalized risk of this patient. Therefore, a tool that can provide these estimates would aid in the process of treatment planning.

With this study, we aimed to develop prediction models using a data-driven approach on preoperative data from multiple nationwide health databases focusing on mortality within 30 days, 90 days, and 1 year after CRC surgery, which in the future could be used for patient stratification and personalized treatment.

Methods

Data sources

Data were retrieved from four Danish observational health databases with nationwide coverage. These are the Danish Colorectal Cancer Group database (DCCG) including information about the patient’s CRC history with a 95–99% coverage from 2001 to 2019 [19, 20], the Danish National Patient Register (DNPR) with trajectory data from all contacts with the secondary healthcare sector from 1976 to 2019 [21], Register of Laboratory Results for Research (RLRR) containing biochemical and microbiological laboratory results from 2013 to 2019 [22], and the National Prescription Registry (NPR) containing information about claimed drug prescriptions from 1994 to 2019 [23].

Access to Danish observational health databases for research does not require ethical approval. This study was approved for Region Zealand under the record number REG-102–2020 [24].

Study population and outcomes

We included patients above 18 years of age with a CRC diagnosis and date of surgery between May 1st 2001 and December 31st 2019 in both DCCG and DNPR and where the surgery date in DCCG is matched by the date in DNPR. The outcome of interest was all-cause mortality within 30 and 90 days and 1 year after CRC surgery.

Statistical analyses

Each data source was transformed into the Observational Medicine Outcomes Network Common Data Model (OMOP-CDM) v5.3 and subsequently merged [25]. The OMOP-CDM structure enables the use of open-source analysis tools provided by the Observational Health Data Science and Informatics community (OHDSI) [26].

The initial index date was set to the date of surgery, and times-at-risk were set to 30 days, 90 days, and 1 year in order to assess different parts of the patient trajectory with varying impact of the surgical treatment on the individual survival points covering both immediate and more prolonged impact of surgery.

Data was divided into a training set consisting of a random 75% of patients for model development and 25% of patients were used for internal validation. Threefold cross-validation was used for hyperparameter optimization. When considering the possibility for missing data, OMOP-CDM requires sex and age, which is present for all records. However, for all other fields, there may be missing data. In DCCG, some covariates are mutually exclusive and thus can be separated from missing values such as deficient mismatch repair (dMMR) and proficient mismatch repair (pMMR). In the remaining data sources, however, distinguishing between no record or missing record is not possible due to the nature of data capture from EHR and longitudinal data collection, for instance if there was no record of a specific diagnosis, it was not present in the specific patient. Categorical variables such as procedures and diagnoses were stored by one-hot encoding, causing both negative values and missing data for categorical variables to be interpreted as zero by the model. Missing continuous values were interpreted as NA and as such the record of a missing continuous variable would not be taken into consideration [25, 27, 28]. Apart from age, continuous variables were almost exclusively biochemical results, especially from blood samples.

We trained a Least Absolute Shrinkage and Selection Operator (LASSO) logistic regression model for 30-day, 90-day, and 1-year postoperative mortality [29,30,31,32,33]. The LASSO logistic regression generally performs well with rare outcomes or sparse data. The advantage of LASSO logistic regression is that the model is provided with all covariates above a certain amount, which we predefined as 0.1% of records, from the source data and automatically selects covariates, which are associated with the outcome and thus shrinks tens of thousands of covariates to only a fraction of relevant covariates. Performance of models was assessed using area under the receiver operating characteristic curve (AUROC) and area under precision-recall curve (AUPRC) for assessment of discrimination including calculating the ratio between incidence and AUPRC, and calibration intercept, calibration slope, and calibration-in-the-large for assessment of calibration. Visual representation of receiver operating characteristic, precision-recall curves, and calibration plots are reported and assessed visually [34].

In order to assess the performance of the model in different clinical scenarios related to acuteness and intent of surgery, we performed subgroup analysis by testing the model on subsets of patients, who underwent curative, palliative, emergency, and elective surgery, respectively. In the subgroup analyses, the model was trained on the full training set and subsequently validated on the different subsets of patients.

The tool used for covariate selection and model settings was ATLAS version 2.9.0. For model training, R v. 4.0.3 was used with the “PatientLevelPrediction” package v. 4.3.7. Reporting of outcomes of this study adheres to the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) guidelines [35].

Results

Participants

A total of 65,612 patients underwent surgery for CRC from 2001 to 2019 in DCCG. Of these patients, 57,521 patients were included in the study cohort (Fig. 1). This cohort consisted of 48.9% female patients, with a median age of 72 years. The remaining patient characteristics are found in Table 1 and boxplots showing the predicted risk in the mortality and no mortality groups are found in Figs. 2A, 3A, and 4A. Incidences of mortality were 4.48%, 6.64%, and 12.8% for 30-day, 90-day, and 1-year mortality, respectively.

Fig. 1
figure 1

Flowchart of patient inclusion. DCCG Danish Colorectal Cancer Group database, DNPR Danish National Patient Register

Table 1 Demographic and clinical parameters for patients included in the cohort
Fig. 2
figure 2

Study outcomes for 30-day mortality. A Boxplot of predicted risk for patients, who died (blue) and patients, who did not die (red). B Receiver operation characteristic (ROC) curve for 30-day mortality. C Precision-recall curve for 30-day mortality. D Calibration plot for 30-day mortality. Blue color is cases (mortality within the time at risk); red color is patients, who did not die within the time at risk

Fig. 3
figure 3

Study outcomes for 90-day mortality. A Boxplot of predicted risk for patients, who died (blue) and patients, who did not die (red). B Receiver operation characteristic (ROC) curve for 90-day mortality. C Precision-recall curve for 90-day mortality. D Calibration plot for 90-day mortality. Blue color is cases (mortality within the time at risk); red color is patients, who did not die within the time at risk

Fig. 4
figure 4

Study outcomes for 1-year mortality. A Boxplot of predicted risk for patients, who died (blue) and patients, who did not die (red). B Receiver operation characteristic (ROC) curve for 1-year mortality. C Precision-recall curve for 1-year mortality. D Calibration plot for 1-year mortality. Blue color is cases (mortality within the time at risk); red color is patients, who did not die within the time at risk

Model performance

The models contained 419, 561, and 581 covariates for 30-day, 90-day, and 1-year mortality, respectively. The models with all details can be viewed in supplementary Table 1.

The models showed discrimination of AUROC of 0.88, 0.878, and 0.861 (Table 2, Figs. 2B, 3B, and 4B) and AUPRC of 0.353, 0.431, and 0.559 for 30-day, 90-day, and 1-year mortality, respectively, where the incidence was 0.448, 0.665, and 0.128 (Table 2, Figs. 2C, 3C, and 4C). In terms of calibration, the models had calibration slopes of 0.96, 1.02, and 1.04, calibration intercepts of 0.02, 0.01, and 0.01, and calibration-in-the-large of 1.01, 0.99, and 0.99 for 30-day, 90-day, and 1-year mortality (Table 2, Figs. 2D, 3D, and 4D). Additionally, the visual assessment of performance in terms of the smooth calibration plots was excellent for 30-, 90-day, and 1-year mortality as can be viewed in Figs. 2D, 3D, and 4D.

Table 2 Cohort, calibration, and discrimination metrics for 30-, 90-, and 180-day post-operative mortality using LASSO logistic regression

In subgroup analysis, the results from the general model including all patients operated for CRC were compared to smaller subgroups of patients undergoing elective surgery (n = 30,167), patients undergoing emergency surgery (n = 4279), patients undergoing palliative surgery (n = 1829), and patients undergoing curative surgery (n = 24,598). A comparison of all performance evaluation measures in each subgroup can be viewed in Table 3.

Table 3 Performance shown with discrimination and calibration evaluation methods in subgroup analysis

Discussion

In summary, we used observational health data from national registers to train models by a machine learning algorithm, which identify the most important predictors for 30-day, 90-day, and 1-year mortality following colorectal cancer surgery. The model showed overall good performance, especially for 1-year mortality (Fig. 4). In the subgroup analysis comparing the general population with subgroups of patients undergoing curative, palliative, emergency, or elective surgery, we consistently found good performance comparable to the larger study population in each subgroup, showing that the model can be used in multiple circumstances.

The aim of the study was to create a model to binary classify mortality during different time windows based on a high granular national dataset. The high number of candidate features of the model versus the relative few patients can cause a regression model to have a high variability if using a least squares fitting procedure; however, statistical learning methods within ML can often decrease the overall error of the predictions, by reducing the variability at the cost of a negligible bias increase [36]. A common concern when creating such prediction models is a “black box” effect, where the relationship between the input parameters and the output becomes opaque, which might cause clinicians to trust and act on the output [37]. The study uses the LASSO logistic regression as a statistical learning method, which imposes a restriction on the coefficient estimates, which is otherwise identical to ordinary least squares. This method keeps the intrinsic interpretability of regression-based models and performs variable selection, resulting in sparse models, which further aids interpretability [36, 38]. While the model itself is interpretable and does not contain any higher order polynomials or interaction terms, the number of covariates included in all models, in clinical implementation might benefit for solutions designed to communicate the relationship between input and output of the model [38].

Other risk assessment tools currently exist, but are not routinely used in the clinical setting, and tend to perform less impressively on internal validation sets [39,40,41]. Additionally, most prediction studies tend to mainly report AUROC, but not other performance metrics, especially calibration metrics [42]. In this study, we report a range of graphical and numerical discrimination and calibration measures with full transparency, which also lives up to the reporting guidelines in the TRIPOD statement [35].

Using prediction models for clinical risk assessment is known to have implementation challenges, partly because of suboptimal introduction to the clinicians, which in some cases have been seen to cause lack of trust in the prediction [43, 44]. When interpretability comes into question, it is, however, important to note that increasing explainability of the model may reduce the performance and clinical applicability [45]. The data-driven covariate inclusion means that some included covariates may seem clinically unrelated to the outcome. The more traditional approach is the pre-selection of variables based upon clinical knowledge and existing known causal relationships. By doing a pre-selection, the potential search space for the model is vastly reduced and removes the potential for finding new associations that can be predictive of the outcome. As such this method, while providing models with a high explainability, may sacrifice the potential for better performance. This study does not report positive-predictive value, negative-predictive value, sensitivity, and specificity, since they require a threshold for high risk. However, establishment of a high risk is greatly dependent on the patient phenotype and having high or low risk as result of a prediction rather than a number, may lead to the model being used as a deciding factor, and not clinical decision support.

One of the main strengths of this study is the use of high-quality observational health databases with national coverage, high rate of completeness [19,20,21], high granularity in data, and a combination of differently focused data sources. This yields many patients in the cohort, which provides more data for model training but also reduces selection bias from variation in local practices. The merging of different health databases with a different focus, cancer-specific trajectory data, admission-specific data, prescription medicine data, or laboratory results data, also provides the most comprehensive data coverage related to each patient, which is necessary in order to create full patient phenotypes. An additional strength is the testing of the model in a subgroup analysis of different surgical settings, acute, elective, palliative, and curative, where the model showed equal performance in all subgroups. This additional internal validation step supports that the model provides reliable risk predictions for patients in multiple different settings.

Denmark is known to have observational health databases with high coverage, but other countries such as the Netherlands (for instance Netherlands Cancer Registry provided by Netherlands comprehensive cancer organization [46]) and Norway (for instance Cancer Registry of Norway [47]) show similar datasets and coverage to the Danish databases. This means that models likely can be used in different countries with similar datasets. Additionally, the models are trained to use the provided covariates, meaning that models can be used on incomplete datasets.

The study evaluates the performance of prediction models developed using data harmonized to the OMOP-CDM based on several national Danish registers. The use of open-source tools and the overall model development allows other data holders to convert their data to the OMOP format and train their own prediction model. The covariates with the largest positive and negative covariates are presented in the supplementary.

The data-driven approach meant that all available covariates could potentially be included in the model development. The LASSO logistic regression selects covariates for the model based on their impact on the model and excludes covariates with no effect on the prediction. The use of this data-driven covariate selection often leads to many included covariates, which means that these models utilized 419, 561, and 581 one-hot encoded covariates for prediction, which is infeasible for clinical implementation. A step towards making the number of variables smaller is by grouping them into categories or phenotypes, which will make the model more clinically manageable, a so-called parsimonious model [48, 49]. This would however potentially decrease the performance of the model or may introduce bias, when covariates are selectively grouped. The prospect of parsimonious models is interesting for the next steps of clinical implementation but will require further research into how the number of covariates can be shrunk with minimal model impact. In addition, external validation with other colorectal cancer research groups with an OMOP-CDM would be an important future step to ensure that models are clinically applicable internationally.

The limitations of this study included the fact that external validation was not yet possible and that the models have many included covariates, where further research is required.

However, the strengths of the study was also considerable the two largest being the use of national health databases with good coverage rather than single or few centers and the very large sample size of over 50,000 patients. Additional strengths worth mentioning are the use of only preoperative covariates, internal validation where the model is tested on a separate group of records not used for training the model, and with reporting of multiple measurements of performance including calibration and discrimination, which is rarely provided in similar detail.

Conclusion

We found that the combination of multiple nationwide databases for patients with CRC allowed for development of high performing prediction models for mortality with great calibration and discrimination. By only including preoperative covariates, the models are usable in the treatment planning phase and may assist clinicians with optimizing an individualized approach to colorectal cancer treatment.