Maximizing Interpretability and Cost-Effectiveness of Surgical Site Infection (SSI) Predictive Models Using Feature-Specific Regularized Logistic Regression on Preoperative Temporal Data

This study describes a novel approach to solve the surgical site infection (SSI) classification problem. Feature engineering has traditionally been one of the most important steps in solving complex classification problems, especially in cases with temporal data. The described novel approach is based on abstraction of temporal data recorded in three temporal windows. Maximum likelihood L1-norm (lasso) regularization was used in penalized logistic regression to predict the onset of surgical site infection occurrence based on available patient blood testing results up to the day of surgery. Prior knowledge of predictors (blood tests) was integrated in the modelling by introduction of penalty factors depending on blood test prices and an early stopping parameter limiting the maximum number of selected features used in predictive modelling. Finally, solutions resulting in higher interpretability and cost-effectiveness were demonstrated. Using repeated holdout cross-validation, the baseline C-reactive protein (CRP) classifier achieved a mean AUC of 0.801, whereas our best full lasso model achieved a mean AUC of 0.956. Best model testing results were achieved for full lasso model with maximum number of features limited at 20 features with an AUC of 0.967. Presented models showed the potential to not only support domain experts in their decision making but could also prove invaluable for improvement in prediction of SSI occurrence, which may even help setting new guidelines in the field of preoperative SSI prevention and surveillance.


Introduction
Surgical site infections (SSIs) are the most common type of nosocomial infections and a major cause of morbidity among surgical patients, especially following abdominal and colorectal [1], cardiovascular [2], oncological [3], and trauma or orthopaedic surgeries [4,5]. An empirical surveillance study [6] for patients undergoing surgical procedures (SP) that encompassed data from 82 hospitals in 30 countries confirmed that the highest SSI rates from SP were from the aforementioned surgeries, where ventricular shunt had the highest rate with 12.9%, followed by colon surgery with 9.4%, bile duct, liver, or pancreatic surgery with 9.2%, abdominal aortic aneurysm repair with 7.7%, and thoracic surgery with a 6.1% rate. e best strategy in SSI prevention lies in effective guidelines to the issue of SSI prevention, and most are specifically tailored to meet the needs of the countries in which they were published [7].
e World Health Organization (WHO) in 2016 published global guidelines on the prevention of SSI [8], which included conducting specific tests per patient prior to a certain medical procedure. ere are several approaches to predict occurrence and reduce the incidence of SSIs, such as different risk models and prevention strategies. For instance, high-income countries have realized that collecting data through centralized surveillance systems is an essential component of SSI prevention [7,[9][10][11]. Risk models aim at predicting SSIs and guiding further action to prevent more serious outcomes.
In addition to the health risk for the patient, SSI usually implies longer postoperative hospital stays, considerably increased postoperative costs, and often a higher mortality (on average by 9.7 days with an increased cost of $20,842 per admission) [12]. e economic costs alone of SSIs are substantial [13], for example, when looking from the national perspective (in the US), the SSI cases were associated with additional 406,730 hospital-days and hospital costs exceeding $900 million. An additional 91,613 readmissions for treatment of SSI accounted for a further 521,933 days of care at a cost of nearly $700 million [14].
Several researchers have already focused on predicting SSI based on electronic health records (EHRs), for example, by using patient demographics, past medical history, and surgical information [15]. In recent studies, ranging from model evaluation methods, such as ROC analysis [16,17], to data-driven modelling approaches, such as linear regression models [1,18,19] and Support Vector Machines [20], the attention has shifted to results of blood tests before and after surgery. One of the main reasons being the ease and extent of this type of data. Specifically, blood test results of C-reactive protein (CRP) have been associated with a high predictive power [1].
Interpretability of the current practices and models therefore starts with CRP, a known indicator of inflammation, and is the first in line of predictors associated with the presence of inflammation of any kind, but it also has its drawbacks. Among these is the fact that if a high CRP is measured in a patient prior to surgery, the surgery will most likely get postponed. Consequently, CRP is not a predictor, but more likely a filter of having a surgery in a specific patient at all. However, it is of course also clear that if surgery gets underway in a patient with an already high mean CRP value, the probability that the surgical site gets infected increases.
Several studies have taken a classical (knowledge-based) approach and handcrafted features by setting cutoff values of CRP after the surgery, where the term of postoperative day (POD) is commonly used. For example, Angiolini et al. [19] evaluated the diagnostic accuracy as an early predictor of SSIs after pancreaticoduodenectomy and showed that CRP on POD 3, with a cutoff of 17.27 mg/dl, predicted the postoperative course in 78.2% of patients, whereas a CRP cutoff of 14.72 mg/dl on POD 4 predicted the postoperative course in 80.2% of patients. A systematic review of studies on diagnostic value of CRP after major abdominal surgery for predicting SSI [21] showed that CRP > 15.9 mg/dl on POD 3 increases the risk of SSI [10].
An alternative data-driven approach for the SSI problem was demonstrated by Ke et al. [22], where the focus was on dynamic wound data (e.g., using mHealth tools, which include self-reported symptoms of pain, body temperature, wound features, and patient-or caregiver-generated images of the wound) for SSI prediction. Predicting time to SSI onset with spatial-temporal data via bilinear formulation was used and further enhanced with automatic missing data imputation by the matrix completion technique for data from POD 2 until discharge or POD 21, whichever was earlier. is approach showed superior performance on realworld data set of SSI in terms of mean absolute error (MAE) compared to linear regression and support vector regression (SVR) [22]. e aforementioned studies have all focused on postoperative data. However, as Silvestre et al. [18] noted, preoperative CRP concentrations were significantly higher already prior to surgery in patients that developed infections postoperatively than those who did not develop complications.
is suggests that there is a high risk that an underlying infection was already present prior to surgery. Hence, if we detect those patients prior to surgery, preventive interventions can be performed, minimizing the risk of SSI. We believe that a data-driven approach that alleviates the vast amounts of information stored in the EHRs, and in particular in blood samples, is valuable for predicting SSI prior to surgery. Such an approach already achieved remarkable success rates at predicting development of SSIs following gastrointestinal surgery [20].
Predicting SSI after surgery gives us additional intra-and postoperative risk factors such as surgery duration [23][24][25], treatment complexity [25], blood loss during surgery [24], administration of supplemental oxygen [26], and higher intraoperative lactate levels [23] which in turn can improve the SSI prediction or augment an existing preoperative data SSI prediction.
We further summarized our findings of studies focusing on pre-and/or postoperative data in the prediction of SSI in Table 1, where we compared tasks, data, and methods used.
We conjecture that a higher predictive power lies in the expansion of the preoperative blood test results from the mean value to more complex parameters (e.g., slope of linear regression line for a fixed temporal window and number and proportion of low/high abnormal values of tests for a fixed temporal window). is presents a unique approach, which was to our best knowledge, not yet used in SSI prediction.
Considering the available models for predicting SSI based on preoperative data, our motivation was to improve their predictive potential, while at the same time maximizing their interpretability and final diagnostic cost-effectiveness. As such, this study presents a novel approach to SSI prediction among patients undergoing gastrointestinal surgery based on temporal data from blood test results. Our solution is built on the abstraction of preoperative blood test data in three different temporal windows. In order to maintain interpretability, penalized logistic regression was used as a core classification method since it allows tuning of the model complexity to avoid overfitting and still maintains a high level of predictive performance. Moreover, clinical knowledge was incorporated into the model, and also features that Moyes et al. [30] e aim was to examine the relationship between the preoperative mGPS (the glasgow prognostic score) and the development of postoperative complications in patients undergoing potentially curative resection for colorectal cancer Blood tests: white cell count, albumin, and C-reactive protein and clinicopathological characteristics such as age, gender, tumor site, and nodal involvement, among others.

Pre-and postoperative
Mantel-Haenszel (χ 2 ) test for trend, logistic regression analysis account for the amount of missing data were proposed. Additional solutions were developed to produce a more cost-effective model by introducing a penalization based on the price of the respective blood tests. Using this approach, it is possible to achieve better economic efficiency of the models in practice as our approach aims to reduce the costs of laboratory tests by eliminating more expensive tests in cases where similar results can be achieved by combining tests costing less.

Data Set Description.
e data used in this work was previously explored and analyzed by Soguero-Ruiz et al. [20]; further, it was used in the American Medical Informatics Association Knowledge Discovery and Data Mining 2016 data competition. e data set we considered consists of 7725 patients that underwent a gastrointestinal surgical procedure at the University Hospital of North Norway in the years 2004-2012. Since SSI-persistent inhospital morbidity is particularly associated with colorectal cancer surgery [33], patients who did not undergo this type of surgery were excluded, reducing the size of the cohort to 1137 patients. Guided by input from clinicians, the International Classification of Diseases (ICD10) and NOMESCO Classification of Surgical Procedures (NCSP) codes related to severe postoperative complications, and in particular to SSI, were considered to identify patients with SSI. Patients who did not have these codes or the word "infection" in any of the postoperative text documents were considered as controls. 80% of the cohort (909 patients) was used for model development, and the remaining 20% of the cohort was set aside for model testing. In the model development set, 183 out of 909 patients (20.13%) developed SSI, whereas in the test set 50 out of 228 (21.93%) patients developed SSI.
Data included information from various blood tests (811 different blood tests at different points in time). e data ranging from preoperative day 5393 up to the day of surgery were used, while data collected postoperatively were not used in this study. For each blood test, the mean number of blood test values recorded in the last 30 days before the surgery was calculated. In total, 14 most frequent blood tests (with a mean number of measurements above one) were used: hemoglobin (5.37 measurements), leukocytes (4.43), sodium (4.24), CRP (4.11), potassium (3.97), albumin (2.87), creatinine (1.94), thrombocytes (1.53), alanine aminotransferase (ALT, 1.23), total bilirubin (1.22), aspartate aminotransferase (AST, 1.14), glucose (1.06), amylase (1.06), and alkaline phosphatase (ALP, 1.04). From a medical point of view, the abovementioned blood tests can have a specific or a more general role in the context of gastrointestinal surgery. ALT, AST, ALP, total bilirubin, and albumin serve to specifically assess hepatic functional capacity, inflammation or biliary tract obstruction, and amylase pancreatic ductal obstruction and inflammation. Glucose is a much more general metabolic marker, depending heavily on insulin sensitivity in peripheral tissues, but may be of special relevance in gastrointestinal patients with liver or pancreatic disease, due to deranged central insulin sensitivity or reduced gluconeogenesis, and diminished insulin secretion, respectively. e remaining 7 tests are even less specific for gastrointestinal conditions. Reduced hemoglobin may indicate chronic gastrointestinal bleeding, elevated CRP, and leukocytes inflammation. Sodium, potassium, and creatinine levels help to assess water and electrolyte balance and kidney function, and thrombocytes the risk for bleeding or thrombosis [34][35][36].
Some of these represent parameters obtained in routine blood tests, while others are more specifically aimed at detecting inflammation or infection, making them easier to interpret in the context of SSI model prediction.

Feature Representation.
Following an initial exploratory analysis of model development data set using different visualization techniques (observing different patterns with regard to SSI), we set the observation interval to 60 days before the surgery for all selected blood test based features. e initial data set of feature representation thus consisted of 14 most frequent blood tests for an interval of 60 days before surgery on a daily basis, where the mean values of blood tests were used if more than one value was recorded in a day. Since all tests were not available on a daily level, with a large percentage of missing values ranging from 90.25% for hemoglobin to 98.19% for amylase present, imputation using three approaches was used: (a) last observation carried forward (LOCF) [37] Additionally, three temporal windows (S, short; M, medium, and L, long) were manually defined for each feature based on the observed patterns (e.g., peaks or changes in trends) of feature values prior to surgery in (Figure 1). e S window included measurements from days 2 to 0 (depending on observed pattern for specific blood test) prior to surgery, in which the most tests per day were performed. e M window included data from days 18 to 13 prior to surgery up to the start of the S window. e L window encompassed the period from preoperative day 60 to the upper limit of window M.
Since the length of the observation interval could influence the SSI prediction results, we included shorter observation intervals, more specifically a 30-day and 15-day observation interval. In the 30-day observation interval, the L window encompassed a shorter temporal window from preoperative day 30 to the upper limit of window M. e 15day observation interval had 1 temporal window from 15 to 0 days before surgery.
While changes in blood test values in the short temporal window can be associated with acute infections in patients prior to surgery (that can clearly lead to SSI after surgery), the longer temporal windows (medium and long) are more indicative of some underlying (maybe even not properly treated) pathophysiological changes that could have an indefinite influence on the SSI outcome. Looking at the parameters that were chosen by our model to predict SSI, it is clear that the different temporal windows play an important part in the battery of predictors for SSI. e following features were extracted for each temporal window separately: introduced as an early indicator for developing SSI, with our underlying assumption that patients with a higher proportion of abnormal blood tests develop SSI more frequently.
Since there were 14 blood tests and 3 temporal windows available for each of the three types of generated features, with 6 features generated in each instance, the final datasets consisted of 252 features for the 60-day and 30-day observation interval and 84 features for the 15-day observation interval.

Predictive Modelling.
As maximizing interpretability was one of our goals, we restricted modelling to linear models with an additional model based on an ensemble of boosted decision trees serving as a nonlinear comparison. We further restricted linear models to regularized linear models allowing the complexity of the model to be tuned for a better predictive performance, thus avoiding overfitting.
A generalized linear model via penalized maximum likelihood L1-norm (lasso) regularization was used as defined by Friedman et al. [40]: where i represents observations and its negative loglikelihood contribution is noted as l(y, η), w i noting weights and tuning (shrinkage) parameter λ controlling the overall strength of the penalty. We excluded a broader elastic net regularization, as it did not show any significant gain in our initial experiments and added complexity to the model, making it less interpretable. Due to the class imbalance with only 20.13% positive cases in the development set, random oversampling examples (ROSE) [41] technique was used.
Additionally, a prior knowledge of predictors (blood tests) was integrated in the modelling by introduction of penalty factors for each coefficient β j , j � 1, . . . , p, which depended on blood test prices (these vary from test to test, with the most expensive costing twice the price of the cheapest one; Table S1). e penalty term can be described as minimizing the coefficients β j , j � 1, . . . , p in the following equation: where v j represents the penalty factor of coefficient j. An additional user-defined parameter p max was used in the above-described framework to limit the parameter λ in a way that the maximum number of selected features in the model cannot exceed p max . is parameter can be seen as an "early stopping" parameter, as it stops the λ cross-validation tuning as soon as the number of selected features exceeds p max , thus providing a much higher level of interpretability and generalizability of the model. To compare results with a nonlinear-based solution, an optimized tree learning-based distributed gradient boosting framework called XGBoost  Computational and Mathematical Methods in Medicine [42] was used with the full set of 252 features. Selection of parameters for it was done by selecting the best performing set of values of parameters via 100 repeated evaluations of random range values for parameters on fixed training and validation set. It has to be noted that ensemble-based models are less interpretable and are not preferred in cases where model explanation can be of practical use (in our case for clinician treating the GIT surgery patient or extracting new knowledge that could lead to new guidelines).

Cost-Efficient Feature
Penalization. e prices for each blood test were obtained from the Department of Medical Biochemistry, Oslo University Hospital. e blood tests with lower prices were assigned lower feature-specific penalization coefficients v j that were calculated as v j � r max /r j , with r max representing a price of the most expensive blood test (leukocytes and thrombocytes at 58 NOK) and r j representing a price of the blood test j. e lowest value of v j was calculated for a group of glucose blood test-related features (23 NOK).

Experimental Setup.
Repeated holdout crossvalidation approach on model development data set, using 80% of data for training and 20% for validation, was used in order to ensure the generalizability of the predictive model results.
e holdout cross-validation was repeated 100 times in each experiment to obtain mean values and 95% confidence intervals (CI) for each performance metric (Table 2). e following widely used performance metrics were used in all experiments (Tables  2 and 3): area under the ROC curve (AUC) as primary evaluation metric; area under the precision recall curve (AUPRC) as a secondary evaluation metric, since it summarizes the PPV (i.e., ratio of correctly classified positive values to the number of all instances classified as positive) over sensitivity into one number and it can be often more informative than AUC in cases of unbalanced datasets [43], threshold, sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV). e same performance metrics were also used on the test set (n � 228) to evaluate the final model built on the model development set (n � 909).

Results Using Penalized Logistic Regression.
Initially, experiments using L1-penalized logistic regression (lasso) were conducted without imposing a limit to the number of selected features p max in the λ cross-validation tuning step, with different imputation methods and different observation intervals. e results showed that when we used different imputation methods for different observation intervals (60, 30, or 15 days), the three imputation methods (LOCF, KNN, and the combination of LOCF and KNN) performed similarly, more precisely AUC, AUPRC, and PPV were generally less than 1% apart same observation intervals and considering that the combined LOCF and KNN imputation method often performed better in PPV we chose to be used in further experiments. ese experiments were followed by experimental runs with different p max values ranging from 10 to 100 in steps of 10 to find the best balance between the interpretability and predictive performance of the model. When observing the influence of relaxing the restriction on the maximum number of features on the mean AUC, the stabilization of mean AUC was observed with p max of 50 (Table 2). Initial experiments included the use of ROSE algorithm, where we oversampled due to class imbalance of 20.13% positive cases, but there was a loss in terms of AUC on average of 0.5%; therefore, we did not include rebalanced data in further modelling. e first results (CRP) with an AUC of 0.797 were obtained from a very simple (baseline) model, where only three features representing the mean CRP value in L, M, and S windows were used together with age and sex variables. In the basic lasso model (BLM), only the values of the recorded blood tests were used (mean value, slope, number, and proportion of abnormal values for each time frame interval). Since there were 14 blood tests and 3 time windows available for each of the three types of generated features, the data set consisted of 126 features. Together with age and sex, there were 128 features available to build the BLM. As it can be observed from Table 2, the mean AUC value for the BLM with p max restricted to 20 features decreased only slightly, from 0.947 to 0.943, compared to the unrestricted model optimized for maximal AUC.
In the next step, the models were built using additional features (e.g., number and window-specific proportion of blood tests performed in a specific window), where a significant increase in all evaluation metrics can be observed, indicating that the features describing the number and proportion of blood tests in a specific time window represent an important contribution to the model. e mean AUC value of the full lasso model (FLM) increased to 0.954 and was surprisingly slightly higher for p max set to 50 to 0.956, whereas AUPRC increased from 0.819 to 0.821 in the same way. is represents a gain (∼1%) in AUC compared to the BLM.

Results Using the Price Penalized Model.
e next experiment price penalized model (PPM) aimed at producing a more cost-effective model by taking into account the price of blood tests. When comparing the unrestricted PPM and the model restricted to maximum 20 features, a slight decrease in all evaluation metrics can be observed. Interestingly, also the mean number of features included in unrestricted PPM decreases from 32.6 features in the FLM to 25.2 features in the PPM, which points to the fact that two or more tests might have been replaced by single more-expensive tests.
e restriction of the model to a maximum number of features is even more reasonable, although a slight decrease

Results Using Extreme Gradient Boosting of Decision Trees
Finally, the extreme gradient boosting of decision trees was tested on a full set of features. e best results were obtained using an ensemble of 20 decision trees with a maximum depth of 10. e mean AUC value of 0.954 is comparable to the FLM with a mean number of features at 32.6.

Model Selection.
However, the FLM was chosen on the basis of lower complexity and higher interpretability. e most frequent features in 100 iterations of the FLM evaluation and their signs of regression coefficients are shown in Table 4. It can be observed that 8 features are included in the model for most of the iterations (above 95%). More precisely, CRP in the medium window is the only feature for which the mean parameter value was selected. is makes sense also from a medical point of view, since CRP is one of the most common indicators of systemic inflammation and is often increased due to infection. ree other selected features from the medium window are related to the number of performed tests (leukocytes and sodium), five features present the proportional number of respective tests (short window hemoglobin, medium window thrombocytes, hemoglobin, and long window leukocytes), and the final one presents the slope of albumin in the long window. e number and proportion of performed tests probably relate to the attending physician's instinct or experience. More specifically, the number of tests for counting leukocytes could indicate the attending physician's assumption of a possible infection.

Model Testing.
e final models, built on the model development data set (n � 909), were evaluated using the test set (n � 228) with 50 positive cases for FLM using different p max values. e optimal result in terms of AUC/ AUPRC was surprisingly at p max � 10 and p max � 20 with some performance metrics better with p max � 10, like AUPRC of 0.882, specificity of 0.971, and PPV 0.667; other were better with p max � 20, like AUC of 0.967 (Table 3). Additionally, it can be seen that, at p max � 20, six more predictors were selected with a total of 10 predictors than at p max � 10 and more positively predicted cases were predicted with a total of 70 (Table 3). A graphical representation of test set evaluation in terms of AUC and PPV at different p max values is also shown (Figure 2). When looking at selected features, we can see that four positive predictors were selected in all models: hamoglobin and leukocytes number of test in the medium window, potassium number of tests in the medium window, and thrombocytes proportional number of tests in the medium window.

Discussion
Norway has one of the highest rates of SSIs in the gastrointestinal tract (GIT), which is consistent with a very high incidence of colorectal cancer that is the most common cause of GIT surgery [12]. Our results support the practical applicability of a combination of blood test results and the temporal testing pattern in predicting SSIs in a clinical setting. More specifically, in addition to the values of a given blood test per se (e.g., CRP), our findings demonstrate that additional extracted features (e.g., number and windowspecific proportion of blood tests performed in a specific window) can be very informative with regard to predicting SSIs.
We are aware that the manual selection of temporal windows may have introduced some bias in evaluation of the solutions and is a limitation of the study, but we believe that the sample is big enough to reflect the trends present also in the test set. Future work will improve the selection process with an automatic selection process. One general algorithm considered is the maximum distance between windows approach. A two-step approach, where we firstly select the best partitioning on k temporal windows in terms of the highest score function and , where the f function could be as simple as multiplication of distance functions. For all the k-partitionings of a fixed k ≤ n, we then select the k-partitioning with the highest score function value. e second step is using the first step for each of 1 ≤ k ≤ n partitionings, giving us the optimal partitioning. One drawback of such an approach could be computational intensity, since for n � 60-day window and k � 3 partitions, a total of 34,220 partitions of a specific blood test are possible. Since testing is mostly guided by judgment of clinicians, which is in turn based on patient history, clinical signs, previous blood tests, results of diagnostic imaging, general patient observation etc., we believe that our choice to include information on testing patterns enabled us to indirectly include judgment by expert clinicians into our predictive model. Moreover, we believe that, due to its intuitive nature, clinicians could easily relate to and accept our predictive model.
With regard to values, among all tests, CRP values above normal played a predominant role in our model. From a mechanistic point of view, increased CRP is highly suggestive of an underlying inflammatory process. Our findings indicate that, also in the period preceding operation, a higher-than-normal CRP value raises the probability that a patient will develop SSI. e mechanistic substrate for this observation might be that the underlying inflammatory process indicates an increased susceptibility toward infections or even contributes to their development.
Recently, it has been suggested that signs of preoperative inflammation may predict postoperative infectious complications, specifically in patients undergoing colorectal surgery [30,44,45]. At least 4 out of our 6 positive predictors may be viewed as markers of inflammation, namely, the mean CRP value in the medium window, leukocytes with high proportion of abnormal values/number of tests in the long window, and leukocytes number of tests in the medium window. Local inflammation impairs the healing process, and systemic inflammation interferes with the immune response. e fact that our positive predictors pointed to inflammation or a suspected inflammation in the long or medium period might suggest that, during this time, they were a better marker of chronic inflammation than just shortly before the procedure. e predictive roles of thrombocytes and potassium remain to be confirmed and explained in future studies. However, it is tempting to speculate that low thrombocyte counts may directly impede the healing process, whereas high numbers may indicate or even modulate inflammation [46], and that the number of potassium concentration tests reflects more fragile patients with disturbed water and electrolyte homeostasis, e.g., due to abnormal ADH secretion or kidney disease.
Our study included only GIT surgery patients, mainly due to the high risk of SSIs in this group [6]. It is reasonable to speculate that a similar predictive model could be developed and used for other types of surgeries. A recent study indeed found that CRP, together with preoperative levels of albumin, hemoglobin, and signs of mild or moderate kidney failure, was significantly associated with the odds of SSI in a diverse group of patients undergoing general, oncologic, trauma, and vascular surgery procedures [32]. Even so, it remains to be investigated systematically whether in these and other groups of patients, inclusion of information on testing patterns will result in better predictive models. Let us now discuss the wider importance of studies similar to ours that try to shed more light on the predictive power of preoperative blood tests in prediction of SSI. It is generally accepted that effective guidelines are key in prevention of SSI. On one hand, clinical guidelines pave the way for specific tests that need to be conducted in a certain patient prior to a specific surgery. On the other hand, the demand for specific tests, which are not always even conducted by the hospital where the patient is hospitalized, mandates the overall cost for a battery of tests. In the case of preoperative blood tests, of which none costs more than 2€ per measurement per patient, we can easily calculate that even a battery of 10 tests conducted three times prior to surgery (which in most cases costs at least a couple of 1000€, depending on surgery type) would cost only 60€.
is means we can probably perform as many blood tests as we want and still provide a cheaper solution for the insurance companies (and hospitals) compared to complications and increasing costs connected with SSI. On the other hand, by preventing SSI preoperatively, we at the same time also prevent unnecessary patient complications, the lowered quality of life, and finally, prolonged stays of the respective patients' socioeconomic environment. Longer hospitalization times on one hand tend to induce longer socioeconomic reintegration times, especially in the elderly, while on the other hand, they lead to significantly increased overall treatment costs.

Conclusions
Our solution is based on abstraction of preoperative blood test data and use of L1-penalized logistic regression to predict SSI. Additional solutions were developed to improve the model's practical interpretability (for the treating clinician as well as for the medical informatician) and to include the cost-effectiveness aspect. e model with the best solution clearly indicates that a specific and easy interpretable SSI-related feature (CRP) has to be paired with rather indirect features related to the available temporal data (number, proportion of tests, etc.). In our opinion, this nicely captures the "treating clinician's suspicion" based on the overall patient evaluation over the course of the period preceding surgery. e model that included CRP only resulted in an AUC of ∼80%, whereas models that additionally included the indirect features reached an AUC of ∼95%.
Considering the high incidence of SSI (especially related to GIT surgery) and the continuous efforts of EU officials to decrease this value, our model could contribute to new guidelines in the field of preoperative SSI prevention and surveillance.
Data Availability e surgical site infection data recorded in the EHR at the Gastrointestinal Surgery Department in the University Hospital of North Norway used to support the findings of this study have not been made available because of the sensitive nature of data used in this study.

Conflicts of Interest
e authors declare that they have no conflicts of interest.