Reliability and validity of multicentre surveillance of surgical site infections after colorectal surgery

Surveillance is the cornerstone of surgical site infection prevention programs. The validity of the data collection and awareness of vulnerability to inter-rater variation is crucial for correct interpretation and use of surveillance data. The aim of this study was to investigate the reliability and validity of surgical site infection (SSI) surveillance after colorectal surgery in the Netherlands. In this multicentre prospective observational study, seven Dutch hospitals performed SSI surveillance after colorectal surgeries performed in 2018 and/or 2019. When executing the surveillance, a local case assessment was performed to calculate the overall percentage agreement between raters within hospitals. Additionally, two case-vignette assessments were performed to estimate intra-rater and inter-rater reliability by calculating a weighted Cohen’s Kappa and Fleiss’ Kappa coefficient. To estimate the validity, answers of the two case-vignettes questionnaires were compared with the answers of an external medical panel. 1111 colorectal surgeries were included in this study with an overall SSI incidence of 8.8% (n = 98). From the local case assessment it was estimated that the overall percent agreement between raters within a hospital was good (mean 95%, range 90–100%). The Cohen’s Kappa estimated for the intra-rater reliability of case-vignette review varied from 0.73 to 1.00, indicating substantial to perfect agreement. The inter-rater reliability within hospitals showed more variation, with Kappa estimates ranging between 0.61 and 0.94. In total, 87.9% of the answers given by the raters were in accordance with the medical panel. This study showed that raters were consistent in their SSI-ascertainment (good reliability), but improvements can be made regarding the accuracy (moderate validity). Accuracy of surveillance may be improved by providing regular training, adapting definitions to reduce subjectivity, and by supporting surveillance through automation.


Introduction
Surgical site infections (SSI) are one of the most common healthcare-associated infections (HAI) [1], and are associated with substantial morbidity and mortality, increased length of hospital stay and costs [2][3][4][5][6]. The highest SSI incidences are reported after colorectal Page 2 of 9 Verberk et al. Antimicrobial Resistance & Infection Control (2022) 11:10 surgeries, possibly due to the risk of (intra-operative) bacterial contamination and post-operative complications [7][8][9]. Worldwide, incidence rates range from 5 to 30% and are affected by several risk factors, including the type of surgery, age, sex, underlying health status, diabetes mellitus, blood transfusion, ostomy creation, prophylactic antibiotic use [10][11][12] and by the definition used to identify SSIs [4,13]. Surveillance is an important component of prevention initiatives and most surveillance programs include colorectal surgeries [14]. Large variabilities in SSI rates between centres remain, even after correction for factors that increase the risk of SSIs. Previous studies reported significant variability in surveillance methodology and in inter-rater agreement, introducing uncertainty regarding whether observed differences in colorectal SSI rates reflect real differences in hospital performance [15][16][17][18][19][20][21].
For the purpose of comparing SSI rates between hospitals, accurate adherence to standardized surveillance protocols is required. Furthermore, case definitions should be unambiguous to avoid subjective interpretation. To reduce subjectivity the Dutch national surveillance network (PREZIES) has modified the case-definition on two criteria as compared to the definitions set out by the (European) Center of Disease Control and Prevention ((E)CDC) [22][23][24][25]. First, the diagnosis of an SSI made by a surgeon or attending physician only is not incorporated in the Dutch definitions. Second, in case of anastomotic leakage or bowel perforation, a deep or organ-space SSI can only be scored by purulent drainage from the deep incision, or when there is an abscess or other evidence of infection involving the deep soft tissues found on direct examination. A positive culture obtained from the (deep) tissue is not applicable in case of anastomotic leakage. Moreover, to increase standardization, the Dutch surveillance only includes primary resections of the large bowel and rectum, in contrast to the (E)CDC, who also allows biopsy procedures, incisions, colostomies or secondary resections.
Awareness of the correctness of applying the definition and vulnerability to inter-rater variation is crucial for correct interpretation and use of surveillance data. The aim of this study was to investigate the reliability and validity of SSI surveillance after colorectal surgery using the Dutch (PREZIES) SSI definitions and protocol. Secondary aims were to report the accuracy of determining anastomotic leakage and to provide insights in the SSI incidence and epidemiology in the Netherlands.

Study design
In this multicentre prospective observational study, seven Dutch hospitals (academic (tertiary referral university hospital) n = 2; teaching n = 3; general n = 2) collected surveillance data for occurrence of SSI after colorectal surgeries performed in 2018 and/or 2019, according to the Dutch PREZIES surveillance protocol [23,25,26]. Three hospitals had no prior experience in performing SSI surveillance after colorectal surgeries and four hospitals already performed this surveillance for more than five years as part of their quality program. Participation in SSI surveillance after colorectal surgery is voluntary, hence not all hospitals include this in their surveillance programme. When executing the surveillance, additionally intra-and inter-rater reliability and validity were determined by two case-vignette assessments and a local case assessment. Reliability refers to the consistency and reproducibility of SSI-ascertainment and was determined by three agreement measures: 1) the intra-rater reliability, reflecting the agreement within one single rater over time; 2) the inter-rater reliability, which is the agreement between two raters within one hospital; and 3) the overall inter-rater reliability between all 14 raters of seven hospitals [27,28]. Validity refers to how accurately the surveillance definition is applied and was determined by the correctness of ascertainment compared to a medical panel as described in detail below. The Medical Ethical Committee of the University Medical Centre Utrecht approved this study and waived the requirement of informed consent (reference number 19-493/C). All data were processed in accordance with the General Data Protection Regulation. Hospitals were randomly assigned the letters A-G for reporting of the results.

SSI surveillance after colorectal surgery
All hospitals included all primary colorectal resections of the large bowel and rectum performed in 2018 and/or 2019 in patients above the age of 1 year. Per hospital two raters, mostly ICPs, manually reviewed the electronic medical records for all included procedures retrospectively and classified procedures into three categories: (1) no SSI, (2) superficial SSI or (3) deep SSI or organ-space SSI within a follow-up period of 30 days post-surgery. SSIs were registered in their own hospital's surveillance registration system. All identified SSIs and questionable cases were validated and discussed with each facility's medical microbiologist or surgeon after completing the assessments which are described below.

Case-vignette assessment
Case-vignettes were used to assess the validity, intra-rater and inter-rater reliability. Four medical doctors developed standardised case-vignettes in Dutch language, based on 20 patients selected from a previous study [29]. Each vignette described demographics, the medical history, type of surgical procedure and the postoperative course. An external medical panel of seven experts in the field of colorectal surgeries and surveillance classified the case-vignettes as a superficial SSI, deep SSI, or no SSI according to the Dutch SSI definition, and indicated presence or absence of anastomotic leakage. Their conclusion was considered the reference standard. Each rater who performed surveillance completed the casevignettes individually through an online questionnaire. Three months later, the same vignettes were judged once more by the same raters, but presented in a different random order.

Local case assessment
The reliability of surveillance data also depends on the ability to find the information necessary for case-ascertainment in the medical records. As this is not measured by the case-vignettes, we additionally performed a local case assessment: within each hospital, 25 consecutive colorectal surgeries included in surveillance were scored independently by the two raters, on separate digital personal forms. After sending the completed forms to the research team, raters discussed the results and entered the final decision into their hospital's surveillance registration system.

Training
Before starting the surveillance activities, a training session was organized to ensure the quality of the data collection and to practice SSI case-ascertainment. Thereby, before starting the reliability assessments, each ICP had to complete at least 20 inclusions for surveillance to assure familiarity with the surveillance procedure. In case of any questions, the research team was available to provide assistance.

Statistical analyses
Descriptive statistics were generated to describe the surveillance period, number of inclusions and epidemiology. The number of SSIs per hospital were reported and displayed in funnel plots. The primary outcomes of this study were the reliability and validity of the surveillance. From the case-vignette assessments, the intra-rater and inter-rater reliability were analysed by calculating a weighted Cohen's Kappa coefficient (κ). The scale used to interpret the κ estimates was as follows: ≤ 0, no agreement; 0.01-0.20, slight agreement; 0.21-0.40, fair agreement; 0.41-0.60, moderate agreement; 0.61-0.80, substantial agreement; 0.81-1.00, almost perfect agreement [27]. For the inter-rater reliability within a hospital, we used the second questionnaire round of the casevignettes, to account for a possible learning curve over time. The overall inter-rater reliability among all 14 raters was estimated using a weighted Fleiss' Kappa. For all Kappa's, 95%-confidence intervals were estimated using bootstrapping methods (1000 repetitions). Inter-rater reliability was also measured from the local case assessment, from which the overall percentage agreement was calculated per hospital. Validity was determined by comparing the answers of the two case-vignettes questionnaires with the answers of the medical panel. The same comparison was performed to investigate the accuracy related to the determination of anastomotic leakage. Analyses were performed with R version 3.6.1 (R Foundation for Statistical Computing, Vienna, Austria) [30] with the use of packages irr [31] for inter-rater reliability and the boot [32] package for bootstrapping.  Table 2.

Reliability and validity
All 14 raters completed the two rounds of online questionnaire with case-vignettes. Of those, two had less than one year of experience with HAI surveillance, six had 2-5 years of experience, five persons 6-15 years and one more than 25 years. The estimated Cohen's Kappa for agreement within a rater (intra-rater reliability) calculated from the case-vignette assessment varied from 0.73 to 1.00, indicating substantial to perfect agreement ( Table 3). The inter-rater reliability within hospitals showed more variation, with lowest estimates reported for hospital A (κ = 0.61, 95%-CI 0.23-0.83) and the highest in hospital C (κ = 0.94, 95%-CI 0.75-1.00). The overall inter-rater agreement of all 14 raters in the second round case-vignettes was 0.72 (95%-CI 0.59-0.83). From the local case assessment it was estimated that the overall percent agreement between raters within a hospital was almost perfect (mean = 95%, range 90-100%).
Regarding the accuracy of determining SSIs correctly, 87.9% (range 70%-95%) of the answers given by the raters were in accordance with the medical panel: 3 raters had similar SSI rates compared to the medical panel, five raters underestimated the number of SSIs, four had higher SSI rates because of incorrect ascertainment and there were two raters who had overestimated SSI in the first round, and an underestimation in the second round. Presence of anastomotic leakage was accurately scored in the vignettes where it was present, however misclassified in cases where anastomotic leakage was absent (Table 3).

Discussion
In this study we observed good reliability of SSI surveillance after colorectal surgeries in seven Dutch hospitals. Based on the case-vignette assessment, the intra-rater reliability was estimated substantial to perfect (κ = 0.71-1.00) and the inter-rater agreement within hospitals was substantial, but varied between hospitals (κ = 0.61-0.94). The local case assessment showed 95% agreement within hospitals. Despite the fact that individual raters were consistent in their scoring, validity was moderate: in 12.1% (range 5%-30%) the case-ascertainment was not correct as compared to the conclusions of the medical panel. The SSI rate determined by surveillance would therefore be under-or overestimated.
To the best of our knowledge, there is only one other study assessing the inter-rater reliability explicitly for SSI after colorectal surgeries. Hedrick et al. [18] concluded from their results that SSIs could not reliable be assigned and reproduced: they demonstrated large variation in SSI incidence between raters with only modest inter-rater reliability (i.e. κ = 0.64). They therefore opt for alternative definitions such as the ASEPSIS score [33]. In the present study similar estimates for inter-reliability were found in 2 out of 7 hospitals (κ = 0.61 in hospital A and κ = 0.65 in hospital E), for the other five hospital we found estimates above 0.69. The higher reliability estimates found in the present study may be explained by several factors. First, the definitions and method used in the Netherlands aim to be more objective: a previous study has shown that surgeon's diagnosis -not included the Dutch definitionlead to biased results [34,35]. Another factor that may influence reliability is the years of surveillance experience of the raters and their ability to find information in the electronic health records needed for case-ascertainment [36]. From Table 3 it seems that more experienced raters produce more consistent results. However, the design of this study did not allow to investigate this type of causal relationships.
The reliability estimates of this study show that SSIs after colorectal surgery are an appropriate measure to use for surveillance: the same result can be consistently achieved, making them reproducible and suitable for monitoring trends and detecting changes in SSI rates within a hospital. However, at this moment, using SSI incidence as a quality measure for benchmarking may be hampered because of three reasons. First, we found that on average 12.1% of patients in the case-vignettes were misclassified: one rater misclassified 6 out of 20 vignettes while another had only one misclassification. This will lead to unreliable comparisons of SSI rates, although in practice difficult cases may be discussed in a team hence improving accuracy. As superficial SSIs rely on more subjective criteria, focusing on deep SSI may improve accuracy and comparability. Additionally, we observed that anastomotic leakage was too often assigned while it was actually absent. This may lead to an underestimation as these cases cannot be scored by a positive culture anymore according to the Dutch definition (as explained in the introduction). Second, Kao et al. [16] and Lawson et al. [15] investigated whether SSI surveillance after colorectal surgeries has good ability to differentiate high and low quality performance (i.e. the statistical reliability of SSIs). They both concluded that the measure can only be used as hospital quality measure when an adequate number of cases have been reported, which can be challenging for some hospitals as shown in Table 1. Third, another challenge in using SSI rates for interhospital comparisons is the lack of a sufficient method for risk adjustment. To obtain valid SSI comparisons, you have to correct for differences in the surveillance population and their risk factors. However, to date no method has been proven generalizable and appropriate [12,37]. The points raised above show that the overall SSI incidence of 8.8% in this study is difficult to compare to others. Overall, the SSI incidence was lower compared to other studies, but in line with numbers previously reported to the Dutch national surveillance network [13,38,39]. When SSIs after colorectal surgery are used for monitoring and perhaps benchmarking, continuous training of raters is required to assure correct use and alignment of surveillance definitions and methodology. Reliability and validity of surveillance may be improved by automatization methods as they can help to support casefinding [40][41][42]. Furthermore, hospitals should perform a certain number of colorectal surgeries to generate representative estimates of performance. If there is no appropriate case-mix correction, comparisons should be made with caution, preferably between similar types of hospitals with comparable patient groups.

Strengths and limitations
This study was performed within multiple Dutch centres, including different types of hospitals. The 14 raters in this study were well-trained according to standardized methods to minimalize differences possibly caused by years of surveillance experiences between hospitals. Unfortunately, this design was not suitable for explaining which factors enhance SSI-ascertainment or will improve reliability and validity estimates. Second, we aimed to produce Cohen's Kappa coefficients from the local case assessment as well, however it appeared that there was too little variation in outcomes and number of cases hindering this calculation.

Conclusion
Awareness of the validity of surveillance and vulnerability to inter-rater variation is crucial for correct interpretation and use of surveillance data. This study showed that raters were consistent in their SSI-ascertainment, but improvements can be made regarding the accuracy. Hence, SSI surveillance results for colorectal surgery are reproducible and thus suitable for monitoring trends, but not necessarily correct and therefore less adequate for benchmarking. Based on prior literature, accuracy of surveillance may be improved by providing regular training, adapting definitions to reduce subjectivity, and by supporting case-finding by automation.