Introduction

Worldwide, acute appendicitis is one of the most common surgical emergencies [1]. The diagnostic goal is shifting from correct identification of patients with acute appendicitis to differentiation between complicated and uncomplicated appendicitis. In contrast to complicated appendicitis, uncomplicated appendicitis does not require emergency surgery, since this form of acute appendicitis can be treated by antibiotics [2,3,4,5] or semi-emergent appendectomy, or may even resolve spontaneously [6]. Therefore, reliable exclusion of complicated appendicitis is of paramount importance in the assessment of the appendicitis severity [7].

Several methods have been described to distinguish between complicated and uncomplicated appendicitis [7]. While imaging modalities have demonstrated efficacy in diagnosis of acute appendicitis, imaging alone falls short in ruling out complicated appendicitis [7, 8]. Models that combine imaging findings with clinical variables seem more promising [7, 9, 10]. However, these models do not take into account any subjective judgement of the doctor’s interpretation of laboratory and imaging findings. This doctor’s judgement, based on experience, clinical perception, and intuition, has shown high negative predictive values (NPVs, 96–99%) for the diagnosis of acute appendicitis in children or adults before imaging [11,12,13]. Nevertheless, none of these studies has investigated this judgement in the context of distinguishing between complicated and uncomplicated appendicitis, let alone this doctor’s final judgement when integrating their interpretation of laboratory and imaging findings.

Kim et al. explored the accuracy of radiologists’ judgement for the differentiation between complicated and uncomplicated appendicitis. They reported a pooled sensitivity and specificity for complicated appendicitis of 64% and 76%, respectively [14]. However, radiologists typically do not evaluate the patient directly. It could be argued that the judgement of the attending doctor at the emergency department (ED), who has extensively examined the patient, may be more accurate with respect to the differentiation between complicated and uncomplicated appendicitis, but literature on this subject is currently lacking.

The aim of the present study was to determine the accuracy of the doctor’s judgement integrating all available information including clinical, laboratory, and imaging results for diagnostic differentiation between complicated and uncomplicated appendicitis in the ED. The second objective was to determine the accuracy of this differentiation for the radiologist when specifically asked to indicate appendicitis severity — complicated or uncomplicated — based on imaging findings.

Methods

A prospective, observational study, termed Scoring system of Appendicitis Severity (SAS) study, was conducted between January 2020 and August 2021 in 11 Dutch hospitals. The primary aim of the main study was to externally validate an objective scoring model designed for differentiation between complicated and uncomplicated appendicitis [9], and to develop the SAS [15]. The present study was a predefined sub-analysis of this SAS study [15]. Informed consent was obtained from all participating patients. Diagnostics and treatment were performed according to national or local guidelines. The updated list of the STARD 2015 guidelines was used in the design and implementation of the study [16].

Study population

Consecutive adult patients (≥ 18 years) with an imaging-confirmed diagnosis of acute appendicitis and operated with intention to appendectomy were included in the SAS study cohort. Conservatively treated patients were excluded. Clinical practice dictated the type of imaging. For all patients of the SAS study cohort, the first attending doctor at the ED of each case was asked to provide an independent assessment of the appendicitis severity, in terms of a subjective final judgement of “uncomplicated” or “complicated” appendicitis. These doctors were consultant surgeons, surgical trainees, consultant emergency doctors, or emergency trainees. They had access to clinical, laboratory, and imaging results, and the assessment was completed prior to any surgery. Similarly, the radiologist of each case was asked to evaluate the severity of appendicitis as “uncomplicated” or “complicated,” but they relied solely on imaging findings. Current sub-analysis of the SAS study excluded participants from the original cohort who did not have any doctor’s and/or radiologist’s judgements of appendicitis severity.

Data collection

Data regarding clinical, laboratory, and imaging findings were prospectively gathered through standard reports in the electronic health record by the attending doctor at the ED, the radiologist, the operating surgeon, and the pathologist. In addition to the assessment of appendicitis severity, doctors at the ED as well as radiologists were asked about their years of experience and the level of confidence in their appendicitis severity judgement. This confidence level was evaluated according to an 11-point Likert scale (score 0–10) and categorized as follows: a score of 7 or higher was defined as “certain” and a score of 6 or lower was interpreted as “uncertain.” All data were prospectively collected into the data collection program CASTOR EDC [17].

Test definitions

The index tests were appendicitis severity according to the final judgement of the attending doctor at the ED and the assessment of the radiology imaging reader, collected as described above. The reference standard was the final diagnosis, classified as uncomplicated or complicated appendicitis, assigned by an adjudication committee based on all available data, including intraoperative and histopathological findings. The committee consisted of two surgeons, two radiologists, one pathologist, one surgeon-in-training, and one research fellow. Uncomplicated appendicitis was defined as inflammation or ulceration of the appendix or periappendix without obvious signs of necrosis or perforation [18]. Complicated appendicitis was defined as appendiceal inflammation with signs of gangrene or a perforation, large intraperitoneal infiltration, or abscess [18]. In case of discrepancy between intraoperative and histopathological findings, surgical findings were decisive, with the exception of malignancies as found at pathology examination.

The purpose of this study was not to diagnose appendicitis, but to estimate the severity of the appendicitis in patients diagnosed at the ED with acute appendicitis. Therefore, intraoperative finding of a normal appendix was categorized under the heading of uncomplicated appendicitis, whereas any urgent disease other than complicated appendicitis in need of surgery was categorized under complicated appendicitis, including appendiceal malignancy.

Outcomes

The primary outcome of this study was the accuracy of the doctor’s final judgement — based on subjective interpretation of clinical, laboratory, and imaging results — for the diagnosis of complicated appendicitis in terms of sensitivity, specificity, positive predictive value (PPV), and NPV. The secondary outcome was the accuracy of the radiologist for the diagnosis of complicated appendicitis based on imaging. Additionally, all assessment outcomes were stratified for years of experience and level of certainty of the judgement. The outcomes were stratified for doctor’s field of expertise and whether or not they consulted a surgeon; radiology outcomes for imaging modality.

Statistical analyses

Normally distributed data were shown as mean with standard deviation (SD) and non-normally distributed data as median with the interquartile range (IQR). Statistical significance was considered with a p-value < 0.05. Inter-observer agreement of diagnosis between doctors and radiologists was expressed as Cohen’s κ coefficient. This coefficient was interpreted as follows: ≤ 0.20 as none to slight agreement, 0.21–0.40 as fair, 0.41–0.60 as moderate, 0.61–0.80 as substantial, and 0.81–1.00 as (almost) perfect. Statistical analyses were performed using SPSS® version 26.0.

Results

A total of 1371 adult patients with an imaging-confirmed diagnosis of acute appendicitis who underwent surgery with the intention to appendectomy were included in the original prospective SAS cohort (Fig. 1, flowchart). In 1222 patients, diagnostic assessment of the appendicitis severity was made. In 1070 patients, this assessment was made by the doctor at the ED based on clinical, laboratory, and imaging findings, and in 941 patients by the radiologist based on only imaging findings. For 789 patients, both the emergency doctor and the radiologist assessed the appendicitis severity preoperatively.

Fig. 1
figure 1

Flowchart of included patients

Patients’ characteristics and reference standard diagnosis

Patients’ characteristics are summarized in Table 1. More than half of the patients were diagnosed with ultrasound (US) as only imaging modality (62.1%). According to the reference standard, 805 (65.9%) patients had a non-urgent disease, of which 790 (64.6%) had uncomplicated appendicitis and 15 (1.2%) a normal appendix. In comparison, overall, 417 (34.1%) patients had an urgent disease, of which 150 (12.3%) had gangrenous and 228 (18.7%) perforated appendicitis. In 21 (1.7%) patients, a malignancy was found. Baseline characteristics of patients with uncomplicated versus complicated appendicitis are described in Table S2.

Table 1 Baseline characteristics of 1222 patients with an assessment of the appendicitis severity by the emergency department doctor and/or radiologist

Assessment at the emergency department

The assessment of “uncomplicated” or “complicated” appendicitis at the ED was by and large made by doctors from the surgical department (70.0%) or emergency doctors (25.9%); 4.1% missing data. The majority were novice doctors with 0–1 (45.2%) or 1–3 (31.1%) years of experience, whereas 8.3% had 3–5 years of experience and 9.9% even more than 5 years; 5.4% missing data. About half of the assessments of appendicitis severity were made in consultation with the surgeon on call (48.3%). The majority (81.1%) of the assessments were a “certain” decision, meaning 7 points or higher in certainty on an 11-point Likert scale (0 to 10), with a median certainty of 8 (IQR 7–9). In 14.2% of patients, the assessment was “uncertain,” and in 4.7%, this variable was missing.

At the ED, the severity of acute appendicitis was assessed by the attending doctor as “uncomplicated” in 862 patients (80.6%) versus “complicated” appendicitis in 208 patients (19.4%). Overall, these judgements were correct in 76.4%. Almost all patients with true uncomplicated appendicitis were accurately identified (656 of 701 (93.6%)), in contrast to only 163 of 369 (44.2%) patients with true complicated appendicitis (Fig. 1). This resulted in a sensitivity of 44.2%, a specificity of 93.6%, a PPV of 78.4%, and an NPV of 76.1% for the final doctor’s judgement of complicated appendicitis (Table 2).

Table 2 Accuracy of appendicitis severity as assessed by the doctor at the emergency department

Severity assessments that were indicated as “certain” scored higher in specificity and PPV than those of lower certainty (Table 2). Furthermore, doctors with more than 5 years of experience achieved the highest specificity and PPV for complicated appendicitis; comparable for severity accuracy among patients who had been consulted by the surgeon on call. Counterintuitively, doctors with ≤ 1 year experience had higher specificity than those with 1–5 years of experience. Finally, no differences were observed between surgery and emergency medicine doctors, although consultant doctors scored better than trainees, in particular for PPV of complicated appendicitis (Table 2). Sensitivity exceeded 50% only in the eight patients who had been assessed directly by the consultant surgeon.

Assessment by the radiologist

For the 941 patients with a registered appendicitis severity assessment by the radiologist, US was the imaging modality in 615 (65.4%), CT in 320 (34.0%), and MRI in 6 patients (0.6%). Differentiation between “uncomplicated” and “complicated” appendicitis was made by radiology trainees in 218 patients (23.2%), by consultant general radiologists in 578 patients (61.4%), and by consultant abdominal radiologists in 101 patients (10.7%). The vast majority of assessments, for which a level of certainty was indicated, were rated as “certain” (716 of 740; 96.8%).

Based on the radiologist’s assessment, 747 of 941 patients (79.4%) were classified as having uncomplicated appendicitis and 194 (20.6%) patients as complicated appendicitis. These severity assessments were correct in 77.2% when compared to the reference standard. Almost all patients with true uncomplicated appendicitis (581 of 630 (92.2%)) but only 145 of 311 (46.6%) patients with true complicated appendicitis were accurately classified. This results in a sensitivity of 46.6%, a specificity of 92.2%, a PPV of 74.7%, and a NPV of 77.8% for radiologist’s final imaging judgement of complicated appendicitis (Table S1).

The proportion of patients with complicated appendicitis in the US subgroup was 24.6% compared to 49.4% in the CT subgroup. In patients with US imaging, sensitivity, specificity, PPV, and NPV were 39.1%, 91.6%, 60.2%, and 82.2%, respectively, versus 53.2%, 93.8%, 89.4%, and 67.3% for patients with CT imaging.

Inter-observer agreement

In 789 patients, both the attending doctor at the ED and the radiologist made an assessment for appendicitis severity preoperatively. These estimates were in agreement between both assessors in 91.3% of the cases, with a substantial inter-observer agreement (Cohen’s κ = 0.73). Also, a substantial agreement was found within the subgroups of patients diagnosed with US or CT (Cohen’s κ = 0.70 and 0.74, respectively).

Patient characteristics of correct vs incorrect severity assessments

Among 369 patients with complicated appendicitis according to the reference standard, 163 (44.2%) were correctly identified as such at the ED. These correctly classified patients were significantly older than patients wrongfully classified as having uncomplicated appendicitis; the median age was 54 (IQR 38–65) compared to 49 years (IQR 37–60), respectively (Table 3). Moreover, patients correctly identified as complicated appendicitis had higher CRP levels (median 139 (IQR 70–217) versus 69 mg/L (IQR 35–134), respectively) and longer duration of symptoms before presentation (proportion of patients with complaints ≥ 3 days: 50.9% vs 23.3%, respectively) than those wrongfully classified.

Table 3 Characteristics of all 369 patients with complicated appendicitis as reference standard diagnosis assessed by the doctor at the ED: correct vs incorrect severity assessment

Discussion

This large, prospective, multicenter study demonstrated that final diagnostic judgements of doctors at the ED, integrating all available clinical, laboratory, and imaging results, largely underestimate the number of patients with complicated appendicitis. More than half of all patients with true complicated appendicitis was incorrectly classified as uncomplicated appendicitis according to these judgements, even if these decisions were marked as “certain” by the assessors. Similar results were seen among radiologists who based their diagnostic judgement on imaging alone. Ruling out complicated appendicitis on final doctor’s judgement or final imaging interpretation remains unreliable for the selection of patients with uncomplicated appendicitis who may be treated without surgery.

Differentiation between complicated and uncomplicated acute appendicitis is important for the choice of treatment. Several large trials have shown that antibiotic treatment of uncomplicated appendicitis in adults is effective and safe [2,3,4,5]. It is important to exclude complicated appendicitis when selecting patients eligible for antibiotic treatment. A NPV of 75.8% in patients with a “certain” judgement of uncomplicated appendicitis is equivalent to 24.2% of patients who were thought to have “uncomplicated appendicitis” based on a confident doctor’s judgement but turned out to have “complicated appendicitis.” The studies that randomized between antibiotics and surgery used objective inclusion criteria. Within the patients randomized for surgery, they described a proportion of patients with complicated appendicitis of 1.5–18% [2, 3]. This means that objectively measurable variables work better than subjective judgement for making an accurate assessment of appendicitis severity.

Several objective variables are known to be predictive for differentiation between complicated and uncomplicated appendicitis, e.g., age [19] and CRP levels [20]. These characteristics indeed differ significantly between patients with uncomplicated and complicated appendicitis in our cohort (Table S2). When comparing patients with complicated appendicitis who were correctly identified at ED by the doctor to patients wrongfully classified as having uncomplicated appendicitis, these previously published distinctive variables (age and CRP level) differ significantly. This means that indeed these variables contributed, consciously or unconsciously, to the doctor’s final diagnostic judgement. Moreover, the doctor’s interpretation of objective findings decreases the sensitivity of these distinctive variables for ruling out complicated appendicitis. This means that adding the subjective judgement does not guarantee more accurate identification of complicated appendicitis. Prior to this study, we hoped that subjective aggregated assessment would add value in patients in whom objective variables were deficient in differentiation of appendicitis severity. Unfortunately, this does not appear to be true.

Only a small proportion of patients were assessed by a consultant surgeon (0.8%) or an emergency medicine consultant (5.8%). The accuracy of these consultant assessments surpassed that of other doctors, with sensitivities of 43.5–66.7% vs 35.1–44.9% and specificities of 100.0 vs 93.4–94.5%, respectively. This suggests that overall accuracy could increase if the risk assessments were conducted by consultants for all patients. However, a prior study, in which both surgical trainees and consultant surgeons made an estimate of the diagnosis in all patients with acute abdominal pain, revealed that the diagnostic accuracy of clinical assessments does not improve when a surgeon, rather than a surgical trainee, conducts the assessments [21].

Although sensitivity and NPV did not achieve high values, a high specificity was found. This means that the false positive rate of the subjective diagnostic judgement of the doctor at the ED is low. Unfortunately, this has limited clinical utility. When treatment options must be considered, it is precisely the exclusion of complicated appendicitis that is crucial, meaning that sensitivity and NPV are important rather than specificity and PPV.

In most patients, both the radiologist and the doctor at the ED made an assessment for “uncomplicated” or “complicated” acute appendicitis. There was a substantial inter-observer agreement among cases. A possible explanation for this is that the doctor at the ED, integrating available findings including imaging results, likely was influenced by the radiologist’s final assessment based on imaging that was stored in the electronic patient record.

The reference standard used in this study slightly deviated from the previously published study protocol [15]. Initially, any sign of gangrene on histopathology led to a diagnosis of complicated appendicitis. However, advancing insights led to a more prominent role for the intraoperative findings when these differed from histopathology results in case of microscopic signs of necrosis without any macroscopic necrosis as clinical relevance is unclear.

Limitations

The study protocol has been published before the final analysis was executed and all predesigned analyses were performed [15]. This study was performed prospective and multicenter. All patients underwent diagnostic imaging and thereby had imaging-confirmed acute appendicitis. For each included patient, a reference standard from an adjudication committee was available. For most included patients, both the estimate of the doctors at the ED and radiologist’s final diagnostic judgement were available. Apart from these strengths, the study has several limitations. First, the present study describes an analysis of patients included in the SAS study [15], with the primary aim to validate the Atema score [9], which is an objective scoring model for appendicitis severity. Although the individual patient’s score result had not been made available during the runtime of the study, the doctors who estimated the appendicitis severity were aware of the included objective variables, which may have led to their judgement being influenced by conscious or subconscious emphasis on the results of those variables. Second, as shown in the flowchart, more than half of the eligible patients were not included, which may have contributed to a relatively high rate of complicated appendicitis and may have affected the results in other ways. Although clearly stated in the questionnaire, the question about level of certainty of diagnosis may have been wrongly interpreted as certainty of the very diagnosis of acute appendicitis and not so much the specification “uncomplicated” or “complicated.” Moreover, this level of certainty of diagnosis, like other variables, was noted in the electronic patient record to which patients themselves have access when they register to “Mychart.” This may have led doctors to prefer not to indicate doubt and fill in higher level of certainty than they felt.

Conclusions

More than half of all patients with true complicated appendicitis is incorrectly classified as uncomplicated appendicitis according to the final diagnostic judgements of doctors at the ED, integrating all available clinical, laboratory, and imaging results. Comparable accuracy is found for radiologists assessing diagnostic imaging. Thereby, these diagnostic judgements are not sufficiently reliable in ruling out complicated appendicitis. With respect to selection of patients with true uncomplicated appendicitis for antibiotic treatment without appendectomy, subjective final judgements of doctors at the emergency department or radiologists’ interpretation of imaging results are still far from perfect. Scoring systems for appendicitis severity that compile results from objective variables into a final probability of complicated appendicitis among patients with acute appendicitis, as a percentage with a confidence interval, may improve accuracy of severity assessment.