Introduction

Lung disease is the main manifestation of the COVID-19 [1], with clinical presentation ranging from asymptomatic to fever, dry cough, fatigue, and dyspnea, up to respiratory failure in severe cases [2]. The standard of reference for diagnosis is the reverse transcriptase polymerase chain reaction (RT-PCR) test, using nasal-pharyngeal swabs or lower respiratory tract specimens [3].

Chest HRCT plays an important role in the detection of COVID-19 pneumonia, with reported sensitivity ranging from 61% [4] to 99% in a study performed in a setting with high disease prevalence [5]. Typical findings include GGO and consolidation, involving multiple lobes of both lungs [6], as well as OP pattern [7]. Despite the low specificity (about 25–33%) [8, 9], HRCT with typical appearance may be of help when there is diagnostic uncertainty in a patient with high pretest probability for the disease [6]. In this light, an expert consensus statement from the Radiological Society of North America (RSNA) provided a system for categorizing HRCT findings based on the likelihood they represent COVID-19 pneumonia [7].

Another target of research is as to whether HRCT can predict unfavorable clinical outcome, which has been variably defined as progression to severe disease, Intensive Care Unit (ICU) admission, or death [10,11,12,13]. Previous Authors [10,11,12,13] found that qualitative and semi-quantitative indexes expressing the amount of lung involvement are associated to disease worsening.

One might assume that the pre-requisite for using HRCT as a diagnostic and predictive tool is adequate inter-reader agreement in assessing COVID-19 pneumonia-related HRCT features. To our knowledge, a few studies only investigated this topic [14,15,16,17]. Thus, it is uncertain whether interpretation and quantification of lung involvement can be reliably provided across different readers and different geographical areas involved by the pandemic [18]. Moreover, an adequate inter-reader agreement may further support the use of HRCT in many COVID-19 related clinical situations, e.g., in case of swab/clinical data doubts, in the evolution/worsening of the disease, and in the outcome evaluation.

The aim of the study was to investigate the inter-reader agreement in assessing HRCT features of COVID-19 pneumonia.

Material and methods

Patient population

The Ethical Committee approved the study protocol, and waived for the acquisition of the informed consent, given the retrospective design.

By performing a search in the database of our COVID-19-center, we identified all the consecutive adult patients with suspected COVID-19 pneumonia who underwent chest HRCT examination in the period March-April 2020. Before HRCT, all patients performed RT-PCR test for severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) in nasal-pharyngeal swabs, and were categorized according to the Italian Society of Emergency Medicine (SIMEU) classification of clinical phenotypes [19]. The latter included: (i) phenotype 1: fever without respiratory failure and normal chest X-ray; (ii) phenotype 2: fever with chest X-ray and arterial blood gas test indicating lung focus and/or mild respiratory failure [partial pressure of arterial blood oxygen (PaO2) > 60 mmHg]; (iii) phenotype 3: fever with moderate-severe respiratory failure (PaO2 < 60 mmHg in room air); (iv) phenotype 4: respiratory failure with suspected initial acute respiratory distress syndrome (ARDS) or complicated pneumonia; and (v) phenotype 5: overt ARDS [18]. Oxygen therapy and/or continuous positive airway pressure (CPAP) ventilation were indicated in patients with SIMEU phenotypes 3–4 pneumonia, while orotracheal intubation with invasive ventilation was the treatment for SIMEU phenotypes 4–5 [19].

Of the 192 eligible subjects, we excluded 104 patients with negative RT-PCR test, and 11 patients with clinical phenotypes 3–5 at the time of HRCT. Therefore, the final population consisted of 77 patients (40 men and 37 women, mean age 64 ± 15 years) with mild COVID-19 pneumonia (i.e., SIMEU clinical phenotypes 1–2). In cases the patient had undergone several HRCT examinations, only the baseline one was included in the analysis.

HRCT examinations

HRCTs were performed on a 64-row Computed Tomography (CT) scanner (LightSpeed, General Electric, Milwaukee, Wisconsin, USA), by means of volumetric acquisition with the patient in the supine position, at suspended full inspiration. Image acquisition parameters were as follows: 0.6 s gantry revolution time, 100–350 mA tube current modulation range, 120 kV tube potential, 64 mm × 0.625 mm detector configuration, 1.25 mm reconstructed section thickness and interval. In 4/77 patients (5.2%) iodinated contrast medium [iomeprol 350 mgI/mL (Iomeron, Bracco Imaging, Milan, Italy)] was intravenously injected before scanning. Two image sets were reconstructed and displayed, including one with high-spatial-frequency algorithm and pulmonary parenchyma windowing (level, −500 HU; width, 1700 HU), and the other with soft tissue algorithm and windowing (level, 50 HU; width, 350 HU).

Image analysis

For each patient, three readers recorded the presence of GGO, consolidation, and crazy-paving pattern, as defined by the glossary of terms for thoracic imaging from Fleischner Society [20]. Readers included two radiologists devoted to thoracic imaging, namely reader 1 (R1) and reader 2 (R2), with 10 and 3 years of experience, respectively, and one generalist radiologist (R3) with 20 years of experience in body imaging. Readers also assessed whether GGO and/or consolidation presented with an OP pattern, i.e., whether they showed triangular or polygonal shape, or were associated with perilobular pattern, bronchial dilatation, reverse halo sign, linear and band-like opacities, and signs of fibrosis [21]. On a per-examination basis, six lung zones were identified (3 per lung), i.e., two upper zones (above the carina), two middle zones (from the carina to the inferior pulmonary veins), and two lower zones (below the inferior pulmonary veins). Readers then assessed the zonal extent of GGO, consolidation, and crazy-paving pattern, using a previously reported semi-quantitative score [11, 22]. Score was 0 if the feature was not present, 1 if it was present with a <25% zonal involvement, 2 for a ≥25% to <50% involvement, 3 for a ≥50% to <75% involvement, and 4 for ≥75% involvement. Therefore, the per-patient total score for each pulmonary feature (TFS) ranged from 0 (i.e., a certain feature was scored 0 in each of the six lung zones) to 24 (i.e., a certain feature was scored 4 in each of the six lung zones). TLS was defined as the summing up of the GGO, consolidation, and crazy-paving pattern TFSs.

Clinical data analysis

For all patients, comorbidities and time from symptoms onset to HRCT examination were reported. We recorded the patients’ SIMEU phenotype twice, i.e., the one observed at the time of HRCT examination, and the worst one noticed in the 15-day period following HRCT. For the purpose of analysis, SIMEU phenotypes recorded during the follow-up period were dichotomized into mild disease group [including patients with no or mild respiratory failure (SIMEU phenotype 1 or 2)] versus severe disease group [including patients with moderate-to-severe respiratory failure or ARDS (SIMEU phenotype 3–5)].

On this basis, the study outcome was defined as the development of severe disease in the 15-day period following HRCT, i.e., a shift from SIMEU phenotype 1–2 to SIMEU phenotype 3–5.

Statistical analysis

We used descriptive statistics to summarize HRCT findings, and coupled relevant proportions with 95% confidence intervals (95%CI). After checking whether continuous data showed normal distribution, we described them as means ± standard deviation or medians with the interquartile range (IQR). The Cochran’s Q test was used to determine whether there was any significant difference in the prevalence of each HRCT feature among the three sets of readings. Pairwise comparisons were performed with the McNemar test.

To determine the inter-reader agreement in assessing HRCT features we used Percent Agreement (PA), Cohen’s Kappa (k), as recommended by the Guidelines for Reporting Reliability and Agreement Studies (GRRAS) [23], as well as Free-Marginal Multirater k. We used a TFS cut-off of >2, and a TLS cut-off >6 for including data into the analysis. When paradox k was observed [i.e., unacceptable kappa value (k ≤ 0.41) and acceptable percent agreement (PA ≥ 0.80) with Prevalence Index and Bias Index different from zero], the imbalance was corrected by using the prevalence-adjusted bias-adjusted kappa (PABAK) statistic [24, 25]. Interpretation of k and PABAK coefficient was as follow: <0.00, poor; 0.00–0.20, slight; 0.21–0.40, fair; 0.41–0.60, moderate; 0.61–0.80, substantial; 0.81–1.00, almost perfect [26].

On a per-reader basis, we then performed a logistic regression analysis with the stepwise approach to assess whether HRCT presentation could predict the occurrence of the study outcome as defined above. The model included TLS > 6, consolidation > 2, crazy-paving pattern > 2, and presence of OP pattern. Preliminary univariable analysis was performed with the chi-square test.

Analyses were performed using MedCalc statistical software (MedCalc Software bvba, version 18.11.6, Ostend, Belgium), and Online Kappa Calculator (Computer Software, retrieved from http://justus.randolph.name/kappa). The reference alpha value was 0.05. When appropriate, the Bonferroni correction was used (0.05/3 pairwise comparisons = 0.017).

Results

Study population and HRCT findings

Patients showed at least one comorbidity in 57% (44/77) of cases, and ≥2 comorbidities in 26% (20/77) of cases, respectively. Cardiovascular, oncological, and respiratory diseases were the most frequent ones, reported in 42% (32/77), 13% (10/77), and 12% (9/77) of patients, respectively. The median time period from the onset of symptoms to HRCT was 5 days (IQR, 2–9 days). 38 over 77 patients (49%) developed severe disease during the 15-day period following the HRCT examination [median (IQR) time 1 (1, 2) day].

The per-reader distribution of HRCT findings is shown in Table 1. Regardless of the reader, the most frequent features were GGO > 2 (74–83% of patients) and OP pattern (38–68% of patients), while a TLS > 6 was found in 65–69% of patients. Overall, the prevalence of HRCT features was not significantly different among readers, except for OP pattern, which was more frequently reported by R1 than R3 (52/77 versus 29/77 patients, p < 0.001), and by R2 than R3 (43/77 versus 29/77 patients, p = 0.014). Example cases are shown in Figs. 1 and 2.

Table 1 Per-reader distribution of HRCT findings (n = 77). The “difference in prevalence” columns report the p values expressing whether the prevalence of the detected HRCT features was significantly different among the three readers and on a pairwise basis (R1 versus R2, R1 versus R3, R2 versus R3)
Fig. 1
figure 1

48-year old man with confirmed COVID-19 pneumonia. At hospital admission, HRCT images on axial (a) and coronal (b) planes showed bilateral, mostly peripheral GGO and consolidations. The two horizontal white lines in (b) delimit the upper, middle, and lower lung zones, which were identified to apply the semi-quantitative score (see the text for details). The scheme in (c) resumes how each of the three readers (R1, R2, and R3) assigned the TFS for GGO, consolidation, and crazy-paving pattern, thus allowing the calculation of TLS as the sum of all the TFSs. For all the readers, TLS was >6, a feature we found to be predictive for short-term occurrence of severe disease. After one day, the patient developed respiratory failure [Italian Society of Emergency Medicine (SIMEU) phenotype III disease]

Fig. 2
figure 2

61-year old woman with confirmed COVID-19 pneumonia. At hospital admission, HRCT images on axial (a) and sagittal (b) planes showed bilateral, peripheral GGO)and band-like opacities with a perilobular distribution, resembling an OP pattern. OP pattern was deemed present by all readers. After 5 days, the patient developed respiratory failure [Italian Society of Emergency Medicine (SIMEU) phenotype IV disease]

Inter-reader agreement in assessing HRCT features

Table 2 shows the results of the inter-reader agreement analysis. When comparing the three radiologists at the same time, we found that they agreed to a substantial extent in assessing HCRT features (k values ranging 0.65–0.74). The highest agreement was observed in the case of GGO > 2 (k = 0.74) and TLS > 6 (k = 0.69). Exceptions were consolidation and OP pattern, for which the agreement was moderate (k = 0.60) and fair (k = 0.32), respectively.

Table 2 Inter-reader agreement in assessing HRCT features. The “inter-reader agreement” columns express the magnitude of the agreement in assessing a certain HRCT feature among the three readers, and on a pairwise basis (R1 versus R2, R1 versus R3, R2 versus R3)

When comparing the radiologists on a pairwise basis, the agreement was moderate to substantial for most HRCT features. The only exception was the OP pattern, which was scored with moderate agreement by R1 versus R2, fair agreement by R1 versus R3, and fair agreement by R2 versus R3. Of note, the inter-reader agreement between more experienced readers (R1 and R2) was substantial for consolidation and moderate for OP pattern.

Online Resource 1 shows the PA values we used to verify whether the prerequisites for using PABAK were matched (or not), as described above. PA values do not represent primary measurements of agreement, as they do not account for the effect of chance.

Prediction of unfavorable outcome

Table 3 shows the results of univariable analysis and multivariable analysis, including a model built upon each reader. On multivariable analysis, independent predictors of severe disease were TLS > 6 (for all readers), OP pattern (for R1) and consolidation > 2 (for R2).

Table 3 Results from the logistic regression model (outcome, development of severe disease within 15 days from HRCT)

Discussion

Chest HRCT represents a valuable imaging tool both in diagnosis and management of patients with COVID-19 [27], through suggesting a possible diagnosis of COVID-19 in a high suspicion clinical setting and indicating a progression in disease severity at follow-up (e.g., signs of disease progression such as consolidation or crazy-paving pattern, or bacterial superinfection) [28]. In this study, the agreement in assessing HRCT features of mild COVID-19 pneumonia among the three readers with different experience in thoracic imaging was fair in the case of OP pattern, moderate in the case of consolidation > 2, and substantial in the case of GGO > 2, crazy-paving pattern > 2, and TLS > 6. Of note, the latter feature was found to be an independent predictor of short-term onset of severe disease at multivariable analysis, regardless of readers’ experience. When considering experienced readers only, severe disease was independently predicted also by OP pattern (in the case of R1), and consolidation > 2 (in the case of R2), in accordance with higher pairwise agreement on those two features (k = 0.59 and 0.64, respectively). Our findings are in line with previous studies showing substantial-to-almost perfect inter-reader agreement for most CT features [14, 16], and the capability of semi-quantitative or quantitative evaluation of lung involvement [10, 12] to predict disease worsening. Overall, our results provide reliable potential markers for disease progression.

Concerning the estimation of COVID-19 pneumonia extent, Cozzi et al. recently proposed a quantitative method based on chest X-ray performed in an emergency setting, correlating with an increased risk of admission to ICU [29]. In parallel, a few Authors [11,12,13,14] evaluated lung involvement from COVID-19 pneumonia under the form of CT severity score [14], CT score [11], total lung involvement [13], and, conversely, total extent of well aerated lung parenchyma [12]. Computerized aided methods for the quantification of lung involvement in COVID-19 pneumonia were also investigated [30]. We used TLS for this purpose. This makes our results difficult to compare, though our approach shows the potential advantage of accounting for different HRCT manifestations when quantify pulmonary involvement. Of note, we found this parameter to be reliable, showing substantial inter-reader agreement for a cutoff >6, regardless of readers’ experience in thoracic imaging. One can assume this is a potentially relevant result, since non-thoracic radiologists can be involved in reporting in the pandemic-related scenario. Since TLS was an independent predictor of clinical worsening, a reliable quantification of this feature, even by non-thoracic radiologists, might impact on patients’ management (e.g., in terms of patients’ allocation in ICU).

Concerning other HRCT features, Zhang et al. [16] found excellent inter-reader agreement in assessing consolidation (k = 0.983) and crazy-paving pattern (k = 0.978) between experienced readers. Differently from them, we observed lower (even if substantial) agreement. A potential explanation for the discrepancy might be related to the fact that our assessment included not only the presence but also the extent of those HRCT features. While this can expectedly lead to lower agreement, our approach has the potential advantage of adding a semi-quantitative evaluation of the amount of lung involvement.

The OP pattern is part of typical COVID-19 pneumonia appearance (type 1 category), according to the RSNA categorization system [7], as well as a marker of other coronavirus-related lung diseases [i.e., severe acute respiratory syndrome (SARS), and Middle East respiratory syndrome (MERS)] [31, 32]. Consolidation has been identified as a potential marker of disease progression, reflecting cellular fibromyxoid exudates in alveoli [33, 34]. When analyzing readings from more experienced readers, both features were significantly associated to severe disease on univariable analysis (R1 and R2), with OP pattern and consolidation > 2 representing an independent predictor of the outcome for R1 and R2, respectively. However, k values for those features were disappointing, both overall and when comparing R1 or R2 versus R3. On the other hand, the agreement rose when comparing R1 and R2. Those results suggest that OP pattern and consolidation are more reliable and can represent markers for clinical evolution only when provided by experienced readers.

In the case of OP pattern, this may be due to its composite definition, which includes many HRCT findings at a time [21]. As supported by the lower prevalence in R3 readings, the OP pattern was presumably more difficult to assess by less experienced readers. Our results concerning consolidation can be explained by the expectedly lower accuracy to assess the extent of this otherwise easier to interpret finding.

Some study limitations warrant mention. First, this is a monocentric work, suggesting that our findings should be validated by multi-institutional trials. However, our results are overall in line with previous works performed both in Western [12] and in Eastern [10, 11] scenarios of the pandemic, suggesting they are reasonably generalizable. Second, we limited our models to imaging findings, excluding patients’ comorbidities and other clinical factors. On the other hand, the definition of clinically predictive models was beyond the purpose of the study. We focused on inter-reader agreement assuming that, in an emergency scenario, reliable imaging findings can represent more objective data for managing patients with known mild COVID-19 pneumonia than clinical features that can be incompletely known at the time of HRCT. Finally, we did not include the whole spectrum of HRCT features in the analysis. This choice was done to avoid asking the less experience reader to interpret subtler findings, for which the agreement would be expectedly low.

In conclusion, we observed that, regardless of radiologists’ experience in chest imaging, there was moderate-to-substantial agreement for most HRCT features of COVID-19 pneumonia, in terms of semi-quantitative assessment of lung involvement. Some of the features were predictive for short-term evolution to severe disease, namely: (i) TLS > 6, showing substantial inter-reader agreement regardless of readers’ experience; (ii) consolidation > 2 and OP pattern, which showed acceptable inter-reader agreement between more experienced readers only. Thus, the agreement on different HRCT features apparently depends on readers’ experience.