A head-to-head comparison of the intra- and interobserver agreement of COVID-RADS and CO-RADS grading systems in a population with high estimated prevalence of COVID-19

Objective: To evaluate the inter- and intraobserver agreement of COVID-RADS and CO-RADS reporting systems among differently experienced radiologists in a population with high estimated prevalence of COVID-19. Methods and materials: Chest CT scans of patients with clinically–epidemiologically diagnosed COVID-19 were retrieved from an open-source MosMedData data set, randomised, and independently assigned COVID-RADS and CO-RADS grades by an abdominal radiology fellow, thoracic imaging fellow and a consultant cardiothoracic radiologist. The inter- and intraobserver agreement of the two systems were assessed using the Fleiss’ and Cohen’s κ coefficients, respectively. Results: A total of 200 studies were included in the analysis. Both systems demonstrated moderate interobserver agreement, with κ values of 0.51 [95% confidence interval (CI): 0.46–0.56] and 0.55 (95% CI: 0.50–0.59) for COVID-RADS and CO-RADS, respectively. When COVID-RADS and CO-RADS grades were dichotomised at cut-off values of 2B and 4 to evaluate the agreement between grades representing different levels of clinical suspicion for COVID-19, the interobserver agreement became substantial with κ values of 0.74 (95% CI: 0.66–0.82) for COVID-RADS and 0.73 (95% CI: 0.65–0.81) for CO-RADS. The median intraobserver agreement was considerably higher for CO-RADS reaching 0.81 (95% CI: 0.43–0.76) compared with 0.60 (95% CI: 0.43–0.76) of COVID-RADS. Conclusions: COVID-RADS and CO-RADS showed comparable interobserver agreement, which was moderate when grades were compared head-to-head and substantial when grades were dichotomised to better reflect the underlying levels of suspicion for COVID-19. The median intraobserver agreement of CO-RADS was, however, considerably higher compared with COVID-RADS. Advances in knowledge: This paper provides a comprehensive review of the newly introduced COVID-19 chest CT reporting systems, which will help radiologists of all sub-specialties and experience levels make an informed decision on which system to use in their own practice.


INTRODUCTION
According to the latest WHO figures, the COVID-19 pandemic caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has been associated with more than 25,000,000 confirmed cases and 848,000 deaths worldwide. 1 A series of unprecedented large-scale nonpharmaceutical interventions introduced across Europe has reportedly led to 3,100,000 deaths being averted. 2 Despite the gradual easing of lockdown measures, health-care services across Europe remain under continuous pressure with radiology departments remaining on the frontline of the COVID-19 diagnostic pathway. [3][4][5] Although major international institutions, including the British Society of Thoracic Imaging (BSTI) and the American College of Radiology, argue against the routine use of chest CT for diagnosis and triage of patients with suspected COVID-19, the advantage of unenhanced chest CT over real-time polymerase chain reaction (RT-PCR) as a rapid and prognostically valuable first-line investigation in symptomatic and comorbid patients, particularly in high prevalence areas,has been highlighted in several studies. [6][7][8][9][10][11] The recent multinational consensus statement from the Fleischner Society, however, highlights the greater sensitivity of chest CT to early pneumonic changes compared to chest X-ray and acknowledges the preferred use of the former modality in severely affected areas, where the reliability of RT-PCR testing is limited and turnaround times are long. 12 Although an extensive body of literature has emerged describing characteristic CT features of COVID-19 at different stages of the disease, considerable differences in reporting practices have been highlighted. [13][14][15][16] With repeated waves of the pandemic being forecast, 2 radiologists of all experience levels and subspecialties will be expected to contribute to the effective triage of patients with suspected COVID-19. To ensure optimal results, this requires the development of standardised reporting systems with high intra-and interobserver agreement.
With this in mind, two grading systems for standardised assessment of unenhanced chest CT in patients with suspected COVID-19 were independently proposed in late April 2020: CO-RADS and COVID-RADS. 17,18 COVID-RADS represents a 5-point scale with a supporting lexicon that clearly defines specific findings one needs to observe in order to assign a score indicating low, moderate, and high suspicion level of COVID-19 pneumonia.In contrast to providing a list of specific features needed to assign each individual grade, CO-RADS combines them in patterns indicating five different levels of suspicion ranging from very low to very high, also incorporating grades 0 and 6 that are assigned when a study is of insufficient quality or accompanied by a positive RT-PCR test. CO-RADS has been internally validated against RT-PCR (AUC 0.91, 95% CI 0.85-0.97) and its interobserver agreement among eight radiologists with different experience in reading chest CTs accounted for a Fleiss' κ of 0.47 (95% CI 0.45-0.49). 17 Conversely, there are no published reports of a similar validation of COVID-RADS, and no attempts have been made to conduct a direct comparison between the two systems when applied to a population with high estimated COVID-19 prevalence by radiologists with different experience levels.
In this study, chest CT scans of patients with clinical-epidemiological diagnosis of COVID-19 were reviewed by differently experienced radiologists from three European countries with the objective of comparing the intra-and interobserver agreement of COVID-RADS and CO-RADS grading systems in ahigh prevalence setting.

Data set description
In this retrospective study, we used anonymised unenhanced chest CT images obtained from the open-source MosMedData data set published by the Research and Practical Clinical Center for Diagnostics and Telemedicine Technologies of the Moscow Health Care Department. The data set includes 1110 individual chest CT studies (slice thickness 1.0-1.5 mm) of patients with clinical-epidemiological diagnosis of COVID-19 (ICD-10 code U07.2) performed in municipal hospitals in Moscow, Russian Federation, between 1 March and 25 April 2020. The studies, stored in the NifTI format, were categorised by the authors of the data set into five groups depending on the degree of pulmonary involvement, ranging from scans representing normal chest (CT-0) to those with detected ground glass opacifications (GGOs), regions of consolidation, reticular changes and hydrothorax with more than 75% of lung tissue involved (CT-4) as per Russian national guidelines. The data set is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0) license and is available via a permanent link https:// mosmed. ai/ datasets/ covid19_ 1110.

Patient selection process
To maximise the use of the available imaging data while also ensuring adequate distribution of studies with different predefined degrees of pulmonary involvement, all cases from the CT-3 and CT-4 groups were combined and matched by randomly selected studies from the remaining three groups CT-0, CT-1 and CT-2 to produce an overall proportion of 1:1:1:1 with a total sample size of 200 studies, the order of which was then randomized ( Figure 1).
Chest CT interpretation using COVID-RADS and CO-RADS grading systems Three radiologists from different tertiary referral centres from three different countries independently reviewed the studies and assigned COVID-RADS and CO-RADS scores using the originally described scoring systems as a reference (Supplementary Material 1). 16,17 Given the nature of the selected data set, there were no studies for which CO-RADS grades 0 (not interpretable) or 6 (RT-PCR positive for SARS-CoV-2) could be assigned. Although the readers were aware of the clinical-epidemiological diagnosis of COVID-19 in the selected patients, they were blinded to the predefined CT category. Before completing the reading process as illustrated in Figure 1, all readers completed a training set of cases that they had selected randomly from the original data set. The first two readings, the outcomes of which were used to calculate the interobserver agreement of COVID-RADS and CO-RADS, respectively, took place within a week from one another with cases being re-randomised for the second reading. Readings 3 and 4, which took place within a week of reading 2, were used to calculate the intraobserver agreement of the two systems and each consisted of 50 re-randomised cases. The images were viewed using an open-source software ITK-SNAP, with readers being able to modify the window settings. 19 Reader 1 (VB) worked at an emergency medicine hospital and was a senior abdominal radiology fellow with no experience of routine reporting chest CTs of patients with suspected COVID-19. Reader 2 (MK) was a thoracic imaging fellow at a large regional heart and lung hospital with 4 years' overall experience reporting chest CTs and 3 months' experience reporting chest CTs of patients with suspected COVID-19. Reader 3 (GS) was a consultant cardiothoracic radiologist at a regional COVID-19 referral centre with 11 years' experience of reporting chest CTs.

Statistics
The outcomes of Readings 1 and 2 were analysed to calculate the interobserver agreement of COVID-RADS and CO-RADS grading systems among the three readers using the Fleiss' κ with 95% confidence intervals (CIs). The analysis was repeated after COVID-RADS and CO-RADS scores were dichotomised at cutoff values of 2A and 3, respectively, to evaluate the interobserver agreement between the COVID-19 levels of suspicion that may directly impact clinical decision-making (low and moderate for COVID-RADS vs equivocal, high and very high suspicion for CO-RADS). Additional dichotomisation was performed at cutoff values of 2B and 4 in order to assess the agreement between the grades that include typical COVID-19 features and therefore represent the highest levels of clinical suspicion the two systems can offer. Linearly weighted Cohen's κ was calculated to assess the interobserver agreement between individual readers. The intraobserver agreement of COVID-RADS and CO-RADS was calculated using Cohen's simple κ using the outcomes of Readings 3 and 4, respectively. 20 The κ values were interpreted as follows: values ≤ 0 as indicating less than chance agreement, 0.01-0.20 as slight, 0.21-0.40 as fair, 0.41-0.60 as moderate, 0.61-0.80 as substantial and 0.81-0.99 as almost perfect agreement.
Interobserver agreement of COVID-RADS and CO-RADS grading systems For COVID-RADS and CO-RADS, the Fleiss' κ values were 0.51 (95% CI: 0.46-0.56) and 0.55 (95% CI: 0.50-0.59), respectively, indicating moderate agreement for both systems, with κ values for individual scores presented in Table 1. At an individual level, the agreement of COVID-RADS was highest between Readers 1 and 2 and lowest between Readers 2 and 3. For CO-RADS, the highest agreement was again observed between Readers 1 and 2 and the lowest agreement was noted between Readers 1 and 3 ( Table 3).
The outcomes of a direct comparison of the interobserver agreement of individual COVID-RADS and CO-RADS grades is illustrated in Figure 2. Despite the intrinsic differences in the definition of each grade, the two systems showed a moderate interobserver agreement with the Cohen's κ of 0.51 (95% CI: 0.46-0.56).

Intraobserver agreement of COVID-RADS and CO-RADS grading systems
The median intraobserver agreement of COVID-RADS among the three readers was 0.60 (95% CI: 0.43-0.76) whilst CO-RADS demonstrated a considerably higher median intraobserver agreement of 0.81 (95% CI: 0.66-0.95). Individual Cohen's κ values for the three readers are presented in Table 4. At an intraobserver level, 60 and 47% of discrepancies occurred between COVID-RADS Grades 2B and three and CO-RADS Grades 4 and 5, respectively.

Distribution of COVID-RADS and CO-RADS grades among patients with different degree of pulmonary involvement
The distribution of COVID-RADS and CO-RADS grades among patients with different pre-defined CT features is summarised in Supplementary Material 1, respectively. On average, 91% of both COVID-RADS grades 0 and CO-RADS grades 1 were assigned by the readers to CT-0 directory studies, defined by the authors of the data set as containing CTs inconsistent with viral pneumonia including COVID-19. Conversely, COVID-RADS grades 3 and CO-RADS grades 5 (and to a slightly lesser extent 2B and 4) were proportionately distributed by the three readers among studies from CT-1, CT-2 and CT-3/4 directories, suggesting equal occurrence of findings consistent with COVID-19 regardless of the overall degree of pulmonary involvement.

DISCUSSION
In this study, three differently experienced radiologists from different countries independently assigned COVID-RADS and CO-RADS grades to 200 chest CTs of patients with clinically-epidemiologically diagnosed COVID-19. When applied to a population with high pre-test probability of COVID-19, COVID-RADS and CO-RADS demonstrated comparable interobserver agreement with Fleiss' κ values of 0.51 and 0.55, respectively. However, the median intraobserver agreement was considerably higher for CO-RADS.
In this study, the overall interobserver agreement of CO-RADS was broadly in agreement with that reported originally by Prokop et al. (0.55 vs 0.47), with slight differences expected due to the differences in the study populations and the number of readers. 17 The agreement between individual readers was also in line with that recently reported by de Jaegere et al. 21 The

BJR|Open
Original research: COVID-RADS vs CO-RADS agreement study marginally higher agreement of CO-RADS may be explained by several intrinsic differences between the two systems that are clearly visible when κ values of individual scores are compared with each other. It is of note that COVID-RADS had higher interobserver agreement between less experienced readers, which might be explained by its "feature-centric" nature rather than a "pattern-centric" structure of CO-RADS. In other words, COVID-RADS provides a well-defined lexicon of specific features, combinations of which comprise individual scores, thereby offering a more structured approach that may be more appreciated by less experienced readers. Conversely, CO-RADS provides higher flexibility that makes it easier to account for the "bigger picture," thereby requiring a certain degree of experience to be used confidently. This point can be supported by higher interobserver agreement of CO-RADS between more experienced readers that was reported in this study. Finally, in contrast to a subtle difference in the overall interobserver agreement between the two systems, intraobserver agreement of CO-RADS was considerably higher for all readers. A possible explanation is that once readers get used to the patterns described in CO-RADS, their subsequent use then becomes less dependent on subtle features that make more difference for the "feature-centric" COVID-RADS.
There are certain characteristic features of each system that require particular attention. CO-RADS 1 combines features consistent with both normal findings and those of unequivocal non-infectious aetiology, which in COVID-RADS are represented by two different grades, 0 and 1, likely contributing to its lower intra-and interobserver agreement. Furthermore, CO-RADS 3 includes GGOs that do not have an appearance typical for COVID-19, e.g. perihilar GGOs. 22 In contrast, COVID-RADS 2A allows for the presence of only a single area of GGO, whereas peribronchovascular GGOs fall into the COVID-RADS 1 category, again providing the basis for some clinically significant interobserver variation since COVID-RADS 1 and 2A imply different levels of suspicion for COVID-19. Furthermore, the flexibility of CO-RADS 3 in relation to the number and localisation of GGOs makes it easier to be assigned in cases when false-positive GGOs related to motion, hypoventilation or air trapping are suspected (as illustrated in Figure 3). Conversely, COVID-RADS has a clearer definition of typical findings (score 3) that in CO-RADS are more cautiously distributed between scores 4 and 5, e.g. making a distinction between unilaterally or bilaterally located GGOs, thereby leading to an overlap with COVID-RADS grades 2A, 2B and three as evidenced in Figure 2. In addition to the higher κ for COVID-RADS 3 compared to CO-RADS 5, this trend is clearly evidenced by the fact that the overwhelming majority of single-grade discrepancies of CO-RADS occurred between scores 4 and 5, resulting in a substantial improvement in the interobserver agreement of the two systems when scores were dichotomised at cut-off levels of 2B and 4. It should also be stressed that neither system takes into account the overall degree of pulmonary involvement, which was confirmed by the equal distribution of studies with the highest COVID-RADS and CO-RADS grades among patients with different predefined CT groups. As illustrated in Figure 4, another potential benefit of COVID-RADS is the presence of score 2B that allows to highlight cases where typical COVID-19 features are mixed with atypical findings. This core difference between the two systems is further reflected in Figure 2, where 72 CO-RADS 5 cases were assigned COVID-RADS 2B. It is of note, however, that the majority of single-grade discrepancies  This study has several limitations. Validation of the two grading systems against RT-PCR was not possible due to the unavailability of this information in the original data set, however, evaluating the diagnostic utility of COVID-RADS and CO-RADS was not the aim of this study. However, as pointed out by the authors of both systems, they are primarily applicable to epidemic areas with high estimated prevalence of the disease, and were themselves developed in high-prevalence settings, in which CT allows for a faster and more accurate triage of patients at initial presentation. 17,18 Furthermore, as pointed out by Chen et al, long detection time and dependence on adequate sampling make RT-PCR less adaptable to the clinical workflow and decision-making during an outbreak, implying the need to isolate patients with positive CT findings even in the presence of a negative RT-PCR. 15 In addition, two recently published studies investigating the diagnostic accuracy of CO-RADS in RT-PCR confirmed cohorts confirmed its good performance in symptomatic individuals, thereby further supporting its application for triage. 23,24 Moreover, the readers were not blinded to the presence of clinical diagnosis of COVID-19 in the included patients, which may have artificially increased the interobserver agreement of the two systems due to the introduced bias. This, however, is also representative of a real-life clinical scenario during an epidemiologically severe situation where high pretest prevalence of COVID-19 is estimated and supporting clinical information is almost always available to the reporting clinicians, which was the case in a recent study evaluating the agreement of the RSNA COVID-19 chest CT classification scheme. 21 Furthermore, high vigilance for COVID-19 is likely to remain even between the repeated waves of the epidemic, which makes a certain degree of bias inevitable.Finally, clinical suspicion for COVID-19, which is essentially a clinical-epidemiological diagnosis, is imperative for requesting imaging studies in reallife practice as mandated by the BSTI guidelines. 25 The reading sessions were relatively close together in time, which may have increased the reported intraobserver agreement. The distribution of cases with predefined CT groups in the study group differed considerably from the original dataset, however, this was done to avoid bias associated with over representation of CT-1 group (<25% pulmonary involvement) that could have artificially increased both intra-and interobserver agreement reported in this study. Moreover, there were no studies with CO-RADS scores 0 and 6, however, this did not make the data Figure 4. An example single-grade discrepant case between COVID-RADS and CO-RADS. All readers noted multiple bilateral GGOs (black arrows) located posteriorly at lung bases in the presence of a marked nodular pattern, which gave this case an atypical appearance. Multiple GGOs located close to the visceral pleura represent a mandatory feature of CO-RADS 5, automatically leading to the highest level of suspicion for COVID-19 without the opportunity to somehow highlight their atypical nodular appearance. However, COVID-RADS specifically lists nodular pattern as an atypical finding (Grade 1) that in combination with multiple GGOs (Grade 3) comprises a combined score of 2B, providing readers with an opportunity to highlight cases where certain combinations may slightly reduce the overall level of suspicion for COVID-19. GGO, ground glass opacification. Table 4. Cohen's κ with 95% CIs demonstrating the intraobserver agreement of COVID-RADS and CO-RADS grades for Reader 1 (abdominal radiology fellow), Reader 2 (thoracic imaging fellow) and Reader 3 (cardiothoracic radiologist) set non-comparable against COVID-RADS. In turn, the presence of scores 0 and 6 could artificially increase the κ values for CO-RADS as these scores are likely to have almost perfect interobserver agreement by their nature. Finally, we acknowledge the recent development of other standardised reporting systems such as the aforementioned RSNA scheme, chest CT patterns of BSTI guidelines, COVID-19 S, etc that all have their comparative advantages, evaluating which, however, is beyond the scope of this study. 21,25,26 In conclusion, both COVID-RADS and CO-RADS represent reproducible reporting systems of chest CTs of patients with suspected COVID-19 and can be confidently used by differently experienced radiologists in a population with high estimated prevalence of the disease, which is further supported by other studies suggesting high diagnostic performance of the two systems compared to RT-PCR testing. Whilst CO-RADS has a considerably higher intraobserver agreement and a slightly higher overall interobserver agreement across all readers, this is counterbalanced by a higher interobserver agreement of COVID-RADS between less experienced readers and a similar interobserver agreement when it comes to scores representing high clinical suspicion of the disease, thereby suggesting possible interchangeability of the two systems depending on the individual reader's preferences and experience level, with factors informing the final choice summarised in this study.