Comparing Prognostic Factors of Cancers Identified by Artificial Intelligence (AI) and Human Readers in Breast Cancer Screening

Simple Summary Current AI algorithms show breast cancer detection rates that are comparable to human readers, but it is not clear whether AI and humans detect cancers with similar characteristics. As these factors will influence survival, we aimed to compare the invasiveness status, histological grade, lymph node stage, and tumour size of cancers. Women diagnosed with breast cancer between 2009 and 2019 from three UK double-reading sites were included in this retrospective cohort evaluation. From 1718 screen-detected cancers (SDCs) and 293 interval cancers (ICs), AI indicated 85.9% and 31.7%, respectively, as suspicious. The first human reader detected 90.8% of SDCs and 7.2% of ICs. There were no differences in the detected proportion for any of the investigated prognostic factors. The AI algorithm detected more ICs. These findings imply that using AI has limited or no downstream effects on screening programmes, supporting its potential role in the double-reading workflow. Abstract Invasiveness status, histological grade, lymph node stage, and tumour size are important prognostic factors for breast cancer survival. This evaluation aims to compare these features for cancers detected by AI and human readers using digital mammography. Women diagnosed with breast cancer between 2009 and 2019 from three UK double-reading sites were included in this retrospective cohort evaluation. Differences in prognostic features of cancers detected by AI and the first human reader (R1) were assessed using chi-square tests, with significance at p < 0.05. From 1718 screen-detected cancers (SDCs) and 293 interval cancers (ICs), AI flagged 85.9% and 31.7%, respectively. R1 detected 90.8% of SDCs and 7.2% of ICs. Of the screen-detected cancers detected by the AI, 82.5% had an invasive component, compared to 81.1% for R1 (p-0.374). For the ICs, this was 91.5% and 93.8% for AI and R1, respectively (p = 0.829). For the invasive tumours, no differences were found for histological grade, tumour size, or lymph node stage. The AI detected more ICs. In summary, no differences in prognostic factors were found comparing SDC and ICs identified by AI or human readers. These findings support a potential role for AI in the double-reading workflow.


Introduction
The positive effect of breast cancer screening on survival rates is mainly due to the early detection of invasive cancers [1,2]. Recently, the relative sensitivity of human readers for grade 3 invasive cancers compared to grade 1 invasive cancers was estimated as low as 52% [3]. Although this indicates that there might be room for improvement, it is even more important to ensure that there is no decrease in the detection of clinically important diseases when introducing new workflows or new technologies. Studies have reported that AI performance is comparable to or might even outperform humans in the interpretation of screening studies [4][5][6][7][8]. In these studies, performance is often assessed in terms of general outcome metrics such as cancer detection rate and recall rate [9], but information on the prognostic features of screen-detected breast cancers is frequently not provided. However, patient outcomes, including survival, are dependent on the characteristics of the detected tumours [10,11]. There are two clinical issues of importance: the first is to ensure that AI detects aggressive, potentially fatal cancers early; the second is to check whether AI detects an excess of in situ or otherwise "minimal" cancers, in which case it might be contributing to overdiagnosis. The need to assess the downstream effects of using AI is also emphasised by the UK National Screening Committee, which recently published their approach for reviewing AI evidence [9].
While the topic is considered important, original research on it is scarce. Lee et al. reported that the AI gave higher malignancy scores to invasive cancers compared to noninvasive cancers but did not compare the proportion found by the AI to that of human readers [12]. McKinney et al. reported in a supplementary file that there were no differences in characteristics of tumours detected by AI or by human readers, but on a limited set of screen-detected cancers [13].
The American Joint Committee on Cancer now incorporates biomarkers such as hormone receptor status and HER2 status into the 8th edition of its cancer staging system [14], but the traditional prognostic features of tumour size, lymph node involvement, and histological grade are still widely and consistently used to assist in therapeutic decision-making as they confer strong, independent associations with breast cancer survival [15,16]. A comparison of AI and human performance for subgroups based on invasive/in situ status of disease, histological grade, lymph node stage, and tumour size would provide useful information for the way that AI could be integrated in screening programmes in order to support sustainability by presenting opportunities to save workload while maintaining or even improving screening performance. The aim of this analysis was to compare the characteristics of cancers detected by artificial intelligence (AI) and human readers using these important prognostic features.

Study Populations and Samples
A retrospective study was carried out at four centres: three in the UK and one in Hungary. The trial was registered at ISRCTN (https://www.isrctn.com (accessed on 15 March 2023), trial registration number ISRCTN18056078). The current analysis is based on data from the UK sites only, which all adhere to a three-year screening interval. The study population consisted of women who participated in breast cancer screening and were diagnosed with breast cancer. The study period ranged from 2009 to 2019. All cancers detected in this consecutive ten-year historical cohort of de-identified cases were included. The three UK centres included Leeds Teaching Hospital NHS Trust (LTHT), Nottingham University Hospitals NHS Trust (NUH), and United Lincolnshire Hospitals NHS Trust (ULH). Details of the cohort have been published elsewhere [17].
Mammograms were acquired on equipment from a single vendor at each screening site (Hologic at LTHT, GE Healthcare at NUH, and Siemens at ULH). In the UK, women are routinely invited between the ages of 50 and 70, with women over 70 years of age able to self-refer. The screening programmes used double human reading as the standard of care throughout the study period. Discrepancies between readers in the double-reading workflow were resolved by arbitration using an independent third reader or consensus, depending on local screening centre protocols. When readers agreed to recall a case, either a "recall" decision was reached or an arbitration performed by a single or group of radiologists made the definitive "recall" or "no recall" decision, also depending on the local screening centre protocols. For the current analysis, human reader performance was based on the first reader's opinion (R1). At all sites, the historical first reader's opinion was made in isolation, and the second reader had access, at their discretion, to the opinion of the first. Therefore, the first human reader's opinion is a more reliable indicator of individual human performance.

AI System
All study cases were analysed by the Mia TM version 2.0 'AI system', developed by Kheiron Medical Technologies. The AI system works with standard DICOM (Digital Imaging and Communications in Medicine) cases as inputs, analyses four images with two standard full field digital mammography (FFDM) views per breast (i.e., CC and MLO), and generates a binary suggestion of "recall" (for further assessment due to suspected malignancy) or "no recall" (until the next screening interval). The AI system's output is deterministic and is based on a single prediction per case. The system used pre-defined thresholds for "recall" or "no recall".
The AI software version was fixed prior to the study. All study data came from participants, whose data were never used in any aspect of algorithm development.

Data Collection
All positive cases, both screen-detected and interval cancers, were pathology proven malignancies. The presence of an invasive component, histological grade, tumour size, and lymph node status were retrieved from the NHS National Breast Screening Service (NBSS) database, including cancer registry information. In addition, the human readers' recall decisions were also retrieved from the NBSS database.

Statistical Analysis
The screen-detected cancers and interval cancers (IC) were analysed separately. For all tumour characteristics, numbers, percentages, and 95% Wilson score confidence intervals were reported [18,19]. The proportion detected by the AI system was compared to R1 using the chi-square test. If subcategories contained too few cases, Fisher's exact test was applied. Statistically significant differences in the detection of cancers with specific prognostic factors were concluded if the chi-square test resulted in a p-value of <0.05. Invasive tumour size was classified as ≤20 mm or >20 mm in accordance with the TNM staging system. In addition, a refined categorisation into 5 categories was analysed: ≤5 mm, 5-10 mm, >10-20 mm, >20-50 mm, and >50 mm.
The sensitivity of AI and R1 for each prognostic subgroup was compared using McNemar's test [20]. Point estimates and 95% confidence intervals for sensitivity were reported. The relative sensitivity of the AI, defined as the percentage of cancers that the AI detected from the group of R1-detected cancers, and the relative sensitivity of R1, defined as the percentage of cancers that R1 detected from the group of AI-detected cancers, were also reported.
Commonly used outcome metrics for breast screening programmes have recently been reported for the study population [17]. These metrics include recall rate (RR; number of cases suggested to call back for further assessment), cancer detection rate (CDR; number of cancers detected per 1000 screens), sensitivity (SEN), specificity (SPEC), and positive predictive value (PPV; number of cancers detected divided by the number of cases suggested to call back for further assessment). The confidence intervals for the sensitivity in the prognostic subgroups, the relative sensitivity, and the screening outcome metrics are based on 4000 bootstraps. All analyses were performed using Python 3.8 [21].

Characteristics of the Study Sample
Data from a previously performed study was used [17]. For the current analysis, we focused on the patients diagnosed with cancer. In addition, we only included the UK population, as this data was more complete. While the original study excluded patients with a history of breast cancer, they were included in this analysis. This resulted in a study sample of 194,145 UK cases. Figure 1 shows the study flow diagram. There were 1718 screendetected cancers and 293 interval cancers among 194,145 screening examinations.

Characteristics of the Study Sample
Data from a previously performed study was used [17]. For the current analysis, we focused on the patients diagnosed with cancer. In addition, we only included the UK population, as this data was more complete. While the original study excluded patients with a history of breast cancer, they were included in this analysis. This resulted in a study sample of 194,145 UK cases. Figure 1 shows the study flow diagram. There were 1718 screen-detected cancers and 293 interval cancers among 194,145 screening examinations.

General Performance Metrics
The AI system detected 85.9% (1475) of the screen-detected cancers and indicated 31.7% (93) of the original screening mammograms of ICs as suspicious. R1 detected 90.8% (1560) of the screen-detected cancers and 7.2% (21) of the original screening mammograms of ICs (Table S1, Figure 2). General outcome metrics for the initial study have been reported elsewhere [17].

General Performance Metrics
The AI system detected 85.9% (1475) of the screen-detected cancers and indicated 31.7% (93) of the original screening mammograms of ICs as suspicious. R1 detected 90.8% (1560) of the screen-detected cancers and 7.2% (21) of the original screening mammograms of ICs (Table S1, Figure 2). General outcome metrics for the initial study have been reported elsewhere [17].

Histological Grade
The distribution of the tumour characteristics for the invasive cancers is shown in Tables 2 and 3, for the screen-detected cancers and ICs, respectively, as well as Figure 3a-d. A refined categorisation for whole tumour size and invasive tumour size can be found in Table S1. The 95% confidence intervals for the histological grade distribution showed large overlap, for screen-detected as well as ICs. The screen-detected invasive cancers consisted of 17.2% and 17.3% of grade 3 tumours, for the AI and R1, respectively (p = 0.981). For the invasive ICs, the percentage of grade 3 cancers was 35.9% for the AI and 33.3% for R1 (p = 0.851). The absolute number of detected ICs was much lower than the number of screen-detected invasive cancers, resulting in wider confidence intervals (Table 3).

Tumour Size
There were no differences in invasive tumour size for screen-detected cancers identified by AI or R1: 83.1% (907/1091) and 83.1% (941/1133) were ≤20 mm, respectively, p = 0.995. A comparison of the whole tumour size, which includes the size of invasive and non-invasive disease, showed no difference between AI and R1: 66.9% (723/1080) and 66.1% (742/1122). For the invasive tumour size of the ICs, the proportion of small cancers for the AI (56.7%) was lower than for R1 (64.3%), but the chi-square test resulted in a non-significant p-value of 0.826 (Table 3). For invasive as well as whole tumour size, the 95% confidence intervals largely overlapped. Results for a refined categorisation into 5 categories (≤5 mm, 5-10 mm, 10-20 mm, 20-50 mm, and >50 mm) can be found in Supplementary Table S2.

Lymph Node Stage
The proportion of screen-detected cancers with a positive lymph node status was 20.0% and 20.1% for the AI and R1, respectively (p = 0.959), while these proportions were 36.4% and 40.0% for the ICs (p = 0.999). Additionally, for this tumour characteristic, the overlap of the 95% confidence intervals was large.

Sensitivity and Relative Sensitivity per Prognostic Subgroup of R1 and AI
The sensitivity of R1 and AI for cancers with specific characteristics is shown in Table 4. While R1 has a significantly higher sensitivity for non-invasive cancers, the AI shows a somewhat higher sensitivity for cancers with a large invasive tumour size and for highgrade tumours. This finding is mainly due to the higher proportion of ICs that the AI detects, as shown in Table S3a,b for screen-detected and interval cancer, respectively. The 95% confidence intervals show again a large overlap for most characteristics, except for non-invasive cancers, where R1 s sensitivity is higher than the AI. Table S4 displays the relative sensitivity per prognostic subgroup. For most subgroups, the relative sensitivities of the AI and R1 are in line with each other, indicating that the number of cancers detected by R1 from the group of cancers detected by the AI is comparable to the number of cancers detected by the AI from the group of cancers detected by R1.

Discussion
The aim of this large-scale evaluation was to compare the prognostic features of cancers detected by AI and human readers. The proportion of invasive cancers detected by AI was comparable to the proportion detected by R1. Additionally, for the tumour characteristics of the invasive cancers, no differences were found between the AI and R1. Results for screen-detected cancers and ICs were similar, although the number of ICs in the dataset was lower, which resulted in broader confidence intervals. In addition, the AI detected a higher number of ICs than the human readers (31.7% versus 7.2%, respectively), indicating the complementary value of the AI to humans. Our results indicate that using AI in breast screening will not have downstream effects on screening programmes. Moreover, the findings suggest that cancers detected by AI and human readers are likely to have a similar clinical course and outcome, supporting the potential role of AI as a reader in the double-reading workflow. As such, these results pave the way for prospective assessment of AI, either in clinical studies or in service evaluations, as a safe next step.
Very few previous studies have assessed the prognostic features of cancers detected by AI. McKinney et al. compared the histological features of cancers detected by the first human reader and the AI system and found no difference in the presence of invasive disease, histological grade, or size, but their analysis only included 414 cancers [13]. By contrast, our analysis is based on 1718 screen-detected cancers and 293 interval cancers. Lee et al. reported that the AI gave higher malignancy scores to invasive cancers compared to non-invasive cancers but did not compare the proportion found by the AI to that of human readers [12]. Leibig et al. reported the sensitivity of a large number of screen-detected cancers for AI versus radiologists, but the operating point was not pre-specified, making it difficult to interpret their results in terms of real-world performance [22].
The study of Leibig et al. only included screen-detected cancers, while we analysed screen-detected as well as interval cancers [22]. This gives a more complete picture of the cancers that can be detected with AI. We have shown that the sensitivity for prognostic subgroups of R1 and AI is comparable, although AI had a somewhat higher sensitivity for cancers with a larger invasive tumour size and a higher grade, while R1 was more sensitive in finding non-invasive cancers. These differences relate to the fact that the AI detected more ICs than R1.
It is well known that interval cancers have less favourable prognostic features compared to screen-detected cancers [23]. Detecting these cancers earlier has the potential to reduce breast cancer mortality. In our study, AI was able to detect 31.7% of screens that subsequently presented as interval cancers, and it is reassuring to note that grade 3 tumours were overrepresented within the group of interval cancers detected by AI, with 35.9% being histological grade 3 compared to 17.2% of screen-detected cancers flagged by AI. Our findings are very similar to those of Larsen et al., who in a recent large retrospective study found that an AI system set at a threshold to mirror the average human reader's rate of positive interpretations was able to flag 30.7% of interval cancers, with 33% being grade 3 [5]. In both studies, the absolute number of interval cancers is low compared to the number of screen-detected cancers, so confidence intervals are wider and differences are not statistically significant, but the ability of an AI algorithm to identify clinically significant cancers sooner would be beneficial to breast cancer screening programmes.
In the last decades, there has been an ongoing debate about the benefits and harms of breast cancer screening in terms of overdiagnosis, whether of ductal carcinoma in situ (DCIS) or invasive cancer [24][25][26][27]. It is reassuring that in our study, the AI detected a similar pattern of cancers as the human reader. This implies that expected changes in the downstream impact, in terms of clinical outcomes as well as health economics, will be minimal if AI is used as an independent reader. As such, these expected minor changes strengthen our opinion that it is safe to initiate prospective studies and service evaluations where AI is actually integrated into the breast screening workflow. At the moment, these studies and evaluations are lacking [28][29][30].
The strength of our study is that the data has been acquired from large real-world screening populations with mammograms acquired from multiple mammography equipment vendors. However, the missing data for some cancer cases might be seen as a limitation. In addition, cancers were detected several months to three years after the initial screening. This delay makes it difficult to interpret the tumour characteristics, as it is unknown whether a biopsy and subsequent surgery performed at the time of the initial screening would have resulted in a smaller or non-invasive cancer. In some cases, cancer might not have been present at all on the initial screen. Additionally, interval cancer data is incomplete for the later part of the ten-year screening period as not enough time has elapsed for all interval cancer cases to have presented or been notified. As the AI system has been shown to be more sensitive than R1 on interval cancers, it is expected that with a more complete collection of interval cancers, the relative sensitivity of the AI system to R1 would be revealed to be greater. For the same cohort and AI system as in this study, this was reported to be the case where the relative difference in sensitivity of the AI system to R1 ranged from −1.4 to 0.8% across UK sites for the whole study time period, including the later years with missing interval cancer data, but was revealed to be greater, ranging from 2.5% to 7.6% for a study time period with more complete interval cancer data [17]. Finally, integrating AI into the screening workflow can be carried out in multiple ways, and the interplay between human readers and AI is complex. Ultimately, prospective data will be needed to determine the detected cancer spectrum of a breast screening programme where AI has been implemented.
Despite these limitations, our study, based on a large number of screen-detected breast cancers and interval cancers, provides valuable insights into the cancer spectrum detected by AI compared to human readers. The ability of AI to detect breast cancers with similar prognostic features to human readers suggests that cancers detected by AI and human readers will have a comparable clinical course and outcome, including survival. In addition, AI also has the potential to identify interval cancers earlier, which is likely to be of benefit to individual women and the wider screening programme. Prospective studies combining an AI opinion and a human reader are currently lacking and will be needed to better understand the effect on cancer detection, interval cancer rates, and prognostic features.

Conclusions
The results of this analysis are based on a large number of screen-detected and interval cancers. The AI showed that it can potentially detect interval cancers earlier, which could be beneficial for individual women and the wider screening programme. The cancers detected by the AI and by human readers were comparable in terms of invasiveness, histological grade, tumour size, and lymph node status. Therefore, cancers detected by AI are expected to have a similar clinical course and outcome, including survival. This implies that using AI in a double-reading workflow will have no or limited effects on screening programmes.
These findings indicate that it is safe to integrate AI into breast cancer screening programmes. As such, it paves the way for initiating prospective studies and service evaluations.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/cancers15123069/s1: Table S1. Agreement between human reader and AI for detection of cancers; Table S2a. Size of invasive cancers detected on screen by human reader or AI; Table S2b. Size of invasive interval cancers detected by human reader or AI; Table S3a. Sensitivity per prognostic subgroup for screen-detected cancers; Table S3b. Sensitivity per prognostic subgroup for interval cancers; Table S4. Relative Sensitivity of AI and R1 per subgroup for screen-detected and interval cancers combined; Figure S1a  The study was performed in accordance with the principles outlined in the Declaration of Helsinki.

Informed Consent Statement:
The need for consent to participate was reviewed by HRA and ETT-TUKEB and confirmed to not be required as the study involved secondary use of retrospective and pseudonymised data.

Data Availability Statement:
Restrictions apply to the availability of these data. Data generated or analysed during the study are available from the corresponding author by request. The imaging datasets are obtained under licences for this study and are not publicly available.

Conflicts of Interest:
All authors have completed the MDPI disclosure form. The following authors are paid employees of Kheiron Medical Technologies: Cary J.G. Oberije, Annie Y. Ng, Jonathan Nash, and Peter D. Kecskemethy. All other authors received no payment for this work. There are no other relationships or activities that could appear to have influenced the submitted work.