Performance of ultrasonography screening for breast cancer: a systematic review and meta-analysis

To investigate the performance of primary ultrasound (P-US) screening for breast cancer, and that of supplemental ultrasound (S-US) screening for breast cancer after negative mammography (MAM). Electronic databases (PubMed, Scopus, Web of Science, and Embase) were systematically searched to identify relevant studies published between January 2003 and May 2018. Only high-quality or fair-quality studies reporting any of the following performance values for P-US or S-US screening were included: sensitivity, specificity, cancer detected rate (CDR), recall rate (RR), biopsy rate (BR), proportion of invasive cancers among screening-detected cancers (ProIC), and proportion of node-negative cancers among screening-detected invasive cancers (ProNNIC). Twenty-three studies were included, including 12 studies in which S-US screening was used after negative MAM and 11 joint screening studies in which both primary MAM (P-MAM) and P-US were used. Meta-analyses revealed that S-US screening could detect 96% [95% confidential intervals (CIs): 82 to 99%] of occult breast cancers missed by MAM and identify 93% (95% CIs: 89 to 96%) of healthy women, with a CDR of 3.0/1000 (95% CIs: 1.8/1000 to 4.6/1000), RR of 8.8% (95% CIs: 5.0 to 13.4%), BR of 3.9% (95% CIs: 2.7 to 5.4%), ProIC of 73.9% (95% CIs: 49.0 to 93.7%), and ProNNIC of 70.9% (95% CIs: 46.0 to 91.6%). Compared with P-MAM screening, P-US screening led to the recall of significantly more women with positive screening results [1.5% (95% CIs:0.6 to 2.3%), P = 0.001] and detected significantly more invasive cancers [16.3% (95% CIs: 10.6 to 22.1%), P < 0.001]. However, there were no significant differences for other performance measures between the two screening methods, including sensitivity, specificity, CDR, BR, and ProNNIC. Current evidence suggests that S-US screening could detect occult breast cancers missed by MAM. P-US screening has shown to be comparable to P-MAM screening in women with dense breasts in terms of sensitivity, specificity, cancer detection rate, and biopsy rate, but with higher recall rates and higher detection rates for invasive cancers.


Background
Cancer is a global public health issue in the world. In 2016, an estimated 17.2 million cancer cases and 8.9 million cancer deaths occurred worldwide [1]. For women, both the most commonly occuring cancer and the leading cause of cancer deaths and disability-adjusted life-years (DALYs) was breast cancer (1.7 million incident cases, 535, 000 deaths, and 14.9 million DALYs) [1]. Over the years, the burden of cancer has shifted from more developed countries to less developed countries [2]. Moreover, the burden is expected to grow worldwide due to the aging of the population and the adoption of lifestyle behaviors such as smoking, poor diet, physical inactivity, and reproductive changes (including lower parity and later age at first birth), particularly in less developed countries [3]. Therefore, broad prevention measures, such as cancer screening, are urgently needed to control this increasing burden, especially in less developed countries.
Mammography (MAM) has been used to screen for breast cancer since the 1970s and is now widely available in developed countries. However, in less developed counties, such as China, MAM is not easily accessible due to several barriers, including insufficient MAM equipment, inadequate insurance coverage for MAM, and widely dispersed populations [3]. Moreover, MAM has a low sensitivity in women with dense breasts [4], who could suffer a higher risk of breast cancer than those without dense breasts [5]. Worrisome researches from Denmark and Netherlands showed that nearly one in every three or half of screening-detected breast cancers represents overdiagnosis, respectively [6,7].
Recent data indicates that supplemental ultrasonography (S-US) screening could detect occult breast cancers missed by MAM, and primary ultrasonography (P-US) screening seems to perform comparably to primary MAM (P-MAM) screening [8][9][10][11]. However, systematic reviews of the performances of S-US or P-US screening have been published only in limited studies. Moreover, among broad screening studies in which both P-MAM and P-US were used, researchers just focused on the performance differences between joint screening and P-MAM screening alone. Limited studies investigated the independent performances of P-US screening. Therefore, we conducted this systematic review and meta-analysis to provide a global profile of S-US screening after negative MAM screening or P-US screening for breast cancers.

Methods
This meta-analysis was reported in line with the preferred reporting items for a systematic review and meta-analysis of diagnostic test accuracy studies: The PRISMA-DTA Statement [12].

Types of studies and participants
Randomized-controlled trials (RCTs), prospective or retrospective screening cohort studies focusing on the performance of P-US screening for breast cancer or performance of S-US screening for breast cancer after negative MAM were included. The screening performance included the following indicators: sensitivity, specificity, cancer detected rate (CDR), recall rate (RR), biopsy rate (BR), proportions of invasive cancers among screeningdetected cancers (ProIC), and proportions of node-negative invasive cancers among screening-detected invasive cancers (ProNNIC). The types of ultrasonography (US) included were hand-held ultrasonography (HHUS) and automated whole breast ultrasonography (ABUS). Diagnostic studies of patients with histopathologically proven breast cancer or women with suspicious finding after initial screening were excluded. Screening studies for second cancers among women previously diagnosed with breast cancer were also excluded.

Searching strategies
A comprehensive search was conducted according to the Cochrane handbook guidelines. The American College of Radiology (ACR) developed the Breast Imaging Reporting and Data System (BI-RADS) classification for breast ultrasonography examinations starting in 2003 [13]. Electronic databases (PubMed, Scopus, Web of Science, and Embase) were systematically searched to identify relevant studies published in English between January 2003 and May 2018. Five groups of key words were used in the searching strategies: (1) breast neoplasm, breast cancer, breast carcinoma; (2) ultrasound, ultrasonography; (3) screening; (4) supplemental, supplementary, adjunct, adjunctive, combined, joint, primary, single, alone; (5) sensitivity, specificity, detection rate, recall rate, biopsy rate. Reference lists from retrieved articles were also reviewed. Detailed searching strategies are referred to in the supplementary S1.

Selection of studies
Two authors independently screened the titles and abstracts of all selected articles to confirm their eligibility. All selected articles were analyzed by EndNote software that allows reviewers to manage articles and detect duplicate publications. When two or more articles from the same trial were selected, the article with the larger sample size, longer duration of follow-up, or the latest results was included. Any disagreement on the selection of articles was discussed and arbitrated by a third author. Details of the selection process are provided in the supplementary S2.

Data extraction
Two authors independently extracted the following data from the qualifying studies: general information (name of first author, year of publication, and country or countries where the study was performed), design of study (sample size, median age, percent of women with dense breasts among the whole population, type of US, screening mode), performance of US, and information for risk assessment of bias (detailed information referred to in the following section). Since there was not a consistent conclusion that dense breast can be regarded as an independent risk factor of breast cancer [5,14], in order to avoid bringing 'high risk' labels to women with dense breasts, we collected information of dense breast as an attribute for average risk women. All data was entered into STATA 14.0 software for analysis. Any disagreements on data extracted were also discussed and arbitrated by the same third author.

Risk assessment of bias in included studies
Two investigators critically appraised all included studies independently according to the pre-specified criteria, which were adjusted from the USPSTF's design-specific criteria and the STARD checklist for reporting diagnostic accuracy studies [15,16]. The adjusted criteria included 7 items: source of population, sample size, inclusion and exclusion criteria, blinding of test, data completeness, BIR-ADS criteria, and reference standards. Result of each item was classified as high-risk or low-risk. Detailed information of the adjusted risk assessment criteria of bias refered to supplementary S3.
According to the above-mentioned criteria, highquality studies were defined as those meeting at least six low-risk items for joint screening studies and five lowrisk items for S-US screening studies. Fair-quality studies meet four or five low-risk items for joint screening studies and three or four low-risk items for S-US screening studies. Poor quality studies were defined as those meeting less than four low-risk items for joint screening studies and three low-risk items for S-US screening studies. Poor studies were excluded from the review.

Data synthesis and analysis
All data were extracted with pre-specified uniform tables and recalculated with uniform methods. The corresponding authors were contacted to obtain any missing information from their studies. For those studies in which the number of 'examinations' rather than the number of 'women' as the denominator to calculate the detection rate of breast cancer, each woman would be followed up several times, and every time she had an examination. Therefore, each woman would have several examinations in these stuides. In this study, if we changed the number of 'women' as the denominator to calculate the detection rate for these studies, the results would significantly be overestimated since the number of 'women' was significantly less than the number of 'examinations'. Therefore, in order to follow the analysis protocol in the original studies and avoid potential overestimate in detection rate, we equate each examination with an independent woman. However, equating each examination with an independent woman could bias the estimate because observations within a woman are not 'independent' observations.
Cancer detected rate was defined as any cancer detected (including carcinoma in situ and invasive cancer but not high-risk precancerous lesion) among all examinations/participants. The recall rate was calculated as the number of women recalled for further diagnosed examinations divided by the total number of women who participated the screening. If the number of women recalled for any further diagnosed examinations was not available, the number of women with a positive result of index screening modality was used instead. The biopsy rate was calculated as the number of women recalled for pathological examination divided by the total number of women participated the screening.
The variation in different screening performances attributable to heterogeneity was measured as I 2 . If the P value for I 2 was less than 0.1, significant heterogeneity was indicated among included trials and the random-effect model was used to combine screening performances [17]. Otherwise, the fixed-effect model was used if the P value for I 2 was larger than 0.1. To search for sources of heterogeneity and obtain clinically meaningful estimates, subgroup analyses were conducted according to different studies characteristics, such as sample size > 1000 (Yes/No), all women with dense breasts (Yes/No), type of US (HHUS/ABUS), and quality assessment (Yes/No), whenever possible.
The package "midas" was used to combine sensitivity and specificity, to investigate whether there were potential publication biases among included studies, and to plot the summary receiver operating characteristic (SROC) curve with its 95% confidence and prediction contours [18]. The package "metaprop" was used to combine CDR, RR, BR, ProIC, and ProNNIC [19]. In addition, the package "metan" was used to compare the performances between MAM and US [20].
All meta-analyses were conducted with STATA software (version 14.0). All tests were two-sided, and P values of less than 0.05 for all meta-analyses indicated statistical significance.

Results
Supplementary S2 shows a flowchart of the study selection procedure. The electronic searches yielded 1162 potentially relevant studies, of which 23 eligible studies were included in the final review [9][10][11][21][22][23][24][25][26][27][28][29][30][31][32][33][34][35][36][37][38][39][40], including 12 studies in which S-US screening was used after negative MAM and 11 joint screening studies in which both P-MAM and P-US were used. Table 1 shows the baseline characteristics of the 23 studies. Twelve studies were conducted among women with dense breasts. Twenty studies screened women with HHUS. Twelve studies were conducted among general Screening accuracy for S-US and P-US screening Table 2 shows the original data of screening accuracy for S-US and P-US screening among the included studies. Based on meta-analyses, S-US screening could detect  (Fig. 1a, supplementary S4).

Subgroup analyses
Subgroup analyses showed very similar results to those of primary analyses (Supplementary S7 and S8). In addition to results comparable to those observed in the primary analyses, lower sensitivity, higher specificity, higher cancer detection rate, and higher biopsy rate were found for S-US screening among women with dense breasts compared to those without dense breasts (Supplementary S7). Moreover, the differences for sensitivities, specificities, and cancer detection rates between P-MAM screening and P-US screening were larger among women with dense breasts compared to those without dense breasts (Supplementary S8).

Discussion
The U.S. Preventive Services Task Force (USPSTF) had initially reviewed the performances and clinical outcomes of S-US screening in women with dense breasts or negative mammography [15]. However, only two studies were included. The authors concluded that the effects of S-US screening on breast cancer outcomes remain unclear due to sparse good evidence [15]. In addition, Gartlehnerhad systematically reviewed the evidence investigating the joint effectiveness of screening with P-MAM and P-US compared to MAM screening alone [41]. However, this review did not investigate the performance of P-US screening. Our study is the first systematic review and meta-analysis to investigate the performance of P-US screening for breast cancer, and this is also an important up-to-date systematic review and meta-analysis investigating the performance of S-US screening.
The role of S-US screening was first addressed in ACRIN 6666 by Berg in 2008 [4]. Berg concluded that S-US screening to P-MAM screening would yield an additional 1.1 to 7.2 cancers per 1000 high-risk women [4]. Our analyses also found a similar additional 0.4 to 22.4 cancers per 1000 examinations. Moreover, after reanalysis of ACRIN 6666, Berg concluded that ultrasound could be used as the primary screening method for breast cancer [11]. However, up to now, there have been no consistent conclusions concerning whether US screening should be recommended as the primary  [46][47][48][49]. Several reasons would lead to these inconsistent recommendations among current guidelines. As argued by USPSTF, sparse good evidence would be the major reason. However, as shown in our study, several high-quality studies and fair-quality studies had been conducted since 2003. Although EUSOBI supported S-US screening after P-MAM, it also addressed the concern that breast US was inappropriately suggested to be a primary screening method since P-US screening had not been shown to reduce mortality of breast cancer in the general female population. Moreover, US would lead to more biopsies and recalls than MAM [45]. In this systematic review, we did observe higher recall rates for US compared to MAM. We also observed higher biopsy rates for US compared to MAM; however, the difference was nonsignificant. This nonsignificant difference in biopsy rates between US and MAM may be due to small sample sizes, but it may also reflect no actual difference. In addition, there are several limitations of breast ultrasound that would make it inappropriate for a screening test. These included: US cannot take an image of the whole breast at once as MAM does; US cannot show microcalcifications, which would be the most common feature of tissue around a tumor; the skill level of the US operators makes a great difference in the screening results. However, as shown in our study, these concerns seemed not to cause significant differences in the sensitivity and specificity, or even in cancer detection rates and cancer characteristics (such as the proportion of node-negative invasive cancers) between P-US screening and P-MAM screening. Moreover, lower price, larger coverage, absence of radiation effects, and lower overdiagnosis rates for US compared to MAM make US more easily accepted in China and other countries [3,9,50].Therefore, Chinese Anti-Cancer Association and other societies supported S-US screening in their guidelines. Lastly, the following results are significant. First, we observed significantly higher RR and ProIC for P-US screening compared with P-MAM screening. Higher recall rates would be an important barrier to promote US screening. More studies are needed to investigate the factors associated with higher recall rates of US screening to reduce unnecessary recalls. Second, as shown in supplementary S7, subgroup analyses did not find obvious differences in sensitivity, specificity or cancer detection rate for S-US screening after negative MAM screening between women with and without dense breasts. These results suggested that influence of dense breasts on the performance of S-US after negative MAM would be influenced by other factors. Moreover, as shown in supplementary S8, subgroup analyses also did not find significantly higher sensitivity for P-MAM compared to P-US among women with dense breasts. Small sample size could be an important factor, since only 3/11 exclusively recruited women with dense breasts (a proportion of 100% dense

Limitations
First, due to lack of evidence for reduced mortality from breast cancer, we cannot conclude that US screening would lead to a long-term benefit. More studies with sophistacted design and long-time follow-up are needed to investigate the long-term benefits and potential risks (including false positivity, "unnecessary" recalls, and overdiagnosis) of P-US screening. Second, in addition to breast density, no studies investigated whether other risk factors (such as obesity) influenced the differences in screening performance between US and MAM. Therefore, we cannot conclude whether these different performances between US and MAM derived from confounding effects or from the actual differences between US and MAM. Third, as shown in Table 3 and Table 4, not all included studies reported all screening performances indexes (such as biopsy rate, proportions of invasive cancers, and proportions of nodenegative invasive cancers). Therefore, meta-analysis results from studies reporting screening performances indexes would lead to biased results and complete reporting screening performances for US and MAM screening studies are needed to improve the current results. Fourth, combination data from repeated (longitudinal) US screening for a woman with data from an initial screening would also lead to biased results. Fifth, meta-analyses under the criteria of P < 0.05 would potentially overestimate the performations of US even though random-effect model was used. More real-world studies with large sample size are needed in the future.

Conclusions
Current evidence suggests that S-US screening could detect occult breast cancers missed by MAM. P-US screening has shown to be comparable to P-MAM screening in women with dense breasts in terms of sensitivity, specificity, cancer detection rate, and biopsy rate, but with higher recall rates and higher detection rates for invasive cancers. More studies are needed to investigate the long-term benefits and potential risks (including false positivity, "unnecessary" recalls, and overdiagnosis) of P-US screening. Moreover, we hope that US screening for breast cancer should deserve more attention in the future, not only because US is comparable to MAM in women with dense breasts in terms of sensitivity, specificity, cancer detection rate, and biopsy rate, but also because ultrasound is not a radiation modality and is easier to access in low-resources areas, such as Chinese rural areas.