Mammographic density assessed on paired raw and processed digital images and on paired screen-film and digital images across three mammography systems

Background Inter-women and intra-women comparisons of mammographic density (MD) are needed in research, clinical and screening applications; however, MD measurements are influenced by mammography modality (screen film/digital) and digital image format (raw/processed). We aimed to examine differences in MD assessed on these image types. Methods We obtained 1294 pairs of images saved in both raw and processed formats from Hologic and General Electric (GE) direct digital systems and a Fuji computed radiography (CR) system, and 128 screen-film and processed CR-digital pairs from consecutive screening rounds. Four readers performed Cumulus-based MD measurements (n = 3441), with each image pair read by the same reader. Multi-level models of square-root percent MD were fitted, with a random intercept for woman, to estimate processed–raw MD differences. Results Breast area did not differ in processed images compared with that in raw images, but the percent MD was higher, due to a larger dense area (median 28.5 and 25.4 cm2 respectively, mean √dense area difference 0.44 cm (95% CI: 0.36, 0.52)). This difference in √dense area was significant for direct digital systems (Hologic 0.50 cm (95% CI: 0.39, 0.61), GE 0.56 cm (95% CI: 0.42, 0.69)) but not for Fuji CR (0.06 cm (95% CI: −0.10, 0.23)). Additionally, within each system, reader-specific differences varied in magnitude and direction (p < 0.001). Conversion equations revealed differences converged to zero with increasing dense area. MD differences between screen-film and processed digital on the subsequent screening round were consistent with expected time-related MD declines. Conclusions MD was slightly higher when measured on processed than on raw direct digital mammograms. Comparisons of MD on these image formats should ideally control for this non-constant and reader-specific difference. Electronic supplementary material The online version of this article (doi:10.1186/s13058-016-0787-0) contains supplementary material, which is available to authorized users.


Background
Mammographic density (MD), a measure of the radiodense tissue in the breast, is a strong marker of breast cancer (BC) risk [1]. MD is increasingly being incorporated into BC research and clinical practice, for example in BC risk prediction models [2], as a marker for the effectiveness of therapeutic drugs mediated through MD [3], and in risk-based stratification for tailored BC screening regimens [4]. To enable these applications, estimates of differences in MD between women and within women over time are needed. However, obtaining directly comparable MD measurements is challenged by the fact that no single MD measurement tool is used universally; there are more than 10 quantitative methods currently in use [5][6][7][8]. Further, for the widely used threshold method, MD measurements are affected by well-documented reader variability [9,10]. Less studied is the influence of the type of mammogram used for MD measurements. Images originate from a variety of imaging modalities and mammography systems; that is, from older screen-film mammography (SFM) or more recently from digital mammography.
Image quality differs between SFM and digital mammography-for example, in terms of object visibility and spatial resolution [11]-and thus a reader's assessment of threshold-based MD may also differ between these modalities. Further, digital images are acquired in a raw ('for processing') format, in which the greyscale is proportional to X-ray attenuation. The processed ('for presentation') image is a manipulation of the raw image to aid tumour detection, based on manufacturer-specific algorithms which are generally unspecified and thus irreversible. Because processing may suppress or enhance image features such as dense tissue, MD measurements may systematically differ between the original raw and the processed images. The raw image is often deleted and only a processed format is available for MD measurements. Further, differences in MD between raw and processed images may vary by the type of digital mammography; that is, computed radiography (CR, a digital extension of screen film) or direct digital.
Two previous studies of MD in raw-processed pairs showed different results. From a General Electric (GE) Senographe 2000D model, percent MD (PMD) was higher in processed than in raw images [12]; whereas on images captured on a GE Senographe DS model [10], PMD was lower in processed than in raw images for one reader, but not different for another reader. We are not aware of raw-processed MD comparisons for other mammography systems.
In the present study, we extended the examination of MD across three widely used digital mammography systems (GE and Hologic, both direct digital, and Fuji, a CR system) by comparing threshold-based MD measurements for the same mammogram saved in both raw and processed formats and estimating MD conversion equations between these formats. In a similar fashion, we examined differences in MD between digitized SFM and processed CR-digital images taken from the same woman during consecutive screening rounds.

Source of images
For raw-processed MD comparisons, we included women who had both raw and processed image pairs available; that is, the same mammogram from a single screening session was saved in both formats. To examine different digital mammography system manufacturers (hereafter 'systems') we acquired six sets from three systems (  [12] and the East London Breast Screening Programme, UK (set G2) [7]. These five sets reflect populations with nearly 3-fold differences in BC incidence rates [15]. In contrast, set F1 is a pooled resource of anonymized Fuji CR images taken for 100 women in 2008, on which both right craniocaudal (CC) and left CC images were saved in both formats (400 images). Other than age for 47 women, no other information was known about these women. Thus whilst all other sets were from BC-free women, we cannot guarantee this status for set F1. All mammograms were taken between 2007 and 2013. Two sets, H1 and G2, also contributed to the International Consortium on Mammographic Density (ICMD) [16]. For the comparison of MD assessed on SFM and digital mammography (Table 1, set F2, BreastScreen Victoria, Australia), we obtained pairs of view and laterality-matched films for the same 139 woman who were screened on SFM at one screening round and on a digital CR Fuji system at the next, a median of 2.1 years later (range 1.2-2.5 years).
Ethics approvals were obtain from IARC (IEC 12-34 for the ICMD) and from contributing studies.

MD measurements
To improve readability of raw images, greyscale levels were transformed using a log-inversion implemented in Niftyview [17]. This process creates a 'positive' image out of the raw 'negative' and restores the approximately linear relationship between image intensity and tissue density exhibited by SFM. MD was measured in Cumulus version 3 or 6, in which the reader selects the threshold to  NK not known, SD standard deviation dichotomize dense and non-dense pixels. These versions give equivalent MD measurements, but differ in ease of use for the reader. Measures obtained are areas (cm 2 ) of the breast, the dense area (DA) and the non-dense area, and PMD, calculated as: Image sets were read by four experienced readers (VAM, Id-S-S, NFB and JH) in combinations dependent on permissions for inter-institutional image transfers. Sets H1, H2 and G2 were distributed randomly into 12 batches of 100 images (six raw and six processed batches) and allocated randomly to three readers. Each pair was read by the same reader. Each batch included three within-batch repeats and five images from each batch were repeated in the other two readers' batches. The Fuji images (F1) and the SFM-digital image set (F2) were mainly read by a single reader. Sets H3 and G1 were not transferred between institutions, but had been measured previously by one reader as published previously [12].
Twelve image pairs were excluded because one or both images were indicated for exclusion upon MD measurement (e.g. due to low image quality, breast implants).

Statistical methods
The primary outcome is PMD (%), and secondary outcomes are DA and breast area. For each of these, we used a square-root transformation (e.g. √PMD) to normalize distributions [18]. The interpretation of these measures can be aided by considering each area as a square, thus √DA and √breast area are the width in centimetres of the square. Similarly, √PMD can be thought of as the width of the dense square for a 10 cm × 10 cm breast area.
For each image format, within-reader reliability of √MD was assessed using the intraclass correlation coefficient: Between-women variance (σ 2 b ) and within-reader variance (σ 2 w ) were estimated in ANOVA models fitted on sets H1, H2, G2, F1 and F2 and all of the ICMD measurements combined. Sets H3 and G1 did not have within-reader repeats.
To estimate within-pair raw-processed differences in MD, we fitted multi-level normal-error regression models of √MD, where the fixed effect of image format was level 1 and a random intercept for woman was level 2. The assumption of a constant difference in √MD across the MD range was examined using Bland-Altman plots. Subgroup analyses were conducted by reader, system, model and processing software version, and by PMD and breast area categories and possible effect modification tested using likelihood ratio tests. These potential effect modifiers are features of the image or of the imaging process; woman-level characteristics such as body mass index (BMI) or age were not investigated, because potential effect modification would be mediated through image characteristics.
A similar approach was used to compare SFM and digital processed images for set F2.
Calibration equations for conversion between MD measured on raw and processed images, and vice versa, were based on √DA because all √PMD differences were driven through √DA whilst the change in √breast area was negligible (<1 mm). Standard regression models were not used as they assume error only in the dependent variable, which results in a fitted model that is not reversible (i.e. predicting raw from processed would give a different outcome to predicting processed from raw). Because there is measurement error in MD assessment on both raw and processed films, we applied a reversible conversion method. The principle of this calibration method was to maintain, for each reader and system combination, equality of the standard normal z scores of √DA whether they were assessed on a processed image (z p ) or a raw image (z r ): where x and s are the mean and standard deviation for the image type respectively. This method yields the following conversion equation:

Results
In total, 1294 raw-processed digital image pairs ( (Table 2) and reader-specific median measures are given in (Additional file 1: Table S1). Visual examination of sample raw-processed image pairs shows different degrees of accentuation of breast features and of the skin edge (Fig. 1).
Within-reader reliability of PMD was slightly higher in SFM (ICC 0.94, 95% confidence interval (CI): 0.93, 0.95) than in raw digital (ICC 0.91, 95% CI: 0.89, 0.93) or processed digital (ICC 0.89, 95% CI: 0.88, 0.91) images. This difference generally held across readers (Table 3) and was driven by higher within-reader repeatability from SFM than when measuring from digital images. In contrast, whilst readers 1 and 3 had higher ICCs for PMD and DA assessed on raw images than on processed images, this was reversed for reader 2. Reader 1 ICCs for PMD and DA did not differ between image formats for the Fuji CR or Hologic systems, whereas for GE images the ICCs were lower on processed than on raw images. Throughout, ICCs for PMD predominantly reflected those for DA because breast area ICCs were near 100% for all image formats, readers and systems (Table 3). Based on the subset of images that were read by all readers, mean raw-processed MD measures and    Table S3. For processed-raw digital image pairs, the median PMD was higher when measured on processed images than on raw images, by 1.7-3.3 absolute percentage points depending on the system (Table 2). Similarly, the median DA was larger by 1.6-4.6 cm 2 , whereas the median breast area was similar. Regression results were similar: √PMD was 0.34 cm (95% CI: 0.28, 0.40) larger in processed images than in raw images, whilst √DA was 0.44 cm (95% CI: 0.36, 0.52) larger and √breast area did not differ (0.01 cm; 95% CI: −0.01, 0.02) ( Table 4). These differences in PMD were approximately one-fifth of the between-women SD ( Table 3). For a given reader, PMD and DA differences varied in magnitude between systems (heterogeneity p < 0.01 for readers 1-3, p = 0.21 for reader 4), and for a given system the differences varied in both magnitude and direction between readers (p < 0.001 for each system). Specifically, for readers 1, 3 and 4, √PMD was larger in processed than in Table 4 Mean differences in MD measures between processed images and the corresponding raw digital image, by reader and mammography system p for heterogeneity <0.001 between readers for each of the Hologic, GE and Fuji systems, for both percent density and dense area. For breast area, p for heterogeneity <0.001 also between readers on the Hologic system, and no difference between readers for breast area was found for GE (p = 0.07) and Fuji (p = 0.08) a Differences are processed-raw images b p value for heterogeneity between systems, for a given reader CI confidence interval, MD Mammographic density, GE General Electric, PMD percent mammographic density raw images by 0.4-0.9 cm (reader 1), 0.1-0.7 cm (reader 3) and 0.4-0.6 cm (reader 4), depending on the system. In contrast, √PMD in processed compared with raw images for reader 2 was either not different (GE) or was smaller (Fuji CR system and Hologic). Mean √DA from processed images was 0.9 (95% CI: 0.7, 1.1) higher for reader 2 and 0.9 (95% CI: 0.7, 1.1) higher for reader 3 compared with reader 1. Between-reader differences were larger for raw images; mean √DA was 2.3 (95% CI: 1.9, 2.8) higher for reader 2 and 1.9 (95% CI: 1.4, 2.3) higher for reader 3 compared with reader 1. For SFM, between-reader differences were slightly smaller; mean √DA was 1.3 (95% CI: 1.1, 1.4) higher for reader 2 and 0.7 (95% CI: 0.5, 0.8) higher for reader 3 compared with reader 1. Breast area differences also varied between system-reader combinations, but average differences were extremely small in magnitude (<1.2 mm √breast area). Differences by model or processing software within a system were not significant (data not shown). Effect modification of DA and PMD differences by categories of PMD or of breast area (categories defined by the raw image) were significant (p < 0.001 for both). The differences tended to decrease with increasing PMD, but they increased with increasing breast area (Additional File 4: Table S4).
Most scatter plots (Fig. 2) showed that differences in DA on processed images compared with raw images are larger at lower DAs, and converge towards no difference in breasts with a √DA of ≥5 cm. Bland-Altman plots also revealed that processed-raw differences in √PMD and √DA (Additional File 5: Figure S1) were not constant across the underlying MD range. However differences were constant on the standardized scale (shown for DA in Additional File 6: Figure S2), and thus calibration equations were based on standardized values of DA in the two image types. Figure 2 (Additional file 7: Information 1) presents these readerspecific and system-specific calibration equations for DA. Differences were very small for the Fuji CR and were larger and of a similar magnitude between the direct digital systems. For all readers combined, conversion equations from raw DA to their processed equivalent are as follows: Hologic: processed √DA = 5.252 + 0.719 (raw √DA -4.751) GE: processed √DA = 5.081 + 0.872 (raw √DA -4.523) Fuji: processed √DA = 5.694 + 1.107 (raw √DA -

5.633)
After correcting DA, the corrected non-dense area and PMD would then be calculated using the original breast area and preserving the original definitions: Equations to generate √DA, as if measured on a raw image, from DA measured on a processed image are provided in (Additional file 7: Information 1).
For the processed-SFM set (F2), comparing MD measured on the processed digital image with that on Fig. 2 Scatter plot of paired √DA readings measured on processed (y axis) vs raw (x axis) digital images, by reader and system. Dashed lines, equality (if DA from processed images was read identically to raw images); blue dots, modelled linear conversion. Reader-specific and systemspecific calibration equations for the conversion of raw √DA to processed √DA are supplied in (Additional file 7: Information 2). √DA square root of dense area, GE General Electric the earlier SFM, √breast area was 0.17 cm larger (95% CI: 0.06, 0.28) and √DA was 0.17 cm smaller (95% CI: 0.01, 0.33).

Findings
In the present study, we compared Cumulus-assessed MD measures (PMD, breast area and DA) on the same digital mammograms saved in processed and raw formats. Overall, we observed higher MD in the former image type, a difference that was not entirely consistent either in magnitude or direction across four readers for a given mammography system. Differences in MD assessed on raw and processed images were small for the CR system, but larger for direct digital systems. Differences between SFM and CR-digital images appeared to be small, although the latter were not time-matched comparisons. Readers had higher MD repeatability for SFM images than for raw or processed digital images. This may be because readers had more experience of reading from SFM images, or because density is more easily visualized in SFM images.

Comparison and plausibility
Readers noted several appearance qualities of processed images that may affect the MD assessment, such as 'thickened breast edge' or 'faded parenchyma'. Processing algorithms involve multiple steps designed to clarify the image, enhance suspected lesions and reduce noise-this noise may be dense tissue, therefore it has been hypothesized that density would be lower in processed images. However, this and similar studies generally found higher MD on processed images, particularly at lower density levels. Enhancement of light/dark transitions and accentuation of the breast edge may contribute to this increase. Differences in PMD were almost entirely driven by changes in the DA because breast area altered minimally. Our results are also consistent with those of Keller et al. [10], and Martin et al. [19], who reported that differences were highly reader dependent. Unsurprisingly, Vachon et al.'s results [12], which comprised 14% of our raw-processed pairs, also found that PMD was overestimated in less dense breasts in processed compared with raw GE images. Studies that compared MD using the BIRADS classification did not find differences by image type [20], but differences may be too small to be detected using a broad categorical classification.
Differences in MD assessment between SFM and Fuji CR were not assessed optimally, because they were based on films taken 2 years apart. While there was no breast area difference in the time-matched images, over this time interval the breast area increased indicating measurable agerelated changes. The magnitude of this increase (0.17 cm √breast area) was consistent with the expected withinwoman changes (0.16 cm over 2 years) found in a previous SFM-only longitudinal study [21]. Similarly, the decline in DA was only slightly larger than would be expected from age-related changes (−0.13 cm √DA), suggesting that any differences due to image formats were small (at most 0.04 cm). However, similar studies comparing PMD in SFM and digital mammography reported that PMD was higher in SFM images than in raw or processed digital images [22], including one in which the digital and the SFM were taken on the same day [19]. In both studies the differences were larger than for the present study, possibly because they were comparing SFM with direct digital and not with CR as in the present study. Breast area was also higher in digital images taken on the same day as SFM images, indicating that lower PMD assessment may be a product of both underestimation of DA and overestimation of breast area in digital images compared with SFM images. Harvey [22] hypothesized that more subcutaneous fat is included in digital measurements because the breast edge can be seen and delimited more precisely, but only PMD was reported in that study. In the present study, small differences between SFM and CR may reflect these closely related imaging technologies; CR systems are additions to SFM systems, using phosphor plates and a separate reader to create digital images, whereas the direct digital image is created at the point of image capture [23]. Thus, CR images have lower spatial resolution and more image noise than direct digital images [24]. The improved image quality in direct digital allows for more complex multi-functional processing algorithms, which may account for the larger raw-processed differences in direct digital images compared with CR images.

Strengths and limitations
This is the first study to compare raw and processed images, using the same design and analytic approach, captured on several widely used mammography systems. Comparisons of MD across multiple systems are important because it is unlikely that all women in a study, or the same woman followed for several years, will be screened on the same mammography machine. Nevertheless, several design features would have improved the study; by including CC views alongside MLO for all images, and including other widely used mammography systems such as Siemens, and other CR systems. We were limited by the lack of information on manipulations performed by processing algorithms which are proprietary to manufacturers. Multiple readers are a further strength, being reflective of clinical and research settings-betweenreader differences in raw-processed calibration highlight the need to recognize and quantify these differences where possible. Further, we used a reversible statistical method for processed-raw MD conversions; that is, neither raw nor processed MD is considered the error-free independent variable, which would not have been the case had a simple regression method been used. Finally, the women included in this study came from countries with a wide range of BC incidence rates, and thus the results should be generalizable to women across the BC risk spectrum.

Relevance and implications
The potential impact of raw-processed differences in MD from direct-digital systems (3.3 percentage points) will depend on the application. When investigating MD as a predictor of BC risk, differences are unlikely to introduce substantial misclassification between very low density (<10%) and very high density (e.g. >50%) and would thus have a small impact on relative risk estimates. For investigations of determinants of MD or changes in MD, raw-processed differences are of a magnitude similar to 10 years of aging or the menopause-related PMD change (as assessed within ICMD) and depend greatly on the reader. Thus, in the screening or clinical setting when assessing MD change over time for the same woman, it is important that the same reader reads the woman's repeat mammograms. If the calibration equations presented in this article are to be used in the screening or clinical settings, they will need to be validated, particularly for different readers. In studies comparing PMD across raw and processed image types, correcting for these differences is thus important and would ideally be made using reader-specific and system-specific calibrations. Even if all images are of the same type (raw or processed) it is necessary to calibrate between readers. Comparability of raw images between systems has not been assessed and difference in acquisition between systems may be present. The repeated finding across studies of large between-reader differences in MD, in addition to their time-intensive nature, again emphasizes the need for fully-automated methods of MD measurement. Four such fully automated quantitative methods were recently evaluated for BC risk prediction, alongside Cumulus [7]. Although such methods eliminate between-reader variations in readings, many only work on a single image type (often raw digital images [25]), but others can be applied across multiple types [8,26]. It is possible that there would be between-system differences in automated measures, particularly volumetric measures due to differences in breast positioning and therefore breast thickness [27], but not all studies have found this [28]. In the future, as further processing algorithms are developed, MD differences between raw and processed images are likely not only to persist but also to change. However, as digital storage becomes cheaper and faster, such problems may be overcome if raw images are systematically stored and MD is consistently measured on them. In a similar fashion, a consistent and fully-automated MD measurement tool could be applied to the raw image bank to provide MD data in an efficient and systematic manner.