Explainability for deep learning in mammography image quality assessment

The application of deep learning has recently been proposed for the assessment of image quality in mammography. It was demonstrated in a proof-of-principle study that the proposed approach can be more efficient than currently applied automated conventional methods. However, in contrast to conventional methods, the deep learning approach has a black-box nature and, before it can be recommended for the routine use, it must be understood more thoroughly. For this purpose, we propose and apply a new explainability method: the oriented, modified integrated gradients (OMIG) method. The design of this method is inspired by the integrated gradientsmethod but adapted considerably to the use case at hand. To further enhance this method, an upsampling technique is developed that produces high-resolution explainability maps for the downsampled data used by the deep learning approach. Comparison with established explainability methods demonstrates that the proposed approach yields substantially more expressive and informative results for our specific use case. Application of the proposed explainability approach generally confirms the validity of the considered deep learning-based mammography image quality assessment (IQA) method. Specifically, it is demonstrated that the predicted image quality is based on a meaningful mapping that makes successful use of certain geometric structures of the images. In addition, the novel explainability method helps us to identify the parts of the employed phantom that have the largest impact on the predicted image quality, and to shed some light on cases in which the trained neural networks fail to work as expected. While tailored to assess a specific approach from deep learning for mammography IQA, the proposed explainability method could also become relevant in other, similar deep learning applications based on high-dimensional images.


Introduction
The evaluation of image quality, otherwise known as image quality assessment (IQA) [1], plays an important role in various image-processing applications. Especially in safety-relevant use cases such as medical applications, a reliable and robust assessment of image quality is of particular relevance. In mammography, the main goal of IQA is to ensure that the image shows the breast anatomy and lesions compatible with breast cancer. This main goal must be achieved with the lowest possible radiation dose. A variety of phantoms for IQA in mammography have been proposed [2]. The method we take as a reference here was recommended by European Guidelines [3,4] and is based on the automated readout of the contrast-detail curves (CDCs) from a set of recorded images of a CDMAM 1 phantom [3]. A method recently proposed for the CDC determination involved using a deep neural network [5,6]. This method, which requires only a single image and allows the calculation of uncertainties, is considered in this work. The approach is one of an increasing number of deep learning-based methods in the field of medical IQA [7][8][9][10][11]. Unfortunately, neural networks suffer from a lack of explainability due to their black-box nature [12], which is of particular relevance in safety-critical areas such as medicine. The goal of this paper is to provide an explainability for the deep learning approach from [5,6] for IQA.
While a variety of methods exists for the explainability of neural networks [13,14], we found none of them to yield satisfactory results for the deep learning-based IQA in mammography. We therefore propose a new method specifically tailored for the use case, which we call oriented, modified integrated gradients (OMIG) method. The resulting explainability map supports users of a deep learning approach in mammography IQA in various ways: • It makes it possible to understand which parts of the CDMAM phantom have a significant impact on the prediction. • This, in turn, allows the trustworthiness of the employed neural network to be assessed.
• Ascertaining which parts of the phantom are the most informative for a deep learning-based IQA in mammography might inspire the design of new phantoms for this application.
Although the proposed explainability method is tailored to the particular task of IQA in mammography, we believe that its ideas could be beneficial also in other applications of deep learning. However, the exploring of the potential of the proposed approach for other cases is beyond the scope of this article. Nevertheless, we will still provide a short discussion on this topic.
The paper is structured as follows: section 2 briefly recapitulates a few aspects of IQA that are important for mammography, introduces the employed approach based on deep learning, and finally presents our explainability approach. This includes a technique for determining the important input features from the downsampled data, which will be referred to as 'upsampling' within this work. Results are then discussed in section 3. Finally, we discuss limitations, give an outlook and some conclusions.

A recap of IQA for mammography
As indicated in the introduction, we here follow the European Guidelines that recommend the use of the CDMAM phantom for IQA in mammography. This phantom, which is shown in figure 8 in the appendix, is a polymethyl methacrylate-coated aluminum plate with gold disks of various diameters and thicknesses placed on it. The disks, that represent breast lesions such as microcalcifications [15], are inserted in a matrix (grid) of square cells. The CDMAM phantom includes disks with 16 diameters whose detection is highly depending on the mammography system together with the radiographic factors. A CDC is determined from images of the CDMAM phantom and describes the minimum threshold gold thickness for various diameters necessary for the corresponding disks to be detected.
We will here use the method of automated readout, as first introduced in [16] and recommended in [3], which provides the results for the 12 disks with diameters between 1 and 0.08 mm [17], since the smallest diameters (smaller than 0.1 mm) are harder to detect. The 12 values of the CDC are then rescaled via an experimentally found relation [3,18] in order to be comparable to human readout. From the point of view of IQA each disk diameter is relevant although the values of the limiting curve are defined only for five diameters. Four of these five diameters are contained in the CDC range obtained by the automated readout.
Given images of the CDMAM phantom the CDC can be computed by the CDMAM Analyser software [19,20] which requires at least 16 recordings of the phantom. Following the approach in [5,6], we will here train a deep neural network (to be more specific, an ensemble of networks, cf section 2.2 below) on images labeled with the ground truth CDCs to predict CDCs using a single image. The CDCs determined by the CDMAM Analyser software serve as the ground truth. Image quality is considered better if the CDC has smaller values, because a smaller CDC characterizes a smaller minimum threshold gold thickness. We will make extensive use of this fact below.

Details on training and used data
We trained an ensemble of M = 10 neural networks (a so-called deep ensemble [21]) that had the same architecture as in [5], using 52 800 simulated images and 4800 real images (constructed from 48 recordings via augmentation [5,6]). The neural networks had 12 output neurons that correspond to the 12 points of the CDC. For training we used the negative log-likelihood loss [21] assuming homoscedastic noise, whose variance is learned during the training. We used an Adam optimizer [22] with a learn rate of 10 −5 . The training lasted 300 epochs with a batch size of 50 images. For regularization purposes, we used an L2 regularization of 10 −4 and a dropout layer with a dropout rate of 0.2. Details of the network architecture can be found in figure 10 and table 2 in appendix C together with an illustration of the training procedure that uses incremental learning in figure 11. Using an ensemble instead of a single network allows the predictions Figure 1. Example of the downsampled images from the test data set: a recorded image (a) and two simulated images from scenario #1 (b) and the OOD set (c), cf section 2.2 for details. Note that, for the two simulated images (b) and (c) we chose two simulated images without margin for the purpose of presentation. For simulated examples with a margin, as in the recorded image (a), see figure 9 in the appendix. Table 1. The simulation parameters used by the software for the virtual mammography [29]. The software considers the tube voltage, current-exposure time product and the noise level which is defined as the standard deviation of the Gaussian white noise that is fed into a Gaussian filter which simulates the noise in mammography images.

Scenario
Voltage ( to be improved and makes them more robust [23][24][25][26]. Furthermore, uncertainties can be evaluated in this way, which is of importance for making deep learning usable in a medical context [27,28]. We follow [5,6] in downsampling the images in the training data to a resolution of 250 × 250, which was observed in those works to reduce the required computational resources and to improve the convergence of the neural network.
To evaluate the trained neural networks, we used data sets which consisted of 24 recorded images and 5000 simulated images [5,6]. These test images were not part of the training data. While the recorded images were originated from the same device with the same anode/filter combination, tube voltage and current-exposure time product, the simulated images split into the groups whose simulation parameters are described in table 1. The training and the test data sets were normalized by the mean and standard deviation obtained from the pixel intensities of the whole images of the training data set. The four groups, of 1200 test images each, were simulated using the simulation parameters from table 1 and six different margin sizes. We refer to these groups as 'scenario #1-#4' . Additionally, a single set of 200 out-of-distribution test images was simulated with a choice of simulation parameters that fundamentally differ from the ones used for training and for scenario #1-#4. We will refer to this group as 'OOD' .
For further details on the data, see [5,6]. To evaluate the trained neural network, the test images must be downsampled to a resolution of 250 × 250. Some downsampled test images are shown in figure 1, including a real (i.e. recorded) test image. Figure 1(b) shows a simulated image from scenario #1. Note that this particular example has no margin, in contrast to the recorded example in (a). The training and test sets contain simulated images with and without margins of various sizes, cf [5,6] and figure 9 in the appendix for an example including a margin. Figure 1(c) shows an image from the OOD data set (again an image without a margin). Figure 2 shows some results of the impact of our trained neural networks on the test data set. For the specific images from figure 1 the predicted CDCs together with their uncertainty multiplied by a factor of 1.96 are shown in figure 2(a). The plotted uncertainty shows the total uncertainty, whose square is the sum of the squared epistemic and aleatoric uncertainties, see [21,30] for details. The factor 1.96 is chosen so that the corresponding interval estimate has a coverage probability of 95% assuming a Gaussian distribution. The corresponding ground truths for the three images, that were obtained by the CDMAM Analyser software, are depicted by the dashed lines. The black solid line connects the four values of the limiting curve described in section 2.1. For all data points in the test data set, the relationship between the prediction error and the predicted uncertainty is shown in figure 2(b). By far, the highest errors were recorded for scenario #4, which is set apart from the other groups in the test set by its high noise level, cf table 1. In addition, the expanded total uncertainty (total uncertainty times 1.96) covered the ground truth only in 61% of the cases. The explainability method presented in this work allowed us to investigate this point further and to study why the predictions are worse for scenario #4. For all other groups in the test data set, (i.e. the recorded images, scenario #1-#3 and the OOD set), the error was substantially smaller and the observed coverage rates of the interval estimates were rather conservative-namely, 100% (recorded), 97.1% (sc. #1), 99.1% (sc. #2), 96.9% (sc. #3), and 98.7% (OOD).

Explainability
Explainability for neural networks is often done by analyzing the effect of the input features, which in our case are the image pixels, on the output [31] (i.e. the 12 points of the CDC). Such an explanation is often presented by a heatmap that highlights which features of the input image affect the output of the network to the greatest extent. While there are several generally applicable methods [13,14], none of the established methods we explored yielded expressive results for the mammography example studied in this work. Moreover, the computations of some methods are too complex and time-consuming to be combined with the upsampling procedure, which we explain below and which is required in order to obtain informative, high-resolution results. We therefore propose a new method, OMIG, that has been tailored to the needs of the CDMAM evaluation.
Before presenting a detailed analysis, let us summarize several factors unique to deep learning-based IQA in mammography: (a) Recall that a CDC expresses a higher image quality if its threshold values are lower. This gives the problem at hand a natural direction that we can exploit. (b) We want to use a deep ensemble as this allows the uncertainty to be quantified, which is of great relevance in a medical case such as mammography. (c) While the network was trained with downsampled inputs to ensure scalability and convergence, what is ultimately required is a heatmap that corresponds to the original high-resolution image, as this will allow finer patterns to become visible.

OMIG explainability
The motivation behind the explainability method introduced in this work is the following question: How can we modify a given input image x to a new image x ′ in such a way that for x ′ the neural network is expected to predict a better image quality?
Here, the intention is that a quantitative answer to this question will give us an information about the importance of the features in x on the prediction. The goal is to construct a modified hypothetical image x ′ , somehow close to the original image x. We then expect that the image x ′ yields a CDC that lies lower than the CDC obtained from the original image x.
Let us introduce some notations. We denote by µ m (x) with m = 0, . . . , M the output of the M ensemble members (in our case M = 10) for the input image x. Recall, that we predict K = 12 points of the CDC, so that each µ m (x) is in fact K-dimensional. Let us denote the (again K-dimensional) prediction of the deep ensemble asμ(x) = 1 M ∑ M m=1 µ m (x) and the K components ofμ(x) byμ k (x). We now summarize the K predicted points ofμ(x) into a scalar, by taking the 'midpoint' of the CDC, that is 1 k (x). We will assume this midpoint as a scalar measure for the position of the CDC. More precisely, we will say that the CDC is lowered when this is true only for the midpoint. It turns out that this, somehow simplified, interpretation is completely sufficient for our purposes. This reduces the question above to the task of obtaining from x an x ′ such that 1 To construct x ′ we proceed as follows: given x, computeμ(x) and compute the gradient of the midpoint Then, use R(x) to successively update x to create a modification x ′ by moving x in the direction of the negative gradient. More precisely, set x 0 = x and, for some positive c > 0 and integer L set ε = c L , compute for l = 0, . . . , L − 1. Finally, set The motivation behind choosing ε ∝ 1 L is that for large L the discrete steps from (2) form a continuous path whose length depends on c and whose endpoint is given by x ′ . We then define the OMIG explainability for x as the difference of the end and the starting point of this path In the experiments, we use L = 50 and c = 1 which we found to perform well for the considered use case. L should be large enough so that the according step size ε = c L will expectedly lead to a lowering of the midpoint in the update (2). The parameter c determines the length of the path leading to x ′ . Choosing a large c (and scaling L accordingly) will push x ′ toward a local minimum, while choosing a small c will reduce the OMIG explainability to the explainability of sensitivity analysis. As we want x ′ which is close to x but still allow for a modification ∆x = x ′ − x that contains sufficient information, c should be chosen neither too large nor too small. For the results in section 3 below we included a comparison for various choices of c in the appendix, in figure 14.
The visual performance of ∆x can be substantially enhanced by removing 'outlier components' of the gradient R(x l ) for l ⩾ 1 before performing the iteration step (2). To do this, we compute for each of the components R(x l ) i of the gradient R(x l ) the following Z-score: where µ (R(x l )) and σ (R(x l )) represent the mean and the standard deviation of the components of R(x l ) respectively. A component R(x l ) i is then set to zero during the lth iteration if the absolute value of its Z-score is greater than a predefined threshold t. In this work, we used t = 3, which excludes, on average, about 1%-2% of the components in our case. The full algorithm for computing the OMIG explainability is summarized in algorithm 1.
Note that the dimension of ∆x coincides with that of x, thus, we can plot it as an image (a 'heatmap'). From the construction of the OMIG explainability, we have the following interpretation of the sign of the pixels of ∆x: given an image x and its OMIG explainability ∆x, we know that, • if, for pixels with a positive value in ∆x, we increase the intensity of the corresponding pixel in x according to this value and • if, for pixels with a negative value in ∆x, we decrease the intensity of the corresponding pixel in x according to this value, we expect to obtain an image with a better predicted image quality, i.e. we expect the CDC to decrease. In general, we expect image quality to improve if the following is done: • pixels of the modified ∆x that have positive values indicate pixels of the original x that must be more visible and • pixels of the modified ∆x that have negative values indicate pixels of the original x that must be less visible.
This information will be helpful in interpreting the results in section 3 below. Note, however, that the design of the proposed OMIG procedure means we only expect the predicted CDC to decrease if we modify all pixels in x according to their value in ∆x at the same time; thus, one should be careful not to interpret local structures in ∆x without taking the entire image into account. Moreover, our interpretation is restricted to the midpoint of the CDC only, rather than using the whole curve. Input: input image x, ensemble of neural networks, parameters ε > 0, t > 0 and an integer L > 0 1 x0 = x; 2 Obtain initial gradient R(x0) with (1); 3 for each l in 0, . . . , L − 1 do 4 In section 3 below we will use for comparison two other methods from the literature that are somewhat conceptually close to the presented method, namely occlusion sensitivity [32] and sensitivity analysis [33]. For occlusion sensitivity the explainability value (in respect to the CDC midpoint) for an input x at a pixel i is given by 1 is constructed from x via setting its value at i to 0 and where we summarized, similar as for OMIG, the CDC by its midpoint. Occlusion sensitivity measures, by construction, the impact of single pixels. Sensitivity analysis, on the other hand, shows the gradient of − 1 K ∑ K k=1μ k (x) with respect to x, i.e. with the notation from (1), −R(x). Sensitivity analysis therefore shows the effect of changing all pixels at once. Note that due to (2) , so that the OMIG method can be seen as a way to enrich the information of the sensitivity analysis gradient R(x) by 'integrating' it along a path. In addition to the methods presented in figure 3, we studied layerwise-relevant-propagation [34] but found it, as in [6], unsatisfactory for our use case.

Upsampling explainability maps
As explained in section 2.2, the data used to train the neural networks are downsampled to a resolution of 250 × 250 by randomly selected pixels in the original images while keeping track of their original order [6]. As a result of this procedure, 99% of the original image is omitted. While this leads to a better convergence of the training and requires less resources for computation, it raises the problem that the trained neural network can only process downsampled images. Specifically, any OMIG explainability ∆x as in (3) will have the dimensions of the downsampled input. For the sake of interpretability, it would be more useful to have an explainability map in the resolution of the original, not downsampled, image instead, as this would allow us to study the impact of finer structures such as individual disks. To this end, we here propose an upsampling of the explainability map. The upsampling procedure is described in detail in algorithm 2. Here, the image is downsampled several times and the results of the OMIG explainability are collected in a single explainability map of the size of the original, not downsampled, image. In each iteration, the 250 × 250 randomly chosen pixels I are saved. After computing the explainability of map ∆x as in (3), the values of ∆x are then assigned to a high-resolution δ, with the same dimensions as the original, not downsampled, image, according to the pixels stored in I. All pixels of δ that are not stored in I are filled with zeros. This procedure is repeated N times and the obtained δ's are averaged to yield an upsampled explainability map (∆x) upsampled that has the same dimensions as the original, not downsampled, image.

Results
The right column of figure 3 shows the resulting explainability maps obtained by the proposed OMIG method. The rows of figure 3 represent the explainability maps for three different images: the recorded image from figure 1(a) in the top row, the simulated image from scenario #1 from figure 1(b) in the middle row and the simulated OOD image from figure 1(c) in the lower row. As seen in the OMIG explainability maps, the OMIG method highlights a large area of the input image and successfully detects margins. Most importantly, however, when compared to the sensitivity analysis whose results are presented in the middle column and the occlusion sensitivity from the left column, the OMIG method is the only method presented that actually highlights expressive patterns. The CDMAM phantom grid is clearly visible on the OMIG heatmaps, but due to the downsampling procedure it is impossible to determine if any other phantom features besides the grid are highlighted.
It is evident that each of the four methods studied produces distinct results. Occlusion sensitivity and sensitivity analysis put a particular emphasis on the middle region. However, the highlighted region is blurry and does not highlight any more specific structures. Moreover, for occlusion sensitivity, this region is invariant of the position of the actual phantom, as can be seen from its performance in the recorded image with a margin (figure 1(a)), in the image from scenario #1 ( figure 1(b)) and the image from the OOD set ( figure 1(c)), both of which had no margin.
Due to the fact that the heatmaps in figure 3 were obtained from highly downsampled image data, as explained in section 2.2, the explainability maps on the downsampled images are not particularly useful for analyzing the impact of finer structures in the phantom such as individual disks. For this reason, we now employ the upsampling procedure from algorithm 2 to construct high-resolution explainability maps. This involves evaluating the OMIG algorithm from section 2.3.2 N-times, where we chose N = 5000. While an evaluation of the OMIG explainability for a single downsampled image takes a few seconds only, executing a loop as in algorithm 2 takes around 4.5 h. In principle, this process is parallelizable and can thus be sped up. For occlusion sensitivity, the upsampling consumes too many computational resources to be of any practical use. In figure 12 in the appendix, we included the upsampled heatmaps for the sensitivity analysis method: here, in contrast to the OMIG method, no finer details become visible through upsampling. We will therefore restrict ourselves to the OMIG method from this point on.
The upsampled explainability maps for the recorded image from figure 1(a) and the simulated scenario #1 image from figure 1(b) are presented in figure 4. The left column of figure 4 shows the entire, upsampled heatmap for these two images (top: recorded image, bottom: simulated scenario #1 image). In the upper part of both heatmaps, individual disks are clearly visible. These parts are shown enlarged in the second column of figure 4. The corresponding area is marked by a white rectangle in the left column. As explained in section 2, the presented OMIG heatmaps were produced with c = 1 and L = 50.
The maps in figure 4 indicate that the neural networks trained for mammography IQA learned a 'meaningful' mapping, that is, a mapping that uses the geometrical structures of the phantom and does not, or does not only, rely on structure-agnostic aspects such as the noise level or on a 'Clever Hans' effect [35]. The color chosen for the pixels in the explainability map in figure 4 is red for pixels i with ∆x i > 0, blue for those with ∆x i < 0 and black for ∆x i = 0. For the interpretation of these colors, let us recall the logic behind the OMIG method: if we are given an image x and if, for each pixel i that is • red (i.e. ∆x i > 0), the intensity of x i will be increased and if, for each pixel that is • blue (i.e. ∆x i < 0), the intensity of x i will be decreased, then we expect the deep ensemble to predict a better image quality, i.e. a lower CDC. Figure 13 in the appendix shows for the images considered in figures 4 and 6 below, the effect of the corresponding pixel modification on the predicted CDC. This illustrates that the OMIG heatmap does indeed highlight those features that are expected to influence the predicted CDC to move downwards.
Looking at the cutouts in figure 4, we observe that almost all of the disks in the cutouts are red. Broadely speaking, this means that increasing the intensity of the disks is expected to increase the predicted image quality. This is to be expected, as the detectability of single disks is what the CDC is designed to quantify. However, note that only a part of the disks and also other structures influence the prediction. Figure 5 shows further cutouts of the explainability maps from figure 4. Apparently, the disks in these cutout region do not really contribute to the prediction of the neural network.
The difference in the highlighting of disks between the cutouts in figures 4 and 5 is easy to understand. The cutout in figure 4 shows an area of the phantom with the disks of larger diameters, whereas the region depicted in the cutout in figure 5 contains the disks of smaller diameters (cf figure 8 in the appendix). The disks with larger diameters are more likely to be chosen during the downsampling procedure and therefore, on average, have a higher impact on the prediction of the trained neural network. This shows that the algorithmic choice to downsample images before training the networks has a noticeable effect on their 'functional principle' .
The pattern behind the explainability values for the grid is more complex. For the examples used in figures 3 and 4, we observe a vertical strip of the grid, colored in blue, that contains the area where disks are highlighted in red. A possible explanation for this pattern is as follows: in figure 1, the lines of the grid become randomly corrupted due to the downsampling. These random shifts could make it harder to detect disks in front of their noisy background. Thus, in the areas where the disks are typically detected, such as the upper part of the images in figure 4, it may be possible to improve the detection performance by lowering the intensity of the surrounding grid while increasing the disk intensity: this could explain the blue-red coloring we see in the cutout. The images from the training data set are randomly mirrored horizontally and vertically, but are not rotated. Since the explainability map depends not only on the input image but also on the trained parameters, this might lead to a certain symmetry, as reflected in the blue coloring of the grid expanding to the bottom.
Note also that, for the recorded image in the first row of figure 3, we see that both the grid and the writing on the CDMAM phantom that surrounds the grid are highlighted. The explainability coloring of this writing follows a similar scheme and could thus be explained along the lines of the hypothesis mentioned above.  This shows that these features also have an influence on the prediction and could suggest an improvement of the network performance if such artifacts were included in the simulation.
We can thus summarize that the prediction of the trained neural network is influenced by the disks that have a large enough diameter to 'survive' the downsampling, but, in addition, also by the grid and the writing on its sides. These findings exemplify that a data-driven model does typically not follow the mechanics that might be most natural to us [12] and show the importance of explainability methods, such as OMIG, that tell us what features such a model actually relies on.
An explainability map allows an understanding not only of how a neural network creates a certain prediction but also why it fails to work as expected in certain cases. We saw in section 2.2 that simulation scenario #4, which has a particularly high noise level, could be considered such a case, since it leads to substantially increased errors. These errors are not properly covered by the predicted uncertainty of the deep ensemble, but can be better understood by means of the OMIG explainability. Figure 6 shows the OMIG explainability heatmap and a cutout similar to those in figure 3 but for an image from scenario #4. We observe that, while the grid seems to have an influence on the prediction, that is similar to the influence of the grid shown in figure 4, the disks do not appear in the explainability map. This indicates that the high noise level hinders the neural networks in detecting the disks of the phantom; thus, the networks can only base their prediction on the grid, which leads to a substantially worse performance, as seen in figure 2.
As a final example, we study the effect of a corrupted input image. The left column of figure 7 shows the recorded image from figure 1(a), but rotated by 90 • . The middle column of figure 7 contains the corresponding OMIG explainability map. The right column then shows a cutout (an enlarged white rectangle in the middle row). For the heatmap, we observe that the grid coloring largely resembles that of the unrotated image in figure 3. This is reasonable because, for the training of the neural networks, no rotated images were considered; thus, the networks 'assume' the grid is oriented as in figure 1. Note that, even in the rotated image, some disks are highlighted; this is shown in the cutout region in the right column. Even though this region is now located in an area different to that of figure 3, we still observe the familiar red-colored disks surrounded by a blue-colored grid. In other words, by increasing the intensity of the disks and decreasing the intensity of the surrounding grid, we expect to increase the predicted image quality. This indicates that at least some of the networks use the shape of geometrical structures for their prediction, which suggests a certain robustness of the learned mapping.
Let us summarize our observations: • The introduced OMIG explainability method reveals that the learned neural networks use geometrical patterns in the input image, such as the disks of the CDMAM phantom. This indicates that the deep ensemble seems to have learned a meaningful mapping, although it uses different information than a human observer or the CDMAM Analyser software. The disks which have the greatest impact are those that have the highest chance of surviving the downsampling. • However, the highlighted patterns not only include the disks but also the grid itself and the writing on its sides. For the grid, OMIG reveals a rather complex pattern behind the prediction; this pattern could be the subject of future work. • The patterns still appear in corrupted (i.e. rotated) images, which suggests a certain robustness of the learned neural networks. • For cases in which the neural networks fail to work as expected, such as simulation scenario #4, the OMIG explainability helps to explain this poor performance.
These observations provide a rationale for the use of the OMIG method for a deep learning-based IQA in mammography, as it reduces the deep learning black-box nature and allows the limits of the learned networks to be studied, thereby revealing opportunities to improve the approach and enhance its trustworthiness.

Limitations and outlook
In section 3, we observed that the OMIG explainability maps mark certain disks as being of particular relevance for the prediction of the CDCs. This reveals two insights: first, as explained above, we can deduce that the employed downsampling procedure has a direct consequence on the learned mapping. Second, it follows that the neural network has learned a mapping, that while still using geometric structures in the phantom, is not identical to the one a human observer or the CDMAM Analyser software uses. The neural network uses only a part of the disks and not necessarily those that are most relevant for a non-deep-learning-based prediction. In addition, the explainability maps in figures 4, 6 and 7 show that other structures such as the grid and the writing also have an impact on the prediction of the CDCs made by the neural networks. In particular, we see that not all disks are necessary and that other structures can be used as well to correctly predict the image quality if one is willing to use a data-driven model. A more in-depth study of such findings may be helpful for designing alternatives to the CDMAM phantom such as anthropomorphic breast phantoms [36][37][38].
The method in this work builds on the integrated gradients (IG) approach from [39] but modifies it, by removing the need for a baseline and exploiting instead the natural direction toward an expected better image quality. This leads to a rather complex path connecting the input image x and its updated version x ′ , in contrast to the path considered in the IG approach, which is merely a straight line connecting x and the baseline. While the 'IG path' will therefore always end with the baseline, the 'OMIG path' will end with different endpoints x ′ that are dependent on x and the chosen hyperparameters c and L. In this work, we used a large L; this was reasonable because, in the limit L → ∞, the updates from (2) were expected to describe a continuous path in the input space. However, no such unique reasonable choice is possible for the proportionality factor c. The comparison of various c is presented in figure 14 in appendix E. We observed, empirically, that c = 1 works quite well for our use case; however, other factors may be more effective for other applications. The question of how to choose this parameter may, along with a study of the performance of the OMIG method for different tasks, be an interesting topic for future research.
From the explanation above, one can see that the OMIG approach is, in principle, applicable to any problem (e.g. in IQA but also in a broader context) that possesses a natural 'direction' . Given such a direction, one can choose a value for R that is similar to that in (1) in order to point in (the inverse of) this direction and then update x to create an x ′ according to (2). Finally, we can define the OMIG explainability via (3), which gives information on how x must be modified to shift the prediction toward the natural direction. For other IQA problems, this direction might again be expected to describe an improvement of the image quality. In a classification problem, on the other hand, such a direction might be chosen to make a certain class more likely. An investigation of these questions is reserved for future work.

Conclusion
We introduced a new explainability method (OMIG) for a deep learning-based IQA in mammography. Compared to the other explainability methods, OMIG yields more expressive results for this use case. To deal with the issue of the downsampled data, we proposed a simple upsampling algorithm that provided us with high-resolutional explainability maps based on which we were able to study the impact of fine structures such as individual disks. These maps show that the trained neural networks use the disks of the CDMAM phantom and indicate that the networks have learned a meaningful relationship. The explainability maps also reveal which of these disks and which other structures in the phantom have an impact on the prediction. This allows not only the mechanisms behind the neural network's prediction to be analysed but also cases in which the networks fail to perform as expected to be explained. The presented OMIG approach could also be useful for other problems that possess a natural direction.

Data availability statement
No new data were created or analysed in this study.

Appendix D. Traing details
The training followed an incremental learning strategy based on the strategy from [5] and is depicted in figure 11 below. In the first training step, only the simulated images with no margins were used. The smaller margins were added in the second training step, while the largest margins were applied in the third step. In the last training step, the recorded and the simulated images of the CDMAM phantom were included.