Explainable artificial intelligence (XAI) in deep learning-based medical image analysis

With an increase in deep learning-based methods, the call for explainability of such methods grows, especially in high-stakes decision making areas such as medical image analysis. This survey presents an overview of eXplainable Artificial Intelligence (XAI) used in deep learning-based medical image analysis. A framework of XAI criteria is introduced to classify deep learning-based medical image analysis methods. Papers on XAI techniques in medical image analysis are then surveyed and categorized according to the framework and according to anatomical location. The paper concludes with an outlook of future opportunities for XAI in medical image analysis.


Introduction
Deep learning has invoked tremendous progress in automated image analysis. Before that, image analysis was commonly performed using systems fully designed by human domain experts. For example, such image analysis system could consist of a statistical classifier that used handcrafted properties of an image (i.e., features) to perform a certain task. Features included low-level image properties such as edges or corners, but also higher-level image properties such as the spiculated border of a cancer. In deep learning, these features are learned by a neural network (in contrast to being handcrafted) to optimally give a result (or output) given an input. An example of a deep learning system could be the output 'cancer' given the input of an image showing a cancer.
Neural networks typically consist of many layers connected via many nonlinear intertwined relations. Even if one is to inspect all these layers and describe their relations, it is unfeasibly to fully comprehend how the neural network came to its decision. Therefore, deep learning is often considered a 'black box'.
Concern is mounting in various fields of application that these black boxes may be biased in some way, and that such bias goes unnoticed. Especially in medical applications, this can have far-reaching consequences.
There has been a call for approaches to better understand the black box. Such approaches are commonly referred to as interpretable deep learning or eXplainable Artificial Intelligence (XAI) (Adadi and Berrada (2018); Murdoch et al. (2019)). These terms are commonly interchanged; we will use the term XAI. Some notable XAI initiatives include those from the United States Defense Advanced Research Projects Agency (DARPA), and the conferences on Fairness, Accountability, and Transparency by the Association for Computing Machinery (ACM FAccT).
The stakes of medical decision making are often high. Not surprisingly, medical experts have voiced their concern about the black box nature of deep learning ), which is the current state of the art in medical image analysis (Litjens et al. (2017); Meijering (2020); Shen et al. (2017)). Furthermore, regulations such as the European Union's General Data Protection Regulation (GDPR, Article 15) require the right of patients to receive meaningful information about how a decision was rendered.
Researchers in medical imaging are increasingly using XAI to obtain insight into their algorithms. In this survey, we aim to give a comprehensive overview of papers using XAI in medical image analysis. We chose to focus solely on papers that used deep learning-based XAI in medical image analysis. The search strategy for inclusion of papers is detailed in Appendix 1. In short, it followed a systemic review procedure, discussion with colleagues, and a snowballing approach -investigating papers referenced by the included papers and papers that refer to the included papers, to come to the final list of surveyed articles.
The survey is structured as follows: We will first introduce the taxonomy of XAI and describe a framework to classify XAI techniques in Section 2. In Section 3, the discussed papers are characterized according to this XAI framework. We will discuss applications of XAI techniques in medical image analysis. In case of multiple papers using the same technique, we will discuss some early adopters and summarize the rest of the papers in the tables. Since XAI techniques often originate from computer vision, we will elaborate on papers that adapted XAI techniques from computer vision by adding domain knowledge from the medical imaging field. The papers are grouped in the tables according to explanation method and according to anatomical location. This survey adds to the review of Reyes et al. (2020); since they mainly discussed techniques in computer vision, without extensively evaluating the adaptation of such techniques throughout medical image analysis. Furthermore, we describe if and how techniques from computer vision have been adapted specifically for medical image analysis. This survey adds to the review of Huff et al. (2021), since they mostly focused on examples of visual explanation, while our survey aims for a more

Visual explanation
Visual explanation, also called saliency mapping, is the most common form of XAI in medical image analysis. Saliency maps show the important parts of an image for a decision. Most saliency mapping techniques use backpropagation-based approaches, but some use perturbation-based or multiple instance learning-based approaches. These approaches will be discussed below. An overview of papers using saliency maps in medical imaging is shown in Table 2.

(Guided) backpropagation and deconvolution
Some of the earliest techniques to create saliency maps highlighted pixels that had the highest impact on the analysis output. Examples included visualization of partial derivatives of the output on pixel level (Simonyan et al. (2013)), deconvolution (Zeiler and Fergus (2014)), and guided backpropagation (Springenberg et al. (2014)). These techniques provided local, model-specific (only for CNNs), post hoc explanation. These techniques have been used in medical image analysis. For example, de Vos et al. (2019) estimated the amount of coronary artery calcium per cardiac or chest computed tomography (CT) image slice, and used deconvolution to visualize from where in the slice the decision was based on. Zhou et al. (2016) introduced Class Activation Mapping (CAM). They replaced the fully connected layers at the end of a CNN by global average pooling on the last convolutional feature maps. The class activation map was a weighted linear sum of presence of visual patterns (captured by the filters) at different spatial locations. This technique provided local, model-specific, post hoc explanation. Several researchers used this technique in medical imaging (Table 2).

Class Activation Mapping (CAM)
CAMs have also been used in medical image analysis in ensembles of CNNs. For example, Jiang et al. (2019) constructed an ensemble of Inception-V3, ResNet-152, and Inception-ResNet-V2 to distinguish fundus images of healthy subjects or patients with mild diabetic retinopathy from those with moderate or severe diabetic retinopathy; and provided a weighted combination of the resulting CAMs for localization of diabetic retinopathy. Lee et al. (2019b) constructed CAMs of the output of an ensemble of four CNNs: VGG-16, ResNet-50, Inception-V3, and Inception-ResNet-V2, for the detection of acute intracranial hemorrhage.
Since medical images often contain information at multiple scales, multi-scale CAMs have also been proposed. Liao et al. (2019) concatenated feature maps at three scales which were subsequently provided as input for the global average pooling. The provided activation maps showed higher resolution than single-scale maps, and were better at identifying small structures on fundus images of the retina. Shinde et al. (2019a) concatenated the feature maps of each layer before max-pooling and also gave those as input to a global average pooling layer. Their 'High Resolution' CAMs provided accurate localizations of brain tumors on MRI. García-Peraza-Herrera et al. (2020) proposed extracting CAMs at multiple resolutions. They showed that the CAMs at high resolution were accurate in highlighting interpapillary capillary loop patterns in endoscopy images, which were relatively small compared to the entire image. Selvaraju et al. (2017) introduced Gradient-weighted Class Activation Mapping (Grad-CAM), which is a generalization of CAM. Grad-CAM can work with any type of CNN to produce post hoc local explanation, whereas CAM specifically needs global average pooling. The authors also introduced guided Grad-CAM, Grad-Cam to show which areas of brain MRI made the classifier decide on the presence of a tumor. Bach et al. (2015) introduced layer-wise relevance propagation (LRP). LRP uses the output of the neural network, e.g. a classification score between 0 and 1, and iteratively backpropagates this throughout the network. In each iteration (i.e., each layer), LRP assigns a relevance score to each of the input neurons from the previous layers. These distributed relevance scores must equal the total relevance score of its source neuron, according to the conservation law.

Layer-wise relevance propagation (LRP)
LRP has been used in medical image analysis. For example, Böhle et al. (2019) used LRP for identifying regions responsible for Alzheimer's disease from brain MR images. They compared the saliency maps provided by LRP with those provided by guided backpropagation, and found that LRP was more specific in identifying regions known for Alzheimer's disease.

Deep SHapley Additive exPlanations (Deep SHAP)
Lundberg and Lee (2017) (Ronneberger et al. (2015)) and a variant of VGG (Simonyan and Zisserman (2014)). The attention coefficients were used to explain on which areas of the image the network focused. Ribeiro et al. (2016) introduced Local Interpretable Model-agnostic Explanations (LIME). LIME provides local explanation by replacing a complex model locally with simpler models, for example by approximating a CNN by a linear model. By perturbing the input data, the output of the complex model changes. LIME uses the simpler model to learn the mapping between the perturbed input data and the change in output.

Local Interpretable Model-agnostic Explanations (LIME)
The similarity of the perturbed input to the original input is used as a weight, to ensure that explanations provided by the simple models with highly perturbed inputs have less effect on the final explanation. In images, Ribeiro et al. (2016) implemented the perturbations using superpixels (Achanta et al. (2012)), rather than individual pixels, to show which regions were important for explaining a classification. LIME has been used by several researchers in medical image analysis. For example, Malhi et al. (2019) used LIME to explain which areas in gastral endoscopy images contained bloody regions. Fong and Vedaldi (2017) introduced meaningful perturbation, where they perturbed the input image to detect changes in the predictions of a trained model. Rather than using perturbations such as occlusion sensitivity that block out parts of the image, they suggested simulating naturalistic or plausible effects, leading to more meaningful perturbations, and consequently to more meaningful explanations. They opted for three types of local perturbations, namely a constant value, noise, or blurring. (2017) were not suited for medical images. Replacing areas of a medical image with a constant value is implausible, and medical images naturally tend to be noisy and blurry. They proposed to replace pathological regions with a healthy tissue equivalent using a variational autoencoder (VAE). They showed that the perturbations by the VAE pinpoint pathological regions in diverse imaging studies as optical coherence tomography images of the eye (pathology consisted of intraretinal fluid, subretinal fluid, and pigment epithelium detachments), and MRI of the brain (pathology consisted of stroke lesions). Furthermore, they showed that using a VAE yielded better localization of pathology compared with using simple blurring or constant-value perturbations. Lenis et al. (2020) used similar reasoning as Uzunova et al. (2019), and used inpainting to replace pathological regions with healthy tissue equivalents. They showed that the perturbations created by inpainting outperformed backpropagation and Grad-CAM in pinpointing masses in breast mammography and tuberculosis on chest X-rays, based on the Hausdorff distance between thresholded heatmaps derived from the saliency maps and the ground truth labels at pixel level. Zintgraf et al. (2017) adapted prediction difference analysis (Robnik-Šikonja and Kononenko (2008)) for generating saliency maps. If each pixel in an image is considered a feature, prediction difference analysis assigns a relevance value to each pixel, by measuring how the prediction changes if the pixel is considered unknown. Zintgraf et al. (2017) expanded this by adding conditional sampling, which means that they only analyzed pixels that are hard to predict by simply investigating neighboring pixels, and by adding multivariable analysis, which means that they analyzed patches of connected pixels instead of single pixels. They included an analysis of brain MRI of patients with HIV versus healthy controls, yielding explanation of the classifier's decision. Seo et al. (2020) used prediction difference analysis in combination with superpixels (or supervoxels for 3D) on multiple scales. These multiscale supervoxel-based saliency maps provided explanations that the authors described as visually pleasing since they follow image edges. The saliency maps explained which regions were informative for a classifier to distinguish between Alzheimer's disease patients and normal controls.

Multiple instance learning-based approaches
Multiple instance learning can be used for visualizing explanations. In multiple instance learning, training sets consist of bags of instances (Dietterich et al. (1997)). These bags are labeled, but the instances are not. In medical image analysis, multiple instance learning can for example be done using a patch-based approach: An image represents the bag, and patches from that image represent the instances (Cheplygina et al. (2019)).
Several researchers have used this approach to pinpoint which instances in the bag are responsible for the classification. For example, Schwab et al. (2020) localized critical findings in chest X-ray using such a patch-based approach. Each image patch received a prediction, and the predictions were overlaid on the image to visualize on which areas the classifier based its decision. Araújo et al. (2020) used multiple instance learning to explain which areas of a fundus photograph were important for diabetic retinopathy.
They assessed the severity of the disease using an ordinal scale with grades from 0 to 5. Using a patchbased approach, they provided visual explanation maps for each diabetic retinopathy grade.  Table 1 has been noted. CAM = class activation mapping, CT = computed tomography, LIME = local interpretable

Textual explanation
Textual explanation is a form of XAI that adds textual descriptions to the model. Such descriptions include relatively simple characteristics (e.g. 'spiculated mass'), up to entire medical reports. We will describe three types of textual explanation: image captioning, image captioning with visual explanation, and testing with concept attribution.
An overview of papers using textual explanation in medical imaging is shown in Table 3. Vinyals et al. (2015) provided textual explanation for images using an end-to-end image captioning framework. They coupled a convolutional neural network for encoding of the image, with a recurrent neural network -specifically a long-short term memory net (LSTM) (Hochreiter and Schmidhuber (1997)) -for textual encoding. They used human-generated sentences as ground truth for training, and used the bilingual evaluation understudy (BLEU) metric for evaluation. The BLEU-metric describes the precision of word N-grams, i.e. a sequence of N words, between generated and reference sentences (Papineni et al. (2002)). Singh et al. (2019) used an image captioning framework to provide textual explanation for chest X-rays.

Image captioning
They used word-embedding databases Global Vectors (GloVe) (Pennington et al. (2014)) and the radiology variant RadGloVe ) to train the LSTM, and used the aforementioned BLEU metric as well as variants METEOR, CIDER, and ROUGE (Banerjee and Lavie (2005); Lin (2004); Vedantam et al. (2015)). As expected, higher performance was reached in the generated radiology report when both RadGloVe and GloVe were used instead of just GloVe.

Testing with Concept Activation Vectors (TCAV)
Concept attributions provide explanation corresponding to high-level concepts that humans find easy to understand (Kim et al. (2018) Graziani et al. (2020) showed that by using regression concept vectors, they could for example explain why the network classified one area of a breast histopathology image as cancer and another as healthy: Both areas of the image scored high on the concept 'contrast', but the concept 'nuclei area', referring to a clinically used system for evaluating cell size, was different between healthy and cancerous regions. Shen et al. (2019) used what they called a hierarchical semantic CNN to predict malignancy of lung nodules on CT. They classified five textual descriptions of image characteristics representative of lung nodule malignancy that are typically assessed by a radiologist. The task of finding textual descriptions was combined with the main task of classifying lung nodule malignancy. Although their hierarchical semantic CNN did not significantly outperform a normal CNN in predicting nodule malignancy, the method did provide human-interpretable characteristics of the nodules.

Example-based explanation
Example-based explanation is an XAI technique that provides examples relating to the data point that is currently being analyzed. This can be useful when trying to explain why a model came to a decision, and is related to how humans reason. For example, when a pathologist examines a biopsy of a patient that shows similarity with an earlier patient examined by the pathologist, the clinical decision may be enhanced by knowing the assessment of that earlier biopsy.
Example-based explanation often optimizes the hidden layers deep in the neural network (i.e., the latent space) in such a way that similar points are close to each other in this latent space, while dissimilar points are further away in the latent space.
An overview of papers using example-based explanation in medical imaging is shown in Table 4.

Triplet network
Several papers provided example-based explanation using a triplet network (Hoffer and Ailon (2015)). A triplet network consists of three identical networks with shared parameters. By feeding these networks three input samples, the network calculates two values consisting of the L 2 distances between the representations in the latent space (i.e., embedded representations) of these input samples. This allows They demonstrated this technique in dermatology images of melanoma. Wei Koh and Liang (2017) proposed to use influence functions to explain on which inputs from a training set the model based its decision. They did so by investigating what would happen in case an input from the training set would not be available or would be changed. Since it is expensive to assess this by perturbation, they provided an efficient approximation using influence functions (Cook and Weisberg (1980)).

Influence functions
C. J.  used influence functions to explain which classifications of liver lesions on multiphase MRI were associated with which radiological characteristics. This global explanation provided insight into the neural network's behavior. For example, the class 'benign cyst' was most often associated with the radiological finding 'thin-walled mass'. Since the network did not only output the class label but also the corresponding radiological characteristics, this explanation could enhance user trust in the output of the network.

Prototypes
C.  proposed to use typical examples as explanation (i.e., prototypes), which they described as 'this-looks-like-that'. The method reflected case-based reasoning that humans perform. For example, when a person explains why a picture contains a car, they can internally reason that this is a car because it looks like a car they have seen before. A prototype layer was added to the neural network, which grouped training inputs according to their classes in the latent space. A prototype was picked for each class, consisting of a typical example of that class. During testing, the method utilized parts of the test image that resembled these trained prototypes. The output was a weighted combination of the similarities to these prototypes. Hence, the explanation was an actual computation of the model, not a post hoc approximation.  Sarhan et al. (2019) proposed learning disentangled representations of the latent space using a residual adversarial VAE with a total correlation constraint. This adversarial VAE enhanced the fidelity of the reconstruction and provided more detailed descriptions of underlying generative characteristics of the data. When analyzing reconstructions by traversing through the latent space, they showed that their method yielded reconstructions that were more true to human-interpretable concepts such as lesion size, lesion eccentricity, and skin color compared with a regular VAE. Biffi et al. (2020) provided a framework for explainable anatomical shape analysis using a ladder VAE (Sønderby et al. (2016)). They coupled this ladder VAE with a multi-layered perceptron, enabling the network to train end-to-end for classification tasks. By doing this, the highest level of the latent space was enforced to be low-dimensional (2D or 3D), which meant that these learned latent spaces could be directly visualized without the need of further dimensionality reduction after training. They provided dataset-level explanation using these low-dimensional latent spaces to visualize differences in shape for hypertrophic cardiomyopathy versus healthy controls on cardiac MRI, and for Alzheimer's disease versus healthy controls on brain MRI by visualizing the shape of the hippocampus. Silva et al. (2018) proposed example-based explanation that showed similar and dissimilar cases foraesthetic results of breast surgery on photos, and for skin images on dermoscopy. They identified these examples using a nearest neighbor search in latent space: The nearest neighbor of the same class was considered the most similar case, and the nearest neighbor of the other class was considered the most dissimilar case. Their explanation also included rule extraction from meta-features (e.g. the color of a skin lesion or the visibility of scars). They proposed three criteria to measure the validity of the rule-extracted explanation, namely: 1) completeness, i.e. the explanation should be general enough to be applied to more than one observation; 2) correctness, i.e. if the explanation itself was considered a model, it should correctly identify which class it belongs to; and 3) compactness, i.e. the explanation should be succinct.

Examples from the latent space
In later work,  combined example-based explanation with saliency mapping. First, they trained a baseline CNN to classify chest X-rays into pleural effusion versus non-pleural effusion. After that, the CNN was fine-tuned on saliency maps. In testing, a nearest neighbor search between the latent space of the test image and a curated 'catalogue' set of images was performed. Adding the saliency map yielded more consistent examples than extracting examples without the saliency map (i.e., the baseline CNN). Sabour et al. (2017) showed that by replacing the scalar feature maps from convolution neural networks by vectorized representations (i.e., capsules), they were able to encode high-level features of images.
Capsules were basically subcollections of neurons in a layer. These were linked to subcollections of neurons in subsequent layers, forming a capsule network. This capsule network was optimized using dynamic routing. In short, higher level capsules were activated if their corresponding lower-level capsules are active. This correspondence was described by routing coefficients, which summed to one for each capsule. The coefficients were iteratively (i.e., dynamically) updated when the capsule network received new input data. For the MNIST digits dataset, Sabour et al. (2017) found that these capsules learn humaninterpretable features such as scale, thickness, and skew.
LaLonde et al. (2020) used capsules for lung cancer diagnosis, while also predicting visual attributes such as sphericity, lobulation, and texture. Since these visual attributes were not necessarily mutually exclusive, as was the case in MNIST (a digit cannot be a two and a nine at the same time), they adapted the dynamic routing algorithm accordingly. Specifically, the routing coefficients did not have to sum to one in their implementation. LaLonde et al. (2020) showed that their implementation was indeed able to predict these visual attributes as well as lung nodule malignancy.  Silva et al. (2018) Multiple Examples from the latent space Yan et al. (2018) CT Triplet network P.  Histology Examples from the latent space with visual explanation

Overview
We have discussed 223 papers on eXplainable Artificial Intelligence (XAI) for deep learning in medical image analysis. We categorized the papers based on the XAI-frameworks proposed by Adadi and Berrada (2018) and Murdoch et al. (2019). Some trends were noticeable in the surveyed papers. The majority of the papers used post hoc explanation as contrasted with model-based explanation, i.e., the explanation was provided on a model that had already been trained, instead of being incorporated in model training.
Both model-specific (e.g., specifically designed for CNNs) and model-agnostic explanation methods were used. Furthermore, most of the papers investigated provided local explanation rather than global explanation, i.e., the explanation was provided per case (e.g. per patient), rather than on a dataset-level (e.g. for all patients). Since we focus on deep learning in medical image analysis, these trends were to be expected. Most readily available XAI methods suitable for CNNs are saliency mapping techniques, which often provide post hoc, model-specific, and local explanation. Furthermore, post hoc XAI methods can be used after a neural network has been trained, making them more accessible than model-based XAI.
We categorized the papers based on anatomical location and modality of medical imaging. We found that most papers focus on chest or brain and on X-ray or MRI (Figure 3). This is comparable to what Litjens et al. (2017) found for deep learning methods in medical imaging in general.

Evaluation of XAI
We have described several XAI techniques and their applications in medical image analysis, but how does one evaluate whether an XAI technique provides good explanation? Unlike measures of performance commonly used in medical image analysis, such as accuracy, Dice coefficient, or an ROC analysis; success criteria of explanation are more difficult to define. Doshi-Velez and Kim (2017) proposed a framework for the evaluation of explainability, consisting of three evaluation methods: application-grounded evaluation, human-grounded evaluation, and functionally-grounded evaluation.

Application-grounded evaluation
Application-grounded evaluation uses human experiments within a real application. In other words, let domain experts test the explanation. In medical image analysis this might involve a radiologist inspecting whether example-based explanations are actually good examples based on the many images the radiologist has seen in their many years of experience. The advantage of application-grounded evaluation is that it directly tests the objective that the system was built for. The disadvantage is that it is a costly evaluation.

Human-grounded evaluation
Human-grounded evaluation uses simpler human experiments that maintain the essence of the target application. In other words, let laypersons test the explanation or a proxy of the explanation. For example, when explaining the location and size of a cancer, this might involve a crowdsourcing project where laypersons judge the quality of saliency maps. Since it uses laypersons instead of highly trained domain experts, the advantage of human-grounded evaluation is that it is less costly, while still receiving general notions of the quality of an explanation. The disadvantage is that the assessment of the quality of an explanation is a proxy of the actual quality.

Functionally-grounded evaluation
Functionally-grounded evaluation does not use human experiments, but uses other proxies to assess the quality of the explanation. These proxies may include measurements that have already been validated using human users. In our example of explaining the location and size of a cancer, this might involve comparing the explanation with manually drawn tumor delineations of a radiologist. The advantages of functionally-grounded evaluation stated by Doshi-Velez and Kim (2017) include that they are relatively cheap to acquire. This is, however, not necessarily the case in medical image analysis, since acquiring for example manual annotations is a very resource intensive process. When these manual annotations do already exist, e.g. when using curated data from a challenge, evaluation of explanations are easily extracted, and can be automatically extracted multiple times. This can be useful, for example in the development phase of explanation methods.

Outlook
Since high stakes decision-making is intertwined with medicine, we are convinced that XAI will be increasingly important. We have investigated the trends, and noticed that an increasing amount of papers contain a holistic approach, combining multiple forms of explanation. Examples of such more holistic approaches include combinations of textual explanation and visual explanation (e.g. Graziani et al. Other directions of XAI in medical image analysis may include the link between causality and XAI. Typical medical image analysis consists of correlation rather than causation. Causality describes the relation between cause and effect, and can be mathematically described (Pearl (2009) There is no consensus on a priori estimations for required sample size for XAI and deep learning in medical imaging in general (Balki et al. (2019)). Given the costly nature of acquiring medical imaging datasets in terms of money, time, and patient burden, it is desired to have guidelines describing what minimum sample sizes would be required for which XAI techniques.

Limitations
We derived our XAI framework from the frameworks of Adadi and Berrada (2018) and Murdoch et al. (2019). Other frameworks also exist, such as the framework by Kim et al. that divides XAI in pre-, during-, and post-model explanation. During-and post-model explanation are captured by our XAI framework with model-based and post hoc explanation. Pre-model explanation mainly focuses on the structure of a dataset, such as inspecting outliers. One could state that an example-based explanation that utilizes the latent distributions of a dataset could be perceived as a pre-model explanation. We have, however, not made this distinction, since in deep learning, these latent distributions are discovered by training a neural network.
We tried to be as comprehensive as possible with the inclusion of papers in our survey. However, XAI often is a technique used to support methods, and keywords are often not mentioned in the title or body of papers (Rudin (2019)). Therefore, we cannot guarantee that we covered all the work in the field.
Nevertheless, we provided the search strategy in the appendix to be as transparent as possible about the selection of papers.

Conclusion
This paper surveyed 223 papers using explainable artificial intelligence (XAI) in deep-learning based medical image analysis, classified according to an XAI framework, and categorized according to anatomical location and imaging technique. The paper discussed how to evaluate XAI, current critiques on XAI, and future perspectives for XAI in medical image analysis.

Additional information
This work was partially funded by the Dutch Cancer Society (KWF) grant number: 10755. We have no conflicts of interest.

Appendix
We used the search query "(explainable deep learning OR interpretable deep learning OR XAI OR interpretable machine learning OR explainable machine learning) AND (medical imaging OR medical image analysis)" in SCOPUS. We analyzed the query results using the Active learning for Systematic Reviews toolbox. This toolbox uses active learning to sort papers from most relevant to least relevant, while being updated by user input. Furthermore, we had discussions with colleagues, and used a snowballing approach -investigating papers referenced by the included papers and papers that refer to the included papers. We read the title and the abstract of each of these papers, and browsed paper content if we were not sure whether to include the paper. In case of multiple publications by the same authors on the same subject, we chose the journal publication or the most recent publication in case of multiple conference publications. We included peer reviewed journal papers and conference proceedings. Papers up to October 2020 are included in the survey.

References
Abbasi