The enlightening role of explainable artificial intelligence in medical & healthcare domains: A systematic literature review

to explore the use of these techniques in the medical domain.


Introduction
Despite our tendency for having unrealistic short-term expectations for Artificial Intelligence (AI), the future looks promising.Recent advancements in different fields of AI, especially in Machine Learning, are the big reason why AI is gearing to take a central role in our lives.We are just scratching the surface in utilizing deep learning to solve major issues in areas such as e-commerce, the airline industry, warfare, medical diagnoses, and almost all other aspects of human life.AI has made remarkable advancements in the past decade, largely due to unprecedented funding, as well as AI experts' promises to convert narrow AI to artificial general intelligence which can pass the Turing test in every routine task that humans can do seamlessly.
Since the emergence of AI, humans have been fearful that it could take full control and dominate us.This fear is compounded by the fact that it is often difficult to fully understand how AI algorithms operate.The recent revival of neural networks has shown remarkable results, but they function like a black box.A well-trained neural network can mimic human behavior, but the way it updates weights and biases through gradient descent during each iteration is not fully understood, leading to limited control over the algorithm.This is a concerning issue, as we may know what the algorithm is doing, but we cannot explain how it is doing that.
To address the concerns about the opacity of AI algorithms, a new field called Explainable Artificial Intelligence (XAI) has emerged.It encompasses a range of tools and frameworks aimed at helping humans understand and interpret the workings of AI models.The value of XAI in providing insight into the workings of AI algorithms is invaluable across all fields, but it is especially crucial in the medical and healthcare domain where human lives are at stake.https://doi.org/10.1016/j.compbiomed.2023.107555Received 7 May 2023; Received in revised form 13 August 2023; Accepted 28 September 2023 S. Ali et al. 1.1. Background AI and XAI have made great advancements in the medical and healthcare domains.Contributions are being made at a fast pace in the XAI field as a whole, as well as in XAI for the medical and healthcare domain.XAI can eliminate the barrier of distrust between clinicians and AI results when it is used in the medical domain.
XAI is a field that provides explanations for the results derived from AI models, or the way the model reached that result or decision.The goal of XAI is to create transparent and trustworthy AI systems that can be integrated into human decision-making in a supplementary way.Although XAI is a relatively new field, it has gained a lot of attention in recent years due to the need for AI and more specifically, transparent AI in various fields.
Healthcare and medicine are broad categories as they include diagnosis, prevention, and treatment of individuals with diseases.There are several domains where it gets tedious for a clinician to manually examine the results, i.e., examinations of X-rays, Magnetic Resonance Imaging (MRI), Computed Tomography (CT) scans, ultrasounds, etc. Diagnosis is not only limited to diagnosing image data but text data as well.For diagnosing mental health problems, there are many studies that have used textual data in order to diagnose depression and other mental health issues [1][2][3][4].Similarly, prevention and treatment also become laborious for healthcare practitioners.Prevention in fact requires an early diagnosis and treatment requires an accurate diagnosis.Both of which can be achieved if trustworthy and transparent AI models are used to help the diagnosis.
The major contributions of this article are as follows: • Conducted an extensive Systematic Literature Review (SLR) on XAI for medical and healthcare published articles • Identified widely used models and datasets taxonomy of the domain.• Reported literature of 93 studies employing rigorous filtering criteria following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) framework.• Discussed limitations, advantages, and future research directions.
The rest of the article is structured as follows: after an introduction to XAI in the medical domain, Section 2 discusses recent surveys in the domain, followed by Section 3 to discuss the material and methodology of the manuscript.Section 4 presents the results of the study and finally Section 7 concludes the study.

Uncertainty of CNN models prediction
In the last few years, Convolutional Neural Networks (CNNs) have shown remarkable performance in several medical and healthcare applications, including the classification of COVID-19 X-ray images and the diagnosis of COVID-19.However, it is essential to consider the uncertainty associated with CNN predictions to ensure reliable and trustworthy results.This section explores the role of uncertainty-aware CNN models in the medical and healthcare domain.
One notable study, conducted by Gour et al. gour2022uncertainty, proposes an uncertainty-aware CNN model specifically designed for COVID-19 X-ray image classification.The authors recognized the importance of uncertainty estimation in this critical task and developed a framework that not only focuses on accurate predictions but also quantifies the uncertainty related with each prediction.By incorporating uncertainty into the model, they aimed to provide more reliable and interpretable results for medical professionals.
Another relative research paper by Shamsi et al. ss2021khosravi presents an uncertainty-aware transfer learning-based framework for COVID-19 diagnosis.The authors identified the challenges associated with limited labeled data and leveraged transfer learning techniques to improve the model's performance.Additionally, they included uncertainty estimation in order to provide insights into the reliability of the model's predictions.The framework aimed to assist healthcare professionals by providing not only accurate diagnoses but also information about the uncertainty associated with those diagnoses.
Both studies highlight the significance of considering uncertainty in CNN models for medical and healthcare applications.Uncertainty estimation allows for a more comprehensive understanding of the model's predictions, helping healthcare professionals in making informed decisions.Specifically, one aspect of uncertainty that has gained attention is the uncertainty associated with CNN models' predictions, which can provide valuable insights into the reliability and reliance level of the results.
Uncertainty in CNN models' predictions can be attributed to various factors.One factor is the inherent complexity of medical data, including variations and overlaps in different diseases or conditions.Additionally, limited or imbalanced training data can contribute to uncertainty, as the model may encounter instances that differ significantly from the training distribution.Furthermore, ambiguity or noise in the input data can also introduce uncertainty into the predictions.
To address the uncertainty associated with CNN models' predictions, several approaches have been proposed.These include Bayesian neural networks, Monte Carlo dropout, and ensemble methods.These techniques allow the model to generate multiple predictions or probability distributions, providing a measure of uncertainty along with the final prediction.By considering the uncertainty, medical professionals can make more informed decisions and better understand the limitations of the model.
In conclusion, incorporating uncertainty-aware CNN models in medical and healthcare domains is crucial for reliable and trustworthy predictions.The studies discussed gour2022uncertainty and ss2021 khosravi exemplify the efforts made to quantify uncertainty in the context of COVID-19 X-ray image classification and diagnosis.Uncertainty estimation in CNN models' predictions helps healthcare professionals interpret the results, make informed decisions, and understand the limitations and confidence level associated with the model's outputs.By incorporating uncertainty-aware approaches, the medical and healthcare community can leverage the benefits of CNN models while ensuring the reliability and transparency of their predictions.

Related surveys
XAI and Healthcare both are hot topics for research nowadays.Since researchers are actively contributing in both fields, there are numerous surveys that lie under the domain of XAI for medical and healthcare.Out of 11 surveys that we found, four of them belonged to sub-fields of the healthcare domain for example, a survey on epilepsy detection [5], X-ray Image Analysis [6], predictive modeling in healthcare [7] and clinical decision support system [8].Three of them were discussing benefits and/or application of XAI in medical and healthcare but were not directly linked with the healthcare domain such as [9] has discussed augmentation approaches used in XAI for medical informatics, [10] has investigated interactive visualization which can be beneficial for XAI in different domains such as medical, agriculture, etc, [11] has found the application of XAI in different fields, i.e., Natural Language Processing (NLP), biomedical and malware classification, and lastly a mapping study which found interpretability techniques used in medicine using medicine [12].
To the best of our knowledge, we found three surveys that were related to XAI for the medical or healthcare domain [13][14][15].Korica et al. [13] have presented a synthesized taxonomy for categorizing explainability methods and a summary of gaps, challenges, and opportunities for applying XAI in the medical industry through a conducted field survey.Chakrobartty et al. [14] have done a literature survey on the same topic, they have covered 22 studies, and they searched on PubMed only published during the 2008-2020 period.They have tried to find the existing techniques and methods used in the medical domain.The limitation of their work is to use only one database to fetch papers i.e.PubMed.Nazar et al. [15] have done a SLR on the use of AI, XAI, and Human-Computer Interaction (HCI) in the medical domain.They have covered 135 publications, published during the 2016-2021 timeframe and fetched from various data engines.[15] collectively examines the applications and challenges of XAI, AI, and HCI for medical and healthcare.
It is interesting to notice that all the surveys considered for this study, haven been conducted in a specific domain related to Medical and XAI.However, the survey that we are conducting comprises of almost all the medical domains as well as all the XAI algorithms that can possibly correlate with medical and healthcare.
The XAI algorithms used in the publications are explained in detail in the following subsections.

LIME (local interpretable model-agnostic explanations)
LIME was introduced in the year 2016 by Ribeiro et al.LIME was introduced in the year 2016 by Ribeiro et al. [16].This innovative methodology serves as a model-agnostic XAI algorithm.The term ''model-agnostic'' implies that LIME can be applied universally to explain the workings of a wide array of machine learning or deep learning algorithms, regardless of their specific characteristics or complexity.
LIME operates by generating localized explanations centered around individual predictions, utilizing interpretable models.These explanations shed light on the factors contributing to a particular prediction made by a given machine learning or deep learning model.The underlying concept is to estimate the behavior of the complex original model within a smaller, more comprehensible model, known as an interpretable model, specifically tailored to the prediction in question.
This capacity makes LIME highly versatile and adaptable, allowing it to provide explanations not only for prevalent models like deep learning neural networks, random forests, and gradient boosting but also for any other believable machine learning model, due to this property it is referred to as ''model-agnostic''.For a practical illustration of LIME's utility, consider Fig. 1 which serves as an exemplary scenario.In this context, the primary model has predicted that a patient is afflicted with the flu.However, through the application of XAI techniques, LIME has explained that the presence of symptoms such as sneezing and headache, gleaned from the patient's medical history, has contributed to the model's ''flu'' diagnosis.This graphic depiction illustrates how a medical professional can make well-informed decisions by leveraging the insights provided by the XAI, particularly in the context of a single prediction.
It is interesting to note that LIME is intentionally engineered to expound upon individual predictions.The key advantage derived from its model-agnostic nature is its capability to seamlessly integrate with an extensive spectrum of models, even when dealing with intricate predictions in high-dimensional feature spaces.

SHAP (SHapley additive exPlanations)
Lundberg et al. [17] introduced SHAP, a groundbreaking methodology aimed at unifying the realm of model interpretability.The primary objective of SHAP is to provide a comprehensive solution for rendering complex models interpretable, thereby facilitating a broader community of researchers in comprehending the inner workings of machine learning or deep learning models.
In the intricate landscape of XAI, the challenge of selecting the most suitable algorithm for a specific model type proved to be a formidable task.To surmount this hurdle, Lundberg and colleagues devised SHAP, an ingenious framework that bestows importance values upon individual features in the context of a particular prediction [17].By explaining the importance of each feature, SHAP contributes to the identification of key factors exerting the most substantial influence on a given prediction.
SHAP distinguishes itself as a versatile, all-encompassing XAI algorithm, poised to harmoniously interface with a diverse array of deep learning or tree-based ML algorithms.Notably, its efficacy transcends the boundaries of model intricacies and types, rendering it applicable to a wide spectrum of scenarios.
Moreover, when confronted with multifaceted scenarios wherein a multitude of features coalesce, SHAP's efficacy shines through.It has been demonstrated that SHAP can yield superior results compared to alternative methodologies in such scenarios.This capability highlights SHAP's prowess in disentangling intricate relationships and facilitating a more nuanced comprehension of the factors driving predictions.
In summation, Lundberg and his collaborators' introduction of SHAP has addressed a critical need within the XAI landscape.By providing a unified approach to model interpretability, SHAP empowers researchers to unravel the enigmatic inner workings of complex machine learning or deep learning models, transcending the limitations of conventional XAI methodologies.

CAM (class activation mapping)
CAM emerges as a specialized tool tailored to fulfill the eager appetite for interpretability within the realm of deep learning-based computer vision models.Designed with a specific focus on the complex complexities of computer vision, CAM serves as an illuminating XAI technique, offering insights into the enigmatic decision-making processes of neural networks, particularly the formidable Convolutional Neural Networks (CNNs) [18].At its core, CAM coordinates the generation of class activation maps through the integration of global average pooling.This computational maneuver unveils the dominant regions within an image that have precipitated a prediction made by a neural network, particularly the potent Convolutional Neural Networks.The resultant class activation maps demystify the convolutional neural network's inner workings, shedding light on the salient features and distinct regions that the network has honed in on during its decision-making process.
Notably, CAM's efficacy extends beyond its foundational role in the computer vision domain.It carves out a versatile domain in the expansive field of medical imaging, where deep learning-based models hold immense promise.By harnessing CAM's elucidative prowess, predictions spawned by deep learning models utilized in the intricate realm of medical imaging can be unraveled and contextualized.This critical application of CAM within medical imaging fortifies the understanding of predictive outcomes, fostering a symbiotic relationship between cutting-edge technology and informed medical decision-making.
In a nutshell, CAM stands as a testament to the resourcefulness of XAI, catering specifically to the demanding terrain of computer vision models.Its utilization as a potent instrument for deciphering deep learning-based models' predictions, both in the visually intricate realm of computer vision and the vital domain of medical imaging, underscores its pivotal role in unraveling the latent complexities of modern neural networks.

Grad-CAM (gradient-weighted class activation mapping)
Grad-CAM stands as a direct descendant of the CAM methodology, extending and amplifying the prowess of interpretability for intricate predictions churned out by deep learning-based models.The ingenious concept underlying CAM serves as the foundational bedrock upon which Grad-CAM is built, albeit with a novel twist that leverages the potent tool of gradients [19].At its essence, Grad-CAM propels the realm of XAI into a new era by harnessing the raw power of gradients.This technique ushers in a new era of interpretability for complex predictions emerging from deep learning models, casting a radiant light on the convoluted decision-making processes of these models.
In practice, Grad-CAM undertakes the task of crafting coarse localization maps that delineate the critical regions within an image.These maps serve as visual waypoints, directing the observer's attention to the pivotal areas of the image that have contributed substantively to the model's prediction regarding a specific concept.This marks a significant leap forward compared to the predecessor CAM, as Grad-CAM refines the art of visualization, offering a more nuanced and accurate representation of the regions that wield the most profound influence on the prediction.
One of Grad-CAM's remarkable attributes is its aptitude for dealing with images boasting high resolutions.The technique's proficiency shines when faced with these intricate, information-rich images, making it a potent tool for scenarios demanding a granular understanding of complex predictions.
Worthy of note is the common thread linking CAM and Grad-CAM-their shared reliance on gradients to unearth the crux of interpretability.Both methodologies employ gradients as guiding beacons, illuminating the path toward the most crucial regions within a given input image that have propelled the model's prediction.This shared reliance underscores the pivotal role of gradients in unraveling the enigmatic decision-making processes of deep learning-based models.
In summary, Grad-CAM represents a exquisite evolution of the CAM lineage, empowered by the impressive tool of gradients.By adeptly amalgamating visualization and gradient analysis, Grad-CAM paves the way for a more profound and accurate comprehension of complex deep learning-based model predictions.Its invaluable utility in navigating high-resolution images elevates it to a critical position within the everexpanding toolkit of eXplainable Artificial Intelligence, furthering the frontiers of model interpretability.

Counterfactual explanations
Counterfactual explanation, also known as ''what-if'' explanations are a type of explanation which are used to understand the result/ prediction or outcome of an AI or ML algorithm.Counterfactuals as the name suggest, they analyze alternative scenarios referred to as counterfactuals, which are hypothetical situations where some input variables/features are changed while keeping the rest of the model or system unchanged.By systematically altering the input features and observing the resulting changes in the output, counterfactual explanations help to identify the key features or factors that influenced the prediction.The paper titled ''Counterfactual Explanations Without Opening the Black Box: Automated Decisions and the GDPR'' by Wachter, Mittelstadt, and Russell was published in 2017 [20].This paper discusses the importance of counterfactual explanations for automated decisionmaking systems and their relevance in the context of the European Union's General Data Protection Regulation (GDPR).

Anchors
Anchors [21] are rule-based explanations that are mostly used for highlighting the most important features/predictions done by an ML algorithm.The main idea behind anchors is to find a minimal set of conditions that, when satisfied, are likely to guarantee a specific prediction.These conditions are expressed as simple and intuitive rules, making them accessible to non-experts.Anchors focus on providing explanations for individual predictions rather than explaining the overall behavior of a model.Anchors work by generating concise and interpretable ''if-then'' rules that explain why a machine learning model made a specific prediction for a given instance.The process involves iteratively perturbing features and observing the model's response to identify the minimal set of conditions that are both necessary and sufficient for the prediction.Anchors provide local explanations by focusing on individual predictions, offering a transparent and understandable way to understand the model's decision-making process.

Materials and methodology
In this study, we are aiming to investigate the state-of-the-art on XAI for medical and healthcare domains.We are performing a systematic review using mapping study or scoping study technique [23].In this study, we have done a comprehensive review of the literature in the research domain and have identified the techniques, datasets, performance metrics, and algorithms used in the literature.This study follows the proposed guidelines by Kitchenham et al. [24] and it includes the following phases: 1. Specifying research questions 2. Search strategy 3. Identification of primary studies 4. Data extraction 5. Threat to the validity

Research questions
The key research question for this study was to find state-of-theart technologies, algorithms, and datasets evaluation metrics for XAI in the medical and healthcare domain.To do an in-depth analysis for this systematic mapping review, the key question is further divided into four research questions which are mentioned in Table 1.These research questions will clearly show the direction and road map for this study and will help the readers to understand the structure of this work.

Search strategy
For searching literature on XAI for medical and healthcare domains we searched the literature in four well-known online databases, i.e., Web of Science, Scopus, IEEE Xplore, and ACM Digital Library.For forming our search string we used three groups of keywords as mentioned in Table 2.In group A we were also considering ''explainability'' and ''interpretable'' keywords as well but these keywords did not return good quality works and therefore we dropped it from the query.To be inclusive and structured we used the same keywords on all four databases and searched the literature by searching those keywords in the title, abstract, and author keywords.By hitting search, we fetched 454 results, the distribution of which can be seen in Fig. 2.
We got 231 papers from Scopus which is the highest number, 107 from Web of Science, 97 from IEEE Xplore, and 19 papers from ACM Digital Library.

Identification of primary studies
The search string was applied to different digital databases to fetch the relevant results.The data is extracted by applying the search string on the title, keywords, and abstract along with applying year, article type (only conference and article papers), and language filters, as a result, we got 454 articles.After removing duplicates we were left with 239 articles.Then we applied the exclusion criteria mentioned in Table 3, we applied exclusion criteria while reading the abstract and title of the literature, after removing 87 papers including 11 survey papers, 152 papers were selected for full-text screening.While fulltext screening we also applied inclusion criteria mentioned in Table 3, we included papers having scientific rigor, credibility, and relevance.There was doubt about including 17 papers so we used an anonymous majority voting mechanism and three authors participated in that, we removed two papers based on majority voting.After majority voting and full-text screening, 93 studies were selected for systematic review using PRISMA protocol.Fig. 3 is the PRISMA diagram which shows the process from fetching the whole data to filtering the relevant data.

Identification
We retrieved 454 studies in total when we searched four wellknown databases (Scopus, Web of Science, IEEE Xplore, and ACM Digital Library).Fig. 4 shows the number of studies found initially and studies that were selected from each database.Scopus was the main contributor so we selected 66 publications out of 231.Similarly, from IEEE Xplore 19 studies were considered out of 96, and from WoS only six studies were considered as most of the studies were removed because of duplication with Scopus.Lastly, we selected two studies from ACM which were relevant to our topic.

Screening
In this step, we filtered out studies according to our inclusion and exclusion criteria, as shown in Table 3.We first discarded the duplicate papers, then the papers which were not in the English language, and the papers which were out of our time span (2018-2022).Further, we discarded the papers which were not related to the medical domain and papers that were not related to XAI, Machine Learning (ML), Deep Learning (DL), and AI.We also excluded the survey papers which were a total of 11 in number, as well as the papers that did not fall under the quality assessment criteria that were carefully set.The assessment criteria are comprised of three factors that are defined in the following subsection.

Scientific rigor.
If an appropriate research methodology has been applied in the paper, it is considered to be scientifically rigorous.
Credibility.If the research is believable and the findings are accurate and well presented, that paper is considered credible.
Relevance.If the findings of a paper are relevant to the academic community and actors in the medical/healthcare domain, it is considered to be relevant.

Eligibility
Exclusion criteria implementation reduced the 87 papers.Then we included the studies based on scientific rigor, credibility, and relevance.Out of 454 fetched articles, 93 studies passed all the phases hence we included them in this study.

Included studies
Finally, 93 studies were selected throughout the study and review.We had 152 studies for full-text screening, during full-text screening, we were extracting data and also checking if those studies meet the quality assessment criteria defined in 3.3.2.We removed those studies that either failed any of the three criteria (i.e., scientific rigor, credibility, and relevance).Also, we found 11 survey papers that were later excluded from this work.There were three papers removed due to having paid access.Moreover, there was one paper that was abstract, so we also removed that.There were many studies that were related to XAI, but we excluded them because they did not belong to the medical domain.Some were in the medical domain but they were not really related to XAI, they had just used the keyword XAI in the author's

Table 3
Inclusion and exclusion criteria.keywords or abstract.Table 4, 5-7 provide a detailed summary of all studies included.
Among the selected 93 publications, 59 are journal articles while are conference or proceeding papers.This information is visualized in Fig. 5.
Fig. 6 shows the papers published with respect to years in journals and conferences, respectively.We can see that 35 journal articles and 12 conferences were published in 2022, 18 journal papers and conference papers in 2021, and 6 articles and nine conference papers in 2020, while in 2019 only 3 conference papers were published but no journal articles.This figure also shows an increasing trend in terms of the number of publications hence we deduced that XAI is taking popularity with time.Fig. 7 shows the proportion of papers published by different publishers.This information is evident in IEEE being the leading publisher     for research in this domain.40 papers were published in IEEE journals while Springer, Elsevier, and MDPI remained second, third, and fourth choice respectively for the researchers.While Fig. 8 shows the proportion of papers published in journals and conferences with respect to publishers.From Fig. 8 we can see that 23 of 40 papers published in IEEE were journal articles and 17 papers were conference papers.The information about other publishers is also available in this figure to help future researchers in the selection of a venue for their publication.It can also be deduced from this figure that researchers who wish to publish their articles at Conferences can keep IEEE, Springer, and International Society for Optics and Photonics (SPIE) conferences as their first choice as the majority of conference papers are published in these venues.

Data extraction
In this section, we explain the data extraction process.The data was extracted from the 93 papers, for which a tabulated Microsoft Excel spreadsheet was used to log the data.A unique identifier was assigned to each article that was made up of the initial letter of a source followed by a serial number.We extracted fields such as ''Datasets used'' in the study, dataset link, and source code link were also logged if they were mentioned in the paper.In addition, ''Algorithms used'' in the study, ''Impact factor'' of the publishing source from the year of publishing, ''Evaluation metrics'' in case any experiments were performed and assessed, followed by 'Application Area' that defines the medical domain in which the work was applicable, ''Limitations'' of the study proposed, and ''Tools used'' represents the additional tools to XAI models if any were used.Table 8 provides a description of each element.

Threats to validity
Search String: The query that was originally made, included the words, ''XAI'', ''interpretability'', ''explainability'', and ''medical and healthcare''.As a result, we got around 1600 articles which were too many, and most of which were not relevant to the topic under our study.Therefore, we omitted the words ''explainability'' and ''interpretability'' from our search query, and went ahead with the remaining words.This omitting might have caused us to miss some valuable articles related to our domain of study.

Selection of databases:
The databases from which we selected articles for our study were Web of Science, Scopus, ACM Digital Library, and IEEE for the sake of credibility and quality.Since our domain is medical and healthcare, it is possible that Pubmed (a famous site for medical research), that was missed in this study, might have some esteemed research.
Language Barrier: The papers were also filtered out on the basis of language and only papers written in the English language were selected for this study.We might have lost some valuable research due to the language barrier as well.
Time frame of the studies selected: We have only considered the studies from the past five years, there may have been some studies before that time that could be of a beneficial contribution to our domain of study.

Results and discussions
In this section, we discuss our answers for the RQs presented in Table 1.

RQ1. What are the most common XAI algorithms/methods/tools used by researchers for the medical and healthcare domains?
There are multiple algorithms found in the systematic literature review of XAI in the medical and healthcare domain.We have distributed these algorithms in two parts.One is related to algorithms of XAI and the other is for ML algorithms.

XAI algorithms
LIME and SHAP are the two top most used algorithms in XAI [117].Although both methods LIME and SHAP can come up with similar results, we have seen this in most of the papers (i.e., 13) which are using both methods.The purpose of using LIME and SHAP together is to validate explanations.
SHAP and Grad-CAM are the second most used methods in these papers.Since both SHAP and LIME are used for getting an explanation of predictions made by models.These are the models which use images, tabular or textual data [118].There were nine studies that used just LIME method, and ten that used SHAP and Grad-CAM respectively.It is also observed that LIME is a bit flexible to implement and it also generates noisy dataset.So, it means that LIME only needs one observation to be calculated [16].On the other hand, SHAP needs to be more structured.SHAP needs multiple observations and an entire sample to get its result calculated [17].The precise summary of the XAI algorithms used can be seen in Table 9.

Machine learning & deep learning algorithms
Table 10 provides information about machine learning and deep learning-based models used with XAI algorithms for research in medical and healthcare.From this table, we can infer that deep learning algorithms like CNN, Deep Neural Network (DNN), VGG-16, and ResNet are widely used for building models, while XAI is used in junction with deep learning models for better interpretation or explainability.This information can be helpful for future researchers to decide which ML/DL models provide better performance while combining XAI algorithms with them for applications in the medical domain.
Along with the XAI algorithms explained above, we have found a portion of researchers that have used some additional tools in order to take assistance in representing the outcomes of XAI algorithms' results as well as to better explain their research.Since the domain here is of medical and healthcare, most of the researchers were made to detect medical abnormalities.As many detections are done through X-rays or some other sort of visual representation of the problematic area of the human body.Although not all of the studies have used or at least mentioned the additional tools, but those which have, most of them have used heat maps, attention maps, or activation maps in order to visually highlight the areas that are important for the abnormality that is under study.

RQ2. What challenges and limitations have been faced or undertaken by researchers in these domains?
In most of the studies, they did not mention the challenges and limitations.There were only 19 studies that came up with some challenges and limitations.The most common challenge we observed was about the data.Some studies mentioned that data was not enough to improve the performance of the model.In [30] they came up with the challenge of large-scale dataset as they were unable to train and test their model with large-scale dataset acquired from different cohorts.In [56] they faced the challenge of reduced accuracy of aggregated model.Because for collection and analysis of data they used multiple devices, this resulted in an increase in chances of data poisoning attacks.For one study, language was a big challenge as they implemented the data set of only Norwegian text.But, in some studies evaluation was considered as the biggest challenge faced.They lacked the experts relating to XAI and medical fields.

RQ3. Which datasets are being prominently and mostly used in research on explainable artificial intelligence for medical and healthcare?
In this literature review, the datasets utilized in various studies play a pivotal role in driving advancements in AI research for medical applications.Among the extensive list of datasets presented in Tables 11-13, three datasets stand out as the most commonly used.To gain insights into the types of these datasets, we can refer to Fig. 9, which provides a taxonomy of dataset types used in the reviewed studies.Let us delve deeper into the dimensionality and quality of these datasets to understand the significance and potential challenges when applied to medical AI research.Understanding the dimensionality and quality of these datasets is essential for assessing the potential benefits and challenges associated with explainability in medical AI research.

MIMIC dataset
The MIMIC dataset stands out as a widely utilized and versatile dataset in the medical domain, known for its comprehensive electronic health records (EHRs) of intensive care unit patients.To achieve explainability in AI models using MIMIC, the dimensionality of the data needs to be considered carefully.Analyzing the multitude of patient attributes present in each record will provide insights into the complexity of medical data being handled by XAI techniques.Ensuring the quality of MIMIC is crucial for building reliable and transparent models.With EHRs, data accuracy and completeness are critical factors that can impact the explainability of AI algorithms.Furthermore, addressing potential biases present in the EHRs, such as demographic variations or medical conditions prevalence, becomes essential to develop fair and interpretable models in the medical XAI context.

Chest X-ray datasets
The chest X-ray datasets, such as CheXpert, are extensively used in medical XAI research for diagnosing various pulmonary anomalies.Understanding the dimensionality of the radiographic features and the richness of expert annotations in these datasets is essential for designing explainable AI models for chest X-ray analysis.In medical diagnosis, explainability is crucial to gaining trust and acceptance from medical professionals.Ensuring the quality of the chest X-ray datasets, including accurate annotations and addressing potential confounding factors, will pave the way for interpretable AI models that can provide meaningful insights to radiologists and aid in early disease detection and diagnosis.

Kvasir dataset
The Kvasir dataset is prominently used in medical XAI research for analyzing gastrointestinal endoscopy images.Understanding the dimensionality and diversity of endoscopic findings present in the Kvasir dataset is vital for building explainable AI models for gastroenterological applications.Interpretable AI in gastroenterology can assist clinicians in making informed decisions and improving patient care.Ensuring the quality of the Kvasir dataset, including reliable annotations and addressing potential challenges in endoscopic imaging, will enable the development of transparent and trustworthy AI systems for diagnosing gastrointestinal conditions.[43] 0.01% TabNet, DFS, Bayesian Network [48] 0.01% ChexNet [70] 0.01% CIU [113] 0.01%
[38] Artificial Simulacrumhealth dataset Link Medical records [38] UK Covid-19 patients data Closed Not specified [39] Data from the Simulacrum dataset from NCRAS, England was used Link Medical Records [51] Toddler Dataset Link Medical records [109] Retinal Fundus Image Quality Assessment (RFIQA) dataset Link Image [109] EyePacs dataset Link Image [28] Pathological voice samples of people with vocal cord polyp and paralysis were obtained from an unknown source Closed Audio [54] Dataset of Heart Failure Patients from UCI Link Medical records [67] Cervical Cancer Risk Factors Link Medical records [112] Two in-house mammogram datasets: ''Data A'' and ''Data B''.Data A was gathered from four medical centers and Data B was acquired from a separate single medical center.
Closed Image [110] UCI's Cleveland heart disease database and the Framingham Heart Study Repository Link Medical records [111] Cell Images for detecting Malaria Link Image [99] T1-weighted DCE-MRI scans from six institutions were collected and used [120] Image [40] Cholec80 Link Video [43] LIDC-IDRI dataset Link Image [62] Chest Xray Dataset collected by [121] Link Image [76] CT volume data from four hospitals in China (Private), CC-CCII Open Image [78] MRI Alzheimer brain image dataset Open Image [33] SPECT image dataset Open Image [88] EEG dementia diagnosis dataset by [122] Open Medical records [79] sEMG Dataset Closed Image [57] Breast Cancer Dataset by the University of California Link Medical records [81] T1 weighted MRI Dataset Link [37] An automated regular expression based searching was used to find potential veterans with PTSD from twitter
[58] A web-based survey conducted from July 13 to July 17, 2020 was used to collect data that is available on Kaggle Link Image [82] MIT-BIH Arrhythmia database Open Medical records [94] Infants' Functional near-infrared spectroscopies (fNIRS) were used Link Image [83] National Health and Nutrition Exam Survey (NHANES) III Link Text [89] MRI images of AD and microarray gene expression were used Link Image [25] Alzheimers Dataset Link Image [32] Breast Cancer Wisconsin (Diagnostic) Dataset Link Medical records [35] Patient clinical information for TCGA breast-invasive carcinoma cohort (BRCA) from two projects on the cbioPortal were used.

Link
Medical records [35] Clinical information for 1101 patients from Firehouse Legacy Link Medical records [96] APTOS 2019 Blindness Detection Dataset Link Medical records [55] Dementia Prediction w/ Tree-based Models Dataset Link Medical records used for AI/ML algorithms used in combination with XAI for research in the medical/healthcare domain.This information can help researchers interested in XAI in the future to decide on evaluation metrics for use in their research for benchmarking.These evaluation metrics can be used to check the performance of AI models accompanied by XAI algorithms.Furthermore, researchers can also refer to related studies to check how these performance metrics were used to evaluate the experiments and how experiments performed in this direction can be improved with the help of these performance metrics.Fig. 10 shows the frequency of performance metrics utilized to evaluate three Explainable Artificial Intelligence (XAI) techniques-LIME, SHAP, and Grad-CAM-in the context of medical and healthcare applications.The numbers represent the count of papers that have employed each specific metric for evaluation.Accuracy was the most commonly used metric, with 15 papers assessing it for LIME, 10 for SHAP, and 3 for Grad-CAM.Precision and Recall were also frequently used, with 5 papers using Precision for LIME, 3 for SHAP, and 1 for Grad-CAM, and 5 papers employing Recall for LIME, 3 for SHAP, and 2 for Grad-CAM.F1-Score was used in 4 papers for LIME, 3 for SHAP, and 2 for Grad-CAM.Area Under Curve (AUC), a metric suitable for binary classification tasks, was assessed in 1 paper for LIME, 3 for SHAP, and 3 for Grad-CAM.Sensitivity and Specificity were both used in 2 papers for LIME, 3 for SHAP, and 3 for Grad-CAM.These numbers demonstrate that various performance metrics have been widely applied to comprehensively evaluate the XAI techniques, providing a comprehensive analysis of their effectiveness in medical and healthcare scenarios.

Metrics discussion:
In this section, we provide a comprehensive discussion on the performance metrics utilized to evaluate the Explainable Artificial Intelligence (XAI) techniques in the context of medical and healthcare applications.The selection of appropriate performance metrics is crucial in assessing the effectiveness and interpretability of AI models, which play a vital role in critical domains like healthcare.
Firstly, the metrics of Accuracy, Precision, Recall, and F1-Score have been widely used in evaluating the XAI techniques' performance.Accuracy measures the overall correctness of predictions, while Precision quantifies the ratio of true positive predictions to total positive predictions.Recall, also known as Sensitivity, represents the ability to correctly identify positive instances.The F1-Score is the harmonic mean of Precision and Recall, providing a balanced performance measure.These metrics are essential in evaluating the AI models' interpretability and their capability to identify relevant features in medical data, thereby aiding in informed decision-making by healthcare professionals.
Secondly, the Area Under the Curve (AUC) has been employed as a performance metric, particularly in binary classification problems.AUC represents the ability of AI models to distinguish between positive and negative instances.In the medical context, where identifying critical conditions accurately is crucial, AUC serves as a valuable metric to gauge the model's effectiveness in making accurate predictions and achieving high interpretability.
Thirdly, the discussion extends to Sensitivity and Specificity, which are vital in evaluating the XAI techniques' ability to correctly identify
In summary, the selection of performance metrics is critical in evaluating the effectiveness and interpretability of XAI techniques in medical and healthcare applications.The metrics of Accuracy, Precision, Recall, F1-Score, AUC, Sensitivity, and Specificity provide valuable insights into the models' reliability and their ability to make accurate predictions while being interpretable.Acknowledging potential issues and limitations associated with certain metrics ensures a comprehensive evaluation, enabling the effective deployment of XAI in critical healthcare scenarios.

Open challenges and future directions
From the literature, we have found that most papers have used performance metrics that are commonly used for AI models for model evaluation, i.e., accuracy, recall, precision, and F1-Score but there are no performance metrics specific to evaluate the results of XAI algorithms.Researchers have used AI performance metrics for the evaluation of XAI results, hence suggesting specific evaluation metrics for XAI can be interesting to work upon for future researchers.Moreover, an XAI performance evaluation method could also be proposed that is dedicated to the medical and healthcare field.It should be achieved in collaboration with medical experts.

S. Ali et al.
Another promising future direction is to assess the individual contribution of each XAI technique when more than one technique is being exploited.It is important to highlight the fact that when multiple explanation techniques are being used in conjunction then it is not necessary that each interpretation technique should have an equal contribution to the final result.There should be an assignment of weight to each XAI technique in the explanation, the weighted combination will be helpful to differentiate the contribution of each XAI technique, and the amount of contribution in the final explanation will determine the weight of the explanation technique being used.Investigating individual contributions of each XAI technique when used together thus can be an interesting topic to work on.
Explaining the features and their importance through XAI could also help in understanding the models used in the medical and healthcare domains, but looking for the process behind the decision-making of models used in medical can also be a potential future direction.Moreover, due to XAI explanations, it is easy to determine the region of interest for different medical imaging problems.Medical experts can use such explanations with their field knowledge and propose alternative diagnosis methods for different diseases.
Most of the pre-trained models such as Inception, VGG16, etc, restrict the image to be of a certain size in order to use them for classification.Whereas, medical images, such as X-rays, scans, and MRIs, lose their quality and meaning when resized.A pre-trained model could be proposed that does not require resizing images in order to perform classification on them.
Also, when the dataset is small, there is a chance of the model overfitting.XAI explanations can help understand the underlying, important features of a particular prediction.XAI can help identify features that are not so important and those that are creating noise and problems for training data so those problematic features can be avoided in order to avoid model over-fitting.Potential researchers can study this area in depth to make accurate conclusions about the contributions of XAI for avoiding model over-fitting problems.
Labeling accuracy and efficiency are directly related to the quality of the initial training set.Some studies faced limitations in the feature selection method, which was found to be slightly more time-consuming, impacting the overall efficiency of the methodology.Additionally, the lack of prospective quality-of-life data in certain studies posed a limitation in understanding the long-term impacts of AI applications.One common limitation across several studies was the reliance on specific types of data, such as 2-D MRI images or solely using GRF signals for classification.This limited data variety may restrict the models' generalizability and applicability to a broader range of medical scenarios.However, some studies attempted to mitigate this limitation by training and testing their models on large-scale datasets acquired from different cohorts.
The use of federated learning raised concerns about data integrity and authentication due to the distributed nature of data collection and analysis across multiple devices.Furthermore, the assumption of homogeneous data and devices in some proposed frameworks may not hold true in real-world scenarios, potentially reducing the accuracy of aggregated models.For applications like telesurgical operations, researchers faced challenges in minimizing errors during virtual surgery control, ensuring real-time operation with minimal latency, and maintaining privacy and security during remote surgical procedures.
In certain studies, models encountered difficulties in generating captions for some words due to data preprocessing and the presence of unknown words, leading to limitations in caption generation capabilities.In the context of acute kidney injury (AKI) prediction, limitations arose from missing laboratory parameters, the absence of recorded etiology in the databases used, and the exclusive use of a single-centered dataset without external validation.An external multicenter validation is suggested to enhance the reliability of the models.While some studies claimed the adaptability of their models to other languages using translated features, it remains essential to consider potential limitations and differences in language-specific data.
Restrictions imposed by pre-trained models, such as fixed input sizes, required resizing of input data and may have impacted model performance in certain applications.Proposed solutions for foreground/ background separation were observed to be limited to binary or multilabel classification, presenting challenges in directly applying them to multiclass classification scenarios.To improve model performance, researchers suggested utilizing larger training datasets and fine-tuning hyperparameters.However, the interpretability of models could be limited by equivalent weights assigned to explanations in certain combination frameworks.Scalability issues were observed with certain techniques, such as LORE, which hindered their application in largerscale experiments.The data-hungriness of some medical AI models presented a significant challenge, making it difficult to transfer learned models from one task to another.
Lastly, limitations in literature coverage and the lack of proper evaluation of explainability in many medical XAI applications were noted, potentially hindering the adoption and understanding of these models by medical experts.The field of XAI currently lacks benchmark datasets.In other AI fields, such as classification, clustering, and segmentation, benchmark datasets are readily available, facilitating progress.However, in XAI, there is currently no benchmark dataset that researchers can use to establish a common platform and agreed-upon performance metrics.This absence of a benchmark dataset impedes the progress of the field.Therefore, the development of a proper benchmark dataset could prove to be a watershed moment in the field of XAI.

Practical implications of XAI in healthcare
Explainable Artificial Intelligence (XAI) has gained increasing attention in the healthcare domain due to its potential to improve patient outcomes and enhance medical decision-making.In this section, we discuss the practical implications of employing XAI techniques in healthcare settings, highlighting how its implementation has positively impacted medical practices.

Improved patient outcomes and informed decision-making:
One of the key practical implications of XAI in healthcare is its ability to improve patient outcomes by providing transparent insights into the decision process of AI models.XAI techniques, such as LIME, SHAP, and Grad-CAM, have enabled healthcare professionals to gain a deeper understanding of the features contributing to model predictions.This interpretability empowers clinicians to make more informed and confident decisions, leading to accurate diagnoses, optimized treatment plans, and better patient care.

Bias detection and mitigation:
Another practical implication of XAI in healthcare is its role in identifying potential biases in AI models.By revealing biases in the data and model outputs, XAI allows healthcare systems to address issues related to fairness and equity.This ensures that AI-powered healthcare interventions are more inclusive and provide equitable treatment for all patient groups.
Personalized Medicine and Tailored Treatment Plans: XAI has facilitated the adoption of personalized medicine in healthcare.Through the transparent and interpretable nature of AI models, XAI enables healthcare professionals to tailor treatment plans to individual patients based on their unique medical history and characteristics.This personalized approach has led to improved treatment efficacy and patient satisfaction.
S. Ali et al. 6.3.Reduced medical errors and early disease detection: Documented cases and empirical evidence from various healthcare institutions indicate that the implementation of XAI has contributed to reducing medical errors and early detection of diseases.The interpretability of AI models allows healthcare providers to catch potential errors and identify anomalies in medical data, leading to timely interventions and better patient outcomes.

Case studies and empirical evidence:
A wealth of case studies and empirical evidence exists, demonstrating the practical impact of XAI in healthcare.These documented cases showcase instances where the implementation of XAI has significantly improved medical decision-making, diagnostic accuracy, and patient care.
The practical implications of XAI in healthcare are far-reaching and hold immense potential for transforming medical practices.The transparency and interpretability of AI models provided by XAI techniques have improved patient outcomes, facilitated personalized medicine, and reduced medical errors.This section highlights the real-world benefits of incorporating XAI in healthcare settings and underscores its significance in revolutionizing medical practices.

Conclusion
AI and ML in particular have been advancing quite rapidly in the last decade, however relying completely on the results of AI-based algorithms in sensitive fields is difficult, especially in the medical field.Thus, the field of XAI was introduced, which aims to explain the results derived by the AI or ML models.It explains how the model has reached a certain conclusion.This increases the credibility of the models to be used by medical practitioners to aid in their manual practices.In this SLR, we have targeted the articles from the last five years that have discussed or used XAI for the said domain.We ended up with a total of 93 studies after a thorough selection process.
We carried out information such as the most common algorithms being used in the domain of XAI for medical and healthcare, which included both ML and XAI algorithms.LIME was the most talked about and used in most of the studies.We have also discussed LIME and how it works since it was being prominently used.After LIME, came the SHAP, CAM, and GradCAM which we have also discussed in the Related Surveys Section 2. In addition to that, we observed the limitations and challenges of the proposed study.Moreover, we proposed to find out the datasets that are being used most commonly for these studies, and we discovered that not much can be said as there was a lot of variation found.However, COVID-19 X-rays were used more commonly, from different regions of the world.
As there are only a few hospitals or clinics that make their data available for research, and even if they do make it available, it is only shared privately with the researchers.Researchers in the medical domain experience a very common issue which is the lack of medical image data.They have to use various techniques to combat that, such as using pre-trained models and doing synthetic image generation.

Fig. 1 .
Fig. 1.This is an example of how LIME can help doctors taking the decision.While explaining individual predictions LIME can help identify the important features considered in predictions so doctors can take appropriate decisions.
are statistical tools used for finding the influence of the individual training sample or parameters on the outcome of a machine learning model.Influence functions measure how sensitive a model's predictions or parameters are to changes in the training data.They enable the identification of influential training examples that have a significant impact on the model's behavior.By quantifying the influence of each example, influence functions provide insights into which training instances contribute the most to the model's decision-making process.The calculation of influence functions involves computing the derivatives of the model's predictions or parameters with respect to changes in individual training examples.These derivatives capture how small perturbations in the training data affect the model's output.By analyzing these derivatives, one can determine the influence of each example on the model's behavior.
Inclusion criteria • Papers having scientific rigor, credibility, and relevance Exclusion criteria • Not in the medical sector • Not about explainable AI • Not related to ML/DL/AI • Excluded editorial materials, book chapters and reviews, letters, and retracted publications • Survey Papers (n = 11)

S.
Ali et al.

S.
Ali et al.

S.
Ali et al.

Fig. 9 .
Fig. 9. Taxonomy for the types of datasets mostly used.Numbers in brackets represent the articles where those datasets are used.

Table 1
Research questions.

Table 8
Elements of the study.
Elements DescriptionStudy ID Source of paper and serial number Impact factor Impact factor of the journal if published in one.ObjectivesObjectives of the conducted study or experiment Algorithms used Which AI or XAI algorithms were used?Tools used Any additional tools that were used after XAI methods.Application areaIn which medical domain does the study intend to be applicable Evaluation metrics Which evaluation metrics were used if any experiments were conducted Limitations Any limitations of the study

Table 9
XAI algorithms used by publications.

Table 10
ML/DL algorithms used in publications.

Table 14
Performance Metrics and their usage.