Information fusion as an integrative cross-cutting enabler to achieve robust, explainable, and trustworthy medical artificial intelligence

Medical artificial intelligence (AI) systems have been remarkably successful, even outperforming human performance at certain tasks. There is no doubt that AI is important to improve human health in many ways and will disrupt various medical workflows in the future. Using AI to solve problems in medicine beyond the lab, in routine environments, we need to do more than to just improve the performance of existing AI methods. Robust AI solutions must be able to cope with imprecision, missing and incorrect information, and explain both the result and the process of how it was obtained to a medical expert. Using conceptual knowledge as a guiding model of reality can help to develop more robust, explainable, and less biased machine learning models that can ideally learn from less data. Achieving these goals will require an orchestrated effort that combines three complementary Frontier Research Areas: (1) Complex Networks and their Inference, (2) Graph causal models and counterfactuals, and (3) Verification and Explainability methods. The goal of this paper is to describe these three areas from a unified view and to motivate how information fusion in a comprehensive and integrative manner can not only help bring these three areas together, but also have a transformative role by bridging the gap between research and practical applications in the context of future trustworthy medical AI. This makes it imperative to include ethical and legal aspects as a cross-cutting discipline, because all future solutions must not only be ethically responsible, but also legally compliant.


Introduction and motivation
Artificial intelligence in medicine is on everyone's lips.Politicians around the world have declared its use a worthy goal.Industry sees its use as a huge driver of growth, and medicine sees it as a great opportunity for solving medical problems, providing new insights, and In order to use AI to solve problems in medicine, biology, and life sciences outside of our labs and in routine settings, there is an urgent need to move beyond mere benchmarking and improve the performance of methods that work only with independent and identically distributed (i.i.d.) data.Even the best current machine learning models do not generalize well, have difficulty with small training datasets [10], and are sensitive to even small perturbations [11][12][13].Moreover, the most promising approaches are difficult for human experts to interpret, however, most importantly, they are not able to infer causal relationships.Therefore, explainability and robustness have been declared by the European Union as the most important properties for successful medical AI.Robustness and explainability are also important prerequisites for discovering causal relationships and enabling the verifiability of machine decisions by a human expert in a given context.This cannot be achieved by a single approach, but requires concerted action from complementary disciplines.
Research and teaching are trying to keep up with these trends in order to meet these functional requirements for medical AI.A systematic and up-to-date treatment of the topic in research-led teaching is therefore not only necessary but crucial for the practical and effective implementation of AI in the future in order to secure the increasing demand for highly qualified specialists in Europe and worldwide.The task of this new generation of experts will be to bring the latest developments into daily application.
The goal of this position paper is to identify the most relevant pioneering frontier research areas and make the case for why and how they can contribute to a concerted integrative effort to make future medical AI efficient and effective in practice.Specifically, we discuss on three Frontier Research Areas (FRA): (1) Complex Networks and their Inference (CNI); (2) Graph Causal models and Counterfactuals (GCC); and (3) Verification and Explainability Methods (VEM).All through the above FRA, we advocate for information fusion as the integrative cross-cutting catalyst that unleashes a great chance to unify and synergize these three FRA.The new ''AI spring'' is causing an exponential increase not only in interest in AI, but also an actual increase in the use of AI in all areas of life, including medicine.This inevitably raises questions of reliability, safety, fairness, as well as moral and ethical integrity [14], in addition to questions of robustness and explainability.Therefore, ethical and legal aspects must always be included.All future solutions must not only be ethically responsible [15], but also legally compliant [16].The European Union has taken a clear stance on AI: AI must be human-centered and trustworthy.To be trustworthy, any AI must comply with applicable rules and regulations, adhere to ethical principles [17], and be implemented in a secure and robust manner, as defined by the EU High-Level Expert Group on AI. 1To this end, and following 1, a cyclic, iterative, agile human-centered AI redesign process, based on agile user-centered design methods [18] is needed to intertwine the proposed frontier topics with respect to the proposed information fusion approach, eventually reaching the degrees of trustworthiness, robustness and explainability required to fully harness the potential of medical AI.
This paper is organized as follows: for each Frontier Research Area, we begin with a few selected specific problems to show what problems each FRA addresses.We proceed by describing why the topic under study by every FRA is a problem for medical AI, and the extent to which the current state of the art falls short of what is needed to solve the problem.We then describe how the problem can be addressed, and present some promising work in the literature that goes in the right direction for this FRA.In a subsequent section, which we refer to as ''Desiderata'', we list some general characteristics and features that future technical achievements in this FRA should have for this application domain.We conclude each section by highlighting the practical benefits of realizing this FRA and how it will help bridge the gap between scientific achievements and their practical implementation in the medical domain.

What: Fighting complex diseases poses many problems in the integration and scalability of machine learning methods
Exploring and researching complex diseases such as arthritis, brain disorders, cancer, or infectious diseases such as COVID-19 requires novel medical decision support systems that are able to incorporate not only humans into the loop, but also integrative analyses combining diverse omics datasets along with clinical information from a wide variety of modalities [19], using scalable methods for data fusion and mining [20], machine learning, statistics, graph theory and graph visualization into low-dimensional representations because human cognition is not optimized to work well in high-dimensional spaces.Among the myriads of properties describing genome, epigenome, transcriptome, microbiome, phenotype, lifestyle, etc., no single data type, however, can capture the complexity of all the factors relevant to understanding a phenomenon such as a disease.A key challenge is the identification of effective models to provide a relevant systems view [21].
Additional insights can be gained and in vivo validations better planned by trying to understand the conservation of deregulated genes, networks, and pathways across organisms [22,23] -which is a major and, to date, unsolved problem.

Why: Integrating data with networks makes it possible to identify novel relationships between data silos
Currently, explainable AI (XAI) developments are mostly uni-modal.However, enriched, more feasible explanations in the medical domain can be achieved if they consider multimodality.Integrating data with networks -protein interactions networks, transcription regulatory networks, microRNA-gene networks [41], metabolic and signaling pathways -enables to identify relationship among data silos [42].Further analyzing these annotated networks with graph theory algorithms or knowledge engineering tools provides insights into their structure [43,44], which in turn, can characterize the function of these proteins, transcription factors and microRNAs [45].Combining machine learning, data mining and graph theory is difficult, but critical to maximize the impact on translational research [46], enable more accurate and explainable modeling, increase our understanding of complex diseases [47,48] and, ultimately, support P4 medicine (precision, personalized, participatory, preventive) medicine [49][50][51].
Challenges at the intersection of machine learning and network biology for Next-Generation Machine Learning for Biological Networks, which could impact disease biology, drug discovery, microbiome research, and synthetic biology are discussed in Camacho et al. (2018) [52].

How: Quantitative graph theory can help interpret integrated omics data within diseases
Graphs have been used in life sciences for a long time.In recent years, there is a growing trend to combine elements of graph theory, machine learning, and statistical data analysis, which offers tremendous opportunities especially to support interactive knowledge discovery for personalized medicine [53].In network analysis, complex biomedical graph data is examined, and the increasingly easy generation of large amounts of genomics, proteomics, metabolomics etc., and signaling data enables the construction of large networks that provide a framework for understanding the molecular basis of physiological and pathological conditions.Such complex networks have been investigated extensively for several purposes [54,55].On the one hand, networks have been explored in the context of studying complex systems by means of graphs.Examples thereof are biological, linguistic, chemical and technical networks [56].Other contributions in this area relate to study motifs and modules within complex networks [55].On the other hand, lots of quantitative analyses on networks have been performed [57].
To shed light on this problem, we briefly sketch Quantitative Graph Theory, introduced by Dehmer and Emmert-Streib [58].Quantitative Graph Theory can be divided into two major categories, namely Comparative Network Analysis, Network Characterization and networks explainable by design.Comparative Network Analysis relates to measuring the structural similarity between networks [59].This can be done by using so-called exact or inexact graph matching, see [60,61].Exact graph matching is based on the concept of graph isomorphism.Inexact graph matching relates to determining a gradual change on the similarity between graphs by utilizing graph invariants.Another approach for measuring the similarity between graphs is based on utilizing topological indices as an input when using distance or similarity measures for real numbers [62].
Next, Network Characterization using quantitative graph complexity measures can be employed.A network measure is a function that maps network instances to positive real numbers.In mathematical chemistry, they are often referred to as topological indices [63].Many complexity measures for graphs have been developed, e.g., based on distances, vertex degrees, graph automorphism and so forth.We refer to [63][64][65] for more details.One promising domain for the future is the emerging field of geometric deep learning, which is an umbrella term for new techniques that attempt to generalize (structured) deep neural models to non-Euclidean domains, such as graphs and manifolds [66].Machine learning of networks is promising and has recently been used very successfully to fight Covid-19 [67].
With respect to networks explainable by design, compositional partbased object detecting and classifying neural symbolic explainable models [68] can aid the explanations based on not only on coarse grained labels, but more fine grained findings, and provide a wider provenance that traces the explanation to the very source, i.e., at the data acquisition stage.This goes beyond current XAI techniques that limit their explanations to provide rationale only for a given input and output sample data [69][70][71].Going beyond uni-modal explanations makes the information fusion aspect to be of paramount importance in the explanation process, to allow traceability from the data collection, to the output explanation interfaces with a diverse set of audience profiles that participate in the medical and clinical processes characterized by different backgrounds and expertise.
Apart from the methods sketched above, networks have also been used in other areas including data mining, machine learning, lexical semantics, information fusion [72][73][74][75] and integrative computational biology, such as cell differentiation [44].

Desiderata: Fusing machine learning with systematic graph theory promotes the knowledge gain of multi-modal data and their interrelationships
Many interactions are transient, so networks change in different tissues or under different stimuli [37,[94][95][96].Studying the dynamics of these networks is an exponentially complex task.Many stable complexes show strong co-expression of corresponding genes, whereas transient complexes lack this support [97,98].These contextual network dynamics must be considered when linking interactions to phenotypes and when studying the networks topology.Analyzing such insights on the network dynamics towards the identification and minimization of different biases of individual detection methods, the simple intersection of results achieves high precision at the cost of low recall.
Systematic graph theory analyses of dynamic changes in interaction networks, combined with probabilistic modeling [99], and integrated with gene and protein cancer profiles enable comprehensive analyses of complex diseases such as cancer [100][101][102], generating new insights [42,51], robust biomarkers [90,91,103] and models that explain causal relationships through network inference [104,105].Implementing algorithms using heuristics fine-tuned for interaction networks [106][107][108] will ensure their scalability.Finally, we also highlight achievements reported lately on the use of Deep Learning methods to undertake modeling problems formulated over interaction networks, which have so far elicited promising results [7,109].

What for: Pushing the boundaries in this FRA will help understanding complex diseases
There are many benefits emerging from early steps taken along this FRA.For instance, some of the most successful network-based methods of gene group identification for class prediction have been the scorebased sub-network markers [110][111][112][113]. Sub-networks identified using these approaches were recently shown to be highly conserved across studies and to perform better than individual genes or pre-defined gene groups at predicting breast cancer metastasis [111].Improving these methods by considering network modularity results in better prediction of aging [45].Combining existing known and predicted interactions with novel local co-expression annotation of existing edges will elucidate disease-specific dynamics and identify local network structures (graphlets, [107,114]) that are the most aberrant components in the cancer network, as compared to a normal control case.Network dynamics [94], in turn, enable explainable modeling of healthy and disease signaling cascades [115], or modeling cancer progression [50].

What: Causal learning from observational data is a central problem relevant to many application domains
Causal learning from pure observational data and predictive modeling is a general problem relevant for many application domains.It is gaining much interest recently and has been largely tackled by the AI community [116,117].There are a number of fundamental problems that have existed for a long time and have not yet been solved.The renowned American philosopher Charles S. Peirce argued that human induction must be guided by special aptitudes for guessing right, which led to the challenge of simplicity or parsimony, which is even going back to Occam's razor.Alone, the concept of simplicity poses a lot of problems for both causal machine learning [118] and causal human learning [119].If causal inference has a rational basis, we would expect the resulting causal knowledge to allow the formulation of coherent answers to a variety of causal questions.
Two main problems about causal relationships can be distinguished in the literature: (1) ''What is the probability that a cause causes (or prevents) an effect?''and (2) ''What is the probability that a causal relationship exists between these two variables?''Or, put another way, ''Does the cause have a nonzero probability of producing (or preventing) the effect?''[120].The generality and wide spectrum of practical scenarios in which such questions can be formulated makes the discovery of causal relationships from data a subject under vibrant study in diverse fields and disciplines.AI-based medicine is not an exception, with specific tasks such as diagnosis and treatment calling for further advances in causality inference that unveil novel interventional and prescriptional strategies from medical data.

Why: Typically, the underlying causal model that accounts for all factors affecting an outcome variable of interest is missing
A common challenge in applying causal analysis is the lack of an underlying causal model that can account for all factors influencing an outcome variable of interest.Recent progress has been done on causal signal extraction from images [121,122].Causality has also been applied to generative neural networks and proxy variables in an attempt to better deal with the kind of data used by Deep Learning [123,124].Nevertheless, the international research community agrees that there are a lot of shortcomings and many open problems to be solved, for instance, dealing with the all possible underlying, and often unknown, factors of variation and variables on which causality is feasible to be studied in practice [117,125,126].

How: XAI with counterfactual explanations and causal algorithmic recourse can help determine what is causally related
Formal reasoning about causal relations between features  = [ 1 , … ,   ] can be done by using a structural causal model, i.e. a non-parametric model with independent errors according to Judea Pearl [127,128].In the following we introduce some basics to show how this can be helpful.For more extensive introductions, please refer to [120,129].The data-generating process of  is described by an (unknown) underlying structural causal model  of the general form: The structural equations  are a set of assignments generating each observed variable   as a deterministic function   of its causal parents  () ⊆  ⧵   and an unobserved noise variable   .Here it is important to note that   is a factorizing joint distribution over background variables which introduces uncertainty due to the lack of observations.The assumption of mutually independent noises (i.e., a fully factorized   ) entails that there is no hidden confounding and is referred to as causal sufficiency.For an experimental proof, we refer to Karimi et al. (2020) [129].
Structural causal models are often represented by a so-called causal graph .Such causal graphs can be obtained by drawing a directed edge from each node in  () to   for  ∈ {1, … , }.
Figs. 2(b) and 2(c) show a typical textbook example.We assume henceforth that  is acyclic.In this case, the data-generating process  implies a unique observational distribution   , which factorizes over , defined as the push-forward of   via .
The structural causal model framework allows for the study of interventional distributions, describing a situation in which some variables are manipulated externally.The structural causal model also implies distributions over counterfactuals, i.e. statements about (hypothetical) interventions that were all else being equal (Ceteris Paribus, namely, the analysis of the effect of one variable on another, assuming that all other variables remain the same).
When formulated in the context of classification via a model ℎ, a popular approach to the study of counterfactuals is to find so-called (nearest) counterfactual explanations [130] where the term ''counterfactual'' is meant in the sense of the closest possible ''fact'' with a different outcome.Counterfactual predictions consist of asking ourselves what would have been the effect of something if we had not taken an action, i.e., alternative scenarios [131], or modifications of the input data that could eventually alter the original prediction of the model ℎ, and help the user understand the performance boundaries of the model for improved trust and informed criticism.Interventional clinical predictive models require the calculation of counterfactuals, apart from the correct specification of cause and effect [131].Just to give an example, to analyze counterfactuals based on the structural causal model , an intervention (also known as do operator) can be used to indicate that a set of variables  ′ ⊆  is fixed to , which is often denoted as ( ′ = ).The corresponding distribution of the remaining variables  ⧵  ′ can be computed from  by replacing the structural equations for  ′ ∈  to obtain the new set of equations (( ′ = )).The interventional distribution   ′ |( ′ =) is then given by the observational distribution implied by the manipulated structural causal model (( ′ = ),   ).
Given observations   , the definition of the (⋅) interventional operator permits, for example, to ask what would have happened if  ′ had instead taken the value .
An answer to this question departs from the definition of the counterfactual variable by (( ′ = ))|  , and the distribution of this counterfactual variable can be computed in three steps [128] [131].In medical applications, some of the tests for measuring robustness of estimated effects on non pharmaceutical interventions include intervention models doing different structural assumptions and validation of such assumptions when they do not hold.An example of such interventions against COVID-19 includes generalization over countries presented in [132].In cases where causal effect estimation is aimed at individual-level recommendations, alerting decision makers when predictions are not to be trusted is crucial.Therefore, identifying failure with uncertainty-aware models (e.g., when covariate shift makes training and test datasets vary), as proposed in [133], facilitates uncertainty communication to decision-makers.Generally, uncertainty enables deep learning methods to be adopted into clinical workflows [134].
A different but intuitively similar concept related to the characterization of counterfactuals is that of contrastive explanation [135], which consists of explaining not only why an event occurred, but also why it occurred as opposed to some alternative event.They are considered necessary for agents to achieve moral responsibility, although a debate exists on contrastive explanations entailing causal determinism [136,137].Approaches producing contrastive explanations serve to learn more efficiently from data.For example, using pertinent negatives [138] is one among such approaches, and relates to learning structural descriptions from examples.Another example is using active learning, which can help select the most informative pairs of labels to elicit contrastive natural language explanations from experts, while dynamically changing the model [139].
Equally important is the integration of ''Big Data'' methods with explanations that involve a causal analysis.This integrated analysis is key, especially in omics and imaging for causal inference [93].An example of such tight integration is the use of deep feature selection for causal analysis in Alzheimer's Disease [140].Other example is the alignment of domain expert knowledge with Deep Learning models in order to achieve more expert-compatible explainability.Neuralsymbolic learning and reasoning systems can be used for this purpose with different kinds of integration schemes [68,141].

Desiderata: Disentangling influential factors from multivariate observations and plausible yet diverse counterfactuals
A concern with causal AI in medicine is how to disentangle correlated factors of influence in high-dimensional settings.One way to deal with the independent manipulation of as set of correlated factors is to disentangle the influence of correlated factors from multivariate observations with interventions.An example of such is Back-to-Back regression [142], to help identifying the causal contributions of co-linear factors in multi-variate and multi-dimensional magnetic resonance imaging observations.Back-to-Back regression produces an interpretable scalar estimate for each factor from a set of correlated factors to estimate those that most plausibly account for multidimensional observations.As a result, this method disentangles respective contributions of collinear factors to identify the causal contribution of covarying factors.
In regards to counterfactual explanations, the plausibility, feasibility, and diversity of the obtained counterfactual explanations (whether they are contrastive or not) are particularly relevant aspects that should be considered in the medical domain.In this regard we advocate for an increasing prevalence of modern generative learning approaches applied to the discovery of counterfactuals.The capability of such methods to model the distribution of existing multi-dimensional data yields a proxy generator of plausible hypothesis that can be of utmost help to ensure that counterfactual instances can occur in reality.Further along this line, the diversity of counterfactuals can be a conflicting objective with their plausibility as per   , hence counterfactual generation methods should also properly balance among such objectives [143].

What for: Causality and counterfactual generation may reduce diagnostic results, increase quality of care and life, reduce overall costs, and free up clinicians' time
As in other fields with strong human interaction, in designing a medical AI system it is critical to consider who will use it.Furthermore, when the system is used for diagnostics, it is also crucial to ensure proper balance between sensitivity and specificity, and to optimize the user interface and workflow integration.There are numerous examples that support these claims from pathology, radiology and dermatology, e.g. a smartphone based melanoma classifier would likely be used by general public as a first step in screening for skin diseases.
Here the main goal -specially when the treatment for the disease to be diagnosed is invasive or has serious side effects for the health of the patient -is to maintain a low false negative rate.On the other hand, a system for radiologists should automatically classify common cases, and leave the decision on more complex cases for the expert, aiming at a high true positive rate.Properly using such systems would reduce false negatives and false positives, increase quality of treatments and quality of life of patients, decrease the overall cost and free-up clinicians' time, which becomes more critical as decision-making situations become more patient-centered [144].
Advances on graph causal modeling and counterfactuals can be a major step towards realizing such objectives.On one hand, interventional clinical studies can be driven by the results of causal analysis of multi-dimensional medical data, thereby eliciting new diagnostic and treatment criteria that in turn, produces data from such new cases that can be fed back to the AI-based models.On the other hand, counterfactuals can increase the trustworthiness of the medical expert on the decisions issued by the AI model, discerning when it must not be fully relied as a result of a counterfactual being close to the case to be diagnosed/treated.This augmented information offered to the expert could reduce the amount of false positives, thereby favoring the aforementioned decrease of costs and efforts.

What: The use of AI requires the ability to verify correctness and causal accuracy
In the medical domain, the use of AI and machine learning models that are explainable and verifiable by human medical experts is an absolute necessity, primarily for legal reasons [145].The central problem is that no AI method will be deployed if its results cannot pass a verification process for correctness and causal accuracy by a human expert on demand.Making these assessments is difficult if the AI methods in question do not provide explanations to users.The problem becomes clear when we consider the classic problem described by Caruana et al. (2015) [146], where an AI system trained to predict a person's risk of pneumonia came to incorrect conclusions, and applying this model would have increased, not reduced, the number of patient deaths.At the same time, this is also a good example of the usefulness of having a human-in-the-loop [147], because physicians can easily verify the results based on their experience -namely, that such results of an AI system are not correct after all.Moreover, a human in-the-loop approach can bring in contextual understanding, implicit knowledge and experience to statistical machine learning methods, and consequently provide prior knowledge.However, one core open problem remains, namely, how to integrate this knowledge into the machine learning pipeline.
The term verification comes from both software engineering and medicine and was used in AI as well [148], the term explainability is used to technically highlight decision-relevant parts of machine representations, i.e., parts that contributed to the accuracy of a particular prediction.However, such a technical explanation does not refer to a human model.For this, explainability must be extended to include the concept of causability [149], which refers to a human model.Causability was introduced in reference to the well-known term of usability [150].While explainability is about implementing transparency and traceability, causability is about measuring the quality of explanations, i.e., the measurable extent to which an explanation of a statement achieves a certain level of causal understanding for a user with effectiveness, efficiency, and satisfaction in a given context of use [151].In other words, causability measures whether an explanation achieves a given level of causal understanding for a human.This is a major challenge in the medical field, as many different modalities contribute to a single outcome, requiring multimodal causability [19].

Why: The best machine learning methods to date lack robustness and are difficult to interpret
Currently, the most important and most lacking aspect of AI in general, and in medical AI in particular, is robustness.Recent success in machine learning has led to an explosion of AI applications, resulting in high expectations being placed in autonomous systems, such as autonomous vehicles [152,153], medical diagnosis [154,155], industrial prognosis [156,157], or cybersecurity [158].These developments require that we recognize and understand the fundamental limitations of current intelligent systems, which often apply across many different application areas.This crucial deficit of robustness of current systems concretely relates to their lack of ability to adapt to changes in the environment.In medicine, this is even more profound, as data changes because of changes in patient cohorts, due to advancements of instruments and assays that generate images and omics data, and as a result of changes of treatment modalities and our understanding of health and disease states at physiological and molecular levels.
The field of machine learning deals with the development of successful adaptation strategies and attempts to enable machines to recognize or respond to changing conditions for which they have not been specifically programmed or trained.So far, however, most work in machine learning has been based on the ''independent identically distributed'' assumption.That is, the machine must be able to process new input data that have not been seen during training, but that they conform to the same statistical distribution.As the i.i.d.assumption is a strong assumption that is rarely met in practice, the field of machine learning is currently working extensively on theoretical and empirical approaches to develop learning strategies that do not require this assumption to hold.These efforts are particularly related to the concepts of ''transfer learning'' [159,160], ''domain adaptation'' [161][162][163], ''adversarial training'' [164][165][166][167] and ''lifelong'' or ''continual learning'' [168,169].
Even if non-i.i.d.issues are circumvented or simply do not occur, an obstacle to reach fully actionable medical AI is the lack of explainability.In particular, modern Deep Learning models that nowadays monopolize modeling approaches for medical imaging usually remain ''black-boxes'' [69,170,171] that are unable to explain the reasons for their predictions or recommendations.This property largely precludes the diagnosis and correction of defects, and only favors conservative safety assessments of the behavior of a learning model.Both problems are very much related to a lack of understanding of cause-effect relationships.This hallmark of human cognition is a necessary (though not sufficient) component for machine learning methods achieving human-like intelligence, which would provide the basis for a much broader application of AI in industry and business.A grand issue in the task of learning from a set of observed samples is to estimate the generalization error of learning algorithms.The problem with these typical measurements, e.g., the training error, is that they are biased, particularly if the available amount of data is small.Traditionally this is measured by complexity measures such as the Vapnik-Chervonenkis (VC) dimension [172,173], or stability [174].
In the race towards properly characterizing and understanding medical AI-based models, one cannot ignore the importance of providing important features for explainable models, which becomes particularly essential for image processing algorithms [140].Furthermore, these systems need to be integrated with existing research and clinical workflows.Importantly, proper independent verification and explainability methods may highlight that well-performing AI systems are reportedly superior to humans in some clinical systems (or e.g., radiologistlevel [175]), and unveil the reasons why their outperforming behavior can degrade severely in other healthcare systems as a result of potentially non-identically distributed data resulting from a context-induced bias [176].

How: Causal approaches and explainability methods can contribute to achieving target trials, transportability, and predictive invariance
From the previous section it is clear that robustness is a key aspect to be addressed in medical AI-based systems.Performance guarantees can only be given if models are proven to be robust against different phenomena that compromise their generalization capability.An interesting approach to study generalization of learning algorithms from the perspective of robustness was presented in [177], which derived generalization bounds for learning algorithms based on their algorithmic robustness.The assumption is that if a testing sample is ''similar'' to a training sample, then the testing error is close to the training error, which is different from the traditional complexity or stability arguments mentioned earlier that concentrate on solely optimizing pure performance measurements.Indeed, in the machine learning community the overall trending goal seems to be maximizing standard accuracy, and many papers from the biomedical domain report increasing accuracy levels for different medical diagnostic tasks by virtue of models of increasing complexity and sophistication.However, such models still yield erroneous cases, which should motivate doctors to retrace and find the rooting cause of such errors.However, a non-automated inspection and verification of such cases is often unfeasible due to the multi-modality of data and the efforts it requires from the medical expert.At this point a new opportunity arises for causality and explainability as enablers to automate this medical verification process.
Unfortunately, observational biomedical studies are affected by confounding and selection biases among other biases [178], which makes causal inference infeasible unless robust assumptions are made.These require a priori domain knowledge, as data-driven predictive models can be used to infer causal effects.However, neither their parameters nor their predictions necessarily have a causal interpretation.
Consequently, we firmly call for the use of causal approaches and learning causal structures by using certain linchpins to develop and test intervention models [131], namely: (1) target trials, (2) transportability, and (3) prediction invariance.To begin with, target trials refer to algorithmic emulation of randomized studies.Transportability [179] is a license to ''transfer causal effects learned in experimental studies to a new population, in which only observational studies can be conducted''.Akin to transportability is prediction invariance, where a ''true causal model is contained in all prediction models whose accuracy does not vary across different settings''.When a causal structure is available or a target trial design can be devised, the evaluation of model transportability for a given set of action queries (e.g., treatment options or risk modifiers) is recommended; while for exploratory analyses where causal structures are to be discovered, prediction invariance could be used.In this way, as advocated by Prosperi et al. (2020) [131], transportability and prediction invariance could become guideline core tools and part of reporting protocols for intervention models, for a better alignment with the standards for prognostic and diagnostic models of medicine and biomedical practice today.
Another phenomenon placing at risk the trustworthiness and verification of medical AI models is their robustness to adversarial attacks.Technically, we assume a model processing unseen examples from the underlying distribution   .In general, the goal of model training is to reach a minimum of a expected loss function [180].However, many machine learning models, particularly deep neural networks [181], are susceptible to be deceived by the presence of adversarial examples [182].Adversarial examples can be conceived as modified data instances resulting from small yet intelligently tailored perturbations made to original examples.Even if they are not even visible to the human eye, such perturbations yield dramatic effects when processed through the machine learning model, provoking a wrong output with high confidence.
Fig. 3 depicts a schematic diagram showing the different reasons by which model verification and robustness assessment are of utmost necessity in the medical domain.XAI methods can help determining what a model observes in an input when predicting its output, ascertaining the presence of biases inherited from data or purposely inserted by adversarial attacks.Likewise, counterfactual explanations can also benefit for stronger input-output causal relationships discovered from data, stepping beyond the production of correlation-based counterfactuals to the generation of interventional what-if stories.This might be a major step in the medical AI field to trascend from verifiable models for diagnosis towards verifiable AI-based solutions for medical prescription and treatment.
A pause must be done before proceeding further to highlight, once again, the importance of having a human-in-the-loop as the ultimate stakeholder to decide whether an AI-based model is robust enough [147].Even if the verification process can be partly automated by XAI and causality inference methods, trustworthiness always requires a qualitative assessment of the overall verification process, both in terms of their starting assumptions (e.g. is a certain adversarial attack strategy for a medical AI-based model plausible and likely to occur in the context in which data are produced?)and the results it conveys (corr. is the detected bias inherited from data?Can we reduce this bias by preprocessing or improving the data collection process anyhow?).All in all, humans, even if we make mistakes, can be considered a robust proxy in decision making when informed with quantitative and well-summarized measures of algorithmic robustness.

Desiderata: Adversarial training can contribute to better robustness and explainability
A very different use of adversarial training is to make models more robust and interpretable.The work in [183] shows that adversarial training improves the interpretability of gradient-based saliency maps in medical imaging diagnosis of skin cancer.In particular, adversarially trained convolutional neural networks are significantly sharper and more visually coherent than non-adversarial traditionally trained CNNs.What many of these robustness tests highlight is the needs for verification and validation methods for deep learning techniques beyond academic toy datasets.It is clear that much of the research efforts have focused on overfitting deep learning models with ever-increasing numbers of parameters to a small selection of research benchmark datasets [184].
Even results reported in carefully curated international challenges such as PASCAL VOC [185] later turned out to be largely based on spurious correlations (e.g., ships were classified by the presence of water, or horses were linked to copyright watermarks).In a similar vein, popular text classification datasets have been shown to contain biases, meaning that only parts of the input are needed to make the correct predictions [186].This type of cheating is also referred to as ''Clever Hans effect'' [187].
In spite of permitting the incremental improvement and incredible advances in the field, natural image datasets can normally be very different from real life datasets, which are more sparse, noisy and in uncontrolled settings.Language differences aside, similar conclusions can be derived from medical text data collected in diverse environments, which ground on cultural, geographical or individually-induced biases present in such data.Generalizing to real life datasets is thus a part of the desiderata of having robust machine learning models for medical application.For this to occur, we envision that explainability tools will become increasingly relevant, becoming a core part of prospective studies reporting successful real-world cases.

What for: The most important practical benefit of implementing this FRA theme is maintaining trust
In the medical domain, the use of AI methods that are verifiable, comprehensible and interpretable by human experts will not only be mandatory for legal reasons in the future, but also offers a number of other technical and non-technical advantages.Advantages from the technical point of view include that developers get a better understanding of the medical system endowed with AI-based functionalities, thus are able to improve existing methods (e.g., by reducing complexity or model size) with increased knowledge about the niches and directions along which such improvements can be attained.Bias identification [188] or adversarial attack detection [189] can be arguably the most evident examples of technical advantages granted by XAI methods for model verification.
Above all, the big advantage for the medical expert and the end user affected by decisions issued by verified medical AI models lies in the increased trust on their outcomes, the remaining responsibility of the human being (human-in-control) and the avoidance of bias and discrimination.Medical decisions can pose a turning point in the life of a patient, so trustworthiness on the suitability of decisions issued by such models is a must at many different levels of the medical workflow, from the diagnosis (confidence of predictions), to the design of the treatment (suitability of prescribed therapies/medication by a model) and the acceptability of the patient (causability to ensure that he/she understand that the AI-informed decisions are the best ones for his/her disease).When understanding this need for trustworthiness at multiple levels of the medical workflow, one can realize the enormous relevance of AI verification and explainability in the medical realm.

A unified view on the integrative role of information fusion in medical AI
Encoding multidimensional data, but also tabular data and data of temporal sequential nature, is an open challenge for the latest DL models to assimilate incomplete and irregular healthcare data.Reinforcement learning and explainable models to fully control this family of AI black-box models [190] can better use this data for sequential decision making from observational multi-modal data if meaningful representations are learned and used to represent a patient state [191].
In this context, local and global explanations are equally important, i.e., assessing machine learning model output with respect to a single input data point, also called ''decision understanding'' (e.g., as done by methods such as Local Interpretable Model-Agnostic Explanations -LIME [192] or Layer-wise Relevance Propagation -LRP [193]), but also verifying and certifying the full model at a global scale, also called ''model understanding'' [194].Likewise, [195] advocates for explanations in cooperative decision making in medicine to be mutual, implicitly implying a continual fusion of explanations.Mutual explanations [196] are introduced in a context of transparent expert companions towards medical decision support systems where interactive and explainable HRI [197] machine learning plays a key role.Mutual explanations naturally provide the understanding of verbal explanations, i.e., based on dialog incremental processes to provide human machine learning users with trust and deeper involvement in the learning process.When explanations are not accepted, the human cannot only ask for them but also correct them.This way, expert domain knowledge is used in learning and inference through explanation sketches that are applied as constraints for the inductive logic programming system Aleph.
Verbal interpretability perspective [198] is achieved by ensuring that the model is capable of providing humanly understandable statements, e.g., logical relations, showing positive words drawing to a conclusion, verbal chunks or sentences [199] that indicate causality, and that the model produces explanations which are non-contradictory, non-redundant, fluent and cover all important aspects related to the prediction [200].
Also related to human expert alignment are the needs for developing models for clinical acceptance.An example of such good practice is shown in [201], where such acceptance test is done through ratings by ophthalmologists on the correlation of the attribution method scores with diagnostic features.In this context, in addition to local explainable models of a single sample, approaches to test global explanations such as TCAV (Testing with Concept Activation Vector) [202] or SpRAy (Spectral Relevance Analysis) [187] are desired in order to explain beyond a single data point example.However, they may not be fully considered as global method, as they only consider the set of all training examples from a given class [198].Another critique of current Natural Language Processing (NLP) models provided with verbal interpretability is the lack of provision of the actual underlying mechanisms to generate texts.Generating free text explanations is often framed as a summarization task -either as extractive settings, where salient sentences from provided evidence documents are selected as explanations [200], or abstractive settings, where, given evidence documents, the explanation is produced from scratch using a generative model [203].While the latter can result in more fluent explanations and incorporate further background knowledge not explicitly present in the evidence documents, it is known that, as for example used for EHR generation from conversations in [204], fake facts are hallucinated by neural generators [205].Yet other works rely on hybrid approaches, where extractive summarization is followed by abstractive summarization [206,207].However, as also advocated by [198], further work on providing explanations of the process and shape of the embedding optimization is needed.
The role of natural language in information fusion and XAI is twofold: on the one hand, language is one of the data modalities, in which complex facts and relationships are expressed, e.g. in electronic health records (EHRs) or medical literature.On the other hand, language is the prime channel of explanation: verbalizing the algorithmic reasoning enables the health practitioner to easily detect whether the reason for the algorithmic decision is acceptable.
For both variants, the use of cross-modal representations that link, e.g., textual, image and omics data will be crucial for AI in multimodal data as present widely in the medical domain.Challenges lie in the harmonization and curation of cross-modal datasets aligned across two or more modalities enabling the cross-modal transfer, either by learning a common subspace via methods such as DCCA [208] or by projection learning [209].While suitable datasets are becoming available in the public domain, they are yet to be constructed for medical data.
For processing and generating language in a transparent way, future work will have to concentrate on NLP models with provenance, i.e., models that provide the data on which their output is based on.In the case of automatic summarization, for example, this would be the statements that lead to the formulation of a summarizing sentence; for semantic processing it could be the use of hybrid models that combine sparse representations [74] with dense representations, e.g., [210].For Transformer-based architectures (e.g., [159]), in the absence of human rationales to train a model to generate explanations, this could be realized with attention scores, although they only loosely correspond to human-acceptable explanations [13,211,212].An alternative could be to investigate the utility of diagnostic properties, such as Faithfulness, Dataset Consistency and Confidence Indication [71].These have been shown to be useful for automatically evaluating the quality of explanations, and might be suitable as objectives for generating explanations in an unsupervised way.Another option is the use of (intransparent) NLP technologies to identify and extract information with provenance, as for example done in [213] for metadata extraction from biomedical literature to increase reproducibility of studies.
Metrics worth assessing beyond model understanding through subspace explanation (MUSE) induce fidelity (based on instances disagreement between model and explanation), unambiguity (in terms of rule overlap and cover), or interpretability (in terms of triple rule set size, width, and predicate size) [214].
One strand of future methods strives for high quality data in order to produce better predictions, the requirements to deploy AI systems in medicine advocate as well for natural handling of noisy and incomplete data, which is much more realistic in healthcare, where many information silos due to the distributed nature of domain expert knowledge bases and respective EHR.In this line, techniques to complete partial data from missing sensor readings through data level-and feature level information fusion to improve the overall data quality include, for instance, kernel random forests in fog computing for heart disease prediction [215].Another example showing improved results with extra fused data includes the use of self-attention architectures for CTimage and non visual features for immunotherapy treatment response prediction [216].In fog computing, a similar approach to federated learning in terms of data decentralization, the ability to access all data at once is not possible.However, fusing the different sensors available for different users makes all data actionable [131], and the full set richer, and of better quality.Recent work showed that it is even possible to train largely personalized models in such distributed settings [217].Other strand of ideology advocates for approaches that incorporate a natural handling for anomalies and outliers [218], as well as incomplete, dirty and irregular datasets, as a common feature of medical AI systems [219].The latter work also warns for the potentially large impact of unintended consequences of machine learning in medicine from an empirical and technical viewpoint.These and other pitfalls in data-driven decision making [131] are to be considered in the development of the frontier topics discussed in this paper, hand in hand with experts-in-the-loop.
Integrative computational biology and AI algorithms play a central role in precision medicine.Individual analyses can be combined using multiple networks, including transcription regulatory, microRNA-gene, physical protein interactions, metabolic and signaling pathways [220].Such analyses help identify better prognostic and predictive signatures, drug mechanism of action, combination therapies, and possible novel drug targets.These networks can be further annotated with tissues and diseases to form richly-annotated typed graphs, which in turn can be analyzed with graph theory algorithms to form explainable models.For example, Bhattacharyya and colleagues integrated a pathway-based patient model with multi-scale Bayesian network to predict specific treatment options [221].Similarly, exploring the possible links between AKT1 (Akt is a Protein kinase B that plays a key role in glucose metabolism, apoptosis, cell proliferation, transcription and cell migration) and BTK (Bruton's tyrosine kinase that plays a crucial role in B cell development and signaling), we obtain 1,862 proteins connected by 2,324 edges (i.e., direct physical protein interactions, 437 unidirectional, 84 bi-directional and the rest non-directional), as shown in Fig. 4. The network in this figure highlights which of the interactions are relevant to arthritis, neuro-degenerative diseases, or cognitive disorders.
Importantly, once a hypothesis and model are created from an integrative analysis, such as the one highlighted in Fig. 4, one would need to select the most appropriate -and ideally, the least costly -organism to act as the model for further functional studies and validation.Considering this network, the mouse would be the best model organism, as about 98% of all interactions in the network are conserved from human to mouse, while the rabbit has only 33% of the network conserved, and fly, worm and yeast have none of these interactions present (Fig. 4, a).Using analogous selection, the most relevant tissues for functional validation include adipose, lung, spleen and bone (81%-85%), falling to just around 50% for heart and brain (Fig. 4, b).Considering diseases, only cancer has a substantial set of annotated interactions in this network with almost 60% of the network being annotated to diverse cancers (Fig. 4, c).(See Fig. 5.)

Discussion
As we have seen in previous sections, for AI models in medicine, there are several concerns with respect to the development of these frontier topics.Besides them, another dimension with large concerns in medicine whose importance can be exacerbated upon the fusion of multimodal data is the privacy and confidentiality awareness of medical AI-based models.Indeed, the compliance with patient privacy normally hinders medical AI methods from excelling in practical settings due to a diversity of reasons, such as the increased difficulty of collecting data, restrictions to their use following ethical and legal constraints, or the potential performance penalty obtained when data are encoded prior to modeling.Ideas using the concept of differential privacy [222], privacy-preserving representations [223] or along the lines of privacy distillation [224] are key to further develop this line of work.Privacy distillation [225] allows patients to decide the type and amount of information they disclose to healthcare information systems while retaining the model accuracy under a sufficient subset of original privacy-relevant features.The idea behind this model-agnostic mechanism is to balance accuracy of the model with the redacted inputs of users.An example of application in a DL regression setting for dose prediction is in [225]; it demonstrates to reduce the amount of over-prescriptions and under-prescriptions of warfarin.To sum up, we foresee that the growing amount and diversity of patient, medical and clinical information combined and flowing together into medical processes relying on AI-based models will give rise to unprecedented challenges in what relates to the privacy of sensitive data, calling for overarching strategies that maintain the confidentiality of protected information of the patient all over the process.
One size does not fit all.While AI can solve standard cases with similar accuracy to human experts, it cannot yet beat human specialists.However, we rather stand with the synergy that flourishes when AI and the specialist collaborate together, feeding each other with knowledge that allow them performing better, more robustly and reliably in their respective tasks.Human-in-the-loop systems would benefit from AI approaches, and even more from an ensemble of AI systems, implemented using different approaches and algorithms, and trained and validated on different patient cohorts.Conversely, AI-based systems can leverage the qualitative verification of the knowledge captured from data, as well as the conformity of explanations with the medical expertise and the evidence recorded over the medical workflow.
Ignoring the implications of improper usability planning may lead to incorrect results and reduced applicability.This requires one to weigh up sensitivity with specificity to ensure specialist vs general use cases or screening vs treatment planning.It is also important to ensure clear understanding of limitations based on validation -which patient cohorts may or may not be appropriate for a given trained model.Besides explicitly acknowledging and recognizing the limitations of these AI models and resulting systems, patient-centric medicine requires models to provide specific confidence and uncertainty estimates  on the recommendation for each patient, rather than simply provide broad accuracy measures across cohorts.
To realize this holistic vision, it is important that ongoing studies dealing with medical AI are verified swiftly, providing informed evidence that AI-based models for medical practice can be trusted.On the other side of the coin, research retractions should be managed and resolved quickly, as done in recent COVID-19 related research contributions (e.g., Mehra et al. (2020) [226] in The New England Journal of Medicine and Lancet, Mulvey et al. (2020) [227] in Annals of Diagnostic Pathology, and Zeng et al. (2018) [228] in Lancet-Global Health.
However, the process takes a long time -mistakes are usually detected and retracted within months, but fraud often takes years [229].This has direct, negative implication for evidence-based medicine, and a significant impact on computational biology and AI.Considering requirements for training and validation of AI systems, data from retracted papers may affect large number of workflows and analyses, leading to incorrect models and interpretations.Training or validating AI systems on flawed data may not be obvious immediately, and even when the paper is retracted, data will likely exist in multiple forms on the Web for years after.
To circumvent this latter issue, online data repositories are crucial, but stringent curation processes are essential to ensure high quality, reliable and properly annotated data.For example, the IMEx consortium [24,25,[230][231][232] curates interaction data from published literature to enable integrative computational biology analyses, and ensure the implementation of data-driven medicine and the correct analysis and interpretation of model results.The availability of such curated repositories, and evidences of real-world AI-based models that largely rely on advances over the frontier topics reviewed in this position paper would free-up specialists by solving straightforward cases automatically, and comprehensively characterizing complex cases for further consideration and inspection.
Finally, the important element of visualization must be included, because it is ultimately what is presented to the expert end user [233].The findings and knowledge from the long-established domain of visual analytics [234] must therefore be comprehensively taken into account and integrated into future overall solutions [235] to build new human-AI interfaces supporting explainability and causability [236].

Conclusion
From our experience, we have identified and outlined three key Frontier Research Areas that need to be developed hand-in-hand within AI and the application fields.These frontier topics would benefit enormously from a Frontier Development Lab, an example of successful implementation being the SETI-NASA-ESA FDL program for AI, Space and Earth Sciences, which benefits from a catalytic environment for tackling some of the most challenging interdisciplinary research problems.In similar synergy, future biomedical AI would benefit from cross-domain research teams solving challenges in the context of cross-science problems.New additional PhD schools that take such a research-based approach can be of help.
Experts at the seam between AI and medicine are urgently needed worldwide.In the European Union, there is a dramatic shortage of qualified experts who understand both domains.Industry is desperate for properly trained professionals [237].Additionally, these experts need also to understand ethical and legal issues, and how and by whom AI is used.This requires that future experts are not only theoretically educated in ethical and legal aspects, but also given the opportunity to put them into practice in both healthcare institutions and industry.This is where agile, human-centered AI design methods can be beneficial (refer to Fig. 1).
In our holistic vision of medical AI, we highlight the cohesive role of information fusion as a technology to transport all medical data modalities through the frontier research areas.New challenges around multi-modal explanations, causality (cause-effect) and causability (quality of explanations) analysis are still to be addressed by the research community for achieving full trustworthy and robust medical AI-based systems and the use of new types of human-AI interfaces and supportive visualizations.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Fig. 3 .
Fig. 3. Schematic diagram exemplifying the different circumstances under which robustness of a medical AI-based systems (in this case, for diagnosing a melanoma) must be verified: adversarial attacks, counterfactual explanations and biases.Causality inference and explainability methods can enable automated means to perform such a verification procedure.

Fig. 4 .
Fig. 4.Exploring the connection between AKT1 and BTK.The physical protein interaction network from the Integrated Interactions Database (IID v.2020-05)[37] highlights the Gene Ontology biological process (node color) and disease annotation from DisGeNET (edge color); specifically, arthritis, neurodegenerative diseases, cognitive disorders, and their overlap (thicker, darker color edges).
Prosperi et al. (2020)tions (a) treat features as independently manipulable inputs to a given fixed and deterministic classifier ℎ ∶  → {1, … , } trained to make decisions about i.i.d.samples from the data distribution   .In the causal approach to algorithmic recourse taken in this work, we instead view variables as causally related to each other through a structural causal model  (in (b)) with associated causal graph  (c)[129].2.Action: perform the intervention to obtain the new structural equations  ( ′ =) ; and 3. Prediction: then compute the counterfactual distribution  (( ′ =))|  induced by the resulting structural causal model  ( ′ =) ,  |  .Causal inference and counterfactual prediction for actionable healthcare are discussed inProsperi et al. (2020) : 1. Abduction: first compute the posterior distribution over background variables given   ,  |  .Fig. 2.