A systematic review of biologically-informed deep learning models for cancer: fundamental trends for encoding and interpreting oncology data

Background There is an increasing interest in the use of Deep Learning (DL) based methods as a supporting analytical framework in oncology. However, most direct applications of DL will deliver models with limited transparency and explainability, which constrain their deployment in biomedical settings. Methods This systematic review discusses DL models used to support inference in cancer biology with a particular emphasis on multi-omics analysis. It focuses on how existing models address the need for better dialogue with prior knowledge, biological plausibility and interpretability, fundamental properties in the biomedical domain. For this, we retrieved and analyzed 42 studies focusing on emerging architectural and methodological advances, the encoding of biological domain knowledge and the integration of explainability methods. Results We discuss the recent evolutionary arch of DL models in the direction of integrating prior biological relational and network knowledge to support better generalisation (e.g. pathways or Protein-Protein-Interaction networks) and interpretability. This represents a fundamental functional shift towards models which can integrate mechanistic and statistical inference aspects. We introduce a concept of bio-centric interpretability and according to its taxonomy, we discuss representational methodologies for the integration of domain prior knowledge in such models. Conclusions The paper provides a critical outlook into contemporary methods for explainability and interpretability used in DL for cancer. The analysis points in the direction of a convergence between encoding prior knowledge and improved interpretability. We introduce bio-centric interpretability which is an important step towards formalisation of biological interpretability of DL models and developing methods that are less problem- or application-specific. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-023-05262-8.


Background
There is an increasing interest in the use of Deep Learning (DL) based methods as a supporting analytical framework in oncology. Recent works have articulated the potential applied impact of DL-based methods in oncology including drug response prediction [1,2], cancer diagnosis or prognosis [3][4][5][6][7][8] and the overall impact of this emerging analytical substrate to deliver the vision of precision and personalised medicine [4,9]. Despite not being mainstream methods at this point, these architectures point in the direction of addressing existing paradigmatic analytical gaps currently faced by more traditional inference frameworks, including the tension between small study cohorts and increasingly available complex set of features per patient ( p >> n).
However, most direct applications of DL will deliver models with limited transparency and explainability, which constrain their deployment in biomedical settings. In this systematic analysis we tackle an aspect commonly acknowledged but left almost untouched, namely: how authors understand and use the definition of biological interpretability, and how it dialogues to the growing spectrum of biologically-informed models, which integrate prior biological knowledge within existing DL frameworks. This paper provides a systematic review focused on omics-based DL models used in cancer biology highlighting the dialogue and convergence between biologically-informed models, explainable AI (XAI) and biological interpretability. In this sense, it complements recent surveys on Machine Learning (ML) methods [10][11][12] and DL methods [13,14] developed for biomarker identification. Moreover, it supports the argumentation in favour of the integration of multi-omics data using the AI pipelines, which is already regarded as important and advantageous over single-omic data (more on this topic in [15][16][17][18]). We perform a systematic review, identifying the motifs within emerging architectures: the domain knowledge which is integrated in the design of the models, data representation aspects and emerging architectures, ranging from biological networks and graphs to embedding models. Finally, we introduce the concept of bio-centric interpretability in DL models, which augments the contemporary Explainable (XAI) taxonomies and emerges as a fundamental property and desideratum of biologically-informed DL.
Addressing the above mentioned gaps, we defined the following research questions: 1 What are the perspectives of interpretability accross different DL-based frameworks within the cancer research domain? 2 What are the methods that deliver biological interpretability? 3 What are the desirable approaches to integration of domain knowledge in the models' architecture? 4 What are the emerging representation paradigms within these models?
Recent works in the area of XAI provide an extensive discussion on the properties and desiderata of explainability methods [38][39][40][41][42][43][44], however they do not discuss their reception by a specific user -the biomedical expert. In the area of medical XAI, according to Holzinger et al. [45], in order to satisfy the need for trustworthiness at multiple levels of the medical workflow the main frontier topics are: the verification and explainability methods, inference of complex networks, and graph causal models and counterfactual. Our proposed concept of bio-centric interpretability encapsulates all of them.
This systematic review is restricted to the context of multi-omics based DL in cancer biology, excluding papers from the computer-vision subarea. The sub-field of ML and AI in biomarker identification, is discussed in [10][11][12][13]46]. The recent study, Dhillon et al. [14] examines the state of the art feature selection, ML and Deep Learning approaches to uncover markers in single and multi-omics data. Zhao et at. [47] investigates the aspect of reproducibility in models applied to transcriptomics data. For the review on sequence-to-activity and sequence-based DL models, the reader is refereed to [29]. For a critical introduction to the application of interpretable genomics, see [48]. Another subfield, the Deep Learning in drug response prediction is discussed in [49][50][51][52].
The importance and advantages of the integration of multi-omics in the AI algorithms over single omics are presented in [15][16][17][18]. A summary of recent data integration methods and frameworks is available in [53]. Alharbi and Rashid [54] catalogue different DL tools/software in different subareas of genomics for various predictive tasks and discussed the data types in genomics assays providing a guidance which DL architecture to use. Mo et al. [55] discusses data integration and contrasts DL methods with mechanistic modelling.
These reviews provide a comprehensive overview of current DL modelling techniques and their existing genomic applications. However, the above mentioned do not elaborate specifically on the domain knowledge integration into the model and its impact on interpretability. In this review, we focus on the dialogue between post-hoc explainability (regarding its internal mechanisms and the interpretation of the model's output) and the encoding of prior biomedical knowledge, thus discussing the contribution of AI for supporting the understanding of oncogenic processes, in particular, the methods for integrating existing domain knowledge (DK) into DL models (Additional file 1: Table S1). We highlight the dialogue between explicit and latent representations.
The paper is organized as follows: First, we substantiate the concept of bio-centric interpretability and explain its three key aspects and four main components. Second, we define a taxonomy for the integration of domain knowledge into models, which is specific for biologically-informed DL models. Then, we provide a detailed review on the DL models for cancer: their architectural patterns, methods of the integration of domain knowledge and interpretability, and observed trends. We describe 42 selected papers divided into thematic blocks that correspond to the new concept and proposed taxonomy. In the Discussion we highlight the prevalence of graph representation, sparse connections as a key design feature and improved support for biomarker discovery. Then, we summarize specifically in the context of the four research questions. The paper concludes with the summary of the main findings and future perspectives in the field of DL models for cancer. Last section, the Methods, contains the details regarding papers' selection criteria and data extraction form. A diagrammatic outline of the discussion is depicted in Fig. 1.

Results
The electronic bibliographic databases (PubMed and Web of Science) search identified 661 records, which were reduced to 591 after removing duplicates. The 591 records were screened on the basis of prespecified inclusion criteria resulting in 176 records. All these potentially relevant articles were read in full text. The reasons for the exclusion of the papers were as follows: papers provided methods that are not directly linked to cancer and functional analysis/insights on biological processes; papers provided models based on DL and ML using clinical/laboratory data alone; based on microarray data or developed a sequence-based algorithmic framework. A list of eligible studies was created and resulted in 42 studies 1 (Additional file 1: Table S1). The PRISMA checklist is provided as Additional file 2: Table S2.

Emerging methodological paradigm: bio-centric model interpretability
Explainability and interpretability are considered as key desiderata of the machine learning (ML) models (e.g. [22]). They are thought to prevent the risks of misuse of machine learning models embedded in healthcare applications. Model transparency and explainability are required to deploy AI-derived biomarkers in clinical settings. In addition, the transparency of interpretable methods can minimise the risks in AI-based a is integrated in the design of the models, b is integrated in the input data pre-processing, c is integrated in the post-hoc analysis process decision-making in healthcare applications. It is by definition impossible to appeal to decisions resulting from a DL model that are not presented in an understandable manner and cannot be explained in biomedical terms and grounded in current biomedical reasoning. In biomedicine, the predictions and metrics calculated from these predictions alone are insufficient to characterise the model. The existence of multiple types and definitions of models' interpretability makes it difficult to formulate a precise definition of biological interpretability in a cancer biology setting. When is it valid to say that the ML model used in cancer biology is interpretable? The lack of a formal definition needs to be addressed and points in the direction of an unmet research gap. Benk and Ferrario [56] introduced three different dimensions of the need for interpretation: epistemic, pragmatic, ethical. In biology, the impact of these models from a scientific epistemology setting needs to be considered as, at their limit, emerging AI methods bring the promise of integrating heterogeneous evidence and mechanistic and statistical inference paradigms. These methods can ultimately impact fundamental notions of what constitutes a valid scientific argument, bringing alternative perspectives to the notion of statistical significance.
Despite high demand, interpretability remains one of the biggest challenges for bringing these models into a real-world setting. In the AI and ML fields, there is a well-known trade-off between how well the model performs and how well people are able to interpret it [40,56,57]. Additionally, there is no consistent agreement on definitions of interpretability. One of its definitions directly refers to the components of interpretable models such as transparency ('how does the model work?') and post-hoc explanations ('what else can the model tell me?') [40]. It identifies two main objects for interpretation: i) the internal mechanisms, i.e. how the models compute their outcomes, and ii) the outcomes generated by the model. Similarly, according to known taxonomic accounts [32], interpretability can be: algorithmic-centric, focusing on the inner-working of the model; or output-centric, highlighting the model agnostic post-hoc analysis.
In the context of DL, we replace algorithmic-centric with architecture-centric interpretability. We argue that more emphasis and inference is put on the structure of the model rather than the learning process of the DNN (via backpropagation algorithm).
In order to derive biological insights from the model, an interpretation of a biological expert is required regardless of architecture or outcome-centric approach. Both of them need to favor mapping the biological mechanisms to the models' components, aiming at delivering an interpretation for the intended end user (i.e. biologist, oncologist) which relies more on biological knowledge rather than on DL or mathematical knowledge. More specifically, a preferable format of model's transparency would be a biological mechanism integrated in the model's architecture (e.g. gene activation pathway) or calculations mimicking biological processes (e.g. mimic typical molecular biology assays that study functional genomics), in complement to state-of-the-art explainability methods borrowed from other fields. Some formats have already been successfully applied to transcriptomic data, such as the integration of DK of gene modules, or the integration of hierarchical information about molecular subsystems involved in cellular processes. Such models provide informative biological interpretation of the predictions by studying the activation of the various subsystems embedded in the model architecture and, moreover, they can make it possible to infer on the activity of latent factors as a priori characterized gene modules. The interpretation of the biological expert allows for evaluation of biological plausibility and satisfiability of biological constrains.
Hence, in this paper we revisit the notion of interpretability to ground it in a biomedical context, introducing the concept bio-centric interpretability. It encompasses three key aspects which lead to biological understanding of the investigated problem and new insights: • Architecture-centric interpretability • Output-centric interpretability • Post-hoc evaluation of biological plausibility We argue, that evaluation of the DL model regarding bio-centric interpretability requires an analysis of all these aspects at once. These three aspects are evaluated via the analysis of the four bio-centric interpretability components: • The integration of different data modalities • The schema level representation of the model • The integration of domain knowledge • Post-hoc explainability methods

The integration of different data modalities.
Cancer is a complex and multi-faceted disease with a landscape of features that can separately or together influence treatment responses and patient prognosis. Important biological relations can be expressed in more than one data modality, e.g. potential cancer driver genes can be represented through integration of copy number, DNA methylation and gene expression data. Therefore, combining different data modalities in the DL model, including different types of omics data is imminent as the field evolves and inherent if biological processes are modelled. Only provided that the biologically-informed model can reveal both established and novel molecularly altered candidates which can be implicated in predicting advanced disease.

Schema level representation of the model.
Understanding the data flow in the model is crucial for the post-hoc interpretation by an expert user. Obviously, this is affected by how the data is represented in subsequent components of the model. Usually, collected multi-omics data is stored in tables (matrices). However, over a series of computations steps, the representation can change into graphs, networks, eigenvalues, eigenvectors, among others. Each representation has its own specific properties and is processed by specific architectural elements in the model, e.g. Graph Neural Networks and Graph Convolution Networks (for graphs). Thus, in the context of bio-centric interpretability, it is crucial to understand these representations, how they transform and how to communicate such transformation during the post-hoc inference. The underlying dialogue between the input data model and the architectural structure of the model requires a schema level representation, which then allows for domain expert interpretation and inference.

The integration of domain knowledge.
A key aspect, which significantly impacts all three components is the domain knowledge integration into the model. A biologically-informed DL model can and should make use of databases that contain an abundance of known biological relations. Later in the paper, we compare and contrast emerging approaches of DK integration and its close dialogue with DL archtectures, indicating which model has the highest potential in improving bio-centric interpretability.

Post-hoc explainability methods.
The inherent property of DL model is its ability to derive latent features reflected in a large space of weighted connections between neurons. Even provided that the model's architecture resembles bio relations, post-hoc explainability methods must be applied to allow for tracking back the information flow, highlighting the importance (and unimportance) of model's components. More specifically, when investigating an individual output it is necessary to define key neurons, connections or layers that most impact the prediction, as well as those that do not.

Encoding domain priors: Improving bio-centric interpretability and integrating relational knowledge
In cancer, AI / ML is emerging as a methodological enabler to transform omics data into biomarker panels that can diagnose, predict or report on the effectiveness of interventions in the disease. More recently, some of these methods have concentrated on the integration of symbolic-level, explicit domain knowledge into the models. Domain knowledge can be understood as the information so far accumulated in a given field (here: pathways, PPI networks, Gene Ontology), usually expressed as known relational knowledge. In many cases, this knowledge is available in well-known curated databases and expressed in canonical data models that can be integrated in a computational pipeline. The taxonomy for explicit knowledge integration with the informed ML framework proposed by von Rueden et al. [58] includes: (i) source of knowledge; (ii) representation of knowledge; (iii) and integration of knowledge in the ML pipeline. Each dimension contains a set of elements showing different approaches that can be observed in previous literature. Knowledge sources can be classified according to the degree of formality. They range from the rigorously expressed scientific knowledge (derived from any scientific discipline) to an expert-derived statement (mapping for example their clinical experience). More or less formalised, more general scientific knowledge (aka. world knowledge) situated at a basic expertise level within that domain (e.g. that the body is composed of cells; that there is DNA inside cell nucleus; that cancer is a disease of the genome, etc.); we found the general scientific knowledge not relevant in the context of this work.
Domain knowledge can be integrated into the model to improve its consistency, reliability and biological plausibility as well as for supporting better generalisation. As proposed by von Rueden et al. [58] this can be done in a variety of ways, such as incorporating DK into basic training data (e.g. pre-processing), hypothesis set (e.g. sparse connections between neurons), established relational data, learning algorithm (e.g. cost function), and final hypothesis (e.g. model's architecture). On the other hand, DK is needed in order to extract the scientific outcome from the model or from individual elements of the model, and/or to explain such outcome. For example, based on DK, the contributions of specific model components can be better localised and investigated.
In addition, DK can be used in a post-hoc setting, where the scientific credibility and consistency of the results are cross-validated within existing knowledge. Results that do not match the existing knowledge can be rejected or flagged as incorrect or suspicious, so that the final result is consistent with prior knowledge.
In this paper we define a taxonomy which is more specific for biologically-informed DL models (inspired by von Rueden [58]). We suggest three main categories of DK integration as: • Input data pre-processing (PRE) -DK is used to enrich or augment the input data, which results in a change of data representation. Scaling or normalisation is excluded from this category. • Architecture definition (ARCH) -DK explicitly impacts the model architecture, such as connections between neurons and layers. • Post-hoc comparison (POSTHOC) -DK is used to investigate and explain the outcome of the model. The DK is used to process the outcome and compare to current, known biological relations.
Multiple types of DK integration can be observed in a single model. Of note, a pre-requisite of developing any DL model in cancer biology is to understand the target domain, needed at least to define the input and output, and to qualitatively or quantitatively evaluate this output. Despite acknowledging the expert knowledge of the authors of the models, we do not consider it as explicitly integrated domain knowledge. We consider the post-hoc DK integration when the output is compared with information derived from external knowledge, or the representation of the output is changed (e.g. a vector to a graph) by using DK, so the the biological plausibility can be validated.
An outline of the three categories of DK integration is shown in Fig. 2. The results for selected papers according to proposed taxonomy are summarized in Additional file 1: Table S1.

Trends in DL models for cancer
The prominent explanation for the high heterogeneity observed in cancer may be the organisation of genes in various signalling/regulatory pathways and protein complexes. Cellular-level processes and responses are carried out by spatially and temporally organized sets of interacting entities such as proteins or RNA molecules. It is fundamental to understand how these interactions lead to biological processes. The conventional approach to studying biological processes is based on molecular interaction networks between individual biological molecules, represented as nodes with edges describing the interactions between a pair of nodes [59,60]. There are multiple types of biological interaction networks that represent different biological mechanisms and are based on different types of interaction [61]. Many of these biological interactions are publicly available through various specific databases such as KEGG [62], Reactome [63], among others.
They can be leveraged as DK to deliver a mechanistic and relational inference component which can be integrated to a statistical-probabilistic framework (Fig. 3).
Pathway-level representations, which represent sets of the pathway genes subsumed into the pathway nodes, with the interactions between the individual genes are also collectively involved in biological processes, such as cell proliferation and death. Thus, malfunction of the pathways can lead to disease. Taking into account the topology of gene interactions as prior knowledge may further help to characterise new genes or disease modules. Many network models have been developed to use known gene-gene interactions for prediction, based on the assumption that interacting genes tend to produce similar phenotypes. New biomarkers discovered by the DL model can be tracked inside the model more easily when the model's design conforms to biological relations.
The biological pathways can be integrated as curated knowledge on the molecular relation, reaction and interaction networks, covering metabolism, cellular processes, organismal systems, and human diseases and they are widely used to analyse omics data. The pathway construction function can be either a data-driven objective (DDO) or a knowledge-driven objective (KDO) [64]. The first component is used to establish gene or protein associations identified in a particular experiment. Knowledge-driven pathway construction is associated with the development of a detailed knowledge base for specific areas of interest. There are various approaches to mapping the organisation of cellular functions using molecular interaction networks in which the edges represent interactions between genes, proteins or metabolites. Protein-protein interaction (PPI) data are used to construct networks of reactions important for the regulation and implementation of most biological processes in which proteins have been shown to interact

Fig. 2 Bio-centric interpretability scheme in the overview of a biologically-informed DL model. Grey boxes -three interpretability aspects
with functionally related proteins. Such an organisation results in the emergence of 'functional modules' , i.e. functionally related sub-networks in which there is a statistically significant aggregation of nodes with an associated cellular function. Co-expression data, genetic interaction data, and combined data types have been also used to generate similar molecular interaction networks.

Data augmentation with domain knowledge
In this subsection we focus on domain knowledge being used to pre-process the input data in order to change its representation by enrichment or augmentation: from measured omics values as matrices into pathways, networks and graphs (Fig. 3A). First, we discuss how the knowledge of pathways derived from databases was integrated into the model in the reviewed studies.
At an input level, pathways are mapped to scores, graphs or images. Oh Fig. 3 Data representation paradigms and the impact of the integration of domain knowledge. Domain knowledge (DK) can be derived from a database (blue blocks) or expert DK (yellow blocks). DK can be used in pre-processing and data augmentation before the training process. DK from databases can be represented in two ways: A as a step in the pre-processing of input data, before the training process. This first paradigm has emerged for the representation of multi-omic data, which are transformed into graphs or a network and fed into GNN or GCN. This paradigm has been applied to DL models such as: struc2vec, GLUE, several GCN and CNN models; B as inductive bias when creating the neural network architecture, defining the connections between nodes in layers. In this case, DK impacts the training process as it affects the back-propagation. This paradigm has emerged mainly for the representation of multi-omic data, which are fed into sparsely connected Deep Neural Network, where connections are defined by biological relations. This paradigm has been applied to DL models such as: VNN, PNET, KPNN, VAE, CNN at a gene level is converted into pathway-level profiles, and then Principal Component Analysis is applied to extract 3 principal components (PCs). Then, vectors containing PCs of all pathways are represented as a pathway image of a sample (set of pixels) combining all multi-omics data. Images are the input to the CNN model. As an explainability method, Grad-CAM [66] was used to identify pathways impacting cancer survival predictions by identifying the parts of an image that are most discriminative. The authors assumed that relevant pathways were more likely to be detected if they are grouped together on the pathway images. They managed to highlight the pathways ('pixels') that were of importance for the prediction of long-term survival of glioblastoma patients.
Another model which allows for integration of multi-omics data on pathway level was proposed by Lemsara et al. [67]. In the multi-modal sparse denoising autoencoder model, multi-omics features are mapped to NCI pathways. Each pathway is represented as a score obtained via autoencoder, then bi-clustering is applied. The model clusters patients based on three-omics data types, including gene expression, miRNA expression, DNA methylation and CNVs data. The SHAP method is used 'to understand the impact of individual omics modalities and features on the autoencoded score [...] learned for each pathway' [67].
Lee et al. [68] proposed a DL model for cancer subtype classification, which used 287 pathways retrieved from KEGG database. Pathways were used to build a graph in which a set of nodes represents genes and a set of edges represents molecular interactions between genes in the pathway. Gene expression profiles from RNA-seq were mapped to nodes represented as a vector. To model each pathway, they used a graph convolutional neural network (GCN), which can capture localised patterns in data and consider interactions among genes. In this way, they built multiple GCNs, one for each of the 287 pathways. Then, a multi-attention based ensemble combines all the pathway models into a single one through two attention levels (pathway-level and ensemble-level). This is followed by a multi-layer perceptron (MLP) for a cancer subtype classification task. The attention mechanism allows for highlighting pathways that are important for the classification, and falls into the ARCH category as notion of pathways directly impacts the model's architecture. In addition, DK is used POST-HOC to explain the differences between gene expression and interactions between different subtypes in terms of pathways. The authors used the network propagation method on a pathway-PPI network, where the PPI was derived from the BIOGRID database.
PPI networks as a prevalent type of graph based input. An example of a DL model for the integration and analysis of multi-omics data is DeepMOCCA [69]. DeepMOCCA is a survival prediction model, which integrates DK using PPI networks to transform the input data representation into a graph. The PPI networks are obtained from the STRING database. The multi-omics data is mapped into the nodes, which represent combination of genes, transcripts and proteins. The edges reflect physical and other functional interactions between them. Then, the graph is an input to a GCN with a graph attention mechanism. Additionally, as POSTHOC DK integration, cancer driver genes listed from the COSMIC database [70] are used to interpret the averaged rank derived from the attention mechanism. By looking at genes with repeatedly high scores across samples but not yet reported as cancer genes, the attention mechanism allows for the generation of new hypotheses. Therefore, DeepMOCCA allows for identification of prognostic markers and cancer driver genes. The authors of DeepMOCCA [69] also investigated the sample representations in the hidden layer of the network (before the Cox regression) with t-SNE visualization [71] and compared their similarity between cancer types. They suggested that this kind of analysis in reduced dimensional spaces could support patient stratification.
Similarly, Chuang et al. [72] used the PPI network to change the input representation. However, their model maps the PPI network into 2D space by using spectral clustering and combines it with the gene expression data to generate images of cancer-related networks of different types of cancer for a CNN model. More specifically, the adjacency matrix (from the PPI network) is reduced to 2 eigenvalues and represented as 2D images. Then a CNN model is trained for cancer type classification. Unfortunately, spectral clustering renders tracing the signal back to individual input features very difficult. This computational step makes significantly reduces the model's interpretability.
Another DL model integrating PPI networks was developed by Chereda et al. [73]. They use the PPI network from the HPRD database [74,75] to structure the gene expression data. Input data is transformed into a graph and used in GCN model, which is trained to classify expression profiles from breast cancer patients into metastatic or non-metastatic. They developed a Graph Layer-wise Relevance Propagation to interpret the outputs of the GCN. They used this explainability method to build a patient-specific subnetwork containing the genes that contribute the most to a prediction.
Ramirez et al. [76] investigated four models for expression-based cancer type classification (into a cancer subtype or normal tissue gene) using a GCN-based model. The input graphs were generated based on: the co-expression (using Spearman correlation), the co-expression+singleton, the PPI, and the PPI+singleton networks from the STRING database [77]. As an interpretability method, they use an in silico perturbation procedure. Gene expression is successively set to 0 or 1 before passing through the model and examining how the prediction accuracy is affected by this manipulation. The more important for the classification the gene is, the greater the change in accuracy will be observed. This effect is captured with what the authors called a gene-effect or contribution score, defined as 'the larger prediction accuracy change of the labeled cancer type' , and calculated for each gene for all classification labels (33 tumor types plus normal).
Schulte-Sasse et al. [78] combined three omics data types, gene-gene interaction network and PPI network from Consensus Path DB (CPDB). DK was integrated both in PRE and to assign labels in the dataset. First a gene-gene interaction network is created, where some weak correlations are discarded based on DK from PPI. Such graph is an input to a GCN which is trained to predict whether a gene is associated with the disease or not. To derive a collection of positive and negative labels for genes in the dataset (ytrue labels), network of cancer genes (NCG), COSMIC, OMIM and KEGG are used. As the output of the model and true labels depend on the integrated DK, POSTHOC category is also assigned to this model. The authors demonstrate that including the interaction networks with a GCN classifier helps to classify and predict novel genes as well as entire disease modules. Using the Layer-Wise Relevance Propagation (LRP) [79], they are able to dissect which features drive the classification whether a gene is a driver gene or not and to identify, for each gene, neighboring interacting genes that most influence its classification. This results in building sub-modules consisting in a directed graph of gene-gene LRP contributions. As an illustration, this revealed that important neighboring genes of the cancer gene SAPCD2 are enriched for other drivers, suggesting that PPI between these genes are important for the classification.
Liu et al. [80] developed network-embedding based stratification method (NES). The method constructs the patient vectors based on the network-embedding of the PPI network. More specifically, a struc2vec [81] network embedding approach is used. Although this provides relatively good performance in classification of patient subtypes from large-scale patients' somatic mutation profiles, the method lacks interpretability. The author do not attempt to analyse inner working of the model, which may be due to struc2vec embedding of the input graph, which makes the inference very difficult.
Liu and Xie [82] developed TranSynergy, to predict the synergistic drug combinations of cancer therapy. Information from the PPI network, gene dependency, and drug-target association are integrated into the model. They proposed a Shapley Additive Gene Set Enrichment Analysis (SA-GSEA) with the aim of deconvoluting 'genes that contribute to the synergistic drug combination' . Their SA-GSEA method proceeds by ranking the features (i.e. genes) based on these values and then conducting a gene set enrichment analysis. This approach offers perspective for therapeutic approach and decisions in the context of personalized medicine.
Data enrichment and augmentation driven by relations in the input data. Apart from DK extracted explicitly from knowledge bases (e.g. specific pathways), the multi-omics data can be enriched or augmented by using relations derived from the input data, for instance by calculating correlations between gene expression. Studies described below utilise such data enrichment via: co-expression network, co-expression eigengene matrices, sample similarity networks or guidance graphs with GLUE (graph-linked unified embedding). Of note, expert knowledge is required to define or select appropriate method.
Huang et al. [83] proposed SALMON (Survival Analysis Learning with Multi-Omics Neural Networks). The input to the model consists of mRNA-and miRNA-seq coexpression eigengene matrices. They are derived from lmQCM algorithm [84], PRE step. Patient features: diagnosis age, ER and PR status, copy number and tumor mutation burdens are integrated at a later stage. The model predicts Cox proportional hazard ratio (survival) for the TCGA breast cancer dataset. As interpretation method, the perturbation procedure measures the importance of each input variable for survival prognosis. Features are ranked according to how much the concordance index (a metric for quantifying how survival prognosis models perform) is decreased. In this POSTHOC interpretation, the authors performed Gene Ontology (GO) and cytoband enrichment from ToppGene Suite to inference the biological implication from the feature ranking. In this way, Huang et al. [83] identified that the diagnosis age and PR status along with five mRNA-seq co-expression modules are the most determinant features. Genes belonging to these leading co-expression modules were further functionally assessed with gene set enrichment analysis.
A similar way of determining the contribution of input features can be used to identify biomarkers, as illustrated by Wang et al. and their MOGONET model [85]. In MOG-ONET, DNA methylation, mRNA-and miRNA-seq data are transformed into sample similarity networks. Each network enters a separate GCN. The omic-specific label distributions are then concatenated and integrated with a view correlation discovery network (VCDN), which 'can exploit the higher-level cross-omics correlations in the label space' [85]. They identified distinct biomarkers for each of the investigated diseases and performed gene set enrichment analysis yielding results consistent with previous studies.
Another graph embedding of the input was proposed by Cao and Gao [86] in a modular framework, called GLUE (graph-linked unified embedding). GLUE utilizes prior knowledge via a knowledge-based graph, called 'guidance graph'). The method combines omics-specific variational autoencoders with a 'guidance graph' , which models regulatory interactions across omics layers. The method was used to integrate unpaired singlecell triple-omics data. The nodes in the guidance graph correspond to the features of each omics layer, and edges represent signed regulatory interactions.
Xing et al. [87] proposed a multi-level attention graph neural network (MLA-GNN) for multi-task prediction. As a first step in the model, the omics data (unimodal, e.g. proteomics or transcriptomics) are converted into a weighted correlation matrix (WGCNA; [88]). Built for the full dataset, the WGCNA represents a coexpression network, from which an edge matrix is derived. Next, a patient-specific graph can be constructed, where the node values are given by the gene expression level in a given sample, and edges between nodes are drawn according to the WGCNA analysis. The graph serves as input to the first (out of three) graph attention layer (GAT) of the DL model. Features from these 3 GAT are then vectorised after a linear projection, and finally fused into a single vector, which finally passes through sequential fully connected layers in the prediction module. Finally, a full-gradient graph saliency (FGS) mechanism is implemented to interpret the predictions.
Mapping Domain Knowledge as a direct input to DL models. The degree to which a gene is essential for cancer cell proliferation is defined as gene dependency [89]. Chiu et al. [90] proposed a DeepDEP autoencoder (AE) to predict gene dependency profile based on the representations learned from high-dimensional genomic data, including DNA mutation, gene expression, DNA methylation, and copy number alteration (CNA). The model includes molecular signatures of the chemical and genetic perturbations from MSigDB as unique functional fingerprints of a gene dependency of interest. First, five AEs (one for each type of input data) are trained on unlabeled tumor data, then the outputs from five encoders are combined and passed to DNN. As one of the AEs is trained on fingerprints from MSigDB, which is a DK, we considered the integration as PRE. Based on DeepDEP, the authors performed detailed post-hoc analysis including input data perturbation, exploration of the latent layers, signature scores and multi-variable linear regression.

Explicitly defined architecture
In this section we discuss DL models that use domain knowledge to modify a standard densely connected DL model's architecture in order to improve both biological plausibility and interpretability (Fig. 3B).
Pathways are used to define connections. Elmarakeby et al. [91] combined ex ante and ex post interpretability approaches, proposing a novel neural network architecturepathway-aware multi-layered hierarchical network (P-NET). It was built using a set of 3,007 curated biological pathways from the Reactome database. The model predicts disease state in prostate cancer patients on the basis of somatic mutations and copy number alterations data. Encoding the relationships that exist in the Reactome dataset focuses the network on interpretability at the design stage (ARCH).
P-NET comprises one layer to encode the genes and five for the pathways. The input layer corresponds to the features that can be quantified and passed through the network. Three nodes from this layer (representing mutations, copy number amplification and copy number deletion) are connected to one node in the subsequent layer. The connections of the second layer reflect gene-pathway relationships whereas those of the next layers are arranged according to parent-child relationships borrowed from Reactome. For a given patient, the trained NN will return its probability to have metastatic cancer. For each sample, features can be ranked by importance score in a layer-wise manner using DeepLIFT, where sample-level scores are aggregated to obtain the global importance [92]. To gain additional insights into the information flow inside P-NET, the authors evaluated how a change in input sample label affects the activation of a node.
Deng et al. [93] proposed a pathway-guided deep neural network (DNN) framework to predict drug sensitivity in cancer cells, using known biological signaling pathways, the expression profiles of cancer cell lines, drug -protein interactions, and drug sensitivity datasets. The pathway maps were obtained from the KEGG database. DK was integrated into the DNN model via the layer of pathway nodes and their connections to input gene nodes and drug target nodes.
Zhao et al. [94] proposed a scalable, and interpretable DL model, called DeepOmix, for multi-omics data integration and survival prediction. DeepOmix incorporated prior biological knowledge defined by users as the functional module input (such as signaling pathways in this analysis). The pathway gene sets were downloaded from the Molecular Signatures Database (MSigDB) (KEGG and Reactome). DeepOmix integrated multiomics data as an input gene layer, where nodes of the gene layer are connected with a functional module layer based on the DK. Again, the pathways defined whether there is a connection between nodes.
Feng et al. [95] proposed a DL model, called DeepSigSurvNet, based on a set of (46 selected) signaling pathways from the KEGG database for cancer patients' survival prediction and outcome. The model identifies the individual patterns of these signaling pathways to four types of cancer using gene expression and copy number data (multiomics data and clinical factors integrated into the model). Not-densely connected layers are followed by CNN with inception modules. For interpretability, Smoothgrad [96] is used to assess how perturbation added to the signaling pathways affects the model's predictions. This allows scoring the relevance of each pathway for each cancer type. Then the distributions of the relevance scores of each pathway between different cancer types is compared. The authors noted that striking discrepancies arise among the cancer types and also that for a given cancer type only a small subset of the pathways have high relevance scores. This latter observation could be of interest for prioritising drug or drug combinations that target these driver pathways.
Zhang et al. [97] used a DL architecture constrained by the 46 pathways, with a pathway layer that follows the gene layer. Similarly to Feng et al. [95], connections between the two layers are sparse, and connect genes only to pathways to which they belong. They trained the model ('consDeepSignaling') for predicting drug responses in cancer cell lines from the data of dose response and multi-omics (gene expression and copy number). The output from the last layer represents the predicted area under the experimental dose-response curve value of the drug effect on a given cancer cell line. By using Smoothgrad, they analyze the distributions of the importance scores of the signaling pathways from all samples and highlight those important for drug response prediction.
Hao et al. [98] proposed a Pathway-Associated Sparse Deep Neural Network (PAS-Net) to accurately predict patient prognosis and describe complex biological processes related to prognosis by incorporating curated biological pathways from the MSigDB (Reactome). The sparse DL architecture of PASNet modeled a multilayered, hierarchical biological system of genes and pathways enabling for model interpretability. PASNet included a pathway layer where each node indicates an individual biological pathway (linked with input genes) and a hidden layer which represented hierarchical nonlinear relationships of biological processes into account. The associations between the gene layer and the pathway layer were established by well-known pathway databases (e.g., Reactome and KEGG).
Another sparsely connected DL model is a sparse Variational Autoencoder architecture, VEGA (VAE Enhanced by Gene Annotations) proposed by Seninge et al. [99]. The decoder connections are informed by user-provided biological networks based on gene annotation databases (e.g., Reactome). VEGA performance was tested using pathways, gene regulatory networks and cell type marker sets as the gene modules that define its latent space. VEGA was shown to be useful in understanding the response of a population of a specific cell type to a variety of perturbations.
To predict cell states from gene expression profiles, Fortelny and Bock [100] proposed Knowledge-Primed Neural Networks (KPNNs) aiming at providing a biologically interpretable DL model. Their approach combines ex ante and ex post explainability methods. The fully connected NNs were replaced by networks derived from prior knowledge of biological networks, including the signaling pathways and gene-regulatory networks. To do this, the authors assumed that most of the regulatory relationships important for the biological system of interest had already been discovered in other contexts. In KPNNs, each node corresponds to a protein or a gene, and each network edge corresponds to a regulatory relationship that has been documented and annotated in biological databases. The model was trained based on single-cell RNA-seq data. Of note, contrary to previously described models, the KPNN architecture allows for skipping layers. As for the post-hoc analyses, they focused on the node weights applying a perturbation procedure. It quantifies, for each node, how the addition of small noise is reflected in changes in the outputs. In this way, they evaluated the global importance of the node. These informative weights (in absolute value) can therefore be used to identify likely relevant transcription factors and/or signaling proteins.
Gene Ontology used to define architectural constraints. Based on terms extracted from Gene Ontology (GO), the system hierarchy can be structured. Each GO term is associated with a number of genes and gene products, hence genes can be organised into a hierarchy of nested gene sets. Multi-scale hierarchical interactions among biological entities such as GO terms and genes can be encoded as a list of relations. Below, we describe two studies that make explicit integration of GO into ARCH.
The response of cancer cells to therapy depends on biological as well as chemical factors [101]. To predict drug responses, Kuenzi et al. [102] developed a DL model, called DrugCell, a modular neural network with two branches. The model combines conventional DNN that process compound chemical structures with a Visible Neural Network (VNN) processing binary encodings of individual genotypes. The DrugCell system hierarchy was structured from a literature-curated database. The VNN was guided by a hierarchy of molecular human cell subsystems, taken from 2,086 biological processes from the GO database. In DrugCell, RLIPP [103] analysis leads to the identification of the gene embedding network subsystems that most contribute to the cell response prediction. Interestingly, Kuenzi et al. [102] further exploited their approach and confirmed the validity of the hypotheses derived from it. Using cell line data, they demonstrated that subsystems identified as important (as evaluated with the RLIPP scores) for the response to a given drug can reveal synergy of drug combination. In addition, they further showed, using patient-derived xenograft models (PDX) data from a public database, how DrugCell can be used to suggest drug combination treatments. DrugCell constitutes a promising example of how analysis of the inner workings of a DL model could translate into therapeutic recommendations.
Another model using GO is Factor Graph Neural Network model proposed by Ma and Zhang [104]. Each node in the model corresponds to a biological entity such as genes or GO terms (i.e., gene nodes and GO nodes), which forms a bipartite graph. The model is based on the RLIPP analysis ('relative local improvement in predictive power') and is used to predict tumor stages of kidney and lung cancers and also to classify kidney samples in normal vs. tumor tissues. The method calculating attention matrices allows 'capturing multi-scale hierarchical interactions [by assigning] weights to connections between different layers' . By investigating the weights in the last hidden layer, the authors retrieved e.g. the gene ontologies that contribute most to sample classification.
Gene Regulatory Networks used as constraints for VAEs. In [105], Shu et al. developed Deep SEM, a VAE-based model which contains a Gene Regulatory Network (GRN) layer in the encoder and Inverse GRN in the decoder. Of note, the weights are shared between these layers. GRN consists of target genes and transcription factors and can be reconstructed based on the representation learnt by the model. DeepSEM is an example of nonlinear mapping from the gene expression to GRN activities. Although no database is used as DK, certainly the GRN layers added to a VAE architecture can be considered as a step forward bio-centric interpretability.

POSTHOC explanations
Although in previous sections we already described models that use DK both in ARCH and in POSTHOC phases, here we provide examples that integrate DK only for POST-HOC purposes, not impacting the model's design.
A Cox-nnet [106] is an example of an attempt to link biological features or functions to the (hidden) nodes of a DNN model solely via POSTHOC analysis. DK is not used in ARCH. Cox-nnet uses a Cox regression as the output layer, extending the Cox-PH model [107]. The interpretation of the output includes mapping nodes' weight to regression coefficients, t-SNE, the gene set enrichment analysis with KEGG pathways and computation of partial derivatives of the output. Results from Cox-nnet compared favourably with those from Cox-PH from a biological perspective, revealing for example the importance of the BAI1 gene in the p53 pathway or MAPK1 in several cancerrelated pathways. Importantly, POSTHOC interpretation is executed not only via expert (author) evaluation, but systematically using DK about known relations extracted from a database.
A frequently used POSTHOC interpretation method is the exploration of the association between latent representations with input covariates (e.g. phenotypic features of the patients) [108]. This approach is of particular interest for models such as autoencoders (AEs) and variational autoencoders (VAEs). In these models the input data is compressed into a reduced (latent) representation and then reconstructed back from the encoded representation with the least possible error. Due to appealing dimensionality reduction abilities, AEs and VAEs are frequently used within the oncology domain (e.g. [109][110][111]). They can be used together with PCA, UMAP [112], t-SNE [113] or other algorithms [114] for data visualization, and various clustering methods can be used on top of that. POSTHOC analyses can then be performed on the weight parameters and/ or on the compressed data for gaining biological insights on what the model learned.
As an example, XOmiVAE was develop to solve supervised and unsupervised tumour classification tasks [115]. It uses DeepSHAP explanation [116] to explain novel clusters generated by VAEs. Results are compared with DK derived from i.a. Reactome and GO.
Similarly, Kinalis et al. [117] propose an AE for clustering analysis of scRNA-seq data. They used guided backpropagation (only positive gradients used for the backpass) for computing saliency maps. In their model, saliency values are obtained for each cell and each gene. Gene and gene set importance scores are then computed by averaging across the cells or the corresponding genes, respectively. They use DK in POSTHOC to investigate the latent space of the AE, comparing obtained representation with the pathways (i.e. hematopoietic signatures derived from the DMAP study [118]).
In contrast, some AE based models are being developed but no DK is used in PRE, ARCH nor POSTHOC [119][120][121][122]. The architecture proposed by Hira et al. [111] can integrate multi-omics data (genomics, epigenomics, transcriptomics). Patient subtyping is obtained first by applying a clustering algorithm on the learned latent features. Clinically relevant latent dimensions are identified by building a univariate Cox proportional hazards (Cox-PH) model for each of them and clustered into survival subgroups. Based on these labels, a Support Vector Machine was trained for allowing survival subgroup classification for new samples. With the aim of identifying biomarkers, a linear model (correlations) is used to map the embeddings of clinical relevance into the gene space.

Prevalence of graph representations
Recent years have brought an increasing number of specialised DL architectures which encode the structure of biological relations (Fig. 4A, B). DL supports non-linear modelling, while encoding complex structures and relationships, in order to learn informative representations at multiple levels of abstraction [123]. Graph Neural Networks (GNNs, and Graph Convolution Networks -GCNs) based architectures provide a universal support for encoding structural biological knowledge into neural representations. In general, GNNs are a spectrum of models which capture graph dependency by passing interaction between nodes that simultaneously take into account the scale, heterogeneity, and deep topological information of the input data (Fig. 3A). In a biomedical setting, GNNs demonstrate their applicability encoding of topological relations, and mapping them into a high-dimensional embedding space [124]. Compared to other DL models, the advantage of GNN is the ability to integrate relational data into the inference. With the increasing interest in GNNs, we observed a spectrum of new models which combine with explainability methods (Figs. 4C, D and 5).

Upward trend of graph representations
Many models were developed to use known gene-gene interactions for prediction, based on the assumption that interacting genes tend to produce similar phenotypes. This resonates with the development in the field of graph neural networks. We observe an increase of GCN/GNN application (1 in 2019, 3 in 2020 and 7 in 2021, Fig. 4D), which is Fig. 4 The trends in DL models for cancer. There is an upward trend in using multi-omics data (blue) compared to single-omic data (orange) (A) and in the integration of domain knowledge (DK) (orange, green, red) (B) based on recent studies for DL in cancer biology. The most frequently integrated domain knowledge are pathways (orange) and other DK (red) like functional modules with recent increase in the usage of PPI networks. C There are three main categories of DK integration as: input data pre-processing (PRE) (blue), architecture definition (ARCH) (orange) and post-hoc comparison (POST-HOC) (green). There is a trend in the use of DK in PRE step, i.e. DK is used to enrich or augment the input data, which results in a change of data representation; D In recent years, there is an increasing number of specialised DL architectures which encode the structure of biological relations. Graph Neural Networks (GNNs, and Graph Convolution Networks -GCNs) based architectures were the most prevalent used (green). There is an increase in the number of sparse DNN (red) and sparse AE/VAE (blue) models associated with integration of PPI networks as DK (1 model in 2020, 4 models in 2021, Fig. 4B).

32% of the models which used prior knowledge are GCN models
GNNs and GCNs models are able to combine heterogeneous omics data types with graph data representations into a predictive model and learn abstract features from both data types. Based on our study, it can be observed that GCNs are the prevalent architectural choice (Fig. 4D). This is due to the fact that the DK is usually represented as a graph (as the phenotype correlates with modules constituting a graph, i.e., sets of related nodes).

60% of the GNN/GCNs used PPIs as a DK
Due to non-reticular structure data such as graphs, GCNs are successfully used to encode protein-protein interaction networks (PPIs) to predict cancer subtypes, to identify and classify normal tissue and tumour samples for many types of cancer (60% of Fig. 5 Network of relations between key components of the bio-centric interpretability. Network representing the relations between domain knowledge (red nodes), DK databases (orange), DK integration type according to the proposed taxonomy of bio-centric interpretability(purple nodes), DL models (blue nodes) and explainability methods (green nodes). Node size is proportional to the no. occurrences of the entity, edge width is proportional to no. pairs observed in the reviewed papers. We observe strong connections between: ARCH-pathways-sparse DNN; VAE-latent space exploration; PRE-PPI network-GCN; KEGG-pathways the GCNs used PPIs as a DK, Fig. 5). GCNs can systematically determine which part of the pathway is useful for characterising the tumour. Whether neural networks encoding of biological relations as prior knowledge can accelerate biological discoveries remains largely unknown.

PPI networks used in PRE to obtain input to GNN
We observe a pattern that GNN/GCN models are associated with the PPI network application in the pre-processing stage (PRE). Tabular data containing measured multi-omics features are transformed into graphs and then fed into the model. PPI networks are derived from databases such as: STRING, CPDB, HPRD, BioGRID (Fig. 5).

Sparse connections as a key design feature
Pathways encoded via sparse connections is an emerging architectural pattern. We observed an pattern in the approaches towards which employ sparse connections mapping to layers and nodes which have a grounded biological meaning. Domain knowledge integration allows for explicit definition of connections between nodes of DNN. To achieve this, the relational biases of pathways is exploited, where relations are obtained from knowledge bases (KEGG, Reactome, SIGNOR) and used as a mask within the model for removing connections which are not represented. Thus, DK integration in ARCH allows for better, more efficient and meaningful POSTHOC interpretation (Fig. 5) as well as biological plausibility. As a result, a new architectural paradigm emerges (Fig. 3B), which conforms the architecture to reflect biological relations.
As the organisation of genes in pathways shapes the high heterogeneity of cancers, taking into account the topology of gene interactions may further help to characterize new gene or disease modules. This is reflected in the prevalence of pathways in DL models: 48.4% models that integrated DK used pathways (Fig. 4B, C). This corresponds to increase in popularity of sparse DNN and sparse AE/VAE models, as the sparsity comes from limited connections between layers defined by the pathways (4 in 2020, 5 in 2021, Fig. 4D).

Improved support for biomarker discovery
From a machine learning (ML) perspective, predicting clinical outcome can be framed as a classification or regression task, and patient or tumor specific subnets can be identified as distinguishing features. However, the high dimensionality of multi-omics data drives an instability in the feature selection process. In this context, stability means that with minor data perturbations, the process is able to preserve the same features [125,126]. Thus, for minor changes in samples, the biomarker detection method should select a consistent/similar gene set. Ideally, the biomarkers can be applied to any sample in the dataset. In general, finding the relevant features remains a major challenge in the high-dimensional, low sample-size setting, in which features are correlated, either by nature (and this is the case in most molecular datasets) or merely by chance (as the number of samples is relatively small). Finding these truly relevant features is significantly more challenging than finding features that provide optimal predictivity. In practice, current algorithms tend to focus on the prediction error of the models and usually are highly unstable, which limits its applicability in a clinical setting and creates barriers for the interpretation of biomedical insight. Stability of the biomarker discovery can be improved by including prior knowledge (i.e. DK) of molecular networks (e.g., pathways or PPI networks; [125,126]).

What are the perspectives of interpretability across different DL-based frameworks within the cancer research domain?
Based on our proposed taxonomy we argue that to provide biological interpretability to a DL model used in cancer biology, is to enable the domain expert to contemplate the data flow in the entire model and decompose its architectural elements into elements which maps to a biological reasoning and to the structure of the underlying biological mechanisms. We argue that the key explainability property for this class of models is decomposability. Each component can be also viewed as a computational step which transforms the data representation, e.g. in both an explicit or latent form. Although individual computational steps may be mathematically complex, which is inherent to modelling a biological system, they should be organised in the models' architecture in a way that supports the decomposability of the inference process. This will build the representational foundations to deliver bio-centric interpretability.

What are the methods that deliver biological interpretability?
We argue that a promising category of methods are grounded on sparse connections between neurons (e.g. KPNN), that include skip-connections between hidden layers and that this mechanism supports both bio-centric interpretability and improves the biological plausibility of the inference. Such architecture combined with state-of-the art DL explainability methods allow for tracking back in the network the contribution of biologically grounded components to individual outputs. We argue that designing for bio-centric interpretability, i.e. performing architectural choices which minimise the construction of latent representations which are not easily linked to biological primitives should be at the center of any application of DL for cancer.

What are the desirable approaches to integration of domain knowledge in the models' architecture?
DL models can induce a lack of parsimony in data representation (excessive latent features) delivering models which are intrinsically opaque. The application of explainability methods cannot fully circumvent this limitation, limiting the ability of these models to deliver meaningful biological insights. Post-hoc interpretation often leads to confirmation of known existing relations, which is presented as the evidence of the biological plausibility of a model. However, it has been documented that even untrained neural networks can produce saliency maps that appear meaningful [127]. Thus, we argue that bio-centric interpretability may manifest as the ability of the model's architecture to reflect an isomorphism with regard to known biological structures and processes, so that these can be explicitly investigated. Integration of DK allows for the definitions of these architectures. These elements allows for a better use of explainability methods which can rank network components (e.g. nodes activation or edge weights), and return references to biologically grounded elements.

What are the emerging representation paradigms within these models?
Based on our Review we identify two main trends: • Input data is transformed into graphs or network and fed into GNN or GCN • Input data is fed into sparsely connected Deep Neural Network, where connections are defined by biological relations We observed that frequently the multi-omics data is transformed prior to the model input. The transformation extends beyond computational techniques such as the enrichment analysis, and impacts the data representation: tabular data becomes a graph or network (Fig. 3). They can be constructed in data-driven manner, e.g. based on the correlation within the data, like gene-gene interaction networks, or constructed through database DK integration, e.g. input data expressed in nodes and edges of known PPI networks (Fig. 3A). Then, the graph representation is processed in a GNN or GCN. We observe an upward trend in the usage of such models, in most cases using PPI networks as DK.
The second trend focuses more on the architecture of the model, i.e. on the connection between neurons on the network. The input data still can have tabular representation and, because the bio-interpretability comes from carefully crafted architecture, the ability to track back the signal between output and input is not lost. Intuitively, the more times the representation of the data changes in the model, the less interpretable the data flow appears to be. Despite the advantages of graphs in describing biological relations, they might be not the best solution for a DL model, because transforming input data into a graph makes the data flow less transparent (e.g. graph to PCA, then to 2D image; convolutions on graphs). Preserving tabular input data representation may allow for more transparent post-hoc explanations, provided that the model's architecture reflects biological relations. For such models, pathways and functional modules derived from knowledge bases are used for defining the the sparse connections (Fig. 3B).

Conclusions
In this systematic review we focused on the biological interpretability of Deep Learning models that target omics data developed in the domain of cancer biology. We introduced the new concept of bio-centric interpretability and defined its key properties and components. According to a taxonomy centered around this notion, we critically reviewed recent studies in the context of model architecture, domain knowledge integration and biological interpretability methods.
We found that the convergence between the use of external domain knowledge and the design of architectures which reflect the structure of known biological mechanisms can deliver: (i) the model explainability required by domain experts, (ii) the improvement of the biological plausibility of these models, (iii) the improvement of the explanation quality delivered by post-hoc methods and more fundamentally (iv) the repositioning of DL models from opaque pure-predictors to explainable models which can support new biological insight. The two most common approaches to incorporate DK into the model are to use pathways or PPI networks (Figs. 3 and 4). They can be used for (i) data augmentation, (tabular mRNA data → graphs based on gene interactions) and (ii) to biologically ground the architecture of the model (e.g. mapping the connections between nodes). Domain knowledge is most frequently represented as pathways and PPI networks, which are derived from public databases, such as KEGG and Reactome, exploiting the existing curated biological knowledge. The vast majority of reviewed models attempt to interpret the output by post-hoc analyses, with a clear pattern: the more domain knowledge is reflected in the model design, the more interpretable is the post-hoc analysis. Although expert knowledge is always required to interpret the results, we assert that only the integration of explicit domain knowledge in the model design may lead to the improvement in understanding the underlying biological mechanisms. As the notion of biological interpretability is still largely unformalised, we highlight the need for universal bio-centric interpretable methods, so the developed methods are less problem-or application-specific.
In recent years we observe a significant increase of the amount of DL models developed for cancer research. Gradual improvement of their performance and better interpretability will facilitate the adoption of these models to support biomedical inference. Still there are challenges that need to be systematically addressed. First, with the decrease of costs for the acquisition of molecular-level data and accessibility of patients screening improving, more data will become available, often most likely as multi-omics. Commonly, there will be an imbalance between the feature set p and the sample size n (high dimensionality low sample size). DL models in oncology will in many cases need to integrate various data modalities in a efficient and traceable manner, at the same time handling the p >> n regime.
Second, we emphasise the need for reproducibility and benchmarking for DL models in cancer. Although publicly available datasets are often used, the selection of subdatasets (e.g. only one tumor type selected for modelling), modelling approaches and explainability methods vary. As the consequence, the biomarkers and biological relations indicated by the models as predictors or important are inconsistent, containing already well-known biological facts, potential new discoveries and spurious, false biomarkers. At this point, it is challenging to resolve the difference between the last two. The benchmarks, i.e. datasets with expected interpretations, will allow for the model verification and reliable comparison between developed models.
Third, there is a clear direction is set towards domain prior knowledge integration. All the studies we reviewed accredit the improved interpretability to the incorporation of any form of biological knowledge into the model. We anticipate that the future models will exploit known biological relation to the greater extend by combining DL expressivity and flexibility with mechanistic modelling methods.

Methods
In this systematic review, we summarise emerging DL models in cancer biology covering the representation of biological processes, diagnosis and prognosis, and recent progress in biologically informed models. To this end, we started by searching electronic bibliographic databases (PubMed and Web of Science) for relevant studies published between Jan 1, 2018, and Jan 1, 2022. We used the following terms: multi-omics and deep learning or computer science or neural networks or network analysis or machine learning and cancer or cancer biology. The same search was repeated just before the final experimental analysis for completeness (Mar 1, 2022).
We concentrated on deep learning methods applied to cancer or at least to those that are linked to straightforward applications in cancer biology using multi-omics data conducted in humans or human cell lines. Furthermore, we also searched the reference lists of published trials and the relevant review articles. At last we only concentrated on DL applications for omics data including: genomics, transcriptomics and epigenomics data from cancer in humans. We excluded studies published in languages other than English, studies with insufficient data (i.e., studies where full texts were not available or irrelevant studies), case reports, editorial materials, comments and meeting abstracts. Similarly, all pre-clinical studies conducted either in animal cell lines or murine models, review articles, meta-analyses or studies performed on animals and animal cell lines were excluded. Papers providing methods that are not directly linked to cancer and functional analysis/ insights on biological processes were excluded. Papers centered around medical imaging were excluded (e.g. histopathology and computed tomography) as well papers provided models based on DL and ML using clinical/laboratory data alone. Studies based on microarray data or developed a sequence-based algorithmic framework were excluded as well.
Using the search strategy, we obtained titles and abstracts of retrieved studies and imported them to an endnote. Two authors independently screened identified studies on the basis of prespecified inclusion criteria. All potentially relevant articles were read in full text and a list of eligible studies was created. Data were manually extracted using a structured template and any disagreements were resolved by mutual agreement between these two authors during the process of screening and data extraction, or by intervention of a third author. A standardised data extraction form was used to extract the following fields: authors' names, year of publication, type of omic data, model's output, type of prior knowledge, prior knowledge databases, type of domain knowledge integration, type of deep learning model/architecture and interpretability method used.
The multi-omics data can be represented in various ways in the subsequent components of the model. This representation can be changed into a another representation in a series of computations steps. It is crucial to understand these representations, how they are transformed, and how to communicate such transformation during post-hoc inference. We distinguished four bio-centric interpretability components: the integration of different data modalities, the schema level representation of the model, the integration of domain knowledge, post-hoc explainability methods. We took the concept of interpretability and distinguished three categories: architecture-centric interpretability, output-centric interpretability, and post-hoc evaluation of biological plausibility. The association of the model to each of the identified groups was done manually based on the authors' expertise.
The selection criteria resulted in 42 studies (see footnote 1). We elaborate on the components involved in bio-centric interpretability within DL models, focusing on emerging architectural and methodological advances, the encoding of biological domain knowledge and the integration of explainability methods. The dimensions of bio-centric interpretability for recent studies are presented in Additional file 1: Table S1.