A framework for considering prior information in network‐based approaches to omics data analysis

For decades, molecular biologists have been uncovering the mechanics of biological systems. Efforts to bring their findings together have led to the development of multiple databases and information systems that capture and present pathway information in a computable network format. Concurrently, the advent of modern omics technologies has empowered researchers to systematically profile cellular processes across different modalities. Numerous algorithms, methodologies, and tools have been developed to use prior knowledge networks (PKNs) in the analysis of omics datasets. Interestingly, it has been repeatedly demonstrated that the source of prior knowledge can greatly impact the results of a given analysis. For these methods to be successful it is paramount that their selection of PKNs is amenable to the data type and the computational task they aim to accomplish. Here we present a five‐level framework that broadly describes network models in terms of their scope, level of detail, and ability to inform causal predictions. To contextualize this framework, we review a handful of network‐based omics analysis methods at each level, while also describing the computational tasks they aim to accomplish.

model, where genes and gene products are linked by molecular processes, was present from the very first days of molecular biology.Due to the sheer complexity of biological systems, biologists have traditionally employed reductionist approaches where different fragments of cellular processes are isolated and identified.An implicit goal of this approach has been to assemble a network model in a piecemeal fashion from these reductionist findings, which will eventually be able to explain and predict the behavior of the biological system at large.More than half a century later, tens of millions such reductionist findings have accumulated in the literature.Multiple databases and information systems have been developed to capture the pathway information accumulated in scientific literature and present it in computable format [1].Some of these databases include Reactome [2], Kyoto Encyclopedia of Genes and Genomes (KEGG) [3], the SIGnaling Network Open Resource (SIGNOR) [4], Pathway Commons [5], Disease Maps [6], and OmniPath [7].Millions of interactions, molecular processes and relationships are curated as networks, including metabolic pathways, signaling pathways, gene regulatory networks, molecular interaction networks, and genetic-interaction networks.When these networks are used as prior knowledge for data analysis, we refer to them as prior knowledge networks (PKNs).
In parallel, our ability to systematically profile cellular processes has grown with the development of modern omics technologies.We can now use a range of genomic, transcriptomic, metabolomic, and proteomic techniques to profile cellular systems in a given context, with an increasing ability to do so spatially and at the level of single cells.These technologies allow us to generate system-scale profiles without necessarily starting with a specific hypothesis or isolating a specific component-challenging the traditional piecemeal method.These strategies are "data driven", in that they don't seek explicit biological grounding of their findings-clusters, subtypes and signatures replace mechanisms and pathways.The perceived incompatibility between "hypothesis driven, reductionist" and "data-driven, systemscale" camps led to one of the most polarizing epistemological debates in modern molecular biology [8][9][10].
Is this truly a fundamental divide-maybe we can have our cake and eat it too?To bridge this gap, we need to computationally combine these prior information fragments with omics profiles to generate and test mechanistic, falsifiable conjectures at scale.Over the last two decades, thousands of algorithms and methods have been created in the field of network biology to address various sub-problems of this grand challenge, using diverse types of omics data and prior knowledge.
Given multiple data modalities, prior information sources, and tasks, it is often difficult to assess which algorithms are good for which biological questions and how they are related to each other.Here we present a framework to organize these methodologies into broad categories based on their use of prior information and the computational task they target.We also review a few examples from each category.Our goal in this review is to give readers a foundational understanding of the different types of networks, and a mental map to help match their needs with the available tools and algorithms.

NETWORKS ARE MODELS OF BIOLOGICAL SYSTEMS
A biological model is an idealistic construct, in simulacra, that allows us to understand, explain and predict biological phenomena.A network, in the current context, is a graph model of a biological system which depicts molecular entities and the interactions between them.Mathematically, a network is a graph G(V,E), where a set of vertices or 'nodes' F I G U R E 1 Terminology of network models.(A) A node (orange) represents a biological entity, like a protein, gene, transcript, small molecule, or phenotype.In a network model, nodes are connected by edges (black) which represent the relationships between them.(B) Edges can be directed/undirected, and can be signed/unsigned depending on the type of relationship they represent.Undirected edges (blunt lines) can represent relationships where neither entity necessarily acts upon the other, like two proteins in a complex, or they may be used to represent relations where the directionality and/or sign is unknown.Directed edges (arrows) indicate that one entity acts upon another, like a kinase phosphorylating its substrate.Edges which are signed and directed incorporate the consequential nature of the relationship, such as a transcription factor activating the expression of its target gene (green arrow).
(V) connected by a set of ordered pairs or 'edges' (E), represent the biological entities and the relationships between them, respectively (Figure 1A).
Network models vary greatly in their coverage of established biological knowledge, level of detail and interoperability with other networks.
A network model can be as simple as the interaction "MDM2 binds to TP53" or can cover a system-level map that encompasses all known cellular processes.In some models, a single node may represent one entity, whereas others may have multiple nodes corresponding to the same entity but representing different states.Similarly, edges may have directionality that indicates the flow of cause and effect from reactant to product in an interaction or be undirected.They may also be signed, meaning they describe the nature of the reaction (ex.activation/inhibition), or unsigned (Figure 1B).Some highly complex network F I G U R E 2 Domains of biological systems described by networks.(A) Metabolic pathways are typically described as a series of sequential reactions involving small molecules and catalytic enzymes.(B) Signaling pathways describe the passage of signaling events, often triggered by a ligand/receptor combination.(C) Gene regulatory networks are directed graphs that describe the circuitry of regulatory effects exerted from one gene to another.(D) Molecular interaction networks are typically unsigned diagrams that represent uncharacterized interactions between a suite of molecules, often proteins.E. Genetic interactions describe the relationships between genes.Edges here describe the nature of the relationship, as opposed to the mechanisms involved in it.models even account for stochiometric ratios and reaction dynamics equations in their construction.
Several domains of biology are modeled by networks, including metabolic pathways, signaling pathways, gene regulatory networks, molecular interactions, and genetic interactions.These domains are described herein with references to existing databases.(i) Metabolic pathways (Figure 2A) are usually characterized by the abstraction of enzymes, substrates, and products.Typically, these pathways depict the stepwise processing of small molecules.The series of reactions is catalyzed by a suite of enzymes (often proteins).Inhibitors and activators can also modulate the events.The Reactome [2] database contains numerous detailed metabolic pathways, for example, the metabolism of glucose via glycolysis.(ii) Signaling pathways (Figure 2B), on the other hand, encompass a range of biochemical reactions, including binding, transportation, and catalysis events involving molecules and complexes.These pathways may describe molecular states such as cellular location, covalent and non-covalent modifications, and sequence fragments.Many diseases, like cancer, involve the perturbation of signaling pathways.Reactome [2] also contains a great deal of signaling pathways, for example, overactive human epidermal growth factor receptor 2 (HER2) signaling in cancer.(iii) Gene regulatory networks (Figure 2C) involve transcription and translation events, along with their control mechanisms.Databases like TRRUST [11], GRNdb [12], and GRAND [13] are collectives of these interactions, illustrating expressional control relationships.(iv) Molecular interaction networks (Figure 2D) are non-mechanistic graphs depicting relationships between molecular entities, such as those captured by high throughput co-precipitation experiments.In these graphs both entities are represented by a node while an edge represents the type of interaction between them, for example, phosphorylation or non-covalent binding events.Protein-protein interaction (PPI) networks are an intuitive representation of this domain.Databases like STRING [14], IntAct [15], and MINT [16] contain large numbers of these interactions.
(v) Genetic interactions (Figure 2E) capture relationships between genes where the observed phenotypic consequence of perturbing both genes is different from what is expected given the phenotypes of each single gene perturbation.Common edge types in these graphs are epistasis, mutual exclusivity of mutation, and synthetic lethality.In sum, genetic interactions are not a physical interaction, but rather a phenomenological relationship.SLOAD [17] and SynLethDB [18]s are both examples of existing repositories of synthetic lethal relationships.

Creating networks
Network representations of biological systems have been around for decades.Reconstruction of metabolic maps from early biochemical experiments started in the 1950s with Boehringer Mannheim charts.
Modern reconstruction efforts encompass hundreds of thousands of reactions, curated from scientific publications.Despite this herculean effort, these manually curated databases cannot keep up with the rate of scientific production given the available resources.To support manual curation efforts, multiple natural language processing and crowd sourcing approaches to extract computable models from scientific literature have been developed [19], and recent large language models offer great promise in expanding these efforts [20].
In the case where there is very little existing literature about a system, networks can be inferred from existing network models.This approach was particularly popular in the early phases of Coronavirus Finally, some high-throughput modalities, such as protein coprecipitation experiments, can be readily expressed as networks without referring to curated sources of prior knowledge.In this example, the network's edges would be informed by coprecipitation events.These identified interactions can be quantified based on confidence, then filtered using a cut-off score to lessen any noise introduced by the mode of collection.The filter chosen, which may be empirically or statistically informed, can have a significant impact on the rate of false positives and negatives in the resulting network [23].Additional layers such as drug-target relationships can then be mapped to these interaction networks, as was done during COVID-19 to nominate targets for drug repurposing [24].

Networks and context
The fragments which make up a network often come from different biological contexts.Here, context is an umbrella term that implies different models, diseases, conditions, observation modalities and perturbations.Consider the following sequence of events; a group of researchers elucidates the phosphorylation event that drives a signaling cascade using an array of molecular techniques.Another research group identifies an inhibitor of this phosphorylation event, and another identifies a handful of transcription factors which assemble to pro-duce this inhibitor, and so on until a pathway model starts to take shape.An important consideration in implementing this pathway is the context from which each component arose.If each of these groups were working with a different model system, say cell lines derived from different tissues, treated with different perturbing agents, or grown under different environmental conditions-could their results be stitched together into a common network?How would one assemble these fragments properly, and when and which type of context restrictions should be used for a particular problem?These are complicated questions that researchers should consider carefully when embarking on network-based analysis of their data.
Omics studies using a shotgun approach, require a large, more generalist network to fully capture the biology in the data.To achieve a sizeable enough model to be informative, we often need to assemble a larger network from elements across multiple contexts.For example, a researcher comparing phosphorylation patterns between two samples using a large phospho-proteomics dataset, will likely need a large network aggregating interactions which were discovered under many different biological contexts, though they may be able to constrain their network only to include the interactions from their model organism.
Alternatively, for datasets with a very narrow scope it may be possible to find a very detailed, even quantitative, model whose components are derived from a consistent context.For example, a researcher studying glucose metabolism in diabetic mice with a targeted metabolomics assay may be able to find a curated quantitative model of glucose metabolism where all components of the model are informed by studies in diabetic mice.
In either case, the selection of an appropriate prior network is critical to efficient and informative data analysis.Biological context is one of many such parameters that researchers must consider when conducting network-based analysis and interpreting the results thereof.

Utility of networks
Omics profiles offer a molecular snapshot of a biological system under a set of conditions [25].Molecular structures commonly profiled by omics techniques include the genome (genomics), RNA (transcriptomics), proteins and their post translational modifications (proteomics), metabolites (metabolomics), and the epigenome (epigenomics) [26].Some modalities can even be profiled at the level of a single cell, giving much deeper resolution.Using networks, we can generate conjectures about the patterns in these highly complex datasets and understand which observed relationships can be explained by existing knowledge, and which relationships point to novel findings.

Modern omics technologies and the interpretation of their results
with network-based methods have been integral to several great advancements in our understanding of biological systems, particularly those related to disease.Network analysis has provided a platform for the discovery of disease mechanisms, like the identification of causal genes and pathways in disease pathogenesis [27][28][29].Diseasespecific network modules have been used to predict clinically relevant phenotypes, like metastasis in breast cancer tumors [30] or to detect signatures of tumor evolution in response to therapy [31].
Mechanistic explanations for observed phenomena are key for the iterative process of scientific discovery.If an algorithm can make somewhat accurate predictions about the behavior of a system but cannot point to the components that are likely to drive the observed behavior, then the predictions can only be tested phenomenologically and not mechanistically.There is no better example for this claim than commercial drug discovery, which relies on very large phenomenological screens for clinical trials.Despite substantial efficiency gains, between 1950 and 2010 the costs of research and development per approved drug approximately doubled every 9 years [32] as we try to tackle increasingly complex diseases.Drug-target networks have been used to predict off-target interactions or to nominate new alternative applications for therapeutic agents [33,34] allowing for mechanistic interrogation of how and why a given drug elicits it's effects.
A related benefit of a grounded, mechanistic inference is the ability to "reason" about the system's response to a previously unknown perturbation such as a new drug combination or a mutation.This is an extension of a biologist's intuition-for example, inhibiting the inhibitor of a target protein will activate it-but can be done at-scale.Additionally, it enables us to identify the reasons behind the failure of our predictions.
A perhaps less appreciated aspect of network-based approaches is the use of networks as prior information to restrict the search space of statistical algorithms.When evaluating potential network models in the context of their ability to explain or fit a certain dataset de novo, the number of possible network models grows exponentially, O(2 n 2 ), as a function of the number of nodes [35].This leads to substantial problems with model overfitting, multiple hypothesis testing correction and model degeneracy.Multiple hybrid methods were developed that use prior information probabilistically along with de novo inference to center the inferred/evaluated models around known biology that can restrict the search space substantially [22].Another example of this is the detection of synthetic lethal gene sets, which is typically a computationally intensive exercise requiring exhaustive searching across huge volumes of possible combinations.Synthetic lethal pairs, triplets, and even quadruplets can now be detected with greater ease by using PKNs to iteratively prune the search space [36].
The combined effect of these advantages is an incremental, iterative discovery process that can be done at-scale.This is crucial, given the rapid evolution of omics technologies and the ever-increasing volume of omics data.

A FRAMEWORK FOR CATEGORIZING AND CLASSIFYING NETWORK BIOLOGY APPROACHES
When conducting network-based omics analysis, the choice of PKN can impact the results of the analysis [37].Given the multitude of network databases available, it is useful to have a framework that can guide researchers to make informed decisions.Herein we define three 'tasks' which describe the overarching goal(s) of network-based approaches to omics data analysis.These tasks include network inference, explanation extraction, and phenotype prediction.Additionally, we define a framework for classifying network models into five levels of increasing level of detail: Gene Sets, Interaction Networks, Activity Flow, Process Description, and Quantitative Models (Figure 2).Finally, we review a sampling of network-based approaches at each of these five levels to contextualize the framework.Our classification of networks and approaches is intentionally broad to provide a high-level organization allowing for nuances in this rapidly evolving area of research.

Computational tasks
Networks can be combined with omics data to achieve a wide range of computational tasks.Below we define some broad categories that describe these computational tasks.These categories are not mutually exclusive, as many computational methods have the capacity to perform multiple tasks or hybrids of them.For example, methods which "upscale networks", meaning they output a higher-level network from a lower-level PKN, typically do both network inference and explanation extraction, as they select a small subset of the input PKN that can explain the correlations in the data and then will modify it to infer a new, higher-level network.It is also common to use explanation extraction or network inference task as a precursor to phenotype prediction [38], especially in clinical applications [39].
Explanation extraction aims to interpret patterns found within an omics profile and contextualize them using prior information about the system.It addresses hypotheses around system changes, such as differential expression or altered interaction strengths, to elucidate the mechanisms involved [40].or recognize parallel mechanisms that unify multiple datasets in a novel way [22,35].
Network inference tasks produce a network model based on the input data.Some network inference approaches construct an entirely new model representative of their data [39], while others aim to expand on established networks [41].In either case, the goal is to generate new mechanistic hypotheses.Due to the combinatorial complexity of the model space and the inherent stochasticity of biological systems, inference is always an underdetermined problem and coherence of inferred networks and actual biological reality may be low, independent of the performance of the model.Constraining inference to at least partially conform with known biology can help by "anchoring" inferred networks.Another option is to use many biological models in an ensemble learning strategy to reduce bias.
Upscaling algorithms are a common example of network inference.
These approaches infer a higher-level representation (e.g., Activity Flow) from a lower-level prior network (e.g., protein-protein interactions) using omics profiles.Upscaling can also be used to assign weights, direction, sign, and rate constants to edges on a graph.
Phenotype prediction aims to predict how an organism or system responds to disease states and perturbations.These methods may be applied at a cellular level to project signaling events and transformations as well as broad phenomena like cell proliferation and survival, but they can also be extended to a network medicine approach, where predictions are made at a patient level to inform diagnosis, prognosis, or treatment response [42,43] An excellent example of such an analysis is the use of co-phosphorylation networks to predict breast cancer subtypes with the CoPPNet [44] algorithm.
Effective phenotype prediction is arguably more difficult than the prior two tasks.Phenotype is a function of the whole system that often contains feedback loops and other non-linear response circuitry.
It is also inherently multimodal as at minimum, it requires one omics measurement and one phenotype measurement modality-for example, disease-free survival.Each of these factors can be confounding to phenotype prediction tools.

Levels of prior knowledge networks
There are many ways of representing molecular processes in a graph Level 1: Gene Sets, as the name implies, are curated lists of genes grouped by association with a particular phenotypic outcome, molecular pathway, or cellular event.Gene sets, although not networks per se, are often derived from network representations, such as boundaries of KEGG [3] pathways.Pathway boundaries are fiat boundaries [45], induced primarily through human demarcation.For example, despite covering the same biological processes, KEGG [3] pathways contain 4 times more entities on average compared to BioCyc [46] pathways, primarily due to differences in curation guidelines.They also provide substantially different results when these fiat boundaries are used as input for gene set enrichment tasks [47].Although they describe wellstudied biological mechanisms, gene sets do not contain any detail in the form of directed and/or signed edges.
Network-based analysis using gene sets typically performs an explanation/extraction task, which involves testing for statistical enrichment of gene sets or their components to propose explanations for observed cellular behavior, for example, highlighting the most dramatically enriched pathway in a cancer biopsy to determine possible therapeutic targets.Some gene sets, like those in mSigDb's hallmark gene sets [48], describe phenotypes, and can therefore be used for phenotype prediction tasks.
Level 2: Interaction Networks represent interactions between biological entities by unsigned, undirected edges.These edges don't contain any cause/effect semantics and therefore can't be used to make causal predictions.These simple interactions can be detected in large quantities by through high-throughput methods, hence there are millions of interactions present in existing data sources, an order of magnitude more than subsequent levels.Additionally, interaction networks are simple to align and integrate with one another, as each entity is typically represented by only one node in the graph.They are commonly used as a starting point in untargeted high throughput assays where quantitative measurements are recorded for many entities and the researcher wants to look broadly at their data without necessarily seeking causal explanations.
Level 3: Activity Flow networks, like interaction networks, typically contain one node for a given entity, allowing for easy integration of multiple networks so long as naming conventions for entities are consistent.In contrast, activity flow networks add a layer of cause/effect semantics in the form of directed/signed edges.For this reason, activity flow networks can be used for making causal predictions, and while these networks are considerably smaller than level 2, they are expansive enough that they can still be used for interrogating untargeted high-throughput datasets.
Level 4: Process Description networks illustrate the mechanistic detail of how a reaction occurs.Because these models describe the stepwise events in a reaction, it is not uncommon that one edge could be informed by multiple sources, making them very well grounded in the literature.They are considerably smaller given that most, if not all, of their curation must be done by hand.Unlike prior levels, these diagrams represent the same entity with multiple nodes, corresponding to each of that entity's states through a sequence of events, including covalent modifications, cellular/subcellular locations, and/or complex memberships.This makes the integration of multiple process description networks a considerably more intensive exercise relative to levels 2 and 3, often requiring manual curation.Some networks and models fall into two consecutive categories.For example, the networks used in PhosphositePlus [49] and CausalPath [50] are represented as activity flow networks, however both describe The 5 levels of network models.Scope refers generally to the size of networks and the volume of interactions recorded at that level.Mechanistic detail refers to whether the stepwise processes of a reaction are explicitly given in the network model.Causality refers to whether the network model can be used to make causal inferences that can be statistically interrogated.
posttranslational modifications, which lends to the mechanistic detail in a process description network.Molecular Interaction Maps [51] are equivalent in semantic detail to process description but keep an activity flow-like visualization.Finally, some large process description databases curate quantitative values such as enzymatic constants to allow for construction of quantitative models [52].

Classifying methods within the framework
To contextualize the above framework, we conducted a limited survey of algorithms and software tools which use networks as prior information in the analysis of omics data.We categorized these methods based on the level of network they employ and the computational task(s) they accomplish.Given that hundreds of new algorithms and approaches are published every year, an exhaustive survey is not feasible for the present review.Methods are extremely diverse in their input, operations, and output, but in any case, the overarching goal of these approaches is to produce something that can be perceived and/or interpreted by a human user.We do not include cross-method comparison of features and performance.For each method we give a brief synopsis, discuss key aspects of the method, and finally summarize any real-world applications or validation in a biological system that the authors describe in their manuscript.

Level 1: Gene Sets
ReactomeGSA [53] Reactome Gene Set Analysis (ReactomeGSA) is an explanation extraction tool for comparative pathway-based gene set analysis.Reac-tomeGSA defines its gene sets from the pathways curated in the Reactome [2] database, then conducts a comparative gene set analysis at a pathway level to explain and biologically ground the differences between omics datasets, making it a quintessential explanation extraction tool with some phenotype prediction applications.
ReactomeGSA performs a differential expression analysis on a pathway scale for five quantitative omics data types, including microarray intensities, transcriptomics counts (raw or normalized), proteomics (spectral counts or intensity based quantitative data).ReactomeGSA is also capable of analyzing single-cell RNAseq (scRNAseq) datasets by calculating the mean expression for genes in a cluster and using this as 'pseudo-bulk' RNAseq to describe the cluster.For the analysis the user selects an appropriate methodology depending on their datatype and computational capacity.ReactomeGSA currently accommodates three gene set analysis methodologies, PADOG [54], Camera [55], and singlesample Gene Set Enrichment Analysis via GSVA [56].The results of the analysis are mapped to the complete pathway browser database, where the user can view the pathway-level enrichment scores in the hierarchical 'tree-view' which also descending into individual pathways to view the differential gene expression values mapped to the corresponding genes in each pathway.
To demonstrate the clinical applications of ReactomeGSA the authors conducted a comparative pathway analysis of tumor induced plasmablast-like B-cell (TIPB) signaling across five cancer cohorts from The Cancer Genome Atlas (TCGA) [57].These included melanoma, breast cancer, ovarian cancer, lung adenocarcinoma, and lung squamous cell carcinoma.The authors compared TIPB-high versus -low in each cohort, in addition to some cross cohort comparisons.They found that pathway-based gene sets describing B-cell receptor signaling and apoptosis were enriched for TIPB-high melanoma and ovarian cancer samples, which they later correlated with improved survival in these groups.When compared to melanoma, lung adenocarcinoma samples with high TIPB retained a unique signaling phenotype.These samples exhibited downregulation of the pathway-based gene sets describing B-cell receptor signaling, NF-kB signaling, p53 associated DNA damage repair, cell cycle, and apoptosis.

Level 2: Interaction networks
SWAN [58] SWAN (Shifted Weight Annotation Network analysis) is an explanation extraction tool that uses copy number alteration (CNA) data to predict the wholistic genetic impact of gene-level changes.SWAN uses PPI networks extracted from MSigDB's Hallmark gene sets [48], GO [59], KEGG [3], and Reactome [2], as well as haploinsufficiency data as prior information.For validation, the authors tested SWAN's ability to appropriately identify suppressed and elevated pathways and identify the drivers in their respective directions.A list of known oncogenes (OGs) and tumor suppressor genes (TSGs) from the COSMIC Cancer Gene Census [60] was used as a positive control, where OGs were expected to be identified as drivers in elevated pathways, while TSGs were expected to be identified as drivers in suppressed pathways.This test was carried out with CNA data from 26 tumor types in TCGA [57].In 23 of the 26 cancer types assessed, TSGs were consistently found to be drivers in suppressed pathways, while in 22 of the 26 cancer types, OGs were found to be drivers in elevated pathways.SWAN was also evaluated for its performance with race-specific CNA patterns.The authors compared ovarian cancer samples from an African American population to samples from a non-Hispanic white population.SWAN identified that the cytokine pathway was elevated in the former population which the authors connected to the overall poor prognosis in these patients.
GLRP [61] GLRP (Graph Layer-wise Relevance Propagation) is a hybrid phenotype prediction and explanation extraction algorithm that aims to ground predicted graphs to known molecular networks such as PPIs.It extends the Layer-wise Relevance Propagation technique, which explains the decisions made by deep learning models, to Graph Convolutional Neural Networks.
The primary goal of GLRP is to explain the classification results of various omics data and molecular networks which could facilitate the decision-making processes in personalized medicine.GLRP interprets the classification output by leveraging the molecular network and also produces patient-specific subnetworks that can be used to explain clinical outcomes and therapeutic vulnerabilities.
GLRP was trained on gene expression datasets of breast cancer and human umbilical vein endothelial cells and used a PPI Network from the Human Protein Reference Database [62] to structure the gene expression data.Their predictive performance was evaluated using the 10-fold cross-validation method.In the breast cancer study, GLRP was used to classify patients into metastatic and nonmetastatic groups.The results were compared with the classification performance of random forest and glmgraph [63] models as well as weighted gene co-expression network analysis.GLRP outperformed the other models at the classification task, while also producing patient-specific subnetworks that could be used to better understand tumor biology beyond phenotype (metastatic/non-metastatic) classification.

Level 3: Activity flow
CausalPath [50] CausalPath is an explanation extraction algorithm which uses causal relationships from Pathway Commons [5] as priors to extract a mechanistic explanation for the patterns in proteomics, phosphoproteomics, and transcriptomics datasets.CausalPath produces causal hypotheses about the differences between comparable datasets, for example, biopsies from different conditions or timepoints, or the covariance across a cohort.These explanations are presented as an activity flow sub-network, which can also be expanded as a more detailed process description network.The method mimics a biologist's traditional approach of explaining changes in data using prior knowledge, but does this at the scale of hundreds of thousands of reactions.
CausalPath employs 12 pre-defined patterns that describe causal relationships between biological entities in the network, for example, Using MS datasets from CPTAC (Clinical Proteomic Tumor Analysis Consortium) [64] ovarian and breast cancer cohorts, CausalPath elucidated general and subtype-specific signaling, as well as regulators of well-known cancer proteins.In phospho-proteomics datasets of 32 TCGA [57] cancer studies, CausalPath found a core signaling network that is recurrently identified across many cancer types.
CoPPNet [44] CoPPNet is a network inference tool which uses known functional associations between phosphosites, PPI's, and kinase-substrate associations as prior knowledge to construct co-phosphorylation (Co-P) modules representing the patterns in phosphor-proteomics data.The phosphorylation modules created by CoPPNet can be used for phenotype prediction tasks.
CoPPNet first constructs a PhosphoSite Functional Association (PSFA) network that models potential functional relationships between phosphosite pairs.Edges are informed by existing databases: PTMCode [65] is used for functional, structural and evolutionary associations, PhosphositePLUS [49] for kinase-substrate associations and inferring shared-kinase pairs, and BIOGRID PPI [66] for PPIs.
Data from MS-based phospho-proteomics assays is then incorporated using bi-weight mid-correlation to assess co-phosphorylation of phosphosite pairs connected in the PSFA network, resulting in a weighted PSFA network.Finally, subnetworks enriched in highly co-phosphorylated phosphosite pairs are extracted.To achieve this, the weighted PSFA network is searched for subnetworks using a greedy algorithm to maximize Co-P score, resulting in a list of ranked subnetworks referred to as Co-P modules.Modules are then assessed for statistical significance, subtype specificity, predictive ability, and reproducibility.
The authors applied CoPPNet to two independent breast cancer phospho-proteomics datasets to understand whether Co-P modules could be used to differentiate breast cancer subtypes.The top scoring modules generated for both independent datasets exhibited a significant overlap, lending confidence that the modules represented biologically meaningful patterns.These modules were also IntOMICS [41] IntOMICS is a network inference algorithm which reconstructs gene regulatory networks using regulatory relationship information from KEGG [3] as prior knowledge.It also integrates gene expression, DNA methylation, and copy number variation data, as well as target genetranscription factor associations from ENCODE (The Encyclopedia of DNA Elements) [67].
IntOMICS is a Bayesian framework based on the Werhli and Husmeier (W&H) algorithm [68], which encodes each omics data source For validation and comparison, the authors used IntOMICS to understand the mechanism of chemoresistance using primary colon cancer samples from a randomized Phase III clinical trial.Their goal was to identify downstream mediators of ABCG2, a gene which has been shown to contribute to chemoresistance.They compared the network generated from IntOMICS to those from an unaltered implementation of the W&H algorithm as well as two other multi-omics integration frameworks, RACER [69] and KiMONo [70].IntOMICS nominated more downstream mediators of ABCG2, which may be important for chemoresistance in colon cancer and survival.

Level 4: Process description
ScFEA/FLUXestimator [71] ScFEA (Single cell flux estimation analysis is a phenotype prediction tool that infers metabolic flux from scRNAseq data using handcurated metabolic pathways from KEGG [3] as well as some hand curated mechanisms as prior knowledge.In the web-application of scFEA, FLUXestimator, metabolic pathways from Recon3D [72] are also available.scFEA was validated experimentally using matched scRNAseq and targeted metabolomics data collected from cells exposed to hypoxia and/or APEX1 knockdown.The authors observed that the predicted flux changes were consistent with the observed changes in the metabolomics data, supporting the method's accuracy.
Fast-SL [36] Fast-SL is an inference tool which uses iterative search space reduction for rapid prediction of synthetic-lethal gene sets from large metabolic networks.The overarching goal of this algorithm is to improve the computational efficiency and speed of synthetic lethality prediction.
Fast-SL predicts synthetic lethal gene sets up to four.Synthetic lethality prediction is an example of network inference for genetic interactions.The edges predicted in this inference task don't represent physical interactions, but rather phenomenological relationships between genes.
Generally, for the deletion of a gene/reaction to be considered lethal, the maximum growth calculated by flux balance analysis (FBA) must be smaller than the specified cutoff (v co ), typically 1% of the wild-type growth rate.Fast-SL calculates the lethality cutoff v co as 1% of the 'minimum norm' , which corresponds to the maximum wild-type growth rate.FAST-SL reduces its search space by iteratively removing synthetic-lethal sets of lower order from the search space for higher order sets.Beginning with single-lethal (first order) reactions, the search space is constrained to all reactions in the system with a nonzero flux in the distribution from the prior step.These reactions are exhaustively tested for single-lethality by setting the flux of each individual reaction to zero, calculating the biomass flux, and comparing it to the cutoff, v co .If the biomass flux is less than the cutoff, the reaction is considered lethal and added to the set of single-lethal reactions.Single-lethal reactions are then pruned from the search space for double-lethal (second order) reactions.When calculating third order lethal reactions, the search space would be further reduced by removing double-lethal reactions from the search space.The result is a search space which becomes smaller with increasing order of lethal gene sets, improving the efficiency of the algorithm.
Using Fast-SL, the authors successfully identified lethal gene sets up to an order of four in E. coli, S. Typhimurium, and M. tuberculosis.They validated these results with an exhaustive search for first, second, and third order lethal gene sets.The authors reported an "exact match" between the number of lethal sets identified in the exhaustive search and those identified by Fast-SL.The authors also compared Fast-SL to another algorithm, SL Finder [73], which is also intended to reduce the computational intensity of identifying synthetic lethal gene sets.
Fast-SL identified 127 novel triplets in E. coli which were not found by SL Finder.These novel triplets were predominantly involved in central carbon metabolism and amino acid synthesis.SUMMER [75] SUMMER (Shiny Utility for Metabolomics and Multiomics Exploratory Research) is an explanation extraction tool that conducts metabolic pathway enrichment analysis to identify metabolic 'hotspots' for a given sample.SUMMER uses metabolic networks from KEGG [3] upscaled with predicted reaction potentials.

DISCUSSION
Pathway or network analysis is often viewed as a one-size-fits-all approach that can be applied universally to any dataset.However, as our review demonstrates, network analysis encompasses a broad detail, and ability to inform causal predictions.We also outline some common computational tasks to describe the aim of network-based analyses.To contextualize the framework, we sampled a handful of published network-based methods and discussed their PKN selection, the tasks they aim to accomplish, their approach to analysis and their real-world applications.
Looking ahead, we anticipate network analysis to gain even greater prominence, shifting towards more detailed approaches for two reasons.First, the rapid advancements in multi-modal, spatial, and singlecell modalities have enabled the measurement of subcellular protein localization changes, post-translational modifications, and molecular complexes at a single-cell scale using imaging modalities [76].This wealth of information primarily resides in level 4 networks, and, to a lesser extent, in level 3 networks.Effectively harnessing these rich datasets will necessitate the utilization of more detailed PKNs.Second, recent breakthroughs in large language models [77] have significantly enhanced our ability to extract knowledge from literature.Combining this capability with crowd-sourcing [78] and human-in-the-loop systems [79] holds the potential to reduce curation costs by two orders of magnitude [78], enabling near-complete curation of the entire biomedical literature on biological molecular processes.The increased completeness of PKNs, along with improved and larger datasets, will unlock extensive application areas for increasingly sophisticated network models.
model.The choice of representation is often dictated by the volume of experimental data informing different parts of the model.As models increase in their level of mechanistic detail, they also decrease in the scope of the biology they can cover.For example, a simple PPI network (level 2) with hundreds of proteins may be constructed from a single Co-IP assay, while just one relationship in a quantitative model (level 5) may synthesize the results of many separate experiments.The 5 levels are expressed visually in Figure3and are explained in further detail below.

Level 5 :
Quantitative Models were originally derived from canonical chemical equations.These models are like process description networks in that these representations explicitly model the stepwise process of a reaction, but they are expanded to include quantitative factors like concentrations, stoichiometry, and rate constants.They are often used to describe systems which are very intensively studied and are typically small compared to the preceding levels, due to the volume of research and the manual curation they require.Metabolic pathways are the most common representations at this level, as their enrichment for small molecules makes their reaction dynamics more easily characterized.

First, SWAN assigns
genes represented in the dataset to their respective pathways based on prior knowledge from the aforementioned databases.It then uses known PPIs to build a network model within each pathway.It then scores each pathway by calculating a 'weighted network shift' based on comparison of the input data to a randomly generated null distribution.The weighted network shift is then used to calculate a P-value for each pathway.Individual genes which have a large influence on the network shift are also given as output.

a
kinase phosphorylating another protein implies an expected correlation between the kinase's abundance or activating phosphorylation with the phosphorylation of the target protein.Using these pre-defined patterns, CausalPath assembles an activity flow network showing the causal relationships supported by the proteomic, phospho-proteomic and transcriptomic data.CausalPath was applied to several publicly available datasets covering a wide range of scenarios and biological questions.In a set of time-resolved EGF stimulation experiments, CausalPath detected EGFR activation via downstream signaling of MAPK, including feedback inhibition on EGFR.From ligand-induced and drug-inhibited cell-line experiments, CausalPath estimated the precision of its predictions.
enriched for phosphosites that exhibited significantly different phosphorylation between Basal and Luminal samples.They trained a classifier on the first dataset, using the phosphosites in the Co-P modules as features, and asked the model to identify the subtypes of samples in the second dataset.They then compared the model's performance when other sets of phosphosites were used as features, including (a) all sites represented in both datasets, (b) all sites with significantly different phosphorylation between the two datasets, or (c) the top 74 sites with the highest fold change between the two datasets.The model's performance was greatly improved using the Co-P phosphosites as features, supporting that the Co-P modules are representative of subtype specific phosphorylation patterns.
into separate energy functions.IntOMICS integrates the omics data by encoding the energy functions into a Gibbs distribution.Effects of multiple upstream controllers are additive.The inverse temperature hyperparameters for each source are tuned by sampling from the posterior distribution with Markov chain Monte Carlo (MCMC).Unlike the original W&H algorithm, IntOMICS uses an adaptive MCMC simulation and Markov blanked resampling to improve the MCMC convergence speed.
scFEA constructs a reduced network based on the prior network topology, genes with significant non-zero expression, and any preferred sub-network specifications from the user.This reduced network, termed a factor graph, is composed of metabolic modules (variables), representing groups of connected reactions, linked by intermediate metabolites (factors).For estimation, scFEA combines traditional flux-balance analysis with an optimization goal of minimizing influx/outflux imbalances while also incorporating enzyme transcript levels as a proxy for enzyme activity to further constrain the model search space.
SUMMER infers the catalytic activity of each enzyme based on transcriptomics or proteomics data.Estimates enzymatic activity is integrated with metabolomics data to model the change in reaction rate potentials between a perturbed condition and a reference condition.The resulting ratio of the resulting reaction rate potentials between a perturbed condition and a reference condition is then bootstrapped to calculate a ranking score between each reaction.Using the rank scores, SUMMER identifies the "hotspot" reactions in the network, representing the metabolic reactions most affected under the perturbed condition.The authors applied SUMMER to metabolomics and transcriptomics datasets generated from a study using a mouse model of accelerated aging and dementia to examine the effects of a neuroprotective compound.Drug-treated mice and young mice were compared to untreated mice to identify the metabolic patterns associated with the drug's neuroprotective effects.Using this strategy, they found that a reaction catalyzed by one of the drug's enzymatic targets in acetyl-CoA metabolism was upregulated in the old untreated mice, while the drug treated old mice exhibited similar activation to the young mice.SUMMER was also able to identify the tricarboxylic acid cycle as upregulated in the old mice relative to the young and drug treated groups.
range of approaches with unique data requirements and diverse PKN sources.Any new project or program incorporating network analysis should carefully define the task at hand, explore the available prior information sources, and consider the integration and scalability challenges associated with each resource.Networks and network-based methods are invaluable tools for the analysis of omics data.It is widely recognized that the selection of PKN can influence the outcome of analysis, therefore selection of an appropriate PKN is key to producing reliable results.With such an enormous suite of network resources available it can become overwhelming to select an appropriate model.To address this challenge, we present a framework for classifying PKNs and network-based methods.This framework characterizes PKNs in terms of their scope, mechanistic