Prior knowledge guided active modules identification: an integrated multi-objective approach

Background Active module, defined as an area in biological network that shows striking changes in molecular activity or phenotypic signatures, is important to reveal dynamic and process-specific information that is correlated with cellular or disease states. Methods A prior information guided active module identification approach is proposed to detect modules that are both active and enriched by prior knowledge. We formulate the active module identification problem as a multi-objective optimisation problem, which consists two conflicting objective functions of maximising the coverage of known biological pathways and the activity of the active module simultaneously. Network is constructed from protein-protein interaction database. A beta-uniform-mixture model is used to estimate the distribution of p-values and generate scores for activity measurement from microarray data. A multi-objective evolutionary algorithm is used to search for Pareto optimal solutions. We also incorporate a novel constraints based on algebraic connectivity to ensure the connectedness of the identified active modules. Results Application of proposed algorithm on a small yeast molecular network shows that it can identify modules with high activities and with more cross-talk nodes between related functional groups. The Pareto solutions generated by the algorithm provides solutions with different trade-off between prior knowledge and novel information from data. The approach is then applied on microarray data from diclofenac-treated yeast cells to build network and identify modules to elucidate the molecular mechanisms of diclofenac toxicity and resistance. Gene ontology analysis is applied to the identified modules for biological interpretation. Conclusions Integrating knowledge of functional groups into the identification of active module is an effective method and provides a flexible control of balance between pure data-driven method and prior information guidance.


Background
With the development of high-throughput data collection technologies, vast amounts of omics data that cover different species and different levels of biological activities have accumulated exponentially. These varied omics data, including the genome sequencing data (genomics), genome-wide expression profiles (transcriptomics), and protein abundances data (proteomics), provide valuable information concerning the intrinsic mechanisms underlining biological processes. With the accumulation of large datasets, one of the most essential challenges for researchers is that how to properly interpret these data. Take gene expression data analysis as an example, methods have evolved from the simple single or multivariate statistical analysis, e.g., calculation of fold-change, identification of differential expressed genes, to integrated approaches that integrate prior knowledge and different datasets [1]. As a research field driven by those integrated approaches, network biology has gained popularity recently years.
Network biology offers a highly abstract model of networks to characterize various levels of biological systems and provides insights into those system by taking advantages of network theory [2,3]. Although currently it's not able to fully capture the diversity and dynamics of complex biological system [4], it is still one of the most promising and fast developing research area in modern biology. Many studies have been performed on the construction of networks from biological systems and the structural and functional features that may respond to related biological information. Network construction methods are varied from calculating pair-wise correlation coefficient of expression data (correlation network [5]), filtering from existing interaction database (protein-protein interaction network [6][7][8][9]), or integrated approaches based on both expression data and metabolic models (tissue specific metabolic network [10]). Structural features includes degree distribution, clustering coefficient, scalefree property [11], modularity [12] and network robustness [13]. One of the most studied features is modular structure.
Modular structure is one of the essential characteristics that reveal information about the relationship and interaction among components in the network. In biological networks, modules are considered as the functional units of cellular process and organization [14]. Varied definitions of module have been proposed and numerous methods have been developed to identify those modules [15,16], all aiming to reveal essential biological mechanisms [17,18]. Among them, active module detection is a successfully applied integrative approach. Active module is a densely connected area in network that shows striking changes in molecular activity or phenotypic signatures, which is often associated with a given cellular response. Active module is expected to reveal dynamic and process-specific information that is correlated with cellular or disease states.
A typical active module detection algorithm takes gene expression data, calculates statistical values indicating differential expression level, and searches in corresponding network to identify modules inside which gene activity changes significantly. The jActiveModule [19] method proposed by Ideker in 2002 is considered as the first to formulate active module detection into an optimization problem. It uses the standard normal inverse of single gene's p-value to measure the activity of one gene, aggregates the node scores for a given module with adjustment and background correction, and finally searches for high-scoring modules via simulated annealing. Many following methods adopt this framework of significant-area-search method. One representative research for identifying condition-responsive protein-protein interaction module used edge-based scoring method [6]. There are also formulations that combine both node and edge score [7,9]. As the problem of finding the maximal-scoring connected subgraph is NP-hard (non-deterministic polynomial-time hard) [19], heuristic algorithms are broadly used, e.g. simulated annealing [19], greedy search [20], and evolutionary algorithm [8,21]. Exact approaches via integer linear programming are also developed [22].
In this paper we propose a novel multi-objective active module identification algorithm. We first formulate the active module identification problem as a multi-objective problem, which not only maximises the activity score as defined by Dittrich and Klau [22] but also maximises the prior knowledge contained in the active module. The intuition behind this multi-objective formulation is to use prior knowledge to guide the search of novel information from data, i.e., active modules. The Pareto solutions from this multi-objective optimisation problem are then the optimal trade-off between known knowledge and novel information.
In order to solve this multi-objectie problem, we proposed a modified multi-objective evolutionary algorithm. One of the important details omitted in many papers of active module identification is how to ensure the connectedness of the solutions. Without this connectedness constraint, the optimal solution is trivial, i.e., the top genes with largest node scores. In order to ensure the connectedness of the identified active modules, we design a novel constraints based on algebraic connectivity. The algorithm is applied to a small molecular interaction network that was used by Ideker [19] and then applied to a large Protein-Protein Interaction (PPI) network constructed from microarray data on drug toxicity and resistance.

Problem formulation
The network G is represented as G = (V , E) with p v ∈ (0, 1) for v ∈ V where V is the set of nodes, E the set of edges, and p v the assigned p-value from differential expression analysis for each node v. In the proposed algorithm there are two objectives and one constraint for a given module A: • Active module score S A indicating significant changes in gene expression for a given module, to be maximized during search. • KEGG (Kyoto Encyclopedia of Genes and Genomes) pathway coverage score R A for the number of covered metabolic pathway by genes in module, to be maximized. • Algebraic connectivity to check whether a given subgraph is connected or not, used as a constraint to ensure connectedness.

Active module score
Microarray analysis studies showed that expression data can be effectively estimated by many mixture-model methods that divide genes into two or more groups, one group contains genes that are differentially expressed, and other(s) not differentially expressed [1]. Among those many methods, Pounds and Morris proposed a beta-uniform mixture (BUM) model that very accurately describes the distribution of a large set of p-values produced from an microarray experiment [23]. The BUM model considers the distribution of p-values as a mixture of a special case of beta distribution (b = 1) and a uniform(0, 1) distribution, with a mixture parameter λ. The p-values under the null hypothesis are assumed to have a uniform distribution. Under the alternative hypothesis the distribution of p-values will have a high density for small p-values and can be described by B(a, 1). A general beta distribution B(a, b) is given by where (.) denotes the gamma function. As (1) = 1, the probability density function of BUM model is then reduced to for 0 < x ≤ 1, 0 < λ < 1 and 0 < a < 1. Given a set of p-values the two parameters of BUM distribution λ and a can be calculated by maximum likelihood estimation. Following the idea of Dittrich and Klau [22] to decompose signal component from background noise, an additive score to measure the significance of gene's differential expression is calculated as where τ is a threshold to determine the significance of a p-value. In order to control the estimated upper bound of the false discovery rate (FDR) introduced by Benjamini and Hochberg [24], τ could then be selected to ensure that FDR ≤ α for some predefined α using the following equation meaning the maximum proportion of the set of p-values that could arise from the null hypothesis.
After assigning score to each of the genes, the overall score for a given module A is then the summation of all genes' scores in it, given by

KEGG pathway coverage
KEGG is an integrated database of high level functions and utilities of biological systems [25]. KEGG PATH-WAY is a collection of manually drawn pathway maps representing the knowledge on the molecular interaction and reaction networks. Mapping of pathway information mainly relies on molecular datasets, especially large-scale datasets such as genomics, transcriptomics, proteomics, and metabolomics. Genes involved in the same KEGG pathway are considered as functionally related to each other. In the experiment KEGG pathway coverage score R A is formulated as the second objective to measure the enrichment of functional groups in a given module A.
The KEGG pathway information is retrieved from the KEGG REST-style entry for Saccharomyces cerevisiae (yeast) [26]. Each entry of the mapping data records one gene and its corresponding pathway. The records are then split into different groups labeled by the pathways. For the i-th pathway, V i stands for the set of genes it contains. Given a module A with V A as the set of genes, its KEGG pathway cover rate R i over the i-th pathway is calculated as meaning the percentage this pathway is covered by given module. The cover rate R i is then compared with a threshold R ratio to determine whether this pathway can be considered as enriched in the given module. The threshold shall be selected carefully. A too high value of R ratio leads to a tiny group of connected pathways genes with positive active module score as the search could not expand to other area under such stringent condition. On the contrary, a very low R ratio could not reflect the meaning for the second objective. In practice, R ratio is set to a series of values for preliminary experiment. The results are analyzed and compared to decide a suitable value. The total enriched pathway count R A is given by where P stands for total number of pathways.

Algebraic connectivity
The algebraic connectivity of a graph G, denoted as α(G), is the second-smallest eigenvalue of the Laplacian matrix of G. It serves as a good parameter to measure how well a graph is connected. α(G) is greater than zero if and only if G is a connected graph. The Laplacian matrix L of a simple graph G is calculated as where D is the degree matrix and A the adjacency matrix. The eigenvector ν of the square matrix L is the non-zero vector that satisfies λ is a scalar known as the eigenvalue associated with the eigenvector ν. Algebraic connectivity α(G) is the second smallest eigenvalue of the Laplacian matrix L.

Multi-objective optimization algorithm
In order to perform multi-objective optimization to maximize both active module score and KEGG pathway coverage score simultaneously, a multi-objective evolutionary algorithm modified from NSGA-II (non-dominated sorting genetic algorithm II, see [27]) is applied as search strategy for module detection.

Solution representation
A solution is represented as a binary vector of length |V |, where |V | is the size of network, i.e. total number of nodes. Adding or deleting a node is performed through simply flip one bit of the vector at corresponding site.

Fitness function
Active module score S A and KEGG coverage score R A are used as two objectives. As the implementation of the algorithm is aimed at minimization both objectives, scores calculated from above equations would be given an extra negative sign.

Initialization
The search starts by randomly initializing a group of small cores in network. Nodes with high S FDR (x) scores are selected as seeds of potential modules to begin with. Number of seed nodes is decided by the population parameter for evolutionary algorithm. For instance, if population is set to 50, nodes with top 50 S FDR (x) scores are selected as seeds. In the case when the population size exceeds network size, every node will be selected as a seed.
In initialization stage, neighboring nodes of a seed with positive scores are added to the module in which the seed represents.

Parent selection
Binary tournament selection is applied for selecting parents to reproduce. In some cases when the population converges too fast, this step is skipped to decrease selection pressure, thus the whole population would be used for reproduction.

Reproduction
Single point crossover is applied to selected parents. Mutation is performed by adding neighboring nodes with positive S FDR (x) score or in a pathway into current module each time. Offspring generated is added to parental population to form a combined population with twice the size, waiting to be sorted and selected.

Clearing procedure
An extra clearing procedure is applied after reproduction and before non-dominated sorting. The step is introduced because in practise the algorithm tends to generate a number of replicated solutions when converging towards global optima. However, considering the natural property of our optimization problem, it is reasonable to obtain multiple optima, both those global on the non-dominated Pareto front and those dominated local optima, each representing the most significantly changed modules and modules that do not change that significantly, but still worth looking into. This procedure, inspired and simplified from Petrowski [28], detects replicated solution groups, preserves one copy, and resets all other individuals as infeasible solutions which will soon be eliminated after soring and replacement.

Sorting and replacement
The algorithm uses fast non-dominated soring and crowding distance assignment as detailed in Ref [27] to generate new population from the combined population efficiently and preserve solution diversity.

Constraint handling
To ensure the connectivity of detected module after crossover, algebraic connectivity α(G) is used as a constraint. Solution with non-positive algebraic connectivity violates the constraint, indicating itself a disconnected subgraph and thus an infeasible solution. Replicated solutions are also marked infeasible in the clearing procedure.

Network construction Network 1: a small molecular interaction network on galactose utilization pathway
A small molecular interaction network once used by Ideker [19] is used as a test network. The molecular interaction networks visualization software Cytoscape [29] provides jActiveModule as a plugin to find expression activated modules. The tutorial in Cytoscape App Store [30] provides samples data consists of a network file as a model of the galactose utilization pathway in yeast and a companion expression file contains p-values to describe the significance of each observed change in expression. pvalues under condition labeled as GAL80R are extracted and overlaid to network file, resulting in a network with 330 genes.

Network 2: yeast drug reaction network constructed from differential analysis and interactome mapping
Gene expression data on yeast's reaction to diclofenac is downloaded from GEO (NCBI Gene Expression Omnibus) database [31]. Diclofenac is a widely used analgesic drug that can cause serious adverse drug reactions [32]. Yeast is used as model eukaryote to capture the cellular changes under the treatment of diclofenac. The data provides the microarray expression for diclofenac-treated yeast cells and control cells, each with 5 replicates. Differential expression analysis between diclofenac-treated group and control group is performed using the online tool GEO2R [33], with p-value adjustment set to Benjamini and Hochberg false discovery rate control. After deleting genes with adjusted p-value larger than 0.05, a set of differentially expressed genes is generated for interactome mapping.
Protein-protein interaction data is download from BioGRID [34], an integrated and up-to-date public database that archives and disseminates genetic and protein interaction data from model organisms and humans. To be specific, the downloaded file is BIOGRID-ORGANISM-LATEST.tab2.zip that separates interactions into distinct files based on Organism and was released on June 30, 2016. File for interactions of Saccharomyces cerevisiae is extracted for use. As the whole interaction data contains tens of thousands of proteins and millions of interaction records, a considerable amount of proteins have no corresponding records in given expression data or show no differential expression. Those proteins shall be excluded from the final network in order to avoid the waste of both computational resource and analysis attention. According to the filtering method applied by Muraro and Simmons [8], interactions containing at least one differentially expressed gene are selected as an attempt to include indirect interactions. The resulting network concerning yeast cellular reaction to diclofenac consists of 1803 nodes and 3356 edges.

Analysis of network 1
To estimate distribution for p-values, the parameters of BUM model a and λ are estimated by R package BioNet [35]. Figure 1 shows the fitted model. As the majority of genes in yeast network have a very significant p-value, threshold τ is calculated at an extremely stringent FDR level as an attempt to control the size of detected module. Parameter details are shown in Table 1.
As a benchmark, the jActiveModule method is applied to the network via Cytoscape plugin, generating 5 active modules by default. Figure 2 gives a visualization of the network by Cytoscape, with detected active modules mapped on it. To understand the biological function of modules, gene ontology (GO) annotation for biological process is applied to modules by enrichment analysis tools provided on Gene Ontology Consortium [36]. The tool only asks for a submission of gene list, GO type (biological process, molecular function, cellular component) and species. The results is shown in Table 2. Among the 5 modules, 3 modules are enriched in the GO term galactose catabolic process via UDP-galactose with p-values  The proposed algorithm is applied to network 1 with threshold R ratio = 0.6 for KEGG pathway coverage score, resulting in a set of 13 Pareto solutions. As a feature for multi-objective optimization, all the modules in the same Pareto front are equally good. No one out performs another. In order to show the difference of those modules in trade-offs between two objectives, we selected 3 modules from the 13 Pareto solutions: • Module 1: the extreme point on the Pareto front with maximum active module score S A = 393.41. GO analysis for biological process is performed on the three modules. The results together with the objective function values are tabulated in Table 3. We also visualize Modules 1 and 2 in Figs. 3 and 4, respectively.
By comparing the results in Table 3 with those in Table 2, we found that Module 1 identified by the proposed algorithm have better active module score (S A ) and KEGG pathway coverage score (R A ) than all the modules found by jActiveModule algorithm. Such results indicate that by incorporating the prior knowledge, we can guide the algorithm to search areas in the network with more significant activity.
From these two figures and Table 3, we found that compared with jActiveModule that searches for small and separated modules, the proposed algorithm tends to identify Fig. 3 Visualization of Module 1 with maximized active score S A detected by the proposed algorithm in network 1. Node color and border are set the same as Fig. 2. Module contains the majority of red nodes that are connected densely, indicating high activity. Notice that compared to small separated modules identified by jActiveModule shown in Fig. 2, this module tends to connect small areas of red nods by including linkage nodes with white or light green color. Although these intermediate nodes shows only modest changes in expression, they serve as bridges for cross-talk between functional groups, or as transcription factors that regulate other genes Fig. 4 Visualization of Module 2 which is the knee point of the Pareto front with optimal trade-off between S A and R A detected by the proposed algorithm in network 1. Node color and border are set the same as Fig. 2. Compared to Fig. 3, this module expands broader as R A gets higher a large connected subgraph. Even for Module 1 where the active module score is maximised, because of the integration of the prior knowledge, highly active areas are more likely to be connected together by intermediate nodes that might not be significantly differential expressed, but serve as a bridge for cross-talk between neighboring functional areas.
By visualisation of those Pareto solutions (figures not shown), we found that as the solution on Pareto front moves from maximum active score to maximum pathway coverage score, such intermediate nodes appear with higher frequency. We also found that, as R A gets higher, detected module expands from a small core area with high activity to a broad area with more varied functional groups while still keeping overall activity. This result indicates that by using prior knowledge, we are able to reveal underlying mechanisms that link different activities in the network.
While all the three modules are significantly enriched in the GO term "galactose catabolic process via UDPgalactose" (corresponding p-value 5.15 × 10 −03 , 1.63 × 10 −02 and 4.48 × 10 −02 , respectively), annotations for Module 1 (the extreme point with maximum activity score S A ) are more densely related with galactose metabolic process. On the other hand, for Module 3 with maximum KEGG pathway coverage score R A , core annotations remain the same while additional annotations concerning essential biological processes increases. However, it is worth noting that, all the additional annotations can be reasonably related to the cellular response to disturbance in galactose utilization pathway.
The most interesting module is Module 2, which represents the optimal trade-off between prior knowledge and novel information from data. It is worth noting from Tables 3 and 2 that, even it is a knee point solution, Module 2 has a slightly worse S A but much higher R A than all the modules identified by JActiveModule. We can also observe from Table 3 that, module 2 has a range of slightly broader annotations concerning metabolic process of galactose, pyruvate and gluconeogenesis, which are highly relevant to galactose untilization pathways [37].

Analysis of network 2
Parameters of BUM model a and λ to fit p-value distribution are estimated as shown in Fig. 5. Threshold τ is calculated at given FDR level. See Table 1 details of parameters.
The proposed algorithm is applied to network 2 with threshold R ratio = 0.8 for KEGG pathway coverage score, resulting in a set of 12 Pareto solutions. Solutions on the Pareto front are chosen for gene ontology analysis on biological process. Surprisingly, all identified modules shows a high consistency in the annotation on drug reaction, which exactly reflects the cellular response for yeast under the diclofenac treatment. Three genes (YDR406W, YOR153W and YOR153W, all act as ATP-binding transporter, for detailed functional explanation, see caption in Fig. 6) that play an important role in the cellular reaction and resistance to drug treatment are discovered in all the 12 modules, indicating the accuracy and robustness of searching algorithm.
Similar to the analysis methods for results in network 1, 3 representative modules on Pareto front with different trade-off between active score S A and pathway coverage score R A are select for gene ontology annotation (see Table 4) and visualization (Fig. 6). From Table 4 we can see that as pathway score R A increases, size of module increases and the annotation includes a larger range of biological processes. As drug reaction is considerably complicated response that involves a series of up or down regulation in related function groups such as protein kinase pathway, ribosome biogenesis, rRNA processing and zinc-responsive genes [32], the enriched annotation in modules with higher R A provides a guidance of deciding which functional groups to look into as it combines both prior knowledge from existing interaction database and novel information from gene expression data specific for given experimental conditions.

Conclusion
An integrated multi-objective approach has been proposed for active module identification. The algorithm is motivated by the idea that incorporating prior information into data-driven method would provide new insights into sophisticated biological processes. We also designed an constraint based on algebraic connectivity to ensure the connectedness of the identified active modules.
We first applied our algorithm on a small molecular interaction network, which identified a set of Pareto solutions that represents different trade-off between prior knowledge and novel information from data. Gene Ontology analysis results show that it successfully identifies modules with relevant and reasonable biological  Each node represents for a gene. The setting for node color is the same with Fig. 2. The turning point between red and green is set to the value τ = 7.71 × 10 −6 . Three rectangle shaped nodes with black border are genes involved in drug export and are highly consistent in all modules. YDR406W is an ATP-binding cassette multidrug transporter. YDR011W is a ATP-binding cassette transporter. YOR153W is also an ATP-binding cassette multidrug transporter. The three genes serve as an important role in yeast's resistance to diclofenac interpretations. The algorithm was applied to the second network, The approach is then applied on a microarray dataset from diclofenac-treated yeast cells and identify modules to elucidate the molecular mechanisms of diclofenac toxicity and resistance. The algorithm identifies accurate and consistent modules with biological function densely related to given cellular response, proving that the integrated approach for network construction is feasible and that the proposed algorithm is able to identify biologically meaningful modules in large scale network.

Availability of data and materials
The data and source code generated and analysed during the current study are available in the author's GitHub repository. https://github.com/ WeiqiChen/MOEA-active-module-identification.