Calculating Biological Module Enrichment or Depletion and Visualizing Data on Large-scale Molecular Maps with ACSNMineR and RNaviCell R packages

Biological pathways or modules represent sets of interactions or functional relationships occurring at the molecular level in living cells. A large body of knowledge on pathways is organized in public databases such as the KEGG, Reactome, or in more specialized repositories, such as the Atlas of Cancer Signaling Network (ACSN). All these open biological databases facilitate analyses, improving our understanding of cellular systems. We hereby describe the R package ACSNMineR for calculation of enrichment or depletion of lists of genes of interest in biological pathways. ACSNMineR integrates ACSN molecular pathways, but can use any molecular pathway encoded as a GMT file, for instance sets of genes available in the Molecular Signatures Database (MSigDB). We also present the R package RNaviCell, that can be used in conjunction with ACSNMineR to visualize different data types on web-based, interactive ACSN maps. We illustrate the functionalities of the two packages with biological data taken from large-scale cancer datasets.


Introduction
Biological pathways and networks comprise sets of interactions or functional relationships, occurring at the molecular level in living cells [1,2]. A large body of knowledge on cellular biochemistry is organized in publicly available repositories such as the KEGG database [3], Reactome [4] and MINT [5]. All these biological databases facilitate a large spectrum of analyses, improving our understanding of cellular systems. For instance, it is a very common practice to cross the output of high-throughput experiments, such as mRNA or protein expression levels, with curated biological pathways in order to visualize the changes, analyze their impact on a network and formulate new hypotheses about biological processes. Many biologists and computational biologists establish list of genes of interest (e.g. a list of genes that are dierentially expressed between two conditions, such as normal vs disease) and then evaluate if known biological pathways have signicant overlap with this list of genes.
We have recently released the Atlas of Cancer Signaling Network (ACSN), a web-based database which describes signaling and regulatory molecular processes that occur in a healthy mammalian cell but that are frequently deregulated during cancerogenesis [6]. The ACSN atlas aims to be a comprehensive description of cancer-related mechanisms retrieved from the most recent literature. The web interface for ACSN is using the NaviCell technology, a software framework dedicated to web-based visualization and navigation for biological pathway maps [7]. This environment is providing an easy navigation of maps through the use of the Google Maps JavaScript library, a community interface with a web blog system, and a comprehensive module for visualization and analysis of high-throughput data [8].
In this article, we describe two R packages related to ACSN analysis and data visualization.
The package ACSNMineR is designed for the calculation of gene enrichment and depletion in ACSN maps (or any user-dened gene set via the import function), while RNaviCell is dedicated to data visualization on ACSN maps. Both packages are available on the Comprehensive R Archive Network (https://cran.r-project.org/web/packages/ACSNMineR/ and https://cran.r-project.org/web/packages/RNaviCell/), and on the GitHub repository (https://github.com/sysbio-curie/ACSNMineR and https: //github.com/sysbio-curie/RNaviCell). For   The statistical signicance of the counts in the modules is assessed by using either the Fisher exact test [9,10] or the hypergeometric test, which are equivalent for this purpose [11].
The current ACSN maps are included in the ACSNMineR package, as a list of character matrices.
> length(ACSN_maps) [1] 6 > names(ACSN_maps) [1] "Apoptosis" "CellCycle" "DNA_repair" "EMT_motility" "Master" [6] "Survival" For each matrix, rows represent a module, with the name of the module in the rst column, followed by a description of the module (optional), and then followed by all the gene symbols of the module. The maps will be updated according to every ACSN major release.
The main function of the ACSNMineR package is the enrichment function, which is calculating over-representation or depletion of genes in the ACSN maps and modules. We have included a small list of 12 Cell Cycle related genes in the package, named genes_test that can be used to test the main enrichment function and to get familiar with its dierent options. > genes_test [1] "ATM" "ATR" "CHEK2" "CREBBP" "TFDP1" "E2F1" "EP300" [8] "HDAC1" "KAT2B" "GTF2H1" "GTF2H2" "GTF2H2B" The example shown below is the simplest command that can be done to test a gene list for over-representation on the six included ACSN maps. With the list of 12 genes mentionned above and a default p-value cuto of 0.05, we have a set of 36 maps or modules that are signicantly enriched. The results are structured as a data frame with nine columns displaying the module name, the module size, the number of genes from the list in the module, the names of the genes that are present in the module, the size of the reference universe, the number of genes from the list that are present in the universe, the raw p-value, the p-value corrected for multiple testing and the type of test performed. The module eld in the results data frame indicate the map name and the module name separated by a column character. If a complete map is signicantly enriched or depleted, then only the map name is shown, without any module or column character.
For instance, the third line of the results object below concern the E2F1 module The enrichment function can take up to eight arguments: the gene list (as a character vector), the list of maps that will be used to calculate enrichment or depletion, the type of statistical test (either the Fisher exact test or the hypergeometric test), the module minimal size for which the calculations will be done, the universe, the p-value threshold and the alternative ("greater" for calculating over-representation, "less" for depletion and "both" for both tests).
Only the gene list is mandatory to call the enrichment function, all the other arguments have default values.
The maps argument can either be a dataframe imported from a gmt le with the format_from_gmt function or a list of dataframes generated by the same procedure. By default, the function uses the ACSN maps previously described. The correction for multiple testing is set by default to use the method of Benjamini & Hochberg, but can be changed to any of the usual correction methods (Bonferroni, Holm, Hochberg, Holm, or Benjamini & Yekutieli [12]), or even disabled . The minimal module size represents the smallest size value of a module that will be used to compute enrichment or depletion. This is meant to remove results of low signicance for module of small size. The universe in which the computation is made by default is dened by all the gene symbols contained in the maps. All the genes that were given as input and that are not present on the maps will be discarded. To keep all genes, the user can change the universe to HUGO, and in that case, the complete list of HUGO gene symbols will be used as the reference (> 39,000 genes). The threshold corresponds to the maximal value of the corrected p-value (unless the user chose not to correct for multiple testing) that will be displayed in the result table.
It may be of interest to compare enrichment of pathways in dierent cohorts or experiments. For example, enrichment of highly expressed pathways can reveal dierences between two cancer types or two cell lines. To facilitate such comparisons, ACSNMineR provides a multisample_enrichment function. It relies on the enrichment function but takes a list of character vector genes. The name of each element of the list will be assumed to be the name of the sample for further analysis. Most of the arguments given to multisample_enrichment are the same as the ones passed to enrichment. However, the cohort_threshold is designed to lter out modules which would not pass the signicance threshold in all samples.
Finally, to facilitate visualization of results, ACSNMineR integrates a representation function based on ggplot2 syntax [13]. It allows representation of results from enrichment or multisample_enrichment with a limited number of parameters. Two types of display are available: heat-map tiles or bars. For multiple samples using a barplot representation, the number of rows used can be provided, otherwise all plots will be on the same row. For the heatmap, the color of the non-signicant modules, and boundaries of the gradient for signicant values can also be tuned.
We previously computed the p-value of the genes_test list with default parameters. The number of modules which have a p-value below 0.05 was 36, that can be compared to the 49 obtained without correction with the simple command shown below (some of the results are displayed in table 2). enrichment(genes_test,correction_multitest = FALSE)

RNaviCell
The NaviCell Web Service provides a server mode, which allows automating visualization tasks and retrieving data from molecular maps via RESTful (standard http/https) calls. Bindings to dierent programming languages are provided in order to facilitate the development of data visualization workows and third-party applications [8]. RNaviCell is the R binding to the NaviCell Web Service. It is implemented as a standard R package, using the R object-oriented framework known as Reference Classes [14]. Most of the work done by the user using graphical point-and-click operations on the NaviCell web interface are encoded as functions in the library encapsulating http calls to the server with appropriate parameters and data. Calls to the NaviCell server are performed using the library RCurl [15], while data encoding/decoding in JSON format is performed with the RJSONIO library [16].
Once the RNaviCell library is installed and loaded, the rst step is to create a NaviCell object and launch the browser session. This will automatically create a unique session ID with the NaviCell server. Once the session is established, various functions can be called to send data to the web session, set graphical options, visualize data on a map or get data from the map. There are 125 functions available in the current version of RNaviCell. All of them are described with their dierent options in the RNaviCell documentation, and we provide a tutorial on the GitHub repository wiki (https://github.com/sysbio-curie/ RNaviCell/wiki/Tutorial).
In the simple example detailed below, we create a NaviCell session, then load an expression data set from a local (tab-delimited) le. The data represent gene expression measured in a prostate cancer cell line resistant to hormonal treatment (agressive), and is taken from the Cell Line Encyclopedia project [17].
We The dataset corresponding to this study is available as a Bioconductor package. The code shown below is creating a list of dierentially expressed genes between ER positive and ER negative samples, and calculates the enrichment in ACSN maps from this list of genes. As seen in Table 3 The Molecular Signatures Database (MSigDB) is one of the most widely used repository of well-annotated gene sets representing the universe of biolog- # create a NaviCell session, import the expression matrix on the map and create # heatmaps to represent the data points.

Analysis of glioblastoma mutation frequencies
Recent years have witnessed a dramatic increase in new technologies for interrogating the activity levels of various cellular components on a genome-wide scale, including genomic, epigenomic, transcriptomic, and proteomic information [20].
Integrating these heterogeneous datasets provides more biological insights than performing separate analyses. For instance, international consortia such as The Cancer Genome Atlas (TCGA) have launched large-scale initiatives to characterize multiple types of cancer at dierent levels on hundreds of samples. These integrative studies have already led to the identication of novel cancer genes [21].
Malignant gliomas, the most common subtype of primary brain tumors, are aggressive, highly invasive, and neurologically destructive tumors considered to be among the deadliest of human cancers. In its most aggressive manifestation, glioblastoma (GBM), median survival ranges from 9 to 12 months, despite maximum treatment eorts [22]. In this study we have analyzed whole-genome mutation data generated by the TCGA project on hundreds of patients. More specically, we parsed the MAF (Mutation Annotation Format) GBM les produced by dierent sequencing centers to count and calculate gene mutation frequencies. We kept the mutations having a status likely to disturb the target protein's function (i.e, Frame_Shift_Del, Nonstop_Mutation, In_Frame_Del, In_Frame_Ins, Missense_Mutation, Nonsense_Mutation, Splice_Site, Trans-lation_Start_Site). In total, we collected mutations for more than 13,000 genes in a total of 379 mutated samples. In order to retain the most frequently mutated genes, we calculated frequencies across all mutated samples, and kept genes having a frequency greater than 1% (3,293 genes). We further labelled genes having a frequency greater than 1% and less than 5% as "1" and genes highly mutated (frequency higher than 5%) as "2".
Of course, ACSNMineR is not the only R package for enrichment calculations. For instance, GOstats [23] is probably one of the rst packages that was created to calculate enrichment for Gene Ontology categories. GOstats can also be used to calculate enrichment for other biological pathways categories, such as KEGG pathways (by using an instance of the class KEGGHyperGParams) or PFAM protein families (using PFAMHyperGParams). However, its usage might not be as straightforward as ACSNMineR, and it does not seem possible to test user-dened biological pathways. Furthermore, other authors have pointed out that the KEGG database used by this package has not been updated since 2012. clusterProfiler is a recent R package released for enrichment analysis of Gene Ontology and KEGG with either hypergeometric test or Gene Set Enrichment Analysis (GSEA) [24]. Via other packages, support for analysis of Disease Ontology and Reactome Pathways is possible. Interestingly, this package also oers the possibility to import user-dened gene set, through tab-delimited pairwise denition les.
In order to improve ACSNMineR, we may in the near future try to improve the speed of calculation, which might be a problem if a very large number of samples or experiments have to be analyzed rapidly. For instance, we could use the foreach and \%dopar\% operator to parallelize the most computationally demanding operations. It could also be useful to implement more sensitive methods of gene set enrichment measures, such as the Gene Set Enrichment Analysis (GSEA) method.