ESCAPE: database for integrating high-content published data collected from human and mouse embryonic stem cells

High content studies that profile mouse and human embryonic stem cells (m/hESCs) using various genome-wide technologies such as transcriptomics and proteomics are constantly being published. However, efforts to integrate such data to obtain a global view of the molecular circuitry in m/hESCs are lagging behind. Here, we present an m/hESC-centered database called Embryonic Stem Cell Atlas from Pluripotency Evidence integrating data from many recent diverse high-throughput studies including chromatin immunoprecipitation followed by deep sequencing, genome-wide inhibitory RNA screens, gene expression microarrays or RNA-seq after knockdown (KD) or overexpression of critical factors, immunoprecipitation followed by mass spectrometry proteomics and phosphoproteomics. The database provides web-based interactive search and visualization tools that can be used to build subnetworks and to identify known and novel regulatory interactions across various regulatory layers. The web-interface also includes tools to predict the effects of combinatorial KDs by additive effects controlled by sliders, or through simulation software implemented in MATLAB. Overall, the Embryonic Stem Cell Atlas from Pluripotency Evidence database is a comprehensive resource for the stem cell systems biology community. Database URL: http://www.maayanlab.net/ESCAPE


Introduction
Embryonic stem cells (ESCs) are pluripotent cells characterized by their capability to self-renew and differentiate into all adult cell types. Recent efforts in systematically profiling ESCs have yielded a wealth of high-throughput data. Highthroughput technologies including gene expression microarrays, RNA sequencing, chromatin immunoprecipitation followed by deep sequencing (ChIP-chip/seq), genomewide inhibitory RNA (RNAi) screens, immunoprecipitation followed by mass spectrometry (IP-MS) proteomics and phosphoproteomics, as well as other emerging technologies have been applied to profile the same cell type by many laboratories across the world in the past decade. Several databases and tools have been published to facilitate the integration of such data (1)(2)(3)(4)(5)(6)(7)(8)(9), and such efforts pave the way toward an in silico reconstruction of the gene and protein regulatory networks that regulate selfrenewal and pluripotency of these important cells. For example, Plurinet (2), FunGenES (4), StemBase (5), SyStemmCell (10), iScMiD (9) and PluriNetWork (1) incorporate stem cell data from several studies and provide web-based interfaces for data query and visualization. However, in general, these databases contain information from a single regulatory layer, mostly transcriptome measurements, and thus overlook other important layers as well as cross-layer interactions. To address the need for further data integration in the field, we constructed a more inclusive database called Embryonic Stem Cell Atlas from Pluripotency Evidence (ESCAPE). This database integrates numerous additional types of data ranging from epigenetics, transcriptomics, to proteomics and phosphoproteomics. These data sets are processed into gene lists, gene-gene and protein-protein interactions, and data tables for easy download and manipulation. In addition, a rich-content web-based application has been developed to enable users to interact with the various types of data in the ESCAPE database. These tools enable users to construct subnetworks, perform enrichment analyses visualized on a canvas and predict lineage specification based on in silico gene KDs or overexpressions.

Results
A comprehensive embryonic stem cell database constructed from published high-throughput studies Results from numerous published mouse and human embryonic stem cells (m/hESC) genome-wide profiling studies, as well as loss-of-function/gain-of-function (LOF/GOF) studies, were systematically collected and processed to construct the ESCAPE database. Most data sets are from mouse with several from human embryonic stem cells. In its current version, ESCAPE contains (i) 206 521 documented protein/DNA interactions from ChIP-chip/seq studies, connecting 61 transcription factors (TFs) to their putative target genes; (ii) 153 920 LOF/GOF interactions connecting 28 TFs from LOF KD/knockout studies followed by genomewide expression, and 55 TFs from GOF overexpression studies followed by genome-wide expression. These interactions directly or indirectly connect a target gene to an upstream TF regulator. These interactions are directed (arrow from the factor to the target) and signed (activation/inhibition); (iii) 1037 protein-protein interactions from IP-MS interactome studies centered on 16 bait proteins, as well as from smaller-scale studies; (iv) 813 gene-products functionally identified in five large-scale RNAi screens as key regulators of mESC pluripotency; (v) 19 801 m/hESC and differentiating progeny-specific nuclear proteins from whole nuclear MS proteomic analyses; (vi) 8323 ESC and differentiating progeny-specific phosphoproteins with identified phosphosites extracted from four studies; (vii) three genome-wide microarray mRNA time courses collected during mESC differentiation from one study; (viii) one genome-wide microRNA (miR) expression data set collected from mESCs; and (xi) 18 genome-wide ChIP-chip/seq histone modification studies in ESCs and early differentiated cells. The ESCAPE database descriptive statistics are provided in Table 1. The references are also listed in Table 2. The entity relationship diagram of the database design is shown in Figure 1. Data sets to construct the ESCAPE database are freely downloadable and searchable online. The ESCAPE database is stored as a MySQL relational database. The web interface is implemented as a set of PHP scripts running under Apache as well as a set of Java Servlets running under Tomcat all interacting with the database using SQL. The network viewer used in the network generator page is Cytoscape Web (11) implemented in Flash. The canvas visualization within the enrichment analysis page is implemented with the JavaScript library D3 (12). JavaScript and AJAX are implemented throughout the site for improving user experience (UX) (13). The web interface contains several modules: (i) an interface to browse and query the data; (ii) an interface to download the data; (iii) a tool to generate subnetworks from an input list of genes using background networks generated from the database; (iv) a tool to perform enrichment analysis on user entered gene lists using background lists of genes generated from the database and visualized on a canvas, as well as enrichment analysis of user inputted lists using Enrichr, a tool to visualize enrichment results against 35 gene set libraries (14); (v) an interface to predict lineage commitment on gene KDs or overexpressions; (vi) a downloadable MATLAB software with a graphical user interface for learning Boolean functions and simulating subnetwork dynamics given a prior subnetwork topology and experimental measurements of subnetwork node expression levels across many conditions ( Figure 2). Details of the modules are described in the following sections.

Browsing and querying data sets within the ESCAPE database
The ESCAPE database provides web-based user interface to allow easy browsing and querying. From the Browse page of the web interface, users can click on one of the tables listed on the left, and then the contents of the selected table are displayed in the center of the page. The contents of the table can be sorted by clicking the name of the column. In addition, information about the methods used to generate the table and the number of entries are displayed above each table. There are two ways to search the ESCAPE database: (i) a general search for a gene using the search bar displayed on top of any web page of the ESCAPE web interface or (ii) a detailed search within a selected table. The detailed search is provided under the Browse section of the website. In the case of looking for a specific gene name using the global search, a list of all the tables where the gene appears is displayed in the search results page, and direct   (1) Operator = and 'NANOG, ESRRB, SOX2' listed in the GeneName column. is written in the PerturbType column.
All the tables of the ESCAPE database can be freely downloaded from the Download page of the website. The tables are provided as either flat tab-separated text files or as mySQL files.

Enrichment analysis with ESCAPE
Another function of the ESCAPE web interface is the ability to perform enrichment analyses ( Figure 4A). The enrichment analysis tool performs gene list enrichment analysis using the various experimental modalities that produced gene lists. These include candidate genes from RNAi screens, protein lists from IP-MS pull-downs, genes differentially expressed after KD or overexpression, and target genes for TFs and histone modifications as determined by ChIP-seq/chip. In this web application portion of the site, users can query their own gene lists for overlap with gene lists from the ESCAPE database or analyze their gene list with another external gene list enrichment analysis tool called Enrichr (14). On the left, users can cut and paste lists of Entrez gene symbols and then press Submit to perform the enrichment analysis. In the middle, most of the lists from the ESCAPE database are visualized as a canvas. Each square represents a list. The color indicates the experiment type, and the brightness indicates the level of local similarity among the lists. We use simulated annealing to arrange the lists from the ESCAPE database by their gene content similarity using the Sets2Networks algorithm (17). The enriched terms appear as circles on top of the colored squares representing the gene lists from the ESCAPE database on the canvas: the brighter the circle, the more significant the overlap with the input list. The results are also available in a table with the associated p-values on the right. To compute statistical enrichment, the Fisher exact test is implemented. The resulting lists of enriched experiments only show the enriched terms determined by a cutoff threshold P-value of P < 0.05.
We created two examples to demonstrate how the enrichment analysis with the canvas visualization can be informative for obtaining new insights. We took two lists of genes that when knocked out in mice are causing the phenotypes of 'embryonic growth arrest' and 'abnormal kidney physiology' based on the MGI-MP ontology (18) terms 1730 and 2136 respectively. The enrichment results for 'embryonic growth arrest' show that the enriched terms are clustered in few specific areas on the grid ( Figure 4B). The clustering of enriched terms is clearly not random. The input genes contain H3K36ME targets that are also Oct4 interacting proteins. Interestingly, there is also high overlap with TCFC2L1 interacting proteins as determined by proteomics and target genes of TCFC2L1 as determined by ChIP-seq. The enrichment results for 'abnormal kidney morphology' are all clustered in the same area, which mostly represents the PRC2 complex members, known to suppress the expression of genes required for terminal differentiation, including those critical genes for the maintenance of kidney morphology (19) (Figure 4C). Overall, such analyses can be used to link relevant phenotypes to specific regulatory mechanisms in embryonic stem cells, as well as help experimental stem cell biologists who perform high throughput experiments to place their results in context of prior studies.

Lineage specification prediction with ESCAPE
The next function of the ESCAPE web interface is a tool to predict lineage-propensity differentiation outcome on single or combinatorial KD of multiple pluripotency factors ( Figure 5). The tool considers the target genes of knocked-down pluripotency factors and predicts the additive expression of lineage markers based on the combinatorial additive predicted levels of these factors. Specifically, effects of gene KDs on lineage commitment are dynamically computed by enrichment analysis for targets of knocked-down factors against lists of lineage-specific marker genes using the Fisher's exact test. Targets of KD factors were first identified from the LOF/GOF table, and lineage specific components were assembled manually from literature as follows: (i) Trophectoderm: the gene expression data set (GSE11523) reported trophectodermlike state after depletion of Oct4/Pou5f1 in mESCs. Gene expression was profiled at six time points. Genes were sorted according to average fold change of expression on differentiation related to time point 0. The top 5% of genes with an average fold change of at least two and with a monotone increase in expression at each time point upon differentiation were considered as trophectoderm markers. (ii) Primitive endoderm: the same set of experiments and data processing as described for (i) were conduct after overexpression of Gata6 in mESCs. (iii) Neuroectoderm: the gene expression dataset (GSE12982) isolated Sox1-GFP positive cells from mESCs where Ezh1 and Ezh2 were knocked-down. Genes were sorted according to fold change increase in expression comparing differentiated cells to mESCs. The top 10% genes with a monotonic increase and fold change of at least 1.5 were considered as neuroectoderm markers. (iv) Mesendoderm: the same set of experiments and data processing as described for (iii) were conduct after isolation of T-GFP positive cells (T stands for the gene brachyury). By sliding the bars on the web interface, users can choose the components and level of knockdown of 14 pluripotency factors. Corresponding positive and negative targets of each specific pluripotency factor were extracted from the LOF table within the ESCAPE database. As a result, the enrichment P-values reflecting the significance of differentiation potential toward each specific lineage on knockdown(s) are displayed on top. In addition, the up and down genes are provided in two text boxes below the lineage prediction display. Such lists can be further analyzed using the external enrichment analysis tool Enrichr (14) or any other tool available within ESCAPE or beyond.

Functional associations among 15 pluripotency regulators and 15 lineage markers
The aggregated data in ESCAPE can be used to elucidate functional associations among pluripotency and differentiation components across various regulatory layers. Specifically, to demonstrate the usefulness of the compiled ESCAPE database to dissect the pluripotency machinery, we examined functional correlations among 15 pluripotency factors and 15 differentiation markers selected (20). Heatmaps of degree of similarity were constructed ( Figure 6) where we scored pair-wise similarity distance between the components as follows: (i) Shared targets from the ChIP-chip/seq experiments; (ii) Co-expression similarities based on global mRNA measurements after pluripotency TF LOF or GOF; (iii) Histone modification target gene similarities analyzed in mESCs and differentiated cells; (iv) Protein co-occurrence measured after pull-downs of pluripotency TFs followed by MS proteomics; (v) Similarities of miR targets predicted computationally and limited to miRs preferentially expressed in mESCs; and (vi) co-expression similarities during embryoid body differentiation. Additionally, a multi-layer heatmap integrating all six layers was created. As expected, pluripotency regulators and differentiation markers generally cluster into two separate groups. A previous attempt to generate a heatmap for 13 pluripotency regulators based solely based on genomic target binding similarities resulted in slightly different clusters (21). Here, Oct4/Pou5f1 shares greatest functional similarity with Sall4 and Zfp42 (also called Rex1) ( Figure 2G). This is consistent with a report that Sall4 and Oct4/Pou5f1 form a regulatory feedback loop (22). In addition, Rex1 is a known target of Oct4/Pou5f1. However, it is interesting that Rex1 is so closely associated with Oct4/Pou5f1 across several layers. Surprisingly, Gli2, a known ectoderm marker, is highly correlated with pluripotency components across numerous layers, suggesting a function in the pluripotent state for this gene. Gli2 is a downstream TF effector of Hedgehog signaling (23), and thus, potentially linking this pathway to pluripotency. Binding of Gli1 and Gli2 to the Nanog regulatory sequences in neural stem cells has been reported (24). Based on a recent genome-wide RNAi screen, another member of the Gli family, Gli3, was among the hits of genes involved in mESC early differentiation (25).