MicroScope: ChIP-seq and RNA-seq software analysis suite for gene expression heatmaps

Background Heatmaps are an indispensible visualization tool for examining large-scale snapshots of genomic activity across various types of next-generation sequencing datasets. However, traditional heatmap software do not typically offer multi-scale insight across multiple layers of genomic analysis (e.g., differential expression analysis, principal component analysis, gene ontology analysis, and network analysis) or multiple types of next-generation sequencing datasets (e.g., ChIP-seq and RNA-seq). As such, it is natural to want to interact with a heatmap’s contents using an extensive set of integrated analysis tools applicable to a broad array of genomic data types. Results We propose a user-friendly ChIP-seq and RNA-seq software suite for the interactive visualization and analysis of genomic data, including integrated features to support differential expression analysis, interactive heatmap production, principal component analysis, gene ontology analysis, and dynamic network analysis. Conclusions MicroScope is hosted online as an R Shiny web application based on the D3 JavaScript library: http://microscopebioinformatics.org/. The methods are implemented in R, and are available as part of the MicroScope project at: https://github.com/Bohdan-Khomtchouk/Microscope.


Background
Most currently existing heatmap software produce static heatmaps [21,25,26,29,32,35,36,44], without features that would allow the user to dynamically interact with, explore, and analyze the landscape of a heatmap via integrated tools supporting user-friendly analyses in differential expression, principal components, gene ontologies, and networks. Such features would allow the user to engage the heatmap data in a visual and analytical manner while in real-time, thereby allowing for a deeper, quicker, and more comprehensive data exploration experience.
An interactive, albeit non-reproducible heatmap tool was previously employed in the study of the transcriptome of the Xenopus tropicalis genome [41]. Likewise, manual clustering of dot plots depicting RNA expression is an integral part of the Caleydo data exploration environment *Correspondence: b.khomtchouk@med.miami.edu 1 Center for Therapeutic Innovation and Department of Psychiatry and Behavioral Sciences, University of Miami Miller School of Medicine, 1120 NW 14th ST, 33136 Miami, FL, USA Full list of author information is available at the end of the article [42]. Chemoinformatic-driven clustering can also be toggled in the user interface of Molecular Property Explorer [27]. Furthermore, an interactive heatmap software suite was previously developed with a focus on cancer genomics analysis and data import from external bioinformatics resources [31]. Most recently, a general-purpose heatmap software providing support for transcriptomic, proteomic and metabolomic experiments was developed using the R Shiny framework [1].
Moreover, an interactive cluster heatmap library, InCHlib, was previously proposed for cluster heatmap exploration [39], but did not provide built-in support for gene ontology, principal component, or network analysis. However, InCHlib concentrates primarily in chemoinformatic and biochemical data clustering analysis, including the visualization of microarray and protein data. On the contrary, MicroScope is designed specifically for ChIP-seq and RNA-seq data visualization and analysis in the differential expression, principal component, gene ontology, and network analysis domains. In general, prior software has concentrated primarily in hierarchical clustering, searching gene texts for substrings, and serial analysis of genomic data, with no integrated features to support the aforementioned built-in features [3,37,46].
As of yet, no free, open-source heatmap software has been proposed to explore heatmaps at such multiple levels of genomic analysis and interactive visualization capacity. Here we propose a user-friendly genome software suite designed to handle dynamic, on-the-fly JavaScript visualizations of gene expression heatmaps as well as their respective differential expression analysis, principal component analysis, gene ontology analysis, and network analysis of genes.
MicroScope employs the Bioconductor package edgeR [34] to create a one-click, built-in, user-friendly differential expression analysis feature that provides differential expression analysis of gene expression data based on the quantile-adjusted conditional maximum likelihood (qCML) procedure and the Benjamini & Hochberg correction. edgeR is a count-based statistical method that expects input data in the form of a matrix of integer values. The value in the i-th row and the j-th column of the matrix tells how many reads (or fragments, for paired-end RNA-seq) have been unambiguously assigned to gene i in sample j [28]. Analogously, for other types of assays, the rows of the matrix might correspond e.g., to binding regions (with ChIP-seq), species of bacteria (with metagenomic datasets), or peptide sequences (with quantitative mass spectrometry). In general, the values in the matrix must be raw counts of sequencing reads/fragments. This is important for the statistical model to hold, as only the raw counts allow assessing the measurement precision correctly. It is important to never provide counts that were pre-normalized for sequencing depth/library size, as the statistical model is most powerful when applied to raw counts, and is designed to account for library size differences internally via a series of built-in normalization procedures.
The edgeR results supply the user with rank-based information about nominal p-value, false discovery rate, fold change, and counts per million in order to establish which specific genes in the data are differentially expressed with a high degree of statistical significance. This information, Fig. 1 MicroScope user interface. MicroScope UI shown at login, showcasing the Instructions tab and differential expression analysis feature, as well as features such as: sample file download, input file upload, 'Run Statistics' widget, and 'Download Stats Table' widget. Additional UI features are sequentially unlocked as the user progresses through the MicroScope software suite in turn, is used to investigate the top gene ontology categories of differentially expressed genes, which can then be conveniently visualized as interactive network graphics. Finally, MicroScope provides user-friendly support for principal component analysis via the generation of biplots, screeplots, and summary tables. PCA is supported for both covariance and correlation matrices via R's prcomp() function in the stats package. Figure 1 shows the MicroScope user interface (UI) upon login. After a user inputs an RNA-seq/ChIP-seq data file containing read counts per gene per sample, the user is guided through the differential expression analysis (Fig. 2) which, in turn, leads to the heatmap visualization stage of differentially expressed genes at user-specified statistical cutoff parameters (Fig. 3). Heatmaps visualizing statistically significant genes, as determined by the differential expression analysis, can be customized in a variety of ways, through user-friendly methods such as:

Results and discussion
• Statistical parameters visualization cutoff widget (p-value and/or FDR) • log 2 data transformation widget • Multiple heatmap color schemes widget • Hierarchical clustering widget • Row/column dendrogram branch coloring widget • Row/column font size widget • Heatmap download widget MicroScope allows the user to magnify any portion of a heatmap by a simple click-and-drag feature to zoom in, and a click-once feature to zoom out. MicroScope is designed with large gene expression heatmaps in mind, where individual gene labels overlap and render the text unreadable. However, MicroScope allows the user to Fig. 2 Differential expression analysis tabulated results. Once the input data is uploaded, a quantile-adjusted conditional maximum likelihood (qCML) procedure and the Benjamini-Hochberg correction are used to supply the user with information about the nominal p-value, false discovery rate, fold change, and counts per million calculations for differentially expressed genes. The edgeR package is used Fig. 3 Interactive heatmap visualization. MicroScope heatmap options showcasing the magnification feature as well as features such as: statistical parameter settings, log 2 data transformation, multiple heatmap color schemes, hierarchical clustering, row/column dendrogram branch coloring, row/column font size, and heatmap download button repeatedly zoom in to any sector of the heatmap to investigate a region, cluster, or even a single gene. MicroScope also allows the user to hover the mouse pointer over any specific gene to show gene name, expression level, and column ID. It should be noted that specifying the heatmap statistical parameters impacts the contents of the heatmap visualization itself, as stringent cutoffs will naturally result in less genes displayed. However, the downstream PCA or gene ontology or network analysis is not impacted by these heatmap visualizations. In other words, all downstream analyses are performed on the entire input dataset. It should also be noted that prior to visualizing heatmaps in MicroScope, experiment-specific data normalization procedures are left to the discretion of the user [2,22,38,40], depending on whether the user wants to visualize differences in magnitude among genes or see differences among samples.
One of the user-friendly features within MicroScope is that it is responsive to the demands asked of it by the user. For example, gene ontology analysis buttons are not provided in the UI until a user runs differential expression analysis, which constitutes a prerequisite step required prior to conducting a successful gene ontology analysis. In other words, MicroScope is user-responsive in the sense that it automatically unlocks new features only as they become needed when the user progresses through successive stages in the software. Furthermore, MicroScope automatically provides short and convenient written guidelines directly in the UI to guide the user to the next steps in the usage of the software. As such, complex analytical operations can be performed by the user in a friendly, step-by-step fashion, each time facilitated by the help of the MicroScope software suite, which adjusts to the needs of the user and provides written guidelines on the next steps to pursue. It should be noted that the differential expression analysis in MicroScope (qCML and Benjamini & Hochberg correction) is designed to be broadly applicable to be run on any ChIP-seq or RNA-seq data inputted by the user.
Following the successful completion of the differential expression analysis and interactive heatmap visualization, a user is automatically supplied a suite of UI widgets to perform principal component analysis. The user is given the choice to specify the matrix type (i.e., covariance or correlation matrix) in the sidebar panel marked 'Choose PCA Option' . After the PCA is completed, the user is supplied with a biplot and screeplot to visualize the results, as well as tabulated information showing the relative importance of each principal component.  Following the successful completion of the PCA (Fig. 4), the user is prompted with more UI widgets to proceed to the gene ontology analysis. Specifying values for these features and clicking the Do Gene Ontology Analysis button returns a list of the top gene ontology (GO) categories according to these exact specifications set by the user (Fig. 5). To perform the gene enrichment analysis, MicroScope uses the Wallenius non-central hypergeometric distribution to retrieve p-values for each GO category analyzed. Specifically, the goseq package implements a default option to use the Wallenius distribution to approximate the true null distribution, without any significant loss in accuracy [47]. After a null distribution is established, each GO category is then tested for over Fig. 6 Network graphics visualizations of top gene ontology categories. Differentially expressed genes belonging to the respective gene ontology categories are automatically displayed during the network analysis of the data. Top ten gene ontologies (and their respective genes) are shown here. Networks are zoomable and dynamically interactive, allowing the user to manually drag nodes across the screen to explore gene_name-gene_ontology interconnectivity and network architecture and under-representation amongst the set of differentially expressed genes, and the null is used to calculate a p-value for over and under-representation. Supported organisms for GO category analysis include: human [5], mouse [6], rat [7], zebrafish [8], worm [9], chimpanzee [10], fly [11], yeast [12], bovine [13], canine [14], mosquito [15], rhesus monkey [16], frog [17], and chicken [18].
The successful completion of this step can be followed up by running a network analysis on the top GO categories, thereby generating network graphics corresponding to the number of top gene ontology categories previously requested by the user (Fig. 6). Nodes represent either gene names or gene ontology identifiers, and links represent direct associations between the two entities. In addition to serving as a visualization tool, this network analysis automatically identifies differentially expressed genes that are present within each top gene ontology, which is a level of detail not readily available by running gene ontology analysis alone. By immediately extracting the respective gene names from each top gene Fig. 7 Network analysis visualizations of first ranked gene ontology vs. top-two ranked gene ontologies. Comparison of dynamically interactive network graphics at various user-specified gene ontology settings (e.g., 'Choose How Many Top Gene Ontologies to Display' button in the UI). Clearly, the GO category "membrane-bounded organelle" contains two unique genes, while the rest are (perhaps unsurprisingly) shared in common with the GO category "intracellular membrane-bounded organelle" ontology category, MicroScope's network analysis features serve to aid the biologist in identifying the top differentially expressed genes in the top respective gene ontology categories. Figure 7 compares interactive network visualizations of the top two gene ontologies, thereby demonstrating the immediate responsiveness of MicroScope's network graphics to user-specified settings (e.g., number of top gene ontologies to display widget).

Conclusion
We provide access to a user-friendly web application designed to visualize and analyze dynamically interactive heatmaps within the R programming environment, without any prerequisite programming skills required of the user. Our software tool aims to enrich the genomic data exploration experience by providing a variety of complex visualization and analysis features to investigate gene expression datasets. Coupled with a built-in analytics platform to pinpoint statistically significant differentially expressed genes, an interactive heatmap production platform to visualize them, a principal component analysis platform to investigate variation and patterns in gene expression, a gene ontology platform to categorize the top gene ontology categories, and a network analysis platform to dynamically visualize gene ontology categories at the gene-specific level, MicroScope presents a significant advance in heatmap technology over currently available software.