BingleSeq: a user-friendly R package for bulk and single-cell RNA-Seq data analysis

Background RNA sequencing is an indispensable research tool used in a broad range of transcriptome analysis studies. The most common application of RNA Sequencing is differential expression analysis and it is used to determine genetic loci with distinct expression across different conditions. An emerging field called single-cell RNA sequencing is used for transcriptome profiling at the individual cell level. The standard protocols for both of these approaches include the processing of sequencing libraries and result in the generation of count matrices. An obstacle to these analyses and the acquisition of meaningful results is that they require programing expertise. Although some effort has been directed toward the development of user-friendly RNA-Seq analysis analysis tools, few have the flexibility to explore both Bulk and single-cell RNA sequencing. Implementation BingleSeq was developed as an intuitive application that provides a user-friendly solution for the analysis of count matrices produced by both Bulk and Single-cell RNA-Seq experiments. This was achieved by building an interactive dashboard-like user interface which incorporates three state-of-the-art software packages for each type of the aforementioned analyses. Furthermore, BingleSeq includes additional features such as visualization techniques, extensive functional annotation analysis and rank-based consensus for differential gene analysis results. As a result, BingleSeq puts some of the best reviewed and most widely used packages and tools for RNA-Seq analyses at the fingertips of biologists with no programing experience. Availability BingleSeq is as an easy-to-install R package available on GitHub at https://github.com/dbdimitrov/BingleSeq/.

Following Quality Control, data normalization was performed using the "LogNormalize" method with a scale factor of 10,000. Subsequent to normalization, feature selection was performed using the "vst" procedure and the top 2000 most variable genes were selected for downstream analysis with Seurat's method (Fig. 2). Subsequently, the data was scaled, linear dimensional reduction was performed, and the true dimensionality of the dataset was determined using an elbow plot (Fig. 3A). In an analogous manner to the tutorial, the elbow was observed around the 9-10 th PC. Hence, this was the dimensionality used in unsupervised clustering. See Fig. 3B-C for further exploration of the dimensionality of the dataset using PC heatmap.

Figure 3. A)
Elbow plot produced for the ~2700 PBMCs dataset and subsequently used to determine the its true dimensionality. B) 1 st PC Heatmap with the top 10 most variable Genes which is very likely to represent the true dimensionality of the dataset. In contrast, C) is the 15 th PC Heatmap which is unlikely to represent true dimensionality.
Unsupervised clustering was performed with Seurat, monocle, and SC3. Clustering with Seurat was performed with dimensionality and resolution parameters identical to those used in the tutorial and yielded analogous results ( Fig. 4A and 4D). Unsupervised clustering with monocle and SC3 was performed by explicitly setting the number of clusters to 9 which resulted in cells being clustered in a highly analogous way to Seurat's results ( Fig. 4B-C). Furthermore, a similar analysis was conducted using a larger 10x Genomics dataset composed of ~5400 PBMCs with a mean sequencing depth of ~28,000 reads per cell (https://support.10xgenomics.com/single-cell-gene-xpression/datasets/1.1.0/pbmc6k).
The obtained clustering results (Fig. 5) were highly similar to the ones observed in the smaller PBMCs dataset. Thus, further confirming the applicability of BingleSeq's scRNA-Seq pipeline.

Figure 5. A) Seurat B) SC3, C) monocle unsupervised clustering results for the 10x Genomics dataset composed of ~5400 PBMCs. Note that cluster number was explicitly set to 9 for SC3 and monocle.
Subsequent to clustering, DE analysis of all clusters was performed in an analogous manner to the tutorial using the Wilcoxon rank sums test. In turn, this yielded matching results (Fig. 6A). MS4A1 is a B lymphocyte marker gene and was hence chosen to pinpoint the cluster corresponding to B cells (Fig. 6B-D); thus, confirming BingleSeq's applicability to yield meaningful DE results and their subsequent use in identifying cluster identity. Following DE analysis, BingleSeq's 'Functional Annotation' tab was used to gain further insight about the clusters. The functional annotation analysis further confirmed that cluster 3 corresponds to B lymphocytes, as its most significant GO Term as well as 3 of its top 20 Biological processes were specifically associated with B lymphocytes (Fig. 7). Thus, serving as proof for the applicability of BingleSeq's Functional Annotation pipeline in revealing crucial phenotypic insight. Finally, BingleSeq's 'DE Comparison' tab was used to assess the agreement between the different differential gene expression (DE) analysis methods implemented within Seurat's pipeline. The results showed a high-level of overlap and agreement between the different DE methods (Fig 9A), with Wilcoxon and MAST having a particularly high agreement. Furthermore, the interactive Rank-based consensus table can be used to obtain further confidence for the significance of specific features of interest (Fig 8B).