Interactive single-cell data analysis using Cellar

Hasanaj, Euxhen; Wang, Jingtao; Sarathi, Arjun; Ding, Jun; Bar-Joseph, Ziv

doi:10.1038/s41467-022-29744-0

Download PDF

Article
Open access
Published: 14 April 2022

Interactive single-cell data analysis using Cellar

Nature Communications volume 13, Article number: 1998 (2022) Cite this article

12k Accesses
4 Citations
25 Altmetric
Metrics details

Subjects

Abstract

Cell type assignment is a major challenge for all types of high throughput single cell data. In many cases such assignment requires the repeated manual use of external and complementary data sources. To improve the ability to uniformly assign cell types across large consortia, platforms and modalities, we developed Cellar, a software tool that provides interactive support to all the different steps involved in the assignment and dataset comparison process. We discuss the different methods implemented by Cellar, how these can be used with different data types, how to combine complementary data types and how to analyze and visualize spatial data. We demonstrate the advantages of Cellar by using it to annotate several HuBMAP datasets from multi-omics single-cell sequencing and spatial proteomics studies. Cellar is open-source and includes several annotated HuBMAP datasets.

Jointly defining cell types from multiple single-cell datasets using LIGER

Article 12 October 2020

MASI enables fast model-free standardization and integration of single-cell transcriptomics data

Article Open access 28 April 2023

SciBet as a portable and fast single cell type identifier

Article Open access 14 April 2020

Introduction

A number of large consortia including the Human BioMolecular Atlas Program (HuBMAP)¹ are focused on profiling tissues, organs, and the entire human body at the single-cell level. These consortiums use several different technologies for studying the molecular composition of single cells including single-cell RNA Sequencing, single-cell ATAC Sequencing², single-cell spatial transcriptomics³, and single-cell spatial proteomics⁴. In addition to these large consortia, individual labs also generate data using some or all of these modalities.

Over the last few years, a number of methods have been developed for the assignment of cell types in single-cell data^5,6,7,8,9,10. In most cases, different groups from the same consortia, and even the same group when processing multiple types of single-cell data, rely on a different set of tools. This makes it hard to integrate and compare data from these groups since researchers often use different assignment techniques, markers, and even cell-type naming conventions.

To enable large-scale collaborations, integration, and comparisons across many different single-cell omics platforms and modalities, we developed Cellar, an interactive and graphical cell-type assignment web server. Cellar implements a comprehensive set of methods, both existing and new, which cover all steps involved in the cell-type assignment process. These include methods for dimensionality reduction and representation, clustering, reference-based alignment, identification of differentially expressed genes, intersection with functional and marker sets, tools for managing sessions and exporting results, as well as a dual mode for analyzing and comparing two datasets simultaneously. As cell-type assignment often requires user input in the form of domain knowledge, Cellar adopts a semi-automatic solution that permits users to intervene and modify each processing step as needed. To enable such interactive analysis, Cellar provides methods for semi-supervised clustering and projection of expression clusters in spatial single-cell images. Figure 1 provides an overview of Cellar’s workflow. Cellar was tested by members of HuBMAP over the last year and used to annotate several single-cell datasets from different organs, platforms, and modalities.

Results

Analysis of scRNA-seq data

We used Cellar to analyze 11 HuBMAP seq datasets (10x genomics) with an average of 7500 cells from five different tissues (Kidney, Heart, Spleen, Thymus, Lymph node)¹¹, all of which are available in Cellar. Cellar first performs quality control by removing unreliable cells and low-count genes. Additional normalization and scaling is applied based on user criteria. Cellar then clusters a lower-dimensional representation of the data and further reduces the dimension for visualization purposes. We demonstrate this basic pipeline by analyzing a spleen dataset with 5273 cells (Cellar ID: HBMP3-spleen-CC2). We used PCA, followed by UMAP¹² for dimensionality reduction and the Leiden algorithm¹³ for clustering to obtain a total of 16 clusters (Supplementary Fig. 1a). For each cluster, Cellar identified top differential genes. Using the top 500 differential genes, functional enrichment analysis (GO, KEGG¹⁴, MSigDB¹⁵) identified cluster 0 as B-cells (for example, “B-Cell Activation” (q value = 0) and “B-Cell Receptor Signaling Pathway” (q value = 0) were the top categories for GO and KEGG, respectively). This assignment is further supported by visualizing the concurrent expression of two known B-cell markers CD79A and TNFRSF13C¹⁶.

In addition to unsupervised clustering, Cellar also implements methods for supervised assignment based on a reference dataset. These can directly utilize the dual mode and other methods implemented in Cellar. For example, this form of assignment can be used in conjunction with Cellar’s semi-supervised clustering option to correct noise during the label transfer process. To illustrate such use, we applied Scanpy’s Ingest function¹⁷, which is available in Cellar, to integrate two expert-annotated spleen datasets (Cellar IDs: HBMP2-spleen-2 and HBMP3-spleen-CC3). We used HBMP3-CC3 as ground truth and transferred labels from it to HBMP2-2. We then compared the results of label transfer with the ground truth annotations for HBMP2-2 and observed an adjusted rand score (ARI) of 0.39. In contrast, running Leiden clustering on HBMP2-2 leads to a much lower ARI score of 0.27. We then refined the results of label transfer by using a semi-supervised adaptation of Leiden where the least noisy clusters were chosen as constraints and not allowed to change during the iterations of the algorithm. This led to a much better ARI score of 0.66 demonstrating the benefits of label transfer and semi-supervised clustering. These results are shown in Supplementary Fig. 2.

Analysis of scATAC-seq data

While scRNA-Seq is currently the most widely used data modality, several other molecular data types are also being profiled at the single-cell level. To illustrate the use of Cellar for such data we used it to annotate scATAC-seq². Cellar can handle scATAC-seq data in two different ways: cell-by-gene and cell-by-cistopic. The former is based on the open chromatin accessibility associated with the nearby region of all genes while the latter relies on cisTopic¹⁰ which uses Latent Dirichlet Allocation¹⁸ to model cis-regulatory topics. The resulting cell-by-gene or cell-by-cistopic matrix is used for downstream analysis such as visualization and clustering. We used Cellar to annotate a scATAC-seq dataset profiling Peripheral Blood Mononuclear Cells¹⁹ (Cellar ID: PBMC 10k Cell-By-Gene) using the cell-by-gene representation. Results are presented in Supplementary Fig. 3. DE analysis for clusters 0 and 4 identified the KLRD1 marker for natural killer (NK) cells²⁰.

Analysis of spatial transcriptomics data (CODEX)

In addition to sequencing assays, recent imaging assays can also provide information on the expression of genes or proteins at the single-cell level. Cellar can be used to analyze such data by providing a side-by-side view of the expression clusters and spatial organization. To illustrate this, we analyzed CO-Detection by indEXing (CODEX)²¹ spatial proteomics data. We used a lymph node dataset that contains 46,840 cells (Cellar ID: 19-003 lymph node R2). The clustering results are shown in Fig. 2 along with the corresponding tile for these cells with the projected cluster annotations. Given the small number of proteins profiled in this dataset (19), not all clusters could be assigned to unique types, though several have been assigned based on DE gene analysis in Cellar. Cellar matches the cell colors in the clustering and spatial images, making it easier to identify specific organizational principles and their relationship to the profiled cell types. The spatial tile in Fig. 2 shows that B cells cluster tightly together and are surrounded by T cells and other cell types in the lymph. The B-Cell clusters also contain a subset of proliferating cells.

**Fig. 2: CODEX data analysis in Cellar.**

Joint analysis of multiple modalities

Finally, we used Cellar to jointly analyze data from two different modalities. For this, we used a SNARE-seq²² kidney dataset which profiled both the transcriptome and chromatin accessibility of 31,758 cells (Cellar IDs: kidney SNARE ATAC/RNA 20201005). Here we first ran cisTopic on the chromatin modality and determine cluster assignments by running Leiden on the inferred cis-regulatory topics (Fig. 3a). We use these labels to visualize the expression data in Fig. 3b. This can be easily achieved using Cellar’s dual mode, which allows a cell ID-based label transfer from one modality to the other. Cellar identified differential genes, and we used these to map cell types. For example, cluster 1 was assigned based on both known markers (SLC5A12, p-value = 0) and GO term analysis (“Apical Plasma Membrane”, p value = 1e-4), which signify the presence of Proximal Tubule Cells^23,24.

**Fig. 3: SNARE-seq data analysis in Cellar.**

Discussion

To conclude, Cellar is an easy-to-use, interactive, and comprehensive software tool for the assignment of cell types in single-cell studies. Cellar is written in Python using the Dash framework and includes efficient operations and data structures for dealing with large datasets. These include using the Annotated Data object¹⁷ in memory-mapping mode which allows the analysis of large datasets by using little system memory, approximate nearest neighbors based on faiss²⁵ to speed up neighbors graph construction for Leiden clustering, as well as several interactive components for maximum flexibility. Cellar supports several types of molecular sequencing and imaging data and implements several popular methods for visualization, clustering, and analysis. Cellar has already been used to annotate single-cell data from multiple platforms and tissues. These annotated datasets (mostly from HuBMAP) can serve as a reference for transferring labels to other datasets. For tissues not currently supported by our HuBMAP annotated datasets, Cellar provides several external functional enrichment datasets that, combined with the user’s knowledge about specific markers, help in assignment decisions. We hope that Cellar will improve the accuracy and ease of cell-type assignment in single-cell studies. A web server running Cellar can be accessed at https://cellar.cmu.hubmapconsortium.org/app/cellar.

Methods

Complete details on all the methods used to process, analyze, visualize and integrate the data are available in Supporting Methods.

Preprocessing

Preprocessing of the data was done via scanpy¹⁷. For all scRNA-seq data we filtered cells with less than 50 or more than 3000 expressed genes. We also filtered genes expressed in less than 50 or more than 3000 cells. The data matrix was then CPM total count normalized (total count = 1e5) and log1p-transformed. Finally, we scale the data down to unit variance and zero-mean.

The PBMC scATAC-seq dataset was converted to a gene activity score matrix by summing peaks which intersect the nearby region of all genes as listed in GENCODE v35²⁶. The gene ranges were extended with 5000 base pairs downstream and 1000 base pairs upstream. The resulting cell by gene matrix was then normalized and log1p-transformed as explained above.

We did not normalize any of the CODEX data.

Clustering, visualization, and functional analysis

scRNA-seq and gene activity matrices were reduced to a 40 dimensional space via PCA. We used the PCA implementation of the scikit-learn package with a randomized SVD solver. The lymph node CODEX data was reduced via UMAP¹² with 10 dimensions using the Python package umap-learn. The embeddings were then used to construct an approximate neighbors graph using faiss²⁵ with 15 neighbors, and then clustered using the Leiden community detection algorithm for graphs¹³ with a default resolution of 1. Only for the lymph node CODEX data we used a smaller resolution of 0.1 in order to obtain a reasonable number of clusters. All data was reduced from these embeddings to 2 dimensions using UMAP for visualization purposes.

Differential gene expression analysis was performed with diffxpy (https://github.com/theislab/diffxpy) by using a Welch’s t-test. The 500 DE genes with the greatest fold-change values were selected for enrichment analysis via the package gseapy (https://github.com/zqfang/GSEApy) which uses the GSEA method²⁷. Only for the CODEX data, where the number of channels was small (<20), we used all differentially expressed proteins found.

Label transfer and semi-supervised clustering

Label transfer between HBMP2-spleen-2 and HBMP3-spleen-CC3 was performed using scanpy’s Ingest (https://scanpy.readthedocs.io/en/stable/generated/scanpy.tl.ingest.html). Ingest projects the query dataset to a latent space fit on reference data using PCA with 40 components. We only consider overlapping genes between the two datasets. Following label transfer, we use semi-supervised Leiden (resolution = 1) to refine the cluster assignments, where clusters 0, 4, 9, 10 were “frozen” (see Supplementary Fig. 2c for a scatter plot of the aforementioned clusters). The ARI score was computed on ground truth annotations assigned by a human expert. For the unconstrained version of Leiden used in the experiment we also set a default resolution of 1.

Joint analysis and cisTopic

The SNARE-seq data was formed by combining four separate kidney SNARE-seq datasets. We removed cells for which no annotations were found. The chromatin modality was processed using cisTopic¹⁰ to discover 40 topics. This number was selected via cisTopic’s log-likelihood model selection method. These topics were then treated as a reduced version of the data and used for clustering and visualization in the same way as described earlier for scRNA-seq data.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability

All data analyzed in this study are available for download from the application’s web server as well as the HuBMAP portal at https://portal.hubmapconsortium.org with access codes:

Code availability

Code is available from the GitHub repository: https://github.com/euxhenh/cellar/²⁸. Full documentation is available at https://euxhenh.github.io/cellar/.

References

Consortium, H. et al. The human body at cellular resolution: the NIH human biomolecular atlas program. Nature 574, 187 (2019).
Article ADS Google Scholar
Buenrostro, J. D., Giresi, P. G., Zaba, L. C., Chang, H. Y. & Greenleaf, W. J. Transposition of native chromatin for multimodal regulatory analysis and personal epigenomics. Nat. Methods 10, 1213 (2013).
Article CAS Google Scholar
Rodriques, S. G. et al. Slide-seq: a scalable technology for measuring genome-wide expression at high spatial resolution. Science 363, 1463–1467 (2019).
Article ADS CAS Google Scholar
Schiller, H. B. et al. The human lung cell atlas: a high-resolution reference map of the human lung in health and disease. Am. J. Respir. Cell Mol. Biol. 61, 31–41 (2019).
Article CAS Google Scholar
Pliner, H. A., Shendure, J. & Trapnell, C. Supervised classification enables rapid annotation of cell atlases. Nat. Methods 16, 983–986 (2019).
Article CAS Google Scholar
Hou, R., Denisenko, E. & Forrest, A. R. scMatch: a single-cell gene expression profile annotation tool using reference datasets. Bioinformatics 35, 4688–4695 (2019).
Article CAS Google Scholar
Wang, C. et al. Integrative analyses of single-cell transcriptome and regulome using MAESTRO. Genome Biol. 21, 1–28 (2020).
Article Google Scholar
Schep, A. N., Wu, B., Buenrostro, J. D. & Greenleaf, W. J. chromVAR: inferring transcription-factor-associated accessibility from single-cell epigenomic data. Nat. Methods 14, 975–978 (2017).
Article CAS Google Scholar
Zhang, A. W. et al. Probabilistic cell-type assignment of single-cell rna-seq for tumor microenvironment profiling. Nat. Methods 16, 1007–1015 (2019).
Article CAS Google Scholar
González-Blas, C. B. et al. cisTopic: cis-regulatory topic modeling on single-cell atac-seq data. Nat. Methods 16, 397–400 (2019).
Article Google Scholar
The Human Body at Cellular Resolution: The NIH human biomolecular atlas program. https://portal.hubmapconsortium.org/.
McInnes, L., Healy, J., Saul, N. & Grossberger, L. UMAP: uniform manifold approximation and projection. J. Open Source Softw. 3, 861 (2018).
Article Google Scholar
Traag, V. A., Waltman, L. & van Eck, N. J. From Louvain to Leiden: guaranteeing well-connected communities. Sci. Rep. 9, 1–12 (2019).
Article CAS Google Scholar
Kanehisa, M. & Goto, S. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28, 27–30 (2000).
Article CAS Google Scholar
Liberzon, A. et al. The molecular signatures database hallmark gene set collection. Cell Syst. 1, 417–425 (2015).
Article CAS Google Scholar
Smulski, C. R. & Eibel, H. BAFF and BAFF-receptor in B cell selection and survival. Front. Immunol. 9, 2285 (2018).
Article Google Scholar
Wolf, F. A., Angerer, P. & Theis, F. J. Scanpy: large-scale single-cell gene expression data analysis. Genome Biol. 19, 15 (2018).
Article Google Scholar
Blei, D. M., Ng, A. Y. & Jordan, M. I. Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003).
MATH Google Scholar
10x Genomics. Peripheral Blood Mononuclear Cells (PBMCs) from a healthy donor (v1). Single Cell ATAC Dataset by Cell Ranger ATAC 1.1.0, accessed 25 December 2020. https://www.10xgenomics.com/resources/datasets/10-k-peripheral-blood-mononuclear-cells-pbm-cs-from-a-healthy-donor-1-standard-1-1-0.
Bongen, E., Vallania, F., Utz, P. & Khatri, P. KLRD1-expressing natural killer cells predict influenza susceptibility. Genome Med. 10, 45 (2018).
Goltsev, Y. et al. Deep profiling of mouse splenic architecture with codex multiplexed imaging. Cell 174, 968–981 (2018).
Article CAS Google Scholar
Chen, S., Lake, B. B. & Zhang, K. High-throughput sequencing of the transcriptome and chromatin accessibility in the same cell. Nat. Biotechnol. 37, 1452–1457 (2019).
Article CAS Google Scholar
Gopal, E. et al. Cloning and functional characterization of human SMCT2 (SLC5A12) and expression pattern of the transporter in kidney. Biochim. Biophys. Acta 1768, 2690–2697 (2007).
Article CAS Google Scholar
Molitoris, B. A. & Wagner, M. C. Surface membrane polarity of proximal tubular cells: alterations as a basis for malfunction. Kidney Int. 49, 1592–1597 (1996).
Johnson, J., Douze, M. & Jégou, H. Billion-scale similarity search with GPUs. IEEE Trans. Big Data 7, 535–547 (2019).
Article Google Scholar
Frankish, A. et al. GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Res. 47, D766–D773 (2019).
Article CAS Google Scholar
Subramanian, A. et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl Acad. Sci. USA 102, 15545–15550 (2005).
Article ADS CAS Google Scholar
Hasanaj, E. & Wang, J. Cellar: interactive single-cell data annotation tool. https://github.com/euxhenh/cellar (2022).

Download references

Acknowledgements

This work was partially supported by NIH grants OT2OD026682, 1U54AG075931, and 1U24CA268108 to Z.B.J. J.D. was supported by Fonds de recherche du QuÃbecâ SantÃ (FRQS) -Junior 1. The results here are in whole or part based upon data generated by the NIH Human BioMolecular Atlas Program (HuBMAP).

Author information

Authors and Affiliations

Machine Learning Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, 15213, USA
Euxhen Hasanaj & Ziv Bar-Joseph
Meakins-Christie Laboratories, Department of Medicine, McGill University Health Centre, Montreal, QC, H4A 3J1, Canada
Jingtao Wang & Jun Ding
Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, 15213, USA
Arjun Sarathi & Ziv Bar-Joseph

Authors

Euxhen Hasanaj
View author publications
You can also search for this author in PubMed Google Scholar
Jingtao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Arjun Sarathi
View author publications
You can also search for this author in PubMed Google Scholar
Jun Ding
View author publications
You can also search for this author in PubMed Google Scholar
Ziv Bar-Joseph
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Z.B.J., E.H., and J.D. designed the software. E.H. developed and implemented the back-end including dimensionality reduction, clustering, cell-type annotation, and data integration methods. E.H., J.W., and A.S. contributed to the implementation of the front-end interactive visualizations and also contributed to the implementation of enrichment analysis of identified signature genes. All authors contributed with manuscript writing. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Jun Ding or Ziv Bar-Joseph.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks Nikolay Samusik, Ming Tang, and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Peer Review File

Reporting Summary

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Hasanaj, E., Wang, J., Sarathi, A. et al. Interactive single-cell data analysis using Cellar. Nat Commun 13, 1998 (2022). https://doi.org/10.1038/s41467-022-29744-0

Download citation

Received: 08 March 2021
Accepted: 25 March 2022
Published: 14 April 2022
DOI: https://doi.org/10.1038/s41467-022-29744-0

This article is cited by

ShIVA: a user-friendly and interactive interface giving biologists control over their single-cell RNA-seq data
- Rudy Aussel
- Muhammad Asif
- Lionel Spinelli
Scientific Reports (2023)
Multi-level cellular and functional annotation of single-cell transcriptomes using scPipeline
- Nicholas Mikolajewicz
- Rafael Gacesa
- Hong Han
Communications Biology (2022)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.