Building and analyzing metacells in single-cell genomics data

The advent of high-throughput single-cell genomics technologies has fundamentally transformed biological sciences. Currently, millions of cells from complex biological tissues can be phenotypically profiled across multiple modalities. The scaling of computational methods to analyze and visualize such data is a constant challenge, and tools need to be regularly updated, if not redesigned, to cope with ever-growing numbers of cells. Over the last few years, metacells have been introduced to reduce the size and complexity of single-cell genomics data while preserving biologically relevant information and improving interpretability. Here, we review recent studies that capitalize on the concept of metacells—and the many variants in nomenclature that have been used. We further outline how and when metacells should (or should not) be used to analyze single-cell genomics data and what should be considered when analyzing such data at the metacell level. To facilitate the exploration of metacells, we provide a comprehensive tutorial on the construction and analysis of metacells from single-cell RNA-seq data (https://github.com/GfellerLab/MetacellAnalysisTutorial) as well as a fully integrated pipeline to rapidly build, visualize and evaluate metacells with different methods (https://github.com/GfellerLab/MetacellAnalysisToolkit).

construction tools at a graining level of 30.Compactness and separation were computed in the diffusion component space.

Datasets and methods used throughout the Review
Dataset for illustrating metacell,sketching and clustering representations (Fig. 2A;Fig. 3F,Fig. 4;Fig. 7C,D;Fig. 8) PBMC data from 10X Genomics downloaded from the SeuratData package (Satija Lab, 2020) was used for the majority of illustrative (Fig. 2A 8).2D representation of clusters (cell types) was obtained by averaging the tSNE coordinates of single cells within each cluster (Fig. 2A; Fig. 8).The size of metacells, resp.clusters, in tSNE is proportional to the number of single cells in metacells, resp.clusters.

Computing compactness and separation from different latent spaces (Appendix Fig. S2)
To compute compactness and separation for different metacell construction tools from different latent spaces (Appendix Fig. S2), PMBC dataset from 10x Genomics was used as it is, without filtering cells or genes.For MC2 (metacells v.0.8.0), the divide_and_conquer_pipeline() was used with the target_metacell_size being 76'100 UMIs (to obtain the requested graining level of 30).
For SuperCell (v.0.1), we used the 10 principal components computed from the top 1'000 variable genes with a graining level of 30.The same PCA embedding and graining level was used for SEACells (v.0.2.0) by requesting 87 metacells, initializing algorithm considering 10 eigenvalues and fitting 25 iterations, with the convergence tolerance of 1e-5.Compactness and separation were computed using the corresponding functions from the SEACells package for different latent spaces (Appendix Fig. S2A,B).Different latent spaces correspond to the diffusion components embedding with different number of dimensions (from 8 to 26).
For the correlation between compactness and separation (Appendix Fig. S2C), the values computed on 10 diffusion components were used.
Correlation between metacell size and number of detected genes (Fig. 5b) The correlation between the metacell size and the number of detected genes was computed for the PBMC dataset by constructing metacells using SuperCell with a graining level of 10.The metacell size corresponds to the number of single cells in a metacell, the number of detected genes corresponds to the number of genes with at least 1 UMI count in a metacell profile.

Illustrating datasets of different complexity and size (Fig. 2D-G)
To illustrate datasets of different complexity (Fig. 2D), 3 single-cell RNA-seq datasets of the same size were used consisting of T cells, Cord Blood Mononuclear Cells (CBMCs) and Bone Marrow (BM) cells.Filtered and annotated BM (Stuart et al, 2019), derived from GEO: GSE128639, and CBMC (Stoeckius et al, 2017), derived from GEO:GSE100866.T cells dataset was obtained by selecting cells from the BM dataset annotated to CD4 and CD8 naive and mature cell types.The three datasets were randomly downsampled to 5'000 cells and a standard Seurat pipeline was applied to obtain a UMAP (T, BM datasets) or tSNE (CBMC dataset) representation of the data.
For each dataset we used SuperCell to build metacells at graining levels (gamma) ranging from 1 (single-cells) to 200 in steps of 5 from the 2'000 variable genes identified by Seurat for each dataset.We used 10, 20 and 50 principal components for T, CBMC, and BM dataset respectively.
Using single-cell annotation, we annotated each metacell according to the most abundant cell type within it, allowing us to analyze the number of recovered cell type at increasing graining level.A cell type is considered as a recovered one if at least one metacell was annotated to this cell type.
Similarly, we analyzed the influence of the input size of the single cell data on the number of cell types retrieved at increasing graining level.To do this, we use the BM dataset with all 30'000 annotated cells and random subsamples of 5'000 and 1'000 cells.

Benchmarking computational cost of metacell construction tools (Fig. 5C-E)
To evaluate computational cost of different metacell construction tools, we have used the mouse organogenesis atlas (MOCA) data generated by (Cao et al, 2019).This atlas contains scRNA-Seq data from 61 embryos representing a total of around 2 million cells.We first assessed the computational resources (i.e.CPU time and memory) needed to construct metacells using SuperCell, SEACells and MC2 on multiple datasets containing an increasing number of embryos (from 3 to 50 embryos, i.e. from 10'242 to 900'078 cells) (Fig. 5C).SEACells and SuperCell are limited for datasets larger than 75'000 and 450'000 respectively.However, each algorithm proposes different approaches to accelerate the metacells construction: i) SuperCell can construct metacells for a subset of cells and projects the remaining cells onto the constructed metacells, ii) SEACells proposes to use GPUs to accelerate the process and iii) MC2 can construct the metacells within randomly defined piles using parallel computing.We have used these approaches on the MOCA datasets and show that the computational resources needed decrease significantly.Note that all jobs were run on a machine with 500 GB and time limit of 20 hours with 1 CPU except for the run of MC2 which is able to use multithreading (10 CPUs were used in the latter case).
Alternatively, for large datasets containing multiple samples, the metacell construction algorithms can be applied on each sample separately followed by sample integration at the metacell level for downstream analyses (Fig. 5D).We used this approach on the MOCA datasets and show in Fig. 5E that the computational resources needed to build the metacells and to perform standard downstream analyses are substantially lower than those needed for performing similar analyses at the single-cell level.By standard downstream analyses we assume data normalization, feature selection, data scaling, PCA, clustering, UMAP, and differential expression analyses performed with Seurat framework (Hao et al, 2023).