Skip to main content
Advertisement
  • Loading metrics

Curated single cell multimodal landmark datasets for R/Bioconductor

  • Kelly B. Eckenrode ,

    Roles Conceptualization, Data curation, Investigation, Project administration, Visualization, Writing – original draft, Writing – review & editing

    ‡ KBE, DR, MR share first authorship on this work. RA, CV, LG share second authorship on this work.

    Affiliations Graduate School of Public Health and Health Policy, City University of New York, NY, NY, United States of America, Institute for Implementation Science in Public Health, City University of New York, NY, NY, United States of America

  • Dario Righelli ,

    Roles Conceptualization, Data curation, Formal analysis, Methodology, Project administration, Resources, Software, Validation, Visualization, Writing – original draft, Writing – review & editing

    ‡ KBE, DR, MR share first authorship on this work. RA, CV, LG share second authorship on this work.

    Affiliation Department of Statistical Sciences, University of Padova, Padova, Italy

  • Marcel Ramos ,

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Resources, Software, Validation, Writing – original draft

    ‡ KBE, DR, MR share first authorship on this work. RA, CV, LG share second authorship on this work.

    Affiliations Graduate School of Public Health and Health Policy, City University of New York, NY, NY, United States of America, Institute for Implementation Science in Public Health, City University of New York, NY, NY, United States of America, Roswell Park Comprehensive Cancer Center, Buffalo, New York, United States of America

  • Ricard Argelaguet ,

    Roles Conceptualization, Data curation, Formal analysis, Methodology, Resources, Software, Validation, Visualization, Writing – original draft, Writing – review & editing

    ‡ KBE, DR, MR share first authorship on this work. RA, CV, LG share second authorship on this work.

    Affiliation European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridgeshire, United Kingdom

  • Christophe Vanderaa ,

    Roles Conceptualization, Data curation, Formal analysis, Methodology, Resources, Software, Writing – original draft, Writing – review & editing

    ‡ KBE, DR, MR share first authorship on this work. RA, CV, LG share second authorship on this work.

    Affiliation de Duve Institute, Université catholique de Louvain, Brussels, Belgium

  • Ludwig Geistlinger ,

    Roles Conceptualization, Data curation, Formal analysis, Methodology, Resources, Software, Validation, Writing – original draft, Writing – review & editing

    ‡ KBE, DR, MR share first authorship on this work. RA, CV, LG share second authorship on this work.

    Affiliation Center for Computational Biomedicine, Harvard Medical School, Boston, Massachusetts, United States of America

  • Aedin C. Culhane,

    Roles Conceptualization, Supervision, Writing – original draft, Writing – review & editing

    Affiliation School of Medicine, University of Limerick, Limerick, Ireland

  • Laurent Gatto,

    Roles Conceptualization, Supervision, Writing – review & editing

    Affiliation de Duve Institute, Université catholique de Louvain, Brussels, Belgium

  • Vincent Carey,

    Roles Supervision, Writing – review & editing

    Affiliation Channing Division of Network Medicine, Brigham and Women’s Hospital and Harvard Medical School, Boston, Massachusetts, United States of America

  • Martin Morgan,

    Roles Conceptualization, Supervision

    Affiliation Roswell Park Comprehensive Cancer Center, Buffalo, New York, United States of America

  • Davide Risso ,

    Roles Conceptualization, Supervision, Writing – review & editing

    davide.risso@unipd.it (DR); levi.waldron@sph.cuny.edu (LW)

    Affiliation Department of Statistical Sciences, University of Padova, Padova, Italy

  • Levi Waldron

    Roles Conceptualization, Supervision, Writing – review & editing

    davide.risso@unipd.it (DR); levi.waldron@sph.cuny.edu (LW)

    Affiliations Graduate School of Public Health and Health Policy, City University of New York, NY, NY, United States of America, Institute for Implementation Science in Public Health, City University of New York, NY, NY, United States of America

Abstract

Background

The majority of high-throughput single-cell molecular profiling methods quantify RNA expression; however, recent multimodal profiling methods add simultaneous measurement of genomic, proteomic, epigenetic, and/or spatial information on the same cells. The development of new statistical and computational methods in Bioconductor for such data will be facilitated by easy availability of landmark datasets using standard data classes.

Results

We collected, processed, and packaged publicly available landmark datasets from important single-cell multimodal protocols, including CITE-Seq, ECCITE-Seq, SCoPE2, scNMT, 10X Multiome, seqFISH, and G&T. We integrate data modalities via the MultiAssayExperiment Bioconductor class, document and re-distribute datasets as the SingleCellMultiModal package in Bioconductor’s Cloud-based ExperimentHub. The result is single-command actualization of landmark datasets from seven single-cell multimodal data generation technologies, without need for further data processing or wrangling in order to analyze and develop methods within Bioconductor’s ecosystem of hundreds of packages for single-cell and multimodal data.

Conclusions

We provide two examples of integrative analyses that are greatly simplified by SingleCellMultiModal. The package will facilitate development of bioinformatic and statistical methods in Bioconductor to meet the challenges of integrating molecular layers and analyzing phenotypic outputs including cell differentiation, activity, and disease.

Author summary

Experimental data packages that provide landmark datasets have historically played an important role in the development of new statistical methods in Bioconductor by lowering the barrier of access to relevant data, providing a common testing ground for software development and benchmarking, and encouraging interoperability around common data structures. In this manuscript, we review major classes of technologies for collecting multimodal data including genomics, transcriptomics, epigenetics, proteomics, and spatial information at the level of single cells. We present the SingleCellMultiModal R/Bioconductor package that provides single-command access to landmark datasets from seven different technologies, storing datasets using HDF5 and sparse arrays for memory efficiency and integrating data modalities via the MultiAssayExperiment class. We demonstrate two integrative analyses that are greatly simplified by SingleCellMultiModal. The package facilitates development and benchmarking of bioinformatic and statistical methods to integrate molecular layers at the level of single cells with phenotypic outputs including cell differentiation, activity, and disease, within Bioconductor’s ecosystem of hundreds of packages for single-cell and multimodal data.

This is a PLOS Computational Biology Benchmarking paper.

Introduction

Understanding the quantitative relationship between molecules and physiology has motivated the development of quantitative profiling techniques, especially for single-cell sequencing [1]. Single-cell multimodal omics technologies (Nature Method of the Year 2019 [2]) couple single-cell RNA sequencing with other molecular profiles such as DNA sequences, methylation, chromatin accessibility, cell surface proteins, and spatial information, simultaneously in the same cell. Integrative analysis of multiple molecular measurements from the same cell has enabled, for example, discovery of rare cell types by defining subpopulations based on surface markers with CITE-Seq [3] and ECCITE-Seq [4] (Cellular Indexing of Transcriptomes and Epitopes by sequencing, Expanded CRISPR CITE-Seq), of epigenetic regulation and cell differentiation lineage with scNMT-seq [5] (single-cell nucleosome, methylation, and transcriptome sequencing), a high resolution commercial version of single cell chromatin accessibility with 10X Multiomics [6], understanding of spatial patterns of gene expression with seq-FISH [7], and correlation of genotype-phenotype in healthy and disease states with G&T-seq [8] (parallel Genome and Transcriptome sequencing). Other single-cell multimodal datasets take measurements from separate cells due to the technical constraints, like mass-spectrometry based proteomic methods including SCoPE2 [9] (single-cell protein analysis by mass spectrometry).

Capturing and integrating an array of different molecular signals at the single-cell level poses new analytical challenges. Single-cell multimodal experiments generate multidimensional and high volume datasets, requiring distinct informatic and statistical methods to store, process and analyze data. Integrating different molecular layers to provide biologically meaningful insight is an active area of development in R/Bioconductor due to the availability of data containers and analysis toolkits for single-cell analysis. R/Bioconductor is an open development and open source platform for analyzing biomedical and genomic data with dedicated data structures such as the SingleCellExperiment class [10] for single-cell data and the MultiAssayExperiment class [11] for multi-omics data. Both are designed based on the SummarizedExperiment class [12], the central Bioconductor data structure for storing, manipulating, and analyzing high-throughput quantitative omics data. Relative to analysis platforms within and outside of the R programming language (e.g. GATK, Seurat [13], mixOmics [14], MOFA+ [15], CiteFuse package [16], ScanPy for CITE-Seq [17], Conos for SCoPE2 [18]), Bioconductor provides the broadest range of interoperable data structures and packages for statistical analysis and visualization of single-cell multimodal data.

Easy availability of publicly available experimental data using standardized data classes has long played an important role in the development of interoperable software packages for the analysis of data from new technologies, helping to coalesce development efforts around shared datasets and commonly used data classes such as ExpressionSet [19] and then (Ranged)SummarizedExperiment [20] and SingleCellExperiment [10]. We therefore introduce a suite of single-cell multimodal landmark datasets for benchmarking and testing multimodal analysis methods via the Bioconductor ExperimentHub package SingleCellMultiModal (Fig 1A). The scope of this package is to provide efficient access to a selection of curated, pre-integrated, publicly available landmark datasets for methods development and benchmarking within the Bioconductor ecosystem. Also, we included cell type labels using ontology terms for each experimental database called ontomap (see Methods). Cell labels metadata helps users develop a common ground truth. Some such methods and code for analysis workflows are reviewed by Lê Cao et al. [21]. Users can obtain integrative representations of multiple modalities as a MultiAssayExperiment, a common core Bioconductor data structure relied on by dozens of multimodal data analysis packages. Each dataset was quality controlled; either by the original authors during publication, or we implemented a quality control pre-filtered for high quality cells. SingleCellMultiModal uses Bioconductor’s ExperimentHub package and service to host, coordinate, and manage the data from the cloud. We plan to update the package as new datasets and technologies become available and we welcome community contributions. This manuscript serves as a review of essential aspects of these technologies suitable for developers of bioinformatic and statistical software, and as a description of the SingleCellMultiModal data package.

thumbnail
Fig 1. Representation of modalities included in the SingleCellMultiModal package.

(A) a Venn diagram representation of the modalities collected by each different technology, including: RNA (center), surface proteins (top left), spatial information (bottom left), methylation and open chromatin (bottom right), and peptides (top right). (B) The number of features and cells collected for each data modality by each technology.

https://doi.org/10.1371/journal.pcbi.1011324.g001

Results

Summary of landmark datasets in SingleCellMultiModal

To evaluate and design new statistical methods that accompany experimental single-cell multimodal data, it is important to establish landmark datasets. The goal of this section is to provide an overview of the landmark datasets currently in SingleCellMultiModal as well as to introduce the experimental and technological context for each experimental assay (Table 1). For more information concerning the details of the technologies, consult [22]. We briefly describe each landmark experiment including context, major findings from the publication, and challenges in its analysis, then summarize its accompanying dataset in SingleCellMultiModal including number of cells and features (Fig 1B).

thumbnail
Table 1. Single-cell multimodal datasets included SingleCellMultiModal package.

Modalities refer to the molecular feature measured in the experimental assay. Cell/process type provides information on the type of material or development event data was collected. Datatype name column refers to the dataset name in SingleCellMultiModal.

https://doi.org/10.1371/journal.pcbi.1011324.t001

RNA and protein: antibody tagged cell surface markers.

Purpose and goals: Traditionally, protein expression in cell populations are measured using flow cytometry. With the advent of single-cell multimodal methods cell surface proteins are measured with higher resolution with simultaneous measurements of mRNA abundance, which enhances the ability to identify new cell subpopulations in heterogeneous samples. Cellular Indexing of Transcriptomes and Epitopes by sequencing (CITE-Seq) measures protein cell surface markers and gene expression in the same cell. An extension of CITE-Seq is ECCITE-Seq, Expanded CRISPR-compatible CITE-Seq, which allows for the capture of sgRNA from CRISPR mediated screens. Collectively, these technologies provide a high-throughput method for single-cell immunophenotyping and transcriptome analysis.

Technology: CITE-Seq relies on antibodies conjugated to DNA barcodes to infer protein levels, and in tandem count DNA handles from PCR amplification of mRNA transcripts. Inside the droplet contains mRNA transcripts, proteins conjugated with antibody derived tags (ADTs), beads decorated with oligo-dT, reverse transcriptase and primers for cDNA amplification. The use of DNA barcodes is a departure from traditional fluorescence labels, which are limited in number because of the overlaps in spectral detection, excitation and emission frequencies [25].

A variation of CITE-Seq is ECCITE-Seq that can track single-cell CRISPR screens using sgRNA sequencing capture [4]. The CRISPR-Cas9 system is used to generate targeted gene knockout/mutants by using two components: sgRNA (single guide RNA for gene of interest) and Cas9 (endonuclease for cleaving double DNA strand breaks). sgRNA are composed of custom crRNA 17-22nt with a scaffold tracrRNA, which means the sgRNA are composed of two RNA pieces: one is customizable and the other is not. The sgRNA targets the gene of interest and orchestrates the Cas9 enzyme to gene location to insert a variety of mutations or full gene knock-outs. The CRISPR-Cas9 system introduces targeted gene mutations with greater ease at the bench, plus it is easier to scale up to many more experimental samples than previous approaches.

Antibody oligo counts are listed in the ADT and HTO (hashtag oligo) tables and sgRNAs counts in the GDO tables. After cell perturbations via CRISPR screens, cells are collected and prepared with 10X Genomics V(D)J solution which incorporates Single-cell RNA-seq with additional profiling of protein surface markers and sgRNAs (when applicable). The molecular contents, mRNA and DNA-tagged proteins, will hybridize to the decorated beads. The benefit of adding barcoding to cells is that it allows for tracking of doublets (two cells in one droplet).

Landmark data: There are several experimental datasets derived from the original CITE-Seq landmark paper. Among them we selected the cord blood dataset where the cells have been incubated with CITE-seq antibody conjugates and fluorophore-conjugated antibodies. This cord_blood dataset has two different assays. The scADT assay is a matrix indicating the 13 proteins surface abundance for each of the 8617 cells, while the scRNA assay is a matrix of 20400 human genes and 15880 mouse genes where each entry contains the expression abundance in each of the 8617 cells (Table 2).

thumbnail
Table 2. CITE-Seq dataset description, with assay types, molecular modes, number of specimens, number of features and number of cells.

ADTs, antibody derived tags.

https://doi.org/10.1371/journal.pcbi.1011324.t002

The package also includes an ECCITE-Seq dataset aimed at characterizing immune subpopulation cell types after an experimental perturbation. The peripheral_blood dataset is organized in two different conditions: the control (CTRL) and the cutaneous T-cell lymphoma (CTCL). For both conditions the ECCITE-Seq protocol has been performed to produce transcripts (RNA-seq), proteins (ADT) and cell tracking (HTO) abundance. All these modalities are collected as separated assays into the MultiAssayExperiment, where a sparse matrix is used to store the RNA-seq counts. The modalities are collected from the same cells, but not all the cells are entirely profiled by the same modalities. Of the total 36248 cells, 4190 cells from the CTCL and 4292 cells from the CTRL are matched with all modalities (Fig 2). sgRNA data is stored in long format providing access through the metadata data structure of the MultiAssayExperiment. The CITE-Seq dataset is accessible via the SingleCellMultiModal package by using the CITEseq(DataType="cord_blood") function call, while for the ECCITE-Seq data it’s sufficient to change the identifier as follow CITEseq(DataType="peripheral_blood"). Both function calls return a MultiAssayExperiment object with matrices or sparse matrices as assays (Table 3).

thumbnail
Fig 2. Upset plot [26] of the overlap of modalities on the same cells in the control sample of the ECCITE-Seq “peripheral blood” dataset.

8482 cells are assayed in all three modes (ADT, HTO, RNA), 3105 cells are assayed by HTO and ADT only, etc. RNA data are available for ~10248 cells, whereas HTO and ADT data are each individually available for ~13000 cells across both conditions. This plot is produced by the upsetSamples function of the MultiAssayExperiment package, and can be applied directly to all datasets produced by SingleCellMultiModal.

https://doi.org/10.1371/journal.pcbi.1011324.g002

thumbnail
Table 3. ECCITE-Seq dataset description: assay types, molecular modes, number of specimens, number of features and number of cells.

ADTs, antibody derived tags; HTO, Hashtagged oligos; sgRNAs, CRISPR V(D)J’s.

https://doi.org/10.1371/journal.pcbi.1011324.t003

RNA and protein: mass spectrometry-based.

Purpose and goals: CITE-Seq offers valuable information about the expression of surface proteins. However, the acquisition is limited to tens of targets as the identification relies on antibodies. Furthermore, it cannot provide information on intracellular markers. Mass spectrometry (MS)-based single-cell proteomics (SCP) provides a means to overcome these limitations and to perform unbiased single-cell profiling of the soluble proteome. MS-SCP is emerging thanks to recent advances in sample preparation, liquid chromatography (LC) and MS acquisition. The technology is in its infancy and protocols still need to be adapted in order to acquire multiple multimodalities from a single-cell. In this section the multimodality is achieved by subjecting similar samples to MS-SCP and Single-cell RNA-seq.

Technology: The current state-of-the-art protocol for performing MS-SCP is the SCoPE2 protocol [9]. Briefly, single-cells are lysed, proteins are extracted and digested into peptides. The peptides are then labeled using tandem mass tags (TMT) in order to multiplex up to 16 samples per run (Fig 3A). The pooled peptides are then analysed by liquid chromatography-tandem mass spectrometry (LC-MS/MS). LC separates the peptides based on their mass and affinity for the chromatographic column. The peptides are immediately ionized as they come out (Fig 3B) and are sent for two rounds of MS (MS/MS, Fig 3C). The first round isolates the ions based on their mass to charge (m/z) value. The isolated ions are fragmented and sent to the second round of MS that records the m/z and intensity of each fragment. The pattern of intensities over m/z value generated by an ion is called an MS2 spectrum. The MS2 spectra are then computationally matched to a database to identify the original peptide sequence from which they originated. The spectra that were successfully associated to a peptide sequence are called peptide to spectrum matches (PSMs, Fig 3D). Next to that, a specific range of the MS spectrum holds the TMT label information where each label generates a fragment with an expected m/z value. The intensity of each label peak is proportional to the peptide expression in the corresponding single cell and this allows for peptide quantification (Fig 3D). Finally, the quantified PSM data go through a data processing pipeline that aims to reconstruct the protein data that can be used for downstream analyses (Fig 3E).

thumbnail
Fig 3. SCoPE2 workflow.

The workflow consists of 4 main steps. (A) Sample preparation extracts and labels peptides from single-cells. (B) LC separates the peptides based on their mass and affinity for the column. Note that the TMT tag does not influence those properties. Peptides that are eluting are ionised thanks to an electrospray. (C) MS/MS performs an m/z scan of the incoming ions to select the most abundant ones that are then fragmented separately. A second round of MS acquires the spectrum generated by the ion fragments. (D) Each spectrum is then computationally processed to obtain the cell-specific expression values and the peptide identity. (E) The data processing pipeline reconstructs the protein data from the quantified PSMs. Abbreviations: TMT: tandem mass tags; LC: liquid chromatography; MS: mass spectrometry; MS/MS: tandem MS; m/z: mass over charge; PSM: peptide to spectrum match.

https://doi.org/10.1371/journal.pcbi.1011324.g003

The major challenge in MS-SCP is to recover sufficient peptide material for accurate peptide identification and quantification. SCoPE2 solves this issue by optimizing the sample preparation step to limit samples loss, by providing analytical tools to optimize the MS/MS settings, and most importantly by introducing a carrier sample into the pool of multiplexed samples. The carrier is a sample that contains hundreds of cells instead of a single-cell and allows to boost the peptide identification rate by increasing the amount of peptide material delivered to the MS instrument.

Parallel to SCoPE2, other groups have developed a label-free MS-SCP, where each LC-MS/MS run contains unlabelled peptides from a single cell [27]. Although it allows for more accurate quantifications, it suffers from low throughput. The current methodological advances in MS-SCP have extensively been reviewed elsewhere [28].

Landmark data: The SCoPE2 dataset we provide in this work was retrieved from the supplementary information of the landmark paper [9]. This is a milestone dataset as it is the first publication where over a thousand cells are measured by MS-SCP. The research question is to understand whether a homogeneous monocyte population (U-937 cell line) could differentiate upon PMA treatment into a heterogeneous macrophage population, namely whether M1 and M2 macrophage profiles could be retrieved in the absence of differentiation cytokines. Different replicates of monocyte and macrophage samples were prepared and analyzed using either MS-SCP or Single-cell RNA-seq. The MS-SCP data was acquired in 177 batches with on average 9 single-cells per batch. The Single-cell RNA-seq data was acquired in 2 replicates with on average 10,000 single-cells per acquisition using the 10x Genomics Chromium platform. Cell type annotations are only available for the MS-SCP data. Note also that MS-SCP data provides expression information at protein level meaning that the peptide data has already been processed. The processing includes filtering high quality features, filtering high quality cells, log-transformation, normalization, aggregation from peptides to proteins, imputation and batch correction (Fig 3E). More details on the protein data processing can be found in the original paper or in the paper that reproduced that analysis [29]. Count tables were provided for the Single-cell RNA-seq dataset with no additional processing.

The data can be accessed in the SingleCellMultiModal package by calling SCoPE2("macrophage_differentiation") (Table 4). Relevant cell metadata is provided within the MultiAssayExperiment object. The MS-SCP dataset contains expression values for 3,042 proteins in 1,490 cells. The Single-cell RNA-seq contains expression values for 32,738 genes (out of which 10,149 are zero) for 20,274 cells.

thumbnail
Table 4. SCoPE2 dataset descriptions, with assay types, molecular modes, specimens, dataset version provided, number of features and number of cells.

https://doi.org/10.1371/journal.pcbi.1011324.t004

Single-cell nucleosome, methylation and transcription sequencing (scNMT-seq).

Purpose and goals: The profiling of the epigenome at single-cell resolution has received increasing interest, as it provides valuable insights into the regulatory landscape of the genome [30,31]. Although the term epigenome comprises multiple molecular layers, the profiling of chromatin accessibility and DNA methylation have received the most attention to date.

Technology: DNA methylation is generally measured using single-cell bisulfite sequencing (scBS-seq) [32]. The underlying principle of scBS-seq is the treatment of the DNA with sodium bisulfite before DNA sequencing, which converts unmethylated cytosine (C) residues to uracil (and after retro-PCR amplification, to thymine (T)), leaving 5-methylcytosine residues intact. The resulting C→T transitions can then be detected by DNA sequencing. Further methodological innovations enabled DNA methylation and RNA expression to be profiled from the same cell, demonstrated by the scM&T-seq assay [33].

Chromatin accessibility was traditionally profiled in bulk samples using DNase sequencing (DNase-seq) [34]. However, in recent years, transposase-accessible chromatin followed by sequencing (ATAC-seq) has displaced DNase-seq as the de facto method for profiling chromatin accessibility due to its fast and sensitive protocol, most notably in single-cell genomics [35]. Briefly, in ATAC-seq, cells are incubated with a hyperactive mutant Tn5 transposase, an enzyme that inserts artificial sequencing adapters into nucleosome-free regions. Subsequently, the adaptors are purified, PCR-amplified and sequenced. Notably, single-cell ATAC-seq has also been combined with Single-cell RNA-seq to simultaneously survey RNA expression and chromatin accessibility from the same cell, as demonstrated by SNARE-seq [36], SHARE-seq [37] and the recently commercialized Multiome Kit from 10x Genomics [6]. Finally, some assays have been devised to capture at least three molecular layers from the same cell, albeit at a lower throughput than SNARE-seq or SHARE-seq. An example is scNMT-seq (single-cell nucleosome methylation and transcriptome sequencing) [5]. scNMT captures a snapshot of RNA expression, DNA methylation and chromatin accessibility in single-cells by combining two previous multi-modal protocols: scM&T-seq [33] and Nucleosome Occupancy and Methylation sequencing (NOMe-seq) [38]

In the first step (the NOMe-seq step), cells are sorted into individual wells and incubated with a GpC methyltransferase. This enzyme labels accessible (or nucleosome depleted) GpC sites via DNA methylation. In mammalian genomes, cytosine residues in GpC dinucleotides are methylated at a very low rate. Hence, after the GpC methyltransferase treatment, GpC methylation marks can be interpreted as direct readouts for chromatin accessibility, as opposed to the CpG methylation readouts, which can be interpreted as endogenous DNA methylation. In a second step (the scM&T-seq step), the DNA molecules are separated from the mRNA using oligo-dT probes pre-annealed to magnetic beads. Subsequently, the DNA fraction undergoes scBS, whereas the RNA fraction undergoes Single-cell RNA-seq.

Landmark data: The scNMT landmark paper reported simultaneous measurements of chromatin accessibility, DNA methylation, and RNA expression at single-cell resolution during early embryonic development, spanning exit from pluripotency to primary germ layer specification [23]. This dataset represents the first multi-omics roadmap of mouse gastrulation at single-cell resolution. Using multi-omic integration methods, the authors detected genomic associations between distal regulatory regions and transcription activity, revealing novel insights into the role of the epigenome in regulating this key developmental process.

One of the challenges of this dataset is the complex missing value structure. Whereas RNA expression is profiled for most cells (N = 2480), DNA methylation and chromatin accessibility is only profiled for subsets of cells (N = 986 and N = 1105, respectively). This poses important challenges to some of the conventional statistical methods that do not handle missing information.

The output of the epigenetic layers from scNMT-seq is a binary methylation state for each observed CpG (endogenous DNA methylation) and GpC (a proxy for chromatin accessibility). However, instead of working at the single nucleotide level, epigenetic measurements are typically quantified over genomic features (i.e. promoters, enhancers, etc.). This is done assuming a binomial model for each cell and feature, where the number of successes is the number of methylated CpGs (or GpCs) and the number of trials is the total number of CpGs (or GpCs) that are observed. Here we provide DNA methylation and chromatin accessibility estimates quantified over CpG islands, gene promoters, gene bodies and DNAse hypersensitive sites (defined in Embryonic Stem Cells).

The pre-integrated scNMT dataset is accessed from the SingleCellMultiModal package by calling e.g. scNMT("mouse_gastrulation", version = "1.0.0") (Table 5). Relevant cell metadata is provided within the MultiAssayExperiment object. The overall dataset is 277MB.

thumbnail
Table 5. scNMT-seq dataset description, with of assay types, molecular modes, number of specimens, number of features and number of cells.

https://doi.org/10.1371/journal.pcbi.1011324.t005

Chromium Single-cell Multiome ATAC and gene expression.

Purpose and goals: A new commercial platform introduced in late 2020 by 10X Genomics, the Chromium Single Cell Multiome ATAC and gene expression (10x Multiome), provides simultaneous gene expression and open chromatin measurements from the same cell at high throughput. This technology is well suited to identify gene regulatory networks by linking open chromatin regions with changes in gene expression, a task which is harder to perform when the two modalities are derived from separate groups of cells. However, very few datasets have been published to date using the 10x Multiome technology, and so how much information can be obtained by simultaneously profiling both modalities in the same cell remains an open question.

Technology: First, cells are purified and single nuclei are isolated, chromosomes are transpositioned. Next, ATAC and mRNA sequencing libraries are prepared with 10X Genomics Chromium microfluidic controller device where nuclei are partitioned and embedded in a droplet with a decorated gel bead with DNA 16nt 10X barcode that allows for pairing ATAC and mRNA signals to the same nuclei. mRNA is tagged with an 12nt Unique Molecular Identifier sequence (UMI), and a poly(dT)VN for poly-adenylated 3’ends. ATAC fragments are tagged with a Illumina primer sequence and an 8nt space sequence. All barcoded products are amplified in two rounds of PCR and then processed for sequencing. According to the Chromium Single-Cell Multiome ATAC and gene expression assay product information, it has a flexible throughput of 500–10,000 nuclei per channel and up to 80,000 per run with a 65% recovery rate and low multiplet rate of <1% per 1000 cells (10Xgenomics.com).

Landmark data: 10X genomics has released a dataset of ~10k peripheral blood mononuclear cells (PBMCs) from a human healthy donor. Here we provide the RNA expression matrix and the binary matrix of ATAC fragments for each cell, quantified over a set of pre-computed peaks (Table 6). To access data in the SingleCellMultiModal package, call the scMultiome("pbmc_10x") command. Relevant cell metadata is provided within the MultiAssayExperiment object. The overall dataset is 1.1 GB.

thumbnail
Table 6. 10X Multiome dataset descriptions, with assay types, molecular modes, number of features and number of cells.

https://doi.org/10.1371/journal.pcbi.1011324.t006

RNA and spatial sequencing assays.

Purpose and goals: The power of microscopy to resolve spatial information has been paired with single-cell sequencing to measure transcriptomic activity. These microscopy-based sequencing technologies capture a cell population’s heterogeneous gene expression typically lost in bulk assays. Technologies like seqFISH(+) (sequential Fluorescence In Situ Hybridization), fluorescence in situ hybridization sequencing [7], Multiplexed error-robust fluorescence in situ hybridization (MERFISH) [39], Slide-seq [40,41] combine sequential barcoding with in situ molecular fluorescence probing, allowing the identification from tens to thousands of mRNAs transcripts while preserving spatial coordinates at micrometer resolution. We refer to this family of technologies as molecular-based spatial transcriptomics. Another family of spatial omics technologies can be described as spot-based; it includes the 10x Visium Spatial Gene Expression and Slide-seq [40]. In this family, the spatial coordinates are typically associated with barcoded spot-like identities, where the transcripts are amplified and sequenced. Currently, our package does not include any spot-based spatial transcriptomics dataset. The TENxVisiumData package [42] (available at https://github.com/HelenaLC/TENxVisiumData) contains several such datasets. See [43] for a comprehensive review of spatial transcriptomics technologies.

Technology: The seqFISH technology makes use of temporal barcodes to be read in multiple rounds of hybridization where mRNAs are labeled with fluorescent probes. During the hybridization rounds, the fluorescent probes are hybridized with the transcripts to be imaged with microscopy. Then they are stripped to be re-used and coupled with different fluorophores, during further rounds. In this case, the transcript abundance is given by the number of colocalizing spots per each transcript. The main differences between the technologies are due to the barcoding of RNAs. In seqFISH they are detected as a color sequence while in MERFISH the barcodes are identified as binary strings allowing error handling but requiring longer transcripts and more rounds of hybridizations [44].

Landmark data: The provided seqFISH dataset is designed on a mouse visual cortex tissue and can be retrieved in two different versions. Both versions include Single-cell RNA-seq and seqFISH data. Single-cell RNA-seq data in version 1.0.0 are part of the original paper [24] of 24057 genes in 1809 cells, while version 2.0.0 is a pre-processed adaptation of version 1.0.0 [22] where the authors analyzed it in order to provide the 113 genes in common with seqFISH data in 1723 cells. The provided seqFISH data are the same for both versions as part of their original paper [45,46] made of 1597 cells and 113 genes. The dataset is accessible via the SingleCellMultiModal Bioconductor package by using the seqFISH(DataType="mouse_visual_cortex", version = "1.0.0") function call, which returns a MultiAssayExperiment object with a SpatialExperiment object for the seqFISH data and a SingleCellExperiment object for the Single-cell RNA-seq data (Table 7).

thumbnail
Table 7. seqFISH dataset descriptions, with assay types, molecular modes, specimens, dataset version provided, number of features and number of cells.

https://doi.org/10.1371/journal.pcbi.1011324.t007

RNA and DNA sequencing assays.

Purpose and goals: Parallel genome and transcriptome sequencing (G&T-seq) of single-cells [8] opens new avenues for measuring transcriptional responses to genetic and genomic variation resulting from different allele frequencies, genetic mosaicism [47], single nucleotide variants (SNVs), DNA copy-number variants (CNVs), and structural variants (SVs). Although current experimental protocols are low-throughput with respect to the number of cells, simultaneous DNA and RNA sequencing of single-cells resolves the problem of how to associate cells across each modality from independently sampled single-cell measurements [48].

Technology: Following cell isolation and lysis, G&T-seq measures DNA and RNA levels of the same cell by physically separating polyadenylated RNA from genomic DNA using a biotinylated oligo-dT primer [49]. This is followed by separate whole-genome and whole-transcriptome amplification. Whole-genome amplification is carried out via multiple displacement amplification (MDA) or displacement pre-amplification and PCR (DA-PCR) for DNA sequencing, providing targeted sequencing reads or genome-wide copy number estimates. Parallel Smart-seq2 whole-transcriptome amplification is used for Illumina or PacBio cDNA sequencing, providing gene expression levels based on standard computational RNA-seq quantification pipelines. While pioneering technologies such as G&T-seq [8] and DR-seq [50] sequence both the DNA and RNA from single-cells, they currently measure only few cells (50–200 cells [51]) compared to assays that sequence DNA or RNA alone (1,000–10,000 cells [51]) such as Direct Library Preparation [52] or 10x Genomics Single-cell RNA-seq [53].

Landmark data: G&T-seq has been applied by Macaulay et al. [8] for parallel analysis of genomes and transcriptomes of (i) 130 individual cells from breast cancer line HCC38 and B lymphoblastoid line HCC38-BL, and (ii) 112 single cells from a mouse embryo at the eight-cell stage. Publicly available and included in the SingleCellMultiModal package is the mouse embryo dataset, assaying blastomeres of seven eight-cell cleavage-stage mouse embryos, five of which were treated with reversine at the four-cell stage of in vitro culture to induce chromosome mis-segregation. The dataset is stored as a MultiAssayExperiment [11] consisting of (i) a SingleCellExperiment [10] storing the single-cell RNA-seq read counts, and (ii) a RaggedExperiment [54] storing integer copy numbers as previously described [55] (Table 8). Although assaying only a relatively small number of cells, the dataset can serve as a prototype for benchmarking single-cell eQTL integration of DNA copy number and gene expression levels, given that Macaulay et al. [8] reported copy gains or losses with concomitant increases and decreases in gene expression levels.

thumbnail
Table 8. G&T-seq dataset description, with assay types, molecular modes, number of specimens, number of features and number of cells.

https://doi.org/10.1371/journal.pcbi.1011324.t008

Integrative analysis across modalities using data from SingleCellMultiModal.

Existing methods of integrative analysis of single-cell multimodal data have been recently reviewed [21]. Very briefly, some of the most popular current implementations are 1) the Seurat V4 R package which aims at vertical integration across several modal data types [56], 2) mixOmics [14] provides an extensive framework for data integration at molecular (P-integration, MINT [57]) and sample levels (N-integration, DIABLO [58]), and 3) Multi-Omics Factor Analysis, MOFA+ [15], a generalisation of Principal Components Analysis for inferring low-dimensional representation of multimodal data. Datasets provided by SingleCellMultiModal can be readily reshaped as input to any of these packages. We provide novel examples of such integrative analysis for exploratory visualization using SingleCellMultiModal datasets, produced within package documentation: MOFA+ [15] on the 10X Multiome dataset (Fig 4). For more information, see Data Integration Methods. In addition, we provide a sample analysis on the SCoPE2 dataset, which can be found in SingleCellMultimodal’s package vignette.

thumbnail
Fig 4. Summary of example integration using the 10X Genomic Multiome data from the SingleCellMultiModal package.

A): (left) from the 10X Genomic Multiome data resource, a sparse matrix and FASTA datasets provided, (right) pre-processing steps of source data required to work with of Multiome dataset provided in SingleCellMultiModal B): MultiAssayExperiment (MAE) object returned when called C): Bar graph of the alternative data import sizes D), E) data integration of 10x Genomic Multiome dataset, combining the chromatin accessibility data with the transcriptome data using MOFA+. D): RNA-seq and ATAC-seq matrices used for weight factored analysis, E): UMAP cluster of cell types based on factor analysis. For more detail on the analysis, see the Methods and the SingleCellMultiModal package vignette. Other datasets can be represented similarly: raw data processing and integration of data modes occur upstream of the SingleCellMultiModal package, users invoke a single command that creates a MultiAssayExperiment integrating appropriate memory-efficient objects, which are applied directly to downstream R/Bioconductor analyses.

https://doi.org/10.1371/journal.pcbi.1011324.g004

Methods

SingleCellMultiModal data package

All datasets are distributed through the SingleCellMultiModal experimental data package in Bioconductor. This package employs ExperimentHub [59] for robust Cloud-based data download from AWS S3 buckets, with automatic local caching to avoid repetitive downloads. These methods are described in detail elsewhere for application to The Cancer Genome Atlas and cBioPortal [60]. Briefly, metadata and individual omics datasets are stored in ExperimentHub as simple core Bioconductor objects such as matrix, SparseMatrix, SingleCellExperiment, and RaggedExperiment. A simple user-facing convenience function is provided for each dataset that retrieves all necessary individual components, assembles a MultiAssayExperiment object [11], and returns this to the user. For very large matrices we employ HDF5 and MTX on-disk representation. Methods for users to access these datasets are documented in the SingleCellMultiModal package vignette and functions manual.

All datasets provide a cell-type annotation either provided by the authors of the original publications or based on external knowledge, except for ECCITEseq and G&T-seq for which none was available (Table 9). In particular, the SCoPE2, seqFISH, and scNMT datasets include the cell type labels provided by the authors of the original publication, while scMultiome includes a cell type label inferred by manually inspecting cluster markers. While the CITEseq dataset does not formally have cell type labels, the ADT data for cell surface markers typically used in immunology allow for manual gating of cell types (see e.g., S3 Fig of [3]). We used this additional information to define cell type labels that are independent of scRNA-seq data. We note that this dataset has been previously used for method benchmarking by looking for maximal correspondence between scRNA-seq derived cell type labels and ADT surface markers, which can be considered as gold standard (see e.g. [61]). A dedicated function (getCellGroups) has been added to the SingleCellMultiModal package to help the user to add their own cell annotations based on manual gating. An overview of the cell type annotation for all the dataset can be easily retrieved through the ontomap function, which also standardizes cell names from the Cell Ontology [62] (first preference) or NCI Thesaurus [63] to facilitate cross-dataset comparisons.

thumbnail
Table 9. Dataset information on cell filtering and annotations.

Specifies if the quality controls on the cells of each dataset have been already performed or if it is present in the retrieved object. Column Version QC indicates the version of the dataset where QC is available. Cell Annotation column indicates which datasets have a ground-truth available.

https://doi.org/10.1371/journal.pcbi.1011324.t009

Furthermore, each dataset includes information on the quality of the cells: for some datasets, the package provides a pre-filtered object, in which low-quality cells were already excluded by the original authors; for others, we provide an option to retrieve either the filtered dataset, which include only cells that pass a quality control step, or the full dataset along with an indication, for each cell, of whether it passes the QC filters, defined following state-of-the-art best practices (Table 9).

CITE-Seq and ECCITE-Seq dataset

The CITE-Seq contains two modalities of cord blood mononuclear cells, the transcripts (scRNA) and the cell surface proteins (scADT) measured and preprocessed as described in the CITE-Seq landmark paper [3]. The PBMC UMI counts together with the Centered Log Ratio (CLR) transformed counts for the scADT were retrieved from the GEO repository with accession number GSE100866 and then loaded in R to be transformed in matrix format and then be loaded as separate assays of a MultiAssayExperiment object. This latter object can be retrieved from the SingleCellMultiModal package with the function call: CITEseq(DataType = "cord_blood", dry.run = FALSE).

Because this dataset provides a mixture of human and mouse cells, we applied k-means clustering combined with hierarchical clustering on the log transformed scRNA counts to identify then classify human cells, approximately reconstructing the population of cells that the landmark paper reported through manual gating on the ADT assay. Despite a few unidentified cells, most of the subgroups of cells are identified, i.e., Natural Killers, Precursors, CD4 T-cells, CD8 T-cells, B-cells, Monocytes CD14+ and Monocytes CD16+. The code for cell filtering and cell manual gating has been released inside the script folder of the SingleCellMultiModal package. Additionally, two functions (getCellGroups and addCTLabels) for the manual gating are now exported by the package to help users identify their own cell populations.

The ECCITE-Seq has three modalities of the of peripheral blood mononuclear cells, (scRNA), the cell surface proteins (scADT) and the Hashtagged Oligo (scHTO) measured and preprocessed as described in the ECCITE-Seq landmark paper [4].

The PBMC modalities for the cutaneous T-cell lymphoma (CTCL) and controls (CTRL) were retrieved in TXT format from the GEO repository with accession number GSE126310 and then loaded in R to be transformed in matrix and data.frame format and then be loaded as separate assays of a MultiAssayExperiment object. The CRISPR perturbed scRNAs data are stored as data.frame in the object metadata to keep their original long format. This latter object can be retrieved from the SingleCellMultiModal with the function call:

CITEseq(DataType = "peripheral_blood", dry.run = FALSE).

Visual cortex seqFISH dataset

The seqFISH dataset has two different modalities, the spatial transcriptomics (seqFISH) and the single-cell RNA-seq, in two different versions. The main difference between the two versions are in the Single-cell RNA-seq counts data which in version 1.0.0 are provided as downloaded in CSV format from the GEO repository with accession number GSE71585, while the version 2.0.0 is a processed dataset [46] where only the genes with correspondence in the seqFISH dataset have been preserved. Methods of pre-processing are described at https://github.com/BIRSBiointegration/Hackathon. In both versions the seqFISH dataset is the processed version [46] as downloaded from https://cloudstor.aarnet.edu.au/plus/s/ZuBIXuzuvc9JMj3. Processed version of the seqFISH data were downloaded as TXT format for the coordinates (fcortex.coordinates.txt), as TSV format for the cell annotated labels (seqfish_labels.tsv) and TXT format for the counts (seqfish_cortex_b2_testing.txt). The data constitute a SpatialExperiment object with the counts as assay, the cell labels as colData and the coordinates stored as spatialData.In the same way, the processed Single-cell RNA-seq data were downloaded as TXT format for the counts (tasic_training_b2.txt), as TSV format for the cell annotated labels (tasic_labels.tsv) to build a SingleCellExperiment object with the counts as assay and the cell labels as colData.

Finally, the SingleCellExperiment and the SpatialExperiment have been loaded into a MultiAssayExperiment object as two different assays. The MultiAssayExperiment object can be retrieved with the function call, for example:

seqFISH(DataType = "mouse_visual_cortex",dry.run = FALSE, version = "2.0.0")

Mouse gastrulation scNMT dataset

Preprocessing methods are described in full by Argelaguet et al. [23]. Briefly, RNA-seq libraries were aligned to the GRCm38 mouse genome build using HiSat235 (v.2.1.0). Gene expression counts were quantified from the mapped reads using featureCounts [66] with the Ensembl 87 gene annotation [67]. The read counts were log-transformed and size-factor adjusted using scran normalization [68]. Bisulfite-seq libraries were aligned to the bisulfite converted GRCm38 mouse genome using Bismark [69]. Endogenous CpG methylation was quantified over ACG and TCG trinucleotides and GpC chromatin accessibility over GCA, GCC and GCT trinucleotides. Note that for GCG trinucleotides it is not possible to distinguish endogenous CpG methylation from induced GpC methylation. In addition, CGC positions were discarded because of off-target effects of the GpC methyltransferase enzyme [70].

For each CpG site in each cell we obtained binary methylation calls and for each GpC site in each cell we obtained binary accessibility calls. Notice that binary readouts is an exclusive property of single-cell bisulfite sequencing data, as for the vast majority of sites only one allele is observed per cell. This contrasts with bulk bisulfite sequencing data, where each dinucleotide typically contains multiple reads originating from different cells.

Finally, we quantified DNA methylation and chromatin accessibility over genomic features by assuming a binomial model is assumed for each cell and feature, where the number of successes is the number of methylated CpGs (or GpCs) and the number of trials is the total number of CpGs (or GpCs) that are observed within the specific cell and genomic feature. Here, We quantified DNA methylation and chromatin accessibility rates over CpG islands, gene promoters, gene bodies and DNAse hypersensitive sites. All these data modalities were compiled together with the RNA expression into a MultiAssayExperiment object. The dataset can be loaded from within the SingleCellMultiModal package by the function call scNMT("mouse_gastrulation", dry.run = FALSE, version = "2.0.0"). Code with the data processing pipeline is available in https://github.com/rargelaguet/scnmt_gastrulation.

10X multiome dataset

PBMCs were extracted from a healthy donor after removing granulocytes through cell sorting. The dataset was downloaded as a CellRanger ARC output from https://support.10xgenomics.com/single-cell-multiome-atac-gex/datasets/1.0.0/pbmc_granulocyte_sorted_10k, which includes the gene expression matrix and the chromatin accessibility matrix quantified over ATAC peaks. The dataset included 11,909 cells with a median of 13,486 high-quality ATAC fragments per cell and a median of 1,826 genes expressed per cell. Data processing details, including the peak calling algorithm, can be found in https://support.10xgenomics.com/single-cell-multiome-atac-gex/software/pipelines/latest/what-is-cell-ranger-arc. The dataset is provided as a MultiAssayExperiment [11] consisting of two SingleCellExperiment [10], one containing the single-cell RNA-seq read counts, and the other containing the binary ATAC peak matrix. The dataset can be loaded from within the SingleCellMultiModal package by the function call scMultiome("pbmc_10x", dry.run = FALSE).

Macrophage differentiation SCoPE2 dataset

The macrophage differentiation project contains two datasets: single-cell RNA-seq data and MS-SCP data. Upstream processing is described in detail in the SCoPE2 landmark paper [9]. Briefly, for the Single-cell RNA-seq dataset, the authors used CellRanger to align the reads and to build the UMI count matrices. Based on cell QC and manual inspection, they discarded cells containing less than 104 UMI barcodes. The resulting tables for two technical replicates were deposited in a GEO repository with accession GSE142392. For the MS-SCP dataset, the authors followed the workflow described in Fig 3, with identification and quantification steps performed using the MaxQuant software and additional protein quantification using a custom R script available on GitHub (https://github.com/SlavovLab/SCoPE2).

We retrieved the single-cell RNA-seq dataset from the GSE142392 repository. The MS-SCP data and annotations were retrieved from CSV files available at the authors’ website (https://scope2.slavovlab.net/docs/data). We formatted the Single-cell RNA-seq and the MS-SCP data as two separate SingleCellExperiment objects without further processing. Because the Single-cell RNA-seq data is relatively large, it is stored as a sparse matrix using the HDF5 data format. We combined the two data objects in a single MultiAssayExperiment object. This latter object can be queried from the SingleCellMultimodal package with the function call SCoPE2("macrophage_differentiation", dry.run = FALSE).

G&T-seq dataset

Raw sequencing data was obtained from the European Nucleotide Archive (ENA [71], accession PRJEB9051). The data was downloaded in fastq files for whole-genome and whole-transcriptome paired-end sequencing data for 112 mouse embryo cells. The data was processed as described in the step-by-step protocol of Macaulay et al. [49]. Preprocessing and mapping of genome sequencing data was carried out following steps 78–84 of the protocol of Macaulay et al. [49], using Rsubread [72] for read trimming, alignment to the mm10 mouse reference genome, and removal of PCR-duplicate reads. DNA copy-number profiling was carried out following steps 85–87, using bedtools [73] to convert BAM to BED files, and subsequently applying Ginkgo [74] for copy number determination. Preprocessing and mapping of transcriptome sequencing data was carried out following steps 94–96, using Rsubread [72] for read trimming and alignment to the mm10 mouse reference genome. Read counts for each gene were obtained using the featureCounts [66] function of the Rsubread package. The dataset is provided as a MultiAssayExperiment [11] consisting of (i) a SingleCellExperiment [10] storing the single-cell RNA-seq read counts, and (ii) a RaggedExperiment [54] storing integer copy numbers as previously described [55]. The dataset can be loaded from within the SingleCellMultiModal package by the function call GTseq(dry.run = FALSE).

Data integration of the 10x multiome data set

For the integration of the 10x Multiome dataset we used MOFA+ [15] to obtain a latent embedding with contribution from both data modalities. The RNA expression was normalised using scran [68], followed by feature selection of the top 2000 most variable genes. The chromatin accessibility was normalised using TFIDF, followed by feature selection of the top 10,000 peaks with the highest mean accessibility. The MOFA model was trained with K = 15 factors using default options. To obtain a non-linear embedding we applied UMAP [75] on the MOFA factors.

Discussion

Experimental data packages providing landmark datasets have historically played an important role in the development of new statistical methods in Bioconductor, from the classic acute lymphocytic leukemia (ALL) microarray dataset [76] to the HSMMSingleCell single-cell RNA-seq dataset [77], as well as packages providing more extensive curated selections of standardized datasets in a specific realm [78]. Such packages greatly lower the barrier of access to relevant data for developers of scientific software, and provide a common testing ground for development and benchmarking. We present the SingleCellMultiModal Bioconductor experimental data package, to distribute landmark single-cell multimodal datasets in pre-integrated immediately usable forms, utilizing standard Bioconductor data structures. Multimodal datasets are serialized as a MultiAssayExperiment object by a single command, without requiring users to perform data wrangling to link multiple ‘omics profiles or to manage cells with incomplete data. We provide curated landmark datasets for a selection of key single-cell multimodal assays that will serve as benchmarks for the development and assessment of appropriate analysis methods in R/Bioconductor. We provide a brief review of the assays provided for the purpose of providing essential background to developers of statistical and bioinformatic methods, a summary of the data contained in each dataset, and examples of minimal code needed to access each dataset in an R/Bioconductor session. Methods of statistical analysis are reviewed in a recent complimentary paper [22].

Single-cell RNA-seq analysis methods in Bioconductor are well developed and widely used [10], setting the stage for new development in single-cell multimodal data analysis that will be facilitated by the SingleCellMultiModal experimental data package. Areas of active research include integrative systems biology across data modes, spatial statistics on high-dimensional data, dimension reduction and clustering [14], cell identification, multimodal batch correction, and new data structures for representation and analysis of large and spatially resolved single-cell multimodal data. These areas of research and their software products will be facilitated and made more interoperable by the easily accessible and uniformly represented data provided by this work.

References

  1. 1. Schier AF. Single-cell biology: beyond the sum of its parts. Nat Methods. 2020;17: 17–20. pmid:31907464
  2. 2. Method of the Year 2019: Single-cell multimodal omics. Nat Methods. 2020;17: 1.
  3. 3. Stoeckius M, Hafemeister C, Stephenson W, Houck-Loomis B, Chattopadhyay PK, Swerdlow H, et al. Simultaneous epitope and transcriptome measurement in single cells. Nat Methods. 2017;14: 865–868. pmid:28759029
  4. 4. Mimitou EP, Cheng A, Montalbano A, Hao S, Stoeckius M, Legut M, et al. Multiplexed detection of proteins, transcriptomes, clonotypes and CRISPR perturbations in single cells. Nat Methods. 2019;16: 409–412. pmid:31011186
  5. 5. Clark SJ, Argelaguet R, Kapourani CA, Stubbs TM, Lee HJ, Alda-Catalinas C, et al. ScNMT-seq enables joint profiling of chromatin accessibility DNA methylation and transcription in single cells e. Nat Commun. 2018;9. pmid:29472610
  6. 6. Shi M, Annika K, Michael P. Nuclei Isolation from Tissue for 10x Multiome v1. protocols.io.
  7. 7. Lubeck E, Coskun AF, Zhiyentayev T, Ahmad M, Cai L. Single-cell in situ RNA profiling by sequential hybridization. Nature methods. 2014. pp. 360–361. pmid:24681720
  8. 8. Macaulay IC, Haerty W, Kumar P, Li YI, Hu TX, Teng MJ, et al. G&T-seq: parallel sequencing of single-cell genomes and transcriptomes. Nat Methods. 2015;12: 519–522.
  9. 9. Specht H, Emmott E, Petelski AA, Huffman RG, Perlman DH, Serra M, et al. Single-cell proteomic and transcriptomic analysis of macrophage heterogeneity using SCoPE2. Genome Biol. 2021;22: 50. pmid:33504367
  10. 10. Amezquita RA, Lun ATL, Becht E, Carey VJ, Carpp LN, Geistlinger L, et al. Orchestrating single-cell analysis with Bioconductor. Nat Methods. 2020;17: 137–145. pmid:31792435
  11. 11. Ramos M, Schiffer L, Re A, Azhar R, Basunia A, Rodriguez C, et al. Software for the Integration of Multiomics Experiments in Bioconductor. Cancer Res. 2017;77: e39–e42. pmid:29092936
  12. 12. Huber W, Carey VJ, Gentleman R, Anders S, Carlson M, Carvalho BS, et al. Orchestrating high-throughput genomic analysis with Bioconductor. Nat Methods. 2015;12: 115–121. pmid:25633503
  13. 13. Stuart T, Butler A, Hoffman P, Hafemeister C, Papalexi E, Mauck WM, et al. Comprehensive Integration of Single-Cell Data. Cell. 2019;177: 1888–1902.e21. pmid:31178118
  14. 14. Rohart F, Gautier B, Singh A, Lê Cao K-A. mixOmics: An R package for ‘omics feature selection and multiple data integration. PLoS Comput Biol. 2017;13: e1005752. pmid:29099853
  15. 15. Argelaguet R, Arnol D, Bredikhin D, Deloro Y, Velten B, Marioni JC, et al. MOFA+: a statistical framework for comprehensive integration of multi-modal single-cell data. Genome Biol. 2020;21: 111. pmid:32393329
  16. 16. Kim HJ, Lin Y, Geddes TA, Yang JYH, Yang P. CiteFuse enables multi-modal analysis of CITE-seq data. Bioinformatics. 2020;36: 4137–4143. pmid:32353146
  17. 17. Wolf FA, Angerer P, Theis FJ. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 2018;19: 15. pmid:29409532
  18. 18. Barkas N, Petukhov V, Nikolaeva D, Lozinsky Y, Demharter S, Khodosevich K, et al. Joint analysis of heterogeneous single-cell RNA-seq dataset collections. Nat Methods. 2019;16: 695–698. pmid:31308548
  19. 19. Gentleman R, Carey V, Huber W, Irizarry R, Dudoit S. Bioinformatics and Computational Biology Solutions Using R and Bioconductor. Springer Science & Business Media; 2006.
  20. 20. Lawrence M, Huber W, Pagès H, Aboyoun P, Carlson M, Gentleman R, et al. Software for computing and annotating genomic ranges. PLoS Comput Biol. 2013;9: e1003118. pmid:23950696
  21. 21. Lê Cao K-A, Abadi AJ, Davis-Marcisak EF, Hsu L, Arora A, Coullomb A, et al. Community-wide hackathons to identify central themes in single-cell multi-omics. Genome Biol. 2021;22: 220. pmid:34353350
  22. 22. Al J Abadi Emily F. Davis-Marcisak Lauren Hsu Arshi Arora Alexis Coullomb Atul Deshpande Yuzhou Feng Melanie Loth Vera Pancaldi Kris Sankaran Amrit Singh Joshua S. Sodicoff Genevieve L. Stein-O’Brien Ayshwarya Subramanian Joshua D. Welch Yue You Ricard Argelaguet Vincent J. Carey Ruben Dries Casey S. Greene Susan Holmes Michael I. Love Matthew E. Ritchie Guo-Cheng Yuan Aedin C Culhane and Elana Fertig K-ALC. “Community-wide hackathons establish foundations for emerging single cell data integration.” In submission.
  23. 23. Argelaguet R, Clark SJ, Mohammed H, Carine Stapel L, Krueger C, Kapourani C-A, et al. Multi-omics profiling of mouse gastrulation at single-cell resolution. Nature. 2019;576: 487–491. pmid:31827285
  24. 24. Tasic B, Menon V, Nguyen TN, Kim TK, Jarsky T, Yao Z, et al. Adult mouse cortical cell taxonomy revealed by single cell transcriptomics. Nat Neurosci. 2016;19: 335–346. pmid:26727548
  25. 25. Schepers K, Swart E, van Heijst JWJ, Gerlach C, Castrucci M, Sie D, et al. Dissecting T cell lineage relationships by cellular barcoding. J Exp Med. 2008;205: 2309–2318. pmid:18809713
  26. 26. Conway JR, Lex A, Gehlenborg N. UpSetR: An R Package For The Visualization Of Intersecting Sets And Their Properties. bioRxiv. 2017. p. 120600.
  27. 27. Zhu Y, Piehowski PD, Zhao R, Chen J, Shen Y, Moore RJ, et al. Nanodroplet processing platform for deep and quantitative proteome profiling of 10–100 mammalian cells. Nat Commun. 2018;9: 882. pmid:29491378
  28. 28. Kelly RT. Single-cell Proteomics: Progress and Prospects. Mol Cell Proteomics. 2020;19: 1739–1748. pmid:32847821
  29. 29. Vanderaa C, Gatto L. Utilizing Scp for the analysis and replication of single-cell proteomics data. bioRxiv. 2021. p. 2021.04.12.439408.
  30. 30. Kelsey G, Stegle O, Reik W. Single-cell epigenomics: Recording the past and predicting the future. Science. 2017;358: 69–75. pmid:28983045
  31. 31. Klemm SL, Shipony Z, Greenleaf WJ. Chromatin accessibility and the regulatory epigenome. Nat Rev Genet. 2019;20: 207–220. pmid:30675018
  32. 32. Smallwood SA, Lee HJ, Angermueller C, Krueger F, Saadeh H, Peat J, et al. Single-cell genome-wide bisulfite sequencing for assessing epigenetic heterogeneity. Nat Methods. 2014;11: 817–820. pmid:25042786
  33. 33. Angermueller C, Clark SJ, Lee HJ, Macaulay IC, Teng MJ, Hu TX, et al. Parallel single-cell sequencing links transcriptional and epigenetic heterogeneity. Nat Methods. 2016;13: 229–232. pmid:26752769
  34. 34. Song L, Crawford GE. DNase-seq: a high-resolution technique for mapping active gene regulatory elements across the genome from mammalian cells. Cold Spring Harb Protoc. 2010;2010: db.prot5384. pmid:20150147
  35. 35. Buenrostro JD, Wu B, Litzenburger UM, Ruff D, Gonzales ML, Snyder MP, et al. Single-cell chromatin accessibility reveals principles of regulatory variation. Nature. 2015;523: 486–490. pmid:26083756
  36. 36. Chen S, Lake BB, Zhang K. High-throughput sequencing of the transcriptome and chromatin accessibility in the same cell. Nat Biotechnol. 2019;37: 1452–1457. pmid:31611697
  37. 37. Ma S, Zhang B, LaFave LM, Earl AS, Chiang Z, Hu Y, et al. Chromatin Potential Identified by Shared Single-Cell Profiling of RNA and Chromatin. Cell. 2020;183: 1103–1116.e20. pmid:33098772
  38. 38. Pott S. Simultaneous measurement of chromatin accessibility, DNA methylation, and nucleosome phasing in single cells. Elife. 2017;6. pmid:28653622
  39. 39. Chen KH, Boettiger AN, Moffitt JR, Wang S, Zhuang X. RNA imaging. Spatially resolved, highly multiplexed RNA profiling in single cells. Science. 2015;348: aaa6090. pmid:25858977
  40. 40. Rodriques SG, Stickels RR, Goeva A, Martin CA, Murray E, Vanderburg CR, et al. Slide-seq: A scalable technology for measuring genome-wide expression at high spatial resolution. Science. 2019;363: 1463–1467. pmid:30923225
  41. 41. Eng C-HL, Lawson M, Zhu Q, Dries R, Koulena N, Takei Y, et al. Transcriptome-scale super-resolved imaging in tissues by RNA seqFISH. Nature. 2019;568: 235–239. pmid:30911168
  42. 42. Righelli D, Weber LM, Crowell HL, Pardo B, Collado-Torres L, Ghazanfar S, et al. SpatialExperiment: infrastructure for spatially resolved transcriptomics data in R using Bioconductor. Cold Spring Harbor Laboratory. 2021. p. 2021.01.27.428431.
  43. 43. Moses L, Pachter L. Museum of Spatial Transcriptomics. bioRxiv. 2021. p. 2021.05.11.443152.
  44. 44. Shah S, Lubeck E, Zhou W, Cai L. In Situ Transcription Profiling of Single Cells Reveals Spatial Organization of Cells in the Mouse Hippocampus. Neuron. 2016;92: 342–357. pmid:27764670
  45. 45. Zhu Q, Shah S, Dries R, Cai L, Yuan G-C. Identification of spatially associated subpopulations by combining scRNAseq and sequential fluorescence in situ hybridization data. Nat Biotechnol. 2018. pmid:30371680
  46. 46. birsauthors. White Paper for the Mathematical Frameworks for Integrative Analysis of Emerging Biological Data Types Workshop. Available from: https://github.com/BIRSBiointegration/whitePaper.
  47. 47. Taylor TH, Gitlin SA, Patrick JL, Crain JL, Wilson JM, Griffin DK. The origin, mechanisms, incidence and clinical consequences of chromosomal mosaicism in humans. Hum Reprod Update. 2014;20: 571–581. pmid:24667481
  48. 48. Campbell KR, Steif A, Laks E, Zahn H, Lai D, McPherson A, et al. clonealign: statistical integration of independent single-cell RNA and DNA sequencing data from human cancers. Genome Biol. 2019;20: 54. pmid:30866997
  49. 49. Macaulay IC, Teng MJ, Haerty W, Kumar P, Ponting CP, Voet T. Separation and parallel sequencing of the genomes and transcriptomes of single cells using G&T-seq. Nat Protoc. 2016;11: 2081–2103.
  50. 50. Dey SS, Kester L, Spanjaard B, Bienko M, van Oudenaarden A. Integrated genome and transcriptome sequencing of the same cell. Nat Biotechnol. 2015;33: 285–289. pmid:25599178
  51. 51. Stuart T, Satija R. Integrative single-cell analysis. Nat Rev Genet. 2019;20: 257–272. pmid:30696980
  52. 52. Zahn H, Steif A, Laks E, Eirew P, VanInsberghe M, Shah SP, et al. Scalable whole-genome single-cell library preparation without preamplification. Nat Methods. 2017;14: 167–173. pmid:28068316
  53. 53. Zheng GXY, Terry JM, Belgrader P, Ryvkin P, Bent ZW, Wilson R, et al. Massively parallel digital transcriptional profiling of single cells. Nat Commun. 2017;8: 14049. pmid:28091601
  54. 54. Ramos M, Morgan M. RaggedExperiment: Representation of Sparse Experiments and Assays Across Samples.(2017). bioc. RaggedExperiment.
  55. 55. da Silva V, Ramos M, Groenen M, Crooijmans R, Johansson A, Regitano L, et al. CNVRanger: association analysis of CNVs with gene expression and quantitative phenotypes. Bioinformatics. 2020;36: 972–973. pmid:31392308
  56. 56. Hao Y, Hao S, Andersen-Nissen E, Mauck WM 3rd, Zheng S, Butler A, et al. Integrated analysis of multimodal single-cell data. Cell. 2021. pmid:34062119
  57. 57. Rohart F, Eslami A, Matigian N, Bougeard S, Lê Cao K-A. MINT: a multivariate integrative method to identify reproducible molecular signatures across independent experiments and platforms. BMC Bioinformatics. 2017;18: 128. pmid:28241739
  58. 58. Singh A, Shannon CP, Gautier B, Rohart F, Vacher M, Tebbutt SJ, et al. DIABLO: an integrative approach for identifying key molecular drivers from multi-omics assays. Bioinformatics. 2019;35: 3055–3062. pmid:30657866
  59. 59. Pasolli E, Schiffer L, Manghi P, Renson A, Obenchain V, Truong DT, et al. Accessible, curated metagenomic data through ExperimentHub. Nat Methods. 2017;14: 1023–1024. pmid:29088129
  60. 60. Ramos M, Geistlinger L, Oh S, Schiffer L, Azhar R, Kodali H, et al. Multiomic Integration of Public Oncology Databases in Bioconductor. JCO Clin Cancer Inform. 2020;4: 958–971. pmid:33119407
  61. 61. Risso D, Pagnotta SM. Per-sample standardization and asymmetric winsorization lead to accurate clustering of RNA-seq expression profiles. Bioinformatics. 2021;37: 2356–2364. pmid:33560368
  62. 62. Diehl AD, Meehan TF, Bradford YM, Brush MH, Dahdul WM, Dougall DS, et al. The Cell Ontology 2016: enhanced content, modularization, and ontology interoperability. J Biomed Semantics. 2016;7: 44. pmid:27377652
  63. 63. Sioutos N, de Coronado S, Haber MW, Hartel FW, Shaiu W-L, Wright LW. NCI Thesaurus: a semantic model integrating cancer-related clinical and molecular information. J Biomed Inform. 2007;40: 30–43. pmid:16697710
  64. 64. Lun ATL, Riesenfeld S, Andrews T, Dao TP, Gomes T, participants in the 1st Human Cell Atlas Jamboree, et al. EmptyDrops: distinguishing cells from empty droplets in droplet-based single-cell RNA sequencing data. Genome Biol. 2019;20: 63. pmid:30902100
  65. 65. Griffiths JA, Richard AC, Bach K, Lun ATL, Marioni JC. Detection and removal of barcode swapping in single-cell RNA-seq data. Nat Commun. 2018;9: 2667. pmid:29991676
  66. 66. Liao Y, Smyth GK, Shi W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics. 2014;30: 923–930. pmid:24227677
  67. 67. Yates A, Akanni W, Amode MR, Barrell D, Billis K, Carvalho-Silva D, et al. Ensembl 2016. Nucleic Acids Res. 2016;44: D710–6. pmid:26687719
  68. 68. Lun ATL, McCarthy DJ, Marioni JC. A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor. F1000Research. 2016. p. 2122. pmid:27909575
  69. 69. Krueger F, Andrews SR. Bismark: a flexible aligner and methylation caller for Bisulfite-Seq applications. Bioinformatics. 2011;27: 1571–1572. pmid:21493656
  70. 70. Kelly TK, Liu Y, Lay FD, Liang G, Berman BP, Jones PA. Genome-wide mapping of nucleosome positioning and DNA methylation within individual DNA molecules. Genome Res. 2012;22: 2497–2506. pmid:22960375
  71. 71. Harrison PW, Ahamed A, Aslam R, Alako BTF, Burgin J, Buso N, et al. The European Nucleotide Archive in 2020. Nucleic Acids Res. 2021;49: D82–D85. pmid:33175160
  72. 72. Liao Y, Smyth GK, Shi W. The R package Rsubread is easier, faster, cheaper and better for alignment and quantification of RNA sequencing reads. Nucleic Acids Res. 2019;47: e47. pmid:30783653
  73. 73. Quinlan AR. BEDTools: The Swiss-Army Tool for Genome Feature Analysis. Curr Protoc Bioinformatics. 2014;47: 11.12.1–34. pmid:25199790
  74. 74. Garvin T, Aboukhalil R, Kendall J, Baslan T, Atwal GS, Hicks J, et al. Interactive analysis and assessment of single-cell copy-number variations. Nat Methods. 2015;12: 1058–1060. pmid:26344043
  75. 75. McInnes L, Healy J, Melville J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv [stat.ML]. 2018. Available from: http://arxiv.org/abs/1802.03426.
  76. 76. Li X. ALL Bioconductor package. Bioconductor; 2004.
  77. 77. Trapnell C. HSMMSingleCell: Single-cell RNA-Seq for differentiating human skeletal muscle myoblasts (HSMM). R package version. 2014;1140.
  78. 78. Ganzfried BF, Riester M, Haibe-Kains B, Risch T, Tyekucheva S, Jazic I, et al. curatedOvarianData: clinically annotated data for the ovarian cancer transcriptome. Database. 2013;2013: bat013. pmid:23550061