ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Research Article

Comparison of clustering tools in R for medium-sized 10x Genomics single-cell RNA-sequencing data

[version 1; peer review: 1 approved, 2 approved with reservations]
PUBLISHED 15 Aug 2018
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the Bioinformatics gateway.

Abstract

Background: The commercially available 10x Genomics protocol to generate droplet-based single-cell RNA-seq (scRNA-seq) data is enjoying growing popularity among researchers. Fundamental to the analysis of such scRNA-seq data is the ability to cluster similar or same cells into non-overlapping groups. Many competing methods have been proposed for this task, but there is currently little guidance with regards to which method to use.
Methods: Here we use one gold standard 10x Genomics dataset, generated from the mixture of three cell lines, as well as three silver standard 10x Genomics datasets generated from peripheral blood mononuclear cells to examine not only the accuracy but also robustness of a dozen methods.
Results: We found that some methods, including Seurat and Cell Ranger, outperform other methods, although performance seems to be dependent on the complexity of the studied system. Furthermore, we found that solutions produced by different methods have little in common with each other.
Conclusions: In light of this, we conclude that the choice of clustering tool crucially determines interpretation of scRNA-seq data generated by 10x Genomics. Hence practitioners and consumers should remain vigilant about the outcome of 10x Genomics scRNA-seq analysis.

Keywords

Clustering, Single-Cell RNA-seq, Benchmarking, 10x Genomics

Introduction

Single-cell RNA-sequencing (scRNA-seq) studies have opened the way for new data-driven definitions of cell identity and function. No longer is a cell’s type determined by arbitrary hierarchies and their respective predefined markers. Instead, a cell’s transcriptional and epigenomic profile can now be used1 to accomplish this task. This is achieved using computational methods for scRNA-seq that characterize cells into novel and known cell types. Characterization consists of two steps: (i) unsupervised or semi-supervised clustering of same or similar cells into non-overlapping groups, and (ii) labeling clusters, i.e. determining the cell type, or related cell types, represented by the cluster. Here, we focus on the first step of this process.

Research into clustering has produced many algorithms for the task, including over 60 tools specifically designed for scRNA-seq2. Due to the relative youth of the field, there are currently no rules guiding the application of these clustering algorithms. If tools’ performances have been tested outside synthetic scenarios, testing seems to be confined to scenarios with limited biological variability. Furthermore, most tools were developed and consequently tested only on the Fluidigm C1 protocol, despite considerable differences in throughput capabilities and sensitivities3 in the different scRNA-seq platforms. Here we focus solely on clustering performance on medium-sized scRNA-seq data generated by 10x Genomics as it is currently the most widely used platform. Commercially available scRNA-seq platforms, like 10x Genomics’ Chromium, are being widely adopted due to their ease of use and relatively low cost per cell4. The 10x Genomics protocol uses a droplet-based system to isolate single cells. Each droplet contains all the necessary reagents for cell lysis, barcoding, reverse transcription and molecular tagging. This is followed by pooled PCR amplification and 3’ library preparation, after which standard Illumina short-read sequencing can be applied5. Unlike other commercially available scRNA-seq protocols, like Fluidigm C1, 10x Genomics allows for sequencing of thousands of cells albeit at much shallower read depths per cell, and without allowing the use of fluorescence markers to establish cell identity. As such the 10x Genomics platform is particularly suited to detailed characterization of heterogeneous tissues.

Methods

In this study, we performed comprehensive evaluation of a dozen clustering methods (Table 1). We focused on analysis methods available in the R language, as this is one of the most commonly used programming languages for scRNA-seq data analysis. The exception to this is the 10x Genomics software Cell Ranger. Our evaluation comprised four core aspects: (i) accuracy of clustering solutions compared to a gold standard (near absolute truth, limited variability and complexity), (ii) performance of clustering methods using silver standard data (no absolute truth, realistic variability and complexity), (iii) stability of clustering solutions, and (iv) miscellaneous characteristics, such as time and practicality.

Table 1. Overview of the clustering tools included in this study, and several characteristics thereof.

SoftwareYearDescriptionPropertiesRef
ascend version
0.4.0
2017Adjusted RLE normalization followed by hierarchical
clustering and merging
Gene filtering, limited documentation, medium
user interaction
6
Cell Ranger
version 2.0.0
2016Graph-based clustering on the first 10 principal
components
Gene filtering, detailed documentation, no user
interaction
CIDR version 0.1.52017Imputation of potential dropout genes followed
by hierarchical clustering on first 4 principal
components
Gene filtering, imputation good documentation,
high user interaction
7
countClust
version 1.4.1
2014Likelihood models to estimate a specified number of
multinomial distributions
Requires number of clusters, limited
documentation, medium user interaction
8
RaceID2015Two iterations of k-means clustering with merging of
outlier cells and identification of rare cell types in last
step
Gene filtering, limited documentation, little user
interaction
9
RaceID22016More advanced version of RaceID based on UMIsGene filtering, limited documentation, little
interaction
10
RCA version 1.02017Rudimentary filtering then projection onto reference
datasets consisting of profiles of isolated cell types
Gene filtering, reference dataset required, limited
documentation, medium user interaction
11
SC3 version 1.7.72016Rudimentary filtering followed by ensembl clustering
method
Gene filtering, good documentation, medium
user interaction
12
scran version
1.6.9
2016Library scale normalization by cell pools followed
by hierarchical clustering on rank correlation-based
distances of marker genes
Gene filtering, requires marker genes, detailed
documentation, medium user interaction
13
Seurat version
2.3.0
2015Normalization using mitochondrial RNA followed by
PCA of highly variable genes and then graph-based
clustering
Gene filtering, detailed documentation, high
user interaction
14
SIMLR version
1.4.1
2016Multikernel learning finds best fit and forces blocks
in similarity matrix to address dropouts then applies
spectral clustering
Requires number of clusters, good
documentation, little user interaction
15
TSCAN version
1.16.0
2016In-silico pseudo time reconstruction with a cluster-
based minimum spanning tree approach to order
cells
Good documentation, little user interaction16

Data

Gold standard. Three human lung adenocarcinoma cell lines, HCC827, H1975 and H2228, were cultured separately. The cell lines were obtained from ATCC and cultured in Roswell Park Memorial Institute 1640 medium with 10% fetal bovine serum (FBS, catalog number: 11875-176; Thermo Fisher Gibco) and 1% penicillin-streptomycin. The cells were grown independently at 37°C with 5% carbon dioxide until near 100% confluence. Before mixing cell lines, cells were dissociated into single-cell suspensions in FACS buffer (phosphate-buffered saline (PBS); catalog number: 14190-144; Thermo Fisher Gibco) with 5% FBS, Corning, catalog number: 35-076-CV), stained with propidium iodide (catalog number: P21493; Thermo Fisher FluoroPure) and 120,000 live cells were sorted for each cell line by FACS (BD FACSAria III flow cytometer, BD FACSDiva software version 7.0; BD Biology) to acquire an accurate equal mixture of live cells from the three cell lines. The resulting mixture was then processed by the Chromium Controller (10x Genomics) using single Cell 3’ Reagent Kit v2 (Chromium Single Cell 3’ Library & Gel Bead Kit v2, catalog number: 120237; Chromium Single Cell A Chip Kit, 48 runs, catalog number: 120236; 10x Genomics) (see Table 2). Afterwards the library was sequenced using Illumina NextSeq500 and V4 chemistry (NextSeq 500/550 High Output Kit v2.5, 150 Cycles, catalog number; 20024907; Illumina) with 100bp paired end reads. RTA (version 1.18.66.3; Illumina) was used for base calling.

Table 2. Properties of all benchmarking datasets used in the study.

Benchmark standardGoldSilver
DatasetDataset 1Dataset 2Dataset 3
Tissue
Source
Instrument
Number of cells
Total genes detected
Cell lines
GSE111108
Chromium
1,039
29,451
PBMCs
GSE115189
Chromium
3,372
24,654
PBMCs
Website*
GemCode
2,690
20,693
PBMCs
Website+
Chromium
4,337
25,820
After preprocessing
Number of cells
Mean counts per cell
Median genes detected per cell
925
114,426
8,499
3,205
3,818
1,158
2,590
2,605
877
4,292
4,528
1,318

Silver standard. We consider three human peripheral blood mononuclear cells (PBMCs) scRNA-seq datasets to be the silver standard (Table 2). All datasets were generated using the 10X Genomics droplet system combined with Illumina sequencing. The Australian Genome Research Facility in partnership with CSL generated one dataset using the 10x Genomics Chromium system (Dataset 1). Two datasets were generated by 10x Genomics and are publicly available (Datasets 2 and 3). Of these, one dataset was generated with an earlier version of the microfluidics instrument, the 10x Genomics GemCode Controller (Dataset 2). The second dataset was generated with the latest instrument, the 10x Genomics Chromium Controller (Dataset 3).

For the first dataset, PBMCs were isolated from whole blood obtained through the Australian Red Cross Blood Service in the following manner. First, 50ml of blood was diluted using 50ml of PBS (catalog number: D8537-500ml; Sigma-Aldrich). We then added 30ml of Ficoll-Paque medium (catalog number: Catalog: 17-1440-03; GE Healthcare). We then centrifuged at room temperature for 20 minutes at 400 g and carefully removed the interface layer containing PBMCs, located between the top plasma layer and middle layer (Heraeus Multifuge 3 S-R Centrifuge; Thermo Fisher Scientific). To remove the supernatant, we further centrifuged at 400 g for 10 minutes at room temperature. This process was repeated to remove the contaminating Ficoll medium or platelets. Finally, cells were resuspended in 20ml of cell culture media with 5% FBS (RPMI-1640 Medium, catalog number: R0884-500ml, Sigma-Aldrich) and counted (Nikon Eclipse TS100 Microscope; Nikon). The resulting mixture was then processed by the Chromium Controller (10x Genomics) using single Cell 3’ Reagent Kit v2 (Chromium Single Cell 3’ Library & Gel Bead Kit v2, catalog number: 120237; Chromium Single Cell A Chip Kit, 48 runs, catalog number: 120236; 10x Genomics). Afterwards the library was sequenced using HiSeq2500 (Illumina) and V4 chemistry (HiSeq PE Cluster Kit v4 cBot, catalog number: PE-401-4001; HiSeq SBS Kit V4 50 cycles, catalog number: FC-401-4002; Illumina) with 101bp paired end reads. RTA (version 1.18.66.3; Illumina) was used for base calling.

Preprocessing

We used the 10x Genomics software, Cell Ranger (version 2.0.0) to align, de-duplicate, filter barcodes and quantify genes for all datasets. Note that we aligned reads with Cell Ranger to the GRCh38 (version 90) genome annotation. Using the Bioconductor package scater17 (version 1.6.3), we then removed low quality data from cells with low library size or low number of expressed gene transcripts. We also removed cells with a high mitochondrial read proportion as this can indicate apoptosis, also known as programmed cell death. Stressed cells undergoing apoptosis have an aberrant transcriptome profile in comparison to a living cell and have previously been acknowledged to adversely influence transcriptome studies13.

Establishing truth

Gold standard. By exploiting the genetic differences between the three different cell lines we were able to establish near absolute truth in the gold standard dataset. To this end we first called single nucleotide variants (SNVs) in publicly available bulk RNA-seq of the same cell lines (GSE86337)18. Drawing on these SNVs, we then apply demuxlet19 (version 0.0.1), which harnesses the natural genetic variation between the cell lines to determine the most likely identity of each cell. We observe almost complete concordance between the result from demuxlet and clustering of cells seen in dimension reduction visualizations of the data (compare Supplementary Figure 1).

Silver standard. For the silver standard data, we compared clustering solutions to a cell labeling approach by 10x Genomics5. This approach finds the cell type in a reference dataset which most closely resembles the expression in the cell. The reference dataset contains 11 isolated cell types sequenced using the 10x Genomics system. While this labeling does not constitute truth, it has been found to be perform well in comparison with marker-based classification5. Furthermore, the proportions of cells assigned to the 11 cell types by the supervised labeling approach were consistent with the literature (see Supplementary Table 1)20,21.

Criteria for inclusion of clustering tool

We based our selection of method on the online list within www.scrna-tools.org2 in October 2017. We only considered methods with an R package that had sufficient documentation to enable easy installation and execution and had at least one preprint or publication associated with it. We also excluded any methods that required extensive prior information not provided in the package. We also excluded any methods that continually failed to run (e.g. Linnorm22 and Monocle23). This resulted in the evaluation of 12 methods (see Table 1). Note that for some of the R packages the primary focus is not clustering, but the package authors explicitly describe how their packages can be applied to achieve clustering of the scRNA-seq data.

The aim of this study is to provide guidance for the use of clustering methods to non-experts. Hence, we use all clustering methods with their default parameters as this represents the most common use case. In the case of countClust and SIMLR parameters included the number of clusters, which we set to 3 and 8 for the gold standard and silver standard datasets, respectively. Marker genes were required for the analysis with scran, which we obtained by performing differential expression analyses on GSE86337 and an in-house dataset of isolated cell types in PBMCs24 for the gold standard and silver standard datasets, respectively. Furthermore, we also followed upstream data handling, such as filtering of genes and normalization, as described in the documentation of the respective clustering method. We concede that it is possible that more care in the upstream data handling and selection of parameters could result in different results. However, confronted with the extremely large number of parameter choices, we believe that this evaluation suffices to identify strengths and weaknesses of each method.

Methods for the comparison of clustering solutions

To evaluate the performance and similarity of different clustering solutions, we rely on three different metrics. We use the adjusted Rand index (ARI)25 and the normalized mutual information (NMI)26, two metrics routinely applied in the field of clustering, to assess the similarity of clustering solutions or their similarity to a known truth. Both metrics can take values from 0 to 1, with 0 signifying no overlap between two groupings and 1 signifying complete overlap. These metrics are also applicable in the absence of known cluster labels. Finally, we also use a homogeneity score27. This score takes the value 1 when all of its clusters contain only data points that are members of a single known group. Values of this score closer to 0 indicate that clusters contain mixed known groups. Unlike NMI and ARI, this score requires knowledge of an underlying truth.

Let X be a finite set of size n. A clustering solution C is a set C1, . . . , Ck of non-empty disjoint subsets of X such that their union equals X. Let C=C1,...,Cl be a second clustering solution or the supervised labeling solution with the same properties. The contingency table M = (mij) of the pair of sets C, C is a k × l matrix whose i, j-th entry equals the number of elements in the intersection of clusters Ci and Cj:

mij=|CiCj|,1ik,1jl.

ARI

Radj(C,C)=i=1kj=1l(mij2)t3(12(t1+t2)t3),

where t1 = i=1k(|Ci|2),t2=j=1l(|Cj|2) and t3=2t1t2n(n1)· For ease of notation this is referred to as ARI in the text, dropping the reference to specific pairs of sets. Furthermore, we also distinguish between ARI_truth as a comparison of a clustering solution to an underlying known or suspected truth and ARI_comp, which refers to a comparison between two clustering solutions.

NMI

NMI1=I(C,C)(H(C)H(C),

where H(C) = I(C, C) is the entropy of C. Note that

I(C,C)=i=1kj=1lP(i,j)log2P(i,j)P(i)P(j),

where P(i, j) = mijn and P(i) = |Ci|n, is the mutual information of C and C‘.

Homogeneity. Now let us assume C is the known and correct grouping of the cells. Then,

Homogeneity=I(C,C)H(C)·

Stability assessment

To test the robustness of different clustering methods we pursued a sampling strategy in terms of genes and cells. We used Dataset 3 for the cell robustness evaluation with regards to cells, since it had the most number of cells. Similarly, we used Dataset 1 for the robustness evaluation with regards to genes, since it had the most number of non-zero genes after filtering. The impact of different aligners and preprocessing was assessed using all appropriate combinations of programs.

Cells. We randomly sampled 3,000 cells in Dataset 3 (out of the total of 4,292 that were available after filtering), generating five (non-independent) datasets. For every combination of two datasets (10 combinations in total) we then investigated for each clustering method separately how often cells contained in all five sampled datasets were assigned to the same cluster using the ARI_comp.

Genes. We randomly filtered half of all genes in Dataset 1 (out of the total of 58,302 genes), generating 10 datasets. For every combination of two datasets (45 combinations in total) we then investigated for each clustering method separately how often cells were assigned to the same cluster using the ARI_comp.

Aligners and preprocessing pipelines. In order to assess the affect of using different preprocessing pipelines on the data, we applied the Bioconductor package scPipe28 (version 1.0.6) to the raw data. Like Cell Ranger, scPipe can be used to align, deduplicate, filter barcodes and quantify genes. Since scPipe is modular, we tried it with both the STAR29 (version 020201) and Subread30 (version 1.5.2) aligners. In order to ensure comparability we aligned reads to the same GRCh38 genome annotation and repeated quality control with scater. We investigated the similarity of clustering solutions applied to the differently preprocessed and aligned versions of the same dataset by ARI_comp.

Run time

Each execution of a method on a dataset was performed in a separate R session. Each task was allocated as many CPU cores of a 24 core Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz as specified by the default parameters. The base::set.seed was overridden in order to prevent stochasticity and thus give reduced unwanted variation in the results. Timings for each method include any preprocessing steps.

Influence assessment

We also investigated what properties of each cell’s data were driving the clustering solutions produced by the different methods as well as the inferred cell labels. Properties of a cell’s data refer to features such as the number of total reads that included the cell’s barcode, the total number of gene transcripts found for this cell, etc. To this end, we used linear mixed models where cell data properties were predicted using the indicators for cluster membership. We predicted cell data properties and not cluster membership for modeling ease. The adjusted R2 of these models was used to assess which properties influenced the clustering solutions. Properties investigated included: (i) the total number of detected gene transcripts, (ii) the total read count, and (iii) the percentages of reads aligning respectively to ribosomal proteins, mitochondrial genes and ribosomal RNA.

Results

Evaluation of clustering tools

Gold standard data. For the gold standard data set consisting of three cell types, half of the tested clustering methods overestimated the true number of different cell types in the data. Methods with cluster number estimations close to the correct number of different cell types included methods with prior information, such as SIMLR, countClust and scran, as well as ascend, Cell Ranger, RaceID and CIDR (Figure 1). The clustering solutions produced by these methods, with the exception of countClust, largely reflect the cell types. This is indicated by ARI_truth >0.8. The remaining methods overestimated the number of clusters by 2 to 85 clusters, with SC3 and RaceID2 representing the extremes, both estimating more than 20 clusters (see t-SNE plots in Supplementary Figure 1 for the impact). As a consequence of the greater number of estimated clusters, the ARI_truth of the other clustering methods is lower than 0.8. To see whether these methods split cell types into several clusters or instead assign cells types randomly to clusters, we also investigate the homogeneity of the clustering solutions with respect to the known labeling. Apart from countClust and RCA, all methods have extremely high homogeneity, indicating that they split cell types into more subtypes, rather than randomly creating more cell types, which is reassuring.

d60760c5-81bf-4a84-b5bf-be174d9726a9_figure1.gif

Figure 1. Performance on the gold standard dataset.

(a) ARI_truth of each method with regards to the truth versus the number of clusters. The dashed line indicates the true number of clusters.(b) Homogeneity of clusters of each method, given the truth.

Silver standard data. We labeled the cells in each of the three silver standard datasets as one of 11 different PBMC cell populations. When using the ARI_truth to compare the likeness of the clustering solutions and the labels, no method produced solutions that were uniformly the most similar to the inferred labels (Figure 2). Two methods, ascend and countClust, tended to estimate smaller number of clusters and consequently did not agree with the labeling. Only Seurat, SC3 and Cell Ranger achieved an ARI_truth above 0.4 for at least two out of the three silver standard datasets. All methods considerably improved their ARI_truth when we subset to more confidently labeled cells (see Supplementary Figure 2). RCA was particularly affected, showing much greater similarity for more confidently labeled cells. We also calculated the homogeneity of each method in each dataset with respect to the inferred labeling. Generally, most methods exhibited significantly lower performance on Dataset 2, which was generated with an older version of the 10x Genomics technology than that used to generate Datasets 1 and 3. Apart from RCA all methods had much lower accuracy than for the gold standard data, indicating that most clusters represent mixtures of different inferred cell types. The exception is SC3’s clustering solution of Dataset 3, which achieved an homogeneity score above 0.7.

d60760c5-81bf-4a84-b5bf-be174d9726a9_figure2.gif

Figure 2. Performance on the three silver standard datasets.

(a) ARI_truth of each method in each dataset, as indicated by different shapes, with regards to the supervised cell labeling versus the number of clusters. The dashed line indicates the number of cell populations estimated by the supervised cell labeling approach. (b) Homogeneity of clusters with regards to the inferred labeling for each method and each dataset. Different datasets are indicated by transparency.

Interestingly, similar performance when compared to the labeling did not imply that cluster solutions were similar (compare Figure 3). In fact, only a few methods resulted in clustering solutions that were similar. For all three datasets, a group of five methods (RCA, scran, Seurat, SIMLR and TSCAN) produced similar results, while the other seven methods appeared dissimilar between each other and to the set of five methods.

d60760c5-81bf-4a84-b5bf-be174d9726a9_figure3.gif

Figure 3.

Similarity of all combinations of clustering methods as estimated by ARI_comp (lower triangle) and NMI (upper triangle) in (a) Dataset 1, (b) Dataset 2, and (c) Dataset 3. The similarity is indicated by the color; yellow indicating no similarity and purple indicating complete overlap. The diagonals give the number of clusters estimated by each respective method.

Stability. We evaluated the stability of the clustering methods by examining three different features: (i) filtering of cells, (ii) filtering of genes (Figure 4), and (iii) use of different aligners (Figure 5). When assessing the stability with regards to input, countClust, RaceID and RaceID2 did not appear very robust. The robustness of ascend, countClust and RaceID was variable. Due to its reliance on reference profiles RCA is extremely robust, achieving ARI_comp above 0.9 consistently. Seurat, SIMLR and Cell Ranger demonstrated robustness with regards to input, but also exhibited robustness when changing gene filtering procedures (compare Figure 4b). CIDR appeared to be very sensitive to changes in gene filtering, which may be due to its imputation feature.

We also investigated how the stability of the clustering method was affected by the use of different aligners (Figure 5). In particular, we used Cell Ranger and ScPipe28 with Subread30, or STAR29. We found that different aligners largely result in the same gene counts, but with some notable exceptions for processed pseudogenes (see Supplementary Figure 4, Supplementary Figure 5 and Supplementary Figure 6). Not all methods were able to be used in conjunction with scPipe. This included ascend and SIMLR, which failed to run, and Cell Ranger, which requires output from its own preprocessing pipeline. However we were able to evaluate eight methods. Apart from RaceID2 and RCA, all tested methods appeared robust.

d60760c5-81bf-4a84-b5bf-be174d9726a9_figure4.gif

Figure 4.

Tukey boxplots of ARI_comp results from the comparison of clustering solutions of the same method when (a) cell input was varied in Dataset 3 and (b) gene input was varied in Dataset 1. Note that RaceID only estimated 1 cluster when genes were varied.

d60760c5-81bf-4a84-b5bf-be174d9726a9_figure5.gif

Figure 5. ARI_comp results from the comparison of clustering solutions of the same method used on datasets processed with different aligners.

Only the subset of nine methods that worked in conjunction with all three aligners are shown.

Miscellaneous properties. Running time varies substantially between different methods. Methods like RaceID and RaceID2 take prohibitively long and thus do not lend themselves to interactive analysis when applied to 10x Genomics data (Figure 6). The fastest methods were RCA and TSCAN, with both taking less than 25 seconds on average for the entire dataset analysis. Their fast running time is due to both of these methods offering little flexibility or intermediate results during their analysis (compare Table 1). By contrast Seurat’s relatively long running time is partially due to extensive quality control during the analysis. Also note that methods differed in the quality of their documentation. For example, tools like Cell Ranger and Seurat offer detailed documentation, with many different use cases as well as tutorials. Tools, which are not found on Bioconductor, such as RaceID, RaceID2, ascend and RCA have more limited documentation.

d60760c5-81bf-4a84-b5bf-be174d9726a9_figure6.gif

Figure 6. The bars indicate the average log10 run time (in seconds) of all 12 methods on Dataset 1 with 29,151 genes over 10 iterations.

Factors influencing clustering solutions

The variation in the percentage of reads aligning to ribosomal protein genes strongly predicted all clustering solutions as well as the inferred cell labels (see Figure 7). Expression of ribosomal protein genes has been successfully used to discriminate cell types belonging to different hematopoietic lineages31. Hence, it may be the case that overall mRNA amount of ribosomal protein genes can also serve as a discriminator. Furthermore, differences in abundance of ribosomal protein genes are likely to drive variation in PBMCs scRNA-seq datasets, as they typically account for a large proportion of reads (around 40% in all three datasets). In combination with ribosomal protein genes being less affected by dropout due to their relatively high expression, it is perhaps unsurprising that clustering solutions of all methods foremost reflect differences in the amount of ribosomal protein genes between cells.

Most methods’ solutions were much more driven by the total number of features and total number of counts than the inferred solution. TSCAN was particularly affected (R2 = 0.52), but for both ascend and RaceID2 similar effects were observed. It can be speculated that this strong influence of total number of features and total number of count on their clustering solutions points to a failure to appropriately normalize the data.

d60760c5-81bf-4a84-b5bf-be174d9726a9_figure7.gif

Figure 7. Radial plots describing the average effect of 5 cell features on the clustering solutions of different methods across the three silver standard datasets.

For every method and every feature the adjusted R2 of the linear model fitting the feature by the clustering solution is presented.

Discussion

Most biological conclusions obtained from droplet-based scRNA-seq data crucially rely on accurate clustering of cells into homogeneous groups. Indeed, one can argue that it is the very act of clustering that unlocks the technology’s potential for discovery. Therefore it is not surprising that according to several repositories, such as www.omicstools.org and www.scRNA-tools.org2, many of the tools developed for scRNA-seq specifically focus on clustering. With so many choices, it is thus important to evaluate their performance for droplet based protocols, such as 10x Genomics, specifically.

In this study, we presented our evaluation of a dozen clustering method on scRNA-seq 10x Genomics data. The results of our investigations will be useful for method users, as we provide clear and practical guidelines. Nonetheless, our evaluation has several limitations:

  • Inclusion of methods limited to R packages and methods published before October 2017

  • Parameter selection limited to defaults

  • No assessment of robustness to noise and parameter changes

  • No assessment of ability to discover rare cell populations

  • Evaluation of more silver standard datasets from systems other than PBMCs

  • No evaluation of quality of code and documentation

  • No assessment of scalability of methods

Our evaluations suggest that Seurat and Cell Ranger provide the most stable and accurate clustering solutions for 10x Genomics scRNA-seq data. While Seurat performed slightly better, the choice between Seurat and Cell Ranger, in our opinion, should be informed by the user’s familiarity with statistical concepts, which enable them to make the informed parameter choices required in Seurat. More generally we find that different clustering methods resulted in very different solutions. The good performance of Seurat as well as the vast difference between clustering methods have also been observed by Duò et al.32 in a benchmarking study including multiple scRNA-seq protocols. Our investigations suggest that biological differences between cells, such as cell type or state, and technical variation between cells (as well as combinations of biological differences and technical variation) all drive clustering. However, which aspects are captured by which clustering method remain to be confirmed. Our study merely pinpoints some of the drivers of performance, but not their origin, and thus cannot anoint an overall best method.

We recommend that practitioners and consumers of results generated from 10x Genomics scRNA-seq data alike remain vigilant about the outcome of their analysis, and acknowledge the variability and likelihood of undesired influences. The choice of clustering tool for scRNA-seq data generated by the 10x Genomics platform crucially determines interpretation. Hence, at least two clustering tools should be routinely applied to 10x Genomics scRNA-seq data in order to offer more than one subjective interpretation and hence increase robustness and confidence in any results.

Data availability

Repository: Gold Standard Dataset. Single cell profiling of 3 Human Lung Adenocarcinoma cell lines, GSE111108

Repository: Silver Standard Dataset 1. Single cell profiling of peripheral blood mononuclear cells from healthy human donor, GSE115189

Repository: Silver Standard Dataset 2. 3k PBMCs from a Healthy Donor, https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/pbmc3k

Repository: Silver Standard Dataset 3. 4k PBMCs from a Healthy Donor, https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.2.0/pbmc4k

Software availability

All code is available for download at: https://github.com/SaskiaFreytag/cluster_benchmarking_code.

Archived code at time of publication: 10.5281/zenodo.1324576

License: MIT License

Consent

Written informed consent for publication of the participant’s transcriptomic information was obtained (Australian Red Cross Blood Service Supply Agreement 18-03VIC-07).

Comments on this article Comments (0)

Version 2
VERSION 2 PUBLISHED 15 Aug 2018
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Freytag S, Tian L, Lönnstedt I et al. Comparison of clustering tools in R for medium-sized 10x Genomics single-cell RNA-sequencing data [version 1; peer review: 1 approved, 2 approved with reservations] F1000Research 2018, 7:1297 (https://doi.org/10.12688/f1000research.15809.1)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 15 Aug 2018
Views
78
Cite
Reviewer Report 31 Aug 2018
Stephanie Hicks, Johns Hopkins Bloomberg School of Public Health (JHSPH), Baltimore, MD, USA 
Approved with Reservations
VIEWS 78
Freytag et al. have produced a nice research article on assessing methods for clustering scRNA-seq data from the 10x Genomics platform. I was excited to read the article to learn about what they recommend using. I have made some suggestions below ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Hicks S. Reviewer Report For: Comparison of clustering tools in R for medium-sized 10x Genomics single-cell RNA-sequencing data [version 1; peer review: 1 approved, 2 approved with reservations]. F1000Research 2018, 7:1297 (https://doi.org/10.5256/f1000research.17256.r37228)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 19 Dec 2018
    Saskia Freytag, Department of Medical Biology, University of Melbourne, Parkville, Australia
    19 Dec 2018
    Author Response
    We would like to thank the reviewer for reviewing our manuscript and for their constructive comments. Below are point-by-point responses to the individual comments.

    While the authors have provided ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 19 Dec 2018
    Saskia Freytag, Department of Medical Biology, University of Melbourne, Parkville, Australia
    19 Dec 2018
    Author Response
    We would like to thank the reviewer for reviewing our manuscript and for their constructive comments. Below are point-by-point responses to the individual comments.

    While the authors have provided ... Continue reading
Views
88
Cite
Reviewer Report 29 Aug 2018
Shila Ghazanfar, School of Mathematics and Statistics, University of Sydney, Sydney, NSW, Australia 
Approved with Reservations
VIEWS 88
Freytag and colleagues provide a comprehensive comparison of clustering methods - specifically designed for scRNA-Seq data - on data collected using the popular droplet-based 10x Genomics platform. A total of four datasets, comprising a Gold standard mixture of cell lines ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Ghazanfar S. Reviewer Report For: Comparison of clustering tools in R for medium-sized 10x Genomics single-cell RNA-sequencing data [version 1; peer review: 1 approved, 2 approved with reservations]. F1000Research 2018, 7:1297 (https://doi.org/10.5256/f1000research.17256.r37231)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 19 Dec 2018
    Saskia Freytag, Department of Medical Biology, University of Melbourne, Parkville, Australia
    19 Dec 2018
    Author Response
    We would like to thank the reviewer for reviewing our manuscript and for their constructive comments. Below are point-by-point responses to the individual comments.

    Linnorm and Monocle failed - ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 19 Dec 2018
    Saskia Freytag, Department of Medical Biology, University of Melbourne, Parkville, Australia
    19 Dec 2018
    Author Response
    We would like to thank the reviewer for reviewing our manuscript and for their constructive comments. Below are point-by-point responses to the individual comments.

    Linnorm and Monocle failed - ... Continue reading
Views
62
Cite
Reviewer Report 28 Aug 2018
Joshua W. K. Ho, Victor Chang Cardiac Research Institute (VCCRI), Darlinghurst, NSW, Australia 
Approved
VIEWS 62
This paper presents a well-designed and comprehensive evaluation of widely used clustering algorithms for medium-sized 10x Genomics scRNA-seq data. Clustering is a highly active area of research in scRNA-seq data analysis. With so many published clustering tools available, it is ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Ho JWK. Reviewer Report For: Comparison of clustering tools in R for medium-sized 10x Genomics single-cell RNA-sequencing data [version 1; peer review: 1 approved, 2 approved with reservations]. F1000Research 2018, 7:1297 (https://doi.org/10.5256/f1000research.17256.r37232)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.

Comments on this article Comments (0)

Version 2
VERSION 2 PUBLISHED 15 Aug 2018
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.