rCASC: reproducible classification analysis of single-cell sequencing data

Abstract Background Single-cell RNA sequencing is essential for investigating cellular heterogeneity and highlighting cell subpopulation-specific signatures. Single-cell sequencing applications have spread from conventional RNA sequencing to epigenomics, e.g., ATAC-seq. Many related algorithms and tools have been developed, but few computational workflows provide analysis flexibility while also achieving functional (i.e., information about the data and the tools used are saved as metadata) and computational reproducibility (i.e., a real image of the computational environment used to generate the data is stored) through a user-friendly environment. Findings rCASC is a modular workflow providing an integrated analysis environment (from count generation to cell subpopulation identification) exploiting Docker containerization to achieve both functional and computational reproducibility in data analysis. Hence, rCASC provides preprocessing tools to remove low-quality cells and/or specific bias, e.g., cell cycle. Subpopulation discovery can instead be achieved using different clustering techniques based on different distance metrics. Cluster quality is then estimated through the new metric "cell stability score" (CSS), which describes the stability of a cell in a cluster as a consequence of a perturbation induced by removing a random set of cells from the cell population. CSS provides better cluster robustness information than the silhouette metric. Moreover, rCASC's tools can identify cluster-specific gene signatures. Conclusions rCASC is a modular workflow with new features that could help researchers define cell subpopulations and detect subpopulation-specific markers. It uses Docker for ease of installation and to achieve a computation-reproducible analysis. A Java GUI is provided to welcome users without computational skills in R.


Contents Section 1 rCASC: a single cell analysis workflow designed to provide data reproducibility
Since the end of the 90's omics high-throughput technologies have generated an enormous amount of data, reaching today an exponential growth phase. Analysis of omics big data is a revolutionary means of understanding the molecular basis of disease regulation and susceptibility, and this resource is accessible to the biological/medical community via bioinformatics frameworks. However, because of the fast evolution of computation tools and omics methods, the reproducibility crisis is becoming a very important issue [Nature, 2018 ] and there is a mandatory need to guarantee robust and reliable results to the research community [Global Engage Blog].
Our group is deeply involved in developing workflows that guarantee both functional (i.e. the information about data and the utilized tools are saved in terms of meta-data) and computation reproducibility (i.e. the real image of the computation environment used to generate the data is stored). For this reason, we are managing a bioinformatics community called reproducible-bioinformatics.org [Kulkarni et al.] designed to provide to the biological community a reproducible bioinformatics ecosystem [Beccuti et al.].
rCASC, reproducible Cluster Analysis of Single Cells, is part of the reproducible-bioinformatics.org project and provides single cell analysis functionalities within the reproducible rules described by Sandve et al. [PLoS Comp Biol. 2013 ]. rCASC is designed to provide a workflow ( Figure 1) for cell-subpopulation discovery.
The workflow allows the direct analysis of fastq files generated with 10X Genomics platform, InDrop technology or a count matrix having as column-names cells identifier and as row names ENSEMBL gene annotation. In the following paragraphs the functionalities of rCASC workflow are described. • 32 Gb RAM • 2.6 GHz Core i7 6700HQ with 8 threads.
• 500 GB SSD The analysis time can be significantly improved increasing the number of cores. Implementation of the workflow for computers farm, using swarm, is under implementation.

Section 1.2 Installation
The minimal requirements for installation are: • A workstation/server running 64 bits Linux.
• Docker daemon installed on the machine, for more info see this document: https://docs.docker.com/engine/installation/ .
• A scratch folder, e.g. /data/scratch, possibly on a fast I/O SSD disk, writable by everybody: The functions in rCASC package require that user belongs to a group with the rights to execute docker. See the following document for more info: https: //docs.docker.com/install/linux/linux-postinstall/ The following commands allow the rCASC installation in R environment:     ensembl.urlgtf : the link to the ENSEMBL GTF of the genome of interest.
bowtie.index.prefix, the prefix name of the bowtie index. If genome was generated with indropIndex function the bowtie index is genome (default).
-M, integer. Ignore reads with more than M alignments, after filtering on distance from transcript end, suggested value 10 -U, integer. Ignore counts from UMI that should be split among more than U genes, suggested value 2 -D, integer. Maximal distance from transcript end, suggested value 400.
The output of the above analysis are two counts matrices results_cellranger.cvs and results_cellranger.txt and a folder called results_cellranger, which contains the full cellranger output, more information on cellranger output can be found at 10XGenomics web site.

Section 2.3 Smart-seq full transcript sequencing.
Smart-seq protocol generates a full transcript library for each cell, i.e. a fastq file for each cell. To convert fastq in counts we suggest to use rnaseqCounts or wrapperSalmon counts from docker4seq package [Kulkarni et al.]. Both above-mentioned functions are compliant with minimal hardware requirements indicated for rCASC and are part, as rCASC, of the Reproducible Bioinformatics Project. rnaseqCounts is a wrapper executing on each fastq: • quality evaluation of fastq with FastQC software, • trimming of adapters with skewer, • mapping reads on genome using STAR and counting isoforms and genes with RSEM .
wrapperSalmon instead implements FastQC and skewer and calculates isoforms and genes counts using Salmon software.

Section 3 Counts matrix manipulation
This paragraph describes a set of functions that can be used to remove low quality cells and non-informative genes from counts tables generated by any single-cell sequencing platform, Fig. 8  The function filterZeros retains all genes having a user defined fraction of zeros (between 0 and 1, where 1 indicate that only genes without any 0s are retained, and 0 indicates that genes with at least a single value different from zero are retained), and plots the frequency distribution of gene counts in the dataset.  (GUI Counts table), a character string indicating the path of the tab delimited file of cells un-normalized expression counts threshold (GUI Threshold), a number from 0 to 1 indicating the fraction of max accepted zeros in each gene. 0 is set as default and it eliminates only genes having no expression in any cell.
sep (GUI Separator), separator used in count file, e.g. '\t', ',' The output is a PDF providing zeros distributions before and after removal of genes with 0s counts. A tab delimited file with the prefix filtered_ in which the filtered data are saved.

IMPORTANT:
In case user would like to apply cell quality filter, e.g. lorenzFilter, it is convenient to remove only genes with 0 counts in all cells, i.e. threshold=0 ( Figure 10).

Section 3.2 Plotting genes numbers versus total UMIs/reads in each cell
To estimate the overall number of genes detectable in each cell, the function genesUmi generates a plot of the number of genes present in a cell with respect to the total number of UMI in the same cell. The number of UMIs required to call a gene present in a cell is a parameter defined by the user, the suggested value is 3 UMIs. The function mitoRiboUmi plots genesUmi output and also generate the percentage of mitochondrial protein genes with respect to percentage of ribosomal protein genes for each cell. The output are two pdfs named respectively genes.umi.pdf ( Figure 12A), where each dot represents a cell.
X axis is the total number of UMIs/reads mapped on each cell in log10 format and Y axis is the number of detected genes and Ribo_mito.pdf ( Figure 12B), Percentage of mitochondrial protein genes plotted with respect to percentage of ribosomal protein genes. Cells are colored on the basis of total number detected genes. The latter plot is useful to identify stressed cells, i.e. those with high percentage of mitochondrial genes ( Figure 12B, black cells), and low informative cells, i.e. those in which genes are few and mainly ribosomal/mitochondrial ( Figure 12B, black and green cells). unzip("testSCumi_mm10.csv.zip") mitoRiboUmi(group="docker", file=paste(getwd(), "testSCumi_mm10.csv", sep="/"), scratch.folder="/data/scratch", separator=",", umiXgene=3, gtf.name="Mus_musculus.GRCm38.94.gtf", bio.type="protein_coding") genesUmi(file=paste(getwd(),"testSCumi_mm10.csv",sep="/"), umiXgene=3, sep=",") In Figure 12 it is shown the distribution of genes in cells for 'filtered_testSCumi_mm10.txt' counts  Figure 13). Ziegenhain observation clearly also apply to sub-populations clustering. Sequencing depth, which affects differential expression analysis and sub-population partitioning, influences the structure of a single-cell dataset at two levels: • number of genes called present in the experiment, • robustness of gene expression, i.e. number of reads associated to a gene.
In particular, the number of genes called present in the experiment is the key element for the discrimination between sub-populations. In Figure 14 it is shown the effect of sequencing depth on the number of detectable genesUmi(file=paste(getwd(),"250K_counts.txt",sep="/"), umiXgene=5, sep="\t") system("mv genes.umi.pdf genes.umi_250K.pdf") genesUmi(file=paste(getwd(),"100K_counts.txt",sep="/"), umiXgene=5, sep="\t") system("mv genes.umi.pdf genes.umi_100K.pdf") genesUmi(file=paste(getwd(),"25K_counts.txt",sep="/"), umiXgene=5, sep="\t") system("mv genes.umi.pdf genes.umi_25K.pdf") Figure 14 clearly shows that the number of genes detectable by 10XGenomics sequencing ( Figure 14A-C) is far less of those detectable using a whole transcript experiment ( Figure 14D-F). In the case of 25K reads/cells in 3' end sequencing ( Figure 14A) the number of called genes goes from a dozen of genes to 350. In a whole transcript sequencing experiment where 25K reads/cells were considered ( Figure 14D), 350 genes is the lowest number of genes detectable in a cell. In 10XGenomics experiment with 250K reads/cell the range of detectable genes goes from few hundred to 2000, which is relatively similar to the number of detectable genes in a whole transcripts experiment where the same number of reads/cells were considered. The larger scattering and the overall lower number of detectable genes in 3' end sequencing experiment, with respect to whole transcript experiments, is a peculiarity of the technology [Ziegenhain et al. 2017 ].
It has also to be highlighted that cells with a very low number of genes called present will have a genes repertoire made mainly of housekeeping genes, ribosomal and mitochondrial genes, see Figure 19 in Section 3.4. Thus, the lack of cell-type specific genes makes these cells useless for the identification of functional cell sub-populations.
However, it has to be underlined that each cell type is characterized by a peculiar transcriptional rate and therefore, the technical assessment of the reads per cell vs. genes detected between different cell types, i.e.
14A-C, might be bias by differences in the transcriptional rate of the different cells used in this specific example. Instead, the above-mentioned bias does not affect 14D-F because they are generated by a down sampling of a set of cells sequenced with the smart-seq protocol at a coverage of 1 million reads/cell. Figure 14: Number of detected genes with respect to mapped reads. A) 25K reads/cell 10XGenomics platform, 3' end sequencing v2 chemistry. B) 83K reads/cell 10XGenomics platform, 3' end sequencing v2 chemistry. C) 250K reads/cell 10XGenomics platform, 3' end sequencing v2 chemistry. D) 25K reads/cell C1 platform, whole transcript sequencing, E) 100K reads/cell C1 platform, whole transcript sequencing, F) 250K reads/cell C1 platform, whole transcript sequencing.
The above observations indicate that 3' end sequencing has a much lower genes called present with respect to whole transcript sequencing.
We have further investigated the following point: • Is the number of total UMIs/cell affecting the separation between sub-populations?
To address the above point, we used two types of cells belonging to the T-cells [Zheng 2016]. The two sets of cells were sequenced with a coverage of approximately 21K reads/cell: • T-cytotoxic (10209 cells, Figure 15A) cells, • T-regulatory (10263, Figure 15B) cells.
We generated three datasets: • d3.4, which is made of 100 cells randomly selected within cells having, in each of the two datasets, a total UMIs/cell value greater than 2511 in each cell.
• d3, which is made of 100 cells randomly selected within cells having, in each of the two datasets, a total UMIs/cell value comprised between 2511 and 1000 in each cell.
• d3m, which is made of 100 cells randomly selected within cells having, in each of the two datasets, a total UMIs/cell smaller than 1000 in each cell.
Thus, since 3' end sequencing platforms (10Xgenomics, inDrop) has the advantage to produce high number of sequenced cells, users might decide to select for clustering only the subset of cells with the highest number of genes called present.  The output is a counts table without low quality cells and with the prefix lorenz_filtered_. Output will be in the same format and with the same separator of input. lorenzFilter(group="docker",scratch.folder="/data/scratch/", file=paste(getwd(),"testSCumi_mm10.csv", sep="/"), p_value=0.05, separator= , ) tmp0 <read.table("testSCumi_mm10.csv", sep=",", header=T, row.names=1)

#782 cells
In the example above 24 cells were removed because of their low quality ( Figure 17).

Section 3.4 Annotation and mitochondrial/ribosomal protein genes removal
The function scannobyGtf allows the annotation of single-cell matrix, if ENSEMBL gene ids are provided.
The function requires the ENSEMBL GTF of the organism under analysis and allows the selection of specific annotation biotypes, e.g. protein_coding.  The output are a file with prefix annotated_, containing all annotated genes and a file with prefix filtered_annotated_ which contain the filtered subset of cells. Furthermore, the filteredStatistics.txt indicates how many cells and genes were removed.
Ribosomal RNA and ribosomal proteins represent a significant part of cell cargo. Large cells and actively proliferating cells will have respectively more ribosomes and more active ribosome synthesis [Montanaro et al. 2008 ]. Thus, ribosomal proteins expression might represent one of the major confounding factor in cluster formation between active and quiescent cells. Furthermore, the main function of mitochondria is to produce energy through aerobic respiration. The number of cell mitochondria depends on its metabolic demands [Nasrallah and Horvath 2014 ]. This might also affect clustering, favoring the separation between metabolic active and resting cells, with respect to functional differences between sub-populations. scannobyGtf allows also the removal of mitochondrial and ribosomal protein genes. library(rCASC) scannobyGtf(group="docker", file=paste(getwd(),"testSCumi_mm10.txt",sep="/"), gtf.name="Mus_musculus. GRCm38.94.gtf",biotype="protein_coding",mt=FALSE,ribo.proteins=FALSE,umiXgene=3,riboStart.percentage=10,riboEnd.percentage=70,mitoStart.percentage=0,mitoEnd.percentage=5,thresholdGenes=100) In figure 19B is shown the effect of the removal of both mitochondrial and ribosomal protein genes and the removal of cells characterized by high percentage of mitochondrial protein genes (>5%) and too litte (<10%) or too much (>70%) ribosomal protein genes .

Section 3.5 Top expressed genes
For clustering purposes user might decide to use the top expressed/variable genes. The function topx provides two options: • the selection of the X top expressed genes given a user defined threshold, parameter type="expression" • the selection of the X top variable genes given a user defined threshold, parameter type="variance" gene variance is calculated using edgeR Tag-wise dispersion. The method estimates the gene-wise dispersion implementing a conditional maximum likelihood procedure. For more information please refer to edgeR Bioconductor package manual.

IMPORTANT:
The core clustering tool in rCASC is SIMLR, Section 5. SIMLR requires that the number of genes must be larger than the number of analyzed cells.

Section 3.6 Data normalization
The best way to normalize single-cell RNA-seq data has not yet been resolved, especially in the case of UMI data. We inserted in our workflow two possible options: • SCnorm, which works best with whole transcript data.
• scone, which provides different global scaling methods that can be applied to UMI single-cell data.

Section 3.6.1 SCnorm
SCnorm performs a quantile-regression based approach for robust normalization of single-cell RNA-seq data.
SCnorm groups genes based on their count-depth relationship then applies a quantile regression to each group in order to estimate scaling factors which will remove the effect of sequencing depth from the counts.
IMPORTANT: SCnorm is not intended for datasets with more than~80% zero counts, because of lack of algorithm convergence in these situations.

Section 3.6.1.1 Check counts-depth relationship
Before normalizing using scnorm, it is advised to check the data count-depth relationship. If all genes have a similar relationship then a global normalization strategy such as median-by-ratio in the DESeq package or TMM in edgeR will also be adequate. However, when the count-depth relationship varies among genes global scaling strategies leads to poor normalization. In these cases, the normalization provided by SCnorm is recommended. If not provided data is assumed to come from same condition/batch.
-ditherCounts (GUI UMI), boolean. Setting to TRUE might improve results with UMI data.
-FilterCellProportion (GUI Min non-zero cells), a value indicating the proportion of non-zero expression estimates required to include the genes into the evaluation. Default is .10, and will not go below a proportion which uses less than 10 total cells/samples -FilterExpression (GUI Med non-zero expr. threshold), a value indicating exclude genes having median of non-zero expression below this threshold from count-depth plots #' @param ditherCounts, Setting to TRUE might improve results with UMI data, default is FALSE #' @param outputName, specify the path and/or name of output file.

#N.B. checkCountDepth function requires as input raw counts
The output is a PDF (Figure 22), providing a view of the counts distribution, and a file selected.genes.txt, which contains the genes selected to run the analysis. Section 3.6.1.

scnorm
The scnorm function executes SCnorm from SCnorm package, which normalizes across cells to remove the effect of sequencing depth on the counts and returns the normalized expression count. -outputName, specify the name of output file.
-nCores, number of cores to use.
-filtercellNum (Min non-zero cells), the number of non-zero expression estimate required to include the genes into the SCnorm fitting (default = 10). The initial grouping fits a quantile regression to each gene, making this value too low gives unstable fits.
-ditherCount (GUI UMI), boolean. Setting to TRUE might improve results with UMI data.

IMPORTANT:
In case sub-population discovery is the analysis task, it is important to check if a specific normalization is compliant with the clustering approach in use. For example, in the case of SIMLR, the rCASC core clustering tool, the normalizations provided in scone are not compliant, because they remove some of the features required to run the SIMLR multi-kernel learning analysis. TMM is instead compliant with the rCASC implementation of tSne. In case Seurat clustering is used the dataset does not required any normalization since a normalization procedure is included in the algorithm.  Counts table), full path to the file MUST be included.

Section 3.8 Detecting and removing cell cycle bias
Single-cell RNA-Sequencing measurement of expression often suffers from large systematic bias. A major source of this bias is cell cycle, which introduces large within-cell-type heterogeneity that can obscure the differences in expression between cell types. Barron and Li developed in 2016 a R package called ccRemover which removes cell cycle effects and preserves other biological signals of interest.
However, before applying ccRemover, it is essential to address if the removal of cell cycle effect is required.
reCAT is a modeling framework for unsynchronized single-cell transcriptome data that can reconstruct cell cycle time-series. Thus, reCAT cell cycle prediction step can be used to check if cell cycle effect can be detected in a dataset and therefore ccRemover normalization approach will be needed.

Section 3.8.1 Evaluating the presence of cell cycle effect in a dataset: reCAT
reCAT prediction step is implemented in rCASC in the function recatPrediction, which requires a data set annotated using scannobyGtf.
• Parameters (only those without default; for the full list of parameters please refer to the function help) To execute the analysis on the same number of cells, 288 cells were randomly selected from quiescent naive T-cells dataset. In Figure 28A the presence of oscillatory behavior is evident in the predicted cells time series and the G1 and G2M trends are indicated respectively in blue and red dashed curves. On the other hand, the oscillatory behavior is totally absent ( Figure 28B) in the naive T-cells, which are expected to be quiescent in G0.
• Parameters (only those without default; for the full list of parameters please refer to the function help) cutoff, p-value to use: 3 is almost equal to 0.05 species, human or mouse -rawCount, 1 for unlogged and not-normalized, 0 otherwise IMPORTANT: The output of ccRemover does not require log transformation before clustering analysis.
ccRemove output is compliant with SIMLR, the rCASC core clustering tool. ccRemove output does NOT require log transformation when applied to SIMLR. In Figure   Section 4 Estimating the number of clusters to be used for cell sub-population discovery.
The rCASC core clustering tool is SIMLR, which requires as input the number of clusters to be used to aggregate cell sub-populations. Unfortunately, there is no definitive answer to the definition of the most probable number of clusters, in which cells will aggregate. Some of the possible ways to identify the most probable number of clusters is summarised in: "Determining the optimal number of clusters: 3 must known methods -Unsupervised Machine Learning". In rCASC, the identification of the optimal number of clusters is addressed, in presence of cells number perturbations, with griph. The clustering performed by griph is graph-based and uses the community detection method Louvain modularity. Griph algorithm is closer to agglomerative clustering methods, since every node is initially assigned to its own community and communities are subsequently built by iterative merging. Griph is embedded clusterNgriph function, which evaluates the number of clusters in which a set of cells will aggregate upon a user defined leave-N%-out cells bootstraps. In the example below the number of clusters are detected for the file 'annotated_buettner_G1G2MS_counts_10000bis.txt', used in Section 3.8.
• Parameters (only those without default; for the full list of parameters please refer to the function help) clusterNgriph(group="docker",scratch.folder="/data/scratch/",file=paste(getwd(), "annotated_buettner_G1G2MS_counts_10000bis.txt", sep="/"), nPerm=160, permAtTime=8, percent=10, separator="\t",logTen=0, seed=111) In Figure 33 it is shown the output generated by clusterNgriph. The output folder is called Results and it is located in the folder from which the analysis started. Within Results is present a folder named as the dataset used for the analysis. In this case 'annotated_buettner_G1G2MS_counts_10000bis'. In the latter folder is present a folder, in this specific example '5', named with the number of clusters that were more represented as result of the bootstrap analysis. The file indicated with the blue arrow contains all the information to generate the griph output plot, used as reference to allocate cells to a specific cluster at each bootstrap step. The file indicated with the green arrow contains the cluster position for each cell over all bootstrap steps. The file indicated with the red arrow contains the cells removed at each bootstrap step. The file called 'hist.pdf', indicated with the black arrow, is the plot of the frequency of different number of clusters generated by griph as consequence of the bootstrap steps. In this specific case, over 160 permutations, 80 produced 5 clusters, 70 produced 4 clusters and 10 produced 6 clusters.
It has to be noted that in principle, since this dataset has a strong cell cycle effect, we would have expected ideally only three clusters: G1, S and G2M. This toy experiment clearly shows that perturbation of the dataset under analysis can affect the number of detectable clusters. Thus, to identify the clustering condition which guarantees the greatest cell stability in a cluster, in our opinion it is mandatory clustering cells taking in account perturbation effects. In Section 4.2 we further investigate this issue.  PC1 shows that, as the differences between cell populations become smaller, moving from setA to setD, the aggregation in homogeneous groups of cells is compromised ( Figure 34A-D).
One of the peculiarities of rCASC is the user tunable bootstrap procedure. rCASC represents bootstrap results via a cell stability score (Figure 37). In brief, a set of cells to be organized in clusters ( Figure 37A) is analyzed with SIMLR, applying a user defined k number of clusters ( Figure 37B). A user defined % of cells is removed from the original data set and these cells are clustered again ( Figure 37C). The clusters obtained in each bootstrap step are compared with the clusters generated on the full dataset using Jaccard index ( Figure 37D-E). If the Jaccard index is greater of a user defined threshold, e.g. 0.8, the cluster is called confirmed in the bootstrap step ( Figure 37F). Then to each cell, belonging to the confirmed cluster, cell stability score value is increased of 1 unit ( Figure 37G). At the end of the bootstrap procedure, cells are labeled with different symbols describing their cell stability score in a specific cluster ( Figure 37H).

Section 5.1 Cell Stability Score: mathematical description.
Stability score Algorithm Let be C the count matrix with N × M dimension where N is the gene number and M is cell number.
Then, we define C p the matrix generated by p th permutation removing q random columns from the original matrix C. Moreover considering the p th permutation we denote L p the set of all the removed cells in p th permutation with |L p | = q and cl p the vector with length M − q encoding the relation between cells and clusters in p. Hence, cl p i identified the cluster in which the i th cell is inserted in permutation p. Finally, we use the notation cl C for indicating the output of the clustering algorithm obtained by all the cells (i.e. matrix C).
The relation symmetric matrix R p with dimension M × M is defined as follows: where r p i,j is: Similarly we defined R C the relation symmetric matrix obtained considering all cells and R C−L P as the relation symmetric matrix R C in which the columns and the rows associated with cells in L p are removed. Observe that the sum R C−L P + R p will always gives as results a matrix with 3 possible values: 0,1 or 2.
if cell i and cell j clustered together in R C−L p or in R P only; 2 if cell i and cell j are always in the same cluster; 0 if cell i and cell j are never in the same cluster.
Let δ(i, k) be the kronecker delta, a function of two variables returning 1 if the variables are equal, and 0 otherwise: then we define the function length that counts the occurrence of a value k fixing the row j in the matrix R C−L P + R p .
We use the function length to count the occurrence of 1 or 2 in in the matrix R C−L P + R p .
Finally, we define the permutation score pscore j,p as: where p is a permutation and j a cell. Then, this returns the percentage of cells initially clustered with cell j that remain clustered with cell j in the permutation p. Then, we define tscore j,s as follow: tscore j ,s = 1 P p∈P 1 pscore j ,p >=s where P is the total number of permutations and s is user-defined threshold. This metric compute the probability that a cell j is always clustered with the same set of cells given that pscore j,p >= s.
In each of these folders there are a set of folders indicated with a number that refers to the analyzed range of number of clusters (Figure 39 Folder 3). In this specific example the range of clusters goes from 4 to 6. In each NameOfCountMatrix folder there is a file called _Stability_Violin_Plot.pdf ( Figure 39A) which represents the distribution of the cells stability scores over the bootstraps in the range of number of clusters investigated. Figure 39A clearly show that the analysis done with k=5 is the one providing the highest stability of cells within each cluster. The other plot that is also available in the folder ( Figure   39 Folders 4) is NameOfCountMatrix_vioplot.pdf, Figure 39B, which contains the distribution of the Silhouette values for each cluster over the bootstraps. Silhouette value is a measure of computation cluster stability, and evaluates how similar an object is to its own cluster (cohesion) compared to other clusters (separation). Clearly the information provided by Silhouette plot is much less informative for the definition of the optimal clustering number with respect to the information provided by the cells stability score ( Figure   39 blue arrow). In each clustering folder there is a pdf named NameOfCountMatrix_Stability_Plot.pdf, which contains two plots ( Figure 39C-D) generated by the clustering program. These plots provide a 2D view of the clustering results from two different perspectives. In Figure 39C plot each cell is colored on the basis of the belonging cluster and it is labeled with a symbol indicating its cell stability score (CSS). Instead, in the plot in Figure 39D each cell is colored on the basis of its CSS: 0-25% black, 25-50% green, 50-75% gold and 75-100% red. Here, we show the plots generated with k=6 to better describe the information described in the above-mentioned clusters plots, as in k=5 clusters all cells have a CSS greater than 75%. In Figures 39C and D it is shown that 4 out of 6 clusters cells remain in a cluster between 75 to 100% of the bootstraps (+ symbol in Figure 39C and red dots in Figure 39D). The plot, Figure 39E, shows the genes detectable in each cell in function of the total number of reads/cell. In this plot cells are colored with the same color of their belonging cluster. This plot is useful to observe if the clustering is biased by the number of genes called in each cluster. In this specific example, only the green cluster is characterized by a number of detected genes, which is larger of those detectable in the other clusters ( Figure 39E). This is expected, since blue cluster is made of Monocytes, which have been sequenced to 100K reads/cells, as all other cells in setA were sequenced between 19 to 29K reads/cell, see Section 4.2.
In Figure 40 are shown the effects, on cell stability, in SIMLR analysis done with 4 and 6 clusters for SetA  Figure 40A, some of the cells in each cluster show a reduced stability (50-75%, triangle) and arrows highlights cells characterized by a cell stability score between 25 to 50%. The circles in Figure 40B show the clusters where the full cluster has a cell stability between 50 to 75%. This observation indicates that the stability score is a useful measure to identify the optimal number of partitions to be used, i.e. the highest cell stability score was observed when five clusters were selected, corresponding to the number of cell types combined in SetA, see Section 4.2.
The effect of the perturbations induced in the clustering upon the removal of 10%, 20%, 30% and 50% of the data set was also investigated in SetA (annotated_bmsnkn_5x100cells.txt), Figure 41.
In Figure 44A is shown a screenshot of the output of bootstrapsVideo. The output of this function provides extra information with respect to the standard output of the simlrBootstrap and tsneBootstrap, since the video provides an overview of the area (colored circles) in which the cells of a specific cluster are localized as consequence of the bootstraps. This visualization provides a better description of the maximum size of the clusters and simplify the identification of clusters that are more near to each other, because of clusters structure, Figure 44A. In this specific example, is notable that, even if the yellow cluster has a high cell stability score, Figure 44B, its size is much greater of that observable for all the other clusters, because few cells with lower cell stability score (50-75%, triangle) are located very near to the blue cluster, arrow in Figure   44B.

Section 5.3 Estimating cluster stability.
Hennig published an elegant paper entitled Cluster-wise assessment of cluster stability. The R package fpc is associated to the paper. In this package, Jaccard index is used to evaluate the similarity between clusters.
The calculated score (cluster stability measurement, CSM) is intended to provide an overall quality score for each of the clusters in toto. In particular, the clusterboot function allows to evaluate the cluster stability using a personalized clustering function. We have implemented this function to estimate CSM of the clusters generated with SIMLR, Seurat and tSne. • Parameters (only those without default; for the full list of parameters please refer to the function help): group, a character string. Two options: sudo or docker, depending to which group the user belongs scratch.folder, a character string indicating the path of the scratch folder file, a character string indicating the path of the file, with file name and extension included -nPerm, number of permutations to perform the pValue to evaluate clustering -range1, first number of cluster for k means algorithm -range2, last number of cluster for k means algorithm separator, separator used in count file, e.g. ' ',',' -logTen, 1 if the count matrix is already in log10, 0 otherwise clustering, clustering method to use : "SIMLR" , "tSne", "griph" perplexity, Number of close neighbors for each point. This parameter is specific for tSne. Default value is 10.Setting this parameter when use a clustering method different by tSne will be ignored.

Section 6.1: Seurat
Recently Freytag et al. and Do et al. described, in their independent comparison papers, that Seurat delivered the overall best performance in cells clustering. Thus, we decided to include in our workflow also this clustering tool. Seurat clusters cells based on their PCA scores, with each PC essentially representing a 'metagene' that combines information across a correlated gene set. The bootstrap approach described in section 5.1 is also applied to Seurat clustering to assign cell stability score to the clustered cells. The first step of the analysis is the identification of the how many PCs have to be to included.
seuratPCAEval function allows the exploration of PCs to determine relevant sources of heterogeneity and to define the range of PCs to be used. #defining the PC threshold for clustering. seuratPCAEval(group="docker",scratch.folder="/data/scratch/", file=paste(getwd(), "bmsnkn_5x100cells.txt", sep="/"), separator="\t", logTen = 0, seed = 111) The function generates a Result folder in which is present a folder with the same name of the dataset under analysis, in this specific case "bmsnkn_5x100cells". In the latter folder is present a PCA plot (PCE_bowPlot.pdf, Figure 48A), which allows the identification of the PCs to use, defining a cutoff where there is a clear elbow in the graph. In this example, it looks like the elbow would fall around PC 6. seuratBootstrap function executes the seurat clustering (data reduction method PCA) and estimates cell  seed, important value to reproduce the same results with same input #N.B. seuratBootstrap requires in input raw counts or log10 transformed raw counts seuratBootstrap(group="docker",scratch.folder="/data/scratch/", file=paste(getwd(), "bmsnkn_5x100cells.txt", sep="/"), nPerm=160, permAtTime=8, percent=10, separator="\t", logTen=0, pcaDimensions=6, seed=111) The function generates a Result folder in which is present a folder with the same name of the dataset under analysis, in this specific case "bmsnkn_5x100cells.txt". In the latter folder is present a folder with the number of clusters detected by Seurat, in this specific case 5. In 5 folder there is the file with extension _Stability_Plot.pdf, which is the clustering picture with cells labeled with their stability score, Figure   48B). The file with the extension _clusterP.txt, containing for each permutation the location of each cell in the clusters. The file with the extension _killedCell.txt, where are saved the cells removed at each permutation. The file with the extension _score.txt, containing the scores assigned at each permutation to each cell. The file with extension _scoreSum.txt containing the Cell stability score for each cell.

# required time 9 mins
We run the analysis on the SetA described in Section 4.1. SIMLR Seurat and griph detect the 5 clusters, i.e. 5 cell types, and all clusters show an high CSC (Figure ??A-C). However, tSne detects the 5 clusters, but they are less stable in terms of CSS (Figure ??E). For scanpy we tested various combination of perplexity and PCs number but we failed to find a condition detecting the 5 clusters and clusters stability is very poor.
From the point of view of the computation time, the above mentioned analysis took 159 mins for tSne, 168 mins for SIMLR, 194 mins for Seurat, 61 mins for griph and only 9 minus for scanpy.

Section 7 Detecting the genes playing the major role in clusters formation
Nowadays there are a lot of methods for the identification of differentially expressed genes between two populations of single-cell experiments [Soneson et al.]. However, the number of tools applicable to single-cells and allowing something similar to an ANOVA are very few. An ANOVA-like approach is described as applicable to single-cells in edgeR Bioconductor package. The edgeR ANOVA-like requires a comparison with respect to a reference set of cells. SIMLR and Seurat also provide ranking score for the genes, which are affecting mostly the clusters organization. SIMLR and Seurat gene ranking do not require a reference cluster for the analysis, thus providing wider applications with respect to ANOVA-like. The three methods are part of the rCASC framework.

#N.B. anovaLike requires in input raw counts
anovaLike(group="docker", data.folder=getwd(), counts.table="annotated_setPace_10000_noC5.txt", file.type="txt", cluster.file="annotated_setPace_10000_noC5_clustering.output.txt ",ref.cluster=3,logFC.threshold=1,FDR.threshold=0.05,logCPM.threshold=4,plot=TRUE) #the output of interest is logFC_filtered_DE_annotated_setPace_10000_noC5_reordered.txt The function clustersFeatures selects the genes, which are more up-regulated in a specific cluster using anovaLike outputs. This function focus only on upregulated genes because the ANOVA-like procedure compares all clusters with respect to a reference cluster. Thus, up-regulated genes are genes specific of the cluster under analysis as those down-regulated are characteristics of the reference cluster. The delta parameter describes the minimal distance existing, for a specific gene average logFC in a specific cluster, with respect to all the other clusters under analysis. The outputs of the function are four tab delimited files: • the file with prefix onlyUP_, followed by the counts table name, i.e. the count table only containing the selected genes, • the file with prefixonlyUP_, followed by logFC_filtered_DE_, the table containing logFC only for the selected genes, • the file onlyUP_clusters_specific_genes.txt, which contains the list of specific genes associated with the corresponding cluster.

Section 7.2 A machine learning approach: SIMLR genes prioritization
SIMLR gene prioritization approach relies on the output similarity and does not depend on the downstream dimension reduction or clustering. The advantage of this step is that one can identify highly fluctuating genes that are consistent with the learned similarity in the multiple-kernel space.
The function genesSelection allows to run a selection of the most interesting genes for clustering on the basis of two parameters: • maxDeltaConfidence: max value for Delta confidence for genes features selection. Default is 0.01 • minLogMean: min value for Log mean for genes feature selection. Default is 0.05 The above values can be set in the genesSelection function on the basis of the output of geneRank.pdf ( Figure 61). separator, separator used in count file, e.g. '\t', ',' seed, important value to reproduce the same results with same input, default 111 sp, minimum number of percentage of cells that has to be in common in a cluster, between two permutations, default 0.8 -clusterPermErr, probability error in depicting the number of clusters in each permutation, default = 0.05 -maxDeltaConfidence, max value for Delta confidence for genes feature selection -minLogMean, min value for Log mean for genes feature selection #dataset derived from the exemplary analysis shown in section 8 genesSelection(group="docker",scratch.folder="/data/scratch/", file=paste(getwd(), "annotated_setPace_10000_noC5.txt", sep="/"), nCluster=5, separator="\t", seed=111, sp=0.8, clusterPermErr=0.05, maxDeltaConfidence=0.01, minLogMean=20) #mv annotated_setPace_10000_noC5_clusters_specific_geneSYMBOLs.txt in the main folder hfc(group="docker",scratch.folder="/data/scratch", file=paste(getwd(),"annotated_setPace_10000_noC5.txt", sep="/"), nCluster=5, separator="\t", lfn="annotated_setPace_10000_noC5_clusters_specific_geneSYMBOLs", geneNameControl=1, status=0, b1="/-1", b2="1") The output of genesSelection is a file with extension **_clusters_specific_geneSYMBOLs.txt, which can be used as input for hfc** function, see Section 6.1.1. The list of prioritized genes cannot not be associated to a specific cluster as in the case of the anovaLike, see Section 6.1. Thus, more information will be provided on the cluster specificity from the heatmap visualization (Figure ??), see hfc function.

Section 8 Supporting tools
The Conversion of a dense matrix, with 0s in sparse format, not including 0s, is performed by csvToSparse function and the conversion from a sparse to a dense matrix is done by h5tocsv unzip("annotated_setPace_10000_noC5.txt.zip") csvToSparse(group="docker", scratch="/data/scratch", file=paste(getwd(), "annotated_setPace_10000_noC5.txt", sep="/"), separator="\t") • h5tocsv parameters (only those without default; for the full list of parameters please refer to the function help) (Figure 64): group, a character string. Two options: sudo or docker, depending to which group the user belongs +file, path of the sparse matrix file. For h5 file the full path MUST be included. For mtx matrix the folder MUST contain tsv and mtx files and the FULL path to mtx matrix MUST be provided type, h5 refers to h5 files and 10xgenomics to the folder containing barcodes.tsv, genes.tsv and matrix.mtx system("wget http://130.192.119.59/public/annotated_setPace_10000_noC5.txt.zip") unzip("annotated_setPace_10000_noC5.txt.zip") csvToSparse(group="docker", scratch="/data/scratch", file=paste(getwd(), "annotated_setPace_10000_noC5.txt", sep="/"), separator="\t") h5tocsv(group="docker", file=paste(getwd(),"matrix.mtx",sep="/"), type="10xgenomics") A random subset from a matrix can be extracted using subSetCell function. Two matrices can be merged using mergeMatrix function. unzip("annotated_setPace_10000_noC5.txt.zip") subSetCell(group="docker", scratch.folder="/data/scratch", file=paste(getwd(), "annotated_setPace_10000_noC5.txt",sep="/"), separator="\t", cells.number=200) mergeMatrix(group="docker", scratch.folder="/data/scratch",file1=paste(getwd(),"annotated_setPace_10000_ The function dimensions generates a file containing the size of the matrix under analysis. On the basis of our observation in Section 3.2.1, we selected, for each of the above mentioned groups, the cells with higher number of total reads/cell. Specifically, we analyzed 600 cells, combining respectively 50 cells from N and Nd and 250 cells for NA and NdA. The rational of the different number of cells for the naive and the activated is due to the expectation that activated cells will be organize in multiple clusters and naive are expected to be homogeneous. Cell labels in the counts table were modified adding the groups N, Nd, NA and NdA.
Using this dataset, we tried to address the following questions: 1. Does Suv39h1 silence gene expression program in naive CD8+ T cells?
2. Does Suv39h1 silence gene expression program in activated CD8+ T cells?
3. Does Suv39h1 silencing affect the differentiation of new subsets of activated CD8+ T lymphocytes?
In Figure 67 are summarized the results of the analysis executed on the Pace's dataset. A limitation of the clustering based on SIMLR is due to the need of providing as input the number of clusters (k) in which the data should be organized. Instead of asking to user to define arbitrarily the k number of clusters, we used griph as tool to identify a range of k clusters to be inspected by SIMLR. Figure 67A shows the frequency of k number of clusters, in which the Pace's dataset can be organized using griph software, upon 160 bootstraps in which 10% of the cells is randomly removed from the initial data set. Griph analysis identify a range of clusters going from k=6 to k=9. K=7 is the most represented data organization detected by griph, followed suggesting a reduction in efficiency in the differentiation of this sub-population upon Suv39h1 silencing.
3. Does Suv39h1 silencing affect the differentiation of new subsets of activated CD8+ T lymphocytes? a. Yes, cluster 4 represents a unique Suv39h1 subset of activated CD8+ T cells.

#execution time 276 mins in SeqBox
The results of the analysis done removing cells associated to cluster 5 are shown in Figure 68. The range of clusters suggested by clusterNgriph are between 5 and 9, Figure 68A. The best cells stability score is observable for 5 clusters, Figure 68B. simlrBootstrap results show a large improvement of the overall stability of the clusters Figure 68C with respect to the previous analysis, Figure 67C. It is also notable that the samples organization with in clusters agrees with preliminary answers in Section 5.1.1 . Figure 68: Analysis of Pace dataset. A) Clusters detected by clusterNgriph. B) Cell stability score detected by simlrBootstrap. C) Clusterization with 5 clusters with simlrBootstrap, in parenthesis is given the overall stability score of each cluster. The components of each cluster, in terms of cells experiment groups, are also indicated for the most stable clusters.
The cluster 2 has some traits of memory cells, on the basis of EnrichR analysis of 100 genes. Enrichment analysis shows enrichment for STAT3 transcription specific paths and GO Biological process enrichment for response to cytokine. Furthermore, a PUBMED search showed the presence in this cluster of memory specific genes such as TCF7 ,  The paper Pace showed that Suv39h1-defective CD8+ T cells show sustained survival and increased long-term memory reprogramming capacity. Our reanalysis with rCASC extends the information provided from the single cell analysis in Pace paper, suggesting the presence of an enriched Suv39h1-defective memory subset, cluster 2 in Figure 68C. Noteworthy, this example suggests that even few hundreds cells are sufficient to discriminate between sub-populations.

Section 10 Tips and tricks
In this section we highlight some points that might help data analysis.

Section 10.1 Data preprocessing
Lorenz statistics (lorenzeFilter function) was designed using C1 -Fluidigm and smart-seq libraries. There are no evidences that it performes optimally on UMI based sequencing. In case of UMI, we suggest the use of topx and filterZeros functions, which respectively allow the selection of a user defined set of genes with high dispersion/expression and the removal of genes with a user defined fraction of 0s.
Unless there is a specific interest on mitochondrial and ribosomal protein genes, we suggest to remove them before performing clustering (scannobyGtf, function). Specifically ribosomal genes represent a high fraction of the UMI counts associated to a cell and might result in driving the cells partioning.
The normalization for sequencing depth (checkCountDepth and scnorm functions) was designed for smart-seq single cell sequencing data. This normalization can be applied to UMI sequencing data but it requires at least 10K UMIs/cell. Cell cycle estimation (recatPrediction function) and removal (*ccRemove** function) were designed for smart-seq single cell sequencing data. In our hands they are also working on UMI seqencing data althought, if the cell number exceed few thousands, cell cycle estimation became very slow. We suggest to perform cell cycle estimation on a random subset of few hundreds cells (subSetCell function) of the data set under analysis.

Section 10.2 Clustering
As part of the output of the clustering procedure there is a plot of the number of detected genes as function of the number of cell UMIs. In this plot, each cell is colored on the basis of the cluster it belongs. This plot helps in understanding if cell clusters simply correlate with number of genes called in each cell. Ideally, there should not be any correlation between clusters and number of genes detected in a cell, since clustering should be driven by cells' biological content and not by the number of detected genes.
On the basis of our tests SIMLR and Seurat generated results comparable in terms of CSS and clusters homogeneity. Since griph and scanpy results correlate less with SIMLR and Seurat, we suggest to use them only if a very large set of cells has to be analysed, e.g. >20K cells.