Digital Cell Sorter (DCS): a cell type identification, anomaly detection, and Hopfield landscapes toolkit for single-cell transcriptomics

Motivation Analysis of singe cell RNA sequencing (scRNA-seq) typically consists of different steps including quality control, batch correction, clustering, cell identification and characterization, and visualization. The amount of scRNA-seq data is growing extremely fast, and novel algorithmic approaches improving these steps are key to extract more biological information. Here, we introduce: (i) two methods for automatic cell type identification (i.e., without expert curator) based on a voting algorithm and a Hopfield classifier, (ii) a method for cell anomaly quantification based on isolation forest, and (iii) a tool for the visualization of cell phenotypic landscapes based on Hopfield energy-like functions. These new approaches are integrated in a software platform that includes many other state-of-the-art methodologies and provides a self-contained toolkit for scRNA-seq analysis. Results We present a suite of software elements for the analysis of scRNA-seq data. This Python-based open source software, Digital Cell Sorter (DCS), consists in an extensive toolkit of methods for scRNA-seq analysis. We illustrate the capability of the software using data from large datasets of peripheral blood mononuclear cells (PBMC), as well as plasma cells of bone marrow samples from healthy donors and multiple myeloma patients. We test the novel algorithms by evaluating their ability to deconvolve cell mixtures and detect small numbers of anomalous cells in PBMC data. Availability The DCS toolkit is available for download and installation through the Python Package Index (PyPI). The software can be deployed using the Python import function following installation. Source code is also available for download on Zenodo: DOI 10.5281/zenodo.2533377. Supplementary information Supplemental Materials are available at PeerJ online.

it is clear how the use of the matrix Q i j improves the separation between the attractor states, as compared to Fig. 1. The landscape is only using the first two principal components. Fig. S3 provides a complete representation of the overlaps between the cell attractors on the top non-trivial (non-zero) principal components obtained after the Q transformation.

Batch correction
Data often contain noise associated to batch effects due to systematic experimental errors, collection procedures, and data handling (Tran et al. (2020); Lähnemann et al. (2020)). It is important to remove batch variations and technical noise in the data, yet preserving biologically relevant information. DCS includes a batch effect correction method, COMBAT, that has been developed specifically for microarray data, (Johnson et al. (2007); Stein et al. (2015)) but is also suitable for single cell transcriptomics data.
Handling of missing values is particularly important when dealing with batch effects correction in scRNA-seq. In our implementation of COMBAT we record the locations of missing values, i.e. zeros in the dataset, and perform the COMBAT transformation. Then, we replace with zeros any values that were missing and became non-zero after the transformation. We also replace with zeros any values that became negative. Fig. S5 demonstrates how scRNA-seq data of plasma cells from the bone marrow aspirates of 12 patients (MM01-MM12) split into clusters corresponding to different batches. Processing the data with COMBAT and applying our procedure for missing values properly aligns the multi-dimensional data.
Clustering DCS contains functions that implement different clustering methods: 1. Hierarchical clustering. This is in general a good choice for clustering single cell datasets (Luecken and Theis (2019)). However, this method becomes unfeasible when the size of the datasets goes beyond several tens of thousands cells. In this case DCS can use approximate methods such as k-means.
2. Network-based clustering methods. First a network of cells is constructed. Typically, it is a knn-graph (k nearest neighbors graph) with a cutoff k on the number of the nearest neighbors of each node. Then clusters are found using network-based algorithms such as modularity-based community detection (Newman (2010)).
3. Spectral co-clustering (Dhillon (2001)). Since this method is computationally demanding, it is recommended only for small subsets of the data. We have found that spectral co-clustering can accurately fineseparate cell sub-types when used with our cell type identification algorithms. In this case, we perform first a coarse clustering using one of the two methods above (e.g. to identify all T cells in the dataset) and then we use co-clustering to obtain cell subtypes (e.g. T CD8+, T CD4+, T memory, T naive, etc.).

Projection of high-dimensional data on 2D layout
Visualization of cell clusters can be done in DCS using different state-of-the-art methods to represent the results in a 2D layout: Fig. S6 (a), a well-established nonlinear method that preserves local data structure (Maaten and Hinton (2008)).
2. PCA, based on the first two principal components (2nd PC vs. 1st PC) of the data, see Fig. S6 (b). This is a simple linear method that preserves the global structure of the data.
3. UMAP. This approach (McInnes et al. (2018)) has recently gained popularity because it preserves intercell distance in the dimensionality reduction procedure. This algorithm maintains the global structure and the continuity of the expression data. UMAP has been found to resolve cell populations and to produce equally meaningful representations compared with t-SNE (Becht et al. (2019)). Layouts produced with UMAP are more reproducible than other methods, notably more so than those from t-SNE. An example of visualization using UMAP on PBMC dataset is shown in Fig. S6 (c). Fig. S6 (d). This recently developed method captures local and global structure using an information-geometric distance between data points (Moon et al. (2019)). PHATE has been found to reveal biological insights into cell developmental branches, including identification of previously undescribed subpopulations (Moon et al. (2019)).

Input gene expression data
The input gene expression data is expected in one of the following formats: • Spreadsheet of comma-separated values (csv) containing a condensed matrix in a form ('cell', 'gene', 'expr').
If there are batches in the data, the matrix has to be of the form ('batch', 'cell', 'gene', 'expr'). Column order can be arbitrary.   During and after processing data storage is implemented using Hierarchical Data Format (HDF).

Miscellaneous analysis tools
The optimized performance of our modular DCS software allows for an efficient processing of large single cell datasets. The documentation of our software is built with Sphinx at ReadTheDocs.org. Any changes to source code and python code docstrings are automatically reflected at ReadTheDocs.org, and a new version of the documentation is built.
In DCS we have implemented numerous querying functions for an efficient extraction of cells based on a specific cluster or cell type. This functionality allows for an easy selection of data subsets for further analysis.
A specialized function in DCS provides across clusters comparison via a two-tailed t-test plot of individual genes. See the example of CD4 expression in PBMC data in the Fig. S7.
Finally, DCS includes a function representing in a pie chart the role of different markers in a direct comparison between two cell types. Fig. S8 shows an example of output for T cells versus NK cells markers. This function can be useful, for instance, to make the final decision on a cluster for which the cell type assignment has provided two possible cell types.  (1) Figure S7. Example of two-tailed t-test analysis for the CD4 gene expression in the 68k PBMC dataset. Black stars denote where this gene is significantly expressed in a two-cluster cross-comparison. 8/9 Figure S8. Summary of T cells and NK cells markers from the "CD Marker Handbook". The dark green sector shows markers expected to be found in both cell types, the light green sector contains markers expressed in one of the cell types. The dark red sector is for markers that are not expected to be significantly expressed in either cell types, and light red is for markers that should not be significantly expressed in one of the two cell types. Grey-shaded sector contain markers that are expected to be expressed in one of the cell type and non-expressed in the other. The number in the center is the total number of known markers for these two cell types.