scIMC: a platform for benchmarking comparison and visualization analysis of scRNA-seq data imputation methods

Abstract With the advent of single-cell RNA sequencing (scRNA-seq), one major challenging is the so-called ‘dropout’ events that distort gene expression and remarkably influence downstream analysis in single-cell transcriptome. To address this issue, much effort has been done and several scRNA-seq imputation methods were developed with two categories: model-based and deep learning-based. However, comprehensively and systematically comparing existing methods are still lacking. In this work, we use six simulated and two real scRNA-seq datasets to comprehensively evaluate and compare a total of 12 available imputation methods from the following four aspects: (i) gene expression recovering, (ii) cell clustering, (iii) gene differential expression, and (iv) cellular trajectory reconstruction. We demonstrate that deep learning-based approaches generally exhibit better overall performance than model-based approaches under major benchmarking comparison, indicating the power of deep learning for imputation. Importantly, we built scIMC (single-cell Imputation Methods Comparison platform), the first online platform that integrates all available state-of-the-art imputation methods for benchmarking comparison and visualization analysis, which is expected to be a convenient and useful tool for researchers of interest. It is now freely accessible via https://server.wei-group.net/scIMC/.


Model-based imputation methods
SAVER is a method that optimizes the whole gene expression counts, using information across genes and cells to impute zeros as well as to improve the expression values for all genes. It uses a multi-gene prediction model to restore gene expressions, which is formulated as follows: where % is a cell-specific size factor, as well as %& is the true expression (without dropouts) of %& . SAVER assumes %&~G amma ( %& , %& ), where 9: and 9: are the estimated shape and rate parameters after maximum-likelihood, respectively. 9: is the imputed value of %& . Finally, the imputed value is as: where %& is a prediction based on the observed expression values of the informative genes in the same cell.
MAGIC applies data diffusion to share information between similar cells to optimize the gene expression matrix, and impute the missing values. It imputes the "dropout" events in the following four steps: (1) converting " * $ to a cell-cell distance matrix ; (2) computing an affinity matrix using through a Gaussian Kernel; (3) constructing Markov affinity matrix , representing the probability distribution of each cell transitioning to each other cell in one step; (4) data diffusion is performed by the exponentiation of M. Finally, after t iterations, the imputed matrix is computed as: where is the preprocessed " * $ .
ScImpute can automatically identify possible "dropout" events ("false" zero values), and perform imputation only on the identified values without introducing new noise to the rest data. It first performs the preprocessing operation of normalization and log-transformation on " * $ . One of advantages of scImpute is that it can detect outliers. Clustering is used to obtain the nearest neighbors of each gene to detect outliers. Instead of treating all zero values as "dropout" events, scImpute builds a statistical model to determine whether zero values are from true "dropout" events.
For gene in cell subpopulation k, its expression value is modeled as a random variable % (L) , which obeys the following distribution: (L) , 9 (L) , 9 (L) , 9 L and 9 (L) , respectively. Therefore, the dropout probability %& of gene in cell is calculated as follows: For every cell , the genes are divided into two groups: k = { |`k ≥ } represents genes in need of imputation; k = { |`k < } represents genes having precise expression, where t is the threshold. As last, it uses k to impute the expression values of genes in k .
DrImpute identifies similar cells using clustering, and imputes the missing values by the mean of the expression values from the cells of the same cluster. Assume that there are h results of clusters: b , s , … , u , where Ck (1<k<h) denotes the k-th result.
The missing value in L is calculated as follows: Next, it computes %& L for each clustering result, and imputes the "dropout" event through the mean of %& L . The imputed values is predicted as: scNPF adopts the network propagation approach based on random walk with restart (RWR) to optimize the count matrix, which considers both local and global topology of the interaction network. A gene-gene interaction network G=<V, E, B>, where V is the set of genes, as well as E is the set of interactions. B represents the set of transition probabilities, where %& is the transition probability from node i to node j. The start point is a vector } of scores on genes representing the gene expression profile of a given cell. The network propagation approach based on RWR can be computed as below: W is a degree-normalized adjacency matrix of the interaction network, which is constructed by the adjacency matrix B and a degree matrix D, defining as = cb .
is the trade-off between prior information and network diffusion, governing the distance that a signal is allowed to diffuse through the network during smoothing. The propagation function runs iteratively until it converges to a steady state P: By repeating the network propagation process for each cell, a higher density matrix with smooth gene expression values is generated.
scTSSR imputes the missing values considering the similarity information between genes, and the similarity information between cells. The imputed values can be predicted as: where 9L and u: represent the estimate of representation coefficients which can capture the similarity between genes and the similarity between cells, respectively. It uses penalized least square method to estimate the two parameters.

Deep learning-based methods
AutoImpute is designed based on deep auto-encoder network and sparse gene expression matrix. It aims to learn the inherent distribution of input data, and estimate the missing values with minimal impact on biologically low-expressed genes. The input in AutoImpute is defined as: where ∘ is the Hadamard product, and R is true counts matrix (a matrix without dropouts), which we would like to recover. M is a binary matrix, in which the element is set to 0 when the corresponding element in X is 0; otherwise, the rest elements of M are all set to 1. In order to obtain the optimal solution to the linear inverse problem, it assumes that is a low-rank matrix. Accordingly, is imputed by auto-encoder: where is the sigmoid activation function used in the encoder layer, D is the decoder of auto-encoder, and E is the encoder of auto-encoder. To this end, AutoImpute restores equation (12) using deep auto-encoder network to calculate . Because is the estimation of R, the loss function of AutoImpute is defined by equation (11) where is the regularization coefficient, |||| † s is the Euclidean cost function, and |||| … s represents loss which is calculated only for the non-zero counts present in expression matrix.
ALRA is a method based on low-rank approximation, which applies non-negativity and correlation structure to selectively impute the missing values, leading to maintain biological zero while imputing "dropout" events. First of all, " * $ is normalized, and log-transformed. Subsequently, there are three steps in ALRA for imputing the missing values. First, it uses SVD (singular vector decomposition) to calculate the approximate optimal rank matrix of the normalized expression matrix. Next, it thresholds each gene of the resulting matrix by the absolute value of the most negative entry of the gene.
Finally, the result value is rescaled so that the mean and standard deviation of the nonzero values of each gene in the result matrix match that of the original matrix, respectively.
DCA uses the negative binomial noise model with or without zero-inflation considering the count distribution, over-dispersion and sparsity of the data to impute the missing values, and captures the nonlinear gene-gene correlation. It assumes that the noise distribution obeys ; , , = } + 1 − ; , , and applies the auto-encoder to train the parameters of the distribution. DCA constructs an auto-encoder network with three output layers. The architecture formulations are as follows: where E, B, and D represent the encoder, bottleneck and decoder layers, respectively.
represents the matrix that is normalized and log-transformed. The three output layers of the model ( , , ) correspond to the three parameters of the ZINB distribution (a, b, c), which are used to learn the initial distribution of the noise. The where % is the value of gene in cell c, and 9 is the estimated value.
scIGANs uses a deep generative network called BEGAN to achieve the goal of imputing the missing values. Firstly, the data of scRNA-seq are normalized by the maximum read count of each cell. After that, scIGANs reshapes the count matrix into images with size of * . The pixels of the images are represented by the normalized gene expression. If the number of genes is less than * , zero values are padded. The network of scIGANs includes two parts: generator and discriminator. The generator is constructed as follows: The input a is defined as: ~ (0, 1) which is obtained from any noise distribution, and label `'~ (1, ), where k is the number of cell types. is the hyper-parameter of the model that needs to be trained. The discriminator of scIGANs is defined as: The input X represents the reshaped images, and is defined as: ~f, where f is the distribution of real data. `• is the cell type.
For a given cell % that belongs to cell type K, scIGANs generated a candidate set -, and obtained the k nearest neighbors of % by Euclidian distance in L , which is shown as %L$$ . Accordingly, the impute values are calculated: ScGNN imputes the missing values via the cell-cell relationship inferred by the iteration process. The imputation model contains three regularizers from cell graph, cell types and L1 regularizer: where A is included in the adjacency matrix (N*N) from the pruned cell graph in the last iteration, and b , s represent the intensities of the regularizers. L1 reduces non-zero terms in | |, in which denotes a hyper-parameter. Finally, the loss function is defined as: TRS is k possible gene regulatory signals, corresponding to a mixture of k Gaussian distributions that & ∈ is followed. & represents the normalized expression values of a gene over N cells, X = { b , s , … , d }.                                         Results from RAW data, and imputed data by scImpute, SAVER, ALRA, MAGIC, DCA, DrImpute, DeepImpute, scTSSR, AutoImpute, scIGANs, and scGNN. The x-axis represents true gene expression value, and the y-axis represents imputed gene expression value. Figure S50. Heat map of top 10 significant genes in bulk RNA sequencing data. We extracted the top 10 genes with highest P-Value in bulk samples as reference.

Supplemental Tables
The cells are divided into two groups: H1 and DEC, which was shown in different colors (blue and pink). Figure S51. Heat map of top 10 significant genes in raw scRNA sequencing data. We extracted the top 10 genes with highest P-Value in bulk samples as reference.
The cells are divided into two groups: H1 and DEC, which was shown in different colors (blue and pink). Figure S52. Heat map of top 10 significant genes data in imputed data from scImpute. We extracted the top 10 genes with highest P-Value in bulk samples as reference. The cells are divided into two groups: H1 and DEC, which was shown in different colors (blue and pink). Figure S53. Heat map of top 10 significant genes data in imputed data from SAVER. We extracted the top 10 genes with highest P-Value in bulk samples as reference. The cells are divided into two groups: H1 and DEC, which was shown in different colors (blue and pink). Figure S54. Heat map of top 10 significant genes data in imputed data from MAGIC. We extracted the top 10 genes with highest P-Value in bulk samples as reference. The cells are divided into two groups: H1 and DEC, which was shown in different colors (blue and pink). Figure S55. Heat map of top 10 significant genes data in imputed data from ALRA. We extracted the top 10 genes with highest P-Value in bulk samples as reference. The cells are divided into two groups: H1 and DEC, which was shown in different colors (blue and pink). Figure S56. Heat map of top 10 significant genes data in imputed data from DrImpute. We extracted the top 10 genes with highest P-Value in bulk samples as reference. The cells are divided into two groups: H1 and DEC, which was shown in different colors (blue and pink). Figure S57. Heat map of top 10 significant genes data in imputed data from DeepImpute. We extracted the top 10 genes with highest P-Value in bulk samples as reference. The cells are divided into two groups: H1 and DEC, which was shown in different colors (blue and pink). Figure S58. Heat map of top 10 significant genes data in imputed data from scTSSR. We extracted the top 10 genes with highest P-Value in bulk samples as reference. The cells are divided into two groups: H1 and DEC, which was shown in different colors (blue and pink). Figure S59. Heat map of top 10 significant genes data in imputed data from scNPF. We extracted the top 10 genes with highest P-Value in bulk samples as reference. The cells are divided into two groups: H1 and DEC, which was shown in different colors (blue and pink). Figure S60. Heat map of top 10 significant genes data in imputed data from scIGANs. We extracted the top 10 genes with highest P-Value in bulk samples as reference. The cells are divided into two groups: H1 and DEC, which was shown in different colors (blue and pink).