Statistical methods for analysis of single-cell RNA-sequencing data

Single-cell RNA-sequencing (scRNA-seq) is a recent high-throughput genomic technology used to study the expression dynamics of genes at single-cell level. Analyzing the scRNA-seq data in presence of biological confounding factors including dropout events is a challenging task. Thus, this article presents a novel statistical approach for various analyses of the scRNA-seq Unique Molecular Identifier (UMI) counts data. The various analyses include modeling and fitting of observed UMI data, cell type detection, estimation of cell capture rates, estimation of gene specific model parameters, estimation of the sample mean and sample variance of the genes, etc. Besides, the developed approach is able to perform differential expression, and other downstream analyses that consider the molecular capture process in scRNA-seq data modeling. Here, the external spike-ins data can also be used in the approach for better results. The unique feature of the method is that it considers the biological process that leads to severe dropout events in modeling the observed UMI counts of genes. • The differential expression analysis of observed scRNA-seq UMI counts data is performed after adjustment for cell capture rates. • The statistical approach performs downstream differential zero inflation analysis, classification of influential genes, and selection of top marker genes. • Cell auxiliaries including cell clusters and other cell variables (e.g., cell cycle, cell phase) are used to remove unwanted variation to perform statistical tests reliably.

: total number of cells present in scRNA-seq data; J : total number of genes in the data; K : total number of cell clusters; L : number of cell types; μ i jkl be the mean of non-zero counts in i th cell for j th gene in k th cell cluster of l th cell type; ϕ i jkl ( = θ i jkl −1 ) and θ i jkl be the dispersion and size parameters, respectively in i th cell for j th gene in k th cell cluster of l th cell type; π i jkl be the zero inflation probability in i th cell for j th gene in k th cell cluster of l th cell type.

Traditional statistical models for fitting observed scRNA-seq data
Negative binomial (NB) model NB models are extensively used in modeling the read counts obtained from RNA-sequencing (RNAseq) studies. The Probability Mass Function (PMF) of the NB distributional model is expressed in Eq. (1 ).
The mean and variance of the NB model is given in Eqs. (2 ) and (3) , respectively. E Y i jkl = μ i jkl (2) V ar Y i j = μ i jkl + μ i jkl 2 θ i jkl = μ i jkl + μ i jkl 2 ϕ i jkl (3) Zero inflated negative binomial (ZINB) model The NB model implemented in bulk RNA-seq differential expression (DE) analytic tools including DESeq2, edgeR, baySeq, SAMSeq, etc ., may not handle the excess overdispersion and zero inflation present in the single-cell UMI counts data [ 2 , 3 ]. Therefore, ZINB model is exclusively used for modeling/fitting of UMI count data obtained from single-cell studies [2][3][4][5] . The ZINB model can be briefly described as follows: The PMF of the ZINB distribution is given in Eq. (4 ).
The PMF of the ZINB distribution, used to model the UMI counts from scRNA-seq studies, is given in Eq. (6 ). Poisson model.

SwarnSeq model
In the existing single-cell data analytic tools including Seurat, DEsingle, Monocle, MAST, etc ., the observed UMI counts are considered the realizations of true UMI counts. This assumption is not true, as different noises including biological sources, e.g ., lower molecular capture, are mostly confounded with the observed UMI counts [ 2 , 4 ]. For instance, the recent single-cell sequencing protocols only capture the 1-10 % of the transcriptomics present in the cell [ 4 , 5 ]. Therefore, this property needs to be incorporated in modeling of the observed UMI count data. Here, we considered a simple Binomial cell capture model to model the observed UMI count data. However, other cellular capture model, e.g ., Beta-Binomial, Poisson-NB models, Hypergeometric models, etc ., can also be considered to represent biological dropout events in single-cell studies.
Theorem : Let, ρ i jkl be the rv represents the transcriptional capture rate of i th cell for j th gene in k th cell cluster at l th cell type/pseudo-time. If the true UMI counts, Z i jkl , follow ZINB ( π i jkl , μ i jkl , θ i jkl ) distribution, and ρ i jkl follows a binomial model with parameter p i jkl ( 0 ≤ p i jkl ≤ 1 ) , then the observed UMI counts, Y i jkl , will also follow ZINB distribution with parameters ( π i jkl , μ i jkl p i jkl , θ i jkl ) .
Proof: Given that, Z i jkl ∼ ZINB ( π i jkl , μ i jkl , θ i jkl ) and ρ i jkl = ( Y i jkl | Z i jkl = z) ∼ B ( z, p i jkl ) Now, the PMF of Z i jkl is given in Eq. (4 ) and the PMF of ρ i jkl can be expressed in Eq. (7 ).
The joint probability distribution of the observed and true UMI counts, Y ijkl and Z ijkl , can be written as: P Y i jkl = y, Z i jkl = z| π i jkl , μ i jkl , θ i jkl , p i jkl = P Y i jkl = y | Z i jkl = z, p i jkl P Z i jkl = z| π i jkl , μ i jkl , θ i jkl (8) Now, the marginal probability distribution of Y i jkl can be obtained as: Now, Eqs. (10 ) and (11) are in the form of Eq. (4 ), which indicates the distribution of the observed UMI counts, Y i jkl , is also from ZINB ( π i jkl , μ i jkl , θ i jkl ) . The detailed proof of this theorem can be found at [2] . Corollary 1 : When p i jkl = 1 ( i.e. , under full capture rates), this means that all the transcriptomic material present in the cell is fully captured during the sequencing process, this is called as perfect deep sequencing. Under such scenarios, the distributions of the observed and true UMI counts remain same, i.e. , a ZINB model. Mathematically, Here, the genes in a cell will have zero counts which are not truly expressed ( i.e., biological zeros) and the single-cell experiment will be free from dropout events. However, such a scenario is a dream in real experimental single-cell studies. In other words, the real limits of p i jkl is 0 < p i jkl < 1 .
Corollary 2 : In case p i jkl < 1 , i.e. , in real experimental case the transcriptomic materials present in cells is not fully captured, but only certain fraction is captured [ 9 ]. Then, zero counts in the single-cell expression data are the mixture of dropout/false zeros and true zeros. Further, mean of the observed non-zero UMI counts depend on the cell capture rate parameter, while the zero inflation and overdispersion parameters are independent of the cell capture rates. Here, it is worthy to note that ˆ π i jkl from observed data can be used to estimate the proportions of true zeros, as π i jkl remains unaffected by the capture rate parameter.
True UMI counts : Z i jkl ∼ ZINB π i jkl , μ i jkl , θ i jkl (13) Observed UMI counts : Y i jkl ∼ ZINB π i jkl , μ i jkl , θ i jkl , μ i jkl = μ i jkl p i jkl (14) In single-cell experiments, the observed UMI counts are noisy reflection of the true expression of genes due to lower cellular transcriptional capturing Eqs. (13 ), ( (14) ). In other words, distributions of the observed UMI counts of genes are the joint distributions of gene's true expression and transcriptional (cell) capture rate. The relation between the true and observed means of non-zero counts of genes is μ i jkl > μ i jkl . This means, the distribution of observed UMI counts will shift more towards zero, if the cellular capture rate is decreased. In other words, weightage of the Dirac's delta function will be more in the mixture distribution ( Eq. (4 )) compared to be NB part.

Expected value and variance of the observed UMI counts in SwarnSeq model
The expected value and variance of the observed UMI counts of genes, Y i jkl , in the SwarnSeq model can be expressed in Eq. (15 ).
In the SwarnSeq method, expected value of the observed UMI counts of genes depends on the zero inflation, mean of non-zero counts, and cell capture rate parameter. While the variance of the observed UMI counts are the functions of the zero inflation, mean of non-zero counts, overdispersion, and cell capture rate parameters. Further, the relation between the variance and expected value of the observed UMI counts of genes can be shown in Eq. (17 ). Alternatively, variance of the observed UMI counts of a gene is the function of its expected values ( Eq. (17 )) ( i.e. , case of overdispersion).

Distributions of sample mean and sample variance of observed counts of genes
Usually, population parameters of the genes including population mean and variance are unknown, and they are estimated from experimentally observed sample UMI count data. Hence, it is important to obtain the sampling distribution of sample means and variances of the genes in a single-cell experimental study. The sample mean and variance of the observed UMI counts for j th gene can be expressed in Eqs. (18 ), and (19) , respectively. Here, for simplicity, we omitted the subscript denoting cell type.
The expected values of the gene sample mean, and sample variance of the observed UMI counts can be derived under certain statistical assumptions. In other words, we assume that the observed count data are drawn from the ZINB population model, as given in Eq. (4 ), and the transcriptional capture efficiencies of the genes remain same. Further, the model parameters for the genes remain same over the cells in different cell clusters, i.e., μ 1 j1 = · · · = μ I 1 j1 = · · · = μ I K jK = μ j ; π 1 j1 = · · · = π I 1 j1 · · · = μ I K jK = π j ; θ 1 j1 = . . . = θ I 1 j1 . . . = θ I K jK = θ j ; Now, the theoretical expression of expected value of the sample mean for j th gene can be derived as: Under the assumption of Eq. (20 ), the expected value of sample mean for j th gene ( Eq. (21 )) can be obtained, as shown in Eq. (22 ).
The variance of the observed UMI data, V ( Y i jk ) , ( Eq. (16 )) under the assumption of Eq. (20 ), becomes: Now, the variance of sample mean ( Eq. (18 )) can be obtained as shown in Eq. (24 ) under the assumption of Eq. (20 ).
Let, s 2 j be the sample variance of j th gene, expressed in Eq. (19 ). Then its expected value can be derived as follows.

Estimation of SwarnSeq model parameters
We have shown that the distribution of sample means and variances of genes in experimental single-cell studies depends on gene specific model parameters, which are unknown. So, it is necessary to estimate them to get the exact distribution of gene specific sample statistic(s) and performing other analyses including DE analysis. Here, the parameters of the SwarnSeq model, given in Eqs.
) ; α j , τ j and ω j : I × 1 vector of parameters for j th gene; X : I × L design matrix providing group information (first column consists of 1's to include intercept term); L : number of cellular groups/types (cell clusters are divided into L cell groups, if cell group is unknown); R : I × K design matrix providing cell cluster information; C : I × C design matrix providing other cell level auxiliary information; γ j and β j : L × 1 vectors of cellular groups effects for j th gene; w j and u j : K × 1 vectors of cell cluster effects for j th gene; s j and v j : C × 1 vectors of effects for other cell level co-variates including cell cycle, cell phase, etc. for the j th gene; C : Levels of cell level auxiliaries.

Expectation maximization (EM) algorithm
The parameters in Eqs. (26 )-(28) for j th gene, i.e., j = { α j , τ j , ω j } can be estimated by using the Maximum Likelihood Estimation (MLE) Method. It is very difficult to obtain closed form solutions for the resulting log-likelihood function, given in Eq. (29 ). So, we developed an EM algorithm to estimate the SwarnSeq model parameters. For simplicity, we omit the subscripts for cellular type/pseudo-time in the notations. For the EM algorithm, we recast our estimation procedure into a missing data problem through introducing a latent rv, V i jk , as defined in Eq. (30 ). Further, the incomplete data likelihood function for j th gene can be expressed as: Now, the joint likelihood function for complete data (in presence of latent variable), i.e., can be expressed in Eq. (31 ), as: Then, the log-likelihood function in Eq. (31 ) becomes: (32 )) can be obtained as: The conditional expectations in Eq. (33 ) can be given as: The posterior probabilities or the conditional weights in Eqn 33 for observations originate from the count component of the model and can be given as: where, f NB (. ) is the PMF of NB distribution given in Eq. (1 ).

E-step
: The E-step in the EM algorithm involves in evaluating the expected value of the loglikelihood of the complete data ( Eq. (33 )), given the observed data with current estimates of the parameters. In this approach, for each gene, given the observed data and the current estimate of the ZINB parameters, the expected value of the log-likelihood is calculated. Let, ˆ c given current estimate of the parameters, then the expected value of log likelihood ( Eq. (33 )) at step (c + 1), i.e., Q c+1 is calculated. The conditional expectation at c th step, i.e., E( V i jk | Y i jk , ˆ c j ) ( Eqn 33 )) can be estimated using Eq. (36 ).
The updated values of the estimates of parameters at (c + 1) th step is obtained by providing the observation wise weights, ˆ w (c) i jk ( Eq. (35 )) and parameters estimates at c th step. For this purpose, the glm.nb function in MASS R package was executed. (ii). The zero-inflation probability, ˆ π i jk , is updated with the logistic regression, can be expressed as: The updated value of ˆ π i jk at step (c + 1) is obtained by incorporating the observation level weights, i jk ( Eq. (35 )) and the parameters estimate at c th step. For this, glm (…, family = 'binomial') function in stat R package was executed.
The above procedure is iterated until the convergence is achieved, the detail procedure can be found at [2] . It is important to note that for some genes, the EM algorithm may fail to converge or may be not successful [ 8 ]; therefore, we used Nelder's optimization algorithm [6] implemented in optim function of stats R package to estimate the MLE of parameters. The developed EM algorithm for estimation of SwarnSeq model parameters was applied to the considered experimental single-cell UMI data. The obtained analytical results are shown in Figs. 1 and 2 . Furthermore, relations between the estimated values of parameters for the genes are also shown ( Figs. 1 , 2 ).

Cell capture rate estimation
The distributions of the observed scRNA-seq UMI counts Eq. (10 )-( (16) and sample statistic(s) including sample mean and variance Eqs. (22 )-( (25) depend on the value of cell specific capture rate parameter, p ijk . However, it is extremely difficult to estimate the cell capture rate parameters inside the estimation procedure based on EM algorithm. Hence, one analytical technique is discussed here to estimate the cell capture rate parameters. For computational simplicity, we assume that the cell specific capture rate parameters remain same across all the genes, i.e., p i 1 k = p i 2 k = . . . = p iJk = p ik .
Case 1: External RNA spike-ins data available Let, n RNA spike-ins are added to each cell's lysate and spike-in transcripts are processed in parallel. This process will result a set of UMI counts for spike-in transcripts. Let, C 1 , C 2 , . . . , C u , . . . , C n be the respective m RNA concentrations of n spike-in transcripts added to i th ( i = 1, 2, …, I k ) cell of k th ( k = 1, 2, …, K ) cell cluster and let R i 1 k , R i 2 k , . . . , R iuk . . . , R ink be the observed UMI counts of the n spike-in transcripts for i th cell, here, C u and R iuk be the molecular concentration and UMI counts of u th spike-in transcript. Now, the transcriptional capture rate for i th cell in k th cell cluster can be estimated through a linear regression equation, given in Eq. (40) . where, u is the random error for u th spike-in transcript and assumed to follow Gaussian distribution with zero mean and unit variance. Further, ˆ p ik , regression co-efficient, is the estimate of the capture rate for i th cell in k th cell cluster.
Case 2: RNA spike-ins data not available In most of cases, the spike-ins data are not readily available with researchers in single-cell experimental studies. In such situation, the observed cell library sizes [7] can be used to empirically compute the cell specific capture rate. The procedure is given as follows. Let, ( ρ 1 , ρ 2 ) be the range of cell capture rates and S ik be the library size of i th cell in k th cell cluster and, where, L min and L max in Eq. (42 ) is given in Eq. (43 ).
The above procedure for the estimation of cell capture rate parameters was illustrated on the example single-cell dataset and the results are shown in Fig. 3 . The estimation of the cell capture rate parameter is shown for the two cases, 1: RNA spike-in data available and 2: RNA spike-in data not available, in Fig. 3 .

Estimated values of parameters from SwarnSeq model
Let, ( ˆ π j , ˆ θ j , ˆ μ j ) be the MLE estimates of the parameters for j th gene estimated through the EM algorithm and ˆ p ik be the estimate of the cell capture rate for i th cell, ˆ p be the average of the cell capture estimates over all the cells. Now, the estimated values of different statistic(s) including expected value of sample mean, sample variance, standard error and co-efficient variation for j th gene can be obtained as in Eqs. (44 )-(48) . Further, these developed formulae was applied to the considered experimental single-cell data, to estimate the distribution of sample means of genes and the results are shown in Fig. 4 .
The expression for the estimated value of sample mean is given in Eq. (44 ).
The expression for estimated value of variance of the sample mean for j th gene can be given in Eq. (45) .
The expression for the estimate of the expected value of sample variance of j th gene is shown in Eq. (46 ).
The estimated value of co-efficient of variation for the sample mean of j th gene is expressed in Eq. (47 ).
The estimated value of standard error (SE) of the sample mean for j th gene can be expressed in Eq. (48 ).

Determination of optimum number of cell clusters
The major downstream analysis for scRNA-seq data is cluster analysis, extensively used for detecting various cell types [ 2 , 3 ]. For this purpose, k -means clustering technique is used and implemented in various single-cell analytic tools. However, not much work has been done to determine the optimum value of number of cell clusters, to which the cells present in the scRNAseq data, is categorized. Besides, the SwarnSeq model requires cell cluster information to model the observed UMI counts of the genes. Therefore, we reported an algorithm to determine the optimum rates (estimated from the UMI data) with cell library sizes. The relationship between the capture rates with cell library sizes is bell-shaped. It means the cells with higher library sizes have better cell capture rates and vice-versa . (E) Relationship between mean of non-zero counts and zero counts % in cells. X-axis represents the zero counts % in cells; Y-axis represents the mean of non-zero UMI counts. The relation is inversely proportional, i.e. , cells with higher zero % have lower mean UMI counts and vice-versa . (F) Relationship between capture rates and zero counts % in cells. X-axis represents the zero counts % in cells; Y-axis represents the cell capture rates.
number of cell clusters that the cells need to be grouped based on the observed UMI count data, which is given as follows.
Let, Y ik : mean expression value of i th cell in k th cell cluster; Y .k : mean expression value of k th cell cluster, and Ȳ ... be the over-all mean. Then, Total Sum of Squares (TSS) can be expressed as: where, r h > 0 is the index value at h number of cell clusters. In our algorithm, the clustering indices ( r h ) were computed for different values of h ( ≥ 2) using the observed scRNA-seq UMI counts data. Then, the h value which provides the maximum value of r h can be chosen as the estimator for optimum number of cell clusters for that scRNA-seq data. Alternatively, the optimum value of h can be obtained through graphically by plotting h vs. r h and choosing the point in x-axis where the curve gets flatten. The algorithm for this reported technique is given in Fig. 5 . The algorithm is also implemented in optimcluster function of SwarnSeq R package. Further, this algorithm was applied to the considered experimental single-cell data to demonstrate its utility and the results are shown in Fig. 5 . For instance, in cluster index vs. cluster number plot, the curve has its inflexion point at k = 8, means that the 576 cells present in the data can be clustered into eight optimal cell clusters ( Fig. 5 B). The cluster wise distribution of cells is also shown ( Fig. 5 C).

Differential expression analysis of genes
In SwarnSeq approach, the mean parameter of each gene depends on the cellular groups ( Eq. (26 )). Further, the factors such as cell clusters and cell co-variates are included in the model ( Eq. (26 )) to remove their unwanted effects on the mean of genes. For DE analysis of genes, two group comparisons are made and the model in Eq. (26 ) can be expanded as: where, x i jk : binary indicator for cellular group membership, γ 0 j : (intercept term) logarithm of mean parameter for j th gene in the reference cellular group, γ 1 j : log Fold Change parameter for j th gene, w jk : regression co-efficient for k th cell cluster for j th gene, r i jk : indicator variable for cell cluster membership of i th cell in k th cluster for j th gene, s jm : regression co-efficient for m th ( m = 1, 2, …, M ) cell co-variates of j th gene, c mi j : indicator variable for m th co-variate of i th cell for j th gene and O μ j : offset term.
To statistically test whether j th gene is expressed differentially or not across the cellular groups, the following hypotheses are tested.
The above test can be performed by using Likelihood Ratio Test (LRT) statistic, and can be expressed in Eq. (52 ).
where, D S j : LRT statistic of j th gene; ˆ j0 : MLE of j for j th gene under the constraint of H 0 ; and ˆ j : unconstrained MLE of j for j th gene. The test statistic, D S j , follows a Chi-square distribution with 1 degree of freedom (for 2 groups) under H 0 . Further, based on the distribution of D S j , the p-value for j th gene was computed and this procedure was repeated for all the genes. Then the adjusted pvalues and FDRs for the genes were computed after adjustment for multiple hypothesis testing. The above statistical methods of DE analysis was illustrated on the considered single-cell dataset [1] and the results are shown in Fig. 6 . The volcano plot of the genes obtained through DE analysis is shown in Fig. 6 A. The DE analysis results indicated that 274 genes were identified as differentially expressed between the NA19101 and NA19239 cell groups ( Fig. 6 A) for the considered data.

Differential zero inflation analysis of genes
In literature, it is well established that the genes in scRNA-seq data are highly zero inflated ( i.e. , biological and dropout zeros) due to the nature of single-cell studies and several technical, and biological factors [2][3][4][5] . Therefore, it is important to identify the genes which have different number of zeros as expression across the two cellular groups. For this purpose, the SwarnSeq method can perform the zero inflation analysis of the genes across the two cell groups and detect those genes for further study. In SwarnSeq model, the zero inflation parameters of genes depend on the cellular groups through the model given in Eq. (27 ). Further, factors such as cell clusters and other cell-level auxiliaries are included in the model to remove the unwanted confounded effects from the zeroinflation probabilities of genes. For Differential Zero Inflation (DZI) analysis of genes, two cell groups' comparisons are made and the model in Eq. (27 ) can be written as: where, x i jk : binary indicator for cellular group membership, β 0 j : intercept term for j th gene (reference cellular group), β 1 j is the log Fold Change (zero inflation) parameter for j th gene, u jk : regression coefficient of k th cell cluster for j th gene, r i jk : indicator variable for cell cluster membership of i th cell in k th cluster for j th gene, v m j : regression co-efficient for m th ( m = 1, 2, …, M ) cell co-variates of j th gene, c mi j : indicator variable for m th co-variate of i th cell for j th gene and O π j : offset term.
Statistically to decide whether j th gene is DZI or not, the following hypotheses are tested.
The above test can be performed by using LRT statistic, and its expression is given in Eq. (54 ). Fig. 6. Key analytical results obtained through SwarnSeq Model. (A) Volcano plot for differential expression analysis results. X-axis represents the log 2 transformation of the fold change values of genes. Y-axis represents the -log 10 transformation of the p-values computed through the SwarnSeq model. red color represent the genes whose both -log 10 p-values > 20 and |log 2 FC| > 3; blue color represent the genes whose -log 10 p-values > 20; green color represent the genes whose |log 2 FC| > 3; black color indicates the non-significant genes. (B) Volcano plot for differential zero-inflation analysis results. X-axis represents the log 2 transformation of the fold change values of genes. Y-axis represents the -log 10 transformation of the p-values computed through the SwarnSeq model. red color represent the genes whose both -log 10 p-values > 7 and |log 2 FC| > 2; blue color represent the genes whose -log 10 p-values > 7; green color represent the genes whose |log 2 FC| > 2; black color indicates the non-significant genes. (C) Schematic representation of the classification of key genes detected through SwarnSeq model. Here D Z j , for all j , has a Chi-square distribution with 1 degree of freedom (for 2 groups comparison) under H 0 . The adjusted p-values and FDR for the DZI analysis were computed for all the genes after adjusting for multiple hypothesis testing through the SwarnSeq method. The above statistical methods of DZI analysis was illustrated on the considered Tung's scRNAseq data [1] . The volcano plot of the genes obtained through the developed DZI analysis is shown in Fig. 6 B. The results indicated that 243 genes were identified as differentially zero-inflated between the NA19101 and NA19239 cell groups ( Fig. 6 B). In other words, 243 genes have significant number of expressions as zero counts across the NA19101 and NA19239 cell groups.

Classification of detected influential genes
DE and DZI analyses are two major downstream analytical procedures usually practiced in singlecell experimental studies. Hence, it is interesting to know the group of genes which are expressed differentially across the cellular groups as well as differentially zero inflated. For this purpose, SwarnSeq method is able to classify the detected influential genes into different classes based on DE and DZI analyses, as shown in Fig. 6 . For instance, H 0 : γ 1 j = 0 detects all the genes, which are expressed differentially, while H 10 : β 1 j = 0 detects the genes differentially zero inflated across the cellular groups. Further, the SwarnSeq detects a class of genes in scRNA-seq data with both H 0 and H 10 rejected. This indicates there is a significant difference in the number of cells with zero values as expression of genes across the cellular groups, but the (non-zero) expressions in the remaining cells show significant differences. This group of influential genes is termed as 'DEZI' genes ( Fig. 6 ). The other class of genes, for which H 0 is rejected, but H 10 is not rejected. This means the class of genes for which there is no significant difference in the number of cells whose expressions are zeros across the cellular groups, but they are expressed differentially. We call this group of genes as only 'DE' class genes ( Fig. 6 ). Further, the third type ( i.e., only DZI) of genes, for which H 10 is rejected, but H 0 is not rejected ( Fig. 6 ). It includes the genes for which, there is a significant difference in the number of cells with zero expression values across the two cellular groups, but the (non-zero) expressions in the remaining cells show no significant difference. The utility of the SwarnSeq method for classification of the detected influential genes in scRNA-seq study was demonstrated on one real single-cell data and the results are shown in Fig. 6 .

Conclusion
Statistical analysis of single-cell data in presence of biological confounding factors (leading to severe dropout events) is a challenging task. Therefore in this paper, statistical techniques, implemented in the SwarnSeq, are presented for various analyses of single-cell experimental datasets. The analytical techniques include model fitting, EM algorithm based model parameters estimation procedure, estimation of cell capture parameters, clustering and determination of optimal cell clusters, distribution of observed UMI counts of genes, distribution of sample mean and variance of genes, differential expression, and differential zero inflation analyses, classification of genes, etc . A practical real data example was given for illustration of all the analytical techniques in the SwarnSeq. The SwarnSeq method will surely help the experimental biologist and genome researchers to perform various analyses on a single platform. In future, improved parameter estimation procedure including Bayesian techniques can be implemented in the SwarnSeq tool to estimate the gene specific dispersion, and that will enhance its performance. The SwarnSeq method assumes the factors, such as cellular groups, cell clusters and other co-variates, have fixed effects on means and zero inflations. This assumption may not hold good for single-cell data, as some biological factors may have random effects. Therefore, random or mixed effect models can be implemented in SwarnSeq method to improve its performance. The proposed approach is shown with one application in single-cell data analytics and it can be applied in other analytical fields where the data is zero-inflated and over dispersed such as pest population, sample surveys, etc . studies.

Declaration of Competing Interest
Authors declare that they have no competing interests.