Combinatorial prediction of marker panels from single‐cell transcriptomic data

Abstract Single‐cell transcriptomic studies are identifying novel cell populations with exciting functional roles in various in vivo contexts, but identification of succinct gene marker panels for such populations remains a challenge. In this work, we introduce COMET, a computational framework for the identification of candidate marker panels consisting of one or more genes for cell populations of interest identified with single‐cell RNA‐seq data. We show that COMET outperforms other methods for the identification of single‐gene panels and enables, for the first time, prediction of multi‐gene marker panels ranked by relevance. Staining by flow cytometry assay confirmed the accuracy of COMET's predictions in identifying marker panels for cellular subtypes, at both the single‐ and multi‐gene levels, validating COMET's applicability and accuracy in predicting favorable marker panels from transcriptomic input. COMET is a general non‐parametric statistical framework and can be used as‐is on various high‐throughput datasets in addition to single‐cell RNA‐sequencing data. COMET is available for use via a web interface (http://www.cometsc.com/) or a stand‐alone software package (https://github.com/MSingerLab/COMETSC).


Count Count Count
Gene expression Gene expression Gene expression Gene expression

Count
Mean difference (constant sample size N=5000) Mean difference (constant sample size N=200)

Gaussian expression
Negative Binomial counts Appendix Figure S4. The XL-mHG test enjoys desirable properties for marker discovery compared to standard differential expression tests on simulated gene expression data. Simulated data is generated for one gene in two cell clusters and , where denotes the cluster of interest (Methods). Comparison of the XL-mHG test to the standard mHG test reveals the important role of parameters X and L.
A. The XL-mHG test outperforms various differential expression tests in identifying favorable marker genes to be used as markers from simulated datasets (Methods) with respect to robustness to small effect-sizes for Gaussian expression data (center) and Negative Binomial count data (bottom). B. The XL-mHG test outperforms various differential expression tests in identifying favorable marker genes to be used as markers from simulated datasets (Methods) with respect to sensitivity to sample size for Gaussian expression data (center) and Negative Binomial count data (bottom).
Error bars indicate one standard deviation across 100 simulation runs (thresholded below at 0 and above at 1).

Good markers Poor markers Non markers
Cell cluster K Cell cluster C A B C Appendix Figure S5. The XL-mHG test outperforms standard classifiers for marker recovery on simulated Gaussian gene expression data. A. Outline of the synthetic gene expression matrix generated with n cells and p genes (Methods). Cells are divided into two clusters and , where denotes the cluster of interest. Three categories of genes are considered: genes which are upregulated in (good markers), genes which are upregulated in only a small subset of cells in (poor markers) and genes which are not differentially expressed across and (non-markers). The Scaled Sum of Ranks (SSR) metric (Methods) is used to assess the performance of four classification methods at recovering good markers from the expression matrix: XL-mHG test, Random Forests (RF), Extra Trees (XT) and Logistic regression. SSR=1 is the optimal value. B. SSR versus proportion of poor markers in the data set (left). The XL-mHG picks up the correct good markers regardless of the proportion of poor markers, while this proportion affects both LR (via an increase in fold change between and ), RF and XT. Out-of-bag error (OOB error) is included for RF and XT (right). C. SSR versus mean of poor markers in the data set (left). The XL-mHG test picks up the correct good markers regardless of the mean of poor markers. Poor markers with very high expression are valuable for RF and XT, and contribute to increase the fold change between and , resulting in suboptimal performances of RF, XT and LR. Out-of-bag error (OOB error) is included for RF and XT (right). Error bars indicate one standard deviation across 20 simulation runs. Error bars indicate one standard deviation across 20 simulation runs.

Figure S6
Appendix Figure S7. Flow cytometry analysis to compare the protein level stainings of CD19, Ly-6D, CD20, and CD79b to the established T cell marker CD3 (Meuer et al, 1983). Limited co-staining was observed, highlighting the markers' specificity as B cell markers. (SP= single positive).

Figure S8
A Appendix Figure S8. The marker combination Ly-6D + CD3predicted by COMET for B cells achieves improved B cell staining. A. COMET output t-SNE visualization of the marker combination Ly-6D + CD3 -. B. Bar graph showing the overall cell counts in non-B cell clusters for Ly-6D + and Ly-6D + CD3 -, as predicted from the binarization procedure of COMET. Addition of the second marker decreases the overall counts in these other clusters. C. Flow cytometry analysis of the expression of the T cell marker CD3, the dendritic cell marker (CD11c), and the neutrophil marker (Gr-1) in the Ly-6D + and Ly-6D + CD3populations confirms that the Ly-6D + CD3marker panel improves clean out of contamination for all populations tested (Hestdal et al, 1991;Merad et al, 2013;Meuer et al, 1983). The Ly-6D + CD3marker panel was selected based on available antibodies. Error bars indicate the mean and SD. * p < 0.05; * * p < 0.01; * * * * p < 0.0001.

A B
Follicular B cell signature Marginal zone B cell signature Gene Score Gene Score C Appendix Figure S9. Gene signature scores identify follicular and marginal zone B cell populations. A. Marginal zone B cell signature scores identify cluster 2 (from Figure 5A) as the marginal B cell cluster (signature: marginal zone B cells vs. follicular B cells (Kleiman et al, 2015), Wilcoxon Rank-Sum test p-value of 7.8e-25 between clusters 2 and 0). Shown are signature scores computed for each cell (divided into clusters) and violin plots generated by the Scanpy function score_genes. Cluster 4 was not considered marginal due to its lack of expression of the marker gene CD21 ( Figure 2C). B. Follicular B cell signature scores identify cluster 0 (from Figure 5A) as the follicular B cell cluster (signature: follicular B cells vs. marginal zone B cells, Wilcoxon Rank-Sum test p-value of 7.5e-31 between clusters 0 and 2) (Kleiman et al, 2015). Shown are signature scores computed for each cell (divided into clusters) and violin plots generated by the Scanpy function score_genes. Scores in t-SNE visualization are capped at zero. Cluster 3 was not considered follicular due to a strong batch effect defining that cluster (C). Cluster 4 was not considered follicular due to its lack of expression of the marker gene CD23 ( Figure 5B). C. Visualization of splenic immune populations generated by Tabula Muris and available on their website (Tabula Muris Consortium, 2018) website, including all CD45+ cells analyzed. Cluster 3 (circled) in the data contains a significant batch effect driven by mouse of origin. We therefore do not relate to cluster 3 in our follicular / marginal zone B cell analysis.  Appendix Figure S10. COMET identifies single-and multi-gene marker panels for splenic follicular B cells.

Figure S10
A. COMET output of the top 10 ranked single genes for the follicular B cell cluster (shown in Figure 5A). B. Flow cytometry gating strategy for follicular B cells (CD23 + ), marginal zone B cells (CD21 + ), and double negative B cells (DN). All viable CD19 + splenocytes were included. C. Comparison of the marker combination CD62L + CD55 + to the single stain CD62L + for the staining of follicular B cells confirms that staining with the 2-gene marker panel improves on staining with CD62L alone. * * * * p < 0.0001.
Appendix Figure S11. COMET is deployed to users over the web through an Amazon Web Services cloud backend coupled with a Flask application. A queue service (AWS SQS) grabs the submitted jobs and feeds them through a computing instance. Upon completion, a unique job ID is sent to the user's email, allowing them to access the finalized results on any computer. All files are stored in an S3 bucket and available for four days. The basic interface is available at www.cometsc.com and is freely available.