Excel Template for Identifying Mouse Myeloid Cell-Types in the Central Nervous System Based on Single-Cell RNA Sequencing Data

Abstract


Excel template design for gene markers and expression extraction
To perform the cell identi cation of a cluster, we need four Excel tables: cell de nition (Fig.1, Fig.2E and Table S1), cluster data (Fig2A and Table S1), avg_logFC extraction (Fig2B, D and Table S1), and gene extraction (Fig2C and Table S1). In cluster data table, column A is the genes in a cluster and column B is avg_logFC, which means average Log2 Fold Change, it is the ratio of the normalized mean gene counts in each cluster relative to all other clusters for comparison. In some literatures, the average value of gene expression is also used. In avg_logFC extraction table, the data in columns A and B should come from the corresponding columns of cluster data table, column C is extracted genes from column C of gene extraction table, column D is extracted values from column C using Excel command: VLOOKUP(Cn,A:B,2,0). In gene extraction table, the data in columns A is the gene markers from column B of cell de nition table, column B is the genes from column A of avg_logFC extraction table, column C is extracted values from column A using Excel command: IF(COUNTIF(B:B,An)>0,An,"").

Cell-type identity work ow
The cell-type identity work ow included the follow steps ( we will get the extracted genes form gene markers (column A), 3. Copy column C form gene extraction table, and PasteSpecial it to the column C of avg_logFC extraction table, then we will get the extracted values in column D, 4. Copy column D form avg_logFC extraction table, and PasteSpecial it to any blank column we like in the cell de nition table, 5. In cell de nition table, we can perform cell identities by comparing the extracted values (upregulated and downregulated genes are shown as red and green, respectively) to the cell-types (column A) and gene markers (column B).

Data
The sources of gene expression data used in this paper are shown in Table 1 [10,[31][32][33]. The data in each literature are displayed in the form of Excel ( Fig.  2A).

Consistency test of cell-type identity methods
In order to test the consistency of our cell identity method with the literatures, the identi cation results were divided into three grades: excellent, satisfactory and poor, based on Table 2. Bowker's test and Kappa symmetric measures were used to test the difference and consistency of the paired data between the two groups, respectively. For Bowker's test, P < 0.05 was considered to be a statistically signi cant difference. For Kappa symmetric measures, Kappa ≥ 0.75 indicates good consistency, 0.4 ≤ kappa < 0.75 indicates general consistency, and kappa < 0.4 indicates poor consistency. The data were analyzed using SPSS software v.26 (SPSS Inc., Chicago, IL, USA).

Results
Descriptive comparison of our method with the literatures in CNS myeloid cells Using our cell-type identi cation method, we identi ed CNS myeloid cells in the four data reported in the literatures (Table 1). macrophages (MAC), microglia (MG), neutrophils (NEUT), dendritic cells (DC), neuronal-restricted precursors (NRP), immature neurons (ImmN), mature neurons (mNEUR), astrocyte-restricted precursors (ARP), astrocytes (AST), oligodendrocyte precursor cells (OPC), oligodendrocytes (OL), ependymocytes (EPC), and hypendymal cells (HypEPC) as "gold standard" to test our method. As shown in Fig. 3, Table 3 and Table S2, among the 14 cell clusters being compared, we identi ed MNC as MNC (mixed with a few NEUT and DC), and NRP as proliferative cells. The other 12 cell clusters were completely consistent.

In the Supplementary
The 15 clusters of adult mouse brain from the Table S3 of Han, et al. [10] were also identi ed, the results were shown in Table 4 and Table S3. We found that among the 15 cell clusters being compared, pan-GABAergic and Schwann cell were not within the scope of our evaluation, the reported cluster 4 (Macrophage_Klf2 high) was mixed with a few MG, the other 12 cell clusters were completely consistent.
[32] were also compared. As shown in Table 5 and Table S4, we found that among the 17 cell clusters compared, there 14 were completely consistent. The non-consistent clusters included stromal cells (cluster 15) which was not within the scope of our evaluation, the reported cluster 6 (CNS-associated macrophages, CAMs) which expressed MG speci c markers, and cluster 9 (CAMs) which the typical genes of MAC were not elevated.
Of course, our cell identi cation process was not smooth sailing. When we analyzed another data ( problems. In this report, Louvain graph-based community clustering was used to divide the cells into different clusters, and PanglaoDB was used to identify putative cell and/or activation state for each individual Louvain cluster. We still identi ed the cell-types using our method based the author's data. As shown in Table 6 and Table S5, although the cell-type identi cation was basically consistent, in both reported and our results, the cell-types in each of the nine clusters were mixed, which indicates that the cell clustering in this data is not ideal.

Descriptive comparison of our method with the literatures in peripheral blood and bone marrow myeloid cells
In order to test whether our method was suitable for the identi cation of non-CNS myeloid cells, the 21 peripheral blood cell clusters and 17 bone marrow cell clusters of adult mice from the Table S3 of Han, et al. [10] were also identi ed.
The peripheral blood results were shown in Table 7 and Table S6. We found that among the 21 cell clusters being compared, cluster 14 (Erythroblast_Car2 high), cluster 20 (B cell_Igha high), and cluster 21 (Erythroblast_Hba-a2 high) were not within the scope of our evaluation, the reported cluster 18 (Macrophage_Pf4 high) was mixed with a few NEUT, the other 17 cell clusters were completely consistent. The bone marrow results were shown in Table 8 and Table S7. We found that among the 17 cell clusters being compared, cluster 3 (Neutrophil progenitor), cluster 8 (Hematopoietic stem progenitor cell), cluster 9 (Erythroblast), and cluster 15 (Mast cell) were not within the scope of our evaluation, the other 14 cell clusters were completely consistent.
Statistical comparison of our method with the literatures According to the grading evaluation method in Table 2, we graded the results of all data analysis (Table 3-8). Excluding those clusters (N/A) that are not within the scope of our analysis, we obtained a total of 83 valid cases. As shown in Fig. 4, the excellent, satisfactory and poor results in literatures were 74, 3 and 6, respectively, and they were 77, 1, and 5 in our results. The overall consistency rate was 93.98% (78/83). The Bowker's test showed that there was no statistically signi cant difference between the two groups (P >0.05). Kappa symmetric measures showed that the Kappa value = 0.642 (P < 0.01), indicated general consistency.

Discussion
For the last few decades, although advanced techniques, such as ow cytometry, can be used to identify CNS myeloid cell-subtypes, it is still di cult to be very accurate due to the lack of absolutely speci c markers and the instability of marker expression under different pathophysiological conditions [16].
Although, scRNA-Seq is a promising new technology to solve this problem (Cembrowski, 2019), for ordinary researchers, various programming language analysis packages for scRNA-Seq data are really not an easy task, and for bioinformatics experts, they do not necessarily know the speci c markers for CNS myeloid cell-subtype identi es. Therefore, building a bridge to connect the knowledge gap between ordinary researchers and bioinformatics experts is the key to solve this problem.
In this report, a simple excel template was designed, in which a panel of gene makers corresponding to the myeloid cells, lymphocytes, common CNS cells, and proliferative cells were included. For users, as long as the gene expression data of cell clusters are obtained, the clusters can be named directly using this excel template. It should be emphasized that this template is mainly suitable for determining the major categories of myeloid cells. If researchers need to further distinguish the subtypes of certain cells, it is necessary to add corresponding gene markers. Therefore, this Excel template is open, and researchers can modify or add new genes based on their need. In addition, in the selection of gene markers, we consider not only their relative speci city, but also the crossover and commonality of different cells. Therefore, in the Excel template, we de ned the positive gene marker as "P", negative as "N", and if the marker could be positive or negative, we de ned it "P/N" ( Fig. 1 and Table S1). For example, Ptprc (the gene of CD45) was the common marker of myeloid cells and lymphocytes [34][35][36]. Therefore, we used it as a common marker of myeloid cells and lymphocytes to distinguish CNS non-myeloid cells (such as astrocytes, oligodendrocytes, neurons, etc.). In addition, in theory, the protein molecule CD45 expressed by Ptprc gene is positive in many leukocytes, but in the process of collecting gene markers and drawing the Excel template, we found that Ptprc gene is not expressed in every cell cluster, so we de ned it as P/N. In addition to Ptprc, there are many similar examples. We will not list them one by one. Please see Fig. 1 and Table S1 for details. For a certain cell, although there are some relatively speci c gene markers, we do not use a single or a small number of markers to identify it. We use a panel of gene markers to comprehensively evaluate it and then de ne it. This can effectively distinguish the cell-types with similar or cross gene expression and ensure the accuracy of cell cluster identi cation. In this Excel template, there are 73 gene markers (excluding non-myeloid CNS cells) in each panel can be used to distinguish myeloid cell-subtypes and lymphocytes ( Fig. 1 and Table S1). For example, MNC could express Ptprc (P/N), Cd14 (P/N), Itgam (P/N), Itgax (P/N), Csf3r (P/N), Adgre1(P/N), Ly6c1 (P/N), S100a4 (P/N), Cd68 (P), Ly86 (P/N), Ctsb (P/N), Ccr2 (P/N), Ly6c2 (P), Plac8 (P), Pf4 (P/N), Lyz1 (P), Hmox1 (P/N), F13a1(P), Lyst (P/N), Prtn3 (P/N), Elane (P/N), and Pilra (P/N). Although, several molecules (Cd68, Ly6c2, Plac8 and Lyz1) are positive (P) in MNC, they are also expressed in other cells. Therefore, there is no absolute speci c marker of MNC in this template. Nevertheless, we can still determine its cell type using comparative analysis. The typical examples can be found in table S4 (C8 and 11). For those cell-types with their own speci c gene markers, it is easy to identify cell clusters using comparative analysis. Typical examples are Ms4a7, Lyve1, Cbr2, Mrc1 and CD163 for MAC; Hexb, Olfml3, Sparc, Tgfbr1, P2ry12 and Tmem119 for MG; Ltf, Ly6g, Mmp8, Camp, Ngp, Fcnb, Cebpe, Retnlg, S100a8, S100a9, Lcn2, G0s2, Wfdc21 for NEUT. Of course, due to the limitations of knowledge background and research level, this Excel template still has some defects. For example, for DC, the expressions of H2-Ab1, H2-Eb1, H2-Aa, Cd74 and Cd209a should be positive, but these markers can also be expressed in MAC and B cells, especially B cells do not belong to myeloid cells, which is easy to cause misjudgment. Therefore, in this template, we also added B cell markers to facilitate distinguish B cells from DC.
In order to verify the accuracy of this Excel template, the 83 cell clusters from several recently reported single-cell data were used ( Table 1). The results showed that comparing with literatures, the overall consistency rate was 93.98%. The Bowker's test showed that there was no statistically signi cant difference between the two groups (P >0.05). Kappa symmetric measures showed that the Kappa value = 0.642 (P < 0.01). These indicate that our method is general consistency with the literatures. Next, we will analyze the possible causes of inconsistency.
Comparing with the report of Ximerakis, et al. [31], only one cluster is inconsistent (Table 3). Our results showed that there were a few NEUT and DC mixed with their MNC. The possible reason is that they take Plac8 as a speci c marker of MNC. In fact, Plac8 is also expressed in NEUT and DC [10]. Comparing with the cell-type identi es in adult brain of Han, et al. [10], the cluster 4 is inconsistent ( Table 4). The reason may be that the reported cluster 4 was mixed with a few MG, because we can nd the typical microglia markers (Hexb, Olfml3, Sparc, Tgfbr1, P2ry12 and Tmem119) in Table S3. Comparing with the report of Sankowski, et al. [32], the clusters 6 and 9 are inconsistent (Table 5). Both clusters were identi ed as CAMs, however, the expression of typical genes of MACs (Mrc1, Cd163, Lyve1, Pf4, Ms4a7, Stab1, and Cbr2) were not elevated in both clusters. In contrast, MG speci c markers (Hexb, Olfml3, and Sparc) were signi cantly elevated in cluster 6, while the other genes in cluster 9 were not within the scope of our evaluation. Comparing with the cell-type identi es in peripheral blood and bone marrow of Han, et al. [10], excepting cluster 18 of peripheral blood was mixed with a few NEUT, the others were completely consistent. These indicate that our Excel template is also very effective for the analysis of non-CNS myeloid cells.
From the above analysis, we can deduce that the appropriate gene markers and ideal scRNA-Seq data clustering are key factors for the accuracy of cell de nition. We can understand the importance of cell clustering through the following example. When we analyzed another data (Table S2 of Mimouna et al.) [33], both the reported and our results were not ideal. Analyzing the reasons, we nd that their data clustering methods are different from the other literatures mentioned above. The cell clustering method in this literature is Louvain graph-based community clustering, which may be the reason why clustering is not ideal. Although, our Excel template still can be used to identify the cell-types based on the author's data, the cell-types in each of the nine clusters were mixed (Table 6). Therefore, the data used in this Excel template should be processed through the standard scRNA-Seq analysis process, including quality control, standardization, data correction, feature selection and data dimensionality reduction, nally the cells were divided into different clusters according to the similarity of gene expression.

Conclusions
In conclusion, the cell identities of the scRNA-Seq data could be performed using our simple Excel formulae, a panel of gene markers must be compared to obtain accurate analysis of CNS myeloid cell-subtypes. For data with better cell clustering, this template could effectively distinguish myeloid cell-subtypes, various lymphocytes and other CNS cells. For data with poor clustering, this template could also identify various cell-types, but it would need to be further

Availability of supporting data
The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.

Competing interests
The authors declare that they have no competing interests. Tables Table 1 The sources of gene expression data used in this paper Wild-type C57BL/6J mice (SPF, female, 6-10 weekold).
Brain, blood and bone marrow Brain was dissociated using accutase; bone marrow was treated red blood cell lysis buffer; blood was treated red blood cell lysis buffer or Ficoll separation Microwellseq, the 3' ends of the transcripts are then enriched during library generation using PCR and sequenced using the Illumina Hiseq platform Seurat was used for dimension reduction, clustering and differential gene expression analysis.
Single cell MCA (scMCA) analysis built by authors (Fig 7a) Sankowski     Figure 1 Excel template design for cell-type de nition Excel template design for gene markers and expression extraction, and cell-type identity work ow Figure 3 Representative results of cell type identi cation Bowker's test and Kappa symmetric measures of literatures and our results

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.