A set-theoretic definition of cell types with an algebraic structure on gene regulatory networks and application in annotation of RNA-seq data

Summary The emergence of single-cell RNA sequencing (RNA-seq) has radically changed the observation of cellular diversity. Although annotations of RNA-seq data require preserved properties among cells of an identity, annotations using conventional methods have not been able to capture universal characters of a cell type. Analysis of expression levels cannot be accurately annotated for cells because differences in transcription do not necessarily explain biological characteristics in terms of cellular functions and because the data themselves do not inform about the correct mapping between cell types and genes. Hence, in this study, we developed a new representation of cellular identities that can be compared over different datasets while preserving nontrivial biological semantics. To generalize the notion of cell types, we developed a new framework to manage cellular identities in terms of set theory. We provided further insights into cells by installing mathematical descriptions of cell biology. We also performed experiments that could correspond to practical applications in annotations of RNA-seq data.


INTRODUCTION
In multicellular organisms, cells (including stem cells) diverge into different cell types during development and acquire unique functions (Pasquini et al., 2021). As a result of the process, triggered by intrinsic and extrinsic stimulations by spatially specific molecules, cell fate and function are fixed to show morphological and molecular regularities (Pasquini et al., 2021;Tanay and Sebé-Pedró s, 2021). These preserved characteristics of cells have inspired analogies to Linnaean species classification, a hierarchical classification of species. Therefore, the concept of cell types is understood as hierarchically structured sets with unique morphological and functional properties of cells in a multicellular organism (Tanay and Sebé-Pedró s, 2021).
Although the emergence of single-cell RNA sequencing (scRNA-seq) has radically broadened our understanding of the heterogeneity and dynamics of cells (de Vargas Roditi and Claassen, 2015), the classic taxonomies of cell types are nevertheless essential because cell type annotation is an inevitable step toward understanding biological phenomena (Pasquini et al., 2021).
We observed that cell type annotation in scRNA-seq data can be problematic because the transcriptional differences may not always explain all biological characters in terms of morphologies and functions (Pasquini et al., 2021). Because cells undergo a smooth transition of transcriptional states rather than a clear distinction between cells (Street et al., 2018), discussion of cellular identities that adheres exclusively to transcriptional differences may not reach convincing consequences for identical cells in different phases of pseudo-time (RNA-seq data are not time series data). Cells are annotated with specific cell types based on relative differences within a dataset. Hence, it is unclear whether annotated cells can fulfill globally recognized aspects of the cell types. Although this pitfall is avoidable when we have a dataset that is large enough to represent canonical characteristics of a cell type and cells can be compared in different datasets, direct comparison is not suitable when the experimental protocols (such as sample treatment, sequencing equipment, algorithms, and methods of regularization) are inconsistent. Considering that researchers often integrate datasets obtained by others, we require an acceptable solution to this problem for the furtherance of stem cell biology and other related fields.
In this study, we developed a new evaluation metric for the similarity of cellular identities based on the structures of specific gene regulatory networks (GRNs). This metric allows comparisons of arbitrary clusters of cells in arbitrary datasets even when the population configurations of the samples are drastically different.
To reinforce our idea, we introduced a new logic of treating a group of cells in mathematical terms by defining the term ''comparison of cellular identities.'' We also established a new method of annotation that supports our theory by introducing nontrivial background information on cellular taxonomies.

Definition of the generalized expression network and GRNs in this research
Although we can observe the biological functions of cells in countless aspects, we can only obtain the expression levels of genes in fixed timings from the RNA-seq data, and the entire picture of cellular dynamics is hidden. When a significant number of cells is present, we can assume cellular functions from the data by inference of the GRNs because regulatory interactions of genes control cellular functions, and GRNs are a typical representation of such interactions (Aalto et al., 2020). The structures of GRNs can be compared among different datasets even when they are processed differently. Therefore, in this study, we consider the status of dependencies and independencies of the genes as one dimension of cellular functions.
To study the transcriptome of cells, we first assumed a causal model that could fully describe the functions observable in the RNA-seq data in all cells and defined it as the original form of the GRN. To distinguish this GRN from other GRNs that are inferred from the data, we consider it the generalized expression network (GEN), which consists of the genes in G, a complete set of the genes in the RNA-seq data. Cells are influenced by extrinsic signals (such as morphogens, hormones, mechanical stresses, and cytokines.); hence, we assumed that a latent variable for the extracellular environment (defined as ε) exists as the parent of all genes in the causal graph of the GEN (ε can be interpreted as a Cartesian product of the exogenous variables U g , where U g corresponds to g˛G; Figures 1A and 1B)). In this case, the GEN is a directed multigraph (it leaves room for the feedback loops in the model), where the vertexes are the elements in GWfεg, and the edges reflect causality. Exogenous variables of a causal model are defined as observed or unobserved background factors that remain unexplained; they influence, but are not influenced by, the other endogenous variables in the model (Pearl, 2009).
When quantifying RNAs in samples to obtain the scRNAseq data, it is possible that protocols of sample collection Figure 1. The GEN as a universal model of causality for gene expression and the possessed cascades as observed relationships of the genes (A) Schematic of the GEN. The vertexes denote genes, and the edges denote causality. A latent variable ε for extrinsic factors is present in the center of the illustration. (B) ε as the Cartesian product of the exogenous variables for the corresponding vertex in the causal graphical model. (C) Causality blockade altering the relationship between two variables. When the variable y is unobserved, the path between x and z remains (i.e., x and z are mutually dependent on each other). In contrast, when the observation fixes the y value, the path between x and z is blocked (i.e., the two variables are mutually independent). (D) Graphical explanation of the GEN and a GRN.
have some effects on the transcriptome. In such cases, some parameters in the GEN are fixed in a certain range, and the dependency and independency of the genes are affected. For example, supposing the pathway shown in Figure 1C (g x ; g y ; g z : genes) has no dose-dependent effects, but the existence of mRNAs of those genes only controls the phenotype of cells (if g x is expressed, then g y is expressed, and if so, then g z is expressed), and g y -positive cells are collected via flow cytometry ahead of RNA-seq. In such a case, samples are limited to g x -positive and g y -positive cells, and g z is no longer dependent on g x because g x and g y do not have dose-dependent effects on g z . Because the blockade of pathways in the GEN reflects cellular characteristics, we decided to consider the individual status of the cellular functions (in terms of dependency or independency of the genes) separately from the GEN (latent and universal explanations for gene interactions) and GRN because a network of gene statistical dependencies represents cellular identities ( Figure 1D). Therefore, in this study, the GEN is a universal model that considers the causality of genes, and the GRN is a descriptive model that considers the gene dependencies and independencies that can be observed in the RNA-seq data of individual cells (ε and the edges between ε and genes can be omitted because their existence is not visible in data but is evident from the definition of ε).

Biological and computational requirements for implementation
When annotating the scRNA-seq samples, three options are available. The first approach relies on a public database of cell type-specific markers and their ontologies for a biological explanation. The second method uses labeled scRNA-seq data as a reference for correlation-based scoring of the query data. The final method uses supervised learning models that involve classification tasks of cell types (Pasquini et al., 2021). These strategies have in common the belief that only a limited number of genes are required to identify cell types. Reduction of genes is advantageous for biological and computational reasons because it helps with interpretability of the results and decreases computational costs. The proposed model offers similar benefits. Because GRNs consider pairwise relations of genes, similar advantages are received with a smaller number of genes.
In general, it is untrivial that the reduced number of genes explains all of the cellular functions required in defining the cell type in Linnaean taxonomy. To check whether cellular functions (structures of GRNs) of cell type-specific genes are preserved, we mathematically describe the operation of creating the GRN of a sample (cc˛S, where S is the complete set of the samples) from the genes of interest (cF3 G, where G is the complete set of genes), and check whether the structures of the GRNs on F are preserved between the GRNs on G and F (Figure 2A).
Before proceeding with discussion on GRNs, we define a cascade as a pair of genes denoted as fg x ; g y g gx , where cg x ; g y˛G and g x are dependent on g y (fg x ; g y g gx = fg y ; g x g gx and fg x ; g y g gx sfg x ; g y g gy ). We define a complete set of cascades on G (denoted as CðGÞ : = ffg x ; g y g gx g x ; g y˛G g) and a set of possessed cascades of c as a subset of CðGÞ, which is a set of observed cascades in c (denoted as C c ðGÞ). When we define f c : G 3 G/CðGÞ, which is in one-to-one correspondence with cc˛S, as shown below, C c ðGÞ = f c ðG 3 GÞ: cc˛S; d!f c : G 3 G/CðGÞ s:t: cg x ; g y˛G f c g x ; g y = 8 < : n if g x is dependent of g y in c fg x g gx if g x is independent of g y in cng x = g y To compose a GRN from the possessed cascades of c, we consider G (which is isomorphic to V c (G) as a set) as the vertex set and define rules to draw a directed edge from g y to g x if fg x ; g y g gx˛E c ðGÞ, where the definition of V c (G) and E c (G) are given below. It must be noted that V c ðGÞkE c ðGÞyC c ðGÞ for cl˛C c ðGÞ s:t: 1 % jlj % 2, where jlj is the cardinality of the set l ( Figure 2B If the modeler prefers explicitly showing ε in the GRN, GWfεg is considered for the vertex set, and E c ðGÞW ffg x ; εg gx jg x˛G g is suitable for the edge set.
When we map F 3 F, which is a subset of G 3 G, by f c , we obtain C c ðFÞ = f c ðF 3FÞ3f c ðG 3GÞ = C c ðGÞ , and we can compose the GRN on F under the same rule by defining V c ðFÞ and E c ðFÞ as follows: If V c ðFÞ3V c ðGÞ, E c ðFÞ3E c ðGÞ, and C c ðFÞ = ff c ðg x ; g y Þj cg x ; g y˛F g (by its definition), then the GRN on F is an induced subgraph by F of the GRN on G ( Figure 2B). The same can be applied when ε is included in the GRNs, simply by considering FWfεg as the vertex set and E c ðFÞWffg x ; εg gx jg x˛F g as the edge set, and the GRN on F is the induced subgraph by FWfεg of the GRN on G with ε. Therefore, when the modeler focuses on a certain subset of genes, such as cell type-specific genes, its function (edges in the GRN) is preserved. In other words, the structures of GRNs are not changed by ignoring all of the irrelevant genes. Hence, the modeler can remove thousands of genes from the GRNs without losing any information about the cellular functions of the genes of interest.
Although this explanation reveals the effectiveness of GRNs for selective genes, it is still unclear how the ignored genes ðch˛H : = G À FÞ are to be treated. If the ignored genes can be treated as exogenous factors, then they can be treated as ε, which is a nontrivial conclusion. To elucidate this point, we introduced an algebraic structure called tropical semiring (Grigg and Manwaring, 2007) to each cascade and defined mathematical operations to treat certain genes as equiv-alent to ε by considering ideals on the semiring ( Figure 2C).
Tropical semirings are algebraic structures in tropical geometry that are applied in theoretical computer science among other fields. They have been named in honor of the Brazilian computer scientist Imre Simon (Katz, 2017;Pin, 2010;Simon, 1978). Further information on tropical semirings is provided in the Supplemental experimental procedures (Katz, 2017;Mitchell and Sinutoke, 1982).
In this study, into l g Wfεg (where cl g˛C ðGÞ; d!g˛l g ; cg x˛lg s:t: g = g x n g is dependent on g x ), we introduced two binary operations, addition (denoted as 4 g ) and multiplication (denoted as 1 g ), so that the tropical semiring is defined as follows: cl g˛C ðGÞ, ðl g Wfεg; 4 g ; 1 g Þ is a commutative semiring, and z g : l g Wfεg/f0; 1; Ng, defined below, is a semiring homomorphism when cg x˛lg Wfεg, and ðf0; 1; Ng; 4; 1Þ is a tropical semiring. Additional information on commutative semirings and semiring homomorphism is provided in the Supplemental experimental procedures (Pin, 2010;Zumbrägel and Zelmanov, 2008).
Next, we introduced ideals of the semirings to describe mathematical operations to consider ch˛H and ε as exogenous factors (i.e., ε) without losing information related to cg x˛F by utilizing properties of quotient semirings and the natural homomorphism. Additional information on quotient semirings, natural homomorphism, and ideals of a semiring are provided in the Supplemental experimental procedures (Allen, 1969;Zumbrägel and Zelmanov, 2008).
Prior to the construction of ideals, we define a map that shows one-to-one correspondence to cl˛CðGÞ as follows: cl˛CðGÞ; d!i l : GWfεg/lWfεg s:t: Using this map, we define an ideal I lg on a semiring ðl g ; 4 g ; 1 g Þ as follows: where cl g˛C ðGÞ; d!i lg : GWfεg/l g Wfεg and ch˛HWfεg.
Using the definition of I lg , a natural homomorphism p lg : l g Wfεg/lðgWfεgÞ=I lg can be defined. For a better understanding of p lg , the fiber of ½ε, p À 1 lg ð½εÞ = fg x˛lg W fεg p lg ðg x Þ = ½εg, is defined as follows: where d!g˛l g Wffεg;g sε is the neutral element of multiplication of the tropical semiring ðl g Wfεg; 4 g ; 1 g Þ. A conceptual diagram of the image of p À 1 lg is shown in Figure 2D. Given that, we defined C c ðGÞ, V c ðGÞ, and E c ðGÞ as follows: Drawing a directed graph according to the rules above, the graph coincides with the GRN on F. C c ðGÞ À f Bg = C c fFg, and the definitions of V c ðGÞ and E c ðGÞ correspond to those of V c ðGÞ and E c ðGÞ, respectively. As the series of mathematical operations gives rise to the GRN on F, it is proven that the ignored genes are treated equivalent to exogenous factors.
So far, we discussed the effect of gene elimination from GRNs on the GRN structures. Originally, this topic was required in the interest of biological and computational limitations. Regarding limitations of GRN implementation, directionality of the graph is an issue. Considering a gene g x and its regulatory gene g y (e.g., a transcription factor) in the GEN, a directed edge must exist from g y toward g x because there is a causality between them, and there is no edge in the opposite direction unless the feedback loop exists. In contrast, it is fairly possible that an edge from g y to g x and one in the opposite direction exist in the GRN because an edge in the GRN represents statistical dependency. In such a case, the opposite edge has no biological relevance. Hence, undirected edges can be more interpretable rather than directed edges in some cases. There is a technical benefit in converting directed into undirected edges. To investigate statistical dependencies, the Peter and Clark (PC) algorithm of Bayesian network structural learning (Spirtes et al., 2000) is a suitable option for creating GRNs. However, a Bayesian network is a directed acyclic graph (DAG) with a conditional probability distribution in each edge (Cheng et al., 1997), whereas a GRN is not confirmed to be a DAG; therefore, an ad hoc process is required to dismiss the directionality of edges ( Figure 2E). Because the process is merely equivalent to application of equivalence relations fg x ; g y g gx $ fg x ; g y g gy to C c ðGÞ, the existence or absence of the edges between two arbitrary genes is stably preserved through the process. Undirected GRNs lose resolution to distinguish between the different cells when there is no other way to discern between them other than the directionality of the edges ( Figure 2F).
Consequently, we checked the biologically and computationally preferred operations for creating GRNs, subsequently establishing a basis to compare cellular identities with the GRNs of cells ( Figure 2G).
Cell classes in scRNA-seq data: Data-driven definitions of cells' identities In Figure 3A, an example of a workflow in RNA-seq data analysis is illustrated. First, samples are classified using certain methods, followed by elucidation on the groups. The choice of method matters because we prefer to pay attention to the biological and computational aspects in clustering (Kiselev et al., 2019). In some cases, we stratify them with supervised labels (e.g., information on cell lines, patients, drug treatment, etc.) instead of clustering. The expected result of the clustering or stratification is a state where the groups share some properties within themselves. In other words, we classify samples into equivalent classes under a certain configuration of the equivalent relation. The equivalent classes have an explicit reasoning for classification in data including labels; therefore, we call them cell classes to distinguish them from empirical cell types. A cell class can be named after a cell type (e.g., neurons), but this does not imply that the cell class is a necessary and sufficient set to represent the cell type. Hence, we need a new concept for cells of an identity.
Subsequently, the cell classes are compared in certain metrics. Because it is not self-evident that diversity in morphologies and functions can be seen as distances of the scRNAseq data (Pasquini et al., 2021), in this study we aim to develop a new metric based on the dependencies of genes, the GRN. A good method of evaluating cell similarities has not yet been developed. The optimal metrics for comparing single cells or cell classes may differ ( Figure 3B). Therefore, we established an evaluation metric for cell classes.
First, the notations to be used in the following discussion are set. If we describe the process of creating a GRN (regarding cF3G) of a cell as a morphism, 4 : S/ PðCðFÞÞ is a suitable definition, where PðCðFÞÞ is the power set of CðFÞ, for cc˛S; C c ðFÞ3CðFÞ ( Figure 3C), provided, similarity of cell c and c 0 can be interpreted as similarity of 4ðcÞ and 4ðc 0 Þ. We define a morphism to classify samples to cell classes, where cc; c 0˛S ; c $ c 0 : 5c and c 0 belong to the same cell class, as follows: It must be noted that $ can be configured arbitrarily and defined independent of 4 because the cell classes can be defined arbitrarily and separately from the GRN structures.
Subsequently, we define a metric to compare the identities of the single cells based on the GRN structures and attempt to expand it to cell classes. When comparing the single cell identities by the GRN structures, we match the edges in the GRNs (the operation can be described as dðc; c 0 Þ = j4ðcÞW4ðc 0 Þj À j4ðcÞX4ðc 0 Þj, where d : S3S/R is the pseudo-metric (not a distance function because dðc; c 0 Þ can be 0 even when csc 0 ), and cc; c 0˛S ; R is the set of real numbers. Therefore, it is equivalent to the comparison of character strings where the positions in the strings correspond to the index of edges, and the numbers represent the existence or absence of edges. Hence, we define the evaluation metric of single cells based on the GRN structures as the Hamming distance of the GRN structures ( Figure 3D). Additional information on Hamming distance is provided in the Supplemental experimental procedures (Bookstein et al., 2002).
The where cc˛S; d is the Hamming distance function, and R > 0 is the set of positive real numbers. ðS; O Þ is a pseudometric space. Additional information on the topological space, open ball, and pseudo-metric space is provided in the Supplemental experimental procedures (Herrlich and Keremedis, 2015;Manetti, 2015).
Subsequently, we present a discussion on cell classes by considering images of cell classes under 4 in variable cases. Supposing that, when the definition of cell classes coincides with the GRN (that is, cc˛S; xðcÞ = fc 0˛S j4ðc 0 Þ = 4ðcÞg; Figure 3E), 4ðxðcÞÞ = f4ðcÞg. Hence, when we define O : = fo 3xðSÞ x À 1 ðoÞ˛O g, where x À 1 ðoÞ is the fiber of o, ðxðSÞ; O Þ is defined as a metric space because the metric reflection (the equivalence relation c $ c 0 : 5dðc; c 0 Þ = 0) has been introduced in the pseudo-metric space (Herrlich and Keremedis, 2015). Therefore, in this case, the Hamming distance is suitable for cell classes as well.
We consider the second case ( Figure 3F), where d!½c˛xðSÞ; c½c 0 ˛xðSÞ; ½cs½c 0 s:t: cc 0 1 ; c 0 2˛½ c 0 ; dc 1 ; c 21 c; 4ðc 0 1 Þ = 4ðc 0 2 Þ = 4ðc 1 Þs4ðc 2 Þ (one cell class covers all patterns of the GRNs, and the other cell classes have only one and unique GRN structures). In this case, the quotient space ðxðSÞ; O Þ is an indiscrete space because O : = fo 3xðSÞ x À 1 ðoÞ˛O g = fB; xðSÞg. Hence, the Hamming distance is not suitable for the cell classes in this case. Although the second case may be an impractical proposition, the Hamming distance is not always suitable for evaluation of cell classes because, in certain cases, the GRN structures within a cell class are not uniform, and the situation makes it impossible to define the Hamming distance of the GRNs ( Figure 3G). To resolve this problem, we propose a hypothesis, as described below ( Figure 3H), that is also preferred in terms of implementation because the statements capture the properties of statistical tests of independence in such a way that the GRNs can be surmised by checking the dependencies and independencies of genes from a collective expression matrix of a cell class, not single cells.

If a certain proportion of cells in the cell class do not
have an edge with respect to specific genes, then the edge does not exist in the GRN of the cell class. 2. If an edge exists in the GRN of a cell class, then it also exists in the GRNs of a major proportion of cells belonging to the cell class.
The vague descriptions of ''certain proportion'' and ''a major proportion'' are used to adopt uncertainty of the statistical tests of independence. We adopt Pgmpy, a Python package for Bayesian networks (Ankan and Panda, 2015), where the chi-square test is used for the categorical variables, and hypothesis testing for the Pearson correlation coefficient is used for the continuous variables; both models hypothesize the independence (or no correlation) of variables as the null hypothesis. The first statement considers a situation where roughly evenly sized populations with and without edges are concatenated as one cell class, and the null hypothesis may hold for the cell class, although the alternative hypothesis holds in the subpopulation with the edge. In contrast, the second statement claims that there must be a sufficient number of cells with the edge in the cell class to defy the null hypothesis. Because the concrete threshold to fulfill them differs depending on the dataset and p values, we decided to describe it as above.
Given the statements, we can define a set of cascades for a cell class, called the eigen-cascades of the cell class. We denote the eigen-cascades on F of a cell class ½c as C ½c ðFÞ. If we set a small p value and allow minor exceptions by considering them as noise, then we can regard C ½c ðFÞ as X c˛½c C c ðFÞ ( Figure 3I).
Next, we present a discussion on the suitable metric for similarities of the cell classes. Although we have eigen-cascades, we cannot directly apply the Hamming distance because some eigen-cascades can include other eigen-cascades as a subset ( Figure 3J). When we consider the eigen-cascades of a cell class as a set of required aspects of the cell class, a cell class ½c is a subset of a cell class ½c 0 when the eigen-cascades of ½c are a superset of the eigen-cascades of ½c 0 , in the same way as a ring is a semiring, but the opposite is false. Therefore, we need an asymmetric evaluation function that expresses this asymmetric inclusion relation between ½c and ½c 0 .
Here we propose an evaluation metric for cell classes d Ã : xðSÞ3xðSÞ/R as follows: d Ã ðxðcÞ; xðc 0 ÞÞ : = 1 À j4 Ã ðxðcÞÞX4 Ã ðxðc 0 ÞÞj j4 Ã ðxðcÞÞj = 1 À C ½c ðFÞXC ½c 0 ðFÞ C ½c ðFÞ ; where cc; c 0˛S , and 4 Ã : xðSÞ/PðCðFÞÞ is a morphism that maps cell classes to the corresponding eigen-cascades. It is a quasi-pseudo-metric that is not continuous from the pseudometric space; that is, the topological space of the single cells. Additional information on quasi-pseudo-metrics is provided in the Supplemental experimental procedures (Kũnzi, 1992).
Regarding the biological merits of considering cell classes, arbitrary cell classes can be strictly defined with a set of cascades and can be defined flexibly by the modeler by adding the desired cascades (such as previously known cascades, cascades of interest to the authors, and ones that are found in data analysis) as the eigen-cascades of the cell class to highlight the characteristics of the cells. The concept of the GRN and cell classes can be applied in various ways. We list some examples below: 1. Inference of similarity: comparison of cell classes on eigen-cascades 2. Annotation: comparison between cell classes in referential and query data 3. Multifaceted description: definition of cell classes in multiple criteria

Implementation of cell class annotation
To demonstrate some practical examples of application of GRNs and cell classes, we implemented annotation of RNA-seq data based on the GRN structures. The process consists of three steps: formation of cell classes, feature selection for the GRNs, and comparison of the eigen-cascades ( Figure 4A). To capture the canonical features, we used two datasets for the reference and query. We also performed manual annotation using a conventional approach ( Figure 4B), comprising dimensionality reduction, clustering, searching differentially expressed genes (DEGs), and interpretation of clusters based on the DEGs, aiming to validate the results and compare the performance with our method.
In the annotation with the GRNs, cell classes in the referential and query data are compared with respect to their eigen-cascades; hence, both datasets must be divided into clusters using various methods. Because the model of GRN itself does not require a specific approach for defining the cell classes, any method and tool are suitable. In this study, because we preferred to reflect the Linnaean hierarchical taxonomies in the data-driven methods of annotation, we used factor analysis and k-means clustering, aiming to explain the data using latent variables that can be interpreted in domain-specific aspects. First, a query dataset is processed using factor analysis on cell type-specific markers, as described in previous studies.
Second, we selected genes to compose the GRNs. To characterize cell classes with biologically meaningful models, cell classes should be defined to represent certain cell types, and accordingly, GRNs should include characteristic genes as vertexes. Among the various methods available for feature selection, we selected marker genes that showed high communalities in the referential dataset and genes with high feature importance (FI) values in a gradientboosting decision tree (GBDT) model with L1 and L2 regularizations to predict cell types to salvage the aspects missing from the previously reported markers. We used cell type information attached to the referential data for the ground truth of the machine learning (ML) model. We adjusted the class imbalance by random under-sampling to treat all cell types as equally important ( Figure 4C).
The final step is to compare the GRNs of the referential and query data. To calculate the GRN structure, we used the PC algorithm and converted the GRNs into undirected graphs. For the evaluation metric of the comparison analysis, we applied d Ã . c½c˛xðSÞ, C ½c ðFÞ = V ½c ðFÞ + E ½c ðFÞ , where V ½c ðFÞ and E ½c ðFÞ are defined as follows to indicate the vertexes and edges excluding ε and the corresponding edges: In the application of GRNs annotation, we decided to relate pairs of cell classes in the referential and query data that minimized the d Ã values. Considering d Ã is an asymmetric function, when S r is the referential data, S q is the query data, cc r ; c 0 r˛S r and cc q ; c 0 q˛S q are cells, and ½c r ; ½c 0 r 3S r plus ½c q ; ½c 0 q 3S q are cell classes, comparisons of cell classes in the reference and query data can be performed in multiple ways, which are described in the following section ( Figure 4D).
1. d Ã ð½c r ; ½c 0 r Þ or d Ã ð½c q ; ½c 0 q Þ: similarities of cell classes within the same dataset (the operation is named ''inference'') 2. d Ã ð½c r ; ½c q Þ: similarities from the perspective of cell classes in the referential data (the operation is named ''labeling'') 3. d Ã ð½c q ; ½c r Þ: similarities from the perspective of cell classes in the query data (the operation is named ''estimation'') In this study, we performed labeling and estimation.

Annotations in human brain samples
As the first demonstration of actual implementation, we conducted annotation in the scRNA-seq data of normal cells of the central nervous system (CNS), assuming a situation where we desired to annotate excitatory and inhibitory neurons in the scRNA-seq data from human fetal brain. For the referential data, a dataset named Human M1 10x (denoted m1_10x for short) from the Allen Institute for Brain Science, Seattle, Washington (https://portal. brain-map.org/atlases-and-data/rnaseq/human-m1-10x) was selected because it contained abundant samples. For the query data, GSE165388 collected by Yu et al. (2021) (scRNA-seq data from the subpallial tissues in human fetal brain from gestational week 9 [GW9] to GW12, when interneuron neurogenesis is actively taking place) was selected.
Although the authors of the original study integrated datasets from GW9-GW12 (denoted as gw9, gw10, gw11, and gw12), we processed them separately to avoid overlooking batch effect.

Manual annotation by the conventional method
First we conducted manual annotations using the conventional method. We processed gw9-gw12 to identify 2,000 variable features each. The matrices were decomposed by truncated singular value decomposition (SVD), and the dimensionality of the data was estimated by parallel analysis (Figures S1A-S1D) with randomly permutated data matrices as null models (Ledesma and Valero-Mora, 2007;Ledesma et al., 2015). After clustering was performed using the shared nearest neighbor (SNN) algorithm, and DEGs were identified by Wilcoxon rank-sum test, the clusters were annotated after the DEGs with top 10 log fold change (FC) values ( Figure S1E). To cope with the arbitrariness and subjectivity in interpreting DEGs by predefining clear criteria, we collected 90 marker genes in total for glia, astrocytes, oligodendrocytes, oligodendrocyte precursor cells  (Jain et al., 2022;Lier et al., 2021;Lu et al., 2019;Murthy et al., 2014;Wright-Jin and Gutmann, 2019;Yu et al., 2021;Zhang and Jiao, 2015). We discarded all genes, excluding the genes listed above (when multiple genes that represented different cellular identities were suggested as DEGs for a cluster, the gene with a higher log FC value was adopted). The expression patterns of the representative genes were also checked for confirmation ( Figure S1F). Although some clusters were annotated with the same identity, the expression patterns of the DEGs differed (the top 3 genes of the highest log FC values for each cluster are shown in Figures S2A-S2D). Hence, it could not be ascertained from only these results whether we could consider clusters of a name as homogeneous.

Feature selection using GBDT
We constructed a GBDT model with L1 and L2 regularizations to predict the cellular identity from the expression matrix while the cell type information attached to m1_10x was treated as the ground truth. Prior to constructing the ML model, we verified the accuracy of the classification on m1_10x. The samples were annotated as GABAergic, glutamatergic, or non-neuronal ( Figure 5A), and the classification appeared to reflect canonical taxonomies because the expression patterns of the representative marker genes corresponded to the labels ( Figure S3A). Because the aim of the ML model was to identify a minimal set of genes that could computationally behave as good features of the canonical taxonomies, we rebalanced the sample sizes in each category by random undersampling to avoid underestimating the minor classes ( Figure S2B). Upon resolving class imbalance, we randomly separated the dataset into training and test data. Subsequently, to reduce computational costs, the top 1,000 genes of the highest standard deviation (SD) of the mean within a class were selected as features in the GBDT model and were confirmed to include the canonical marker genes ( Figures 5C and 5D).
Subsequently, a GBDT model for multiclass classification to predict labels of m1_10x with 5-fold cross-validation for hyperparameter tuning was constructed. We used ''Strati-fiedGroupKFold'' of scikit-learn to capture the universal features of the cell types shared among the subclasses. The labels of subclasses were also attached to m1_10x ( Figure 5E). Consequently, GAD1 and GRIP1 were the only genes that did not score 0 in FI in any fold ( Figure 5F). Because the top 15 genes with the largest FI showed high ranks of SD of the group-wise mean values, the result suggested that the feature engineering was appropriate ( Figure S3B).
The performance of the GBDT model was evaluated in the one-versus-rest (OvR) classification using the area under the curve (AUC) of the receiver operating characteristic (ROC) curve and average precision (AP) of the precision-recall (PR) curve. The macro average of the AUC and the micro average of the AP were also calculated ( Figures 5G and 5H). Because the GBDT model showed perfect scores, the classification function in m1_10x was considered to be very simple. When another GBDT model was constructed using only GAD1 and GRIP1 as the features, the new model also scored significantly high in each metric ( Figures 5I and 5J). Thus, it could be assumed that the genes reflected the differences in the phenotypes of the cells to a certain degree, and it was reasonable to include them in the GRNs. GAD1 codes glutamic acid decarboxylase, which catalyzes synthesis of gamma-aminobutyric acid (GABA) (Edelhoff et al., 1993), and GRIP1 has been reported to correspond to learning and memory via regulation of synaptic plasticity (Tan et al., 2020).
The ML model is ineffective for automatic annotation because the category distribution is drastically distorted from that of the canonical states of CNS cells in the process of undersampling ( Figure S3C).

Exploratory factor analysis on m1_10x
We performed a factor analysis of the resampled m1_10x data to identify suitable features for GRNs and verify that factor analysis is appropriate for scRNA-seq data. In exploratory factor analysis, the model should be made up of comprehensible and representative variables. Therefore, we used the 90 marker genes and calculated the Kaiser-Meyer-Olkin (KMO) criterion; the measure of sampling adequacy (MSA) was calculated for each variable (Figure S4A), and genes with an MSA of less than 0.6 were removed. Using parallel analysis, the initial number of factors was determined ( Figure S4B); then, the factors with small loadings (absolute value < 0.5) were eliminated from the model. Latent factors are not guaranteed to be orthogonal; hence, we selected the quartimin rotation. In the final iterative computation results, marker genes of common cellular identities showed similar values of factor loadings ( Figure S4C). Communalities and uniquenesses were calculated, and variables with high communalities were considered to reflect the effects of latent variables underlying the phenomenon ( Figure S4D). The original matrix was linearly converted into a matrix with a lower dimensionality from which we could biologically interpret the bases; hence, we assumed that distances between the samples reflect the similarities of the biological characteristics. Therefore, we applied k-means clustering to the factor score matrix, and the k value that maximized the mean silhouette coefficient was considered as providing the optimal number of clusters ( Figure S4E). Because the mean is a poor statistic when cluster sample sizes are imbalanced, we calculated the SD of the silhouette coefficient ( Figure S4E) and the entropy of the categorical distributions (by taking the sample ratios as parameters of the distribution) to evaluate class imbalance (Figure S4F). The elbow plot and difference of inertia are also shown to provide extra verification of the optimal k values' validity ( Figure S4G). The optimal clustering results are shown as silhouette and scatterplots (Figures S4H and  S4I) and interpreted as properties of clusters (based on coordination with biological relevance for each axis) ( Figure S4J).
Although the clusters in the factor scores seemed to qualitatively correspond to the metadata for m1_10x, we used the conventional method to quantitively compare clustering performance. Dimensionality reduction via truncated SVD was performed, and the dimensionality of the output was determined via parallel analysis ( Figure S5A); after clustering via SNN ( Figure S5B), DEGs were identified by Wilcoxon rank-sum test ( Figure S5C). Some mismatches occurred in the DEG expression patterns within a cell type (as given in the metadata for m1_10x); hence, the consistency of the partition (rather than the biological interpretation of clusters) was compared between the proposed and conventional clustering methods (Figures S5D and  ). The factor analysis (FA) results showed a generally superior adjusted Rand index (ARI) and adjusted mutual information (AMI) scores compared with the conventional method (denoted as DEG); hence, we conclude that our method of clustering is more robust to changes in the sample ratio. Even though the higher AMI values for DEGs and subclasses suggest that the clustering of the conventional method was more similar to the subclass partitioning, we could not determine whether the clusters were similar to the subclasses because of the difference in DEG patterns. DEG patterns differ between clusters in m1_10x and gw9-gw12; hence, other approaches (besides computing DEGs) are required to compare the cellular identities of cell classes in different datasets.
FA and clustering for GSE165388 data FA of the marker genes produced comprehensible and robust results; hence, we adopted the same procedure for GSE165388 before creating GRNs.
Variables were selected by referring to the KMO index and MSA, and the initial numbers of factors were determined via parallel analysis ( Figures 6A-6D); then, final versions of the factor loading matrices for quartimin rotation ( Figures 6E-6H) were obtained by recursively eliminating unnecessary factors. The communalities and uniquenesses of the variables were calculated for model evaluation ( Figures S6A-S6D). After confirming that the marker genes for the same cellular identities showed similar factor loading values for each factor, k-means clustering was performed ( Figures 6I-6L), and k values were selected after multiple verifications using the mean/SD of the silhouette coefficients, the entropy of the categorical distributions, and elbow plots ( Figures S6E-S6P). The results are shown as scatterplots (Figures 6MÀ6P), and we qualitatively checked the biological properties of the clusters by verifying their coordinates ( Figures S6Q-S6T); in general, no contradictions of the DEG results were observed.
GRNs and annotation methods based on their structural similarities We implemented GRNs by applying the PC algorithm and compared their structural similarities using the evaluation function d Ã proposed above. We implemented labeling (annotation-centering cell classes in the referential data) and estimation (annotation-centering cell classes in the query data) using the given GRNs.
For vertexes in GRNs, we selected genes with communalities exceeding 0.5 (AQP4, GFAP, GJA1, PDGFRA, RBFOX3, SATB2, CUX2, VIP, GRIN2B, SOX6, NCAM1, and GAD2) in the FA for m1_10x as well as the important features of the GBDT model (GAD1 and GRIP1), anticipating that (1) genes with high communalities would express essences of the FA model that inherited biological functionalities and (2) the important features in the ML model would play crucial roles in discerning CNS cells. When labeling, we limited the vertexes used for comparison with cell type-specific marker genes as well as the important features of the ML model ( Figure 7A). In contrast, we used all vertexes in estimation because the identities of the centered cell classes were unknown and we needed to match the vertexes to compare GRN structures ( Figure S7A). We used appropriate induced subgraphs for labeling, although the GRNs are depicted with all genes in Figures 7B-7E.
For visualization of the GRN structural similarities, we present circular plots with focused cell classes in the centers of the circles; the d Ã values are shown as the radii of the circles ( Figures 7F-7H and S7B-S7E). We refer to these as ''planet plots'' because they resemble the solar system. We defined the labeling or estimation as a process of choosing cell classes that minimize d Ã . Cell classes on the innermost circle were considered most similar to the cell class in the center of the plots.
The goal of annotation was to find GABAergic and glutamatergic neurons in GSE165388. When we aimed to select cell classes in the query data that were most similar to GABAergic or glutamatergic neurons in the referential data, labeling was suitable. When the aim was to choose a suitable name from GABAergic and glutamatergic neurons for each cluster in the query data, estimation was preferable.
The validity of the labeling results varied according to which cell class was centered. Assuming the characters of cell classes in the query data from the DEG and factor scores, NSCs and NPCs were frequently chosen as the cell classes most similar to the GABAergic neurons in the referential data ( Figure 7A). The results for glutamatergic neurons seemed more promising because NSCs, NPCs, and excitatory neurons were frequently suggested ( Figure 7B). Regarding non-neuronal cells, the results were meaningless because a variety of cell classes were suggested as similar thereto. The imbalanced comparison itself might be problematic in that either class exhibits components of much greater diversity than the other one, such as GABAergic neurons versus non-neuronal CNS cells because, in such a case, similar and incompatible features can exist simultaneously in the larger class ( Figure S7F).
The estimation performance was generally better than the labeling one: for inhibitory neurons (CGE, LGE, and MGE), GABAergic neurons or GABAergic plus nonneuronal linages were suggested; for excitatory neurons, glutamatergic neurons were selected ( Figures S7A-S7D). In some cases that were difficult to determine (e.g., conjugated clusters of excitatory and inhibitory neurons or a cell class of NSCs), GABAergic and glutamatergic neurons were selected. The problem for non-neuronal cells also arose in estimation because these cells were in the innermost circles in many cases; however, the cell class was useful for estimating the cell classes of microglia and nonneuronal linages (i.e., microglia or RBCs).
Consequently, GRN-based annotation allowed us to compare the identities of cell classes by visualizing canonical concepts and properties shared between different datasets.

DISCUSSION
In this research, we developed a cellular identity representation that can be compared beyond the batch effects of different datasets, and we added mathematical descriptions to make the new concept more generally applicable. This can introduce quantitative elements into the empirical perspective of Linnean taxonomies while also adhering to biologists' belief of a certain universality in cell types. Numerical features of gene expressions can be diverse in high-dimensional matrices even when the biological semantics coincide; hence, we focused instead on the interactions of genes because we believe the latent mechanisms of a phenomenon to be reproductive in natural science. We also excluded variables that domain-specific knowledge cannot explain. Therefore, we needed to engage with the model to find the background structures of the data. When the model was established, we tried to remain unbiased during the deterministic procedure of annotation; hence, we selected a data-driven method. DEG-based annotation identifies features in data-driven ways and requires expert supervision during the deterministic process; our proposal does the opposite ( Figure 7I). We emphasize that we can combine DEGs with GRNs and vice versa. DEGbased annotation is superior to our method of identifying unexpected features in data, and GRN comparisons are good at identifying shared information regarding cell classes. Because their domains of responsibility differ, they are not always mutual alternatives.
As noted above, GRN structure comparisons between different scales of categories can fail. We need to specify an evenly scaled control for adequate comparison, even when differences exist between them. This represents a fairly general restriction when comparing multiple things, not only in biological research; for instance, we cannot critically compare the city of Paris and the entire region of Asia without designating a representative city.
Several limitations were noted, as follows: 1. The choice of the algorithm. We used the PC algorithm to reduce computational costs; this suffers from a drawback, and several alternatives are available. The PC algorithm can be unstable when removing edges from the graph; to improve result reliability, the procedure of generating the graph should be supplemented with an algorithm to repair disconnected edges (Spirtes et al., 2000). A promising alternative is max-min hill-climbing (MMHC), which combines constraint-based local learning and scoring (Tsamardinos et al., 2006). MMHC is attractive because it can, in general, outperform other algorithms in practical tasks and is a down-to-earth option already available in specific tools such as Pgmpy. 2. Uncertainty in the hypothesis. We defined a model to handle cellular identities by starting the discussion with hypotheses yet to be proven. Outside of our model, the definition of a cell type has been controversial ever since single-cell technologies began to provide detailed information about cellular status. A paradigm shift has emerged in the classic notions of cell types, which are poorly defined but functional taxonomies (Clevers et al., 2017). To contribute to a solution to the debate, we strictly re-defined cellular identities as cell classes on hypothetical GRN models. Our model assumes that the cellular identity can be discussed using the dependencies and independencies of genes, whereas the innate GEN varies under observation. Hence, further discussions and experimental proof are required. 3. Time resolution. Analysis of non-time-series RNA-seq data suffers from limitations, and the new methods we proposed in this study also exhibit the same drawbacks. It is nearly impossible to perfectly identify the cellular dynamics of every single cell using a snapshot of cells'' multifaceted aspects. At best, we could assume no fluctuation in experimental conditions and assume that the model can track temporal changes by tracing samples using continuous curves. Under definitions of cell class and eigen-cascades, heterogeneous cell populations can be divided. Hence, we require a nontrivial operation to collect multiple cell classes into one unit when treating them as identical a group of cells that dynamically drift around the ''attractor'' of the cellular lineage (e.g., each phase of the cell cycle).
Although several issues remain, GRNs and the concept of cell classes are applicable in certain studies pertaining to the canonical state (e.g., patient-derived samples) and simplified models (e.g., cultured cells, model animals, etc.) because cell class annotation compares the structures of GRNs but not the raw values of the expression. This will help scientists to explicitly describe what aspects of the cells in their samples are universal.

Resource availability Corresponding author
Further information and requests for resources and reagents should be directed to and will be fulfilled by the corresponding author, H.O. (hidokano@keio.jp).

Materials availability
Not applicable.

Data and code availability
All of the code we used has already been deposited in a GitHub repository (https://github.com/yo-aka-gene/algebra_ver822), and the sources are listed in the Experimental procedures section and in the figure legends.

Data preprocessing of adult human brain samples
The scRNA-seq data for an adult human brain were downloaded from the website of the Allen Institute for Brain Science (https:// portal.brain-map.org/atlases-anddata/rnaseq/human-m1-10x), and we normalized the expression matrix into read per million (RPM) and converted it further into log2(RPM+1). The code was implemented in Python packages (Numpy and Pandas).

Data preprocessing of fetal human brain samples
The scRNA-seq datasets for fetal human brains (GSE165388) were downloaded from https://www.ncbi.nlm.nih.gov/geo/query/acc. cgi?acc=GSE165388 and processed. Quality control using mitochondrial genes and ln(RPM+1) conversion were performed. The code was implemented in an R package (Seurat).

AUTHOR CONTRIBUTIONS
Y.O. obtained funding, designed the model, performed the experiments, analyzed the results, and wrote the paper. Y.K. obtained funding and edited the paper as an instructor. H.O. obtained funding, edited the paper, and supervised the project. All authors approved the final manuscript.