Mouse-Geneformer: A Deep Learning Model for Mouse 2 Single-Cell Transcriptome and Its Cross-Species Utility

Abstract


Introduction
Single-cell RNA sequencing (scRNA-seq) is a powerful technique that quantifies gene expression profiles at the individual cell level [1].Recent technological advances in scRNA-seq have facilitated the rapid expansion of transcriptomic data and enabled the simultaneous analysis of thousands of single cells.This capability has significantly enhanced our understanding of developmental processes and disease mechanisms by revealing previously hidden heterogeneous cellular populations and novel cell types.scRNA-seq is being applied to a wide variety of experiments across different organisms, resulting in a rapid growth of scRNA-seq databases.These extensive datasets offer tremendous potential for collective use, providing valuable insights into genetic architecture and furthering our knowledge of cellular and molecular biology.
Deep learning, a recent advancement in artificial intelligence, has been successfully applied to numerous problems involving large datasets, and has emerged as a promising tool for analyzing scRNA-seq data [2][3][4][5][6][7][8][9].This approach is particularly effective at extracting valuable information from noisy, heterogeneous, and high-dimensional transcriptome data.Among the successful methods in this domain is Geneformer, a pre-trained model that employs a Transformer Encoder architecture [10], similar to BERT [11] , a widely used attention-based deep learning model in natural language processing.Geneformer utilizes the attention mechanism to calculate the relationships between genes and the context within "cell sentences", enabling it to comprehend context-dependent genetic network dynamics.By fine-tuning Geneformer on specific downstream tasks using limited data, researchers can achieve accurate cell type classification.Additionally, Geneformer facilitates in silico simulations of gene manipulation experiments, thereby streamlining the identification of disease-causing genes and advancing our understanding of genetic networks and disease mechanisms.
The mouse, Mus musculus, is the foremost mammalian model for studying human biology and disease [12].Extensive knowledge has been accumulated about mouse physiology, anatomy, and genetics.Methods for genetic manipulation, such as creating transgenic, knockout, and knockin animals, which are ethically and technically challenging in humans, have significantly advanced mouse research.These tools have led to a dramatic increase in the use of mice as model organisms.
scRNA-seq experiments have also been actively applied to mouse studies, resulting in a rapid accumulation of scRNA-seq data.In this context, a deep learning model of the mouse transcriptome would greatly benefit mouse studies, and Geneformer presents a promising candidate for this purpose.However, the original Geneformer is modeled on the human transcriptome.Predicting disease-causing genes in mice using a human-based Geneformer is not straightforward.Therefore, a mouse version of Geneformer (mouse-Geneformer) is in high demand.
This paper aims to construct a mouse version of Geneformer, a deep learning model trained on mouse scRNA-seq data.We then evaluate the usefulness of mouse-Geneformer for various downstream analyses such as cell type classification and in silico perturbation experiments.We also explore the potential for cross-species application of the mouse-Geneformer.If successful, mouse-Geneformer could be used for humanstudies, where some samples are inaccessible due to ethical and technical constraints, and for non-model organisms, where large-scale scRNA-seq data are not available.

Architecture
We constructed mouse-Geneformer, a deep learning model aimed at predicting the gene network of normal mice.We developed the mouse-Geneformer following the original human version of Geneformer, a context-aware, attention-based deep learning model, which was pretrained on largescale transcriptome data comprising approximately 30 million human single-cell RNA-seq data, developed by Theodoris et al [10].The architecture and transfer learning strategy employed in our development of mouse-Geneformer are the same as those of the original human Geneformer, with minor modifications.An overview diagram of the mouse-Geneformer is shown in Fig 1 .Instead of human RNA-seq data, we used mouse single-cell RNA-seq data as an input corpus.The transcriptome of each single cell was presented to the model by using the Rank Value Encoding method that was also developed in the original human Geneformer [10].Finally, a Geneformer model constructed with Transformer Encoder using the created cell texts is pre-trained with selfsupervised learning to build the mouse-Geneformer.

D
Starting with 119 million cells from the compiled 1,089 datasets, we applied filtering based on metadata and quality.Since mouse-Genecorpus-20M is intended for learning the gene network of normal mice to build a reference genetic architecture, we excluded single-cell data with high mutation loads that could cause restructuring of gene networks, such as those from cancer cells or immortalized cell lines.Quality filtering also addressed artifacts common in droplet-based singlecell data, such as ambient RNA [19,20], doublets [21,22], and data derived from empty droplets.
The specific filtering conditions applied were as follows: 1) For each dataset, cells with total gene expression levels more than three standard deviations from the mean were excluded.2) Cells with mitochondrial RNA expression levels more than three standard deviations from the mean were excluded.3) Cells with fewer than seven detected genes per cell were removed.4) Cells with more than 20,000 total gene expression levels were excluded.After applying these filters, the final dataset was 1,070 RNA-seq datasets and encompassed 20,630,028 cells.The mouse-Genecorpus-20M dataset is formatted in Apache Arrow and is available for download from the repository (https://huggingface.co/datasets/MPRG/Mouse-Genecorpus-20M).It can be accessed using the Python library "datasets" for downstream applications.

Pretraining of the mouse-Geneformer
Pretraining was conducted following the procedures described previously for the original human Geneformer [10], with some modifications.Details are described in the Results section.Briefly, the mouse-Genecorpus-20M dataset was processed using Rank Value Encoding to extract genes that capture cell features, forming cell sequences with these genes as tokens.A special token called [CLS] is added for classifying cell sequences.The resulting data, consisting of cell sequences with token IDs, is used as input for pre-training the mouse-Geneformer.The pretraining employed a masked token prediction task, where 15% of the tokens in the input sequences were randomly masked, and the model was trained to predict these masked tokens using the remaining tokens.This task optimizes the network, allowing the mouse-Geneformer to learn the relationships and expression patterns of genes in mice.
We conducted the pretraining of the mouse-Geneformer using mouse-Genecorpus-20M under the conditions shown in Table 1.The training utilized 8 NVIDIA V100 GPUs, each with 32 GB of memory, and the process took approximately 2 days

Table 1 | Summary of Experiment conditions
Pretrianing is the mouse-Geneformer prior learning conditions.Eval 1 is fine-turning conditions of the mouse-Geneformer using Cell type classification using the mouse-Geneformer and conventional methods.Eval 2 is its using Cell type classification of mouse-Geneformer with and without prior learning.Eval 3 is its using in silico perturbation experiments using the mouse-Geneformer.Eval 4 is its using Cell type classification with conversion of human and mouse genes.Eval 5 is its using in silico perturbation experiments with conversion of human and mouse genes.

Fine-turning of the mouse-Geneformer
Fine-tuning the mouse-Geneformer enables accurate classification of cell types and disease types, even with limited mouse single-cell data, such as data containing rare cells, disease-specific data, or organ-specific data.This process leverages the knowledge gained from pretraining on mouse-Genecorpus-20M to predict gene networks accurately.Fine-tuning involves adding a classification layer to classify the [CLS] token in the final layer of the mouse-Geneformer.The model is initialized with the weights from the pre-trained mouse-Geneformer and further trained using a small amount of specific data.When the fine-tuning task is similar to the pretraining task, some layers of the Transformer Encoder Block are fixed to enhance generalization performance.To mitigate overfitting, which can occur due to the limited amount of data, the number of training iterations is reduced.

Cell type classification using the mouse-Geneformer and other methods
To evaluate mouse-Geneformer, we compared its performance on cell type classification tasks with that of conventional methods, , Single-cell VAE (scVAE) [23] and scDeepSort [24].The mouse data used for comparison comprised single-cell data from nine organs of mice: tongue [25], thymus [25], mammary gland [25], large intestine [25], limb muscle [25], spleen [25], heart [25], brain [25], and kidney [25] (Gene Expression Omnibus (GSE): GSE132042).These datasets were downloaded from CELLxGENE (https://cellxgene.cziscience.com/datasets).Each method was employed to classify the data, and their accuracies are compared.For cell type classification with mouse-Geneformer, the model was fine-tuned for each organ data for this task.The fine-tuning conditions are detailed in Table 1.For cell type classification with scVAE, a method applying the deep generative model Gaussian-mixture VAE (GMVAE) [26] to single-cell RNA-seq analysis [23], we used the GMVAE with a three-layer neural network for the encoder, a Gaussian mixture distributions for the latent space, a three-layer neural network for the decoder, and a three-layer neural network for the class classification network.Additionally, the categorical distribution used in this GMVAE employs the Gumbel-Softmax [27,28].For cell type classification with scDeepSort, a pretrained deep learning model utilizing weighted Graph Neural Networks (GNNs) for single-cell RNA-seq analysis [29], we employed scDeepSort, which consists of a three-layer graph neural network [24].The data were randomly split into two groups, one for training (80%) and the other for testing, (20%).The evaluation metric for this experiment was accuracy, which was calculated by using the 'accuracy_score' function from the scikit-learn library in Python, and the highest accuracy was recorded.Cell distribution was visualized using UMAP [30] performed using the scanpy library in Python.

Cell type classification using mouse-Geneformer with and without prior learning
To evaluate the effect of pretraining, we compared the performance of pretrained mouse-Geneformer with non-pretrained one for the task of classification of cell types.The data used for evaluation included single-cell RNA-seq data from the mouse urethra and prostate mixed dataset [31] (Gene Expression Omnibus (GEO): GSE145929), embryos [32] (Gene Expression Omnibus (GEO): GSE197353), and kidneys [33] (Gene Expression Omnibus (GSE): GSE190094).These datasets were downloaded from CELLxGENE (https://cellxgene.cziscience.com/datasets).This embryo dataset is whole mouse embryo at embryonic age E9.5.These datasets were not included in mouse-Genecorpus-20M.The mouse-Geneformer models with and without pretraining were employed to classify the data, and their accuracies are compared.To perform cell type classification with mouse-Geneformer, the models are fine-tuned for each organ data for the cell type classification task.The fine-tuning conditions are presented in Table1.The data splitting method, evaluation metrics, and cell distribution visualization methods are the same as those used in cell type classification experiments (see above).

In silico perturbation experiments
In in silico perturbation experiments using mouse-Geneformer, gene expression data was first normalized, and then each gene's expression was ranked using Rank Value Encoding to create the dataset.The mouse-Geneformer was then fine-tuned for disease classification tasks, ensuring that the disease classification accuracy of the test data exceeds 90%.Subsequently, the genes in the dataset were randomly perturbed multiple times, and inference was performed using the fine-tuned model.When deleting a gene, the gene was removed from the dataset, and the rank values of other genes were increased.Conversely, when activating a gene, the rank value of that gene was increased, and the rank values of other genes were decreased.Cosine similarity was used to quantify the distances between the perturbed cell state and a specific cell state.
To evaluate the effect of in silico perturbation experiments using mouse-Geneformer, we compered genes predicted in silico perturbation experiments and ones identified in vivo experiments.The data used for evaluation included single-cell data from diabetic nephropathy, UMOD nephropathy, and normal kidney data [33] (Gene Expression Omnibus (GSE): GSE190094), as well as data from cells with knocked out COP1 protein and normal cells [34] (Gene Expression Omnibus (GSE): GSE147559).These datasets were not included in mouse-Genecorpus-20M.To perform disease typing with mouse-Geneformer, the model was fine-tuned for each disease data for this task.Fine tuning conditions are shown in Table 1.Only models achieving disease subtype classification accuracies above 90.00% on the test data were used for evaluation.
The data were randomly split into training (80%) and testing (20%) sets.For the in silico perturbation experiments for diabetic nephropathy, we perturbed the cell states of normal kidney cells to resemble those of diabetic nephropathy.For UMOD nephropathy experiments, we perturb the cell states of cells from UMOD nephropathy to resemble normal kidney cells.In COP1 KO experiments, we perturb the cell states of cells with COP1 knocked out to resemble normal cells.

Gene name conversion between mouse and human for cross-species application
To convert human genes to mouse genes for cross-species application, we used the databases of Mouse Genome Informatics (MGI) [35] and the HUGO Gene Nomenclature Committee (HGNC) [36].After converting human Ensembl IDs to mouse Ensembl IDs, we obtain 18,269 genes.There are 910 human Ensembl IDs that have a one-to-many correspondence, and there are 687 mouse homolog MGI IDs that have a one-to-many correspondence.

Human cell type classification using mouse-Geneformer
To explored whether mouse-Geneformer could be used for the analyses of other organisms, we analyzed three single-cell RNA-seq datasets derived from three human organs: breast [37] (Gene Expression Omnibus (GEO): GSE195665), thymus [38] (Gene Expression Omnibus (GEO): GSE144870), and cerebral cortex [39] (Brain Initiative Cell Census Network (BICCN): SCR_016152).These datasets were downloaded from CELLxGENE (https://cellxgene.cziscience.com/datasets)and were not included in Genecorpus-30M.To perform human cell type classification using mouse-Geneformer, human genes in the original datasets were converted to their mouse homologs as described in the previous section.The mouse-Genformer and Geneformer were employed to classify each data, and their accuracies are compared.The Geneformer was a human-Geneformer model that we pre-trained using Genecorpus-30M (https://huggingface.co/datasets/ctheodoris/Genecorpus-30M).For cell type classification with mouse-Geneformer and Geneformer, models were fine-tuned for each organ data for this task.The fine-tuning conditions are detailed in Table 1.The procedures of fine-tuning and classification task were the same for both the mouse-Geneformer and human-Geneformer models.The data were randomly split into training (80%) and testing (20%) sets.The evaluation metric for this experiment was accuracy.Cell distribution was visualized using UMAP [30] as described above.

In silico perturbation of human data using mouse-Geneformer
To explored whether mouse-Geneformer could be used for the analyses of other organisms, we analyzed human single-cell RNA-seq data from myocardial infarction cells [40] (European Genome-phenome Archive: EGAS00001006330) and COVID-19 human blood cells [41] (Gene Expression Omnibus (GEO), European Genome-phenome Archive: GSE150728,GSE155673,GSE150861,GSE149689,EGAS00001004571).These datasets were downloaded from CELLxGENE (https://cellxgene.cziscience.com/datasets).To perform in silico perturbation experiments on human cells using mouse-Geneformer, human genes in the original datasets were converted to their mouse homologs as described in the previous section.To perform disease typing with mouse-Geneformer, the model was fine-tuned for each disease data for this task.Fine tuning conditions are shown in Table 1.Only models achieving disease subtype classification accuracies above 90.00% on the test data were used for evaluation.Each dataset was randomly split into training (80%) and testing (20%) sets.The in silico perturbation experiments aimed to alter the cell expression profiles of normal cells to resemble abnormal cells and vice versa.
For comparison, these human datasets were also analyzed using the original human-Geneformer model.
The number of cells, cell types, and types of diseases used in this experiment with singlecell data of mice are shown in S2 Table.

Ethics Statement
This study exclusively utilized publicly available data sets, which were obtained from the specific databases detailed in the "Construction of Mouse-Genecoupus-20M" section.As the research involved only computational analyses of existing public data and did not include any live participants, no ethical approval or consent was required.S1.It is noteworthy that mouse-Genecopus include types of samples that are difficult to obtain from humans due to ethical or technical constraints (e.g.embryos).Since mouse-Genecorpus-20M is intended for learning the gene network of normal mice, we excluded single-cell data with high mutation loads, such as cancer cells or immortalized cell lines, that could restructure gene networks.We also omitted low-quality RNA-seq data derived from doublets and damaged cells.After the filtration, single-cell transcriptome data from 20,630,028 cells remained, constituting mouse-Genecorpus-20M.

Development of mouse-Geneformer
The transcriptome of mouse-Genecorpus-20M was modeled using the rank value encoding method and then processed through six transformer encoder units, as employed in the original Geneformer.This process yielded "cell sequences" represented by gene names as tokens.
Pretraining was conducted using a masked language model under the conditions shown in Table 1.
We made several modifications to the original Geneformer parameters: 1) We changed the activation function from ReLU to SiLU (Sigmoid-weighted Linear Unit, also known as Swish) because SiLU provides smoother gradients than ReLU.It also provides gradients when the input values are less than or equal to 0, and increases the gradient when the input values are greater than or equal to 1. Therefore, it is less prone to the vanishing gradient problem and can improve the performance in deep neural network models.2) We increased the number of epochs from three to ten to allow the number of training times on the model, reduced learning loss and improved performance.3) In comparing linear and cosine scheduling modes for the learning rate scheduler, we found that pre-training with the cosine schedule showed better performance by reducing learning loss more effectively.Therefore, we chose the cosine schedule.As a result of these procedures, the mouse-Geneformer was constructed.

Cell type classification: mouse-Geneformer is robust and outperforms conventional methods
We investigated the performance of mouse-Geneformer in cell-type classification across various organs.Nine experiments were conducted using different organs to compare the accuracy of mouse-Geneformer with conventional methods such as scDeepSort and scVAE (Table 2).Our findings demonstrated that mouse-Geneformer greatly improves the classification accuracy of cell types in all cases, achieving an average accuracy of 96.73%.In comparison, the GNN-based method scDeepSort and Autoencoder-based method scVAE showed accuracy scores of 66.34% and 72.95%, respectively.Notably, mouse-Geneformer constantly achieved classification accuracies exceeding 93%, regardless of target organs or the number of cell types.UMAP visualization of the tongue and limb muscle samples further supported the good performance of mouse-Geneformer in cell-type annotation (Fig 3A and 3B).Each cell type was clearly separated, although the overlap of the distribution of Langerhans cells with epithelial basal cells posed a challenge in the tongue dataset.These results indicate that mouse-Geneformer is more accurate than existing methods and robust across different organs and varying cellular complexity.This suggests that mouse-Geneformer can be applicable effectively to a wide variety of mouse organs.

Cell type classification: pretraining is effective
To evaluate the effect of pretraining, we compared the cell type classification results of the mouse-Geneformer with and without pretraining, as shown in Table 3.We conducted three experiments using samples from 1) the prostate gland and urethra, 2) embryo that is whole mouse embryo at embryonic age E9.5, and 3) kidney.Our findings indicated that the classification accuracy of the mouse-Geneformer with pretraining was higher than that without pretraining in all cases.The These results suggests that pretraining is effective in enhancing mouse-Geneformer, leading to improved classification accuracy.cell.This enables the prediction of how these mutations will impact the gene network and elucidate the functions of the mutated genes.Notably, this approach provides a powerful method for candidate gene screening, as it allows for the repeated simulation of experiments for multiple target genes of interest, or even all mouse genes, in silico.By comparing the effects of these perturbations, we can identify the most influential genes.Detailed information about in silico perturbation experiment is provided in Methods.We here analyzed three disease models using mouse-Geneformer.

In silico perturbation experiment in diabetic kidney disease
We analyzed diabetic nephropathy using a set of single-cell transcriptome data collected from  Subsequently, we conducted in silico perturbation experiments by randomly and repeatedly choosing target genes.Detailed information about in silico perturbation experiment is provided in Methods.
The analysis revealed that deleting the gene Slc12a3 from normal kidney cells brought the cells closest to those of diabetic kidney disease, with a cosine similarity of 0.018.This in silico outcome is consistent with observations from in vivo experiments, where granular cells of the glomerular apparatus suffered from diabetic kidney disease showed diminished expression of Slc12a3 [33].

In silico perturbation experiment in UMOD kidney disease
Following the similar approach, we analyzed UMOD kidney disease.We modeled the cell status of UMOD kidney disease by fine-tuning our mouse-Geneformer using the single-cell transcriptome data (Fig 5).In silico perturbation of the disease cells by altering gene expressions revealed that deleting the gene Slc35b1, [abundantly] expressed in cells of UMOD kidney disease, had the greatest impact in moving the disease cell status closer to normal kidney cells.In in vivo experiments, it has been reported that the accumulation of mutant UMOD protein in the kidney alters the Unfolded Protein Response (UPR), leading to changes in the expression of 77 genes, including Slc35b1 and Slc3a2 that activate the UPR [33].These observations indicate that our in silico perturbation experiments for UMOD kidney disease successfully identified one of the two genes validated by the in vivo experiment.

In silico perturbation experiment in COP1 KO microgria
Using a similar approach, we analyzed a model of neuroinflammation disease.We utilized singlecell RNA-seq data of microglia cells with COP1 knocked out, a gene known for suppressing neuroinflammation, and normal cells [34].Detailed information about data and fine turning are provided in Methods.with a cosine similarity of 0.36.Additionally, deleting the genes Fth1, Itgax, and Cst7 also showed significant impacts in altering the cells closer to normal cells.These observations are consistent with in vivo experiments, which reported that neuroinflammation occurs due to the accumulation of c/EBPβ in the brain when COP1 is knocked out, and that knocking out COP1 increases the expression levels of genes Apoe and Fth1 and affects the expression levels of neurodegenerationrelated genes Itgax and Cst7 [34].

Cross-species application of mouse-Geneformer through orthologous gene name conversion
We explored whether mouse-Geneformer could be used for the analyses of other organisms.If successful, this would allow non-model mammals, for which it is difficult, costly, and technically challenging to obtain sufficient single-cell transcriptome data to construct species-specific Geneformer models, to benefit from transfer learning using the mouse gene network model.Given the core metabolic and physiological features are conserved among closely related species, we hypothesize that the core genetic architectures would also be conserved.Therefore, mouse-Geneformer could potentially elucidate a significant portion of genetic architectures beyond mice.
Even humans could benefit from mouse-Geneformer, despite the existence of the original Geneformer.This is because mouse-Genecopus can include types of samples that are difficult to obtain from humans due to ethical or technical constraints.
As a proof of concept for the cross-species application of mouse-Geneformer, we investigated its applicability to human transcriptome analysis.The outline of the procedure was as follows: First, to analyze human genes in mouse-Geneformer, each human gene name was converted to mouse ortholog based on the ortholog table.We then fine-tuned the mouse-Geneformer model using human transcriptome data converted to mouse otrholog.Using this finetuned model, we conducted in silico perturbation experiments.

Human cell type classification using mouse-Geneformer
We conducted human cell type classification using human single-cell RNAseq data with mouse-Geneformer.Three experiments were conducted using different human organs.Detailed information about how to convert gene and fine tuning is provided in Methods.For comparison, we also analyzed the human data using original human-Geneformer model.
The cell type classification results are shown in Table 4.The results demonstrated that mouse-Geneformer accurately classified human cell types based on the human transcriptoem data for all cases, achieving accuracy scores of 95.44%, 99.98% and 88.12% for human thymus, cerebral cortex, and breast, respectively.In addition, the classification accuracy of the ortholog-converted mouse-Geneformer for human data was nearly equivalent to that of the original human-Geneformer, with marginal differences ranging from 0.01 to 0.30%.This indicates that mouse-Geneformer is effective for cell type classification of human cells and suggests that the Geneformer model can work effectively across species following orthology-based data conversion.In silico perturbation of human data using mouse-Geneformer 1: myocardial infarction Encouraged by the success of applying mouse-Geneformer to human cell type classification, we next investigated its utility for in silico perturbation experiments on human data.We analyzed human single-cell RNA-seq data from myocardial infarction cells to evaluate the model's effectiveness.The original human-Geneformer predicted that activation of NPPB and ANKRD1, as well as deletion of MYH7, could drive normal heart cells towards myocardial infarction state.The mouse-Geneformer similarly indicated that activation of Ankrd1 and deletion of Myh7 could lead to a comparable transition from normal heart cells towards myocardial infarction cells.These result aligns with findings from single-cell RNA sequencing analyses of human myocardial infarction and normal heart cells [40].Conversely, human-Geneformer model predicted that deleting ANKRD1 and NPPB from myocardial infarction cells would revert them to a more normal cellular state.
Similarly, the mouse-Geneformer model predicted that deletion of genes Nppb, Ankrd1, and Myh7 from myocardial infarction cells would also lead to a transition towards a normal state.Thus, these results demonstrate that mouse-Geneformer can effectively perform in silico perturbation experiments on human disease models like myocardial infarction.

In silico perturbation of human data using mouse-Geneformer 2: COVID-19 human blood
Next, we investigated the effectiveness of mouse-Geneformer in analyzing a human-specific disease.We conducted perturbation experiments on COVID-19 human blood cells using both human-Geneformer and mouse-Geneformer for comparison.Notably, SARS-CoV-2, the coronavirus responsible for COVID-19, does not naturally infect wild-type laboratory mice [42,43].
Using the human Geneformer, we predicted that the activation of genes CCR4, IL6, and CCR20 in normal human blood cells would lead to a transition towards a state resembling SARS-CoV-2 infection.This prediction is consistent with findings from single-cell RNA sequencing analysis of normal blood and COVID-19 blood [41].The mouse-Geneformer predicted that the activation of genes Cxcl3 and Ccr4 would induce a similar transition towards COVID-19-infected cells.
Conversely, in silico perturbation of COVID-19 human blood cells using human-Geneformer model predicted that deletion of genes CXCL2, IFITM3, and CCL20 would revert the COVID-19 cells towards a normal state.The mouse-Geneformer model predicted that deleting genes Ccr4 and Il6 would lead to a transition towards normal cells.These results suggest that both human-Geneformer and mouse-Geneformer predict the involvement of similar genes, such as CCR4 and IL6, in the process of SARS-CoV-2 infection in human blood cells.However, the overlap of the genes predicted by both models was not large.This may indicate limitations in cross-species application of Geneformer model for species-specific traits.Thus, while mouse-Geneformer demonstrates potential in cross-species application for understanding human diseases like COVID-19, it also highlights the importance of species-specific models for capturing the full complexity of disease mechanisms.

Discussion
Geneformer is an innovative context-aware, attention-based deep learning model pretrained on large-scale human single-cell RNA-seq data.It enables fine-tuning for a vast array of downstream tasks with limited task-specific data.Building on the successful development of Geneformer using human scRNA-seq data, we here developed a mouse version of Geneformer using mouse scRNAseq data.The architecture and transfer learning strategy of the original human version were followed with minor modifications.The evaluation of our mouse-Geneformer indicated successful development, as fine-tuning for downstream tasks improved the accuracy of cell type classification in mouse data, and in silico simulations of gene manipulation in mouse disease models detected genes identified in in vivo experiments.This suggests that the architecture of the original human Geneformer, including key components such as rank value encoding, is robust and applicable to species beyond human.It is expected that this strategy and architecture can be applied to any species to build species-specific Geneformer models, provided that large-scale transcriptome data are available.
The mouse, Mus musculus, is the foremost mammalian model for studying human biology and disease.Extensive knowledge about mouse physiology, anatomy, and genetics, along with well-developed methods for genetic manipulation -such as creating transgenic, knockout, and knockin animals -has positioned the mouse as a crucial model in various biological and medical fields.In this context, the mouse-Geneformer developed in this study would greatly benefit mouse studies.We constructed a large-scale dataset of mouse single-cell RNA-seq data, termed mouse-Genecorpus-20M, comprising approximately 21 million single-cell RNA-seq profiles from healthy mice, encompassing a wide variety of organs and developmental stages.Pretrained with this dataset, mouse-Geneformer gained a fundamental understanding of the genetic network dynamics of the mouse transcriptome.By leveraging the prior knowledge, the accuracy of cell type classification with mouse-Geneformer has significantly improved compared to traditional methods.
Furthermore, in silico perturbation experiments using mouse-Geneformer successfully identified disease-causing genes that have been validated in in vivo experiments.Thus, the mouse-Geneformer not only enhances our ability to understand the genetic network of mice but also enables in silico screening of key genetic factors in disease models.Specifically, in silico prediction using mouse-Geneformer can help prioritize the genes to analyze before conducting animal experiments, thereby avoiding ad hoc gene knockouts, saving time, and reducing the need for sacrificing animals.
We found that the mouse-Geneformer can be used for the analyses of the other animal species in a cross-species manner.In this study, after the ortholog-based gene name conversion, the analysis of human scRNA-seq data using mouse-Geneformer followed by the fine-tuning with human data achieved cell type classification accuracy comparable to that obtained using the original human Geneformer.Also, mouse-Geneformer effectively performed in silico perturbation experiments on human disease models of myocardial infarction.Given the core metabolic and physiological features are conserved among mammals, the core genetic architectures should also be conserved, thereby the mouse-Geneformer worked for human transcriptome.Despite the existence of the original Geneformer tailored for human, human research could benefit from mouse-Geneformer.This is because mouse-Genecopus can include types of samples that are ethically or technically inaccessible for humans, such as embryonic tissues and certain disease models.As the amount and variety of mouse scRNA-seq data continues to increase, the inclusion of additional datasets into the current mouse-Geneorpus-20M to create an expanded mouse-Genecoupus will enable mouse-Geneformer to learn more accurate gene networks.This enhanced model could serve as a reference not only for mouse but also human studies.
The cross-species application of Geneformer holds great potential for the analysis of nonmodel organisms, for which it is difficult, costly, and technically challenging to obtain sufficient single-cell transcriptome data to construct species-specific Geneformer models.However, since we investigated only a single combination of cross-species application between human and mouse, it remains unclear how closely related species can be analyzed with Geneformer models in a crossspecies manner.The success of such applications may also depend on the traits or genes of interest.
We predict that housekeeping metabolic processes involving conserved genes can be analyzed across species, whereas lineage-specific traits may not be as effectively analyzed, as revealed by the COVID-19 experiment in this study.Further research is needed to explore the full potential and limitations of cross-species applications of Geneformer models.

Fig 2 .CFig 2 |
Fig 2 | Building mouse-Genecourpus-20M from single-cell RNA-seq datasets.A): Overview of Theodoris et al. originally developed Geneformer, a context-aware, attention-based deep learning model, which was pretrained on large-scale transcriptome data comprising approximately 30 million human single-cell RNA-seq data.While the original Geneformer was tailored for human transcriptome analysis, our study focuses on creating a mouse version, referred to mouse-Genefomer, from mouse transcriptome data.By leveraging the mouse-Geneformer, we aim to harness the capabilities of large-scale deep learning models to enhance research involving mice, the most extensively studied model organism.The architecture and transfer learning strategy employed in developing mouse-Geneformer are the same as those of the original human Geneformer, with minor modifications.The primary but critical difference lies in the input corpus, which was constructed from mouse single-cell RNA-seq data.We constructed a large-scale dataset of mouse single-cell RNA-seq data, termed mouse-Genecorpus-20M.The mouse-Genecorpus-20M comprises approximately 21 million raw single-cell data from healthy mice, encompassing a wide variety of organs and developmental stages.The breakdown of the dataset is shown in Fig 2B, 2C and 2D and detailed information is provided in Table

Fig 3 |
Fig 3 | Visualization of tongue and limb muscle cell distribution.A): Visualization of tongue cell

maximum difference in classification accuracy was 8 .
22% for the embryo data, while the improvements of the other cases were marginal.UMAP visualization of the cell distribution in the embryo data demonstrated that the pretrained mouse-Geneformer depicted more discrete cell clusters compared to the non-pretrained version as shown in Fig 4A and 4B.For example, primitive erythrocytes and cardiac valve cells are distributed among multiple clusters in the non-pretrained mouse-Geneformer classification, whereas in the pretrained version, they formed distinct clusters.
disease kidney exhibiting diabetes and from normal kidneys as a control.Detailed information about data and fine turning are provided in Methods.The UMAP visualization of the cell distribution shown in Fig 5 showed a clear separation between normal kidney cells and disease cells.

Fig 5 |
Fig 5 | Visualization of disease cell distribution.In three kidney diseases using mouse- Fig 6 exhibits complete separation between normal cells (Cop1 WT) and COP1 knocked-out cells (Cop1 KO).Then, we conducted in silico perturbation experiments.Deletion of Apoe gene from COP1 KO cells demonstrated the closest resemblance to normal cells,

Fig 6 |
Fig 6 | Visualization of disease cell distribution.In two cop1 microglia diseases using the mouse- Table, and an overview is shown in

Table 2 | Cell type classification results using the mouse-Geneformer and two conventional methods.
Cell types is the number of cell types in each data.mouse-Geneforemr, scDeepSort and Single-cell VAE are nine mice tissues classification accuracy.Classification accuracies are shown in percentage (%).

Table 3 | Cell type classification results with and without prior learning.Cell types
in the number of cell types in each data.w/ prior and w/o prior are mouse-Geneformer with and without prior learning.These methods are three mice tissues classification accuracy.Classification accuracies are shown in percentage (%).

Table 4 | Cell type classification results after conversion of human and mouse genes
Cell types in the number of cell types in each data.Geneformer and mouse-Geneformer are three human and three mice tissues classification accuracy.h/ is human data.Classification accuracies are shown in percentage (%).