Predicting cell population-specific gene expression from genomic sequence

Most regulatory elements, especially enhancer sequences, are cell population-specific. One could even argue that a distinct set of regulatory elements is what defines a cell population. However, discovering which non-coding regions of the DNA are essential in which context, and as a result, which genes are expressed, is a difficult task. Some computational models tackle this problem by predicting gene expression directly from the genomic sequence. These models are currently limited to predicting bulk measurements and mainly make tissue-specific predictions. Here, we present a model that leverages single-cell RNA-sequencing data to predict gene expression. We show that cell population-specific models outperform tissue-specific models, especially when the expression profile of a cell population and the corresponding tissue are dissimilar. Further, we show that our model can prioritize GWAS variants and learn motifs of transcription factor binding sites. We envision that our model can be useful for delineating cell population-specific regulatory elements.


Introduction
In multicellular organisms, every cell has the same DNA apart from somatic mutations.Yet its function and the related proteins and genes expressed vary enormously.This is among others caused by transcriptional and epigenetic regulation.Proteins that bind the DNA sequence around the transcription start site (TSS) control whether a gene is transcribed in a cell (Vaquerizas et al., 2009;Lambert et al., 2018).Which transcription factors, and thus which DNA binding motifs, are essential differ per cell population (Vaquerizas et al., 2009;Lambert et al., 2018;Nott et al., 2019;Janssens et al., 2022).As such, mutations in regulatory regions might affect specific tissues or cell populations differently.Improving our understanding of these regulatory mechanisms will help us relate genomic functions to a phenotype.
For example, while promoter sequences are identical across the four major human brain cell populations (neurons, oligodendrocytes, astrocytes, and microglia), almost all enhancer sequences, the regions in the DNA where a transcription factor binds, are cell populationspecific (Nott et al., 2019).These population-specific regulatory elements are discovered by combining single-cell measurements of different data types, including chromatin accessibility, ChIP-seq, and DNA methylation.Bakken et al. (2021), for instance, identified differentially methylated and differentially accessible regions across neuronal cell populations in the human brain, albeit with little overlap.This emphasizes the complexity of transcriptional regulation and the need for more measurements to fully resolve these mechanisms at the cell population-specific level.
An alternative approach would be to train a computational model that directly predicts gene expression from the genomic sequence around the TSS.This way, we can learn which regulatory elements are important for transcriptional regulation in different contexts.Several computational methods have been developed for this task (Kelley et al., 2016;Kelley et al., 2018;Zhou et al., 2018;Agarwal and Shendure, 2020;Zhang et al., 2020;Avsec et al., 2021a;Avsec et al., 2021b).These methods have in common that they one-hot encode the DNA sequence and input this to either a convolutional neural network (CNN) or transformer.ExPecto, Xpresso, and ExpResNet predict expression measurements from bulk RNA-sequencing, while Basset, Basenji, BPNet, and the Enformer model predict regulatory signals, such as cap analysis gene expression (CAGE) reads or TF binding from CHIP-nexus.
A promising application of these models is to prioritize variants that have been identified using genome-wide association studies (GWAS) (Kelley et al., 2016;Wesolowska-Andersen et al., 2020).Using GWAS many potential disease-associating variants have been identified (Schizophrenia Working Group of the Psychiatric Genomics Consortium, 2014; Wightman et al., 2021;Yao et al., 2021).Within each locus, however, it is often challenging to pinpoint which variant is causal and which gene is affected by the variant.
These current computational gene prediction models, however, are designed for predicting bulk gene expression data.This means that they are either tissue-specific or could be applied to FACSsorted cells (Wesolowska-Andersen et al., 2020).Since transcriptional regulation is even more context-specific, the resolution of current methods is not sufficient for heterogeneous tissues where single-cell RNA-sequencing (scRNA-seq) has revealed hundreds of cell populations (Tasic et al., 2016;Tasic et al., 2018;Bakken et al., 2021).To increase the resolution, the models would ideally be trained on scRNA-seq data.
Here, we present scXpresso, a deep learning model that uses a CNN to learn cell population-specific expression in scRNA-seq data from genomic sequences.Since single-cell and bulk data have different characteristics and distributions, we explored whether this type of model is suitable for single-cell data.We show that 1) cell population-specific models outperform tissue-specific models on several tissues from the Tabula Muris, 2) increasing the resolution improves the predictions for human brain cell populations, and 3) in silico saturation mutagenesis of the input sequence can be used to prioritize GWAS variants.

Materials and methods
2.1 Architecture of scXpresso scXpresso is a one-dimensional convolutional neural network (CNN) adapted from the (bulk gene expression-based) Xpresso model (Agarwal and Shendure, 2020) (Figure 1A; Supplementary Figure S1).The input to the CNN is four channels with the one-hot encoded sequence around the transcription start site (TSS) (7 kb upstream and 3.5 kb downstream).Every channel represents one of the four nucleotides (A, C, T, G).For some positions, the exact nucleotide is not known [e.g., any nucleic acid (N) or a purine nucleotide (R)].The exact coding scheme for such positions is shown in Supplementary Table S1.The CNN consists of two convolutional layers.The output of the convolutional layers is flattened and concatenated with the half-life time features.Together, this is subsequently fed into a fully connected (FC) layer(s).The output of the FC layers is the aggregated expression per tissue or for each cell population.
Comparing scXpresso to Xpresso, there are three main differences: 1) we designed scXpresso as a multitask model so that it predicts the expression of multiple cell populations simultaneously.2) We decreased the number of half-life time features from eight to five; the three features we removed (5′ UTR, ORF, and 3′ UTR GC content) correlated less with half-life time, so we removed them to make the model less complex (Sharova et al., 2009;Spies et al., 2013;Agarwal and Shendure, 2020).Furthermore, removing these three half-life time features from the original Xpresso model did not lower its performance (Supplementary Table S2).3) For the multitask model, there is only one FC layer.For the other models, which we use to make tissue-specific predictions as a comparison, we used two FC layers.

Training scXpresso
We split the genes into a train, validation, and test dataset and evaluated using 20-fold cross-validation.These sets are the same across all experiments (i.e., one train, validation, and test set for mouse genes and one for human genes) such that the results of different models can be compared.We update the weights of scXpresso using the Adam optimizer based on the mean square error loss on the training set.The initial learning rate is set to 0.0005 and if the loss on the validation set is not improved from 5 epochs, the learning rate is reduced by a factor of 10.We train the model for 40 epochs and the model with the lowest loss on the validation set is used for evaluation on the test dataset.Since there is always some stochasticity when training a CNN, we always train 5 models and average the predictions.We used the following software packages for training the model: Pytorch (version 1.9.0) (Paszke et al., 2019), CUDA (version 11.1),cuDNN (version 8.0.5.39), and Python (version 3.6.8).
The four bulk datasets (spleen, lung, limb muscle, and bone marrow) from the Tabula Muris were downloaded from https:// www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE132040 (Schaum et al., 2019).For the bulk data, we used the same reference genome as for the single-cell data.

Human motor cortex data
The human motor cortex data from the Allen Institute (Bakken et al., 2021) was downloaded from the Cytosplore Comparison Viewer.We downloaded the reference genome (version GRCh38.p2) and corresponding GTF file with information about the location of transcription start sites of the genes here: https:// www.gencodegenes.org/human/release_22.html.

Aggregated expression values
First, we normalized the count matrices.For the single-cell datasets, we performed library size normalization in the same way as The Tabula Muris Consortium: i.e., counts per million for the smart-seq2 data and counts per ten thousand for the 10X data (Schaum et al., 2018).For the bulk Tabula Muris data, we performed TPM normalization.For the single-cell datasets, we used the annotations defined by the authors to aggregate the expression values per tissue or per cell population using log 10 (mean(x)) (without pseudocount) into pseudobulk values.The advantage of not adding a pseudocount is that the distribution looks more like a normal distribution, which makes it easier to train the models (Supplementary Figure S2).A limitation, however, is that we could not calculate the exact value for genes that were not expressed in any of the cells.For these genes, we replaced the pseudobulk values with −4 in the Tabula Muris and −5 in the motor cortex dataset, since this extrapolated well (Supplementary Figure S2).For the bulk data, we aggregated over the samples instead of the cells.Here, we set the genes that are not expressed in any of the samples to −4.We standardized the expression values before running the model such that the average expression of all genes in each cell population or tissue is zero and the standard deviation is one.Before analyzing the results and comparing the predictions across cell populations, we undid the z-score normalization but kept the log normalization.

Input features 2.5.1 Sequence around the transcription start site
Before extracting the sequences around the transcription start site, we removed genes that are transgenes, ERCC spikeins, genes without a coding region, and genes on the Y chromosome.This resulted in 20,467 mouse genes and 18,138 human genes.Some genes had multiple transcripts.
We downloaded a list with canonical transcripts for each gene from biomart and we used the transcript and transcription start site belonging to the canonical transcript.If the canonical transcript was not defined, we used the transcript that had the longest coding region.After having defined the transcription start site for each gene, we used seqkit (Shen et al., 2016) to extract sequences from the FASTA file containing the reference genome.

Half-life time features
For every gene, we extracted five half-life time features: 5′ UTR length, 3′ UTR length, ORF length, intron length, and exon junction density ( #exons length ORF *1000).We obtained these features by first filtering the GTF files for the canonical or longest transcript.The 5′ UTR length is the length of the sequence from the start of the first exon to the start codon.The 3′ UTR length is the length of the sequence from the last coding sequence to the end of the last exon.The ORF length is the sum of the length of the coding sequences.The intron length is the length of the transcript minus the length of the ORF, 5′ UTR, and 3′ UTR.All features are log-normalized using log 10 (x + 0.1) and afterwards z-scaled.

Evaluating the predictions
For every gene in the test dataset, we averaged the predictions of the five models we trained.We evaluated the performance for every cell population by calculating the Pearson correlation between the true and predicted expression of the genes in the test set.To evaluate the increase in performance between the tissue-specific (t) pseudobulk (pb) and cell population-specific (cp) pseudobulk (pb) model on the Tabula Muris datasets, we calculate: Δ cp,t median Pearson correlation (scEP cp,pb ) − median Pearson correlation (scEP t,pb ).On the motor cortex dataset, we also evaluated the performance of each gene by calculating the Pearson correlation between the true and predicted expression per cell population.

In silico saturation mutagenesis
For CACNA1I, we mutated all positions in silico, which means we tested all possible substitutions at every position.We undid the z-score normalization and calculated the difference between the original (wild-type) prediction and the mutated prediction.The prediction models used during these experiments were the models where CACNA1I itself was originally in the test set.For every position, we only plotted one predicted difference in expression in Figure 4E.This is the substitution that was predicted to have the largest absolute effect.We downloaded the locations of the candidate cis-regulatory elements that fall within the input region for CACNA1I from screen registry v3 (release date 2021) (ENCODE Project Consortium et al., 2020).When plotting the difference between two cell populations, we ignored the positions where one is positive and the other predicts a negative effect.This rarely happened and if it was the case, the predicted effect was very small.
For the 2,000 highly variable genes, selected using scanpy (Wolf et al., 2018), we applied ISM similar as described for CACNA1I.For  every position we then calculated the average maximum absolute predicted effect: where i indicates the genomic position, HVG is the list of highly variable genes, ref indicates the reference allele, and alt indicates the alternative allele.

Comparison to other models 2.8.1 Enformer
Enformer uses the DNA sequence to predict reads for 5,313 human tracks which include CAGE, DNAse, CHIP, and ATAC-seq (Avsec et al., 2021a).Here, we only looked at the effect of a variant on the CAGE tracks that are related to the brain (77 tracks in total, see Supplementary Table S3).Enformer predicts the effect of variants on 128 bp bins.When predicting the effect of a variant on the CAGE reads, we looked at the effect on the bin containing the TSS.

ExPecto
ExPecto predicts gene expression for 218 tissues and cell lines (Zhou et al., 2018).Here, we only focused on 27 outputs that are related to the brain (Supplementary Table S4).We used the ExPecto web server to predict the effect of the variants (https://hb.flatironinstitute.org/expecto/?tabId=3).ExPecto is trained using Hg19 instead of Hg39.We used the R-package SNPlocs.Hsapiens.dbSNP155.GRCh37 (v 0.99.23) to lift-over the variants.Using ExPecto we could not predict the effect of all variants, since for some variants there was no location in Hg19 found, some were too far away from a TSS, and some were linked to a different gene than we were interested in (see Supplementary Table S5 for an explanation per variant).

Xpresso
We trained the Xpresso model on bulk RNA-seq data from the precentral gyrus (Agarwal and Shendure, 2020).The data from two individuals were downloaded from the Allen Human Brain Atlas: https://human.brain-map.org/static/download(H0351.2001,H0351.2002).We used the normalized matrices.Labels were created as described in the Xpresso paper: we took the median expression across the 6 precentral gyrus samples, log-normalized the output using log 10 (x + 0.1), and z-score normalized the expression.Similar to scXpresso, we trained the model using 20-fold crossvalidation.Per fold, we trained 10 runs and used the model with the lowest MSE on the validation data [as described in Agarwal and Shendure (2020)].Afterwards, we predicted the effect of the variants.We could not predict the effect of all variants, since some genes were not measured in the bulk RNA-seq data and for some genes, there were no Xpresso input features defined (see Supplementary Table S5 for an explanation per variant).

Predicting cell population-specific gene expression using scXpresso
Here, we present scXpresso, a multitask convolutional neural network (CNN) to predict cell population-specific gene expression using genomic sequences only (Figure 1A; Supplementary Figure S1).We developed scXpresso by adapting the Xpresso model (Agarwal and Shendure, 2020), which was originally designed for bulk data, to single-cell data.Similar to Xpresso, we use two types of input to the model: 1) the DNA sequence around the transcription start site (TSS) (7 kb upstream-3.5 kb downstream) to model transcription, and 2) five half-life time features (5′ UTR length, 3′ UTR length, ORF length, intron length, and exon junction density) to model mRNA degradation.We input the one-hot encoded DNA sequence into a CNN.The output of the CNN is concatenated with the half-life time features and fed to a fully connected network (see Methods).Since our model is a multitask CNN, the desired output of the fully connected network is the gene expression for every cell population.We predict expression per cell population instead of per cell to achieve more stable predictions with less noise as single-cell data is known to be quite sparse.To obtain one expression value per cell population, we aggregate the single-cell expression into pseudobulk measurements (see Methods).
Since single-cell and bulk data have different characteristics, we tested whether scXpresso performs equally well on single-cell and bulk data.We used scRNA-seq data from five different tissues (limb muscle, spleen, gland, marrow, and lung) from the Tabula Muris (Schaum et al., 2018) (Supplementary Table S6).Here, we used cells isolated via FACS that were sequenced using the Smart-seq2 protocol.Using the annotations defined by the authors, we aggregate the values per cell population and per tissue into pseudobulk values.For four tissues (limb muscle, spleen, marrow, and lung), there are also bulk RNA-sequencing datasets available (Supplementary Table S7).We compared the pseudobulk to the bulk expression per tissue and noticed that these are indeed correlated (r muscle = 0.69, r spleen = 0.71, r marrow = 0.50, r lung = 0.67) (Supplementary Figure S3).
Next, we trained three different models: 1) a tissue-specific (t) model on the bulk (b) values (scXpresso t,b ), 2) a tissue-specific model on the pseudobulk (pb) values (scXpresso t,pb ), 3) a cell population-specific (cp) model on the pseudobulk values (scXpresso cp,pb ) (Figure 1B).The cell population-specific model is, in contrast to the tissue-specific models, a multitask model that predicts the expression of all cell populations in a tissue simultaneously.We evaluated the performance of the models by calculating the Pearson correlation between the true and predicted expression values.In general, the tissue-specific models trained on pseudobulk reach higher performance than the models trained on bulk (Figures 1C, D).Even though the bulk and pseudobulk values are correlated, the pseudobulk distributions are bimodal compared to the normally distributed bulk data (Supplementary Figures S3,  S4).This turns the problem more into a classification problem (is a gene low or high expressed), which might be easier to learn.On average, predicting cell population-specific expression is more difficult than predicting tissue-specific expression (Figures 1D, E): scXpresso cp,pb performs slightly worse than scXpresso t,pb (median correlation of 0.71 vs. 0.75), but still better than scXpresso t,b (0.58).
One of the adaptations to Xpresso is that scXpresso cp,pb is a multitask model.This slightly increases the performance compared to a single-task model (Supplementary Figure S5) but mainly makes the model computationally more efficient.The marrow-FACS dataset, for instance, contains 22 cell populations.Since the single-task and multitask models need the same training time (approximately 30-60 min), this gives a 22x speed up.
The Tabula Muris scRNA-seq datasets were generated using two different protocols: 10X Genomics, a droplet-based method, and FACS-based Smart-seq2, a plate-based method.When comparing scXpresso t,pb and scXpresso cp,pb trained on the two different protocols, e.g., lung-droplet vs. lung-FACS, we conclude that they perform equally well (Figures 1D, E; Supplementary Figures S6, S7).Depending on the tissue and cell population, one performs slightly higher than the other, but there are no significant differences.This is as expected since the pseudobulk values of both protocols are highly correlated (Pearson correlation > 0.85) (Supplementary Figure S8).Hence, the protocol used to create the single-cell dataset does not influence the results.
For scXpresso cp,pb , we tested how the two types of input features, DNA sequence and half-life time, influence the performance.We tested different lengths of the input sequence and whether one of the two features was enough to predict expression (Supplementary Figure S9).A range of different sequence lengths results in the same performance (3.5-3.5, 7-3.5, and 10-5 kb upstreamdownstream).A longer sequence gives more information but also adds more noise.Since the model also becomes more complex, more parameters have to be learned and it takes more time and memory to train the model.Therefore, we decided to use 7 kb upstream and 3.5 kb downstream for further experiments.We also observed that adding the half-life time features results in higher performance, suggesting that these features are not easily captured from DNA sequences alone.
For the cell population-specific models, the performance varies considerably across different populations (Figure 1E).Comparing the populations in the lung dataset, for instance, the performance of the endothelial cells is very high compared to leukocytes (Figure 1F; Supplementary Figure S10).In general, the performance of scXpresso increases when more genes and cells are measured in a population (Figure 1F; Supplementary Figure S11).The leukocyte population is small (35 cells) and fewer genes are non-zero compared to other cell populations in the lung (8,678 out of 20,467 vs. 12,715 on average).The ciliated cell population, on the other hand, is also small (25 cells), but this model reaches a higher performance.In this cell population, however, more genes were non-zero (11,717) compared to the leukocyte population.Hence, to train the model, we need a good representation of the cell population that includes enough expressed genes.
In all previous experiments, we evaluated scXpresso using 20fold cross-validation with the genes randomly divided over the folds.The results could be positively biased if genes from the same chromosome are in different folds.Therefore, we also evaluated the models using cross-chromosomal cross-validation.This slightly reduces the models' performance, but the difference is not significant (lowest p-value = 0.11 for myeloid cells, two-sample Wilcoxon rank sum test) (Supplementary Figure S12).

Cell population-specific models outperform tissue-specific models
Now that we know that all models are well-trained, we predicted cell population-specific expression using the three different models to see whether increasing the resolution of the models increases the performance (Figure 2A).Since scXpresso t,b and scXpresso t,pb were trained using tissue-specific expression values, these models predict the same value for every cell population.On all datasets, scXpresso cp,pb outperforms the tissue-specific models, which shows the benefit of training the models on a higher resolution (Figure 2B; Supplementary Figure S13A).Especially in more heterogeneous tissues, where the gene expression of cell populations is weakly correlated to the corresponding tissue, we see a large improvement (Figure 2C; Supplementary Figure S13B).For the lung-FACS dataset, for instance, the performance increases the most for immune cell populations (Δ cp,t for B cells: 0.11, NK cells: 0.11, T cells: 0.09; see Methods) and the least for lung-specific populations (Δ cp,t for stromal cells: 0.01, endothelial cells: 0.03, epithelial cells: 0.05).In the B cells in the lung, 4,081 genes are not expressed in any of the cells and thus have a log-normalized expression of −4, but for which the tissue-specific model predicts a positive log-normalized expression value (Figure 2D).In contrast, the model trained on B cells predicts a lower expression for these genes (Figure 2E).Almost all these genes, however, are expressed in the lung (in the non-B cells), the lung-model learned this correctly too (Figure 2F).Some of the Tabula Muris datasets contain similar cell populations.For instance, B cells, macrophages, and T cells are measured in four, three, and three tissues, respectively.We hypothesized that if our models are cell population-specific, they should accurately predict the expression of a cell population in one tissue with a model trained on the same cell population but from another tissue [even though a cell's tissue will slightly change the expression for (some) genes].To test this, we predicted the expression for common cell populations using three different types of models: 1) scXpresso cp,pb trained on the same cell population, but from a different tissue, 2) scXpresso cp,pb trained on a different cell population, but from the same tissue, 3) scXpresso t,pb trained on the same tissue (Figure 3A).For example, we predict the expression of B-cells in the limb muscle, using 1) a model trained on B-cells in the lung, 2) a model trained on endothelial cells in the limb muscle, and 3) a model trained on the limb muscle.Again, the cell population-specific models outperform the tissue-specific models, even though they predict either a different dataset or a different cell population than they were trained on (Figure 3B; Supplementary Figures S14, S15).This indicates that if you want to train a model for a cell population from a specific tissue where no single-cell data is available, you are better off using a model trained on a similar cell population from a different tissue than relying on a tissue-specific model.Whether a model trained on a different cell population and the same tissue performs better than a model trained on the same cell population but a different tissue, differs per tissue and cell population.For example, when predicting the expression of B cells in the limb muscle, the models trained on B cells in the marrow and lung even outperform the model trained on B cells in the limb muscle itself (Figure 3C).But, the models trained on different cell populations within the limb muscle perform variably when predicting B cells (Figure 3D).The models trained on immune populations, e.g., T cells or macrophages, perform similarly, but the muscle-specific populations perform worse.This difference between the B cell and the endothelial, mesenchymal stem cell, and skeletal muscle satellite cell models might seem small but is significant across the 20 folds [p-value = 9.5e-07 for all three populations, one-sided Wilcoxon rank sum test (Mann and Whitney, 1947;Demšar, 2006)].Even though the differences are small, this indicates that our models indeed learn cell population-specific features.

scXpresso learns expression patterns across human brain cell populations
Next, we applied scXpresso to a human brain dataset of the motor cortex (Bakken et al., 2021).This dataset is annotated at different resolutions including a class (GABAergic, glutamatergic, and non-neuronal) and subclass (20 subclasses) level.Again, we trained models of different resolutions: a tissue-(t), class-(c), and subclass-specific (sc) model (scXpresso t , scXpresso c , and scXpresso sc , respectively).We used the trained models to predict the subclass-specific expression values (Figure 4A).Since scXpresso t was trained on the tissue-specific pseudobulk expression, it predicts the same expression for all subclasses.The class-specific model, on the contrary, is a multitask model.Here, we use the predictions of the parent class to predict the expression of each subclass (i.e., subclasses belonging to the same parent class are predicted to have the same expression) (Supplementary Figure S16).Similar to the Tabula Muris, we observed that increasing the resolution increases the performance: scXpresso sc outperforms scXpresso c which outperforms scXpresso t , (Figure 4B).For some subclasses, e.g., L2/3 IT, the performance barely improves when comparing scXpresso sc with scXpresso c , which happens when the true expression values of the subclass and corresponding class are strongly correlated, similar as for the Tabula Muris case (Supplementary Figure S17).
Since genes with variable expression across subclasses are often interesting to study, we tested whether scXpresso sc can learn the correct pattern for a gene across the subclasses.For every gene, we calculate the Pearson correlation between the true and predicted expression across the subclasses.If the expression of a gene shows some variance across the subclasses, scXpresso sc predicts the pattern correctly (Figure 4C).An example is CACNA1I, a gene coding for a subtype of voltage-gated calcium channel that has been associated with schizophrenia (Li et al., 2017;Pardiñas et al., 2018;Ikeda et al., 2019;Lam et al., 2019;Yao et al., 2021).Here scXpresso sc correctly learns that the expression in neuronal populations is higher than in non-neuronal (r = 0.90) (Figure 4D).

In silico saturation mutagenesis reveals the most interesting GWAS variants
Since scXpresso can predict expression from the DNA sequence, we expect that it can also predict how the expression changes when the sequence is mutated.Therefore, we applied in silico saturation mutagenesis (ISM) to the sequence of CACNA1I and evaluated the predicted change in gene expression (Zhou and Troyanskaya, 2015;Kelley et al., 2016;Kelley et al., 2018;Avsec et al., 2021a).When comparing scXpresso sc predictions for the Sst Chodl subclass across all possible mutations, we find mutations in the region around the TSS to affect the expression of the CACNA1l gene the most (Figure 4E).When applying ISM to the 2,000 highly variable genes in the data, the maximum absolute predicted effect is highest around the TSS as well (Supplementary Figure S18).Note, that we did not use the TSS location as input to the model, consequently, the model correctly identified that this is the most important region for transcriptional regulation.No other regions within our input window were found that affect the expression that strongly.
Besides visualizing the mutation pattern for one subclass, we can also visualize how ISM affects two subclasses differently.As an example, we compared the scXpresso sc predictions for the Sst Chodl subclass and the L2/3 IT subclass (Supplementary Figure S19).These predictions show that the Sst Chodl subclass is more sensitive to mutations than the L2/3 IT class for CACNA1I, which might be explained by the fact that CACNA1I is also higher expressed in Sst Chodl cells.
In addition, ISM can be used to prioritize variants of interest for diseases.CACNA1I is linked to 18 Schizophrenia-associated variants according to the NHGRI-EBI Catalog (Buniello et al., 2019).Two of these variants, rs7288455 and rs5757730, fall within our input region (7 kb upstream and 3.5 kb downstream of the CACNA1l TSS).Mutating the reference A allele with the C or G variant at the position of rs7288455 increases the predicted expression for all cell populations (Figure 4F).The disease-associated variant, the A allele, is expected to decrease the expression (Buniello et al., 2019;Yao et al., 2021), which is in line with our predictions, although it is not known whether this is subclass-related.Our model suggests that the expression of CACNA1I increases the most in the Sst Chodl subclass.Interestingly, for the Sst Chodl subclass, this mutation results in one of the largest differences in CACNA1l expression amongst all other induced mutations (top 1% mutations with the strongest effect) (Supplementary Figure S20).For the other variant, rs5757730, which lies in an intronic region, we see no difference in expression (Supplementary Figure S21).Further supporting our predictions, rs7288455, but not rs5757730, overlaps with an ENCODE candidate cis-regulatory element.These results show that scXpresso can be used to prioritize GWAS hits.
We input the DNA sequence and five half-life time features to scXpresso.However, certain transcript features, which are related to the half-life time features, can predict zeros in the scRNA-seq data (Lipnitskaya et al., 2022).Whether the observed zeros in scRNA-seq data are technical artifacts or biologically informative is an ongoing debate.We believe that the zeros are biologically informative since binarized data can be used for downstream analysis, resulting in comparable results to those obtained using scRNA-seq counts (Bouland et al., 2023).Furthermore, we would like to highlight that the performance of the cell population-specific pseudobulk models when trained on sequence-only information is also not much lower as compared to both sequence and half-life time features (Supplementary Figure S9).This observation supports our conclusion that the half-life time features are not biasing the models towards scRNA-seq artifacts.
Two future enhancements that we envision that could improve the performance of our model are related to the half-life time features and the output of the model.Currently, we extract five features from the mRNA sequence to approximate the half-life time.Recently, a new model, Saluki, was developed that could predict mRNA degradation rates directly from the sequence of the gene (Agarwal and Kelley, 2022).Replacing the currently used features with those predicted by the Saluki model, or combining these features, might improve the cell population-specific predictions.A second potential improvement relates to the current output of scXpresso, which is the pseudobulk expression for every cell population, i.e., the average gene expression across all cells from that population.However, this ignores the variance within the population.It might be more beneficial to predict the distribution of gene expression across each population, instead of just one aggregated value.
In summary, we have shown the potential of predicting cell population-specific gene expression from genomic sequences by leveraging the resolution of single-cell data, opening the way for many new developments in this area.

FIGURE 1
FIGURE 1 Schematic overview of scXpresso and performance on Tabula Muris datasets.(A) We one-hot encode the DNA sequence around the transcription start site (TSS) and input this to a one-dimensional convolutional neural network (CNN).The output of the CNN is flattened and concatenated with the five half-life time features.The fully connected layers output the cell population's specific gene expression levels simultaneously (Supplementary Figure S1, see Methods).(B) Schematic overview of the experiment.(C,D) Performance of scXpresso t,b [tissue-specific (t) model on bulk (b) data] and scXpresso t,pb [tissue-specific model on pseudobulk (pb) data], respectively.Every dot is the performance (Pearson correlation) across one fold of the 20fold CV. (E) Performance of scXpresso cp,pb [cell population-specific (cp) model on pseudobulk data] summarized per tissue.Every dot represents the model's performance on a cell population in that tissue (median Pearson correlation across the 20 folds).(F) Performance of scXpresso cp,pb on the different lung cell populations.The grey line indicates the median performance across all cell populations.Every dot is the performance across one fold of the 20-fold CV.

FIGURE 2
FIGURE 2 Comparison of the three scXpresso models for making cell population-specific predictions.(A) Schematic overview of the experiment.(B) Boxplot showing the performances of scXpresso t,b [tissue-specific (t) model on bulk (b) data], scXpresso t,pb [tissue-specific model on pseudobulk (pb) data], and scXpresso cp,pb [cell population-specific (cp) on pseudobulk (pb) data] on the cell population-specific task.Every point in the boxplot is the performance of a model on one cell population in that tissue (median Pearson correlation across the 20 folds).(C) Similarity between a cell population and corresponding tissue (Pearson correlation between the true pseudobulk expression values) vs. the increase in performance (Δ cp,t , median Pearson correlation of scXpresso cp,pb -scXpresso t,pb ).Every dot is a different cell population and the colors represent the different tissues.(D-F) Comparing the predictions made by the lung tissue model (lung-model) and the B cell population model (B cell-model).Genes where the lung-model predicts a toohigh value are plotted in orange.(D,E) True expression of the B cells vs. predicted expression by the (D) lung-model and (E) B cell-model.(F) True expression of the lung cells vs. predicted expression of the lung model.

FIGURE 3
FIGURE 3 Comparing the predictions of scXpresso across cell populations and tissues.(A) Schematic overview of the experiment.(B) Performance (Pearson correlation) of three different types of models on different cell populations (rows) in different tissues (columns).Every dot is the median correlation of one model across the 20 folds.Since there are no T cells and macrophages defined in the Marrow and Lung dataset, these boxes are missing.(C) Pearson correlation of different models when predicting the expression of B cells in different tissues.The rows indicate on which tissue scXpresso cp,pb is trained, and the columns indicate for which tissue the expression of the B cells is predicted.(D) Pearson correlation of different scXpresso cp,pb when predicting the expression of B cells in the limb muscle.Again the rows indicate which model is used.

FIGURE 4
FIGURE 4Performance of scXpresso on the human motor cortex.(A) Schematic overview of the experiment.We train a tissue-(t), class-(c), and subclassspecific (sc) model (scXpresso t , scXpresso c , scXpresso sc , respectively) to predict the subclass-specific expression levels.(B) Boxplots showing the Pearson correlation between the true and predicted values.Every point in the boxplot is the performance on a fold (n = 20).(C) Scatterplot showing the relation between the variance of a gene across the pseudobulk values of the subclasses and the Pearson correlation between the true and predicted values across the subclasses.Every dot is a gene.(D) True and predicted expression for CACNA1I.Every dot is the expression in a subclass.Dots are colored according to their class.(E) Mutation profile for CACNA1I for the Sst Chodl subclass.For every position, we calculated the difference in expression for all three possible substitutions and visualized the substitution with the highest absolute predicted effect.Mutations that are predicted to increase or decrease the expression are plotted in blue and orange, respectively.The grey rectangle highlights the region around the TSS.The grey boxes indicate the positions of candidate cis-Regulatory Elements (cCREs) derived from ENCODE data (ENCODE Project Consortium et al., 2020).(F,G) Predicted effect, the predicted difference between the reference and alternative allele, of the three substitutions for (F) rs7288455 on CACNA1I expression, and (G) (Continued )

FIGURE 4 (
FIGURE 4 (Continued) rs10866912 on MROH6 expression.Every dot is one subclass and the dots are colored according to the class.(H) Sequence logo and the consensus sequence for the INSM1 transcription factor motif together with the sequence of the reference genome (bottom line).