Mathematical model for the relationship between single-cell and bulk gene expression to clarify the interpretation of bulk gene expression data

Graphical abstract


Introduction
Performing differential expression (DE) analysis of different sample groups is a standard approach in molecular biology. In recent years, transcriptome data have been used to comprehensively identify DE genes in different experimental groups [1], and several bioinformatics methods have been developed for this purpose [2][3][4]. In DE analysis, if gene expression levels differ significantly between diseased and non-diseased donors, then the genes are considered to be associated with that disease. Similarly, a comparison of gene expression data from different tissues or anatomical regions can therefore be used to identify tissue/ region-specific genes [5][6][7]. Expression quantitative trait locus analysis (eQTL) can be used to identify genetic variants in genotype groups that are significantly associated with gene expression levels, and in so doing, can facilitate an understanding of the mechanisms underlying gene regulation and interpretations of functional genetic variants [8,9]. Specifically, these DE analyses identify differences in mean expression values for bulk gene expression data between the groups based on disease, tissue or genotype ( Fig.1(a)).
On the other hand, differential variability (DV) analysis is another approach for identifying differences in gene expression [10]. DV analysis captures differences in variance of gene expression values between the groups ( Fig.1(a)). DV can capture biological information about a target disease or trait. To date, studies have employed DV analysis of transcriptome data to provide biological insights about disease and aging [11][12][13][14]. For example, a strong relationship has been reported between variability in gene expression and a chronic lymphocytic leukemia subtype [14]. In the context of eQTL analysis, the genetic loci associated with variance in bulk gene expression value are discussed as expression variability QTL (evQTL) [15]. Although the biological processes underlying DV have been investigated from a biological standpoint, such as gene expression noise or epigenetics [16], the interpretation of DV of gene expression remains unclear and controversial. Due to cellular heterogeneity, bulk gene expression data in DE or DV analyses are not typically sufficient for capturing the changes in the gene expression profiles of a cell population. Each sample in an experiment contains randomly selected cells, and each cell has a different gene expression level. Consequently, the cell population profile can be expressed as a probability distribution of gene expression, and the bulk gene expression data captures information of the average value of this distribution [17]. Recent advances in single-cell analysis have reported the existence of two different types of changes in a single-cell expression profile; shifts in gene expression and changes in the proportion of cellular subsets.
First, group differences shift the level of gene expression in the cell, and the direction and magnitude of this shift depends on the cellular subset. For example, it is known that tumor cell subpopulations show distinct drug responses [18]. The recent studies combining single nucleotide polymorphism (SNP) genotype data with single-cell RNA-seq (scRNA-seq) data or cytometry has shown that the effects of genetic variants on gene expression may differ depending on cellular subsets [19][20][21]. Such changes in the single-cell expression distribution can be expressed as shown in the left panel of Fig.1(b).
Second, group differences change the proportion of cellular subsets, as shown in the right panel of Fig.1(b). Differences among groups can alter proportion of cellular subsets by affecting cell differentiation, maturation and transformation. Studies have been conducted to identify cellular subsets with different proportions between sample groups [22,23]. Previous studies combining SNP genotype and cytometry analyses identified SNPs associated with different lymphocyte subsets [24,25]. In addition, it has been suggested that a large number of SNPs are associated with individual differences in lymphocyte profiles, even though their effects are small [25]. Changes in the proportion of cellular subsets can affect the bulk gene expression value. For example, when the proportion of a cellular subset with a relatively high gene expression level increases, the bulk expression levels also increase.
In the complex physiological and pathological changes that occur within cells, both shifts in gene expression and changes in cellular subset proportions can occur simultaneously, resulting in a combined contribution to the bulk expression value. For example, both of them in intestinal epithelial cell population have been observed in patients with Crohn's disease [26]. Evaluating how these changes in single-cell expression profiles are manifested in DE and DV genes is central to understanding the biological mechanisms underlying both DE and DV analysis. Further, given the increased interest in single-cell expression analysis in recent years, it is important to evaluate the results of single-cell expression analyses and compare them to the results of bulk expression analyses that have been reported to date. In this study, we describe a mathematical model for examining bulk gene expression levels and the single-cell expression distribution behind them. Specifically, single-cell expression profiles are modeled using a mixed probability distribution, and the relationships among their parameters and the mean and variance of the bulk expression values among samples are clarified. The model proposed in this study clarifies the interpretation of DE and DV analysis and provides new insights into the relationship between bulk and single-cell data analyses.

Mathematical model for relationship between a single-cell expression profile and a bulk gene expression value
We developed the following mathematical model to clarify the relationship between one single-cell expression profile and its bulk gene expression value for one gene ( Fig.2(a)). The bulk samples of multicellular organisms consist of many cells, which show cellular heterogeneity and consist of multiple cellular subsets. The distribution of gene expression for these cells can be expressed as a mixture distribution of those of different cellular subsets.
To evaluate the effect of a specific cellular subset, we used cellular subset + and cellular subset-. The single-cell expression value for a cell of cellular subset + is assumed to follow a probability distribution f þ ðyÞ with mean l þ . Similarly, that for a cell in cellular subset-is assumed to follow a probability distribution f À ðyÞ with mean l À . Let the proportion of cellular subset + be r and that of cellular subset-be 1 À r. Due to genetic and environmental effects, the parameters affecting the single-cell expression distributions are assumed to vary among donors. We modeled l þ ; l À ; r by assuming that these are random variables that are independent of each other.
Under this model, the single-cell expression level y follows a mixture probability distribution, as follows: The bulk gene expression value (Y) can be considered as the mean value for a single-cell expression distribution. If the bulk sample contains a sufficient number of cells, then the bulk gene expression level (Y) can be modeled with a proportionality constant k and measurement error (e) as follows: where, e is assumed to be independent of other random variables l þ ; l À ; r.
For simplicity, we set k = 1 in the discussion.
As a result, the group difference in bulk expression data, such as control vs. disease, are interpreted via the single-cell gene expression distribution according to Eq. 3. In DE analysis, the bulk data for the disease group (Group D) and the control group (Group C), E½Y D À E½Y C can be detected statistically as the difference of bulk expression values for each group. This model allows us to mathematically evaluate the relationship between the parameters for the single-cell expression distribution with cellular heterogeneity and the results of the bulk gene expression analysis ( Fig.2(b)).

Relationships among cellular heterogeneity, DE and DV analyses in the bulk experiment
From Eq. 3, the bulk expression value Y can be written as Consider the case of comparing bulk data for disease group (Group D) and the control group (Group C). In Group D, the expected value for the parameters of single-cell expression distribution in the model is shifted from those of group C, as follows: where Dl þ ; Dl À and a r express the difference in l þ ; l À and r between groups. Note that a r E½r C is restricted to the range 0 to 1.
The differential expression between the two groups, i.e., E½Y D -E½Y C , can be expressed as follows, based on Eq. 4 (see Appendix A for details).
. In addition, we let E C þ ; E C À and E C r equivalent to E½l þ ; E½l À and E½r in Group C, respectively.
Depending on the combination of a r ; d þ , and d À ; E½Y D À E½Y C can take a value of zero and never be identified by DE analysis, even though positive activation of gene expression is occurring at the single-cell level. Based on these results, if a group difference affects both the cellular subset proportion and the gene expression level in each cell, then it can be missed in the bulk gene expression analysis. Indeed, if d þ > 0 and d À > 0, then the condition for E½Y D À E½Y C 0 can be expressed as follows (from Eq. 6): where we assumed On the other hand, variance in the bulk gene expression value can be calculated from Eq. 3, as follows: From Eq. 8, the following relationship can be mathematically derived (see Appendix B for details): In addition, we let E þ ; E À ; E r ; V þ ; V À and V r be equivalent to E½l þ ; E½l À ; E½r; V½l þ ; V½l À and V½r, respectively. Eq. 9 suggests that there are three main factors that explain the differential variability between groups. First, V þ ; V À and V r directly affect the variability in bulk gene expression. Second, the change in E r affects the variability in bulk gene expression via the term E 2 r V þ þ ð1 À E r Þ 2 V À . If the group difference increases the more variable subset proportion, then V½Y can be increased. Third, ðE þ À E À Þ 2 can affect V½Y.
If the group difference changes ðE þ À E À Þ 2 , then it affects the bulk expression variance via the term ðE þ À E À Þ 2 V r . Importantly, even if V À ; V þ and V r do not change, a change in E þ ; E À and E r can change the variance in the bulk gene expression. Note that cell-to-cell variability V½y does not appear in this equation.

Visualization of the difference between DE and DV genes
Based on the above theory, we visualized the relationship between the parameter (dþ; dÀ; a r ) and an increase or decrease in the mean and variance of the bulk gene expression. We focused on a situation where expression is increased in both cellular subsets (d þ ! 0; d À ! 0) and the proportion of cellular subset + is decreased (a r 1). For the combinations of equally spaced d + and d-(0, 0.1, 0.2, . . ., 1) and three patterns of a r (0.2, 0.5, 0.8), we calculated the difference in the means and variances, and visualized the parameter space with an increase and a decrease in the mean and the variance of the bulk gene expression. We set other parameters in the model as follows:

Simulation analysis of DV genes
We created a computational simulation scheme for the proposed mathematical model. First, this model contains fourteen parameters: N is the number of samples, and a r are the parameters that define individual differences in the distribution parameters described in the above model, n is the number of cells in the bulk sample, V err is the measurement noise associated with quantifying the bulk expression value, and V yþ and V yÀ are the variances of single-cell expression levels in each subset.
First, for N samples in the Group C, l þ ; l À and r are sampled from the Normal or uniform distributions as shown below, and assigned to each sample. Normal and uniform distributions are uniquely determined by the mean and variance parameters.
Also, for N samples in the Group D, l þ ; l À ; r are sampled according to d þ ; d À and a r , based on Eq. 5.
Next, we generated the single-cell expression value (y) of each sample based on Eq. 1, where f þ ðyÞ and f À ðyÞ are modeled as normal distributions. From the following formula, we sampled n cells independently.
y $ rNormalðl þ ; V yþ Þ þ ð1 À rÞNormalðl À ; V yÀ Þ After sampling n single-cell expression values, the bulk expression value is calculated as the mean of y ¼ ½y 1 ; y 2 . . . y n and a measurement error (e) is added.
As a result, the bulk expression values for N control samples and N disease samples were simulated. Based on this simulation scheme, we simulated two situations in which only the variance of the bulk expression is changed without changing the mean value. Eq. 9 shows that the difference in E r (Example 1) or ðE þ À E À Þ 2 (Example 2) can change the bulk expression variance, even if V þ ; V À and V r are the same in the two groups.
(Example 1) We simulated an example where the variance in bulk gene expression decreases as E r decreases in the disease group when V þ > V À . The model parameters were as follows: N ¼ 1000; E þ ¼ 2; E À ¼ 1; E r ¼ 0:9; V þ ¼ 1; V À ¼ 0:1; V r ¼ 0:001; n ¼ 10000; V err ¼ 0:1; V yþ ¼ 1 and V yÀ ¼ 1. d þ ; d À ¼ 0:5 were used so that the value of E þ À E À was the same in the control and disease groups. a r was set so that E½Y C ¼ E½Y D was satisfied. We checked the simulated bulk expression data for the two groups to confirm whether the changes in the proportions of cellular subsets could cause DV.
(Example 2) We simulated an example where the variance in the bulk gene expression increases as E þ À E À increases. The model parameters were as follows: N ¼ 1000; E þ ¼ 2; E À ¼ 1; E r ¼ 0:9; V þ ¼ 0:01; V À ¼ 0:01; V r ¼ 0:001; n ¼ 10000; V err ¼ 0:1; V yþ ¼ 1 and V yÀ ¼ 1. V þ and V À were the same to repress the effect of differences in the proportions of different cells. Instead, d þ ¼ 10 and d À ¼ 0:1 were used so that E þ À E À increases. a r was determined so that E½Y C ¼ E½Y D was satisfied. We checked the simulated bulk expression data for the two groups to confirm whether the different gene expression shifts for each subset could could cause DV.

Real single-cell RNA-seq data analysis
Here, we described the analysis of real single-cell RNA-seq data by applying our method to elucidate single-cell expression changes underlying bulk DE and DV genes. We used the public scRNA-seq dataset for ulcerative colitis (UC) (NCBI GEO ID:GSE125527) [27]. In this study, we used data for the processed scRNA-seq of human peripheral blood mononuclear cells (PBMCs) from seven patients with ulcerative colitis (UC) and eight control donors (NCBI GEO ID:GSM3576411-GSM3576425). After log(1 + count value) transformation, the sum of the gene expression values for each cell was normalized to be 10 6 and used for downstream analysis.
The DE genes and DV genes were identified at the bulk level using the following procedure. Bulk gene expression levels for each sample were defined as the average of the single-cell expression values among cells. For the genes with an average bulk gene expression level of > 20 in the 15 samples, we performed Welch's t-test and F-test analyses to compare the UC and the control groups and calculated p values. We applied Benjamini-Hochberg (BH) correction [28] to the t-test p values and identified the genes with adjusted p values < 0:05 as DE genes. We also applied BH correction to the F-test p values and identified the genes with adjusted p values < 0:05 as DV genes.
For the identified top DE gene and top DV gene with the smallest p value, we estimated the parameters of single-cell expression distribution (l þ ; l À and r). In many cases, the single-cell expression distribution consists of the cell population with zero or small expression values, and cell populations with higher expression values. For the single-cell expression distribution of each gene in each sample, we applied kmeans clustering and divided the cells into subset + and subset-. By calculating the mean expressions and proportions of subset + and subset-cells, we could then estimate l þ ; l À and r for each sample. We then evaluated the distributions of these estimated parameters for the samples and identified differences in the single-cell expression distributions that underlie the bulk expression of DE and DV genes. Fig. 3(a) shows E½Y D À E½Y C for the combination of d þ ; d À and a r . Even though the expression shift is positive for subsets (d þ ! 0; d À ! 0), the difference in mean expression value can be negative depending on a r . Fig.3(b) shows the difference in variance for the combination of d þ ; d À , and a r . Compared to Fig.3(a), the plot pattern is different. These results indicate that DE and DV analysis capture different features of the single-cell distribution.

Theoretical and simulation analyses
We simulated situations where the variance in bulk expression differed between sample groups without changing mean values. Fig.4(a) shows an example where only changes in the proportions of cellular subsets alters variance of bulk gene expression. This is because the subset + cells with large variance decrease in number and the subset-cells with small variance increase in number. Fig. 4, (b,c) shows the parameter distributions for l þ ; l À and r in the two groups in this simulation. Fig.4(d-f) shows another example where a shift in gene expression changes variance of bulk gene expression. These results show that a changes in the proportions of cellular subsets or a different expression shifts for cellular subsets can cause DV.

Real single-cell RNA-seq data analysis
We used a public human scRNA-seq dataset compiled using data for seven patients with UC and eight control subjects. The results of a bulk level analysis showed that there were 289 DE genes and four DV genes. No matches were observed between these DE and DV genes. We investigated the top DE and DV genes with smallest p value in detail.
The top DE gene in the bulk expression analysis was POM121, which had a mean bulk expression value that decreased significantly in UC group (Fig.5(a); t-test p value is 1:51 Ã 10 À5 ). Fig.5(b) shows a histogram of single-cell expression values for pooled cells of each group, which shows that the expression level for this gene was low in a large number of cells (cellular subset -) and high in a small number of cells (cellular subset+). Fig.5, (c,d,e) shows the estimated l þ and l À ; r values for each sample group. The findings suggest that the gene expression level increased markedly in subset + cells in UC group, while the proportion of these cells decreased. Although these two effects act in opposite directions, the greater influence of the former effect increases bulk gene expression.
The top DV gene is MAP1LC3B2 whose bulk expression variance is significantly decreased in UC group (Fig.6(a); the F-test p value is  8:77 Ã 10 À7 ). Fig.6(b) shows a histogram of single-cell expression values for pooled cells of each group, which also shows that the expression level for this gene was low in a large number of cells (cellular subset -) and high in a small number of cells (cellular sub-set+). Fig.6, (c,d,e) shows the estimated l þ and l À ; r values for each sample group. While the mean and variance of the bulk expression level decreased in UC group, l þ increased. The large decrease in the mean and variance of r in UC group is considered to lead significant difference of variance.

Discussion
The proposed model provides the insights for the interpretations of DE and DV genes identified using bulk data. First, when changes in the proportions and expression shifts in cellular subsets occur, they may cancel each other out and not be detected in bulk data analysis. Second, underlying DV gene, there is a combined contribution of different expression shifts for cellular subsets, changes in the proportion of cellular subsets, and changes in individual differences in the parameters for single-cell expression profiles. These are important considerations when interpreting the results obtained from bulk expression analysis. Third, DV and DE capture different aspects of single-cell expression profile differences. In recent years, methods for directly detecting dissimilarities in single-cell expression distributions using scRNA-seq or cytometry data have been proposed [29][30][31][32][33]. Our findings provide an insight into the theoretical relationships among DE and DV gene identified in bulk experiment, and differential distributed genes identified in single-cell experiment.
Here we examine the difference between variability in the bulk expression data (V½Y) and the cell-to cell variability (V½y). Recently, studies on single-cell analysis have examined cell-tocell variability and the statistical methods for analyzing this expression characteristics [34,35]. Our analysis showed that the formula for V½Y does not include V½y and that there is no direct relationship between them when the sample contains a sufficiently large number of cells. Nevertheless, shifts in gene expression or changes in the proportion of cellular subsets will also induce changes in V½y in the same direction in individual samples. In addition, changes in V½y; V½l þ ; V½l À and V½r due to disease or aging may indicate the existence of a common mechanism for disruption of the control mechanism for biological phenomena.
The scRNA-seq expression data is characterized by the inclusion of many zero values [36]. Then, statistical models that can handle zero-inflation are often used in single cell data analysis. Specifically, it is often modeled by probability distributions such as negative binomial, poisson, zero-inflated negative binomial or zeroinflated poisson [36]. Our model is applicable to any probability distribution including the zero-inflation model. Since the relationship between expectation value of distribution and its parameters is known mathematically for these theoretical distributions, it is possible to calculate the values such as E þ or V þ from the mean and variance of the parameter values among samples, which provide the insights about the mean and variance of the bulk gene expression values. If future large scale single cell genomics studies will provide insight into the distribution of the parameters among samples, it will enhance our understanding of the relationship between single cell and bulk data analysis even more. The analysis of real scRNA-seq data presented in this study is an effective tool for examining the relationship between scRNA-seq data and bulk data, but the study has several limitations. A major limitation is that the estimates obtained forÊ þ ;Ê À ;Ê r ;V þ ;V À and V r are not always accurate. These estimates can be affected by biases associated with clustering and parameter estimation. Also, the number of cellular subsets that are actually present in a tissue is not always known, and the identification and classification of cellular subsets using bioinformatic methods is a major research task in the field of single-cell genomics. Combining the mathematical model proposed in this study with advanced bioinformatics methods may further the study of single-cell genomics.
In our theoretical framework, the assumption that the bulk gene expression value is proportional to the mean value of the single-cell gene expression profile is essential. In real data analysis, the attention should be paid to whether this assumption holds. If very rare subset has non-zero expression value, the analysis will be susceptible to sampling bias. In such cases, it would be necessary to obtain data from a larger number of cells to capture the information of distribution. In addition, this assumption is also an unstable for low expressed gene because gene expression quantification by RNA-seq is unstable technically. Therefore, filtering the low expressed genes are important step as preprocessing under our theoretical framework.
The model described in this study could potentially be used in theoretical fields. Extending the model to poly-genes will allow more bulk expression analysis methods to be applied at the single-cell level. For example, gene co-expression network analysis is performed extensively in transcriptome analysis where it is used to infer biological processes and the roles of important transcription factor genes in complex traits [37,38]. It is necessary to consider at least two genes when using the model to investigate correlations between gene expression. While gene co-expression analysis has been used to clarify relationships in gene regulation, it is unclear what exactly the identified relationships captures. Another extension would be a model that considers spatial information. When acquiring bulk gene expression data, not only information on the shape of the distribution but also spatial information is lost. In recent years, with the development of spatial genomics technology, single cell transcriptome data can be obtained with spatial information [39]. Since our framework works as a model of cellular population heterogeneity in general, it is possible to interpret cellular subset as spatial information. For example, it can also be used as a model for spatial information by setting f þ ðyÞ as the distribution on the specific region and f À ðyÞ as those on another region. As candidate for future improvements, mathematical models that simultaneously consider cellular subsets and regional information can be considered.
On the application side, our mathematical model could be applied, not only to the analysis of gene expression data, but also to the analysis of arbitrary biomolecular expression data. For example, in the epigenome layer, an increase in the variability of DNA methylation intensity at the bulk level has been reported to be associated with aging [40][41][42][43]. In recent years, single-cell expression data have been obtained for various omics layers. It is considered that the concepts presented in this study will be useful for reinterpreting molecular biology knowledge obtained at the bulk level using single-cell data.

Conclusion
In this study, we present a mathematical model to clarify single-cell expression profiles with cellular heterogeneity and bulk gene expression data. The model considered the shift in gene expression and changes in the proportion of cellular subsets. Theoretical and simulation analyses showed that the DE analysis can overlook significant changes in gene expression at the single-cell level. In addition, it is revealed that DV analysis capture the feature affected by different expression shifts for cellular subsets, changes in the proportions of cells, and variations in single-cell distribution parameters among samples. The model presented in this study effectively clarifies the differences in interpretation of DE gene and DV gene identified in bulk experiment and provides new insights into the relationship between bulk data analysis and single-cell data analysis.

Funding
This work was funded by a KAKENHI Grant-in-Aid from the Japan Society for the Promotion of Science (JSPS; Grant No. 21K21316).

Code availability
The code used in this study is available athttps://github.com/ DaigoOkada/ScBulkModel.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. E½Y D À E½Y C ¼ ðE C þ À E C À ÞðE C r ða r À 1Þ þ a r E C r d þ þ ð1 À a r E C r Þd À Þ Appendix B. The detail of Eq. 8 The following properties of the variance and covariance properties of the two random variables A and B are used: The covariance can be expressed using expected values, as follows: Cov½A; B ¼ E½AB À E½AE½B ð B:2Þ The following equation holds for the variance of the product of independent random variables.
Using these formulas, the variance V½Y of the bulk expression value can be expressed mathematically as follows, where we let E þ ; E À ; E r ; V þ ; V À and V r be equivalent to E½l þ ; E½l À ; E½r; V½l þ ; V½l À ; V½r respectively.