Controlled noise: evidence of epigenetic regulation of single-cell expression variability

Abstract Motivation Understanding single-cell expression variability (scEV) or gene expression noise among cells of the same type and state is crucial for delineating population-level cellular function. While epigenetic mechanisms are widely implicated in gene expression regulation, a definitive link between chromatin accessibility and scEV remains elusive. Recent advances in single-cell techniques enable the study of single-cell multiomics data that include the simultaneous measurement of scATAC-seq and scRNA-seq within individual cells, presenting an unprecedented opportunity to address this gap. Results This article introduces an innovative testing pipeline to investigate the association between chromatin accessibility and scEV. With single-cell multiomics data of scATAC-seq and scRNA-seq, the pipeline hinges on comparing the prediction performance of scATAC-seq data on gene expression levels between highly variable genes (HVGs) and non-highly variable genes (non-HVGs). Applying this pipeline to paired scATAC-seq and scRNA-seq data from human hematopoietic stem and progenitor cells, we observed a significantly superior prediction performance of scATAC-seq data for HVGs compared to non-HVGs. Notably, there was a substantial overlap between well-predicted genes and HVGs. The gene pathways enriched from well-predicted genes are highly pertinent to cell type-specific functions. Our findings support the notion that scEV largely stems from cell-to-cell variability in chromatin accessibility, providing compelling evidence for the epigenetic regulation of scEV and offering promising avenues for investigating gene regulation mechanisms at the single-cell level. Availability and implementation The source code and data used in this article can be found at https://github.com/SiweiCui/EpigeneticControlOfSingle-CellExpressionVariability.


Introduction
Single-cell RNA sequencing (scRNA-seq) has become a crucial and powerful tool for characterizing gene expression at the individual cell level, offering unprecedented opportunities to study gene regulation from various aspects (Osorio et al. 2022, Xu et al. 2022, Yang et al. 2023a).While many studies have focused on examining the mean expression levels of genes, researchers have increasingly recognized the importance of cell-to-cell variability in gene expression, also known as single-cell expression variability (scEV) or gene expression noise, among cells of the same type and state (Snijder and Pelkmans 2011, Wada et al. 2020, Zheng et al. 2023).Dueck et al. (2016) proposed the "variation is function" hypothesis, suggesting that scEV plays a significant role in manifesting population-level cellular function.Their hypothesis highlights that scEV pertains to the diversity within a highly homogeneous population of cells rather than to the diversity of a mixed cell population with different, clearly distinct cell types that have already been recognized in previous studies.Osorio et al. (2019) validated the "variation is function" hypothesis by identifying highly variable genes (HVGs) from homogeneous cells of different cell types and demonstrated that, for each cell type, the functions of HVGs are enriched with biological processes and molecular functions precisely relevant to the biology of the corresponding cell type.Other studies also support the functional importance of scEV.For instance, Wiggins et al. (2021) reported an increase in scEV in BRCA1associated breast tumors, providing new insights into the associations between genes and disease phenotypes.
Recently, with advancements in single-cell sequencing technology, researchers have been able to simultaneously assay both chromatin accessibility and transcriptomic profiles within individual cells using scATAC-seq and scRNA-seq, respectively, forming single-cell multiomics data.The scATAC-seq data provide valuable complementary information to the scRNA-seq data and enable a more intricate examination of the relationship between epigenetic modifications and gene expression dynamics at the single-cell level (Stuart et al. 2021, Kartha et al. 2022).Granja et al. (2021) developed a gene score (GS) based on the aggregation of scATAC-seq peaks that overlap with the gene window on open chromatin, which serves as a measure of gene expression control.They discovered a significant overall Pearson correlation between the constructed GSs and gene expression levels.Mitra et al. (2024) proposed a regularized Poisson regression method to predict scRNA-seq data using scATAC-seq data.Their approach showed improved prediction performance on both high-and low-coverage multiomics datasets compared to GS methods.More recently, Saelens et al. (2023) employed neural network architectures to predict scRNA-seq data and identified differentially accessible chromatin regions between different cell types.
The coassay of scATAC-seq and scRNA-seq within the same cells offers an opportunity to study the epigenetic regulation of scEV.However, it is still unclear whether chromatin accessibility, as measured by scATAC-seq, impacts scEV.If so, how?There are several challenges in studying this topic.First, measuring the effects of scATAC-seq on scEV is difficult since the relationship between chromatin accessibility and gene expression is still an active research topic.A new measure needs to be proposed to solve this problem.Second, the source of scEV comes from both genetic and nongenetic factors and from both random and nonrandom effects (Li and You 2013).Eling et al. (2019) noted that accurately defining, measuring, and disentangling the stochastic and deterministic components of cell-to-cell variability is challenging.Third, scATAC-seq data is typically extremely sparse and high-dimensional with large variation (Buenrostro et al. 2015).Thus, further robust analysis is needed to explore the relationship between scATAC-seq and scEV.
In this article, we propose a novel pipeline to investigate the notion of "the epigenetic regulation of scEV."Our key idea is to use chromatin accessibility peaks (scATAC-seq data) to predict gene expression (scRNA-seq data) in single cells and compare the prediction performance between HVGs and non-highly variable genes (non-HVGs).For each gene, we first constructed predictive models of scRNA-seq via scATAC-seq using established methods.For a given gene, a higher prediction performance of the model indicated that the variability in expression of this gene between cells can be largely explained by chromatin accessibility.A significantly better prediction performance for HVGs than for non-HVGs means that the expression of HVGs is more strongly related to chromatin accessibility than that of non-HVGs.In other words, HVGs are subject to stricter epigenetic regulation than non-HVGs are.This provides evidence for the epigenetic regulation of scEV and underscores the role of epigenetic factors in determining scEV.We selected seven predicted models and two HVG detection methods in total.As a result, we obtain reliable measurements of prediction performance and increase the robustness of our results.
We applied our pipeline to the single-cell multiomics data of paired scATAC-seq and scRNA-seq obtained from hematopoietic stem and progenitor cells (HSPCs).We found that the overall prediction performance of HVGs was indeed better than that of non-HVGs, which supports the epigenetic regulation of scEV.The results from further in-depth analysis provide additional evidence for the effects of chromatin accessibility on scEV.First, there was a significant correlation between the prediction performance and the level of scEV for different genes, indicating that chromatin accessibility is not independent of scEV.Second, there was a significant overlap between the well-predicted genes and HVGs, suggesting that chromatin accessibility plays a crucial role in the regulatory mechanism of HVGs.Therefore, we suggest that the variable expression of HVGs among cells is due to the varying levels of chromatin accessibility among these cells, and the regulation of HVGs can be inferred by analyzing the enriched transcription factor-binding sites in the scATAC-seq data.Finally, we validated the ability of scATAC-seq to reveal genes that indicate cell function through enrichment analysis of the well-predicted genes.The proposed pipeline was also applied to two additional datasets, one of HSPCs and the other of neurons.The above findings still hold on these two datasets.In conclusion, our results provide single-cell evidence for the association between chromatin accessibility and scEV, highlighting the deterministic role of epigenetic mechanisms in modulating the stochasticity of gene expression.Our results also suggest a new approach for studying the mechanism of gene regulation by investigating how chromatin accessibility affects scEV.

Materials and methods
There are four main steps in our testing pipeline, as shown in Fig. 1: (i) collecting and preprocessing scATAC-seq and scRNA-seq datasets, (ii) building predictive models on scRNA-seq data by scATAC-seq data and assessing the prediction performance of models for each gene, (iii) calculating the level of scEV for each gene and assigning genes to the HVG or non-HVG groups, and (iv) comparing the prediction performance of each model when applied to HVGs and non-HVGs.

Data collection and preprocessing
We used the multiomics datasets provided by the Kaggle competition of Open Problems in Single-Cell Analysis (Burkhardt et al. 2022).This competition comprises measures of peripheral CD34þ HSPCs from healthy human donors over a 10-day period.To ensure cell homogeneity in our study, we focused on data from a single donor (Donor #31800) and a single time point (Day 2).The paired scATACseq and scRNA-seq datasets are derived from the "10× Chromium Single Cell Multiome ATAC þ Gene Expression" multiomics technology.Starting with the scATAC-seq data, we removed cells that contained fewer than 1000 scATAC-seq peaks.For the scRNA-seq data, we removed cells with fewer than 200 expressed genes and genes that were expressed in fewer than 200 cells or had a mean expression lower than 0.5.We also filtered out peaks and genes from the X, Y, and mitochondrial chromosomes, as well as those located within blacklisted genomic regions (Carroll et al. 2014).Ultimately, we retained 6871 cells, 206 611 peaks, and 9657 genes for our analysis.The widths of most peaks range from 200 to 1000.The scATAC-seq peak counts were subsequently transformed using TF-IDF (Chowdhury 2010).The scRNA-seq gene expression counts were sequentially library-size normalized and subjected to log transformation using log(a þ 1) for each count value a.
Check of cell homogeneity: To select homogeneous cells, we used the Signac R package (v.1.13.0) (Stuart et al. 2021) to make a two-dimensional representation of the cells using the scATAC-seq data.Specifically, the FindTopFeatures() function was used to remove features present in less than 1% cells, the RunPCA() function was used to extracted 50 reduced dimension representation of the scATAC-seq data, and then the RunUMAP() function was used on 50-dimensional features to obtain the two-dimensional non-linear representation for visualization as shown in Fig. 2 (McInnes et al. 2018).Following a similar approach as Osorio et al. (2019), we manually picked one cell from the center as the core cell.
We expanded the core by selecting the cells that were closest to the core cell until we had a total of 3000 homogeneous cells (including the core cell itself).These selected cells were then used for further analysis.

Methods for predicting scRNA-seq gene expression with scATAC-seq peaks
Let N, p, and q be the number of cells, peaks, and genes, respectively, in the study.X 2 R N × p and Y 2 R N × q denote the scATAC-seq data matrix and the scRNA-seq data matrix, respectively.The jth column of Y, y j 2 R N , includes the expression levels of the jth gene in all cells.For each gene, we focused on building a predictive model on y j with X.
Despite the availability of diverse predictive models, the use of scATAC-seq data for predicting scRNA-seq data remains in its early stages.To identify the most effective and reliable models, we explored seven different types of models that cover a wide range of established methods.These models can be classified into two categories: global and local methods.Global methods use all peaks as informative features for predicting gene expression, while local methods concentrate on peaks near the target gene on the same chromosome.

Global methods
Five global methods were included and used in our analysis.We briefly describe models of these methods as follows.
1) K nearest neighbors (KNN): The KNN algorithm is a commonly used benchmarking model for scRNA data prediction (Wu et al. 2021).For each cell, KNN first finds its k most similar cells according to scATAC-seq data; these cells are called its neighbors.Then, KNN predicts the expression level of a gene in this cell by averaging the expression levels of its neighbors.As the scATAC-seq data are extremely sparse, the similarity between the two cells was calculated via their peak coexpression.a 1 ; a 2 ; a 3 are the number of peaks where both cells have a nonzero count, only the first cell has a nonzero count, and only the second cell has a nonzero count, respectively.Then, we used a 1 =ða 1 þ a 2 þ a 3 Þ to measure the similarity between two cells.We selected k ¼ 10 in the analysis for KNN. 2) Lasso (Tibshirani 1996): Lasso is a popular linear regression model used to predict the expression levels of genes.It can identify important predictors that are most helpful for predicting the response variable.Specifically, Lasso minimizes the objective function of a squared loss term and a penalty term as follows: where α and β are the intercept and the coefficient vector of regression, respectively, and λ is a penalty parameter to control the sparsity of β.After β is estimated, the peaks corresponding to the nonzero entries of β are those selected by Lasso that potentially affect the expression A homogeneous population of 3000 cells was selected for analysis.The Signac R package was used to produce a two-dimensional nonlinear representation of the scATAC-seq data.A core cell was chosen, and the 3000 cells nearest to the core cell (including itself) were selected as a group of homogeneous cells.The selected cells are surrounded by the black circle.
Evidence of epigenetic regulation of single-cell expression variability levels of the gene.We selected λ using the Bayesian information criterion with the additional requirement that at least five peaks were selected by Lasso to avoid a trivial model.3) Peak-fusion-based model (PeakFusion) (Cao et al. 2018): PeakFusion is a technique for addressing the high sparsity of scATAC-seq data, as more than 90% of the entries in the scATAC-seq matrix are zero.PeakFusion reduces the percentage of zeros in scATAC-seq data by summing the peaks within every 100 kb in chromosomes and forming an aggregated new predictor.The constructed new predictors were subsequently used to construct Lasso models to predict the expression levels of genes.4) LightGBM (Ke et al. 2017): LightGBM is a nonlinear machine learning method based on a gradient-boosting decision tree.Compared with other boosting methods, LightGBM has a faster training speed and lower memory consumption due to the use of histogram optimization, the depth-first split strategy, the gradient-based one-sided sampling strategy, and the exclusive feature bundling strategy.We used the sklearn Python package (v.1.3.1) to construct LightGBM and set the number of trees to 10 while the other tuning parameters kept default.5) Multilayer neural network (MNN): Neural networks have shown their power in numerical prediction problems in single-cell data analysis (Saelens et al. 2023, Yang et al. 2023b).For each gene, we constructed an MNN comprising two hidden layers with the structure p-128-64-1.For each hidden layer, we added a dropout layer to prevent overfitting, for which the dropout rate ¼ 0.2 (Srivastava et al. 2014).The ReLU function was selected as the activation function, and the loss function was the mean square loss between the predicted expression level and the true expression level in the training dataset.The specific structure of MNN used is shown in Fig. S13 in the Supplementary Material.Given that the number of peaks is very large, we also used the peak fusion technique described above to reduce the dimensionality of the input data and the computing time.All the input predictors were also normalized to [0,1] before training the model.This model is implemented via the Keras Python package (v.2.8.0).

Local methods
Many existing studies (Alanis-Lobato et al. 2024, Mitra et al. 2024) suggest that the expression level of a gene is significantly influenced by nearby peaks on the same chromosome.As a result, we included two local prediction methods in our analysis.
1) GS: GS is an unsupervised method introduced in ArchR ( Granja et al. 2021).In GS, the distance between the start of the target gene and the start of the peak is calculated.This distance is then converted to a distance weight by Distance weight ¼ e À absðdistance=5000Þ þ e À 1 : Next, the distance weight for the target gene is multiplied by the inverse of the gene size to account for differences in length among genes.The resulting value is then scaled from 1 to 5. Finally, the GS for a gene in a cell is calculated by taking the elementwise product of the transformed distance weight and the peak.We calculated GSs for all cells and treated them as predictions of gene expression levels.
2) Location information-based regression (LocReg): LocReg uses the nearby peaks of a gene as predictors to construct a Lasso model for predicting gene expression levels.In practice, we used the biomaRt R package (v.2.54.1) (Durinck et al. 2009) to obtain the location information of each gene.We considered only peaks within 1.2 Mbp of the gene for both local methods.

Performance evaluation and selection of well-predicted genes
To evaluate the performance of the models in predicting gene expression, we randomly divided cells into a 70% training dataset and a 30% test dataset.The predictive models mentioned earlier were subsequently constructed using only the cells in the training dataset for each gene.These models were subsequently employed to predict the gene expression of cells in the test dataset.Pearson correlation between the actual and predicted gene expression levels in the test dataset was computed to evaluate the performance of each model.To ensure a reliable and robust evaluation, we repeated this random training-test sample splitting five times and obtained an average Pearson correlation as the metric for model evaluation.A higher average Pearson correlation coefficient for a given gene indicates that its expression level can be more accurately predicted using the scATAC-seq data.Furthermore, for each prediction model, we ranked the genes based on their average Pearson correlation coefficient.The genes with the highest K Pearson correlation values were identified as well-predicted genes in our analysis, as their expression levels are primarily controlled by single-cell chromatin accessibility.

Measurement of scEV and identification of HVGs
Two established methods were chosen to measure the scEV levels of genes and identify HVGs from the scRNA-seq data.Both methods have been widely used and have shown effectiveness in previous studies.
1) Vst method in Seurat (v5.0.2) (Satija et al. 2015, Butler et al. 2018, Stuart et al. 2019, Hao et al. 2021): Vst uses locally estimated scatterplot smoothing to fit the relationship between the log of the variance and the log of the mean of gene expression levels.With the fitted variance and observed mean, the original data was normalized, and the new variance in each gene in the normalized data was considered a metric of scEV.A higher value indicates greater confidence in a gene being classified as an HVG. 2) Splinefit method in scGEAToolbox (v23.2.1) (Cai 2020, Sheng and Li 2021, Gilis et al. 2023): In addition to the mean and variance in gene expression levels, the splinefit method also considers the dropout rate of each gene and performs a spline fitting of a 3D curve using the three dimensions of dropout rate, log of the mean expression level, and log of the variance in the expression levels of all genes.After fitting the 3D curve, the distance from each gene to the curve was calculated and used as a metric of scEV.Genes that are farthest from the curve were identified as HVGs.
We applied both Seurat and scGEAToolbox to the scRNAseq dataset.A systematic comparison of the results of the two methods is presented in Fig. 3a and b.The Pearson correlation coefficient between the two different methods for determining the scEV levels of all the genes was as high as 0.76 as shown in Fig. 3c.For each method, we ordered genes according to their scEV level in descending order.The top K genes were identified as HVGs, and the other genes were denoted as non-HVGs.When K ¼ 200, the number of overlapping HVGs detected by both methods was as high as 179 (89.5%, Fig. 3d).These results confirm the consistency of the scEV and HVG measurements obtained by the two methods.

Comparison of prediction performance when models are applied to HVGs and non-HVGs
To investigate the effects of chromatin accessibility on scEV, we compared the prediction results and scEV levels using three different tests.First, as discussed in Section 1, we tested whether the overall prediction performance of HVGs was significantly greater than that of non-HVGs.Second, we calculated the Spearman correlation coefficient between the prediction performance and the scEV count and tested the significance of this correlation.Third, we identified K 1 wellpredicted genes and K 2 HVGs and examined their overlap.We used Fisher's exact test to determine the independence between the well-predicted genes and HVGs.Finally, we conducted a gene functional enrichment analysis of the wellpredicted genes using Enrichr (Kuleshov et al. 2016) through its web interface to explore whether they are associated with cell-specific functions.Evidence of epigenetic regulation of single-cell expression variability

Comparing the prediction performance of different predictive models
The predictive capabilities of all the genes across the seven different models are shown in Fig. 4, which includes a boxplot illustrating the Pearson correlation coefficients of the genes (Fig. 4a).Among the seven models, MNN demonstrated the best overall prediction performance across all genes, with a maximum correlation coefficient of approximately 0.6.In contrast, GS performed the worst, possibly because it was the only unsupervised method in the dataset.Importantly, compared to local methods, global methods consistently outperform local methods, indicating a superior overall predictive capacity.These findings suggest that certain peaks, even those located at a distance from the target genes, control the expression levels of these genes.Therefore, these distant peaks should not be disregarded and should be included in predictive models.
The distributions of Pearson correlation coefficients on genes obtained using the seven models exhibit several common characteristics.Notably, the medians of Pearson correlation coefficients consistently hover at approximately 0, indicating a general trend.This phenomenon may come from the stochastic nature of both scRNA-seq and scATAC-seq data.Also, genes may only play roles in certain cell types or under certain conditions, lacking the necessary regulatory signals that can be detected from the association between scRNA-seq and scATAC-seq data points.Many essential genes involved in key cellular mechanisms often have multiple regulatory elements.Consequently, regression models between scRNA-seq and scATAC-seq might not fully capture the complete picture of regulatory complicity, and only a number of genes that are directly and strongly associated with cell activities present strong associations between scRNA-seq and scATAC-seq data.However, some genes had relatively high Pearson correlation coefficients and were identified as outliers in the boxplot.This observation highlights that scATAC-seq data can accurately predict only a limited subset of genes, which we refer to as well-predicted genes in our analysis.These well-predicted genes exhibit strong associations with chromatin accessibility.
To evaluate the reliability of the prediction results, we calculated Pearson correlations between the prediction performances of each pair of models, as illustrated in Fig. 4b.With the exception of pairs involving GS, all combinations of models demonstrated a statistically significant positive Pearson correlation (P-value < 0.001), indicating a consistent predictive relationship between the scRNA-seq data and the scATAC-seq data across these models.Notably, MNN and LightGBM had the highest Pearson correlation (0.637), underscoring their particularly strong and aligned predictive capabilities.

Comparing prediction performance between HVGs and non-HVGs
Next, we compared the prediction performance of each predictive model when applied to HVGs and non-HVGs.We first identified K ¼ 200 HVGs using either Seurat or scGEAToolbox.Boxplots were subsequently generated for each predictive model to show the average Pearson correlation for HVGs and non-HVGs (Fig. 5a).The Wilcoxon test was used to compare the medians of the average Pearson correlations between the HVGs and non-HVGs.The overall prediction performance of HVGs is significantly greater than that of non-HVGs in all the models except for GS.The MNN model achieved the highest median Pearson correlations (0.26 and 0.25 for Seurat and scGEAToolbox, respectively) for HVGs.In contrast, the median Pearson correlation for non-HVGs in MNN was approximately 0. These results showed that within homogeneous cells, the scATAC data could provide much better predictions of the expression levels of HVGs than non-HVGs, suggesting that chromatin accessibility is a determining factor for scEV.
To further determine the relationship between chromatin accessibility and scEV, we systematically investigated the overall correlation between the average Pearson correlation coefficient and the scEV level for each gene according to each combination of predictive models and HVG detection methods.Figure 5b shows the scatterplot of genes in these two dimensions.A significant positive Spearman correlation was observed for all the combinations, except for those involving the GS model.The higher the predicted correlation coefficient is, the more likely a gene is to be considered an HVG.Notably, the Lasso and MNN methods achieved the highest Spearman correlations, surpassing 0.5.
Additionally, we examined the ability of chromatin accessibility to determine HVGs.First, for each predictive model, we identified the top K 1 ¼200 well-predicted genes.We examined how many of these well-predicted genes were classified as HVGs as the number of detected HVGs K 2 increased.Figure 5c shows that around 80% of the top 200 wellpredicted genes are denoted as HVGs when K 2 ¼ 500 by the MNN model.Second, for each predictive model, we selected the top K well-predicted genes with the best prediction performance as straightforward predictions for the top K HVGs. Figure 5d shows the overlap between the well-predicted genes and the detected HVGs for K ¼ 200, 500, 1000, and 2000.Among all predictive models, the well-predicted genes by the MNN model exhibited the highest overlap with HVGs, accounting for 53.5% for Seurat and 51.5% for scGEAToolbox when K ¼ 200.For all selections of K, the overlapping number is significant under Fisher's exact test, showing that wellpredicted genes and HVGs are significant, not independent.
Therefore, utilizing well-predicted genes directly can recover more than 50% of HVGs.The results of Fisher's exact tests indicate that well-predicted genes and HVGs, provided by all combinations of predictive models and HVG detection methods, except those involving the GS model, are not independent, providing evidence for the effect of chromatin accessibility on scEV control.

Well-predicted genes reveal cell type-specific functions
Given that the well-predicted genes significantly overlapped with the detected HVGs, we investigated whether these wellpredicted genes are associated with cell type-specific functions.Among the seven predictive models, we selected Lasso and MNN as representative global methods and LocReg as a representative local method.We selected 200 well-predicted genes predicted by the Lasso, MNN, and LocReg models and performed gene enrichment analysis.The lists of 200 wellpredicted genes from the three models are also provided as Supplementary Materials.Two enrichment databases, Reactome and Gene Ontology (GO), were used in the analysis.
The top 10 pathways associated with genes whose expression was most significantly altered for each database are Evidence of epigenetic regulation of single-cell expression variability depicted in Fig. 6.Since pathways involving broadly defined molecular functions and biological processes tend to encompass a large number of genes, we decided to exclude pathways that contain more than 800 genes in the gene enrichment analysis.Many of these pathways are related to the functions of HSPCs.For example, Hemostasis is ranked first in both MNN and Lasso and third in LocReg from Reactome.Hemostasis is the important function of HSPCs because it plays a critical role in maintaining the balance between bleeding and clotting, which is essential for overall vascular health (Nguyen et al. 2018).HSPCs are responsible for replenishing blood cell populations, including platelets, which aid in hemostasis (Ouseph et al. 2015).Identification of the hemostasis gene set in HSPCs helps to reveal the regulatory mechanisms that govern the delicate balance of coagulation and anticoagulation processes within the hematopoietic system and to understand the complex molecular pathways that govern the fate and function of HSPCs.Neutrophil degranulation, highly ranked by all the methods from Reactome, is also essential due to its role in the immune response and hematopoiesis.HSPCs give rise to various blood cell lineages, including neutrophils (Iwasaki and Akashi 2007).Neutrophils degranulation is a fundamental process in which these immune cells release granules containing antimicrobial proteins, enzymes, and reactive oxygen species (Malech et al. 2014).Identification of genes regulating neutrophil degranulation in HSPCs provides insights into the complex mechanisms governing immune cell development and function in the host defense system.From GO, leukocyte migration ranked the first places in both Lasso and MNN models.This result is consistent with existing studies on the relationship between HSPCs and leukocyte migration.For instance, the leukocyte migration pathway involving endothelial endoglin is crucial for the homing and migration of HSPCs (Rossi et al. 2013).

Sensitivity analysis
To demonstrate the robustness of our analysis results, we assessed whether the prediction performance in this study was influenced by the mean expression level of the genes.For each of the 200 HVGs detected by Seurat, we selected the non-HVG with the most similar mean expression level for comparison.Figure 7 displays the scatterplot of the average Pearson correlation coefficient between each HVG and its corresponding non-HVG.The plot indicates that HVGs generally exhibit significantly better prediction performance than their non-HVG counterparts.Therefore, our findings were not influenced by the mean expression level.

Evaluation on more datasets
Besides the analysis of HSPCs described above, we also applied the proposed pipeline to two additional datasets: 1) The multiomics data of HSPCs from another donor (Donor #32606) at Day 2 from the Kaggle competition of Open Problems in Single-Cell Analysis (Burkhardt et al. 2022).Evaluating the pipeline on this dataset can  ¼ 200, 500, 1000, 2000).Fisher's exact test was used to assess the independence between well-predicted genes and HVGs for each method pair.All pairs demonstrated significant nonindependence, with a P-value < 0.001, except for (GS, Seurat) and (GS, scGEAToolbox).
be considered a replication experiment on the same cell type, which demonstrates that our findings are not coincidental and are not specific to a particular dataset.
After quality control and preprocessing, the data include 7071 cells with 207 844 peaks and 9711 genes.
2) The multiomics data of neurons that comes from the following experiment: Cell nuclei were isolated from the midbrain PAG, flash-frozen, and cut into small pieces.
Live cell nuclei were sorted in the BioRad Cell Sorter S3e.The processed nuclei were used to generate GEMs with the 10× Chromium controller and purified, allowing simultaneous construction of ATAC and GEX libraries.Quality control was performed using the Agilent Bioanalyzer 2100, and the libraries were sequenced on the NovaSeq 6000 platform.We use this dataset to show that our results can be found in other types of Evidence of epigenetic regulation of single-cell expression variability cells.After quality control and preprocessing, the data include 3784 cells with 26 330 peaks and 3003 genes.
Due to the page limit and for the sake of brevity, the results for both datasets are included in Figs S1-S12 in the Supplementary Material.For both datasets, the scATAC data were found to better predict the scRNA-seq data of HVGs than non-HVGs.
There is a significant big overlap between well-predicted genes and HVGs, which is consistent with the findings in the article.Furthermore, the results from the multiomics datasets of HSPCs from Donors #31800 and #32606 are very similar.For example, the top 10 gene pathways detected from the 200 wellpredicted genes using the Lasso and MNN models from the Reactome database are very similar for both donors, indicating good reliability of our conclusions.

Discussion
In this study, we presented a new testing pipeline to study the relationship between scATAC-seq data and scEV.We specifically focused on homogeneous cells and found that genes whose expression levels could be accurately predicted by scATAC-seq were more likely to be HVGs.These findings support the notion of epigenetic regulation of scEV.Additionally, our results suggest that chromatin accessibility may play a role in gene regulation by influencing the cell-to-cell variability in gene expression.This provides valuable insights for studying gene function through analyzing scATAC-seq data.We also applied our testing pipeline to several other single-cell multiomics datasets with paired scATAC-seq and scRNA-seq data.
The additional datasets we analyzed reinforced the notion that scATAC-seq possesses superior predictive power for HVGs compared to non-HVGs across diverse cellular contexts.Moreover, genes demonstrating high prediction accuracy consistently mapped to cell type-specific functions.In summary, the emergence of powerful single-cell techniques has paved the way for a deeper understanding of the mechanisms driving scEV.Our findings offer compelling evidence that cell-to-cell variations in chromatin accessibility are a major driver of gene expression noise.This not only confirms the epigenetic regulation of noise but also opens exciting avenues for exploring gene regulation mechanisms at the single-cell level.For each of K ¼ 200 HVGs by Seurat, the non-HVG with the closest mean expression level is used as a compared gene with mean expression level controlled.The scatterplot of the average Pearson correlation of each HVG and its compared non-HVG shows that HVG has much better prediction performance than its compared non-HVG.

Figure 1 .
Figure 1.Overall pipeline for testing the relationship between chromatin accessibility and scEV.

Figure 3 .
Figure 3. Results of two HVG detection methods.(a) The 2D scatterplot of genes with log Mean expression ð Þ as the x-axis and log CV 2 ð Þ as the y-axis.Top K ¼ 200 genes with relatively high logðCV 2 Þ corresponding to log Mean expression ð Þ are identified as HVGs by Vst method in Seurat and are highlighted.(b) Splinefit method in scGEAToolbox draws the 3D scatterplot of genes in log Mean expression ð Þ; log CV ð Þ; Dropout rate À � and fits a 3D curve.The K ¼ 200 genes farthest from the curve are marked as the HVGs and are highlighted.(c) The scatterplot of the scEV levels of genes for two HVG detection methods.The Pearson correlation is 0.76 with P-value <0.001.179 genes simultaneously denoted as HVGs by both methods are highlighted.(d) The number and ratio of overlapped genes between the two methods when selecting different numbers of HVGs (K ¼ 200, 500, 1000, 2000).

Figure 4 .
Figure 4. Results from seven prediction methods were applied to the paired scATAC-seq and scRNA-seq data.All the results represent the average results of five training-test sample splits.(a) Boxplot depicting Pearson correlations between true and estimated scRNA-seq levels for all genes across seven predictive models.The upper outliers in each boxplot signify genes that were well predicted with high Pearson correlations.In each model, the top 200 genes with the highest Pearson correlation coefficients are marked with a cross.(b) Scatter plots illustrate Pearson correlation for each pair of the seven models, with each point representing a gene.

Figure 5 .
Figure 5. Prediction performance for HVGs across diverse models.(a) Identification of 200 HVGs using each HVG detection method, followed by boxplots illustrating the Pearson correlation for HVGs and non-HVGs within each predictive model.The Wilcoxon test was used to compare the medians of the two gene sets.(b) Scatterplots presenting genes with the x-axis indicating Pearson correlation and the y-axis representing scEV levels.Each column corresponds to a predictive model, and each row corresponds to an HVG detection method.Spearman correlation was calculated and tested for each scatterplot.(c) The percentage of the top 200 well-predicted genes that were denoted as HVGs as the number of detected HVGs increased.(d)Evaluation of the overlap between the K well-predicted genes and the K detected HVGs for each pair of predictive models and HVG detection methods(K ¼ 200, 500, 1000, 2000).Fisher's exact test was used to assess the independence between well-predicted genes and HVGs for each method pair.All pairs demonstrated significant nonindependence, with a P-value < 0.001, except for (GS, Seurat) and (GS, scGEAToolbox).

Figure 6 .
Figure 6.Lists of top 10 significantly enriched pathways in top 200 well-predicted genes according to the Lasso, MNN, and LocReg models.Pathways that contain more than 800 genes associated with general functions are not shown.

Figure 7 .
Figure7.Prediction performance of HVGs and non-HVGs with controlled mean expression level.For each of K ¼ 200 HVGs by Seurat, the non-HVG with the closest mean expression level is used as a compared gene with mean expression level controlled.The scatterplot of the average Pearson correlation of each HVG and its compared non-HVG shows that HVG has much better prediction performance than its compared non-HVG.