Single cell expression variability in highly homogeneous populations of human cells is widespread, heritable, and implies cell function

Recent technological development has made single cell assays, e.g., single-cell RNA sequencing (scRNA-seq), reach unprecedented scale and resolution in revealing cellular heterogeneity. There is an increasing appreciation of the extent of intrinsic cell-to-cell variability in gene expression or single cell expression variability (scEV). However, whether this variability per se is functionally important and its implications to multi-cellular organisms remain unclear. Here, we analyzed multiple scRNA-seq data sets generated from lymphoblastoid cell lines (LCLs), lung airway epithelial cells (LAECs), and dermal fibroblasts (DFs). For each of the three cell types, we estimated scEV in homogeneous populations of cells and identified 465, 466, and 291 highly variable genes (HVGs), respectively. These HVGs are enriched with specific functions precisely relevant to the cell types, from which the scRNA-seq data used to identify HVGs are generated— e.g., HVGs identified with lymphoblastoid cells are enriched with those involved in the cytokine signaling pathways, LAECs collagen formation, and DFs keratinization. A number of HVGs are co-regulated by common transcription factors, and, thus, co-expressed, showing coordinated transcriptional outbursts across cells. We also found that scEV is a heritable trait, partially determined by genetic makeups of the donors of the cells. Furthermore, across genes, especially immune genes, the level of scEV is positively correlated with that of between-individual variability in gene expression, suggesting a potential link between the two variabilities measured at different organizational levels. Taken together, our results support the “variation is function” hypothesis, emphasizing the idea that scEV is required for higher-level system function. Assessing and characterizing scEV in relevant cell types may deepen our understating of pathological cell processes.


Introduction
Cells are fundamental units of cellular function. While every cell is unique, cells in multi-cellular organisms can be organized into groups, or cell types, based on shared features that are quantifiable. A multicellular organism is usually composed of cells of many different types-each is a distinct functional entity differing from the other. Within the same cell type, cells are nearly identical and are considered to carry the same function.
The recent development of single cell RNA sequencing (scRNA-seq) technologies has brought the increasingly high-resolution measurements of gene expression in single cells (Zhang et al. 2019). This power has been widely adopted to refine the categories of known cell types, identify rare or novel cell types, and analyze complex tissues systematically and reproducibly (Buettner et al. 2015). In addition to the identification of novel cell types, the power of scRNA-seq has also been harnessed to identify novel cellular states and explore the relationships between discovered and known cell states among the same type of cells (Trapnell 2015).
Cells of the same type, which are often considered identical, may show marked intrinsic cell-tocell variability in gene expression or single cell expression variability (scEV), even under the same environmental conditions (Ko 1992;Fiering et al. 2000;Raj and van Oudenaarden 2008).
Dueck and colleagues (Dueck et al. 2016) put forward the so-called "variation is function" hypothesis, saying that scEV per se might be crucial for population-level function. They used the term "single cell variation or variability" to refer to diversity within an ensemble that has been previously defined as being generally homogeneous, rather than diversity of cell types that are clearly distinct and already recognized. The main focus of their question is to ask how the individual cells with different gene expression levels may interact to causally generate higherlevel function. If true, it means that the intrinsic cell-to-cell variability is an indicator of a diversity of hidden functional capacities that facilitate collective behavior of cells. This collective behavior is essential for the function and normal development of cells and tissues (Raj et al. 2010;Tay et al. 2010). The loss of this collective cellular behavior may result in disease. Thus, investigation of the intrinsic cell-to-cell variability may contribute to the understanding of pathological processes associated with disease development.
It is worth noting that the intrinsic cell-to-cell variability needs to be measured within a highly homogenous population of cells. To work on the cell-to-cell variability, many microenvironmental perturbations and stochastic factors at the cellular level, which are known to change the scEV, have to be controlled. These factors may include local cell density, cell size, shape and rate of proliferation, cell cycle, and so on (Snijder et al. 2009;McDavid et al. 2014;Kernfeld et al. 2018;Miragaia et al. 2018;Mitchell et al. 2018).
Exponential scaling of scRNA-seq made it feasible to study scEV across thousands of cells , and quantify scEV based on measures of statistical dispersion such as the coefficient of variation (CV) (Geiler-Samerotte et al. 2013;Mar 2019). The sheer number of cells sequenced in a "typical" droplet-based scRNA-seq experiment allows us to filter out for a sizable number of highly homogeneous cells, based on the similarity between their global transcriptional profiles. With these selected cells, we are able to test the "variation is function" hypothesis systematically. Furthermore, using established statistical methods, we are able to control for many sources of technical variation that may confound the measurement of scEV to obtain an unbiased estimate. For instance, single-molecule capture efficiency, 3' end bias due to single-cell RNA library preparation protocol, and low expression of genes are examples of known sources of technical variation (Marinov et al. 2014), which should be controlled for using statistical means.
The characterization of the impact of scEV on cell function requires the understanding of which genes show greater or less cell-to-cell variability in their expression. These feature genes may carry valuable information that can facilitate the elucidation of underlying regulatory networks (Li and You 2013). Once these genes are identified, a follow-up question is whether they are tissue-or cell type-specific-i.e., whether the same genes will be identified for different tissues or cell types. Our working hypothesis is in line with the "variation is function" hypothesis, that is, different tissues or cell types have different sets of highly variable genes (HVGs), and these HVGs should be enriched with functions that reflect the biological functions of respective tissues or the cell types. To test this, we analyzed three scRNA-seq data sets generated for three different cell types. Each data set contains thousands of cells. For each cell type, we selected a highly homogenous population of cells, with the help of a newly developed dimensionality reduction method, called potential of heat-diffusion for affinity-based trajectory embedding (PHATE) . We estimated scEV among selected cells for each of these cell types and further systematically characterized functions of identified HVGs. We show that HVGs are highly specific to cell types, i.e., different cell types have different sets of HVGs; functions of HVGs precisely mirror the functions of the corresponding cell types. We explored both the level of scEV and potential mechanism behind the variability across cells, allowing us to understand a previously unexplored aspect of gene regulation in humans.

Single-cell RNA sequencing and selection of highly homogenous cells
In this study, we experimented with three different human cell types, namely, lymphoblastoid cell line (LCL), lung airway epithelial cell (LAEC), and dermal fibroblast (DF). We estimated single cell expression variability (scEV) individually. To obtain the scRNA-seq data for LCL, we cultured GM12878, an LCL strain widely used in genomic research, prepared cells using a 10X Genomics Chromium Controller, and sequenced a total of 7,045 cells (Osorio et al. 2019). This data has been deposited in the NCBI GEO database (Accession number GSE126321, Data Availability). For the other two cell types, LAEC and DF, we obtained the scRNA-seq data for 3,863 and 2,553 cells from the studies of (Habiel et al. 2018) and (Hagai et al. 2018), respectively (Materials and Methods). All scRNA-seq data sets of the three cell types were produced using 10X Genomics droplet-based solution and made use of unique molecular identifiers (UMIs) (Kivioja et al. 2011).
For each cell type, we employed a data analysis procedure, a filter pipeline on scRNA-seq data, to select a subset of highly homogenous cells (Materials and Methods). These selected cells are a representative population of each the cell type. The main steps of the filter pipeline are depicted in Supplementary Fig. S1. Briefly, we first excluded mitochondrial DNA-encoded genes from the analysis. We then excluded cells in the S-or G2/M phases and only retained G1phase cells. We also excluded cells with library size smaller than 55 percentile or greater than 99 percentile. Finally, we used PHATE to produce the embedding plot of remaining cells to inspect between-cell structure driven by heterogeneity in gene expression. PHATE is a visualization method that captures both local and global nonlinear structure in data by an informationgeometry distance between data points . As seen from the PHATE projection ( Fig. 1A), several arms of cells show the complex structure of the cell-to-cell relationship. Based on the observation, we manually picked one "core" cell in the middle of cell cloud where the majority of cells are clustered. The core cell and 999 nearest cells around the core cell are then selected to form the final cell population, which is used for subsequent data analyses.
To examine the homogeneity of selected cells, we used t-distributed stochastic neighbor embedding (t-SNE)(van der Maaten and Hinton 2008) to position all 1,000 selected cells in the two-dimensional t-SNE space. The t-SNE is another visualization method for revealing the cluster structure of separations in single-cell data, emphasizing local neighborhood structure within data. When running t-SNE, we experimented with a series of perplexity parameter values to the selected cells to produce multiple plots for the same cell population. t-SNE is known to be sensitive to perplexity, i.e., when different perplexity values are given. That is to say, t-SNE tends to produce different cell clustering plots, depending on the parameter value. Nevertheless, no structure is observed in selected cells in any of these t-SNE embedding plots (Fig. 1B). The same results were obtained for the other two cell types. Taken together, these results confirm that cells selected with our filter pipeline are highly homogenous populations of representative cells.

Identification of highly variable genes
Highly variable genes (HVGs) are expressed variably, showing a significantly high level of variability, across homogeneous cells of the same type. For each cell type, we used the method of (Brennecke et al. 2013) to identify HVGs from scRNA-seq data of the homogeneous population of selected cells. In this method, the relationship between the squared coefficient of variation (CV 2 ) of genes and their average expression (μ) is considered. The relationship between logtransformed CV 2 and log-transformed μ is fitted with a Generalized Linear Model (GLM), and the expected CV 2 for a given μ is calculated with the fitted curve. The log-transformed ratio between observed CV 2 and expected CV 2 [=log(observed CV 2 )-log(expected CV 2 ], called "residual variability", is used as the measurement of scEV. Since the expected CV 2 captures the variability originated from technical noise, the residual variability is considered to be an unbiased measure of biological variability. In total, we identified 465, 466, and 291 HVGs for LCL, LAEC, and DF, respectively (Supplementary Tables S1-3), after controlling for false discovery rate (FDR) at 0.01 (Materials and Methods). To visualize expression variability of genes, we plot CV 2 against μ, both on the logarithmic scale, for LCL ( Fig. 2A). Each dot represents a gene; all genes together give a characteristic cloud showing the μ and CV 2 of gene expression. Genes above the GLM fitting curve, e.g., IGKC, CCL3, LTB, and FTL, are more variable than expectation, whereas genes below the curve, e.g., TMEM9B and RPL17, are less variable (Fig. 2B).

Cell-type origin determines the function of highly variable genes
To assess the biological functions of HVGs in different cell types, we performed the enrichment analyses (Materials and Methods). The categories of enriched gene ontology (GO) terms and pathways of the three cell types are largely distinct and reflect the respective cell functions of each cell type ( Table 1). LCL HVGs include, for example, CCL22 and IFI27. Collectively, more than expected number of LCL HVGs (FDR<0.01) are involved in cytokine-or interferonsignaling pathways, or, more generally, innate immune system. LAEC HVGs, including exemplary genes, COL1A1, MMP1, and IL17C, are more likely to be involved in the processes of collagen formation and extracellular matrix organization (FDR<0.01 for both). DF HVGs, including KRT14, ACAN, and FLG, are more likely to be involved in keratinization, regulation of cell proliferation, as well as extracellular structure organization (FDR<0.01 for all). DF HVGs also include SFRP2, DPP4, and LSP1; all are marker genes defining major fibroblast subpopulations in human skin (Tabib et al. 2018). There is overlap between enriched functions between the three cell types. For example, cytokine signaling pathway is enriched for both LCL and LAEC, and extracellular structure organization is enriched for both LAEC and DF. Thus

Some HVGs are co-expressed and regulated by the same transcription factors (FTs)
Next, we set out to test whether HVGs are co-expressed and tend to form co-expression networks (Mantsoki et al. 2016). We first imputed the expression matrix and then constructed the co-expressed network using the top 50 HVGs for each cell type. For LCLs, the network contains two main modules centered on NFKBIA and IGHG1, respectively (Fig. 3A).
NFKBIA encodes NF-κB inhibitor that interacts with REL dimers to inhibit NF-κB/Rel complexes (Courtois et al. 2003;Lopez-Granados et al. 2008). For LAECs, two modules are centered on IL23A/TNFAIP6 and COL1A1 (Supplementary Fig. S2A); for DF, KRTAP2-3 and IGFBP7 (Supplementary Fig. S2B). Thus, functions of "hub" genes in HVG coexpression networks are closely relevant to the function of corresponding cell type. These results are another line of evidence that scEV implies cell function.
The transcription of multiple HVGs may be involved in the same underlying regulatory activities, giving rise to the co-expression network as we observed. Thus, a central question is whether scEV in different HVGs is driven by fluctuating activities of few common TFs. To address this question, we searched for upstream regulators of the HVGs defined by our analysis (Materials and Methods). We identified four significant enriched TF binding motifs upstream of HVGs of LCL and five for LAEC (Supplementary Table S4). The enriched motifs of LCL HVGs include that of NF-κB subunit gene, RELA, and that of BACH2 (Fig. 3B). The enriched motifs of LAEC HVGs include TATA box and that of CEBPB (Fig. 3B). No significant motif was identified for DF.

Single cell expression variability is more similar between genetically related samples than unrelated samples
Given that scEV is important for cell function, we thought the level of scEV may be genetically determined. If true, then we expect that the similarity in scEV between cell lines derived from two genetically related individuals is higher than that between cell lines derived from two unrelated individuals. To test this, we performed the scRNA-seq with another LCL-GM18502, derived from a donor of African ancestry, unrelated to GM12878 that is derived from a donor of European ancestry.
LCL unrelated to GM12878. The LCL is GM18502, which was derived from a donor of African ancestry. We processed GM18502 along with GM12878 in the same batch (Materials and Methods), along with a technical replicate sample made from a 1:1 mixture of the two (Osorio et al. 2019). For comparison, we also obtained scRNA-seq data from the study of (Zhang et al. 2019) for another LCL GM12981. The donor of GM12891 is the father of GM12878. We estimated the scEV for these two additional scRNA-seq data sets: one from GM18502 (unrelated) and the other from GM12891 (father), using the same procedure applied to GM12878. To measure the correlation between scEV of different samples, we used both the Spearman correlation coefficient (SCC) and Pearson correlation coefficient (PCC). Across genes, the correlation between residual variability estimated from GM12878 and that from GM12891 (daughter-father) is ρ=0.92 (SCC) or r=0.94 (PCC) (Supplementary Fig. S3). That is to say, 85% (=r 2 ) of the variance in scEV across genes of a daughter can be explained by that of a father.
In contrast, only 77% of the variance can be explained by that of an unrelated individual-the correlation between GM12878 and GM18502 is ρ=0.87 (SCC) or r=0.89 (PCC) (Supplementary Fig. S3).
Note that, the similarity between these two related samples, GM12878 and GM12891 (daughter and father), might have been underestimated. This is due to the gender difference between the two samples was not taken into account. Furthermore, the two scRNA-seq data sets of them were produced in different batches: GM12878 by us in this study and GM12891 by (Zhang et al. 2019). The batch effect could also influence the daughter-father correlation downward.
Nevertheless, we still observed a stronger correlation between the two related samples compared to that between the two unrelated samples. These results suggest that scEV is likely to be highly heritable.
When the correlation tests were performed across cell types, much weaker correlations were observed: the correlation between LCL and LAEC is 0.60 (SCC) or 0.65 (PCC), and that between LCL and DF is 0.57 (SCC) or 0.70 (PCC).

Single cell expression variability is positively correlated with between-individual expression variability
Next, we examined the relationship between scEV and inter-individual expression variability.
We distinguish between the two different types of variabilities at different organizational levels.
Specifically, the former is cell-to-cell variability in a population of cells, and the latter is interindividual variability at the human population level. We again focused on LCLs, for which population-scale gene expression data are available from the Geuvadis RNA-seq project of 1,000 Genomes samples. The bulk RNA-seq data was downloaded as normalized expression matrix of FPKM values. We retained data for all LCLs of European ancestry (CEU) (Lappalainen et al. 2013). With the residual variability estimated from scRNA-seq of GM12878 and that estimated from the CEU population, we tested the correlation between the two estimates across genes.
When the test was conducted with all genes (n=8,424), we obtained a significant but weak positive correlation (SCC, ρ = 0.19, p = 1.2×10 -9 ). We wondered whether this positive correlation was driven by subsets of genes. To identify these gene sets, we conducted the correlation tests for the GO-defined gene sets one by one. Across all gene sets tested, the average SCC for gene sets defined by GO biological process (BP) and molecular function (MF) terms are on average ρ=0.28 and ρ=0.23, respectively. Yet, strikingly, we found a small number of gene sets that produced SCC much higher than averages. The functions of these gene sets include B-cell activation involved in immune response (GO:0002322) and cytokine receptor activity (GO:0004896) (Fig. 3C), as well as leukocyte chemotaxis (GO: 0030595), cellular response to drug (GO: 0035690), and phospholipase activity (GO: 0004620)(for more examples, see Supplementary Fig. S4). Thus, for these gene sets, scEV may contribute to the establishment of between-individual expression variability.

Discussion
Single cell expression variability (or scEV) is also called gene expression noise, implying the stochastic nature of transcriptional activities in cells (Kaern et al. 2005;Raser and O'Shea 2005). Interrogating scEV data has provided insights into gene regulatory architecture (Mar et al. 2011;Chalancon et al. 2012), and manipulating the magnitude of scEV, through using noise enhancers or scEV-modulating chemicals, has been an approach to achieve drug synergies (Dar et al. 2014). Understanding the origin and functional implications of scEV has long been appreciated (Ko 1992;Fiering et al. 2000;Raj and van Oudenaarden 2008;Ecker et al. 2018).
In this study, we focused on scEV in human cells. More specifically, we wanted to characterize different expression levels of genes within a highly homogeneous population of genetically identical (or nearly isogenic) cells under the same environmental conditions. To this end, we set out to quantify scEV in highly homogeneous populations of a sizable number of viable cells.
Working with cells of the same type, for example, LCL, we start by preprocessing data from thousands of cells. We found that, even though we have firstly preprocessed the data and retained only cells with similar library size and in the same cell cycle phase, it is not enough.
There are still marked substructures, shown as branches of cells, in the embedding cloud of cells ( Fig. 1A), as revealed by the new embedding algorithm ). Retrospectively, we applied trajectory analysis and found out that some of these branches are caused by cells with elevated expression of immunoglobulin genes (Supplementary Fig. S5).  (Neitzel 1986). B cells genetically diversity by rearranging the immunoglobulin locus to produce diverse antibody repertories that allow the immune system to recognize foreign molecules and initiate differential immune responses (Tonegawa 1983;Papavasiliou 1997;Mitchell et al. 2018). LCLs are produced through the rapid proliferation of few EBV-driven B cells from the blood cell population (Ryan et al. 2006). Thus, scRNA-seq data sets of LCLs offer a "snapshot" of highly diverse immunoglobulin rearrangement profiles in a much larger population of polyclonal B cells established in donors of these cell lines. Therefore, it is not unexpected to see quite a few immunoglobulin genes in the top list of HVGs identified in LCLs. In addition to these immunoglobulin genes, a number of other immune genes, especially C-C motif chemokine ligands (CCLs) and C-C motif chemokine receptors (CCRs), are in the list of HVGs of LCL. These genes play important roles in allowing the coordination of the activity of individual cells through intercellular communication, essential for the immune system maintains robustness (Altan-Bonnet and Mukherjee 2019). The HVG coexpression network analysis revealed the key role of NF-κB pathway in facilitating communications between immune cells (Tay et al. 2010;Mitchell et al. 2018).
Second, LAEC is a key cell type playing important roles in lung tissue remodeling, and pulmonary inflammatory and immune responses (Hiemstra et al. 2015). The airway epithelium, playing a critical role in conducting air to and from the alveoli, is a dynamic tissue that normally undergoes slow but constant turnover. In the event of mild to moderate injury, the airway epithelium responds vigorously to re-establish an epithelial sheet with normal structure and function. HVGs identified in LAECs, which are enriched with genes involved in collagen formation, regulation of cell proliferation, and extracellular matrix organization, accurately elucidate this aspect of functions of the airway epithelium. LAECs are also central to the defense of the lung against pathogens and particulates that are inhaled from the environment. This aspect of functions is also reflected in the enriched functionality of LAEC HVGs.
Third, DFs are responsible for generating connective tissue and play a critical role in normal wound healing (Tracy et al. 2016). DFs are also commonly used in immunological studies (Zhao et al. 2012;Ivashkiv and Donlin 2014;Hagai et al. 2018). HVGs identified in DFs again accurately reflect these primary aspects of DF functions, including extracellular matrix organization, keratinization, and regulation of signaling receptor activity. DF HVGs do have several categories of enriched functions overlap with those of LAEC, which is not unexpected, given that DF and LAEC have functional overlaps (Sacco et al. 2004).
Our results support the "variation is function" hypothesis, proposed by (Dueck et al. 2016), suggesting that the aggregate cellular function may depend on scEV. Dueck and colleagues also laid down several scenarios, including bet hedging, response distribution, fate plasticity, and so on, in which the establishment of the relationship between scEV and cell function could be attained. Our analytical framework using scRNA-seq data may be utilized in appropriate systems to test the plausibility of these different scenarios. If scEV is an accountable and credible surrogate of cell function, as we have shown in this study, then quantifying and characterizing scEV may become a first-line approach for understanding the function of cell types and tissues.
Through sequencing LCLs from three donors, we were able to compare the overall scEV across genes between related and unrelated LCL pairs. Our results suggest that scEV is a heritable trait and its relative magnitude across genes is genetically determined. In theory, the heritability of scEV can be estimated with more LCL samples from different donors of different levels of relatedness. A pairwise similarity matrix between LCLs in scEV can be regressed with the genetic relationship matrix between LCL donors, using a Haseman-Elston regression-type analysis (Haseman and Elston 1972), to quantify the heritability. The normalization between data from different LCLs (i.e., batch effect correction) can be achieved using the method of mutual nearest neighbors (Haghverdi et al. 2018), canonical correlation analysis (Butler et al. 2018) or manifold alignment (Amodio and Krishnaswamy 2018). With a population-scale scRNA-seq data set in the future, we will be able to identify mutations associated with increased scEV. At this moment, we still do not have such a resource of scRNA-seq data for samples from a sizable number of human individuals. Nevertheless, we have shown that, across certain sets of genes, scEV is positively correlated with population-level expression variability. This correlation provides a new possibility to design single cell assays with one sample to approximate the population variability of certain genes' expression. This new method may be used to study disease-causing expression dysregulation because it has been a number of cases that increased population-level expression variability has been linked with diseases (Ho et al. 2008;Li et al. 2010;Ecker et al. 2015;Guan et al. 2016;Huang et al. 2018).
In a visionary perspective article, Pelkmans (2012) pointed out that "Embracing this cell-to-cell variability as a fact in our scientific understanding requires a paradigm shift, but it will be necessary." Indeed, scRNA-seq technologies have brought revolution to gene expression analysis. The technical development gives us a new approach beyond the capacity of traditional methods that rely on experimental measurements of population-average behavior of cells to conceive regulatory network models and signal processing pathways. More importantly, for traditional methods, by averaging information across many cells, differences among cells, which may be important in explaining mechanisms, can be lost. Given the large degree of cell-to-cell expression variability even between genetically homogeneous cells, conclusions reached as such with traditional average-based methods may be of low-resolution, incomplete, and sometimes misleading (Tay et al. 2010;Bendall and Nolan 2012;Li and You 2013;Trapnell 2015). We conclude that careful assessment and characterization of cell-to-cell expression variability in relevant cell types will facilitate the study of normal cell functions, as well as pathological cell processes. Single cell variability and the information it contains seem to be the key to a deepened understanding of cells and their functions.

Non-LCL scRNA-seq data sets
The scRNA-seq data for lung airway epithelial cells (LAECs) was downloaded from the GEO database using accession number GSE115982. The original data was generated in the study of (Habiel et al. 2018) for CCR10 -and CCR10 + LAECs. We used the data generated from the CCR10 -cells with the sample identifier GSM3204305. The scRNA-seq data for primary dermal fibroblasts (DFs) was generated in the study of (Hagai et al. 2018). We downloaded the data for unstimulated DFs from the ArrayExpress database using accession number E-MTAB-5988. We also downloaded scRNA-seq data (GEO accession number GSE111912) generated in the study of (Zhang et al. 2019) for LCL sample GM12891. All of these data sets were produced using the 10X Genomics scRNA-seq solutions.

Selection of highly homogeneous populations of cells
We used a supervised data analysis method to select highly homogeneous cells based on the scRNA-seq expression profile of each cell. The procedure is summarized in a flowchart ( Supplementary Fig. S1). The main steps are as follows. We used Seurat (v0.2) (Butler et al. 2018) to assign each cell into a cell cycle phase and excluded cells that are not considered to be in G1-phase. We removed genes encoded in the mitochondrial genome from the analysis. We then selected and retained cells with library size between 50 and 95 percentiles. We used PHATE  to generate embedding plot of all remaining cells and inspected the distributions of cells in the three-dimensional plot, and manually picked one "core" cell. Finally, an additional 999 cells that are closest to the core cell, according to the Euclidean distances between cells, were selected to form the final 1,000-cell population. This selection procedure was applied to each of the three cell types independently.

Identification of HVGs
Highly variable genes (HVG) is based on the assumption that genes with high variance relative to their mean expression are due to biological effects rather than just technical noise. We used the method proposed in (Brennecke et al. 2013), which is implemented in function sc_hvg of the scGEApp package (https://github.com/jamesjcai/scGEApp) (Cai 2019). This method starts by adjusting the library size and assumes that the observed mean expression (̂) and the observed CV 2 (̂) of gene among cells have the following relationship: where is the number of cells. The values of 0 and 1 are estimated by generalized linear regression (GLM). The residual term ̂/( ̂1/̂ +̂0) for each gene is used to test if the observed CV 2 is significantly larger than the expected CV 2 via a chi-squared test. Multiple testing p-value adjustment was performed by controlling the false discovery rate (FDR) (Benjamini and Hochberg 1995).

Function enrichment analyses
To identify overrepresented biological functions of HVGs in different cell types, we performed the GO enrichment analysis using Enrichr (Chen et al. 2013;Kuleshov et al. 2016) and GOrilla (Eden et al. 2009). Enrichr was conducted for HVGs (FDR<0.01) against the rest of the expressed genes with respect to pathways collected in the Reactome pathway knowledgebase (Fabregat et al. 2018). GOrilla was performed with the list of genes sorted in descending order of their residual variability.

Analyses of Co-expression network and regulatory regions of LCL HVGs
MAGIC ) was used to impute the expression matrix. The co-expression networks were constructed using SBEToolbox (Konganti et al. 2013). The motif analysis of the regulatory regions associated with the HVGs was performed using the GREAT (McLean et al. 2010). Genomic coordinates for the HVG genes from the Human Reference Genome (hg19) were downloaded from the Ensembl Biomart (Smedley et al. 2009) and converted into bed format using an in-house script. Identified motifs were searched against the JASPAR database (Khan et al. 2018) to match the binding sites of corresponding TFs.

Data availability
The data sets produced in this study and computer code are available:    More gene sets defined by GO terms for biological process (BP) and molecular function (MF) are given in Supplementary Fig. S4.