Single-Cell Expression Variability Implies Cell Function

As single-cell RNA sequencing (scRNA-seq) data becomes widely available, cell-to-cell variability in gene expression, or single-cell expression variability (scEV), has been increasingly appreciated. However, it remains unclear whether this variability is functionally important and, if so, what are its implications for multi-cellular organisms. Here, we analyzed multiple scRNA-seq data sets from lymphoblastoid cell lines (LCLs), lung airway epithelial cells (LAECs), and dermal fibroblasts (DFs) and, for each cell type, selected a group of homogenous cells with highly similar expression profiles. We estimated the scEV levels for genes after correcting the mean-variance dependency in that data and identified 465, 466, and 364 highly variable genes (HVGs) in LCLs, LAECs, and DFs, respectively. Functions of these HVGs were found to be enriched with those biological processes precisely relevant to the corresponding cell type’s function, from which the scRNA-seq data used to identify HVGs were generated—e.g., cytokine signaling pathways were enriched in HVGs identified in LCLs, collagen formation in LAECs, and keratinization in DFs. We repeated the same analysis with scRNA-seq data from induced pluripotent stem cells (iPSCs) and identified only 79 HVGs with no statistically significant enriched functions; the overall scEV in iPSCs was of negligible magnitude. Our results support the “variation is function” hypothesis, arguing that scEV is required for cell type-specific, higher-level system function. Thus, quantifying and characterizing scEV are of importance for our understating of normal and pathological cellular processes.


Introduction
Cells are fundamental units of cellular function. Cells in multi-cellular organisms can be organized into groups, or cell types, based on shared features that are quantifiable. A multicellular organism is usually composed of cells of many different types-each is a distinct functional entity differing from the other. Within the same cell type, cells are nearly identical and are considered to carry the same cell types and further systematically characterized functions of identified HVGs. We show that HVGs are highly specific to cell types, i.e., different cell types have different sets of HVGs; functions of HVGs precisely mirror the biological processes of the corresponding cell types.

LCL Cell Culture and scRNA-seq Experiment
The lymphoblastoid cell line (LCL) GM12878 was purchased from the Coriell Institute for Medical Research. They were cultured in the RPMI-1640 medium supplied with 2 mM L-glutamine and 20% of non-inactivated fetal bovine serum, incubated at 37 • C under 5% CO 2 atmosphere. For maintenance, cells were subcultured every three days by adding fresh medium. For single-cell sequencing, each cell line was subcultured with 200,000 viable cells/mL. Cells were harvested for single-cell sample preparation and sequencing on day four (stationary phase) following the sample preparation demonstrated protocol and Single Cell 3' Reagent Kits v2 user guide provided by 10× Genomics. Briefly, cells were mixed well in each flask, and 1 mL of cell suspensions from each cell line were taken out. The cells were washed three times by centrifuging, suspending, and resuspending in 1× PBS with 0.04% BSA. Viable cells were then counted using an automated cell counter (Thermo Fisher Scientific, Carlsbad, CA, USA). Cells (~5000 per cell line) were then pelleted and resuspended in the nuclease-free water based on the cell suspension volume calculator table, followed by GEM (gel bead-in-emulsions) generation and barcoding, the post-GEM-RT cleanup, cDNA amplification, and library construction and sequencing. The experiments were conducted at the Texas A&M Institute for Genome Sciences and Society. The sequencing was conducted in the North Texas Genome Center facilities using a Novaseq 6000 sequencer (Illumina, San Diego, CA, USA). Raw reads for each cell were analyzed using Cell Ranger (v2.0.0, 10× Genomics, Pleasanton, CA, USA) and the outputs were aligned to the human reference genome (GRCh38) to obtain the counts [31].

Non-LCL scRNA-seq Data Sets
The scRNA-seq data for lung airway epithelial cells (LAECs) was downloaded from the GEO database using accession number GSE115982. The original data was generated in the study of [32] for CCR10 − and CCR10 + LAECs. We used the data generated from the CCR10 − cells with the sample identifier GSM3204305. The scRNA-seq data for primary dermal fibroblasts (DFs) was generated in the study of [33]. We downloaded the data for unstimulated DFs from the ArrayExpress database using accession number E-MTAB-5988. To support our findings, we processed additional samples and compared the results obtained for the same cell type. We selected the transcriptomic profile of fibroblasts (7052 and 6503 cells) from samples extracted from two different lung regions (GEO accessions: GSM2894834 and GSM2894835); and for the iPSC sample, we used another iPSC culture (9146 cells) generated in the study of [34], and downloaded from the ArrayExpress database using accession number E-MTAB-6268. All of these data sets were produced using 10× Genomics scRNA-seq solutions.

Selection of Highly Homogeneous Populations of Cells
We used a supervised data analysis method to select highly homogeneous cells based on the scRNA-seq expression profile of each cell. The procedure is summarized in a flowchart (Supplementary Figure S1). The main steps are as follows. We used Seurat (v2.2.0) [35] to assign each cell into a cell cycle phase and excluded cells that were not considered to be in G1-phase. We removed genes encoded in the mitochondrial genome from the analysis. We then selected and retained cells with a library size between 50 and 95 percentiles. We used PHATE [30] to generate a embedding plot of all remaining cells and inspected the distributions of cells in the three-dimensional plot and manually picked one "core" cell. Finally, an additional 999 cells that were closest to the core cell, according to the Euclidean distances between cells, were selected to form the final 1000-cell population. This selection procedure Cells 2020, 9, 14 4 of 18 was applied to each of the three cell types independently, as well as for the additional samples used to support our findings.

Identification of HVGs
Identification of highly variable genes (HVGs) was based on the assumption that high expression variability of these genes across cells relative to their mean expression is caused by biological effects rather than merely technical noise. We used the method proposed in [36], which is implemented in function sc_hvg of the scGEAToolbox package (https://github.com/jamesjcai/scGEAToolbox) [37]. This method starts by adjusting the library size and assumes that the observed mean expression (μ i ) and the observed CV 2 (ŵ i ) of gene i among cells have the following relationship: where m is the number of cells. The values of a 0 and a 1 are estimated by generalized linear regression (GLM). The residual termŵ i /(â 1 /μ i +â 0 ) for each gene is used to test if the observed CV 2 is significantly larger than the expected CV 2 via a chi-squared test. Multiple testing p-value adjustments were performed by controlling FDR [38].

Function Enrichment Analyses
To identify the overrepresented biological functions of HVGs in different cell types, we performed the GO enrichment analysis using Enrichr [39,40] and GOrilla [41]. Enrichr was conducted for HVGs (FDR <0.01) against the rest of the expressed genes with respect to pathways collected in the Reactome pathway knowledgebase [42]. GOrilla was performed with the list of genes sorted in descending order of their residual variability.

Analyses of co-Expression Network and Regulatory Regions of HVGs
MAGIC [43] was used to impute the expression matrix. The co-expression networks were constructed using 1-correlation as a distance measure, using SBEToolbox [44]. The motif analysis of the regulatory regions associated with the HVGs was performed using the GREAT [45]. Genomic coordinates for the HVG genes from the Human Reference Genome (hg19) were downloaded from the Ensembl Biomart [46] and converted into bed format using an in-house script. Identified motifs were searched against the JASPAR database [47] to match the binding sites of corresponding TFs.

Data Availability
The data sets used in this study and computer code are available. In this study, we experimented with the transcriptomic profiles of three different human cell types, namely, lymphoblastoid cell line (LCL), lung airway epithelial cell (LAEC), and dermal fibroblast (DF). We estimated single-cell expression variability (scEV) for each of these cell types, individually.
To obtain the scRNA-seq data for LCL, we cultured GM12878, an LCL strain widely used in genomic research, prepared cells using a 10× Genomics Chromium Controller, and sequenced a total of 7045 cells [31]. This data has been deposited in the NCBI Gene Expression Omnibus (GEO) database (accession number GSE126321). For the other two cell types, LAEC and DF, we obtained the scRNA-seq data for 3863 and 2553 cells from the studies of [32] and [33], respectively. In addition, we also processed two more samples for fibroblasts and one for iPSC cells (see Section 2.7. for data availability) to cross-validate our findings. All scRNA-seq data sets of the three cell types and the additional samples used were produced using 10× Genomics droplet-based solution and made use of unique molecular identifiers (UMIs) [48].
For each cell type, we employed a data analysis procedure, a filter pipeline on scRNA-seq data, to select highly similar populations of cells (see Section 2 for materials and methods). These selected cells are a representative population of each the cell type. The main steps of the filter pipeline are depicted in Supplementary Figure S1. Briefly, we first excluded mitochondrial DNA-encoded genes from the analysis. We then excluded cells in the S-or G2/M phases and only retained G1-phase cells.
We also excluded cells with library size <55 percentile or >99 percentile. Finally, we used PHATE to produce the low-dimensionality representation of the remaining cells to inspect between-cell structure driven by heterogeneity in gene expression. PHATE is a visualization method that captures both local and global nonlinear structure in data by an information-geometry distance between data points [30]. As seen from the PHATE projection ( Figure 1A), several "arms" of cells show the structure of the cell-to-cell relationship. Based on the observation, we manually picked one "core" cell at the root of the arms of cells in the middle of the cell cloud (red circle in Figure 1A). The core cell and 999 nearest cells around it were then selected using the k-nearest neighbors algorithm to form the final population of 1000 cells, which was used for subsequent data analyses.
To examine the homogeneity of selected cells, we used t-distributed stochastic neighbor embedding (t-SNE) [49] to position all 1000 selected cells in the two-dimensional t-SNE space. Compared to PHATE, t-SNE is a more commonly used nonlinear visualization algorithm for revealing structures in high-dimensional data, emphasizing local neighborhood structure within the data. When running t-SNE, we experimented with a series of perplexity values to produce multiple plots for the same population of selected cells. t-SNE is known to be sensitive to hyperparameters [50]. In general, when different parameter values are given, t-SNE tends to produce different cell clustering plots. However, for our selected cells, no structure is observed in any of these t-SNE embedding plots ( Figure 1B). The same results were obtained for the other two cell types as well as using the uniform manifold approximation and projection (UMAP) as an alternative embedding algorithm [50] (Supplementary Figure S2). Thus, we confirm that cells selected with our filter pipeline are highly homogenous populations of representative cells for each cell type. both local and global nonlinear structure in data by an information-geometry distance between data points [30]. As seen from the PHATE projection ( Figure 1A), several "arms" of cells show the structure of the cell-to-cell relationship. Based on the observation, we manually picked one "core" cell at the root of the arms of cells in the middle of the cell cloud (red circle in Figure 1A). The core cell and 999 nearest cells around it were then selected using the k-nearest neighbors algorithm to form the final population of 1000 cells, which was used for subsequent data analyses.

Identification of Highly Variable Genes
Highly variable genes (HVGs) are expressed variably across homogeneous cells of the same type. For each cell type, we used the method of [36] to identify HVGs from scRNA-seq data of the homogeneous population of selected cells. In this method, the relationship between the squared coefficient of variation (CV 2 ) of genes and their average expression (µ) is considered. The relationship between log-transformed CV 2 and log-transformed µ is fitted with a generalized linear model (GLM) by using a gamma distribution, and the expected CV 2 for a given µ is calculated with the fitted curve. The log-transformed ratio between observed CV 2 and expected CV 2 [=log(observed CV 2 ) − log(expected CV 2 ], called "residual variability", is used as the measurement of scEV. Since the expected CV 2 captures the variability originated from technical noise, the residual variability is considered to be an unbiased measure of biological variability. Indeed, after the correction, the µ-CV 2 correlation disappeared (Supplementary Figure S3). We repeated the procedure for correcting µ-CV dependency using another method [51] and obtained the qualitatively similar results in terms of identified HVGs and enriched functions (Supplementary Figure S4). Here, we only report the results obtained using the method of [36].
After the µ-CV 2 dependency correction, we identified 465, 466, and 364 HVGs at a false-discovery rate (FDR) of 0.01 for LCL, LAEC, and DF, respectively (Supplementary Tables S1-3). To visualize the expression variability of genes, we plot CV 2 against µ, both on the logarithmic scale, for LCL ( Figure 2A). Each dot represents a gene; all genes together give a characteristic cloud showing the µ and CV 2 of gene expression. Genes above the GLM fitting curve, e.g., IGKC, CCL3, LTB, and FTL, are more variable than expectation, whereas genes below the curve, e.g., TMEM9B and RPL17, are less variable ( Figure 2B).

Cell-Type Origin Determines the Function of Highly Variable Genes
To assess the biological functions of HVGs in different cell types, we performed enrichment analyses. We found that enriched gene ontology (GO) terms are largely distinct and reflect respective cell functions of each of the three cell types (Table 1). For example, LCL HVGs (e.g., CCL22 and IFI27) are more likely to be involved in cytokine-or interferon-signaling pathways, and also, more generally, the innate immune system; LAEC HVGs (e.g., COL1A1, MMP1, and IL17C) collagen formation and extracellular matrix organization; DF HVGs (e.g., KRT14, ACAN, and FLG) keratinization and regulation of cell proliferation. DF HVGs also include SFRP2, DPP4, and LSP1, which are marker genes defining major fibroblast subpopulations in human skin [52]. Taken together, these results show that different cell types have different sets of HVGs with substantial scEV, associated with cell-type-specific functions.
Cells 2020, 9, 14 7 of 18 discovery rate (FDR) of 0.01 for LCL, LAEC, and DF, respectively (Supplementary Tables S1-3). To visualize the expression variability of genes, we plot CV 2 against μ, both on the logarithmic scale, for LCL ( Figure 2A). Each dot represents a gene; all genes together give a characteristic cloud showing the μ and CV 2 of gene expression. Genes above the GLM fitting curve, e.g., IGKC, CCL3, LTB, and FTL, are more variable than expectation, whereas genes below the curve, e.g., TMEM9B and RPL17, are less variable ( Figure 2B).  Response to stimulus (GO:0050896) 3.

Functions Associated to HVGs are Conserved Across Tissues and Subpopulations of Cells
To determine how stable the functions associated with the HVGs identified in a cell type are, we used additional samples and random selection of the core cell. We analyzed two other fibroblast samples obtained from different body tissues (see Section 2.7. for data availability), as well as the initially included dermal sample. For each sample, after quality control, we randomly selected a core cell and their 999 more similar cells; then, we identified the set of HVGs (FDR <0.01 and fold-change >1.5) present in each subpopulation as described before (see Section 2. for materials and methods). Under these thresholds, we identified 221, 226, and 228 HVGs for dermal, lung distal, and lung proximal fibroblasts, respectively. Among the identified HVGs from the three samples, we found 36 genes that statistically enrich (FDR <0.05) for specific biological processes historically associated with fibroblasts ( Table 2). The small overlap found, as well as the functional enrichment for the extracellular matrix, were previously described in [53,54], where it is shown that fibroblasts are a remarkably plastic cell type differing between human tissues where they develop unique morphologies and physiologic functions but still have a commonly associated role, the extracellular matrix organization, and maintenance. Regulation of insulin-like growth factor (IGF) transport and uptake by insulin-like growth factor binding proteins (IGFBPs) (R-HSA-381426)

HVGs as Part of the Regulatory Network with High Cell-Type Specificity
Next, we set out to test whether HVGs are co-expressed and thus tend to form co-expression networks [55]. We first imputed the expression matrix and then constructed the co-expressed network using the top 50 HVGs for each cell type. For LCLs, the network contains two main modules centered on NFKBIA and IGHG1, respectively ( Figure 3A). NFKBIA encodes the NF-κB inhibitor that interacts with REL dimers to inhibit NF-κB/Rel complexes [56,57]. For LAECs, two modules are centered on IL23A/TNFAIP6 and COL1A1 ( Figure 3B); for DF, KRTAP2-3 and IGFBP7 ( Figure 3C). Thus, functions of "hub" genes in HVG co-expression networks are closely relevant to the function of corresponding cell type. These results are another line of evidence that scEV implies cell function. The transcription of multiple HVGs may be involved in the same underlying regulatory activities, giving rise to the co-expression network, as we observed. Thus, we wondered whether scEV in several different HVGs is driven by activities of one or few common TFs. To address this question, we searched for upstream regulators of the HVGs defined by our analysis (see Section 2 for materials and methods). We identified significant enriched TF binding motifs upstream of HVGs, four for LCL, and five for LAEC (Supplementary Table S4). No significantly enriched motif was identified for DF. The known motifs of LCL HVGs include that of the NF-κB subunit gene, RELA, and that of BACH2 ( Figure 3A). The known motifs of LAEC HVGs include the TATA box and that of CEBPB ( Figure 3B).
To further explore the involvement of HVGs in the cell type-specific regulatory network, we focused on LCL HVGs in a well-studied gene regulatory network that orchestrates B cell fate dynamics [58][59][60]. This known regulatory network involves eight genes, including three LCL HVGs-PRDM1 (or Blimp-1), AICDA (or AID), IRF4, two key regulatory genes with binding motifs enriched in targeting LCL HVGs (see above)-RELA and BACH2, and three other key regulators-BCL6, PAX5, and REL (cRel) ( Figure 4A).
We examined the inter-relationship between across-cell expressions of three LCL HVGs ( Figure 4B). The scatter plot shows that the directionality of the correlation between AICDA and IRF4 depends on the expression level of PRDM1. Among cells with relatively low expression of PRDM1, expressions of AICDA and IRF4 are negatively correlated. Whereas, among cells in which PRDM1 is highly expressed, expressions of AICDA and IRF4 are positively correlated. This nonlinear relationship between expressions of HVGs suggests they are embedded in a tightly regulated expression network. Thus, we examined the all-by-all Spearman correlation between expressions of all eight genes in this regulatory network using the imputed data of the homogenous LCLs ( Figure 4C). By comparing the sign of correlation coefficient of each pair of genes with the regulatory effect of the gene pair in the model network, we found that the correlation matrix can be used to correctly recover 15 out of 18 direct regulatory relationships. The result suggests that, even in this highly homogenous population of LCLs, cells retain gene regulatory network activities that orchestrate cell fate dynamics as in their original B cells. we searched for upstream regulators of the HVGs defined by our analysis (see Section 2 for materials and methods). We identified significant enriched TF binding motifs upstream of HVGs, four for LCL, and five for LAEC (Supplementary Table S4). No significantly enriched motif was identified for DF. The known motifs of LCL HVGs include that of the NF-κB subunit gene, RELA, and that of BACH2 ( Figure 3A). The known motifs of LAEC HVGs include the TATA box and that of CEBPB ( Figure 3B). To further explore the involvement of HVGs in the cell type-specific regulatory network, we focused on LCL HVGs in a well-studied gene regulatory network that orchestrates B cell fate dynamics [58][59][60]. This known regulatory network involves eight genes, including three LCL HVGs-PRDM1 (or Blimp-1), AICDA (or AID), IRF4, two key regulatory genes with binding motifs enriched in targeting LCL HVGs (see above)-RELA and BACH2, and three other key regulators-BCL6, PAX5, and REL (cRel) ( Figure 4A).  and methods). We identified significant enriched TF binding motifs upstream of HVGs, four for LCL, and five for LAEC (Supplementary Table S4). No significantly enriched motif was identified for DF. The known motifs of LCL HVGs include that of the NF-κB subunit gene, RELA, and that of BACH2 ( Figure 3A). The known motifs of LAEC HVGs include the TATA box and that of CEBPB ( Figure 3B). To further explore the involvement of HVGs in the cell type-specific regulatory network, we focused on LCL HVGs in a well-studied gene regulatory network that orchestrates B cell fate dynamics [58][59][60]. This known regulatory network involves eight genes, including three LCL HVGs-PRDM1 (or Blimp-1), AICDA (or AID), IRF4, two key regulatory genes with binding motifs enriched in targeting LCL HVGs (see above)-RELA and BACH2, and three other key regulators-BCL6, PAX5, and REL (cRel) ( Figure 4A).

Single-Cell Expression Variability in LCLs is Positively Correlated with between-Individual Expression Variability
Next, we examined the relationship between scEV and inter-individual expression variability. We distinguish between the two different types of variabilities at different organizational levels. Specifically, the former is cell-to-cell variability in a population of cells, and the latter is inter-individual variability at the human population level. We again focused on LCLs, for which population-scale gene expression data are available from the Geuvadis RNA-seq project of 1000 Genomes samples. The bulk RNA-seq data was downloaded as a normalized expression matrix of FPKM values. We retained data for all LCLs of European ancestry (CEU) [61]. With the residual variability estimated from scRNA-seq of GM12878 and that estimated from the CEU population, we tested the correlation between the two estimates across genes. When the test was conducted with all genes (n = 8424), we obtained a significant but weak positive correlation (SCC, r = 0.19, p = 1.2 × 10 −9 ). We wondered whether this positive correlation was driven by subsets of genes. To identify these gene sets, we conducted the correlation tests for the GO-defined gene sets one by one. Across all gene sets tested, the average SCC for gene sets defined by GO biological process (BP) and molecular function (MF) terms are on average r = 0.28 and r = 0.23, respectively. Strikingly, we found a small number of gene sets that produced SCC much higher than averages. The functions of these gene sets include B-cell activation involved in immune response (GO:0002322), cytokine receptor activity (GO:0004896), cellular response to drug (GO:0035690), and regulation of tyrosine phosphorylation of stat protein (GO: 0042509; Figure 5), as well as leukocyte chemotaxis (GO: 0030595) and phospholipase activity (GO:0004620; for more examples, see Supplementary Figure S6). Thus, for these gene sets, scEV may contribute to the establishment of between-individual expression variability. 4C). By comparing the sign of correlation coefficient of each pair of genes with the regulatory effect of the gene pair in the model network, we found that the correlation matrix can be used to correctly recover 15 out of 18 direct regulatory relationships. The result suggests that, even in this highly homogenous population of LCLs, cells retain gene regulatory network activities that orchestrate cell fate dynamics as in their original B cells.

Single-Cell Expression Variability in LCLs is Positively Correlated with between-Individual Expression Variability
Next, we examined the relationship between scEV and inter-individual expression variability. We distinguish between the two different types of variabilities at different organizational levels. Specifically, the former is cell-to-cell variability in a population of cells, and the latter is interindividual variability at the human population level. We again focused on LCLs, for which population-scale gene expression data are available from the Geuvadis RNA-seq project of 1000 Genomes samples. The bulk RNA-seq data was downloaded as a normalized expression matrix of FPKM values. We retained data for all LCLs of European ancestry (CEU) [61]. With the residual variability estimated from scRNA-seq of GM12878 and that estimated from the CEU population, we tested the correlation between the two estimates across genes. When the test was conducted with all genes (n = 8424), we obtained a significant but weak positive correlation (SCC, r = 0.19, p = 1.2 × 10 -9 ). We wondered whether this positive correlation was driven by subsets of genes. To identify these gene sets, we conducted the correlation tests for the GO-defined gene sets one by one. Across all gene sets tested, the average SCC for gene sets defined by GO biological process (BP) and molecular function (MF) terms are on average r = 0.28 and r = 0.23, respectively. Strikingly, we found a small number of gene sets that produced SCC much higher than averages. The functions of these gene sets include B-cell activation involved in immune response (GO:0002322), cytokine receptor activity (GO:0004896), cellular response to drug (GO:0035690), and regulation of tyrosine phosphorylation of stat protein (GO: 0042509; Figure 5), as well as leukocyte chemotaxis (GO: 0030595) and phospholipase activity (GO:0004620; for more examples, see Supplementary Figure S6). Thus, for these gene sets, scEV may contribute to the establishment of between-individual expression variability.

No Enriched Functions Associated with HVGs Identified in Human Induced Pluripotent Stem Cells (iPSCs)
Finally, we argued, if scEV is the indicator of cell type-specific function, then scEV in undifferentiated cells should not be associated with any cellular functions. To test this, we obtained the scRNA-seq data from the study of [62] (see Section 2.7. for data availability). The data was generated from human iPSCs [63]. Same as other cell types examined in this study, these iPSCs were also prepared using the 10× Genomics Chromium controller. The released data contains five samples. We used the first batch (Sample 4) of the data to perform the HVG detection and function enrichment tests, using the same procedure applied to other cell types. When plotting the relationship between log-transformed CV 2 and log-transformed average expression (µ), we found almost no genes showing large CV 2 deviated from the regression curve ( Figure 6A)-a pattern differs substantially from those of the other three cell types ( Figure 6B). This pattern suggests that, for the majority of genes in iPSCs, scEV can be explained by technical noise or sampling stochasticity. In other words, iPSCs lack biological variability in their single-cell expression. Nevertheless, we still identified 79 iPSC HVGs (Supplementary Table S5) but could not associate any significant (FDR <0.05) enriched function with them. To further validate our findings, we performed the same analysis using another iPSC sample (see Section 2.7 for data availability) recovering the same pattern, a low number of HVGs (4) that are not significative enriched for a specific function (ID3, LEFTY1, MALAT1, TAGLN). These negative results are consistent with our prediction given by the "variation is function" hypothesis: undifferentiated iPSCs are not expected to be associated with any cell-type-specific function.
We used the first batch (Sample 4) of the data to perform the HVG detection and function enrichment tests, using the same procedure applied to other cell types. When plotting the relationship between log-transformed CV 2 and log-transformed average expression (μ), we found almost no genes showing large CV 2 deviated from the regression curve ( Figure 6A)-a pattern differs substantially from those of the other three cell types ( Figure 6B). This pattern suggests that, for the majority of genes in iPSCs, scEV can be explained by technical noise or sampling stochasticity. In other words, iPSCs lack biological variability in their single-cell expression. Nevertheless, we still identified 79 iPSC HVGs (Supplementary Table S5) but could not associate any significant (FDR <0.05) enriched function with them. To further validate our findings, we performed the same analysis using another iPSC sample (see Section 2.7 for data availability) recovering the same pattern, a low number of HVGs (4) that are not significative enriched for a specific function (ID3, LEFTY1, MALAT1, TAGLN). These negative results are consistent with our prediction given by the "variation is function" hypothesis: undifferentiated iPSCs are not expected to be associated with any cell-type-specific function. For each cell type, the relationship between coefficient of variation squared (CV 2 ) and mean expression of genes is shown.

Discussion
Single-cell expression variability (or scEV) is sometimes called gene expression noise, emphasizing the stochastic nature of transcriptional activities in cells [64,65]. Interrogating scEV data has provided insights into gene regulatory architecture [66,67]; manipulating the magnitude of scEV, through using noise enhancers or scEV-modulating chemicals, has been an approach to achieve drug synergies [68]. Understanding the origin and functional implications of scEV has long been appreciated [4][5][6]69].
In this study, we focused on scEV in human cells. More specifically, we characterized different genes' expression variability levels within a highly homogeneous population of genetically identical (or nearly isogenic) cells under the same environmental condition. We quantified scEV in highly homogeneous populations of a sizable number of viable cells. Working with cells of the same type, For each cell type, the relationship between coefficient of variation squared (CV 2 ) and mean expression of genes is shown.

Discussion
Single-cell expression variability (or scEV) is sometimes called gene expression noise, emphasizing the stochastic nature of transcriptional activities in cells [64,65]. Interrogating scEV data has provided insights into gene regulatory architecture [66,67]; manipulating the magnitude of scEV, through using noise enhancers or scEV-modulating chemicals, has been an approach to achieve drug synergies [68]. Understanding the origin and functional implications of scEV has long been appreciated [4][5][6]69].
In this study, we focused on scEV in human cells. More specifically, we characterized different genes' expression variability levels within a highly homogeneous population of genetically identical (or nearly isogenic) cells under the same environmental condition. We quantified scEV in highly homogeneous populations of a sizable number of viable cells. Working with cells of the same type, for example, LCL, we started by preprocessing data from thousands of cells. We found that, even though we had firstly preprocessed the data and retained only cells with similar library size and in the same cell cycle phase, it was not enough. There were still marked substructures, shown as branches of cells, in the embedding cloud of cells ( Figure 1A), as revealed by the new embedding algorithm [30]. Retrospectively, we applied the trajectory analysis and found out that one of the longest branches contained cells with elevated expression of immunoglobulin genes (Supplementary Figure S7).
Similarly, marked substructures were observed in the embedding plots of the other two cell types, LAEC and DF. Genes that were differentially expressed and drove the formation of branches of LAECs and DFs were different from those in LCL cells. Thus, there is no single or a small set of marker genes that can be used to capture cellular heterogeneity across different cell types, making the definition of populations of homogenous cells a tedious task. Our work represents the first study focused on comparing scEV in highly homogeneous cell populations across genes in different cell types.
We showed that scEV estimated from homogeneous populations of cells for different cell types carries information on cell type-specific function. Information on molecular functions of cells and biological processes of a given cell type can be extracted from a set of highly variable genes (HVGs), bearing significant biological meaning (see also [70]). HVGs detected in different cell types do not overlap and can reveal the subtle differences in cellar functions between cell types. These conclusions are reached based on our investigation of three cell types and their corresponding HVGs.
First, LCLs are usually established by in vitro infection of human peripheral blood lymphocytes by the Epstein-Barr virus. The viral infection selectively immortalizes resting B cells, giving rise to an actively proliferating B cell population [71]. B cells genetically diversity by rearranging the immunoglobulin locus to produce diverse antibody repertories that allow the immune system to recognize foreign molecules and initiate differential immune responses [20,72,73]. LCLs are produced through the rapid proliferation of few EBV-driven B cells from the blood cell population [74]. Thus, scRNA-seq data sets of LCLs offer a "snapshot" of highly diverse immunoglobulin rearrangement profiles in a much larger population of polyclonal B cells established in donors of these cell lines. Therefore, it is not unexpected to see quite a few immunoglobulin genes in the top list of HVGs identified in LCLs. In addition to these immunoglobulin genes, a number of other immune genes, especially C-C motif chemokine ligands (CCLs) and C-C motif chemokine receptors (CCRs), are in the list of HVGs of LCL. These genes play important roles in allowing the coordination of the activity of individual cells through intercellular communication, essential for the immune system maintains robustness [75]. The HVG co-expression network analysis revealed the key role of the NF-κB pathway in facilitating communications between immune cells [18,20]. More strikingly, we were able to reconstruct nearly the entire NF-κB regulatory network, underlying a differentiation of activated B cells and antibody-secreting cells, by using the correlation and anti-correlation relationships between expressions of HVGs and their regulatory genes.
Second, LAEC is a key cell type playing important roles in lung tissue remodeling, and pulmonary inflammatory and immune responses [76]. The airway epithelium, playing a critical role in conducting air to and from the alveoli, is a dynamic tissue that normally undergoes slow but constant turnover. In the event of mild to moderate injury, the airway epithelium responds vigorously to re-establish an epithelial sheet with normal structure and function. HVGs identified in LAECs, which are enriched with genes involved in collagen formation, regulation of cell proliferation, and extracellular matrix organization, accurately elucidate this aspect of functions of the airway epithelium. LAECs are also central to the defense of the lung against pathogens and particulates that are inhaled from the environment. This aspect of functions is also reflected in the enriched functionality of LAEC HVGs.
Third, DFs are responsible for generating connective tissue and play a critical role in normal wound healing [53]. DFs are also commonly used in immunological studies [33,77,78]. HVGs identified in DFs again accurately reflect these primary aspects of DF functions, including extracellular matrix organization, keratinization, and regulation of signaling receptor activity. DF HVGs do have several categories of enriched functions overlap with those of LAEC, which is not unexpected, given that DF and LAEC have functional overlaps [79].
Our results provide evidence supporting the "variation is function" hypothesis, proposed by [17], suggesting that the aggregate cellular function may depend on scEV. Dueck and colleagues also laid down several scenarios, including bet hedging, response distribution, fate plasticity, and so on, in which the establishment of the relationship between scEV and cell function could be attained. Our analytical framework using scRNA-seq data may be utilized in appropriate systems to test the plausibility of these different scenarios. If scEV is an accountable and credible surrogate of cell function, as we have shown in this study, then quantifying and characterizing scEV may become a first-line approach for understanding the function of cell types and tissues. Indeed, when we applied this framework to scRNA-seq data from human iPSCs, we observed no enriched gene functions and no regulatory pathways/networks associated with HVGs in iPSCs. This anti-example, showing no variation no function, further validates the "variation is function" hypothesis.
Furthermore, we have shown that, across certain sets of genes, scEV is positively correlated with population-level expression variability. This correlation provides a new possibility to design single-cell assays with one sample to approximate the population variability of certain genes' expression. This new method may be used to study disease-causing expression dysregulation because it has been a number of cases that increased population-level expression variability has been linked with diseases [80][81][82][83][84].
Pelkmans [8] pointed out in a visionary perspective article that: "Embracing this cell-to-cell variability as a fact in our scientific understanding requires a paradigm shift, but it will be necessary." Indeed, scRNA-seq technologies have brought revolution to gene expression analysis. The technical development gives us a new approach beyond the capacity of traditional methods that rely on experimental measurements of population-average behavior of cells to conceive regulatory network models and signal processing pathways. More importantly, for traditional methods, by averaging information across many cells, differences among cells, which may be important in explaining mechanisms, can be lost. Given the large degree of cell-to-cell expression variability even between genetically homogeneous cells, conclusions reached as for such with traditional average-based methods may be of low-resolution, incomplete, and sometimes misleading [3,18,29,85].

Conclusions
We have shown that scEV in highly homogeneous populations of human cells is widespread in differentiated cell types and is likely to imply cell type-specific function. We conclude that single-cell variability and the information it contains are the key to a deepened understanding of cells and their functions. Careful assessment and characterization of cell-to-cell expression variability in relevant cell types will facilitate the study of normal cell functions as well as pathological cell processes.