Orthogonal control of expression mean and variance by epigenetic features at different genomic loci

While gene expression noise has been shown to drive dramatic phenotypic variations, the molecular basis for this variability in mammalian systems is not well understood. Gene expression has been shown to be regulated by promoter architecture and the associated chromatin environment. However, the exact contribution of these two factors in regulating expression noise has not been explored. Using a dual-reporter lentiviral model system, we deconvolved the influence of the promoter sequence to systematically study the contribution of the chromatin environment at different genomic locations in regulating expression noise. By integrating a large-scale analysis to quantify mRNA levels by smFISH and protein levels by flow cytometry in single cells, we found that mean expression and noise are uncorrelated across genomic locations. Furthermore, we showed that this independence could be explained by the orthogonal control of mean expression by the transcript burst size and noise by the burst frequency. Finally, we showed that genomic locations displaying higher expression noise are associated with more repressed chromatin, thereby indicating the contribution of the chromatin environment in regulating expression noise.


I. Gating of Flow Cytometry Data
Previous studies in yeast have suggested a strong dependence between cell size (Forward Scatter, FSC) and noise as measured by the Coefficient of Variation (CV) or CV 2 . As previously suggested (Skupsky et al, 2010), we find an extremely narrow gate to be unnecessary to limit the correlation between cell size (FSC) and GFP. However, to minimize any errors due to manual gating we used a data driven gating strategy provided by the norm2h filter of the Bioconductor flowViz package in R (Supplementary Figure S1). To ascertain whether there was any residual correlation between FSC and GFP we examined both linear (Supplementary Figure S2) and nonlinear correlations and found that for all clones there is no significant correlation between FSC and GFP (Supplementary Figure S3).
Furthermore, as previously suggested we do not find that size gating has any quantitative effect on the scaling between GFP mean and GFP variance or between GFP mean and GFP CV . Specifically, we examined the relationship between expression noise (CV) as a function of expression before and after gating and found that both exhibited uncorrelated relationships. (Supplementary Figure   S4A) Additionally, we examined whether the relationship between variance and mean was strongly modulated by gate size (Supplementary Figure S4B). We found that neither the regression slope (Supplementary Figure S4C) nor the extent of correlation (Supplementary Figure S4D) was strongly affected by gate size. In contrast to findings in yeast, these results suggest that cell size does not have a significant impact on either expression mean or noise.
In addition to cell size, we have previously shown that other extrinsic sources of noise, such as cell cycle and cell line aneuploidy do not influence gene expression noise in our system (Weinberger et al, 2005). Further, the lack of correlation between mean expression and CV at the mRNA level also suggests that extrinsic cellular mechanisms involved with translation, such as the number of ribosomes, do not influence gene expression noise in our system ( Figure 4B). Finally, we make a theoretical argument that has previously been discussed in detail, for why other extrinsic factors, such as transcription factors, potentially do not influence the results of our study (Skupsky et al, 2010). First, the range of noise (CV) observed with the HIV promoter for different clones used in this study (~0.25-0.8) is larger than that observed in other large-scale studies involving several cellular promoters (~0.1-0.3) (Sigal et al, 2006;. Next, the noise observed in the mRNA distributions (CV t ) are typically higher than those observed in the protein (GFP) distributions (CV p ) ( Figures 2B and 5C). Further, assuming a bursting regime, the noise in promoter state transitions can be given by (Skupsky et al, 2010): In our dataset, we observe burst sizes (b) greater than 1 ( Figure 5A), and thus the first term in the equation is close to 1. The second term is much greater than 1 for all the clones. This implies that the noise introduced from the promoter state transitions (CV promoter ) is greater than the noise in transcript counts (CV t ), which in turn is greater than noise in protein distributions (CV p ). In general, in a cascade of stochastic processes, for noise at a particular step to make a significant contribution and propagate to the next step, the upstream step needs to be noisier than the downstream step. But since it appears that the observed noise in protein (GFP) distributions from the HIV promoter is possibly higher than endogenous genes, it is unlikely that noise in transcription factors (or other cellular proteins) could influence and propagate through the HIV promoter to significantly affect the observed mRNA and GFP distributions.

Supplementary Figure 1 -Representation of gating strategy for subset of clones.
Raw flow cytometry channel data for all 227 LGM2 clones was gated in Side Scatter (SSC) and Forward Scatter (FSC) space using the norm2h filter provided by the BioCondutor flowViz package. This filter fits a 2D Gaussian around the mode of the data, which results in a small ellipse. The resulting gate for the 25 clones used for FISH analysis is depicted (red ellipse) overlaid on a smoothed FSC v SSC scatterplot.

II. Independent Clone Generation Replicate
Despite the large number of clones generated, we entertained the possibility that the relationships inferred from this large set of clones may be an artifact of the particular clones generated or the particular culture conditions. Therefore, to address this possibility we generated an independent set of 191 clones. These clones were derived under identical conditions as the main set from an independent infection of uninfected Jurkat T cells with LGM2 virus. We found that the relationships between Mean and Variance (regression slope ~2) and Mean and CV (no significant correlation) were not significantly different from the main set of clones used in this study (Figure 2 and Supplementary Figure S5) Therefore, we do not find evidence that suggests our findings are dependent on the particular set of clones generated or culture conditions. Mean and CV. In agreement with the main set of clones, we find there to be no significant correlation between expression noise (CV) as a function of mean (R 2 =0.08).

III. Clone Selection and Stability
Our initial observation of an uncorrelated relationship between expression noise and mean led us to devise a clone selection method that would: a) capture the range of CV and mean observed and b) permit deeper analysis of the regulation of expression noise for a given mean. Toward this aim, we performed hierarchical clustering and analyzed the mean silhouette value as a function of cluster number (Supplemental Figure S7A, left). We found that the mean silhouette is only weakly dependent on the cluster number with four clusters yielding the highest value. Examining the intra-cluster Silhouette value for each clone (Supplemental Figure

IV. Computational Analysis of Microscopy Images
smFISH enables elegant visualization of single mRNA molecules in fixed populations of cells. However, its application in high-throughput has been hampered in part by the lack of automated software tools. Therefore, to enable the application of smFISH across many clonal populations and over fifteen thousand single cells, we implemented primarily automated cell and smFISH segmentation software in MATLAB (R2011a, Mathworks Inc.) using the freely available DIPImage toolbox (v. 2.2, Linux). DIPImage provides a rich set of basic morphological, intensity and feature based image processing and measurement functions. Importantly, all functions in the toolbox scale to stacks of images such as those resulting from wide-field deconvolution microscopy. We broke down high-throughput image processing into three tasks: 1) deconvolution to enhance signal-noise and correct geometry 2) cell segmentation and 3) smFISH signal segmentation.

High-throughput Deconvolution
The optical properties of wide-field epifluorescent microscopes lead to blurring from out of focus light and geometric distortion from the point-spread-function (PSF) of the light path. We found that deconvolution using a PSF measured using 100nm fluorescent beads (Invitrogen Inc.) significantly enhances the automation of downstream processing.
In particular, morphological operations and shape-based classification greatly benefit from the higher signal-to-noise ratio and corrected geometry resulting from deconvolution. Therefore, all 84 fields from each of the 25 LGM2 clones analyzed were first deconvolved using Huygens Core (v3.4, SVI) on a 10-node dual-processor (AMD 2.2Ghz, 2Gb RAM per node) Linux cluster with jobs managed by Sun Grid Engine (Sun Microsystems). Custom scripts in Tcl and Bash issued deconvolution commands and ran jobs. Deconvolution parameters were heurtistically determined to maximize final image quality.
Morphological Image-processing enables the high-throughput analysis of over fifteen thousand single cells across many clones scale by imaging all 25 clones in 3 channels across 84 fields (capturing ~5-10 cells per field) with 91 axial slices at 0.2 µm spacing. This spacing was determined to provide optimal resolution of smFISH signals and record the entire cell volume. Importantly, performing RNA FISH at this scale presented two challenges: (1) accurate segmentation of cells in fields with touching cells, and (2) reliable identification of single molecule FISH signals across many samples. Previously reported methods involve either z-projection of image stacks, manually intensive threshold selection, or training set development (Raj et al, 2006;Rifkin, 2011). Z-projection flattens the recorded image stack and exhibits a propensity to generate overlapping signals that then must be algorithmically separated.
Similar to manual threshold selection, training set development requires significant manual intervention. Furthermore, the potential to over-fit limited training data makes scaling to many samples uncertain.
To overcome these issues and enable smFISH at a larger scale, we developed custom software that segments complex fields in a highly automated fashion and reliably identifies smFISH signals with the minor requirement of manual estimation of a 'ballpark' threshold that weakly affects the resulting counts. A necessity for a perfect threshold that efficiently discriminates true signals from background is overcome through application of a heuristic multi-feature classifier that represents a model of smFISH signals. Such multi-feature classification proved more time-efficient and robust (relative to signals distinguished by eye) to inclusion of background artifacts across many images and clones than thresholding alone.

Cell Segmentation
Deconvolved image stacks undergo a multi-step process (Supplementary Figure S10 Subsequently, to remove interior peaks in intensity the image stack is then morphologically reconstructed (Supplementary Figure S10C) with an H-dome (Vincent, 1993), which results in flatter intensity profiles that are less susceptible to oversegmentation. To generate a binary mask, the reconstructed image stack is then thresholded using an empirically determined multiplier of the otsu threshold. Then to separate touching objects, a Euclidean distance transform is applied (Supplementary Figure S10D) and then thresholded using a different multiplier of the otsu threshold. This results in unique seeds within each cell (Supplementary Figure S10E). Importantly, to prevent analysis of partial cells, seeds intersecting the image boundary are rejected. We found this method of seeding to be more robust than computing seeds from the DAPI stained nuclei. Specifically, nuclei frequently have interior holes and discontinuities that frequently result in over-segmentation. The resulting seeds were then first grown on the Euclidean transformed image, which effectively results in larger seeds that were then grown on the reconstructed image to arrive at final segmentation of the field (Supplementary Figure S10F). The resulting objects are rejected on the basis of perimeter to area (P2A) and Podczeck circularity shape descriptors, which identify aberrant objects resulting from under or over segmentation. Fields with rejected objects are retained for manual inspection. Retained fields typically comprise less than 5% of total fields. Segmentation problems in these fields can be overcome through a combination of threshold adjustment. Segmentation object boundaries accurately determine cell boundaries (Supplementary Figure S10G). Furthermore, the sizes of the cell objects are not significantly different from clone to clone (Supplementary Figure   S11A) and individual cell size distributions are well described by normal distributions (Supplementary Figure S11B). Together, these suggest that this segmentation method is robust across the clones studied and that cell size is not a significant factor underlying our analysis.

smFISH segmentation
Each segmented cell object then undergoes a multistep filtering and classification process to identify both single molecule RNA signals and burst-like features. To remove the majority of slowly varying background remaining in deconvolved images (Supplemental Figure S12A), a morphological TopHat filter with an elliptical structuring element sized (7 pixels) to pass FISH signals but reject background is applied. To further enhance spherical signals a Laplacian of Gaussian transform is applied (Raj et al, 2006), resulting in a stack where almost all the background has been removed (Supplemental Figure S12B). What remains are true FISH signals, which are highly spherical, and irregular blobs. For each clone, a 'ballpark' threshold, which provides a reasonable mask of FISH signals and blobs, is manually determined by assessing performance across ten fields chosen at random. A previously reported method relied on setting a precise threshold for each cell (Raj et al, 2006). This method does not scale efficiently to tens of thousands of cells and may be susceptible to user bias. Rather than relying strictly on a threshold, we found that classifying objects on the basis of size, circularity and intraobject grey level intensity standard deviation provided a robust discrimination of FISH objects.
We empirically determined thresholds on these features that yielded classification in good agreement with manual classification. The combination of these three features efficiently distinguishes aberrant blobs from FISH signals. Specifically, to segment FISH signals connected component analysis is performed on the thresholded image. Very small (<30 total pixels) and very large (>300) objects are excluded in this step, which provides an initial classification filter. Subsequently, objects are only accepted if their perimeter to area (p2a) values are highly consistent with a sphere (p2a between 0.96 and 1.02). Secondly, background blobs have very uniform pixel intensity across the object while FISH signals have radially decreasing intensity from the center of the diffraction limited spot. This is captured most succinctly in the pixel grey level standard deviation within each object. Therefore, objects with a grey level standard deviation below an empirical threshold are also rejected. In practice, this classification provide counts <5% different from manual counts and is highly scalable. The combination of a 'ballpark' threshold with a three-feature classification scheme allowed us to scale semiautomated FISH analysis to tens of thousands of single cells.    RNA concentration was quantified with a NanoDrop 1000 spectrophotometer. GFP mRNA abundance was determined by RT-qPCR in triplicate using a QIAGEN QuantiTect SYBR Green RT-PCR Kit (QIAGEN 204243) with a Bio-Rad iQ5 system. 50ng of total RNA and 0.25µM of GFP forward and reverse primers were used according to the supplied instructions at an annealing temperature of 62°C for 45 cycles. To assess the reliability of our measurement, β-Actin mRNA degradation was also measured as a control for normal mRNA degradation rates. We consistently observed an increase in the apparent LGM2 mRNA level at the 2hr time-point, which is likely due to RNA export phenomena as previously reported (Raj et al, 2006). Furthermore, cells are markedly sick at the 8hr time-point. Therefore, we only fit the 2hr, 4hr and 6hr time- To ascertain whether our system for measuring degradation rate provided a reasonable estimate, we also measured β-Actin levels following treatment identical to (A). Using Log-linear regression (orange line) we found the best-fit rate to be 0.14 hr -1 (R 2 =0.73), representing a half-life of 4.95 hours, which is highly consistent with a previously published rate (López-Orduña et al, 2007). Shading around the best-fit regression represents the pointwise 95% confidence interval.

VI. Model Fitting
Maximum likelihood estimation (MLE) of the promoter On rate (k a ) and average burstsize (k t+ /k r ) for each clone was performed against the full analytical probability density function (pdf) of the standard 2-state model of gene expression (Supplementary Figure   S14) (Peccoud & Ycart, 1995;Raj et al, 2006). Despite it's abstractness, this model has received widespread use due to its comprehensive ability to reveal kinetic mechanisms underlying burst-like transcription. MLE was implemented through specification of the log-likelihood function of the steady-state RNA distribution (Supplementary Figure S14B) in Mathematica 8 (Wolfram Inc.). Parameters were estimated by numerically minimizing the negative log-likelihood of the two-state pdf given the experimentally determined RNA distribution for each clone. As discussed, we experimentally measured the LGM2 degradation rate and therefore k t-was fixed at this value. In addition, as previously reported, RNA distributions are insufficient to separately determine the promoter Off rate and the transcription rate. As we and others have done previously (Skupsky et al, 2010;Raj et al, 2006), we held the transcription rate in the On state constant across clones.
We used our previously reported value of 60 hr -1 (Skupsky et al, 2010). Values ±50% of this fixed value did not significantly change estimates of k a or burst-sizes, suggesting that our results are highly independent of the particular value chosen. Mean: Thus, CV is given by: In the bursting regime (k r >> k t-and k r >> k a ), these equations reduce to: Thus, where ! ! is the normalized burst frequency and ! is the burst size.  Figure 14B)

VII. Additional Fit Parameter Analysis
In this study we reveal both observational and mechanistic othorgonality in the control of expression mean and expression noise (CV) across integration positions. Specifically, we find expression mean and noise to be uncorrelated with burst-size primarily explaining mean and promoter On rate primarily explaining noise. Despite the lack of significant cross-correlations between promoter On rate and mean, and burst-size and noise, some possibility of co-modulation remains. Therefore, we directly examined burst size as a function of promoter On rate (Supplementary Figure S15) and found no evidence of significant correlation ((R 2 =0.08, p>0.1), (r s =0-38, p>0.05)). This suggests that burst-size and k a respectively provide orthogonal explanations for mean and noise.

VIII. Additional Nucleosome Sensitivity Analysis
In this study we established that for similar mean levels of expression, noisier clones have more inaccessible chromatin at the promoter. Closer analysis along different regions of the promoter revealed that the HSS consistently has the highest ratio of chromatin inaccessibility between high-and low-noise clone pairs. This arises from much lower chromatin inaccessibility values for low-noise clones compared to other sites in the promoter (Supplemental Figure S16). We also established that chromatin inaccessibility at Nuc-1, unlike HSS or Nuc-0 (Supplemental Figure S17) is the best predictor of the promoter On rate. This is further reinforced by performing a principal components analysis (Supplemental Figure S18). Nuc-1 is almost negatively correlated to the promoter On rate whereas HSS and Nuc-0 are almost orthogonal to the promoter On rate. Therefore, chromatin inaccessibility at

Supplementary
Nuc-1 appears to be the best predictor of k a . (B) PCA for the burst-size and the chromatin state at the promoter shows that most of the data lies along the 1D axis of the first principal component and that the burst size is independent of the three chromatin inaccessibility measurements along the promoter.