Structure of silent transcription intervals and noise characteristics of mammalian genes

Mammalian transcription occurs stochastically in short bursts interspersed by silent intervals showing a refractory period. However, the underlying processes and consequences on fluctuations in gene products are poorly understood. Here, we use single allele time-lapse recordings in mouse cells to identify minimal models of promoter cycles, which inform on the number and durations of rate-limiting steps responsible for refractory periods. The structure of promoter cycles is gene specific and independent of genomic location. Typically, five rate-limiting steps underlie the silent periods of endogenous promoters, while minimal synthetic promoters exhibit only one. Strikingly, endogenous or synthetic promoters with TATA boxes show simplified two-state promoter cycles. Since transcriptional bursting constrains intrinsic noise depending on the number of promoter steps, this explains why TATA box genes display increased intrinsic noise genome-wide in mammals, as revealed by single-cell RNA-seq. These findings have implications for basic transcription biology and shed light on interpreting single-cell RNA-counting experiments.

: Parameter estimation in heterogenous synthetic cell populations. We generated populations of 48 cells, resulting from the mixing of two data set sharing identical kinetic parameters except the transcription rate km, which was set to k (1) m = 1 for the first set and to k (2) m = 5 for the second one. The control parameter r measures the fraction of cells from the second set in the mixed population. The blue crosses stand for the mean parameters in function of r and the error bars for the 5th and 95th percentiles of the posterior distributions. The red dashed lines corresponds to the predicted behavior of the inferred parameters based on the hypothesis that the mean mRNA level m of the mixed population is correctly inferred and that the estimated transcription rate is actually the maximum among the underlying populations.    cell mRNA distribution obtained by Gibbs sampling for eight representative simulated cells from a population with heterogenous kinetic parameters. Gibbs sampling enables reconstruction of single-cell mRNA distributions P (m|θ, Di) conditioned on the data using the model as a prior. Separation of the noise is performed by estimating the mean and the variance of each single-cell distribution (inset: mean, 5th and 95th percentiles of each single-cell distribution). (B) Comparison between single-cell noise estimated with Gibbs sampling and the true single-cell noise in the data for heterogenous simulated population of 48 cells. Inset, comparison between single-cell mean mRNA number estimated with Gibbs sampling and the true value. Overall Gibbs sampling recovers accurately the true single-cell mRNA mean and variance using the optimal model (with global parameters) as a prior. (C-D) Gibbs sampling estimates (light green crosses) and true values (blue crosses) for the population total (C) and intrinsic (D) transcriptional noise in function of the amount of extrinsic noise in the 24 synthetic populations of 48 cells. Noise values predicted by the optimal model for the different population are shown as dark green crosses. The black dash line stand for the expected noise in absence of extrinsic variability (constant kinetic parameters) and the colored dash lines for the linear regression. Gibbs sampling gives an excellent estimation of the total and intrinsic noise in the populations whereas the optimal model gives a close estimation of the true intrinsic noise independently of the amount of extrinsic noise in the data.   Figure S7: Noise reduction provided by the cycle compared to a simple switch (N=1), in function mean mRNA expression m and the mean number of inactive steps N . The genes benefiting the most from the noise reduction provided by the cycle are the ones expressed highly enough such that the promoter cycle dominates (η 2 p η 2 c ) and exhibiting a large number of steps (N ∼ 6). Even for theses genes, the actual noise reduction is at best 30%, which is approximately half the maximal theoretical reduction of 50% (for N → ∞).

Real-time measurements of transcription in mammals
The single cell time-lapse bioluminescence traces analyzed in this study are taken from (Suter et al. , 2011). In addition, we generated new cell lines carrying a synthetic promoter using the Flp-In system. Here, we briefly review the experimental setup which was established in order to monitor transcription at high temporal resolution and the experimental procedures to generate new data.

Suter et al. data
A short-lived nuclear luciferase reporter (NLS-luc) encoded by a short-lived mRNA was integrated into three types of constructs: i. A gene trap lentivector, allowing transcription to be controlled by the endogenous locus at the insertion site.
ii. Vectors conferring circadian expression by the Bmal1 promoter.
iii. Vectors carrying artificial promoters consisting of one (H1) or two copies (H2) of different CCAAT box variants followed by a TATA box.
These constructs were used to generate monoclonal NIH-3T3 cell lines (mice fibroblasts). Nondividing confluent cells were used during the recording to minimize noise from cell-cycle and avoid cell tracking issues (Zopf et al. , 2013). Images were captured every 5 minutes at single-cell resolution with a highly sensitive EMCCD camera during approximately 2 days. Calibration of the system was performed in order to relate amount of emitted light and the protein copy number. The probability of measuring the signal s given p proteins is then given by a normal distribution: with the following parameters α = 7.63, β = 54 and σ b 100. In this study, we analyzed 16 different cell lines, 7 with different endogenous promoter, 2 with the Bmal1 promoter, 1 with the Dbp promoter and 6 with synthetic promoters (among which H1b and H1c were newly generated, see below). The complete list of analyzed genes is given in Table EV1. Of note, we did not include in our analysis several clones from Suter et al. (2011), namely Plectin1, H1 1M, H1 2M, H1 3M and H2 3M, since the strength of the signal was not sufficient to resolve the structure of the promoter cycle accurately, as determined by simulation (cf. Section 5). Generation of additional Flp-In H1 cell lines Clones H1b and H1c were co-transfected with pOG44 and a pcDNA5/FRT plasmid carrying the H1 synthetic promoter driving the expression of the short-lived NLS-luc (Suter et al. , 2011). Cells were first cultured at 30 • C for 48h, and transferred into 15cm dishes. 24h later, the medium was supplemented with hygromycin B (InvivoGen) at a final concentration of 200 µg/ml. After 20 days, individual clones were pooled and expanded in selective medium.

New data
Time-lapse luminescence microscopy Time-lapse luminescence microscopy was performed as previously described (Suter et al. , 2011). Briefly, Flp-In H1 cell were diluted 1:50 into nonluminescence NIH-3T3 cells, and plated in a 2.2cm glass bottom dish (WillCo) to a final amount of 9 · 10 5 per dish one day prior to recording. The following day, the medium was replaced by highglucose DMEM without phenol red (Gibco) supplemented with 10% FBS (Sigma), 1% L-Glutamine-Penicillin-Streptomycin solution (Sigma) and D-luciferin (Life Technologies) at a final concentration of 100µM. Luminescence monitoring was performed in a LuminoView LV200 microscope (Olympus) with a cooled C9100-13 EM-CCD camera (Hamamatsu). During luminescence recording, cells were maintained at 37 • C in a humid environment with 5% CO 2 . Images were acquired with an exposure time and time resolution of 5 minutes during 48h. Recording of the luminescence signal was optimized by using a binning of 4, an EM gain of 150 and a 5x photon imaging mode. The time-lapse recordings were then analyzed using the ImageJ software by manual tracking individual cells in an area of 20 pixels over the entire movie.

Stochastic modeling of mammalian gene expression
In this section, we first present a class of nested stochastic promoter cycle models, which extend the standard Telegraph model (Peccoud & Ycart, 1995). We then show how we analyze the bioluminescence time traces of the different by computing the likelihood of the traces conditioned on a given promoter cycle model.

Extension of the Telegraph model: the promoter cycle
The simplest extension of the Telegraph model accounting for the observed refractory period in gene reactivation (Suter et al. , 2011;Harper et al. , 2011) is the proposed sequential inactive gene states model displayed in Figure 1A. In this kinetic model, the promoter transitions are modeled as are not required to capture the unimodal (peaked) property of the measured distributions, the key characteristic of the refractory period (Bel et al. , 2010).
The number of inactive states N defines a class of nested promoter models, which include the simple promoter switch of the Telegraph model as the case N = 1. To account for a refractory period, at least N = 2 sequential inactive states are necessary. Provided that the transition rates between the different inactive states have similar magnitudes, an increase in N will result in a more peaked distribution of silent periods. Importantly, the distribution only retain the signature of the slowest transitions (rate-limiting transitions), significantly faster transition rates are overshadowed by the rate-limiting ones.
In addition to the promoter cycle models, mRNA and protein copy numbers are modeled as birth-death processes. The resulting models are parametrized by N +5 constant rates: the promoter transition rates k a , k 1 ,...,k N , the transcription rate k m , the translation rate k p and the mRNA and protein degradation rates γ p and γ m . Among these kinetic rates, three were measured experimentally (Suter et al. , 2011), namely the translation rate k p and the two degradation rates γ p and γ m which are given in Table EV1, while the other are unknown.

Likelihood calculations
To select the optimal promoter model and estimate the kinetic parameters for the different clones, we extended the computational method that we developed previously (Suter et al. , 2011). As a first step, we computed the probability that the time traces of luciferase activity were generated by our gene expression models. Taking into account the Markovian nature of the models, the likelihood of a bioluminescence time trace D = {s i }, given the number of inactive states N and the set of kinetic parameters θ N , can be written as where P e (s i |p i ) is the probability of observing the signal s i given the amount of protein p i at time This factorization assumes that the amount of mRNA m does not change much during ∆t which requires γ m ∆t 1 (this is fulfilled in practice since γ m 0.01 min −1 and ∆t = 5 min). It is worth pointing out that we did not further factorize P t (mg|m g ) as done in our previous work (Suter et al. , 2011), therefore achieving a better approximation for fast gene transitions.
The protein dynamics is then described by a birth-death process whose master equation can be solved analytically leading to the following propagator: with P the Poisson distribution with parameter λ = kpm γp 1 − e −γp∆t . The mean and the variance 13 of P t (p|p m ) are given by For a large amount of proteins, the transition probabilities were computed using the linear noise approximation. Specifically, when 2σ p < µ p , The transition probabilities of the mRNA's and the gene P t (mg|m g ) are obtained from the following master equation describing the transcriptional dynamics of the promoter cycle: with g = 0 corresponding to the active state and the following periodic conditions g = −1 ≡ N and g = N + 1 ≡ 0. After appropriate truncation on the mRNA copy number (Munsky & Khammash, 2006), the master equation can be written in term of a matrix L containing the propensity functions of the different reactions: where L can be decomposed as sum of tensor products of different matrices: with I M and I N standing for the identity matrix of size M + 1 and N + 1 where M is the maximum number of mRNA states after truncation and N the total number of inactive states of the promoter cycle. The matrix K N encodes the rates of the possible transitions of the promoter cycle and is given as follows: whereas G M and K M describe the degradation and production of mRNA, and zero otherwise. The propagator of the resulting finite system (Eq. 6) can be expressed as a matrix exponential of the L operator: In order to ensure proper normalization of the propagator, we closed the system by imposing K M (M + 1, M + 1) = 0 which prevents probability leakage. In practice, such approximation proved to be convenient when setting a cutoff on the transition probabilities based on predefined percentiles.
In addition, we ensure that M was large enough to capture the most probable transitions for each luminescent time trace.

Efficient likelihood computation
An efficient computation of the likelihood (Eq. 2) is critical since model selection and parameters estimation usually involve many evaluation of this function. The standard method to compute this quantity efficiently is the forward algorithm (Rabiner, 1989;Durbin, 1998). It introduces the so-called forward score F i (pmg) which is defined as the probability of the data up to time i and being, at that time, in state pmg. This quantity can be computed recursively using the following expression: and the probability of the data is then, The core of the forward algorithm (Eq. 9) is nothing more than a sparse matrix vector multiplication provided some cutoff have been applied on the transition probabilities. Regarding the implementation, we took advantage of the factorization of the probability transitions (Eq. 3) in order to reduce the number of nested loops (Algorithm 1). We performed standard code optimization (C++ implementation), using contiguous data structure, data structure alignment and forcing vectorization of the inner loops using SIMD instructions.
The transition probabilities of the protein P t (p|p m ) were computed on the fly and stored in order to avoid computation of unnecessary transitions and to minimize memory usage. In addition, to reduce number of indices for clones with large number of proteins, we aggregated the proteins in groups of b 1 , b 2 , ... states using a linear aggregation operator A (Kemeny & Snell, 1976), where the size of the groups were based on the standard deviation of the protein transition probabilities (Eq.

4).
Therefore the protein transition probabilities in the aggregated space {p} are then given bȳ where D is the disaggregation operator defined as the Moore-Penrose pseudoinverse of A.
Since the full likelihood is essentially the product of the likelihood of each cell time trace, i.e.
, the algorithm has a natural parallel structure which can be exploited. Because the transition probabilities for mRNAs and the gene (Eq. 8) were shared among the different cells, each cell (core) only computed a block of the transition matrix P t (mg|m g ) using the Krylov-based algorithm of Expokit (Sidje, 1998), the final matrix is then reassembled from the different blocks thanks to message passing interface (MPI).

4 Model selection and parameters estimation
According to the standard Bayesian paradigm, we based the whole inference, namely the model selection and the parameters estimation, on the joint posterior distribution: where N is the number of inactive steps, θ N the parameters of the model, D the bioluminescence In the following section, we first define a suitable parametrization for the sampling, then we express the prior distribution in that parameter space and finally we detail the RJ-MCMC scheme we implemented to estimate the number and the timescales of the rate-limiting steps underlying the silent period of the different promoters.

Parametrization of the kinetic model
The unknown kinetic parameters of the model were: k m , k a , k 1 , k 2 , k 3 , ..., k N , which are the transcription rate, and the different rate of transition between the promoter states respectively.
The mean active time is given by τ a = 1/k 1 , and the mean time spent in each inactive state by τ i = 1/k i+1 for i < N and τ N = 1/k a .
As we will see later, a parametrization which proved to be more convenient, can be expressed in terms of the following independent parameters: b, τ a , T , p 1 , p 2 , ..., p N −1 where b = k m τ a is the burst size, T = N i=1 τ i the mean total silent period and p i = τ i /T the fraction of time spent in the ith inactive state. Consequently, p N is determined by the constraint N i=1 p i = 1 and it follows that the {p i } lie on the (N − 1)-simplex.

Symmetry of the promoter cycle
There is an inherent symmetry in our gene expression models, namely the likelihood function L(D|N, Θ N ) is invariant under permutations of the different inactive steps. This property is a direct consequence of the considered promoter cycles since mRNA transcription can only occur when the gene is in the active state. Thus, we can only expect to reconstruct the distribution of active and silent periods, and not the exact sequence of the inactive steps. Indeed, the distribution of silent periods is given by the convolution of the distributions of each inactive step, and since the convolution is symmetric, the ordering of the inactive steps cannot be resolved. However, the shape of the distribution does depend on the time scales of the different inactive steps and therefore, in principle, these can be inferred from the data.

Parameterization of the cycle
To take into account the symmetry previously mentioned, we design an adequate parametrization.
In practice, we restrict the space of the simplex to a subspace for which the p i satisfy an ordering condition p 1 ≥ p 2 ≥ ... ≥ p N . This can be achieved using the following parametrization: where u i ∈ [0, 1]. Recalling that p N is given by the constraint N i=1 p i = 1, we have actually the same number (N − 1) of independent parameters u i than p i .
Change of variables Using the constraint N i=1 p i = 1 and the relation (Eq. 11), we can find the explicit change of variables from the u space to the p space. First, we can express p 1 in terms of the u i by solving the following system of equations: We find that and consequently for k > 1 using relation (Eq. 11), we finally obtain Jacobian Having specified the transformation p → u, we computed the Jacobian which is required to express the prior in the u space given a prior in the p space: In term of the u i , the Jacobian J N −1 has the following form:

Prior distributions
As required in Bayesian inference, we have to define a prior distribution which encodes our initial state of knowledge concerning the models and the parameters. A natural choice was to factorize it in the following manner: In the following subsection we will specify the different terms in the right hand side of the expression above (Eq. 13).

Priors on kinetic parameters
Recalling that the active time τ a and the total silent period T are real positive parameters, the corresponding uninformative prior is log-uniform. Although, it is possible to resolve small timescales providing sufficient amount of data (structural identifiability), in practice the likelihood function can be rather insensitive to small timescale variations (when τ ∆t). In order to restrict the parameters space to resolvable scales, we added a penalization to the log-uniform prior based on the sampling time ∆t: This prior was used for τ a and T and its improper nature did not represent a problem for the inference since these parameters are shared among the different models and the resulting posteriors were normalizable.
The prior for the burst size b was set by constraining the transcription rate k m . Assuming that the effective transcription rate is bounded, a suitable uninformative prior is a bounded uniform distribution. In terms of burst size this condition leads to Finally, the prior on the {u i } was defined with respect to the {p i }, we set a uniform prior on the (N-1)-simplex. Thanks to the Jacobian of the coordinate transform (Eq. 12), the resulting normalized prior in u-space is given by: where the factor (N − 1)!N ! stands for the volume normalization of the (N-1)-simplex and the N ! symmetric sectors.

Prior on the number of inactive steps
We designed a prior for the number of inactive steps, in order to penalize models which lead to unrealistically small timescale for the {τ i }, namely timescales which are smaller than the sampling time ∆t of 5 min. In term of the {p i }, since p N is the smallest fraction due to the ordering condition, this can be stated as p N > λ where λ = τ t /T with τ t the time threshold. With help of geometric arguments, we can express the volume reduction of the simplex space implied by that condition.
We found that the new volume of one symmetric sector S N −1 is given by where S N −1 is the volume of the sector without the constraint on p N . The ratio S N −1 /S N −1 can be viewed as a probability based on the available volume when p N > λ. Normalizing this quantity over N ∈ N * and setting τ t = ∆t leads to the following prior for the number of inactive steps: where Z is the normalization factor:

Reversible Jump MCMC
The Reversible Jump MCMC algorithm extends the scope of the well-known Metropolis-Hastings (M-H) algorithm (Green & Hastie, 2009;Andrieu et al. , 2003). In essence RJ-MCMC is very similar except that the sampling space X = N ∈M ({N } × Θ N ) usually involves parameter spaces Θ N of different dimensions. In order to build a Markov chain on this space with the desired limiting distribution, i.e. the joint posterior distribution P (N, θ N |D), we first need to define suitable mappings between the different model parameters spaces Θ N , task which is carried out by introducing auxiliary variables r N to ensure dimension matching between the spaces. Specifically, we have to Using these mappings, across-model proposal distributions Q N →N are built to "jump" from one model to another. By imposing reversibility of the chain (detailed balance) as in standard MCMC, the acceptance probability A N →N of such a move can be specified to ensure convergence toward the desired posterior distribution.
In where the acceptance probability A N →N of the move N → N is defined as follows: This quantity is expressed in terms of the M-H proposal distribution of the parameters Q(θ N |θ N ), the proposal distribution of the model Q(N |N ), the proposal distribution of the auxiliary variables Q N →N (r N |θ N ) and the Jacobian of the mapping:

Move sets & proposal distributions
Since the choice of the move set, the mappings and the proposal distributions are problem-dependent, we detail in the following paragraphs our problem specific implementation. We introduced three where we took into account the fact that there is only one available transition at the boundary i.e. N = 1 or N = N max . Regarding the value of N max , we set it to 10 and checked afterwards that the model space was large enough to encompass the inferred value from real data.
Within-model move: Although the kinetic parameters of the model are slightly correlated and a multivariate distribution would potentially lead to a more efficient sampling in term of mixing time of the chain, we experienced that factorizing the M-H proposal distribution in different components as follows "is good enough": where the proposal distributions Q s (θ|θ ) for the real positive parameters (b, τ a , T ) were set as lognormal q L (θ; log θ , σ θ ) with their own scale parameter σ θ and the proposal distribution for the u's Q u (u|u ) as beta distribution q B (u; α(u ), β(u )) where α(u ) = 1 + λ N u and β(u ) = 1 + λ N (1 − u ) with λ N a scale parameter depending on N . The different scale parameters were tuned to achieve satisfactory acceptance of the moves.
Forward move: We defined the mapping H N →N +1 , which determines the new fractions of the inactive time p i in terms of the previous ones p i and corresponds to adding a short inactive step p N +1 ≤ p N to the cycle such that the total inactive time T is conserved, as follows: where r ∈ (0, 1] is an auxiliary variable and s = 1 + rp N is a normalization factor. Using relation (Eq. 11), the corresponding mapping in the u-space is simply given by: It follows that the Jacobian J N →N +1 of the transformation is then trivial, J N →N +1 = 1. The proposal distribution Q N →N +1 (r|θ N ) associated with this mapping was set as a uniform distribution on the interval (0, 1].
Backward move: The reverse mapping H N +1→N which removes the smallest inactive step while keeping the total inactive time T constant, is defined as the inverse of H N →N +1 . It follows that the resulting mapping in the u-space is then trivially given by: where r is again an auxiliary variable. Here also, J N +1→N = 1, consistent with the forward move.
In this case, the proposal distribution Q N +1→N = 1 since no auxiliary random variable is required.

Sampling procedure & error estimates
For every set of cells, we performed 10'000 iterations with a burn-in period of 500 to 600 iterations on average, which proved to be enough to start sampling from the joint posterior distribution. We reported the mean and the 5 and 95th percentiles of the marginal posterior distributions as the final estimates for the unknown parameters. We also computed the Monte Carlo standard error (MSE) for the different estimates using overlapping batch means (Flegal & Jones, 2010). The latter quantity provides a good assessment of the mixing properties of the chain. Indeed, computing the ratio between the Monte Carlo variance σ 2 MC and the sample variance σ 2 enables estimation of the serial correlations in the generated sample: where τ is the autocorrelation time. A small value for τ reflects good mixing of the chain, i.e. the smaller the autocorrelations time, the closer the sample will be from independently drawn random variables.

Performance
Given limited amount of data, determining the optimal number of inactive steps of the promoter cycle and the associated partition of the silent period is particularly challenging since the signature of multiple sequential inactive steps in the bioluminescence time traces is subtle. The information regarding the structure of the promoter cycle essentially lies in the shape of the distribution of silent periods which relies on accurate reconstruction of the gene activity pattern from the bioluminescence time traces.
Here, we present the performance of our inference procedure based on RJ-MCMC sampling.
First, we show how the partition of the silent period can be reconstructed from the MCMC samples.
We then aim to validate the method, by analyzing multiple synthetic data sets. We discuss the performance of the method and its limit, in conditions close to the experimental ones. Finally, we present the inferred kinetic parameters of the optimal promoter cycle model for the full set of clones.

Inferring the silent partition
In principle, the parameterization introduced earlier (Section 4.1.2) to address the structural identifiability of the cycle enables the inference of asymmetric partitions, however, it can introduce a bias for symmetric partitions. Indeed, since we restricted ourselves to ordered partitions (p 1 ≥ p 2 ≥ ... ≥ p N ), i.e. we only sample one single symmetric sector of the (N − 1)-simplex, the mean of the sample will be biased toward a decreasing sequence. For example, consider the extreme symmetric case of a flat posterior in the 2-simplex, while on the full space the mean is centered and trivially given by p 1 = p 2 = p 3 = 1/3, computing the mean on a single symmetric sector leads to a biased partition: p 1 = 11/18, p 2 = 5/18, p 3 = 1/9. The magnitude of the bias lessens as the symmetric posterior becomes more peaked.
Assuming that the true posterior distribution of the partition conditioned on the model, i.e. P (p 1 , p 2 , ..., p N |N, D), can be approximated by a Dirichlet distribution with parameters α. Due to the symmetry of the cycle, the effective posterior distribution is invariant under permutation of the {p i }, and thus would be given by where the sum runs over all permutations of the {p i } and B(α) is the multinomial Beta function.
Such a distribution (Eq. 15) provides a way to estimate the mean of the true posterior, by maximizing the likelihood of the the MCMC sample according to α. Indeed, after reordering of the α i , the unbiased mean partition is the given by (α 1 , α 2 , ..., α N )/ N i=1 α i .

Validation with synthetic data
In order to assess the performance of the designed RJ-MCMC sampler, we generated various sets of synthetic data with the Gillespie algorithm (SSA) (Gillespie, 1977). We simulated luminescence time traces according to the microscope noise model (Eq. 1) and the stochastic promoter cycle model, using a sampling time ∆t = 5 min for a total duration of 2 days, mimicking the experimental recording conditions (Suter et al. , 2011).
Toy examples: The first collection of simulated data was designed to illustrate the RJ-MCMC procedure. We generated 4 sets of 64 cells, with the following common parameters: k p = 0.2 min −1 , One simulated cell of the first and last set are displayed in Figure 1B and 1C respectively.
The estimated parameters are given in Table EV2. The posterior distributions of the parameters for the different sets are displayed in Figure 2B and 2C. Overall, we recovered the input parameters with good accuracy. The number of inactive states tend to be slightly overestimated. The optimal number of inactive states for the second, third and fourth data set, defined as the mode of the posterior distribution were 3, 5 and 7 respectively, instead of 2, 4 and 6. Regarding the inferred partition of the total silent period, we recovered successfully the asymmetry of the partition for the second data set, whereas for the third and fourth data set the inferred partition was slightly biased toward a decreasing sequence instead of a fully symmetric partition. This effect was expected, as explained in Section 5.1. Correcting the bias introduced by the restriction of sampling within a single symmetric sector of the (N − 1)-simplex, the estimated partition was then more regular for the third and fourth data set. Finally, we estimated the relative Monte Carlo standard error (RMSE) for the main parameters, the RMSE remained smaller than 1% for most parameters.
Controlled simulations: Here, we explored the limits of our inference framework, namely how the level of expression and short active time τ a affects our inference capability. We generated 48 cells with different mean levels of mRNA expression m , by increasing either the transcription rate k m or both k m and τ a . The following parameters were common in every cell: k p = 0.6 min −1 , γ p = 0.0318 min −1 , γ m = 0.0100 min −1 , T = 57 min and N = 5. In both scenarios, the mean level of expression m was used as a control parameter, ranging from 2.5 to 25 transcripts. In the first scenario we set τ a fixed to 3 min, below the sampling time ∆t of 5 min, and increased the transcription rate according to m using the steady-state expression for the mean: While in the second scenario, we increased both k m and τ a with the constraint that τ a = 6k m . Using the steady-state expression for m , the transcription rate is then simply given by: The estimated parameters for the various conditions are given in Table EV2 and shown in Figure   2D. In both scenarios, all the parameters are satisfactorily recovered, even when the cells are lowly expressed or the active period is smaller than the sampling time.
Heterogeneity bias: Finally, we investigated the effect of heterogeneity in the data on the inferred parameters. This aspect is particularly relevant, even in isoclonal populations, as cells are expected to display heterogeneity due to external and internal environmental fluctuations in addition to the intrinsic stochasticity of gene expression. In practice, such heterogeneity could be modeled as cells having their own kinetic parameters, or alternatively the whole population could be described with common effective parameters. The latter approach, used here, is more straightforward and provides more precise estimates since all the cells contribute to the inference. However, the estimated effective parameters could be biased by not reflecting the mean parameters of the population.
In order to characterize the potential biases due to a global fitting on cell populations, we first generated two populations of 48 cells sharing similar kinetic parameter except the transcription rate which was set to k (1) m = 1 in the first population and to k (2) m = 5 in the second one, we then generated several subset of data resulting from the mixing of the two populations in controlled ratios. We defined a control parameter r reflecting the fraction of cells from the second population in the final mixed collection of 48 cells. The list of shared parameters were the following: k p = 0.2 min −1 , γ p = 0.0318 min −1 , γ m = 0.0102, τ a = 8 min and T = 90 min. The estimated parameters for the different values of r are given in Table EV2 and shown in Figure S1. We predicted the behavior of the inferred parameters based on the hypothesis that the mean mRNA level m of the mixed population is correctly inferred and that the estimated transcription rate is actually the maximum among the underlying populations. This hypothesis provides a good explanation for the observed behavior of the estimated parameters, as shown in Figure S1. Indeed, the inferred mean mRNA expression follows closely the mean expression of the mixed population (red dashed line), computed as the weighted average between the first and second 27 initial population: The inferred transcription rate stays stable around the maximum of the mixed population k (2) m = 5 as r decreases and finally collapses to k (1) m = 1 when r = 0. Consequently, the inferred active period τ a should decrease with r to compensate the extra contribution to the mean expression, in fact the estimated τ a closely follows (red dashed line): for r > 0 which is the predicted behavior of τ a if the aforementioned hypothesis holds true. Regarding the burst size and total inactive period T , these are only slightly overestimated for the mixed population whereas the number of inactive states N is unaffected. Thus, it appears that the inferred active period τ a can be biased toward a lower value in heterogenous population, whereas the inferred transcription rate tends to reflect the maximum rate in the population. Not so surprisingly, the model with the maximum transcription rate is more flexible, as it can account for a larger range of transcription rates (slopes in the mRNA time traces) by modulating the active period or by successive short activations.

Parameters of each clone & optimal model
We then applied our inference framework on the time-lapse cell recordings of the different clonal populations. We used the experimentally measured parameters from Suter et al. (Suter et al. , 2011;Molina et al. , 2013) which are given in Table EV1 .
The estimated number of inactive states, the kinetic parameters and the partition of the silent period for the different clones are given in Table EV3 and shown in Figure 3A-D. The optimal model was defined from the mode of the posterior distribution of the number of inactive states. The fraction of time the gene is active is given by f a = τ a /(τ a + T ) which is displayed in percent. In Figure 3A-B, we show the 95% confidence ellipses which were computed from the sample covariance of the joint posterior distribution of the burst size b = k m τ a and the active fraction f a (Figure 3A), and between the total silent period T and the burst size b ( Figure 3B). The corrected partition of the silent period (cf. Section 5.1) was used in Figure 3D.

Silent period, active period & steady-state distribution
In order to assess the consistency of the promoter cycle model with the data, we compared the distribution of silent periods, active periods and the mRNA steady-state distribution predicted by the optimal model (modeled distributions) with the distributions inferred from the bioluminescence time traces. For that purpose, we designed a Gibbs sampler which enables reconstruction of those distributions conditioned on the data, using the optimal model as a prior.
Below, we first review the Gibbs sampling approach, and then derived the theoretical distributions according to the model. Finally, we show the comparison between the inferred and modeled distributions for simulated cell populations and the full set of experimental clones.

Gibbs sampling
Once model selection and parameter estimation was achieved, we reconstructed the mRNA steadystate distribution as well as the distribution of silent periods conditioned on the data. This task requires the probability of a hidden states trajectory Λ = {p i m i g i } given the bioluminescence time traces and the parameters: where P (D, Λ|N, θ N ) is nothing else than the likelihood (Eq. 2) without the marginalization over all trajectories Λ. We sampled the joint distribution P (Λ|D, N, θ N ) using Gibbs sampling (Gelfand & Smith, 1990). This MCMC variant allows sampling from the joint distribution by sampling iteratively each state λ i ≡ p i m i g i from the conditional distribution: This quantity is much easier to compute than the joint distribution P (Λ|D, N, θ N ). In our case it reduced to: In practice, at every iteration, we chose randomly a time point i and we resampled the state of the system λ i from the conditional distribution above. We recorded a hidden state trajectory Λ

Modeled distribution of silent & active periods
The silent period of the promoter cycle, i.e. the time between two successive activations, formally follows a phase-type distribution which describes Poisson processes occurring sequentially or in parallel. Since in practice the temporal observations are discrete in time, with intervals given by the sampling time ∆t, we could potentially miss activation events between two observations, provided the on-time is short enough compared to ∆t. The discrete waiting time, which takes into account the missing events, can be computed as follows. Starting in g = 1 the first inactive state, we defined an initial vector F 0 of size N + 1: F 0 (g) = 1 if g = 1 0 otherwise and the following recursion scheme: where the sum runs over all promoter states except the active one g = 0 and P t (g|g ) the transition probabilities of the gene expressed as the matrix exponential of the transition rate matrix K N of the promoter cycle (Eq. 7): The discrete waiting time distribution P W is then given by: which is the probability of having observed the gene in the active state for the first time after n observations, or in term of time after t = n∆t, knowing that the system was initially in the first inactive state.
The discrete distribution of active periods can computed in a similar manner. Starting in g = 0 the active state, we defined an initial vector F 0 of size N + 1: F 0 (g) = 1 if g = 0 0 otherwise 30 and the following recursion scheme: The discrete waiting time distribution P W is then given by: where the sum runs over all promoter states except the active one g = 0.

Modeled steady-state distribution of mRNA
Although an analytical expression for the mRNA steady-state distribution of the promoter cycle was derived by Zhang et al. (Zhang et al. , 2012), this expression proved to be rather unpractical.
Instead, we solved numerically the master equation (Eq. 6) at steady-state: The steady-state distribution of mRNA is then given by the marginal distribution P s (m) = g P s (mg).

Inferred vs modeled distributions
We first validated our approach, by performing Gibbs sampling on the toy examples presented in section 5.2. We then applied the method on the full set of clones, and compared the distributions predicted by the optimal model (modeled distributions) with the inferred ones.
Toy examples: Having already determined the optimal model for the four sets of simulated cells described in section 5.2 (Toy examples paragraph), we then performed Gibbs sampling using the optimal model as a prior to reconstruct the distributions of silent periods, active periods and mRNAs for each synthetic population. We also computed the modeled distributions from the optimal model according to sections 6.2 and 6.3. The resulting distributions are given in Figure S2A, S3A and S4A respectively. Both approaches lead to very similar results, displaying an excellent agreement, which was in fact expected in this case. Indeed, the analyzed population were homogenous (sharing the same kinetic parameters), therefore the optimal model should fully capture the stationary properties of the data.
Clones: We followed a similar approach than described in the previous paragraph, applying the method on the full set of clones, whose inferred kinetic parameter are listed in Table EV3. The resulting distributions, namely the silent period, the active period and mRNA steady-state distribution, are given in Figure S2B, S3B and S4B respectively. Regarding the silent periods ( Figure S2B), the modeled distribution matched fairly well the inferred one, confirming the previously observed signature of sequential inactive processes in the silent period of the genes (Suter et al. , 2011).
Similarly, the modeled active periods ( Figure S3B) are consistent with the inferred one. Although refractory periods in the active state cannot be ruled out, we did not observe peaked active periods in the analyzed clones. Since the average active periods were rather short compared to the sampling time ∆t, it would have been difficult to detect the signature of sequential active states. Of note, we previously reported that Ctgf after serum stimulation exhibited a first longer lasting burst (τ a ∼ 30-40 min) whose duration was constrained and followed a peaked distribution (Molina et al. , 2013).
On the other hand, the inferred mRNA steady-state distributions ( Figure S4B) can deviate from the modeled one. Inferred distributions tend to be wider, confirmed by the larger noise value σ 2 / m 2 . Moreover, some clones (NcKap1 and Ctgf) possess a peak at lower expression level reflecting substantial heterogeneity in mRNA expression in the cell population. Notably, Dbp shows considerable weight in its distribution at very low transcripts number (around 0 and 1 transcripts), resulting from long silent periods due to the circadian oscillations. We already showed evidence in a previous study (Molina et al. , 2013), that Gibbs sampling enables faithful reconstruction of the true dynamics, provided that data is strong enough compared to the prior (here the optimal model), consequently the observed features are unlikely to be artifacts of the method. Further evidence thereof is given in section 7.3.

Biological noise
The stochastic description of gene expression provides an explicit framework to connect the mechanisms involved in the expression process, although quite coarse-grained, with the observed fluctuations in number of proteins or mRNAs. Typically, those fluctuations can be characterized by defining dispersion measure on the steady-state distribution of molecules. The most common measures are the noise η 2 = σ 2 µ 2 defined as the variance over the mean squared, and the noise strength, also called Fano factor, F = σ 2 µ which can be interpreted as a measure of deviation from the Poisson distribution.
An important step which is required in order to characterize the fluctuations of the system is to disentangle the different component of the measured noise, namely to separate the intrinsic contributions which are explicitly modeled from the extrinsic ones resulting from a fluctuating environment. In the following subsections, we first review analytical results regarding intrinsic transcriptional noise of the promoter cycle, then show how the noise propagates at the protein level and finally we present an approach to estimate the intrinsic and extrinsic component of the measured transcriptional noise.

Transcriptional noise of the promoter cycle
Here, we compute the noise of the promoter cycle by solving the equations for the mean and the covariance, derived from the master equation (Eq. 5), at steady-state (Lestas et al. , 2008). The mean of the species n = ( m , g 0 , ..., g N −1 ) t is governed by the following differential equation: where f ( n ) = i r i w i ( n ) with r i the state change vectors, w i the propensity functions and the sum runs over all reaction channels. Using the fact that g N = 1 − N −1 k=0 g k and introducing the different timescales of the system as defined in Section 4.1, with τ m = γ −1 m , we obtained: since the propensity functions are linear functions of the species, f ( n ) = A n + f 0 where A is the drift matrix. At steady-state, the mean of the species n is obtained by solving the following linear system: A n + f 0 = 0 (16) The mean of the different species is then given by Owing to the linearity of the propensity functions, the covariance of the species Σ ij = ∆n i ∆n j , with ∆n i = n i − n i , is exactly governed by the following time-dependent Lyapunov equation (Lestas et al. , 2008): with D the diffusion matrix defined as D = i r i w i ( n )r t i where, again, the sum runs over all reactions channels. By solving the Lyapunov equation at steady-state for increasing number of inactive states N , i.e.
it can be shown recursively, after proper factorization of the different terms 1 , that the variance on the mRNA copy number σ 2 = Σ 11 is exactly given by the following expression: where η 2 c depends on the number of inactive states N and the timescales: 1 This tedious process was performed using Mathematica R It is worth mentioning that the same expression for the variance was already obtained by Zhang et al. (Zhang et al. , 2012), namely they derived the analytical expression for the steady-state distribution of the promoter cycle and its moment generating functions, which enabled them to compute the aforementioned expression.
The intrinsic transcriptional noise η 2 = σ 2 m 2 is then given by where η 2 p is the Poisson contribution to the noise due to the stochastic production and degradation of mRNAs, and η 2 c is the promoter noise due to random fluctuations in the promoter cycle. It was already clear that the noise contribution of the cycle η 2 c could be lowered independently of the expression level (Pedraza & Paulsson, 2008), by increasing the number of inactive states, provided that the timescales {τ i } are similar, and effectively reducing the variance of the total inactive time T . Moreover, the reduction is optimal if all the timescales of the inactive subprocess are equal (Zhang et al. , 2012), namely if τ i = T /N ∀i.
In order to acquire an intuitive understanding of the noise dependance on the promoter cycle parameters, we then aim to investigate various limiting regime. Without loss of generality, the two components of the transcriptional noise (Eq. 20) can be expressed in term of fundamental dimensionless quantities: where b is the burst size, f the normalized completion time of the cycle f = (τ a + T )/τ m and C a factor which depends on the structure and the timescales of the cycle, i.e. C(τ m , τ a , {τ i }). The factor C is always positive and smaller or equal to 1. This property follows from the fact that, for fixed τ m , τ a and T , the promoter noise η 2 c is always maximal for N = 1, namely η 2 c,(N =1) ≥ η 2 c,(N >1) (Zhang et al. , 2012), and consequently In a constitutive regime (T τ a ), C goes to zero, while in a bursting regime (τ a T ∼ τ m ), C goes to one for a simple switch (N = 1).
The noise reduction provided by an increase in number of inactive states N is encoded in the factor C, namely C tends to decrease while N increases. This effect can be further characterized by assuming that the decay time of the transcripts τ m is much larger than the timescales of the inactive subprocesses, i.e. τ i τ m ∀i or equivalently T τ m . Formally, computing the limit of C when τ m goes to infinity, we obtain: Further assuming that the timescales of the inactive subprocesses are equal, τ i = T /N ∀i, consequently getting a lower bound on C, leads to a simple expression: It follows that the noise reduction provided by N is optimal for τ a T . Indeed, in the bursting limit, C solely depends on N : In that regime, the Fano factor only depends on the burst size and the number of inactive states: Similarly, the promoter noise takes a simple form: Interestingly, at fixed T /τ m , the promoter noise η 2 c decreases approximately as 1/N , and at large N its contribution can be halved compared to a simple switch (N=1).
Although the assumptions used to derive the expression above might sound drastic, the approximation remains valid for the analyzed genes since they are bursting and the decay time τ m ∼ 100 minutes is generally larger than the silent period T . In fact, the assumption that T τ m can be relaxed. Only assuming a bursting regime (τ a T ) and equal partition of the silent period (τ i = T /N ∀i), the limit of C for large N is given by where α = T τm . Of note, if T τ m , the number of inactive steps N does not provide any noise reduction: While for T τ m , the noise reduction is optimal: lim α→0 C(α) = 1 2 In the case where T ∼ τ m , i.e. α ∼ 1, the maximal reduction provided by C is still fairly close to the one obtained in the optimal regime T τ m . Indeed, C(1) 0.58, which only yields an error of 16% on the maximal attainable noise reduction.

Protein noise
So far, we focused on mRNA noise and here we will investigate how transcriptional noise propagates and its mRNA (τ 1/2 ∼ 40-100 min) are artificially shorter than what is measured for endogenous mRNA and proteins (Sharova et al. , 2009;Schwanhäusser et al. , 2011), it is important to examine the effect of longer half-lives on the modeled noise.
Since in our model the proteins are governed by a simple birth-death process coupled through the effective translation rate with the mRNAs dynamics, the protein noise results from this one stage cascade. With good approximation (Pedraza & Paulsson, 2008), the protein noise is given by with τ p = γ −1 p . We recognized in the first term the Poisson contribution and the second one describes the time averaging or filtering of the mRNA noise. This expression would be exact if the transcripts were simply described by a birth-death process, and it remains a good approximation for the promoter cycle, as long as the decay time of the proteins τ p and the transcripts τ m are large compared to the timescales of the cycle τ a and {τ i }.
By comparing the exact protein noise, computed numerically by solving the Lyapunov equation for the full system (Eq. 18), with the mRNA noise predicted by the optimal model for the different clones, we show in Figure S9A that the noise transmission closely follows the expression above (Eq. 22). Indeed, computing the noise using the experimentally measured decay time τ p ∼ 30 min and τ m ∼ 100 min (see Table EV1) and using more realistic values (τ p = 5 h and τ m = 3 h), in both cases the slope of the linear regression matches closely the factor τ m /(τ p + τ m ). Thus, the mRNA noise is transmitted linearly at the protein level, since the protein Poisson noise is negligible in practice, owing to the large protein copy number p > 100 observed in most clones.
Having shown a simple linear relation between transcriptional and protein noise in our model, we can already anticipate that the results related to noise reduction at the transcriptional level should remain valid at the protein level. Indeed, when estimating the noise reduction as a ratio of the noise of the one state model compared to the N states model, the prefactor τ m /(τ p + τ m ) drops. Moreover, the noise reduction offered by the N states model at the protein level is almost unaffected by the corrected values of τ m and τ p ( Figure S9B).

Separation of extrinsic & intrinsic noise
One approach to disentangle the intrinsic noise from the extrinsic noise is based on the dual reporters setup Elowitz et al. , 2002), where two reporters of different color inserted in the same cell are driven by identical promoters. Measuring the correlation between the expression of the two reporters in a isogenic cells population allows estimation of the extrinsic component of the noise, namely fluctuations due to the environment (fluctuations in transcription factor copy number, polymerase, etc) affecting both genes equally. Unfortunately, adapting the dual reporters approach to our short-lived luciferase assay seems unrealistic for endogenous genes, since we would have to target both alleles of the randomly selected gene, which in itself would be extremely challenging, moreover both alleles are not necessarily expressed and identical. Therefore, we decided to split the noise based on the assumption that the environment is static, neglecting the extrinsic dynamics which cannot be deciphered without explicit treatment in the case of a single reporter. Such an approximation is satisfactory as long as the extrinsic processes are slow enough such that they can be considered constant during the recording period, otherwise the intrinsic contribution would be overestimated (Hilfinger & Paulsson, 2011).
Applying the law of total variance which splits the variance into the expectation of a conditional variance and the variance of a conditional expectation, we separated the total transcriptional variance of the cell population σ 2 m in term of a variance σ 2 m|i coming from the environment of the different cells i, and an unexplained variance σ 2 m|i : where σ 2 m|i is the transcriptional variance of i th cell, m i its mean and m the mean of the population. The weights w i account for unequal recording durations, and were chosen such that the length of the recording of i th cell. The unexplained variance can be related to the intrinsic contribution σ 2 and the explained one to the extrinsic contribution σ 2 e (Hilfinger & Paulsson, 2011). Although the intrinsic contribution as defined above does not necessarily correspond to the intrinsic variance in absence of extrinsic noise (the mean variance is usually not equal to the modeled variance obtained with the mean kinetic parameters), the splitting still 38 provides a good approximation of intrinsic variability as demonstrated in simulated data below.
Validation: Having defined a splitting of the total variance in the context of a static environment, we tested whether we could potentially separate both components in data subject to extrinsic noise using Gibbs sampling (see Section 6.1). Moreover, we determined what fraction of intrinsic and extrinsic noise the optimal model captures in that condition.
For that purpose, we generated 24 populations of 48 heterogenous simulated cells (see Section 5.2) whose parameters were drawn from a multivariate distribution. We sampled the parameters of the cells as follows, we first set the mean expression of the population m = 10, the mean active period τ a = 6 min and the mean silent period T = 60 min. We then imposed correlations between the different parameters: m and τ a (r=0.4), τ a and T (r=0.5), m and T (r=-0.4). In addition, we fixed the coefficient of variation (CV) of τ a and T to 0.2, and defined the CV of m as a control parameter, varying from 0.3 to 0.9, in order to span a representative range of total noise. Finally, we sampled the different cells using a lognormal multivariate distribution for the three parameters, and set k m = mγ m (τ a + T )/τ a with γ m = 0.01 min −1 . Special care was taken to avoid lowly expressed cells (m < 1) and very short active period (τ a < 1). The number of inactive states was fixed to N = 2 with a silent partition close to symmetric.
On those heterogenous cell populations, we first applied our inference framework to determine the optimal model and its kinetic parameters, defined as the mode of the posterior distribution of N and the mean of the posterior distribution of k m , τ a and {τ i }. We then estimated the noise η 2 m of the optimal model as the mean of its posterior distribution using equation (Eq. 20). Finally, we performed Gibbs sampling on all the cells individually using the optimal model as prior, to reconstruct the single-cell mRNA distributions and thus estimate the mean expression m i and the variance σ 2 i of each cell ( Figure S5A). We verified that the single-cell Gibbs estimates recovered accurately the true single-cell mean and variance computed from the Gillespie mRNA time traces of the different cells ( Figure S5B). Then, using equation (Eq. 23), we computed the intrinsic and extrinsic contributions of the population total noise, from the single-cell Gibbs sampling estimates and directly from data.
First, we verified that Gibbs sampling enables accurate reconstruction of the population mRNA distribution defined as the mixture of each single-cell distribution, and thus provides an excellent estimation of the total transcriptional noise in population with heterogenous kinetic parameters.
The population distribution predicted by optimal model captures mainly the intrinsic variance (average single-cell variance in the data). Finally, we compared the population noise estimates provided by both Gibbs sampling and the optimal model in all the 24 heterogenous populations with varying amount of extrinsic noise ( Figure S5C-D). Gibbs sampling consistently recovers the total and intrinsic noise in the true populations while the optimal model offers a good estimate of the intrinsic noise though it slightly overestimates the true intrinsic noise by ≈ 13%. Notably, both methods provide estimates that are rather insensitive to the amount of extrinsic noise in the data.
Thus, in the context of static environment, we showed that the optimal model captures mainly intrinsic noise and that Gibbs sampling estimates can be used to separate the total noise in term of intrinsic and extrinsic noise.
Clones: Here, we aimed at estimating the extrinsic component of the transcriptional noise in our set of clones. As we have seen in Figure S4B, the steady-state distribution of mRNA estimated via Gibbs sampling do not necessarily agree with the theoretical distribution predicted by the optimal model. There is often a significant excess of noise in the Gibbs estimates which is not captured by the model. Since we have shown in previous paragraph that the Gibbs estimates reflect total noise with good agreement and that the noise predicted by the optimal model is close to intrinsic noise, those deviations observed in the clones constitute strong evidences for extrinsic noise. In the clones, the optimal model recovers on average 56% of the total noise ( Figure S4B).
Considering the deviations in term of noise estimates between the optimal model and Gibbs sampling, we then verified whether they could be explained in the context of static environment, namely if the model noise indeed reflect the estimated intrinsic component. We first separated the total transcriptional noise according to equation (Eq. 23) from Gibbs sampled single-cell distribution as explained in previous paragraph, the resulting intrinsic and extrinsic contributions to noise are shown in Figure 5B. The error bars correspond to parametric bootstrap confidence intervals (5 and 95 percentiles) obtained by fitting the mean m i and the variance σ 2 i of the different cells with a multivariate lognormal distribution. Since circadian clones (Bmal1a, Bmal1b and Dbp) are subject to clear dynamical extrinsic variability due to the circadian oscillations (Suter et al. , 2011), attempting to split the total noise of those clones based on static environment assumption would have led to underestimation of extrinsic noise. Therefore, we excluded them from this part of the analysis. Finally, we compared in Figure 5C the estimated intrinsic contribution to the optimal model prediction, on average the optimal model recovers 83% of the intrinsic noise.

8 Single-cell RNA-seq
To test the prediction of the promoter cycle model regarding biological noise for TATA and TATAless promoters genome-wide, we aimed at inferring the noise from single-cell RNA-seq in mouse embryonic stem-cells (Grün et al. , 2014). We first modeled the measured noise in term of mRNA counts in function of the biological distribution of mRNA, the sensitivity of the technique and variability of the recovery rate across cells. We then use the derived expression of the measured noise to infer the biological noise over the range of expression covered by TATA and TATA-less genes.

Modeling single-cell RNA-seq
Single-cell RNA-seq combined with unique molecular identifier (UMI) enables quantification of the mean transcript abundance and the transcript variability genome-wide (Grün et al. , 2014). The technique can be modeled as binomial sampling of transcripts for each gene with a probability q corresponding to the sensitivity or recovery rate. Providing some variability on the sensitivity q among the cells, the probability to measure n transcripts for a given gene in a single-cell can be expressed in term of the true distribution of transcripts P (m| m , σ 2 m ) and the distribution of the sensitivity P (q| q , σ 2 q ) as follows: P (n| q , σ 2 q , m , σ 2 m ) = dq m P b (n|m, q)P (m| m , σ 2 m )P (q| q , σ 2 q ) where P b (n|m, q) stands for the binomial distribution with m draws (true number of transcripts) and probability of success q. The first two central moments n and σ 2 n can be directly calculated from expression (24) in term of the distribution parameters ( q , σ 2 q , m , σ 2 m ) i.e. the mean and the variance of the sensitivity and the mRNA copy number. We then obtained: Thus, the measured noise η 2 n can be expressed in term of the biological noise η 2 m and the sensitivity noise η 2 q as follows: The measured noise results from a combination of different contributions, namely the sampling noise that leads to Poisson noise 1 q − 1 + η 2 q 1 m , the sensitivity noise η 2 q that set a lower bound on the measured noise for highly abundant transcripts, and the biological noise η 2 m up to a factor 1 + η 2 q .

Inference of biological noise genome-wide
In order to inferrer the average biological noise η 2 m all along the range of expression covered by the genes, we first partitioned the measured noise η 2 n in K = 8 bins S k with mean number of counts n k , such that the partition covered the measured range of average counts with similar number of data points in each bin. We then fitted the expression for the measured noise η 2 n (25) in function of the average number of counts n k in the different bins. We assumed gaussian measurement noise (characterized by σ 2 ) on the log-value of η 2 n (multiplicative noise) and expressed the likelihood of the measured η 2 n in term of the parameters q , η 2 q , η 2 m , as follows: N y i ∈ S k |µ k = log η 2 n ( n k , η 2 m , q , η 2 q ), σ 2 where y i stands for the log-value of the measured noise η 2 n of gene i belonging in the bin S k with average number of counts n k and N y|µ, σ 2 corresponds to the normal distribution with mean µ and variance σ 2 . The sensitivity parameters q and η 2 q were estimated from spike-in experiments, while the global parameter σ 2 and the biological noise η 2 m for each bin were inferred as the mean of the posterior distribution estimated by MCMC sampling. Having inferred η 2 m in function of n , we finally converted the average number counts to mean mRNA copy number as m = n / q .

Control
The control data were obtained by pooling several thousands cells together and splitting the pool in single-cell equivalent (Grün et al. , 2014). Although such an approach reduce the biological variability to negligible level, the splitting introduces again Poisson noise for lowly to moderately abundant transcripts. Indeed, the splitting process can be modeled as multinomial sampling from the pool with probability f g for each gene corresponding to the fraction of available transcripts. Since a large number of cells are pooled together, the fraction f g is essentially given by f g = mg G g=1 mg with m g the average transcript abundance for gene g. Assuming that each cell contain on average M transcripts in total (∼ 5 · 10 5 transcripts in mESC cells), the joint probability for transcript copy number of each gene n g is then given by the following multinomial distribution: P (n 1 , ..., n G |f 1 , ..., f G , M ) = M ! n 1 ! · ... · n G ! f n 1 1 · ... · f n G G Since M is large, the joint distribution above can be well approximated by a product of independent binomial distributions, that can be further approximated to Poisson distributions due to small f g : P (n 1 , ..., n G |f 1 , ..., f G , M ) ≈ G g=1 P b (n g |f g , M ) ≈ G g=1 P p (n g |λ g = M f g ) Thus, the number of transcripts in the control cells is subject to Poisson noise η 2 m = 1 m . Variability on the total number transcripts per cell M would only add a constant term independent on the average expression m .
We then performed inference of the biological noise η 2 m in the control data (both 2i and serum condition) according to previous paragraph, using initially the value for q and η 2 q estimated from the spike-in (3.4% and 0.05 respectively). We recover the predicted Poisson noise in the control, up to a multiplicative factor in expression, suggesting that the average sensitivity q is likely underestimated by the spike-in experiments. Indeed, it was mentioned that sensitivity estimation performed by single molecule RNA-FISH led to higher q (around 10%), the discrepancy was speculated to result from degradation of spike-in RNA or cellular RNA being more protected during cell lysis (Grün et al. , 2014). Thus, we estimated η 2 m using q = 0.1 that led to better agreement between the theoretical Poisson noise and the inferred curve ( Figure 7C-D), although a systematic excess of noise remains that is not explained by the simple model.
Single cells In real cells, the biological noise results from a combination of intrinsic noise and extrinsic noise η 2 m = η 2 + η 2 e . As we have seen previously, the intrinsic noise η 2 can be expressed in term of Poisson noise η 2 p = 1 m and promoter noise η 2 c (Equation 20). Providing both alleles are transcribed and uncorrelated, the promoter noise would only contribute half as much compared to a single allele. Thus the modeled biological noise should behave as follows: According to our inference of promoter-cycle model from time-lapse measurement of single-allele driven luminescent reporter, we identified two groups of promoters: simple promoter with a single silent interval (N = 1) mainly associated with TATA box and more complex promoter cycle (N > 1) observed only in TATA-less promoters. Providing a significant fraction of genes are bursting, the promoter noise would be approximately given by η 2 c = f 2 1 + 1 N with f = τa+T τm . Further assuming that the average mRNA life-time are roughly similar for TATA and TATA-less genes (Sharova et al. , 2009), TATA promoters owing to N = 1 should exhibit larger promoter noise than TATA-less promoters.
To test this hypothesis, we inferred the biological noise η 2 m in single-cell data (both 2i and serum condition) for TATA and TATA-less promoters according to the EPD database (Dreos et al. , 2013).
We found a significant enrichment for noise in TATA promoters compared to TATA-less promoters all along the range of average mRNA level ( Figure 7C-D with q = 0.1), though the difference lessen with expression. The latter effect could result from an increase of mRNA life-time τ m and a lengthening of the active period τ a for highly expressed transcripts. We verified whether the mRNA half-life correlates with expression, we did not find a significant trend or difference between TATA and TATA-less genes (Sharova et al. , 2009). Thus, the effect likely results from a lengthening of the 43 active period τ a in both TATA and TATA-less genes at high expression. Moreover, the biological noise reaches a plateau at high expression suggesting substantial amount of extrinsic noise in mouse embryonic stem cells. Interestingly, the noise is overall larger in serum than 2i condition, consistent with the fact that serum condition leads to more heterogenous cell population. Finally, we tested wether the number of alleles would affect the inferred biological noise (reduction of promoter noise by a factor 2 for two alleles), by comparing the noise of genes on chromosome X (single allele) with the noise of autosomal genes in both conditions ( Figure S8A-B). Although less significant, we still find a similar enrichment for noise on chromosome X as predicted by the simple model, thus supporting the hypothesis that the observed noise enrichment for TATA promoters is of intrinsic nature (promoter noise). Quantitatively, the noise difference between TATA and TATA-less promoter around 0.1-0.2 ( Figure 7C-D) is in the range of the predicted value of f /4, providing T ∼ 1-2 hours and τ m ∼ 1-20 hours. 44