Precision measurement of cis- 1 regulatory energetics in living cells 2

Gene expression in all organisms is controlled by cooperative interactions between 8 DNA-bound transcription factors (TFs). However, measuring TF-TF interactions that occur at 9 individual cis-regulatory sequences remains diﬃcult. Here we introduce a strategy for precisely 10 measuring the Gibbs free energy of such interactions in living cells. Our strategy uses reporter 11 assays performed on strategically designed cis-regulatory sequences, together with a biophysical 12 modeling approach we call “expression manifolds”. We applied this strategy in Escherichia coli to 13 interactions between two paradigmatic TFs: CRP and RNA polymerase (RNAP). Doing so, we 14 consistently obtain measurements precise to ∼ 0 . 1 kcal/mol. Unexpectedly, CRP-RNAP interactions 15 are seen to deviate in multiple ways from the prior literature. Moreover, the well-known RNAP 16 binding motif is found to be a surprisingly unreliable predictor of RNAP-DNA binding energy. Our 17 strategy is compatible with massively parallel reporter assays in both prokaryotes and eukaryotes, 18 and should thus be highly scalable and broadly applicable. 19


21
Cells regulate the expression of their genes in response to biological and environmental cues. A 22 major mechanism of gene regulation in all organisms is the binding of transcription factor (TF) 23 proteins to cis-regulatory elements encoded within genomic DNA. DNA-bound TFs interact with 24 one another, either directly or indirectly, forming cis-regulatory complexes that modulate the 25 rate at which nearby genes are transcribed (Ptashne and Gann, 2002;Courey, 2008). Different 26 arrangements of TF binding sites within cis-regulatory sequences can lead to different regulatory 27 programs, but the rules that govern which arrangements lead to which regulatory programs remain 28 largely unknown. Understanding these rules, which are collectively called "cis-regulatory grammar" 29 (Weingarten-Gabbay and Segal, 2014), is a major challenge in modern biology. carried out to comprehensively annotate cis-regulatory elements in model organisms (modENCODE 35 bioRχiv preprint able to distinguish between two qualitatively different mechanisms of transcriptional activation: 94 "stabilization" of RNAP-DNA binding (also called "recruitment" (Ptashne, 2003)) versus "acceleration" 95 of the transcript initiation rate by DNA-bound RNAP. Contrary to prior in vitro studies, we find that 96 in vivo class II activation by CRP at -41.5 bp occurs exclusively through stabilization, not acceleration. 97 Second, we were able to measure the strength with which both CRP and RNAP bind their respective 98 sites. This strength is quantified by the grand canonical potential (denoted here by ΔΨ), which 99 accounts for the Δ of binding as well as the in vivo concentration of each protein. Importantly, we 100 find that the actual in vivo ΔΨ of RNAP-DNA binding deviates substantially from the predictions of 101 the established RNAP binding motif. This result highlights the perils of assuming simple models for 102 protein-DNA binding energy when modeling the biophysics of transcriptional regulation. 103 In what follows, we first illustrate this expression manifold strategy in the context of simple 104 repression, which provides a general way to measure the ΔΨ of TF-DNA binding. This strategy 105 is then used to measure the ΔΨ of CRP binding to a near-consensus DNA site that we use in 106 subsequent experiments. Next we show how expression manifolds, inferred from measurements  114 We begin by showing how expression manifolds can be used to measure the in vivo strength of 115 TF binding to a specific DNA binding site. This measurement is accomplished by using the TF of 116 interest as a transcriptional repressor. We place the TF binding site directly downstream of the 117 RNAP binding site so that the TF, when bound to DNA, sterically occludes the binding of RNAP. We  121 for this type of simple repression. In this model, promoter DNA can be in one of three states: 122 unbound, bound by the TF, or bound by RNAP. These three state are assumed to occur with a 123 relative frequency that is consistent with thermal equilibrium, i.e., with a probability proportional to 124 its Boltzmann weight. 125 The energetics of protein-DNA binding determine the Boltzmann weight for each state. By 126 convention we set the weight of the unbound state equal to 1. The weight of the TF-bound state is 127 then given by = [TF] where [TF] is the concentration of the TF and is the affinity constant 128 in inverse molar units. Similarly, the weight of the RNAP-bound state is = [RNAP] . In what 129 follows we refer to and as the "binding factors" for the TF-DNA and RNAP-DNA interactions, 130 respectively. We note that these can also be written as = −ΔΨ ∕ and = −ΔΨ ∕ where is bioRχiv preprint promoter DNA can transition between three possible states: unbound, bound by a TF, or bound by RNAP. Each state has an associated Boltzmann weight and rate of transcript initiation. is the TF binding factor and is the RNAP binding factor; see text for a description of how these dimensionless binding factors relate to binding affinity and binding energy. sat is the rate of specific transcript initiation from a promoter fully occupied by RNAP. (B) Transcription is measured in the presence ( + ) and absence ( − ) of the TF. Measurements are made for promoters containing RNAP binding sites of differing binding strength (blue-yellow gradient). (C) If the model in panel A is correct, plotting + vs. − for the promoters in panel B (colored dots) will trace out a 1D expression manifold. Mathematically, this manifold reflects Equation 1 and Equation 2 computed over all possible values of the RNAP binding factor with the other parameters ( , sat ) held fixed. Note that these equations include a background transcription term bg ; it is assumed throughout that bg ≪ sat and that bg is independent of RNAP binding site sequence. The resulting manifold exhibits five distinct regimes (circled numbers), corresponding to different ranges for the value of that allow the mathematical expressions in Equations 1 and 2 to be approximated by simplified expressions. In regime 3, for instance, + ≈ − ∕(1 + ), and thus the manifold approximately follows a line parallel to the diagonal but offset below it by a factor of 1 + (dashed line). Data points in this regime can therefore be used to determine the value of . (D) The five regimes of the expression manifold, including approximate expressions for + and − in each regime, as well as the range of validity for .

Strategy for measuring TF-DNA interactions in vivo
In the absence of the TF ( = 0), the rate of transcription becomes 141 − = sat 1 + + bg . (2) Our goal is to measure the TF-DNA binding factor . To do this, we create a set of promoter 142 sequences where the RNAP binding site is varied but the TF binding site is kept fixed. We then mea-143 sure transcription from these promoters in both the presence and absence of the TF, respectively 144 denoting the resulting quantities by + and − ( Figure 1B)  on promoters for which CRP represses transcription by occluding RNAP. Each promoter assayed contained a near-consensus CRP binding site centered at +0.5 bp or +4.5 bp, as well as an RNAP binding site with a partially mutagenized -35 region (gradient). + (alternatively, − ) denotes measurements made in JK10 cells grown in the presence (absence) of the small molecule cAMP. (B) Dots indicate measurements for 42 such promoters. A best-fit expression manifold (black) was inferred from = 40 of these data points after the exclusion of 2 outliers (gray 'X's). Gray lines indicate 100 plausible expression manifolds fit to bootstrap-resampled data points. The parameters of these manifolds were used to determine the CRP-DNA binding factor and, equivalently, the grand canonical potential ΔΨ = − log . See Materials and Methods for more information about our curve fitting procedure and the reporting of parameter uncertainties. manifold should follow the specific mathematical form implied by Equations 1 and 2 when is 149 varied and the other parameters ( sat , bg , ) are held fixed. See Figure 1C. 150 The geometry of this expression manifold is nontrivial. In particular, when ≫ 1 and bg ∕ sat ≪ 1, 151 there are five different regimes corresponding to different values of the RNAP binding factor 152 for which the expressions for + and − approximately simplify. These regimes are listed in Figure   153 1D. In regime 1, is so small that both + and − are dominated by background transcription, i.e., 154 + ≈ − ≈ bg . is somewhat larger in regime 2, causing − to be proportional to while + remains 155 dominated by background. In regime 3, both + and − are proportional to in this regime, with 156 + ∕ − ≈ 1∕(1 + ). In regime 4, − saturates at sat while + remains proportional to . Regime 5 occurs 157 when both + and − are saturated, i.e., + ≈ − ≈ sat .

158
Precision measurement of in vivo CRP-DNA binding 159 The placement of CRP downstream of the RNAP binding site is known to repress transcription 160 (Morita et al., 1988). We therefore reasoned that placing a DNA binding site for CRP downstream of 161 RNAP would allow us to measure the binding factor of that site. The copyright holder for this preprint (which was not this version posted July 31, 2018. . https://doi.org/10.1101/380972 doi: bioRxiv preprint bioRχiv preprint and +4.5 bp. 1 To avoid influencing CRP binding strength, the -10 region of the RNAP site was kept 167 fixed in the promoters we assayed while the -35 region of the RNAP binding site was varied ( Figure   168 2A). Promoter DNA sequences are shown in Appendix 1 Figure 1. 169 We obtained − and + measurements for these constructs using a modified version of the 170 β-galactosidase assay of Miller (1972); see Materials and Methods for details. Our measurements 171 are largely consistent with an expression manifold having the expected mathematical form ( Figure   172 2B). Moreover, the measurements for CRP at the two different spacings (+0.5 bp and +4.5 bp) 173 appear consistent with each other, although the measurements at +4.5 bp have consistently lower 174 values for . A small number of data points do deviate substantially from this manifold, but the 175 presence of such outliers is not surprising from a biological perspective: introducing mutations into 176 the RNAP binding site has the potential to create a new binding site, either for RNAP itself or for 177 other TFs. Fortunately, outliers appear at a rate small enough for us to identify and exclude them 178 by inspection. 179 We quantitatively modeled the expression manifold in Figure 2B by fitting +3 parameters to our 180 2 measurements, where = 42 is the number of non-outlier data points, each point corresponding 181 to an assayed promoter. The + 3 parameters were sat , bg , , and 1 , 2 , . . . , , where each 182 is the RNAP binding factor of promoter . Nonlinear least squares optimization was then used to 183 infer values for these parameters. Uncertainties in sat , bg , and were quantified by repeating this 184 procedure on bootstrap-resampled data points. 185 These results yielded highly uncertain values for sat because none of our measurements appear 186 to fall within regime 4 or 5 of the expression manifold. A reasonably precise value for bg was 187 obtained, but substantial scatter about our model predictions in regime 1 and 2 remain. This scatter 188 likely reflects some variation in bg from promoter to promoter, variation that is to be expected promoter DNA can transition between four different states: unbound, bound by the TF, bound by RNAP, or doubly bound. As in Figure 1, is the TF binding factor, is the RNAP binding factor, and sat is the rate of transcript initiation from an RNAP-saturated promoter. The cooperativity factor quantifies the strength of the interaction between DNA-bound TF and RNAP molecules; see text for more information on this quantity. (B) As in Figure 1, expression is measured in the presence ( + ) and absence ( − ) of the TF for promoters that have RNAP binding sites of varying strength (blue-yellow gradient). (C) If the model in panel A is correct, plotting + vs.
− (colored dots) will reveal a 1D expression manifold that corresponds to Equation 4 (for + ) and Equation 2 (for − ) evaluated over the possible values of . Circled numbers indicate the five regimes of this manifold. In regime 3, + ≈ ′ − where ′ is the renormalized cooperativity factor given in Equation 5; data in this regime can thus be used to measure ′ . Separate measurements of , using the strategy in Figure 1, then allow one to compute from knowledge of ′ . (D) The five regimes of the expression manifold in panel C. Note that these regimes differ from those in Figure 1D. This can be rewritten as 212 is a renormalized cooperatively that accounts for the strength of TF-DNA binding. As before, − is 214 given by Equation 2. Note that ′ ≤ and that ′ ≈ when ≫ 1 and ≫ 1.

215
As before, we measure both + and − for RNAP binding sites of varying strength ( Figure 3B). 216 These measurements will, according to our model, lie along an expression manifold resembling the 217 one shown in Figure 3C. This expression manifold exhibits five distinct regimes when sat bg ≫ ′ ≫ 1.

218
These regimes are listed in Figure 3D.  Table 1. . CC-BY-ND 4.0 International license certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was not this version posted July 31, 2018. . https://doi.org/10.1101/380972 doi: bioRxiv preprint bioRχiv preprint stabilizing the RNAP-DNA complex. 225 We measured + and − for 47 variants of the lac* promoter (see Materials and Methods, as well 226 as Appendix 1 Figure 1). These promoters have the same CRP binding site assayed for Figure 2, but 227 positioned at -61.5 bp, upstream of RNAP ( Figure 4A). They differ from one another in the -10 or -35 228 regions of their respective RNAP binding sites. Figure 4B shows the resulting measurements. With 229 the exception of 3 outlier points, these measurements appear consistent with stabilizing activation 230 via a Gibbs free energy of Δ = −3.96 ± 0.09 kcal/mol, corresponding to a cooperativity of ∼ 600. 231 We note that, with ≈ 30 determined in Figure 2, ′ = to 3% accuracy. 232 This observed cooperativity is substantially stronger than suggested by previous work. Early 233 in vivo experiments suggested a much lower cooperativity value, e.g. 50-fold (Beckwith et al., 234 1972), 20-fold (Ushida and Aiba, 1990), or even 10-fold (Gaston et al., 1990). These previous studies, 235 however, only measured the ratio + ∕ − for a specific choice of RNAP binding site. This ratio is (by 236 Equation 4) always less than and the differences between these quantities can be substantial.  To test the generality of this approach, we measured expression manifolds for 11 other potential 245 class I activation positions. At every one of these positions we clearly observed the collapse of 246 data to a 1D expression manifold of the expected shape ( Figure 4C). By quantitatively modeling 247 these manifolds, we determined the cooperativity and the Gibbs free energy Δ at each position. 248 Uncertainties in these quantities were determined by the modeling of bootstrap-resampled data 249 points (Materials and Methods). The resulting values for both and Δ are shown in Figure 4D.  262 We investigated whether expression manifolds might be used to distinguish activation by 263 acceleration from activation by stabilization. First we generalized the thermodynamic model 264 in Figure 3A to accommodate both -fold stabilization and -fold acceleration ( Figure 5A). This 265 is accomplished by using the same set of states and Boltzmann weights as in the model for 266 stabilization, but assigning a transcription rate sat (rather than just sat ) to the TF-RNAP-DNA ternary 267 complex. The resulting activated rate of transcription is given by 268 + = sat 1 + + + + sat 1 + + + + bg . for Figure 4B. 282 The resulting data ( Figure 6B) are seen to largely fall along the previously measured all-stabilization 283 expression manifold in Figure 4B. In particular, many of these data points lie at the intersection of value of sat to the sat obtained for the other manifolds in Figure 4C, we were able to estimate 289 for these other positions. Figure 6C shows the results: we find that ≈ 1 at all of the other class I   Figure 4D. is the number of data points used to infer these values, while "outliers" is the number of data points excluded in this analysis. For comparison we show the fold-activation measurements (i.e., + ∕ − ) reported in Gaston et al. (1990) and Ushida and Aiba (1990). In these columns, n/a indicates that no measurement was reported at that CRP site spacing.   Figure 3C will lie along a 1D expression manifold having the form shown here. This manifold is computed using + values from Equation 7 and − values from Equation 2, evaluated using an RNAP binding factor ranging from 0 to ∞. Note that regime 5 occurs at a point positioned ′ -fold above the diagonal, where ′ is related to through Equation 8. Measurements in or near the strong promoter regime ( ≳ 1) can thus be used to determine the value of ′ and, consequently, the value of . (D) The five regimes of this expression manifold. Note that the ranges of validity for these regimes are the same as in Figure 3D, but that the + values differ. 294 Many E. coli TFs participate in what is referred to as class II activation (Browning and Busby, 2016). 295 This type of activation occurs when the TF binds to a site that overlaps the -35 element (often com-296 pletely replacing it) and interacts directly with the main body of RNAP. CRP is known to participate 297 in class II activation at many promoters (Keseler et al., 2011; Salgado et al., 2013), including the 298 galP1 promoter, where it binds to a site centered at position -41.5 bp (Adhya, 1996). In vitro studies  Figure 4B (gray). The value for sat inferred for Figure  4B is indicated by dashed lines. From these new data points we conclude that ′ ≈ 1, and thus ≈ 1. (C) Values for inferred for other CRP positions using the data in Figure 4B and assuming the value of sat shown in panel B. Thus, we detect no acceleration at any class I promoter architectures. Note that values could not be confidently determined at some CRP positions shown in Figure 4D.

Surprises in class II regulation
have shown CRP to activate transcription at -41.5 bp relative to the TSS through a combination of 300 stabilization and acceleration (Niu et al., 1996; Rhodius et al., 1997). 301 We sought to reproduce this finding in vivo by measuring expression manifolds. We therefore 302 placed a consensus CRP site at -41.5 bp, replacing much of the -35 element in the process, then 303 varied the -10 element of the RNAP binding site ( Figure 7A). Surprisingly, we observed that the 304 resulting expression manifold saturates at the same sat value shared by all class I promoters. Thus, 305 CRP appears to activate transcription in vivo solely through stabilization, and not at all through 306 acceleration, when located at -41.5 bp relative to the TSS ( Figure 7B). 307 The genome-wide distribution of CRP binding sites suggests that CRP also participates in class 308 II activation at position -40.5 bp (Keseler et al., 2011; Salgado et al., 2013). When measuring an 309 expression manifold at this position, however, we obtained a scatter of 2D points that did not 310 collapse to any discernible 1D expression manifold ( Figure 7D). Some of these promoters exhibit 311 activation, some exhibit repression, and some exhibit no regulation by CRP. RNAP has a very well established sequence motif (McClure et al., 1983). Indeed, its DNA binding 331 requirements were among the first characterized for any DNA-binding protein (Pribnow, 1975). 332 More recently, a high-resolution model for RNAP-DNA binding energy was determined using data 333 from a massively parallel reporter assay called Sort-Seq (Kinney et al., 2010). This "energy matrix The copyright holder for this preprint (which was not this version posted July 31, 2018. . https://doi.org/10.1101/380972 doi: bioRxiv preprint bioRχiv preprint manifold corresponding to different values of RNAP-DNA binding affinity. 370 Robust data collapse was observed for CRP binding sites located at all except one of the positions 371 we assayed. In these cases, we were able to infer precise values for the energetic parameters of our 372 models. Inferring a model for simple repression allowed us to determine the strength of CRP-DNA 373 binding (ΔΨ = −2.10 ± 0.10 kcal/mol). Inference of models for simple activation then allowed us to 374 determine values for the CRP-RNAP interaction, as quantified by the Gibbs free energy Δ ; these 375 interaction energies were consistently determined to a precision of ∼ 0.1 kcal/mol.  (Kinney et al., 2010). This is a cautionary 388 tale: even for very well studied TFs, one cannot assume that published motifs accurately predict the 389 affinity of individual DNA binding sites. 390 Unexpectedly, our data did not collapse to an expression manifold when CRP was centered at 391 -40.5 bp. This result allowed us to reject our hypothesized biophysical model. We thus learned that 392 the DNA sequence of the core RNAP binding site somehow controls how RNAP interacts with CRP in 393 this class II configuration. Additional work will be required to understand this sequence-dependence, 394 which to our knowledge has not been previously reported. 395 Our strategy has been designed to be compatible with massively parallel reporter assays 396 (MPRAs), which use ultra-high-throughput DNA sequencing to measure the activities of thousands 397 of transcriptional regulatory sequences simultaneously. We expect that MPRAs, performed on 398 microarray-synthesized promoter libraries, should allow hundreds of expression manifolds to 399 be measured in a single experiment. MPRAs will also facilitate the study of TFs that cannot be 400 controlled by a small molecule: one can measure + and − by assaying promoters that either do or 401 do not have a functional TF binding site but are otherwise identical. The ease with which MPRAs can 402 assay promoters with different combinations of sites turned "on" and "off" should enable the study 403 of more complex regulatory architectures, beyond just simple repression and simple activation. 404 Based on these results, we advocate a very different approach to dissecting transcriptional 405 regulatory grammar than has been pursued by other groups. Instead of assaying and modeling The copyright holder for this preprint (which was not this version posted July 31, 2018. . https://doi.org/10.1101/380972 doi: bioRxiv preprint bioRχiv preprint TFs actually affect expression. Such knowledge would also facilitate MPRA-based efforts to dissect 421 previously unannotated regulatory sequences across the genome (Belliveau et al., 2018). 422 Precise knowledge of transcriptional regulatory grammar in bacteria would also have important 423 implications for synthetic biology. Currently, complex biological computations are performed in 424 synthetic systems by stringing simple promoter "parts" together into complex regulatory networks. 425 By contrast, naturally occurring promoters can often perform quite complex computations them-426 selves via the multi-protein-DNA complexes that they scaffold (Kuhlman et al., 2007; Cui et  The copyright holder for this preprint (which was not this version posted July 31, 2018. . https://doi.org/10.1101/380972 doi: bioRxiv preprint bioRχiv preprint The expression levels + and − were quantified from these absorbance data using the formula where = 50 is the volume of lysate in µl added to the ONPG reaction, Δ is the change in time from 516 the beginning of the measurement, and Δ indicates a change in absorbance at nm over this 517 time interval. Only data from wells with 600 ≲ 0.5 were analyzed. Note that the 550 term in Equation 518 9 is not multiplied by 1.75 as it is in Miller (1972). This is because our 550 measurements are used to 519 compensate for condensation on the microplate film, not for cellular debris as in (Miller, 1972); our 520 lysis procedure produces no detectable cellular debris. In practice, Equation 9 was not evaluated 521 using individual measurements, but was rather computed from the slope of a line fit to non-522 saturated absorbance measurements using custom Python scripts. Raw 420 , 550 , and 600 values, 523 as well as our analysis scripts, are available at https://github.com/jbkinney/18_expressionmanifolds. 524 In all the figures, median values from at least 3 independent Miller measurements were used to 525 define each measured + and − data point.
The solid black lines in Figure 2B and Figures 4B,C show the expression manifolds fit to all data 536 points. The gray lines in Figure 2B and Figure 4B represent parameters fit to bootstrap-resampled 537 data points. 538 The values reported for and , as well as for Δ and Δ , were computed using parameters 539 fit to bootstrap-resampled data. For the occlusion data in Figure where 84 , 50 , and 16 respectively denote the 84th, 50th, and 16th percentiles of values obtained 545 from bootstrap resampling. 546 By visual inspection of Figure 6B, we determined that ≈ 1 at 61.5 bp. In Figure 6C