Predictive features of gene expression variation reveal a mechanistic link between expression variation and differential expression

For most biological processes, organisms must respond to extrinsic cues, while maintaining essential gene expression programs. Although studied extensively in single cells, it is still unclear how variation is controlled in multicellular organisms. Here, we used a machine-learning approach to identify genomic features that are predictive of genes with high versus low variation in their expression across individuals, using bulk data to remove stochastic cell-to-cell variation. Using embryonic gene expression across 75 Drosophila isogenic lines, we identify features predictive of expression variation, while controlling for expression level. Genes with low variation fall into two classes, indicating they employ different mechanisms to maintain a robust expression. In contrast, genes with high variation seem to lack both types of stabilizing mechanisms. Applying the framework to human tissues from GTEx revealed similar predictive features, indicating that promoter architecture is an ancient mechanism to control expression variation. Remarkably, expression variation features could also predict differential expression upon stress in both Drosophila and human. Differential gene expression signatures may therefore be partially explained by genetically encoded gene-specific features, unrelated to the studied treatment.


21
For most biological processes, organisms must respond to extrinsic cues, while maintaining 22 essential gene expression programs. Although studied extensively in single cells, it is still 23 unclear how variation is controlled in multicellular organisms. Here, we used a machine-24 learning approach to identify genomic features that are predictive of genes with high versus 25 low variation in their expression across individuals, using bulk data to remove stochastic cell-26 to-cell variation. Using embryonic gene expression across 75 Drosophila isogenic lines, we 27 identify features predictive of expression variation, while controlling for expression level. 28 Genes with low variation fall into two classes, indicating they employ different mechanisms to 29 maintain a robust expression. In contrast, genes with high variation seem to lack both types of 30 stabilizing mechanisms. Applying the framework to human tissues from GTEx revealed similar 31 predictive features, indicating that promoter architecture is an ancient mechanism to control 32 expression variation. Remarkably, expression variation features could also predict differential 33 expression upon stress in both Drosophila and human. Differential gene expression signatures 34 Introduction 38 Living systems have a remarkable capacity to give rise to robust and highly reproducible 39 phenotypes. Perhaps the most striking example of this is the process of embryogenesis, where 40 fertilized eggs give rise to stereotypic body plans despite segregating genetic variants and 41 moderate differences in environmental conditions (e.g. water temperature for fish, mothers diet 42 for humans). This phenomenon led Waddington to propose that developmental reactions are 43 canalized, which buffers them to withstand such variation without alterations in embryonic 44 development (Waddington 1942). In agreement with this, variation in gene expression is an 45 evolvable trait under selection pressure (Lehner 2008;Fraser et al. 2004;Metzger et al. 2015).  Blake et al. 2006). In 51 other cases, variation in gene expression is detrimental and must be tightly regulated, for 52 example for essential genes (Fraser et al. 2004) and genes that reduce fitness in heterozygous 53 mutants (Batada and Hurst 2007). This suggests that there are inherent mechanisms that 54 modulate variation in gene expression, either attenuating or amplifying it (Fig 1a). To address this, we devised a machine learning approach and performed a systematic analysis 63 of factors underlying variation of gene expression in Drosophila melanogaster to uncover the 64 regulatory mechanisms involved. To measure expression variation, we used gene expression 65 data generated from a pool of embryos (~100) sampled from 75 different isogenic lines during 66 embryogenesis (Cannavò et al. 2016). This experimental design cancels out most stochastic 67 noise (since it's bulk sequencing), tissue-specific expression pattern (since it's whole embryo) 68 and slight differences in developmental progression (since it's 100 embryos per line). To 69 dissect the regulatory mechanisms that modulate expression variation (Fig 1a), we collated 70 over a thousand gene-specific and genomically encoded features and applied a random forest 71 model to identify the properties that best explain expression variation across individuals. As a 72 comparison, we also predict median expression level across lines using the same features. 73 Our results show that, overall, increasing regulatory complexity translates into more robust 74 gene expression. We identified two independent mechanisms associated with low expression 75 variation across individual: Low variable genes either have (i) a broad transcription initiation 76 region (broad promoters) with high transcription factor (TF) occupancy, or (ii) narrow 77 initiation regions (narrow promoters) with Polymerase II (PolII) pausing and high regulatory 78 complexity outside the promoter region. In contrast, genes with high variability generally have 79 narrow promoters, and little other regulatory features, suggesting that it may rather be a lack 80 of 'stabilizing' mechanisms that facilitates their noisy expression. Applying the same 81 framework to human data derived from tissues across individuals (GTEx Consortium 2013) 82 identified similar promoter-associated features to be predictive of expression variation, thus 83 validating our findings in an independent organism. Remarkably, these same features are also 84 predictive of differentially expressed genes when tested on independent datasets from adult 85 Drosophila subjected to different stress conditions, or in a collection of differential expression 86 data for human. These findings suggest that the differential expression response may be 87 partially explained by genetically encoded gene-specific features that are unrelated to the 88 treatment applied. 89 Taken together, our results suggest that gene expression variation across genetically diverse 90 multicellular organisms is strongly linked to how the gene is regulated and likely reflects 91 evolutionary constraints on expression precision. 92

Measuring gene expression variation across individuals 94
To understand the mechanisms by which gene expression variation is controlled during 95 embryonic development, we obtained RNA-seq data from 75 isogenic lines of Drosophila 96 melanogaster embryos at three different developmental stages (2-4, 6-8, and 10-12 hours post 97 fertilization) from (Cannavò et al. 2016). To reduce potential confounding effects of maternally 98 deposited RNA, we focused on the late embryonic time-point (10-12 hours after fertilization), 99 and removed genes whose expression decreased between 2-4 h and 10-12h, resulting in 100 embryonic expression data for 4074 genes (Methods, Supplementary Fig 1). For each gene, we 101 calculated its median expression level and the coefficient of variation (CV) from the 102 normalized read counts across individuals (Methods). As variation is highly correlated with the 103 levels of gene expression ( to obtain a measure of expression variation that is relative to the expected variation at a given 106 expression level (Fig. 1b). 107 We confirmed that this measure of variation is highly correlated with alternative metrics, such 108 as variance stabilized standard deviation or residual median absolute deviation (Supplementary 109 Fig. 2a-b) and robust with respect to the number and identity of samples used (Fig. 1c). 110 Moreover, using the full dataset from Cannavò (Cannavò et al. 2016), expression variation 111 values were highly correlated across time, especially for consecutive time-points, further 112 confirming the approach (Methods, Supplementary Fig.1d). Finally, we observed a strong 113 correlation in expression variation between pairs of genes in close proximity (Supplementary 114 As these 75 samples came from strains with different genotypes, we first calculated the 117 proportion of expression variance that is explained by genetics in cis (taking variants within 50 118 kb of each gene into account) using variance decomposition (Methods). On average, 6% 119 (median across all genes) of the total gene expression variation was explained by cis genetics 120 ( Supplementary Fig.1f), indicating that more complex genetic effects and other properties must 121 account for the majority of expression variation. We reasoned that differences in the extent of 122 expression variation among genes should reflect inherent differences in their regulation, 123 including their regulatory complexity and mechanisms of noise buffering or amplification. 124 Therefore, in the remainder of this study we investigate the regulatory differences between 125 genes with high versus low expression variation. 126 127

Genomic features predict expression variation independent of expression levels 128
To restrict our analysis to the important features, we applied the random forest-based Boruta 140 algorithm, which iteratively selects all features that predict better than their permuted version 141 (Kursa and Rudnicki 2010). This resulted in 93 and 106 predictive features for expression 142 variation and level, respectively (Fig. 1d). Using these feature sets, our models predicted 143 expression variation and level with an R 2 of 0.45 and 0.43 (5-fold cross validation), 144 respectively, while permuting the labels resulted an R 2 of zero (Fig. 1d). 145 To ensure the robustness of our predictions we have performed a number of analyses: first, we 146 verified that the predictions for variation are independent of the level of gene expression by 147 showing that the models performed equally well on genes grouped into quartiles based on their 148 expression levels (Fig. 1e). Second, we ensured that the predictions are robust to the choice of 149 measure used for expression variation ( Supplementary Fig. 2c). Third, we tested whether 150 dynamic gene expression changes during developmental stages can contribute to the variation 151 predictions. We reran the random forest models, predicting expression variation for genes 152 grouped based on their absolute expression change between 6-8 and 10-12 hours after 153 fertilization. For genes with minor expression change between the two time-points (below 154 median of 0.8), the performance was comparable to the full model, while for the genes with a 155 stronger expression change (above 0.8) the R 2 dropped to about 0.3 ( Supplementary Fig. 2d). 156 This indicates that some portion of expression variation comes from dynamic changes in gene 157 expression during embryogenesis, which is not captured by our features (and thus reduces the 158 performance of our model for this set of genes). However, since the performance is the best for 159 genes that vary little between stages, it indicates that variance explained by our model is overall 160 not majorly confounded by expression dynamics. Finally, the model performance does not 161 decrease when training and test sets come from different chromosomes ( Supplementary Fig.  162 2e), demonstrating that the results are not confounded by shared regulatory features between 163 neighboring genes. 164 Taken together, these results establish that gene expression variation -as well as gene 165 expression levels -can be predicted based on genomically encoded features, when measured 166 across a population of genetically diverse individuals during embryogenesis. The predictions 167 are independent of the gene's expression level and are robust to the metric used for measuring 168 variation. These models can therefore be used as the basis for addressing questions about 169 buffering mechanisms that regulate gene expression variation during embryogenesis. 170  x 100 x 100 x 100 x 100 a c e b

Promoter architecture is the most important predictor of expression variation 188
Next, we used this predictive framework to investigate the genomic features that best explain 189 expression variation and expression level. We retrieved the features' 'importance score' from 190 the Boruta algorithm and determined the correlation of each feature with both expression 191 properties (Supplementary Table 4). Although most features are to some extent predictive of 192 both expression level and variation, their relative importance differed substantially (Fig. 2a). 193 Being a housekeeping gene, for example, was strongly predictive of high expression level 194 while being less important for expression variation. Conversely, the presence of a core 195 promoter TATA-box motif is strongly predictive of high expression variation only (Fig. 2a,  196 see Suppl. Table 4 for full list). We note that most features are either associated with higher 197 variation and lower expression or vice versa, suggesting that expression level and variation are 198 not completely independent, as was previously observed (Faure, Schmiedel, and Lehner 2017), 199 even though they are globally uncorrelated (Fig 1b). However, we found that when we split 200 genes into the categories of the top features (e.g. housekeeping vs non-housekeeping) the 201 differences in expression variation are pronounced at all expression levels (Fig 2b-e): For 202 example, housekeeping genes (the strongest predictor for expression level) are less variable 203 than non-housekeeping genes at any level of expression (Fig 2b). The same holds true for the 204 feature 'promoter shape index', which is the strongest predictor for variation (Fig 2c), as well 205 as other features such as '#conditions with DHS' (DNase hypersensitive sites) (Fig 2d) and 206 'presence of a TATA box' (Fig 2e). This demonstrates that the features explain expression 207 variation independent of expression level. 208 Promoter-associated features (TSS-proximal) are among the strongest predictors in terms of 209 explanatory power for expression variation, and include promoter shape, core promoter motifs 210 and GC-content, Pol II pausing, chromatin accessibility, and TF occupancy at TSS (Fig. 2a).

Expression variation in broad versus narrow promoter genes reflects trade-off between 246 expression robustness and plasticity 247
The most prominent predictive feature for expression variation is promoter shape index ( Genes with narrow promoters generally have higher variation compared to genes with broad 251 promoters (Fig. 2c), and, interestingly, also comprise a wider range of variation (Fig 3a). 252 Moreover, expression variation of narrow promoter genes is better explained by genomically 253 encoded features compared to broad promoter genes (R 2 = 0.37 vs 0.14), and this difference in 254 performance becomes more pronounced with more stringently defined narrow and broad 255 promoter genes (Fig. 3b). 256 Interestingly, when we group genes from the two promoter classes into quartiles based on their 257 variation we find very specific functions enriched among them: the broad class is strongly 258 enriched for housekeeping genes (Fishers's test odds ratio, OR=15.0, p-value<1e-16, 259 Supplementary table 5) and GO terms related to basic cellular processes (cellular transport, 260 secretion, and DNA/RNA biogenesis) with the exception of the top 25% most variable genes 261 within the group being also enriched in metabolic processes (Fig. 3c, Supplementary Fig. 3a, 262 Supplementary Table 6). In contrast, narrow promoters genes fall into two functional categories 263 depending on their expression variation: the bottom 50% were enriched in TFs (OR=3.0, p-264 value<1e-16) and GO terms related to development, signaling and regulation of transcription, 265 while the top 50% are enriched for TATA-box genes (OR=7.9, p-value<1e-16) and GO terms 266 related to metabolism, stress response, and cuticle development (Fig. 3c, Supplementary Fig.  267 3a). We therefore grouped genes along the dimensions of promoter shape and expression 268 variation into three classes (Fig. 3a): genes with broad promoters and low levels of variation in 269 expression (broad), genes with narrow promoters and low expression variation (narrow-low) 270 and genes with narrow promoters and high expression variation (narrow-high). 271 Next we looked at regulatory plasticity of these classes of genes defined here as the variation 272 in accessibility of TSS-proximal DHSs across time and tissues, see [Reddington et al,273 submitted]. We observed that narrow promoter genes had high regulatory plasticity regardless 274 of their expression variation ( Supplementary Fig. 3b). In particular, narrow-low genes are 275 robustly expressed across individuals at the given developmental stage, while having 276 condition-specific regulation. In contrast, broad promoter genes are characterized by both low 277 expression variation and low plasticity, which agrees with their housekeeping functions. 278 Enrichment of low-variable genes in either housekeeping (broad) or developmental (narrow-279 low) functions suggests selection pressure may act on those genes to reduce expression noise 280 in genes essential for viability and development. One proxy for evolutionary constraints is 281 sequence conservation across long evolutionary distances. In keeping with this, sequence 282 conservation between Drosophila and human was among the top five most predictive features 283 of low expression variation with conserved genes being significantly less variable (Fig 2a,  284 Wilcoxon test p-value <2e-16). Promoter shape is also correlated with gene conservation: 285 conserved genes are highly enriched for broad promoters (80% in broad vs. 41% in narrow) 286 and more enriched in the narrow-low compared to narrow-high class (54% vs 28%). Within 287 each class, conserved genes are less variable ( Supplementary Fig.3c), hence sequence 288 conservation provides additional information about variation constraints across genes. 289 Overall, these results suggest that expression variation is an orthogonal component to the 290 regulatory plasticity, which has previously been defined along the narrow-broad promoter 291 spectrum (Rach et al. 2009; Lenhard, Sandelin, and Carninci 2012). Promoter shape likely 292 reflects differences in regulatory plasticity (constitutive vs. condition-specific genes), while 293 expression variation may reflect evolutionary constraints on expression robustness with 294 essential and highly conserved genes being less variable. These findings indicate a partial 295 uncoupling between expression variation across multicellular individuals in a controlled 296 environment and variation across tissues/development, analogous to the uncoupling between 297 plasticity and noise observed in yeast (Lehner 2010), and suggest different mechanisms to 298 control expression robustness for genes with ubiquitous versus condition-specific expression. 299 Quantiles by promoter shape index R^2 (5−fold cross−validation) genes separately. Quantile intervals for broad and narrow promoter genes provided in methods. 317 318

Two classes of genes with low variation have distinct regulatory mechanisms 319
The results above indicate that the partial uncoupling of expression variation and expression 320 plasticity could be achieved by distinct mechanisms of ensuring expression robustness between 321 different promoter architectures (broad/narrow). To explore this, we examined the most 322 predictive features in relation to the different promoter types. Among the most significant 323 promoter features is "#conditions with DHS" (Fig 2a), which is derived from a comprehensive The median number of developmental conditions in which a gene had at least one DHS site 327 was 18, 8, and 1 for broad, narrow-low, and narrow-high genes respectively (Fig 4a), thus 328 highlighting again that the narrow-low and broad classes differ in their developmental plasticity 329 ( Fig. 4a). A similar trend was observed for related features, such as using a compendium of TF 330 occupancy data during embryogenesis (Fig 4b), TF peaks with motifs, or motifs alone 331 (Supplementary Fig.4a-b). To understand how these promoter-type specific DHS patterns are 332 set-up we next examined the 24 TFs that were predictive of expression variation in the full 333 model (Supplementary Table 4 The next most predictive feature in our model is "PolII pausing index" (Fig 2a), defined as the 343 density of polymerases in the promoter region divided by the gene body length (Saunders et al. 344 2013) (Fig 2a). Narrow-low genes have the highest pausing index (40) followed by broad and 345 narrow-high genes (10 and 7, respectively; Supplementary Fig.4d). Consequently, Pol II 346 pausing is strongly negatively correlated with expression variation in narrow promoters 347 (Spearman correlation Rho=-0.28, p-value<1e-16), yet showed no significant relationship in 348 broad (Fig. 4d) Among the most significant non-promoter features, our model identified distal regulatory 351 complexity ("#TF motifs (dist)" and "#DHS peaks (dist)" in Fig 2a) and post-transcriptional 352 events ("#miRNA motifs" and "#RBP motifs" in Fig 2a) as predictive of expression variation. 353 As for the distal regulatory complexity, narrow-low had the highest number (median of 6) of 354 distal regulatory elements, defined as DHS within 10kb of the TSS, followed by broad (4) and 355 narrow-high (4) genes ( Supplementary Fig.4g). Consequently, the number of distal DHS is 356 negatively correlated with expression variation in narrow promoters (Rho=-0.22, p-value<1e-357 16) while being uncoupled from variation for broad (Fig. 4f). Similarly, narrow-low genes have 358 a higher number of miRNA motifs in their 3'UTRs (median of 35) compared to broad (20) and 359 narrow-high (14) genes ( Supplementary Fig.4e), which again was negatively correlated with 360 variation in narrow promoter genes only (Rho=-0.31, p-value<1e-16) (Fig. 4e). Similar results 361 were obtained for the number of RNA-binding protein (RBP) motifs, which have an effect for 362 narrow, but not for broad, genes (Supplementary fig. 4f). 363 In summary, these findings provide strong evidence that robustness in gene expression across 364 individuals is conveyed by different mechanisms depending on the gene's promoter type: in 365 broad promoter genes, robust expression is likely a result of a plethora of broadly expressed 366 TFs that bind to the core promoter and keep the chromatin constitutively accessible, compatible 367 with their house-keeping roles. Narrow promoter genes, in contrast, seem to be regulated by a 368 smaller number of (narrow-specific) TFs and their robustness is conveyed through mechanisms 369 that involve Pol II pausing, distal regulatory elements, and posttranscriptional regulation. This 370 suggests that broad and narrow promoter types have distinct mechanisms to regulate expression 371 variation that are not necessarily transferable. This is possibly related to the relatively higher 372 regulatory plasticity required of the narrow-low genes. 373 Partial aspects of these findings have been reported previously. E.g. In a study of 14 374 developmental control genes, Pol II pausing at promoters was linked to more synchronous gene 375 activation, thereby reducing cell-to-cell variability in the activation of gene expression 376 (Boettiger and Levine 2009). Also, miRNAs have been proposed to buffer expression noise 377 (Schmiedel et al. 2018(Schmiedel et al. , 2015. Our data puts these previous findings in a more global context 378 as part of a distinct mechanism for a particular promoter type. 379 We summarized these mechanisms as two indices based on the ranked averages of the 380 corresponding features: broad regulatory index (number of TF peaks, motifs and conditions 381 with DHS, at the TSS) and narrow regulatory index (Pol II pausing index, number of distal 382 DHS and miRNA motifs), respectively (Fig 4g), which nicely separate the three gene groups. 383 Interestingly, we found no evidence for a specific noise-amplifying factor, except for the 384 TATA-box. Yet, even for TATA-box genes, since they are depleted of all the aforementioned 385 robustness features ( Supplementary Fig. 4h), the observed high variation may result from a 386 lack of robustness-conveying factors. 387 stratified into broad, narrow-low and narrow-high (defined in Fig 3A). (orange), narrow-low (blue) and narrow-high (red)) gene groups. P-values < 1e-09 for all 409 pairwise comparisons of the distributions. 410

Expression variation can predict signatures of differential expression upon stress 411
So far, we showed that distinct mechanisms regulating expression variation are directly 412 encoded in the genome. In the following, we want to assess the impact of these findings for 413 interpreting gene expression studies in general. 414 We postulate that the expression variation of a gene across individuals can be interpreted as its 415 ability to be modulated by any random perturbation. If this is true, we expect expression 416 variation to be predictive of a gene's response to changes in the environment. To test this, we 417 used an independent gene expression dataset from adult flies that were subjected to different 418 stress conditions related to temperature, starvation, radiation, and fungi infection (Moskalev et 419 al. 2015). In agreement with our postulation we find that genes differentially expressed upon 420 stress have high expression variation in our embryonic dataset (Fig. 5a Supplementary Fig. 5b), a lower number of TFs (p=1.4e-10) and less 426 motifs (p=3.9e-8) at their TSS, as well as other features important for distinguishing between 427 narrow-high and -low genes ( Supplementary Fig. 5c-e). Overall, differentially expressed genes 428 showed lower regulatory complexity as reflected in our broad and narrow variability indices 429 (Fig 5b-c). 430 To assess this association more systematically, we next tested whether the model for predicting 431 expression variation can also identify differentially expressed genes. We trained a random 432 forest model using our embryonic data to classify the top-30% versus bottom-30% variable 433 genes and used it to predict differential expression in adults subjected to different stresses 434 (Methods). The model predicted differential expression on the non-overlapping test set with an 435 AUC of 0.65 and 0.74 when trained to predict embryonic variation for all genes, or for narrow 436 promoter genes, respectively (Fig. 5d). This demonstrates that differential expression can be 437 predicted based on a model trained for predicting expression variation. Since the model's 438 performance was better when trained only on variation in narrow promoters, it is likely that the 439 narrow-specific regulatory mechanisms, such as micro RNA and enhancers, determine a gene's 440 responsiveness to stress. This is also reflected by the strong differences in narrow index 441 between DE and non-DE genes (Fig 5c). 442 Overall, this suggests that the same buffering mechanisms confer expression robustness to 443 different kinds of perturbations. Since the propensity to be differentially expressed is 444 predictable based on genomically encoded features, this implies that results from differential 445 expression studies should always be interpreted relative to a genes inherent tendency to respond 446 to perturbation.

Human promoter features predict both expression variation and differential expression 460
Given that gene expression variation across individuals can be predicted from genomic features 461 in Drosophila we next asked whether this holds true in humans, and whether the predictive 462 features are conserved. We used high quality RNA-seq datasets from the GTEx project 463 comprising 43 tissues with data for at least 100 individuals (GTEx Consortium 2013). For each 464 tissue, we measured expression variation across individuals using the coefficient of variation 465 corrected for mean-variance dependence, applying a similar approach as for Drosophila 466 (Methods). Since gene expression variation values were highly correlated across all tissues 467 ( Supplementary Fig 6), we also computed the mean of tissue-specific variations (mean 468 variation) as potentially more robust metrics. 469 Since TSS-proximal features were the most predictive of expression variation in fly, we 470 focused on promoter features to train the models (Methods). This included promoter shape, TF 471 binding at the TSS, chromatin states, and several sequence features (TATA-box, GC-content, 472 CpG islands). To predict the mean expression variation, promoter shape and chromatin state 473 features were aggregated across multiple tissues. In addition, we collated three tissue-specific 474 datasets for muscle, lung and ovary by matching RNA-seq, CAGE and chromHMM datasets 475 (Methods). Based only on these features, random forest models were able to predict expression 476 variation and level within each tissue to a similar extent as in Drosophila embryos (Fig. 2f) 477 with R^2 ranging between 0.38-0.46 for expression variation and 0.19-0.24 for expression level 478 (Fig. 6a). Aggregating expression variation across tissues yielded even higher performance 479 with R^2 of 0.56 versus 0.31 for mean level across all expressing tissues. The overall 480 performance was robust to changes in the numbers of samples including subsetting by age or 481 sex (Supplementary fig. 8a). 482 The predictive features of expression variation in humans are highly overlapping with those 483 for Drosophila (Fig 6b,c), and include promoter shape, TATA-box, and the number of TFs 484 binding to the promoter. An additional feature highly predictive of genes with low expression 485 variation was the presence of CpG islands, in line with previous findings in single-cells 486 (Morgan and Marioni 2018), while bivalent TSS state was predictive of high expression 487 variation, in line with previous studies (Faure, Schmiedel, and Lehner 2017) (Fig. 6b, c). We 488 also uncovered a number of transcription factors predictive of low variation, including 489 GABPA, YY1, and E2F1 (84 predictive TFs in total, Supplementary Table 17). Similar to 490 Drosophila, the presence of TSS-proximal peaks of all 84 predictive TFs were associated with 491 reduced mean expression variation, again suggesting that high variation (in bulk RNA-seq) is 492 due to a lack of buffering mechanisms rather than a specific mechanism for noise amplification. 493 Extending the distance around the TSS did not improve the correlation between presence of 494 TF peaks and expression variation, indicating that the key regulatory information is already 495 contained within the core promoter region (Supplementary fig 8b). 496 We next asked whether expression variation across individuals is predictive of differential 497 expression in different conditions, as we observed in Drosophila. For this we used differential 498 expression prior (DE prior), a metric that integrates more than 600 published differential  Fig. 6d, Methods), and predictive features for variation showed similar effects in DE prior 506 (Fig. 6e). This indicates that inherent promoter features can explain expression variation and 507 the probability of differential expression to a similar extent -potentially, due to partially 508 overlapping underlying mechanisms. 509 Importantly, both expression variation and DE prior were significantly lower for essential 510 genes, while being higher for GWAS hits and common drug targets (Fig. 6f, Supplementary  511   Fig. 8c). Higher expression variation of the latter agrees with an interpretation that these genes 512 are less buffered to withstand different sources of variation (Fig 1a) and hence are more likely 513 to change in expression level upon different types of perturbations including genetic or 514 environmental factors. Hence, expression variation across individuals likely captures 515 differences in selection pressure and cost-benefit trade-offs between expression precision and 516 plasticity. 517 In summary, despite significant differences in promoter regions between humans and 518 Drosophila (e.g. the presence of Drosophila-specific core promoter motifs, human-specific 519 CpG islands, predominately unidirectional versus bidirectional transcription), promoter 520 features are highly predictive of expression variation in both species. Genes with high variation 521 tend to also have differential expression across diverse conditions, and are significantly 522 enriched in GWAS hits, and disease associated loci.

545
Our analysis suggests that expression variation across a population of multicellular genetically 546 diverse individuals is gene-specific and can be explained by genetically encoded regulatory 547 features, all highly correlated with core promoter architecture. Overall, we found that 548 regulatory complexity positively correlates with robust gene expression. Yet we identified two 549 independent mechanism that decrease expression variation depending on the core promoter 550 architecture. Genes with broad core promoters in Drosophila were overall less variable and 551 characterized by ubiquitously open chromatin and a high number of transcription factors (TFs) 552 binding to the TSS-proximal region. In contrast, genes with a narrow core promoter had a much 553 higher spread of expression variation, which was, in addition to TFs, modulated by regulatory 554 complexity outside of core promoters (miRNAs, enhancers and Pol II pausing). 555 We found that similar promoter-related features were predictive of expression variation across 556 human individuals by applying the same predictive framework to tissue-specific RNA-seq 557 datasets. This was surprising given the differences in promoter features between Drosophila 558 and mammals, with higher heterogeneity within broad promoters and high regulatory 559  fig 1a) and not sequencing batch (Suppl. fig 1b). for 10-12h in Supplementary Fig. 1c). For almost all of the expression bins, spread of 661 expression values increased for higher residual coefficient of variation, except bin-20 (top-5% 662 genes by expression level) and to less extend bin-1 (bottom-5%). Based on this analysis, top 663 and bottom-5% of expressed genes were excluded from the analysis. 664 We focused our analysis on the latest developmental stage (10-12h) and removed genes that 665 decreased in expression between 2-4h and 10-12h after fertilization. This was done to reduce 666 confounding effects of maternal mRNA degradation and focus on the stage when zygotic 667 genome is fully activated (both processes happening from 2h post fertilization onwards). In 668 total, we excluded 3275 genes, from which 90% were detected as maternally deposited (in 669 house data, genes expressed in unfertilized eggs). In addition, genes with the strongest decrease 670 in expression (3-fold or more) were highly enriched in cell cycle biological processes 671 (Supplementary table 7), and cell cycle is known to slow down at later developmental stages 672 (Edgar and O'Farrel 1989). Hence, we reasoned that variation of these genes might be strongly 673 confounded by extrinsic factors (maternal mRNA degradation and cell cycle) that are not of 674 particular interest for this analysis. 675 Overall, the following filtering steps were applied to the data, and the corresponding genes 676 were excluded from the final dataset: 677 1. Genes with zero median expression level across samples (as non-expressed 678 genes); 679 2.
Genes falling into top and bottom 5% by expression level (as potential source 680 of outliers); 681

3.
Genes that decreased in expression between 10-12 and 2-4 hours after 682 fertilization (as maternal genes with role in early embryonic development and potential 683 targets for maternal mRNA degradation) 684

4.
Genes with missing values in the feature table (see below) unless the feature 685 can be easily imputed, i.e. 0 for the absence of transcription factor motif 686 Hence, our final dataset included 4074 genes at 10-12 hours post fertilisation. Final measure 687 of expression variation was calculated as described above on the final set of genes to avoid 688 residual dependence on the expression level after filtering (Fig. 1b, 'resid_cv'   Expression level and variation of neighboring genes. For this analysis we considered all pairs 702 of genes located on the same chromosomes and with TSS to TSS distance below 100 kB. Genes 703 pairs were binned into 5 quantiles by the distance between their TSSs. Coordinates of the 704 topologically associated domains (TADs) were taken from the high-resolution HiC in Kc cells 705 (Ramírez et al. 2018). Genes were assigned to TADs based on their TSS coordinates, and for 706 all pairs of genes we defined whether they belong to the same TAD or span the TAD border. 707 Within the resulting 10 groups of gene pairs (5 quantiles * same/different TAD), we calculated 708 Spearman correlation coefficients in expression variation and median expression level between 709 genes in the pairs (Supplementary Fig 1d). conservation rank (conserv_rank, factor variable taking the following values: none, low, 751 moderate, high). Genes with 'high' conservation rank were referred to as "conserved with 752 human" (e.g. Supplementary Fig 3c). For each gene, we calculated the mean UTR length at different time points as the weighted 889 mean UTR length between UTR isoforms. We used the polyA site expression as weights in the 890 mean calculation. Since length of 3'UTR was highly correlated with gene length (Spearman 891 correlation, Rho=0.62), utr3_length feature was calculated as actual 3'UTR length divided by 892 gene length. Finally, 3'UTR length changes (log2-fold change) between different time-points 893 (10-12h vs. 6-8h, 6-8h vs. 2-4h, 10-12h vs 2-4h) were calculated for each gene 894 (utr3_l2fc_10vs6, utr3_l2fc_6vs2, and utr3_l2fc_10vs2 features). 895 Genomic context features. Insulation score (ins_score_2_4h and ins_score_6_8h) was 896 calculated based on Hi-C data in-house data (unpublished) for Drosophila melanogaster 897 embryos at 2-4 and 6-8 hours after fertilization (in-nucleus ligation, whole embryo). To assign 898 insulation score to genes, we recorded the nearest value to the annotated TSS of each gene. 899 Coordinates of topologically associated domains (TADs) were taken from the high-resolution 900 Hi-C in Kc cells from (Ramírez et al. 2018) and Hi-C in 2-4h embryos (in-house data, 901 unpublished). Each gene was then assigned to TAD from the two aforementioned annotations 902 based on its TSS coordinate, and distance to TAD border and TAD size were recorded 903 (dist_to_tad_border.ramirez, dist_to_tad_border.2_4h, tad_size.ramirez, and tad_size.2_4h, 904 respectively). To control for this effect, we also calculated 'global mean variation' as the residuals from the 948 local linear regression of the mean CV on the mean expression level (calculated as above). This 949 measure was highly correlated with mean variation (Supplementary fig S6) and showed similar 950 results in the downstream analysis (results not shown, global mean variation is provided in 951 Supplementary Table 9). 952

Feature tables for human dataset 953
Only TSS-proximal features and several gene properties (i.e. gene length and number of 954 transcripts) were used to predict expression level and variation in human. Full list of features 955 used in this analysis is provided in Supplementary table 10. Most of the TSS-proximal features  956 (TF peaks and chromatin states) were scanned in the −500/+500 bp of the main TSS of the 957 genes (referred to as TSS-proximal regions), following the same approach as for Drosophila. 958 Several features more strictly linked to the gene core promoters (promoter shape, TATA-box, 959 CpG islands, and promoter GC-content) were scanned in -300/+200 bp of the main TSS of the 960 genes (referred to as core promoter regions). (less than 5 counts) were removed. Next, we applied a simple clustering method (distclu, 971 maximum distance = 20) form CAGEr package on each dataset separately. Clusters with low 972 normalized CAGE signals (sum of TSSs normalized signals of the cluster below 10-50 973 depending on the tissue) were removed. CAGE clusters were then assigned to genes by 974 overlapping them with core promoter regions (-300/+200 bp around TSSs of all annotated 975 transcripts). Clusters that did not overlap any core promoters were removed. 976 Next we defined promoter shape for all CAGE clusters by using two commonly accepted 977 metrics: promoter width and promoter shape index. Promoter width was calculated by using 978 the inter-percentile width of 0.05 and 0.95 following methodology from (Haberle et al. 2015). 979 Promoter shape index (SI) was calculated by the formula as above (Drosophila section) 980 proposed in (Hoskins et al. 2011). 981 For classifying promoters into broad and narrow based on promoter width, we used the 982 following approach. First, we did a linear transformation of promoter width values (actual 983 value minus 1 divided by 10; for fitting gamma distribution) On the transformed data, we fitted 984 gamma mixture model (2 gamma distribution), and parameters was trained using EM algorithm 985 (Dempster, Laird, and Rubin 1979) using mixtools package in R (Benaglia et al. 2009). The 986 threshold for classifying promoters as broad or narrow was selected by finding the point which 987 best separates the two distributions. Following this approach, promoters with width above 988 about 10-15 bp. were classified as broad, which was consistent across all tissues and agreed 989 with earlier studies (Forrest et al. 2014). To classify promoters into broad/narrow using shape 990 index, we fitted Gaussian mixture model (2 Gaussian distribution) to the data and selected the 991 threshold separating the two distributions using the same approach as above. For the 992 subsequent analysis, we used promoter width feature since it showed more clear bi-modal 993 distribution in all tissues (example in Supplementary fig 7a-c) and is a more common metrics 994 in the analysis of mammalian promoters (Forrest et al. 2014;Carninci et al. 2006). 995 Each gene was then assigned the promoter width of its main transcript. If more than one CAGE 996 cluster was present for a gene's main transcript, the cluster with the highest normalized CAGE 997 signal was selected. Promoter width values for most of the genes were highly correlated across 998 tissues (Supplemetary fig 7d). Based on the tissue-specific shape data, we Fisher's tests. We used Fisher's exact test (R base package) to find enrichments of features in 1198 different gene groups in Drosophila dataset (broad, narrow-low, narrow-high). All tests were 1199 done for 2x2 contingency tables, and odds ratios and p-values provided by the test were 1200 recorded. We used Benjamini-Hochberg correction to adjust p-values for the multiple testing. 1201 We used adjusted p-value threshold of 0.01 and odds ratio above 2 to define significantly 1202 enriched features (p-value adjusted < 0.01; odds ratio < 0.5 for significantly depleted). 1203 First, we tested enrichment of housekeeping genes, transcription factors and TATA-box 1204 promoter motifs in the following pairwise comparisons: (1) broad vs. narrow, (2) narrow-low 1205 vs. two other groups, (3) harrow-high vs. two other groups. P-values were corrected for the 1206 number of tests (9 comparisons). 1207 Next, we tested enrichments of ChIP-seq peaks of 24 transcription factors in the TSS-proximal 1208 regions in the same comparisons as above. 25 TSS-proximal TF features selected by Boruta 1209 algorithm, including two ChIP-seqs for Trl (in embryos at 8-16 and 16-24 hours after 1210 fertilization) from which the one with the overlapping time window was used (8-16h). Since 1211 ChIP-seq peaks were first overlapped with DHS peaks before assigning to genes (see Features 1212 section above), we restricted the analysis of TF enrichments to the genes that have at least one 1213 DHS peak in their TSS-proximal regions. P-values were corrected for 72 comparisons (24*2). 1214 Log2-transformed odds ratios from these tests are shown in Fig. 4c, weak 1215 enrichments/depletions (odds ratios above 0.5 and below 2) are shown in grey, actual values 1216 are provided in Supplementary table 5.  1217 Peaks of Trl and Jarid2 (enriched in narrow-low) also showed weak enrichments in narrow-1218 high, which likely comes from strong depletion of these TFs at the TSSs of broad promoter 1219 genes. To control for that, we also tested enrichments of the same 24 TF peaks in three 1220 comparisons between gene groups: (1) broad vs. narrow-low, (2) broad vs. narrow-high, (3) 1221 narrow-low vs. narrow high (also 72 comparisons for p-values correction). 1222 Results from all Fisher's tests described above are provided in Supplementary table 5