Conserved noncoding transcription and core promoter regulatory code in early Drosophila development

Multicellular development is driven by regulatory programs that orchestrate the transcription of protein-coding and noncoding genes. To decipher this genomic regulatory code, and to investigate the developmental relevance of noncoding transcription, we compared genome-wide promoter activity throughout embryogenesis in 5 Drosophila species. Core promoters, generally not thought to play a significant regulatory role, in fact impart restrictions on the developmental timing of gene expression on a global scale. We propose a hierarchical regulatory model in which core promoters define broad windows of opportunity for expression, by defining a range of transcription factors from which they can receive regulatory inputs. This two-tiered mechanism globally orchestrates developmental gene expression, including extremely widespread noncoding transcription. The sequence and expression specificity of noncoding RNA promoters are evolutionarily conserved, implying biological relevance. Overall, this work introduces a hierarchical model for developmental gene regulation, and reveals a major role for noncoding transcription in animal development.

Multicellular development is largely determined by transcriptional regulatory programs that orchestrate the expression of thousands of protein-coding and noncoding genes. To decipher the genomic regulatory code that specifies these programs, and to investigate globally the developmental relevance of noncoding transcription, we profiled genome-wide promoter activity throughout embryonic development in 5 Drosophila species. We show that core promoters, generally not thought to play a significant regulatory role, in fact impart broad restrictions on the developmental timing of gene expression on a genome-wide scale. We propose a hierarchical model of transcriptional regulation during development in which core promoters define broad windows of opportunity for expression, by defining a limited range of transcription factors from which they are able to receive regulatory inputs. This two-tiered mechanism globally orchestrates developmental gene expression, including noncoding transcription on a scale that defies our current understanding of ontogenesis. Indeed, noncoding transcripts are far more prevalent than ever reported before, with ~4,000 long noncoding RNAs expressed during embryogenesis. Over 1,500 are functionally conserved throughout the melanogaster subgroup, and hundreds are under strong purifying selection. Overall, this work introduces a hierarchical model for the developmental regulation of transcription, and reveals the central role of noncoding transcription in animal development.

INTRODUCTION
with distinct sets of transcription factors. We propose a two-tier model of transcriptional control in which 50 core promoters and enhancers mediate, respectively, coarse-grained and fine-grained developmental 51 regulation. 52 53 We also show that noncoding transcription is far more widespread than anticipated in Drosophila, 54 with 3,973 promoters driving the expression of lncRNAs during embryogenesis. The analysis of these 55 core promoters, most of which are currently unannotated, shows that they largely share the structural and 56 functional properties of their counterparts at protein-coding genes. Through the analysis of their fine 57 structure and sequence conservation, we demonstrate that evolutionarily conserved lncRNAs promoters 58 are under strong purifying selection at the levels of primary sequence and expression specificity. We 59 functionally characterize the schnurri-like RNA (slr) locus, which expresses a lncRNA in a highly 60 conserved spatiotemporal pattern suggestive of a role in early dorsoventral patterning. 61 62 In summary, these results uncover a major active role for core promoters in regulating 63 transcription, by defining windows of opportunity for activation by enhancer sequences. They also reveal 64 a vastly underappreciated aspect of developmental transcriptomes, by showing that noncoding 65 transcription is extremely prevalent, tightly regulated and, crucially, deeply conserved. 66 67 68 69 70 RESULTS 71 72 Global multispecies profiling of developmental promoter activity 73 74 To explore the genome-wide dynamics of transcriptional regulation and their evolution, we 75 generated developmental transcriptome profiles at both very high temporal resolution (1 hour) and high 76 sequence coverage (137-180 million read pairs per species) for 5 Drosophila species spanning 25-50 77 million years of evolution: D. melanogaster, D. simulans, D. erecta, D. ananassae and D. pseudoobscura 78 ( Fig. 1A; total of 120 samples). We focused on embryonic development, a crucial period during which 79 the body plan is established and all larval organs are generated. This data allows direct, genome-wide 80 comparisons of promoter activity in a phylogenetic framework (Fig. 1B). 81 82 In order to map transcription start sites (TSSs) with single-base resolution and accurately measure 83 the activity of individual promoters, we used a high-fidelity method called RAMPAGE 37 based on high-84 throughput sequencing of 5'-complete complementary DNA (cDNA) molecules. It offers unprecedented 85 specificity and detection sensitivity for TSSs, and its multiplexing capabilities allow for the seamless 86 acquisition of high-resolution developmental time series 37 . Since eukaryotic promoters often allow 87 productive transcription initiation from multiple positions 37-40 , we use a dedicated peak-calling algorithm 88 to group neighboring TSSs into TSS clusters (TSCs) corresponding to individual promoters 37,41 . 89 90 For each species, we identified 2.2x10 4 to 2.7x10 4 high-confidence TSCs. The narrow distribution 91 of raw RAMPAGE signal ( Fig. 1C & S1) and of TSCs (Fig. S2) over annotated loci confirms our very 92 high specificity for true TSSs. The quantification of promoter expression levels is highly reproducible 93 across biological replicates ( Fig. 1D-E). Importantly, paired-end sequencing of cDNAs allows for 94 evidence-based assignment of novel TSCs to existing gene annotations, and provides valuable 95 information about overall transcript structure (Fig. 1B). We can thus attribute 82% of D. melanogaster 96 TSCs to annotated genes, the remaining 18% potentially driving the expression of unannotated non-97 protein-coding transcripts. The comparison of biological replicates for the D. melanogaster time series confirms our ability 114 to quantitatively measure promoter expression dynamics throughout development. Indeed, this analysis 115 ( Fig. S3) shows excellent reproducibility (Pearson R 2 = 0.95) for TSCs with maximum expression ≥25 116 reads per million (RPM, Fig. 2A & S3), and very good reproducibility (Pearson R 2 = 0.92) with 117 maximum expression ≥10 RPM (Fig. S3). 118 119 Post-synchronization of developmental series was achieved by global alignments of all time 120 series to the D. melanogaster reference to maximize the overall similarity between genome-wide 121 expression profiles 42,43 . The resulting alignments for well-known developmental genes, used here as 122 diagnostic markers, confirm the very high quality of the global alignments ( Fig. 2B & S4). For all genes 123 with one-to-one orthologs, the expression profiles are overall tightly conserved across species (Fig. 2C &  124 S5A), but with substantial gene-to-gene variability: hunchback, for instance, displays considerable 125 conservation in the expression of both of its promoters (Pearson R 2 of 0.88 and 0.97; Fig. S5B), whereas 126 the RpL19 promoter shows rapid divergence (R 2 =0.08; Fig. S5C). 127 128 We found a strong relationship between selective constraints on expression specificity and gene 129 function: indeed, the degree of expression divergence differs substantially between Gene Ontology (GO) 130 annotation categories (Fig. 2D). Functions related to the regulation of transcription and splicing display 131 the strongest conservation of expression, in accordance with the known molecular function of many 132 master regulators of early development. Categories related to the core translational machinery and 133 cytoskeletal structures display much more plastic expression specificities. 134 135 The high similarity of biological replicates, the accuracy of inter-species alignments for well-136 known developmental genes, and the biological features of evolutionary divergence patterns, together 137 confirm our ability to accurately quantify promoter expression and its variation across species. Our 138 observations also highlight the central importance of systems-level selective constraints, such as those 139 acting on gene function and developmental stages, in shaping the evolution of gene expression .  140  141  142  143  144 Core promoter structure defines broad developmental phases of gene expression 145 146 We leveraged our multispecies expression data to study promoter structure-function relationships, 147 and thus gain insights into the regulatory code that determines developmental gene expression. We 148 focused on a set of 3,462 promoters functionally active in all species that we classified either as 149 housekeeping (<5-fold variation throughout development) or as developmentally regulated (≥60% of 150 total expression within an 8-hour window). The developmentally regulated group was further clustered 151 based on temporal correlation, yielding a total of 9 distinct expression clusters (Fig. 3A, see Methods). 152 These thresholds were chosen to maximize the total number of promoters included and the similarity of 153 profiles within each cluster, while still yielding clusters large enough for statistical analysis. These 9 expression clusters fall into 3 main groups, defined by the core motif composition of the 173 promoters (Fig. 3B). Indeed, we unexpectedly observed robust enrichments for specific sets of motifs in 174 the promoters of all individual expression clusters. Three major classes emerge, within which motif 175 enrichments are relatively homogeneous (Fig. 3B). 176 177 Remarkably, these three classes define distinct temporal phases of embryonic development ( Fig.  178 3A-B Importantly, clustering core motifs by their tendency to co-occur within the same promoters, 186 regardless of expression timing, recapitulates the same three main motif groups (Fig. 3C). It does, 187 however, also uncover a finer data structure, which suggests that there are a variety of promoter subtypes. 188 In addition, we were able to train a classifier with substantial predictive power to distinguish the three 189 promoter classes based solely on core motif scores (Fig. 3D). Taken together, our observations suggest 190 that core promoter elements play a significant role in restricting windows of opportunity for expression 191 during distinct periods of development. 192 193 Sequences proximal to the three core promoter classes are also enriched for different sets of 194 TFBSs (Fig. 3E). The Early class preferentially harbors binding sites for Dref and Dfd, while the 195 Intermediate class favors Trl (GAGA motif) and Adf-1. The Late class is enriched in pan, srp and pnr 196 sites. The presence of most core promoter motifs and TFBSs is conserved between species far beyond 197 random expectations (Fig. 3F), which validates the overall quality of our motif predictions. Interestingly,198 TFBSs are often specific for only a subset of expression clusters within a class (e.g., Dfd or GAGA). 199 This suggests a model in which core promoter structure defines broad developmental periods of 200 expression potential, and precise expression timing is then refined by sequence-specific transcription 201 factors. Combinatorial encoding through stereotypical sets of TFBSs is likely to sharpen expression 202 patterns even further ( Fig. S6A-B). 203 204 The analysis of favored pairings between individual TFBSs and core promoter motifs suggests a 205 possible mechanism to mediate this 2-step specification of expression patterns. Indeed, some TFBSs 206 appear to be strongly associated with specific sets of core promoter motifs (Fig. 3G). Dref sites, for 207 instance, are preferentially found along DRE and Ohler-1/6/7 core motifs. Likewise, twi and srp TFBSs, 208 which often tend to be found together (Fig. S6A), have a robust association with INR and MTE. These 209 strong affinities suggest that core motifs may tune the ability of a promoter to respond to specific sets of 210 transcription factors. They may do so by recruiting different sets of general transcription factors (GTFs) 211 that functionally interact with distinct groups of activators. Such a mechanism may channel various 212 regulatory inputs to limited subsets of promoters and thus limit crosstalk between parallel pathways. 213 214 We found that the three promoter classes display markedly different profiles of sequence 215 conservation ( Fig. 3H & S6C). Importantly, this analysis only includes those promoters for which we 216 have detected transcriptional activity in all 5 species, and we can therefore categorically rule out 217 differences in rates of promoter gain/loss as an explanation. These observations suggest that the 3 218 promoter classes indeed have intrinsic structural differences that make them subject to distinct regimes of 219 natural selection and sequence evolution.

244
Selection on expression specificity shapes promoter sequence evolution 245 246 To further explore the selective pressures acting on regulatory sequences, we investigated 247 quantitative relationships between the evolution of promoter structure (primary sequence) and function 248 (expression specificity). We report a subtle but highly significant correlation between the conservation of 249 promoter sequence and that of expression profiles ( Fig. 4A-B). This shows that the effects of selection on 250 expression specificity are reflected in the evolution of promoter sequences. The main areas of differential 251 sequence conservation overlap regions preferentially occupied by precisely positioned core promoter 252 elements (TATA, INR, DPE) and transcription factor binding sites (Fig. 4A). Importantly, this 253 correlation does not hold for sequences >50bp downstream of the TSS, which are likely subject to 254 additional selective pressures on 5'-UTR and protein-coding sequences. Promoters with highly divergent 255 8 expression profiles are depleted of sites under purifying selection, and enriched for sites under positive 256 selection ( Fig. 4C-D). 257 Ab initio motif discovery within the regions of differential sequence conservation returned the 258 canonical consensus sequences for TATA, INR and DPE (Fig. 4E). We found that promoters with highly 259 conserved expression profiles tend to have core promoter elements whose sequence is closer to the motif 260 consensus (Fig. 4F), and those in turn tend to be more conserved at the sequence level (Fig. 4G). 261 Upstream flanking sequences also tend to show higher conservation of individual TFBSs (Fig. S7A). 262 Importantly, it is possible to detect such a correlation for the binding sites of some individual 263 transcription factors (Fig. S7B). This is a rather striking fact, considering that promoter-proximal 264 enhancers generally bind more than one factor, and that many promoters are additionally regulated by 265 distal enhancers not taken into account here. Finally, the conservation of promoter TFBS composition, as 266 expected, also correlates with interspecific divergence (Fig. S7C).  These observations establish a clear relationship between core promoter sequence and expression 287 specificity. Selective constraints on developmental expression timing act particularly strongly on core 288 promoter elements, most notably Initiator, DPE and TATA elements. This is consistent with the idea that 289 core promoter motifs are indeed key determinants of this specificity. Together, our results further support 290 the hypothesis that core motifs and general transcription factors play a crucial role in determining 291 promoter expression specificity. 292 293 294 295 Promoter birth and death are widespread and dynamic 296 297 In addition to changes in the specificity of individual promoters, gene expression programs 298 evolve through turnover of regulatory elements 36,45,46 . And indeed, we observed widespread birth and 299 death of promoters throughout the clade: only 49% of D. melanogaster TSCs are functionally conserved 300 in all 5 species (Fig 5A). To rule out genome assembly artifacts, we restricted our analysis to those with 301 syntenic alignments to other genomes, and measured a functional conservation rate of 75% ( Fig 5A). As 302 some peaks lack alignments owing to genuine insertions or deletions, we expect the true conservation 303 rate to be within the 49-75% range. Analyzing TSC conservation between species pairs, or from a D. 304 simulans-centric perspective, yields similar conclusions (Fig. S8). 305 306 The analysis of replicates for the D. melanogaster time series shows the false positive rate for 307 gain/loss event detection to be under 0.1% (Fig. 5A). Although TSCs with lower expression levels tend 308 to be less conserved, general trends are shared between TSCs of all expression levels (Fig. S9). Even 309 though a gain/loss of expression during embryogenesis may reflect a shift in specificity rather than a 310 complete gain/loss of function, the vast majority (91.4%) of D. pseudoobscura TSCs that were inferred 311 to be lost in D. melanogaster based on embryo data are never expressed at any other stage of the life 312 cycle (analysis of published data 37 ). Therefore, we conclude that our strategy accurately and robustly 313 detects promoter gain and loss events. Consistent with this, we can reconstruct the known species 314 phylogeny by treating the presence or absence of individual TSCs as binary discrete characters in a 315 standard parsimony framework (Fig. 5B). Functional conservation is reflected in sequence conservation: 316 promoters active in all species display far higher sequence conservation than species-specific ones, which 317 appear no more constrained than surrounding regions (Fig 5C & S10). 318 319 While purifying selection clearly plays a major role, a sizeable proportion of TSCs (25-50%) are 320 not shared between all species, underscoring the inherently fluid nature of the regulatory landscape. 321 Comparison to published data on the evolution of Twist binding sites 36 revealed that overall, promoters 322 evolve no more slowly, and possibly even faster, than TFBSs do (Fig. S11). Although a rigorous 323 comparison between such disparate data types is delicate, this shows that promoters and TFBSs do not 324 turn over on vastly different timescales. 325 326 Our data also reveals the existence of several thousand novel promoters that cannot be assigned to 327 any annotated genes. Many of those may drive the expression of long noncoding transcripts, and it has 328 long been a matter of debate to what extent this transcriptional activity plays meaningful biological roles. 329 Hence we analyzed their rates of gain and loss to explore the selective pressures they may be subject to. 330 We found a stark contrast in the degree of functional conservation of the two classes, with novel non-331 genic promoters evolving at a substantially higher rate than genic ones (Fig. 5D). There is, however, a 332 very substantial proportion that is deeply conserved, suggesting the possibility of widespread 333 functionality. The discrepancy between the classes may be due to a larger proportion of noncoding 334 transcripts being devoid of biological roles and evolving neutrally. Alternatively, it may instead reflect a 335 more pronounced tendency for noncoding transcription to take on lineage-specific roles and thereby be a 336 driver of adaptation, as has been suggested before. To gain a better understanding of these questions, we 337 sought to investigate in more depth the evolution of noncoding transcription. 338 339 340 341 342 The prevalence and biological relevance of noncoding transcription have long been major areas of 359 contention. Attempts at resolving these issues using genomic sequence conservation have been largely 360 inconclusive, probably due to minimal selective constraints on the primary sequence of these 361 transcripts 27,28,47 . Studying promoter activity experimentally in a phylogenetic framework provides a 362 unique opportunity to rigorously address the question of lncRNA conservation and functionality. 363 Furthermore, our ability to pinpoint TSSs with single-nucleotide accuracy gives us unprecedented 364 leverage to elicit elusive sequence conservation patterns. Furthermore, beyond sequence conservation, 365 we are also in a position to assess selective constraints on the expression specificity of these promoters. 366 367 We found 3,682 embryonic TSCs in D. melanogaster that could not be functionally linked to any 368 annotated protein-coding or small RNA gene, and could therefore represent putative lncRNA promoters. 369 We also identified TSCs for 291 annotated lncRNAs, bringing the total up to 3,973. Their developmental 370 expression kinetics appear to be diverse and exquisitely stage-specific (Fig. 6A). 371 372 The analysis of published genome-wide DNaseI hypersensitivity data 48 confirmed that these 373 putative lncRNA TSCs are likely to correspond to genuine promoters ( Fig. 6B & S12). In addition, to 374 verify that the transcripts expressed from these TSCs are indeed independent and devoid of any 375 significant protein-coding potential, we built transcript models from a recently published RNA-seq 376 developmental time course 25 . We successfully generated transcript models for 16,105 TSCs, including 377 1,475 lncRNA TSCs. Most of them appear to correspond to full-length transcripts, and the vast majority 378 of putative lncRNAs do not overlap annotated protein-coding sequences (Fig. S13). Analyses of protein-379 coding potential confirm that the overwhelming majority of transcripts are unlikely to encode proteins, or 380 even peptides as short as 10 amino acids (Fig. 6C & S13). Transcripts from 18 loci are likely to encode 381 short open reading frames (sORFs, <100 residues). We conclude that the vast majority of candidate 382 transcripts are likely to be genuine lncRNAs. 383 384 The expression profiles of lncRNA promoters have unique properties that set them apart. As a 385 class, they tend to be substantially more stage-specific than their protein-coding gene counterparts, as 386 measured by the Shannon entropy of their activity profiles (Fig. 6D). Comparing their developmental 387 expression timing to that of protein-coding gene promoters suggests a broad diversity of potential 388 developmental functions (Fig. S14 & Table S1). 389 390 The detection of lncRNA TSCs is highly reproducible across D. melanogaster biological 391 replicates (Fig. 6E). Of all D. melanogaster lncRNA TSCs, 2,016 can be aligned to the D. pseudoobscura 392 genome assembly and 1,111 are functionally conserved (Fig. 6E), suggesting that they have been 393 maintained since the last common ancestor of these two species. In order to investigate whether lncRNA 394 promoters are under purifying selection, we focused on an extremely stringently selected set of 631 TSCs 395 that are active in all 5 species. This set includes well-known essential noncoding transcription units, such 396 as bithoraxoid 49,50 (Fig. S15) and roX1 51 . Overall, the level of sequence conservation at these functionally 397 preserved lncRNA promoters is similar to that observed at protein-coding gene promoters (Fig. 6F). 398 Their developmental expression specificity is also more constrained than that of many protein-coding 399 gene promoters (Fig. 6G). Both observations taken together argue strongly for sustained selective 400 pressure on these 631 lncRNA promoters for 25-50 million years. Furthermore, many TSCs of interest 401 were excluded from this analysis simply because of the poor quality of genome assemblies, and this is 402 therefore an extremely conservative set. Therefore, to place a more reasonable lower bound on the true 403 number of conserved lncRNA promoters, we focused on those shared between the 3 species of the 404 melanogaster subgroup. These 1,529 promoters similarly display a high degree of sequence conservation 405 within the subgroup, and their expression specificity also appears constrained (Fig. S16). 406 407 Still, it remains that lncRNA promoters as a class evolve much faster than those of protein-coding 408 genes (Fig. 5D), and it has been a matter of debate whether this reflects lineage-specific functions or 409 merely neutral evolution. To address this question, we focused on 195 lncRNA TSCs that are specific to 410 the melanogaster subgroup, despite the orthologous sequences being present in the genomes of the two 411 outgroups (Fig. 6H). Surprisingly, they display the same degree of sequence conservation as the protein-412 coding gene promoters that are shared throughout the subgroup (Fig. 6H). The assessment of 413 conservation at orthologous sequences in outgroup species confirms that the selective constraints are 414 indeed lineage-specific (Fig. S17). This argues that these evolutionarily recent lncRNAs have come 415 under purifying selection after acquiring lineage-specific functions.  To validate our findings in one specific case, we focused on the FBgn0264479 locus, which 451 displays one the most tightly conserved expression patterns among all the lncRNA genes in our dataset. 452 This is an intriguing embryonic transcript that, although it has been annotated based on expressed 453 sequence tag (EST) data, has to our knowledge never been characterized. The 0.5kb FBgn0264479 RNA 454 is extremely unlikely to encode functional peptides, as assessed by phyloCSF analysis (score of -217.3) 455 and manual curation (Fig. S18). It is highly expressed in all 5 species surveyed, in a strikingly conserved 456 temporal pattern restricted to a ~3-hour period encompassing the onset of gastrulation (Fig. 7A-B & 457 S19). Northern-blot analysis confirmed the size and expression dynamics of the transcript (Fig. 7C &  458  S20). 459 460 The body of the transcription unit displays hallmarks of robust purifying selection within the 461 melanogaster subgroup (Fig. 7A). In addition, publicly available chromatin immunoprecipitation data 52,53 462 reveals the binding of several transcription factors to the promoter region (Fig. S19), and their putative 463 binding sites identified by sequence motif search also show evidence of purifying selection (Fig. S21). 464 The expression dynamics of the transcription factors are consistent with their regulating the 465 FBgn0264479 promoter (Fig. S21). 466 467 Fluorescent in situ hybridization (FISH) in early embryos revealed expression along the ventral 468 and dorsal midlines at the late blastoderm stage, with the exclusion of lateral regions, the primordial 469 germ cells and the prospective head (Fig. 7D). Based on this distinctive expression pattern, highly 470 reminiscent of the well-characterized schnurri gene 54 , we renamed this lncRNA gene schnurri-like RNA 471 (slr). This early expression domain subsequently evolves into a complex segmented pattern by the end of 472 germband extension (Fig. 7D). In the late blastoderm, the RNA is found almost exclusively in the 473 cytoplasm at the apical pole of the cells (Fig. 7D). It appears at that stage to be generally concentrated in 474 a single major focus per cell (Fig. 7E). Secondary structure predictions show that the slr transcript is 475 likely to be highly structured (Fig. 7F), suggesting a high potential for RNA-protein interactions. 476 477 Although further characterization will be required to decipher the precise biological roles of this 478 lncRNA, our observations confirm its noncoding nature, definitively establish its developmental 479 expression specificity, and provide clear evidence that it has been under strong purifying selection over at 480 least 25-50 million years. This work provides, to our knowledge, the first genome-wide overview of promoter evolution in 505 Drosophila, and we leveraged this comparative framework to study the sequence determinants of 506 developmental expression specificity. Through nucleotide-resolution mapping of TSSs and quantitative 507 measurements of expression kinetics, our study yields new insights into the transcriptional regulatory 508 code. We find that distinct classes of core promoters drive transcription in three broad phases of 509 embryonic development. Each class is defined by a characteristic set of core motifs, and is associated 510 with regulation by specific groups of transcription factors. Of note, we successfully detected in vivo some 511 functional associations between Dref and housekeeping promoters, and between Trl and some 512 developmentally regulated promoters, that were recently described in vitro 18 (Fig. 3). Our analysis 513 generalizes the concept of specific interactions between core promoters and enhancer sequences, and 514 demonstrates for the first time its global relevance in a developmental context. 515 516 We propose a hierarchical model for transcriptional regulation in which core promoter syntax 517 defines broad temporal windows of opportunity for activation, and precise expression timing is 518 subsequently refined by the binding of sequence-specific activators and repressors at enhancers. Core 519 promoters may restrict regulatory inputs by recruiting different sets of general transcription factors 520 (GTFs) that functionally interact with distinct groups of transcription factors 20 . Some GTFs have been 521 shown to shape the expression specificity of individual promoters, and it is known that different 522 activators and repressors have distinct requirements for cofactors and GTFs 18,20 . Such a mechanism may 523 channel regulatory inputs to limited subsets of promoters, and thus limit crosstalk between promoters and 524 enhancers across the genome. Notably, what applies to time may also apply to space, and it is possible 525 that similar core promoter/enhancer interplay hierarchically specifies gene expression in broad 526 developmental lineages and individual cell types. 527 528 Evolutionary analysis of developmental expression specificity further supports this model. First, 529 the three major classes of core promoters defined here show drastically different patterns of sequence 530 evolution, suggesting substantial differences in their underlying structure and functional interactions with 531 the transcriptional machinery. At a finer scale, the conservation of expression specificity across species 532 correlates with the degree of sequence conservation at canonical core promoter elements, such as TATA 533 boxes and Initiator motifs. This is highly suggestive of an instructive role for these sequence elements, 534 once thought generic, in defining developmental gene expression patterns. 535 536 Approximately 4,000 promoters were found to drive the expression of lncRNAs during 537 embryogenesis, a strikingly high number for this very brief developmental period. Our findings likely 538 apply to other developmental stages as well, as we previously reported the existence of 7,421 putative 539 lncRNA promoters in an analysis of the whole D. melanogaster life cycle 37 . In addition, we detected the 540 expression of only 205 of 1,119 recently identified lncRNAs 27 . This suggests that we are only beginning 541 to scratch the surface of lncRNA biology in Drosophila. Importantly, we show here that vast numbers of 542 these promoters are under strong selective pressure, at the levels of both promoter sequence and 543 expression specificity. A melanogaster subgroup core set of at least 1,529 is under substantial selective 544 constraint, and most of those are therefore highly likely to have biologically relevant activities. 545 546 In agreement with previous reports 28,47,55 , we find that lncRNA genes evolve faster than their 547 protein-coding counterparts. It has been an unresolved debate so far whether this reflects neutral 548 evolution or lineage-specific functions. Our observation that lineage-specific lncRNAs are also under 549 substantial selective pressure reveals that noncoding transcription may be a major driver of phenotypic 550 diversification and organismal adaptation. We recently showed that transposable elements play an 551 important role in the evolutionary gain of promoters, and in particular of lncRNA promoters 37 . We 552 propose that transposon proliferation is a major mechanism favoring the neofunctionalization of 553 intergenic regions as sources of biologically active noncoding transcripts.

555
To open a window into the biology of developmentally regulated lncRNAs, we focused on 556 schnurri-like RNA, a deeply conserved yet never-before characterized gene. We experimentally validate 557 the existence and expression kinetics of the slr transcript, and demonstrate that both its promoter 558 sequence and its developmental expression specificity are deeply conserved across drosophilids. 559 Interestingly, this lncRNA is expressed in a spatial profile highly similar to that of the schnurri gene, 560 which is part of the Dpp (TGF-β/Smad) signaling pathway and plays an essential role in the 561 establishment of dorsoventral embryo polarity 54,56 . The punctate cytoplasmic localization pattern of the 562 slr lncRNA is reminiscent of the targeting of multiple Dpp pathway components to endosomes in larval 563 wing discs 57-59 . Endosomes localize apically in the late embryonic blastoderm 60 , and mutants for Sara, 564 the Smad endosome-targeting factor 57,59 , die early in embryogenesis 57 , suggesting that endosome-based 565 signaling is also essential at that stage. Taken together, our observations are suggestive of a possible role 566 for the slr lncRNA in TGF-β signaling, a crucial pathway in all animals that plays a major role in human 567 disease, including cancer. 568 569 In recent years, it has become clear that noncoding transcription serves a myriad of molecular 570 functions in Eukaryotes, and plays a part in virtually every known biological process 29-32 . LncRNAs have 571 been shown to regulate transcription and chromatin structure, as well as mRNA stability and protein 572 localization. Sometimes it is the transcription of the locus itself that plays a mechanistic role, rather than 573 the resulting transcript -as in the case of the upstream bxd promoter 50 , which has one of the most highly 574 conserved expression profiles that we have observed. Our work unambiguously demonstrates the 575 biological relevance of noncoding transcription to developmental processes, and establishes D. 576 melanogaster as an excellent model for exploring the diverse functions of lncRNA genes. We expect that 577 systematic efforts on a larger scale will illuminate the biology of a long-ignored class of genes that has 578 proven its worth. 579 580 581 582 583 ends of an 800-bp window centered on the middle of the peak had to map to the same strand of the same 650 chromosome or scaffold, 50% of bases had to be aligned (i.e., not in assembly gaps), and 25% of bases 651 had to align to orthologous bases (not alignment gaps). Raw 5' signal for each genome was also 652 translated into multiple alignment coordinates. For each peak from each species, functional conservation 653 was assessed by counting the number of RAMPAGE tags in each species. A peak was considered absent 654 in a target species if it had at least a 100-fold lower signal than in the reference species. Peaks with <100 655 tags in the reference species were considered absent if they had no detectable signal in a target species. 656 657 Phylogeny reconstruction: 658 The peaks from all species were merged and collapsed in multiple alignment space to generate a non-659 redundant set of all peaks in the clade. The conservation of these peaks was assessed as described above. 660 The phylogenetic tree was inferred by treating the presence/absence of each peak as a 2-state discrete 661 character, sequentially using the MIX  release 2) were used to match gene expression profiles between species. We pre-processed pairs of 686 datasets (D. melanogaster and another species) to compensate for differences in annotation quality and 687 peak calling between species. We identified orthologs of TSCs that had detectable expression (≥10 tags) 688 but initially failed to be called in one species. In addition, when a functionally conserved TSC had been 689 attributed to an annotated gene in one species but not the other, we corrected this discrepancy by 690 attributing it to the gene in both species. For the D. ananassae dataset, the 8 th time point failed to yield 691 acceptable data, and was excluded from the analysis. All time series were upsampled 5-fold and 692 smoothed with a 2-hour window size using RZ- Clustering of TSC expression profiles: 707 We classified all D. melanogaster TSCs with maximum expression ≥10RPM (n=11,900) as either 708 Housekeeping (<5-fold variation throughout the time series, n=587) or Developmentally Regulated 709 (≥60% of total expression within an 8-hour window, n=6,015). We further selected TSCs functionally 710 conserved in all 5 species (n=240 and n=3,824, respectively). Developmentally regulated TSC profiles 711 were hierarchically clustered (R hclust, distance metric 1 -cor(t(expr), method="pearson")) and initially 712 grouped into 12 clusters (R cutree, k=12). After filtering out excessively small clusters (<200 TSCs, 713 n=4), further analysis was conducted on the remaining 8 regulated clusters (n=3,222 TSCs Cuffmerge was used to generate a consensus annotation set. Transcript models were attributed to a 719 RAMPAGE TSC if their 5' end lay within 150 of that TSC. Models without a matching TSC were 720 excluded from further analyses. phyloCSF was run on these annotation sets and the 15-way multiZ 721 whole-genome alignments. 722 723 Analysis of sequence motifs: 724 We used the MEME Suite v4.9.0 (primarily FIMO and MAST) to search promoter regions for a 725 previously published compendium of motifs 62 corresponding to core promoter motifs and TFBSs. 726 MEME was used to generate consensus motifs from specific subregions of RAMPAGE-defined 727 promoters (Fig. 4) FBgn0264479 promoter sequence analysis: 735 We identified potential regulators of the FBgn0264479 promoter based on the BDTNP transcription 736 factor ChIP-chip data available from the UCSC genome browser website 52,53 . For 10 factors that bind the 737 promoter in embryos (bcd, cad, D, dl, ftz, gt, h, hb, Kr, twi), we searched the 600bp upstream of the main 738 TSS for Jaspar TFBS motifs (http://jaspar.genereg.net), using FIMO. We identified 7 motifs at a p-value 739 cutoff of 2x10 -4 . 740 741 Northern-blot: 742 We ran 8µg of total RNA per sample on an 8% acrylamide 8M urea gel with a Invitrogen Novex minigel 743 system. Transfer to a nylon membrane was carried out in 0.5X TBE in a Novex XCell II module, 744 followed by UV- crosslinking (1,200J). For detection, we used a combination of 6 oligonucleotide probes 745 targeting FBgn0264479 (5'-gaacatcgcttgcagtgcag, 5'-cgatggatgttgtcggtcgg, 5'-ctctcgttctttgattcttc, 5'-746 caggatgtgtggtgttccac, 5'-agattggatccttatggttg, 5'-atatgctgacactgcatggt). 30 pmol of oligo mix were 747 radioactively labeled with γ 32 -ATP and PNK. Following phenol-chloroform extraction, the labeled 748 probes were hybridized for 2 hours at 42ºC in 40mL of ULTRAHyb buffer (ThermoFischer). After serial 749 washes with decreasing concentrations of SSC buffer (final stringency 0.5X), the membrane was exposed 750 on Kodak BioMax autoradiography film. After stripping and control re-exposure, a similar protocol was 751 used to detect the 5S rRNA on the same membrane, using a single probe (5'-caacacgcggtgttcccaagccg).     Figure 10. Phylogeny of sequenced species The species for which we gathered data appear in red. When assessing sequence conservation for features conserved across all 5 species studied, we included the genome sequences of the 5 species studied and the 3 additional sequenced species from the same monophyletic group (8-way alignment). When assessing conservation throughout the melanogaster subgroup, we used the genomes of our 3 species and the other 2 from the subgroup (5-way alignment).    Figure 13. Independence and protein-coding potential of putative lncRNAs We used published D. melanogaster embryo RNA-seq data and Cufflinks to generate transcript models, and only considered those starting within 150bp of a TSC. (A) Concordance of annotated and Cufflinks-modeled 3énds. For each Cufflinks model starting at a protein-coding gene TSC, we are representing the distance to the closest annotated transcript 3énd. (B) Fraction of TSCs for which at least one transcript model overlaps a downstream annotated CDS. Note that while a minority of lncRNA TSC transcript models overlap a downstream protein-coding genes, we observed a similar propensity of Cufflinks models to fuse together consecutive protein-coding genes. (C) PhyloCSF score distributions of protein-coding and putative lncRNA transcript models, as a measure of protein-coding potential. For each TSC, we only considered the transcript model with the highest score. Transcripts with no ORF were assigned a default score of -2,000. Red lines: scores for known lncRNAs (roX1, roX2, bithoraxoid). (D) PhyloCSF analysis restricted to TSCs that are functionally conserved across all 5 species.

8-way Alignment
Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 Cluster 9 Cluster 10  Figure 14. Clustering of D. melanogaster developmental expression profiles In order to analyze coexpression between putative lncRNA and protein-coding gene promoters, we first grouped protein-coding gene promoter expression profiles by k-means clustering (10 clusters) to define reference coexpression sets (A). We then clustered both lncRNA and protein-coging gene promoter profiles together, and extracted the lncRNA promoter profiles corresponding to each expression cluster previously defined (see B).

A T G T T C T C C G A C C G A C A A C A T C C A T C G C G G A C C G A C G T G A A G G C A G C T C C T G G C G G A G C T T C A C G T A T C C T A G G C G -----CT G G A T C T C
C FBgn0264479 ORF 2 alignment 10 20 30 40 50 60 dm3/1-54 droSim1/1-64 droSec1/1-64 droEre2/1-54 droYak2/1-50 droAna3/1-37 droPer1/1-33 dp4/1-33