Entire genome transcription across evolutionary time exposes non-coding DNA to de novo gene emergence

Even in the best studied Mammalian genomes, less than 5% of the total genome length is annotated as exonic. However, deep sequencing analysis in humans has shown that around 40% of the genome may be covered by poly-adenylated non-coding transcripts occurring at low levels1. Their functional significance is unclear2,3, and there has been a dispute whether they should be considered as noise of the transcriptional machinery4,5. We propose that if such transcripts show some evolutionary stability they will serve as substrates for de novo gene evolution, i.e. gene emergence out of non-coding DNA6–8. Here, we characterize the phylogenetic turnover of low-level poly-adenylated transcripts in a comprehensive sampling of populations, sub-species and species of the genus Mus, spanning a phylogenetic distance of about 10 Myr. We find evidence for more evolutionary stable gains of transcription than losses among closely related taxa, balanced by a loss of older transcripts across the whole phylogeny. We show that adding taxa increases the genomic transcript coverage and that no major transcript-free islands exist over time. This suggests that the entire genome can be transcribed into polyadenylated RNA when viewed at an evolutionary time scale. Thus, any part of the “non-coding” genome can become subject to evolutionary functionalization via de novo gene evolution.

possibilities have been discussed by which new transcripts can arise, including single mutational events 10 , 23 stabilization of bi-directional transcription 16 and insertion of transposable elements with promotor activity 17 . These 24 events were initially thought to be rare 6 , but an increasing number of studies show that de novo gene emergence 25 is a rather active mechanism 18-22 . Surveys across phylogenetic times have shown that the highest gene emergence 26 rates are found in youngest taxa 7 . This led to the prediction that high emergence rates must be balanced with high 27 loss rates, because gene numbers do not grow much over time 7 . A comparison of open reading frame turnover 28 rates of de novo evolved genes among Drosophila species has shown that this is indeed the case 22 .

29
Here we assessed the numbers of new transcript gains in a comprehensive phylogenetic framework. Given that the 30 emergence of a new stable transcript is a prerequisite for evolving a new functional gene, we expect that the 31 transcript emergence rate is a key parameter in the process of de novo gene emergence. Stable transcripts can be 32 2 created out of combinations of cryptic functional sites, including a minimal promoter, splicing signals and poly-33 adenylation sites. This applies to completely new transcripts and to modification of existing transcripts by adding 34 new exons from non-coding DNA. Given the widespread presence of cryptic functional sites in genome sequences, 35 a single mutational step can convert a non-transcribed or a spuriously transcribed genome region into a stable 36 transcript, as has been shown for Pldi, the first documented de novo gene in the mouse 10 . 37 Importantly, once a genomic region becomes transcribed, most subsequent mutations within the transcribed 38 region will not lead to a loss of the transcript, since only a few sites are responsible for active and stable 39 transcription. Hence, one can predict that the de novo transcript emergence dynamics would show a higher rate of 40 gain than loss at short evolutionary time scales. Hence, transcriptional gain would constitute a powerful 41 mechanism to continuously expose new genome regions to evolutionary testing, providing the fuel for de novo 42 gene emergence.

43
To test this prediction, we selected species, subspecies and populations related to the house mouse (Mus 44 musculus -suppl. Table S1) as a phylogenetic framework for identifying the emergence and loss of new transcripts.

45
The taxa chosen span approximately 10.5 Myr of evolutionary divergence and represent up to ~5.6% overall 46 genomic divergence (Figure 1, suppl. Table S2). Using such closely related taxa ensures that most neutrally evolving 47 sequences can be reliably mapped across all species 23 . We generated genome sequences for species without 48 published genomes, and transcriptome sequences for brain, liver and testis for all taxa (suppl . Tables S3-S5).

49
For comparative transcriptome analysis, we identified all mappable regions of the M. m. domesticus reference 50 genome (C57Bl/6) 24 using genomic reads from all studied species. We call this the "common genome", 51 representing the total genome where transcript mapping across taxa is reliable. We used a mapping algorithm that 52 was specifically designed to deal with the polymorphisms occurring under cross-species mapping conditions 23 .

53
We first focused on genome-wide signals of transcriptional activity to identify the origin of new transcripts within 54 the phylogeny (suppl . Table S8). For this purpose, we determined the base-wise transcriptome coverage from poly-55 adenylated RNA for each species. This measure of coverage includes both annotated genes and previously un-56 annotated transcripts, whereby the latter are the majority. We set single read coverage as the lower cutoff 57 because we were specifically interested in detecting low-level transcription as an early sign of de novo gene 58 emergence. However, we also report results using a stringent cutoff of five reads for comparison (the median 59 coverage across all transcripts is 3.7).

60
When comparing transcriptome coverage among taxa, we find that the overall proportion of shared transcripts is 61 higher for closely related taxa than for distantly related pairs. Consequently, a phylogenetic tree reconstructed 62 based on shared transcript coverage mirrors the species tree ( Figure 1B, C). This detectable phylogenetic signal in 63 3 transcription coverage suggests that transcripts gained at a given point in evolutionary time are sufficiently stable 64 to be retained in sister taxa, implying that they can become exposed to evolutionary testing.

65
The total transcript coverage of the common genome across all species combined for all three tissues is 67% in our 66 data set. Coverage was highest in testis (53.4%), intermediate in brain (41.5%) and lowest in liver (23.5%) ( Figure   67 2A). When comparing all transcriptional gains versus losses across the surveyed phylogeny, we observe that gains 68 are indeed more frequent than losses ( Figure 2B, C), thus confirming our initial hypothesis.

69
Recording base-wise coverage without paying attention to gene models entails the risk that one is also measuring 70 transcriptomic noise, i.e. spurious random transcriptional initiation in a subset of cells of the tissue under 71 investigation. We have specifically explored the noise issue through deeper sequencing of the brain samples of all 72 taxa. The brain is a complex tissue in which some transcripts are expected to occur only in a small subset of cells.

73
These rare transcripts should become detectable by deep sequencing and thus transcript coverage should increase 74 with more reads available. This is indeed the case; however rarefaction curves and their projections reach 75 saturation for each of the taxa (suppl. Table S6 and Figure S1). This observation argues against a significant amount 76 of transcriptional noise in our data, since noise should lead to a continuous increase of coverage with sequencing 77 depth, at least if noise is randomly distributed across the genome and across the cells of the tissue. Further, it rules 78 out a possible problem with DNA contamination, as this would also be expected to rise with increasing sequencing 79 depth.

80
We have further explored the rarefaction principle to assess whether adding more sequences or more taxa to the 81 data set leads to higher transcriptomic coverage of the common genome. Taking all aggregated reads (including 82 the additional deep sequencing data from brain) across all tissues for all taxa, saturation is reached at 78.5% 83 coverage for sequencing depth ( Figure 2D), but no saturation is reached with the number of taxa used here ( Figure   84 2E). Hence, adding more taxa within this phylogenetic framework, for example species and sub-species on the 85 Apodemus branch, would predictably lead to increasingly higher transcriptomic coverage of the common genome, 86 up to the entire genome when ~38 taxa were surveyed within the phylogeny (based on the intercept of the 87 regression curve).

88
This analysis suggests that there may be no regions that are not transcribed at some point in phylogenetic time.

89
However, genome annotations in a given species usually show an uneven distribution of transcripts; some regions 90 harbor many clustered transcripts and other regions are nearly devoid of transcripts ("gene deserts"). We 91 compared regions devoid of any transcriptomic coverage in our taxonomic sample to regions that show 92 transcription in at least one sample. We find that transcribed regions are more abundant and larger on average 93 than non-transcribed regions ( Figure 2E). The maximum length of non-transcribed regions is ~ 20kb (at ≥1 reads) 94 or ~50kb (at ≥5 reads), suggesting that large gene-free "deserts" are in principle accessible to transcription and 4 possible regulatory constraints 25 do not fully prevent their transcription. In fact, the mouse de novo gene Pldi has 96 arisen within a gene desert 10 .

97
Most de novo transcripts are expected to be neutral, but some may turn into more stable proto-genes 6,19 that can 98 eventually become functional, either as regulatory RNA, or by acquiring a functional reading frame. To identify 99 candidate proto-genes we used algorithms that are able to reconstruct transcriptional islands and splice junctions 100 (STAR 26 /cufflinks 27 ) and join them into predicted gene models (suppl . Table S9). We did this for each taxon 101 separately and assessed the gain and loss patterns of these transcripts in a phylogenetic context. Excluding all 102 previously annotated transcripts, we find a total of 17,746 new candidate proto-genes, distributed across all taxa 103 (suppl. Figure S2). When looking only at gains of proto-gene transcripts in the terminal branches, we find that 104 about 1,300 new proto-genes are gained per million years ( Figure 3A). Interestingly, at least 3,000 proto-genes are 105 already present at the youngest divergence level, implying within-species polymorphism that was also described 106 for Drosophila 21 .

107
When counting gains versus losses of proto-genes, we find again higher numbers gained than lost over short 108 phylogenetic times. However, gains and losses balance out over longer evolutionary times when including the 109 whole phylogeny and all annotated genes ( Figure 3B). This pattern confirms the two essential predictions we made

114
Our analysis is conservative in several respects. First, we focused on poly-adenylated transcripts, thereby avoiding 115 inclusion of RNA fragments that have been processed (i.e. excised introns) and randomly transcribed fragments.

116
However, this means we also exclude RNAs which are not transcribed by RNA polymerase II, such as tRNAs, 117 snRNAs and ribosomal RNAs. The human ENCODE data suggest that such non-poly-adenylated RNAs are abundant 1 118 and it is likely that proto-genes can arise from those transcripts as well, i.e. we are likely underestimating the 119 proto-gene emergence rate. Second, we focused on three tissues and one developmental stage only. Although we 120 included testis and brain, which are known to have the highest diversity of transcripts 28 we can also expect that 121 including more tissues and developmental stages would further increase the transcriptomic coverage. Taking these 122 factors into account, as well as the fact that increased taxonomic representation shows no signs of saturation with 123 respect to transcriptomic coverage ( Figure 2E), it seems reasonable to conclude that when measured at a   We selected ten taxa, ranging from population level through sister genera (Figure 1 A).

209
The youngest divergence point sampled, at about 3,000 years, corresponds to the split between two European

259
All raw data files were trimmed for adaptors and quality using Trimmomatic 9 . The quality trimming was performed 260 basewise, removing bases below quality score of 20 (Q20), and keeping reads whose average quality was of at 261 least Q30. Reads whose trimmed length was shorter than 60 bases were excluded from further analyses, and pairs 262 missing one member because of poor quality were also removed from any further analyses.

264
The reconstruction of transcriptomes using high-throughput sequencing data is not trivial when comparing 265 information across different species to a single reference genome. This is due to the fact that most of the tools 266 designed for such tasks do not work in a phylogenetically aware context. For this reason, any approximation which 267 deals with fractional data (i.e. any high-throughput sequencing setup available to this date) is limited by the 268 detection abilities of the software of choice and by the quality of the reference (transcriptome and genome).

269
Given the high quality state of the mouse genome repositories, we decided to take a reference-based approach, in

279
Genomic reads were used to as empiric mapability, i.e. to identify which regions can be reliably detected. We 280 limited our analyses to regions in the reference genome which could be mapped at least 5 times from genomic 281 reads from all other species (5x coverage). This is the portion we call the 'common genome' in downstream 282 analyses. It is important to highlight that this is not the same as synteny, since we did not perform any co-linearity 283 analyses between fragments, but rather represents the mere presence in the species, in any possible order.

9
Furthermore having genomic reads enables the detection of true absences in transcriptional activity from absences 285 of genome regions, which would show similar patterns in transcriptome-only analyses.

287
Due to the fact that NextGenMap is unable to perform split read analyses we opted for more standard tools to 288 reconstruct gene models from the data. For this we used STAR 12 to map reads to the reference, followed by 289 cufflinks 13 to obtain automated gene and transcript annotations for each species. The annotation file contains 290 models for expressed transcripts with splicing information (exon annotation) when available. All annotations were 291 merged using cuffmerge to generate a final annotation that includes gene models present at least once in the total 292 sample. Mono-exonic models shorter than 500 bases or contained within introns of multi-exonic transcripts were 293 excluded from analyses.

294
Parsimony gain and loss mapping 295 We estimated gain and loss events given the phylogenetic distribution of presence and absence of transcription at 296 a given position or for a given gene model using maximum parsimony (based on GLOOME, 14 , the assumption that 297 gains and losses are equally likely, and a fixed tree describing the relationships between taxa. Transcriptome experiments tend to be limited by the depth of sequencing, with highly expressed genes being 342 relatively easy to sample, and rare transcripts becoming increasingly difficult to find. Given the large amount of 343 data generated, we investigated if our data shows signals of coverage saturation from subsets of the data of 344 different sizes. The total experiment, comprising ten taxa, corresponds to 6.4 x 10 9 reads (or 6.4 billion reads). We 345 subsampled (samtools view -s ) portions of mapped reads for each taxon, ranging between 10% to 90%, at 10%