Global donor and acceptor splicing site kinetics in human cells

RNA splicing is an essential part of eukaryotic gene expression. Although the mechanism of splicing has been extensively studied in vitro, in vivo kinetics for the two-step splicing reaction remain poorly understood. Here, we combine transient transcriptome sequencing (TT-seq) and mathematical modeling to quantify RNA metabolic rates at donor and acceptor splice sites across the human genome. Splicing occurs in the range of minutes and is limited by the speed of RNA polymerase elongation. Splicing kinetics strongly depends on the position and nature of nucleotides flanking splice sites, and on structural interactions between unspliced RNA and small nuclear RNAs in spliceosomal intermediates. Finally, we introduce the ‘yield’ of splicing as the efficiency of converting unspliced to spliced RNA and show that it is highest for mRNAs and independent of splicing kinetics. These results lead to quantitative models describing how splicing rates and yield are encoded in the human genome.


Introduction
Transcription of eukaryotic genes produces precursor RNA molecules that are processed by splicing. Splicing is a two-step reaction that results in the removal of introns from precursor RNA and the formation of mature RNA with joined exons. Splicing is catalyzed by the spliceosome, a dynamic ribonucleoprotein machine assembled from small nuclear ribonucleoprotein complexes (snRNPs) and several non-RNP factors (Herzel et al., 2017;Wahl et al., 2009). In metazoa, the majority of the introns are excised by the major spliceosome, whereas a minority is removed by the minor spliceosome (Will and Lührmann, 2011). Spliceosomes are recruited through conserved RNA elements, the 5´-splice site at the exon-intron border (donor site), the 3´-splice site at the intron-exon border (acceptor site), and the branch point, which is followed by a polypyrimidine track (Coolidge et al., 1997;Turunen et al., 2013) and located~18-40 nucleotides (nt) upstream of the acceptor site Taggart et al., 2012). Additional RNA sequences such as splicing enhancers and silencers are found in both introns and exons and can influence the choice between splice sites (Zhu et al., 2001).
In recent years, the intricate mechanisms of splicing were investigated with a combination of structural and functional studies (Mayerle and Guthrie, 2017;Shi, 2017;Wilkinson et al., 2018;Will and Lührmann, 2011). The spliceosome assembles in a stepwise manner, adopting different intermediates with varying composition, conformation, and interactions between RNAs and proteins (Will and Lührmann, 2011). Assembly of the major spliceosome begins with recognition of the donor site by U1 snRNP (Kondo et al., 2015;Lerner et al., 1980;Seraphin and Rosbash, 1989;Zhuang and Weiner, 1986). The U2 auxiliary factor then binds to the poly-pyrimidine track and the acceptor site (Valcárcel et al., 1996;Zamore and Green, 1989;Zamore et al., 1992) generating the E complex. U2 snRNP then binds the branchpoint, resulting in the A complex (Konarska and Sharp, 1986;Perriman and Ares, 2010). In the A complex, U1 snRNA binds the donor site and U2 snRNA binds the branchpoint, rendering the branchpoint adenosine base available for interaction with the acceptor site (Berglund et al., 2001;Query et al., 1994). Binding of the U4/U5/U6 tri-snRNP leads to the B complex (Bertram et al., 2017a), which is first activated (Bact) and then converted to the catalytically active B* complex. In the B* complex, the donor site is positioned close to the branchpoint in a RNA network formed between the precursor RNA and U2, U5 and U6 snRNAs .
The activated spliceosome catalyzes intron removal in two steps, which are both transesterification reactions. In the first step, the 2´-hydroxyl group of the branchpoint adenosine serves as a nucleophile to attack the donor site and to generate a cleaved 5´-exon and the lariat intermediate. This leads to the C complex  that is then rearranged to form the C* complex, which is catalytically active to carry out the second step. In the C* complex, the ends of exons to be joined are juxtaposed (Bertram et al., 2017b;Zhang et al., 2017), and this enables the 3´-hydroxyl group of the last nucleotide of the 5´-exon to attack the acceptor site, leading to exon ligation and excision of the intron lariat. The resulting P complex contains the ligated exons, which are subsequently released, completing the splicing process (Bai et al., 2017;Wilkinson et al., 2017).
Taken together, extensive in vitro studies have strongly advanced our understanding of the splicing process, but the kinetics and mechanisms of splicing in vivo remain far less understood. Although biochemical assays show that splicing can occur in the absence of transcription, in vivo splicing happens mainly co-transcriptionally, when newly transcribed RNA is still attached to RNA polymerase II (Pol II) and chromatin (for review see Alexander and Beggs, 2010;Bentley, 2014;Saldi et al., 2016). Furthermore, compromising Pol II transcription elongation increases alternative splicing (de la Mata et al., 2003;Dujardin et al., 2014;Ip et al., 2011;Pagani et al., 2003), providing evidence that an optimal elongation rate is essential for a co-transcriptional splicing (Davis-Turak et al., 2018;Fong et al., 2014). Native elongating transcript sequencing in human cells (NETseq) indicates that splicing occurs soon after introns are synthesized (Mayer et al., 2015; eLife digest Genes are portions of DNA that carry the instructions to build proteins. In particular, they are formed of segments called exons, which contain the protein-building information, and of non-coding segments known as introns. Exons and introns alternate within a gene.
To create a given protein, the cell first uses an enzyme, Polymerase II, to copy the entire related gene -including introns and exons -into a molecule of ribonucleic acid, or RNA. As the gene is copied, a machine called the spliceosome comes onto the RNA molecule to remove the introns and create the final RNA template used to produce proteins.
The spliceosome works by recognizing specific sequences that signal the border between introns and exons. Once the machine is bound to these 'splice sites' on each side of an intron, it brings the two neighboring exons close together and cuts out the intron. The two ends of the exons are then attached together. Previous studies have measured how fast introns are removed, but it remained unclear how long it takes to cut individual splice sites genome-wide.
To address this question, Wachutka, Caizzi et al. combined a mathematical approach with a biochemical method that purifies newly made RNA in human cells. The experiments showed that it only took a few minutes to cut most splice sites. Cutting splice sites that bordered very long introns was slower, presumably because the Polymerase II took longer to produce these introns. In addition, the genetic sequences of the splice sites affected the time it took to remove the introns: some made it harder for the spliceosome to recognize where to cut, but others made it easier.
Mistakes in removing introns from RNA can lead to producing abnormal proteins, and many diseases such as cystic fibrosis and Duchenne muscular dystrophy can be caused by such errors. In particular, small changes in the sequences at the splice sites or in the surrounding areas can create problems when it comes to eliminating introns. Decrypting the dynamics of intron cutting and removal may give scientists new insight into the molecular causes of cystic fibrosis and many other genetic disorders. Nojima et al., 2015). Further, the combination of single molecule intron tracking (SMIT) and long read sequencing in yeast shows that splicing is 50% complete when Pol II is 45 nt downstream the acceptor spice site (Oesterreich et al., 2016).
Despite these advances, the in vivo kinetics of splicing remain poorly understood. In particular, different estimates for splicing rates have been reported. Splicing rates have been measured for selected endogenous human genes with the use of live cell imaging (Coulon et al., 2014;Martin et al., 2013;Rino et al., 2014;Schmidt et al., 2011) or with a combination of cellular RNA extraction and quantitative PCR (Pandya-Jones and Black, 2009;Singh and Padgett, 2009). Such studies led to very different splicing rate estimates, ranging from 15 to 30 s (Huranová et al., 2010;Martin et al., 2013;Rino et al., 2014) to 4.3-10 min (Coulon et al., 2014;Schmidt et al., 2011;Singh and Padgett, 2009) per splicing event. These discrepancies may stem from the difference in methods used, which can introduce perturbations, from the selection of genes studied, and from the great variance in intron lengths between human genes.
Other studies have estimated in vivo splicing rates globally with the use of RNA sequencing technologies. The short sequence reads collected from steady-state in vivo samples reflect RNA synthesis, splicing and degradation, which are entangled . In particular, the 'splicing efficiency' is typically defined by the ratio of spliced over unspliced RNAs at steady state (Braberg et al., 2013;Wilhelm et al., 2008). However, the same ratio has been successfully employed to study RNA stability, with the assumption that unspliced RNA levels reflect RNA synthesis and spliced RNA levels reflect the ratio of RNA synthesis over degradation (Gaidatzis et al., 2015;Zeisel et al., 2011). This ambiguity in definitions and interpretations questions the use of splicing efficiency and calls for alternative concepts and metrics. To overcome the limitations of steady-state transcriptome sequencing, one approach is to sequence new transcripts from chromatin-associated RNA fractions and compare them to cytoplasmic fractions, which led to splicing rate estimates from 43 s (Davis-Turak et al., 2015) to 15-120 min (Bhatt et al., 2012;Pandya-Jones et al., 2013) per splicing event.
An alternative method to investigate splicing kinetics in vivo is the use of metabolic RNA labeling with 4-thiouracil (4sU) (Dö lken et al., 2008;Rabani et al., 2011;Rabani et al., 2014;Windhager et al., 2012) coupled to sequencing of the labeled RNA (4sU-seq). We previously combined 4sU-seq with kinetic modeling to obtain RNA synthesis, splicing, and degradation rates in the fission yeast S. pombe (Eser et al., 2016). Others have used 4sU-seq and similar approaches to obtain median splicing rate estimates in human cells of 6.7 min (Mukherjee et al., 2017) or 14 min (Rabani et al., 2014) per splicing event. However, 4sU-seq also introduces biases when applied to the human system because the obtained data are artificially biased toward pre-existing 5'-regions of the RNA due to the length of human genes (Schwalb et al., 2016). These RNA 5'-regions predate the labeling time and are generally already observed to be spliced by 4sU-seq, leading to potential errors in the rate estimates. As a result of these difficulties, in vivo splicing kinetics remain unclear, and individual rate estimates for the two steps of splicing are lacking. However, such information is highly desirable because it may be interpreted alongside the mechanistic information obtained in vitro to provide a better understanding of the splicing process.
To study kinetics of splicing in vivo, we performed TT-seq (Schwalb et al., 2016) after different 4sU-labeling time points in human K562 cells. In contrast to 4sU-seq, the TT-seq protocol includes RNA fragmentation before 4sU-labeled RNA is purified and sequenced. This is a crucial step that eliminates 5' regions of nascent RNAs that were already transcribed, and spliced, prior to incorporation of the label, and thus removes the 5'-bias. We developed a computational approach that estimates the metabolic rates of single phosphodiester bonds. This approach enabled uncoupled quantification of donor-and acceptor-specific kinetics and to relate these to the two transesterification reactions and to the contribution of single nucleotides in spliceosome intermediates defined by structural studies. Moreover, to calculate the amount of precursor RNA successfully spliced into mature RNA, we introduced the 'splicing yield' as the conversion efficiency of unspliced to spliced RNA. As a result, our analysis provides genome-wide metabolic rates for donor and acceptor splice sites and identifies RNA-RNA interactions in the spliceosome that could contribute to in vivo splicing kinetics. Furthermore, we provide genome-wide estimates of the splicing yield that is not biased by splicing kinetics. From this work emerges a comprehensive global view of splicing kinetics and yield in human cells.

RNA labeling monitors intron removal
To monitor global RNA metabolism in human cells, we performed TT-seq analysis in K562 cells after different times of RNA labeling with 4-thiouracil (4sU) ( Figure 1A). We previously showed that such a labeling time series can estimate splicing rates in the yeast S. pombe (Eser et al., 2016). We exposed K562 cells to 500 mM of 4sU for a labeling time of 2, 5, 10, 15, 20, 30, or 60 min, isolated RNA, and conducted both TT-seq and total RNA-seq ( Figure 1A). On average, we obtained 250 million and 55 million 150-nucleotide (nt) paired-end reads for each of the TT-seq and RNA-seq  TT-seq and RNA-seq were performed after 2, 5, 10, 15, 20, 30 and 60 min of 4sU-labeling in human K562 cells. The number of reads spanning exon-intron/intron-exon splice-sites (non-split reads) gradually decrease with longer labeling duration, while exonic reads and reads spanning exon-exon junctions (split reads) increase during the time course, converging to levels similar to RNA-seq. (B) Coverage track of MYNN gene for 2, 5, 10, 15, 20, 30 and 60 min of 4sU-labeling followed by TT-seq and 2 min of 4sU-labeling followed by total RNA-seq. (C) Distribution of non-split reads (left) and split-reads (right) in previously published 4sU-seq (orange) and TT-seq (blue) (Schwalb et al., 2016)  samples, respectively. We next mapped TT-seq and RNA-seq data to the human genome (Materials and methods). The experiments were highly reproducible (Figure 1-figure supplement 1A). Visual inspection of the mapped reads revealed strong TT-seq signals in transcribed regions covering both introns and exons ( Figure 1B), whereas RNA-seq data covered mainly exons ( Figure 1B, Figure 1-figure supplement 1B). The relative number of intronic reads in TT-seq data decreased with 4sU-labeling time, whereas the signal for exons increased. These observations were consistent with capture of newly synthesized precursor RNA because spliced introns are more rapidly degraded than exons that are maintained in mature, stable RNA. Thus, our time series data contained information about the kinetics of precursor RNA splicing that we exploited further.

New and alternative splice sites
We first analyzed our TT-seq data for the occurrence of reads that are informative of precursor RNA splicing, that is reads spanning exon-exon boundaries ('split reads') and reads spanning exon-intron ('donor') and intron-exon ('acceptor') boundaries ('non-split reads'). Using a threshold of at least 10 split reads for an exon-exon boundary, we found 341,855 putative introns (Materials and methods). The relative number of these non-split reads compared to split reads was highest after 2 min of 4sUlabeling and progressively decreased with longer labeling times, eventually converging to a similar coverage as in the RNA-seq samples ( Figure 1-figure supplement 1C). Coverage with split reads decreased with the distance to the transcription start site (TSS) in 4sU-seq data but not in TT-seq data ( Figure 1C, data from Schwalb et al., 2016), suggesting that the previously reported estimation of splicing rates with 4sU-seq (Mukherjee et al., 2017;Rabani et al., 2014), overestimated splicing rates for 5' introns compared to 3' introns. With the use of TT-seq we could avoid a 5'-bias in splicing rate estimations. About one half of the putative introns (177,322) mapped to splice sites that had been annotated in the database of transcribed regions GENCODE (Materials and methods), whereas the other half did not (164,533). Of these putative introns that had not been previously annotated as splice sites in GENCODE, more than 99% represented introns ending with the GT|AG canonical dinucleotides. Moreover, 21% represented new combinations of already annotated donors and acceptors ( Figure 1D). Furthermore, 18% map to a non-annotated donor site and to a previously annotated acceptor site, 22% to a non-annotated acceptor site and to a previously annotated donor site. Interestingly, another 38% mapped to both non-annotated donor sites and non-annotated acceptor sites. Of these, about one half was located within an annotated GENCODE gene (intronic), whereas the other half was located in regions of the genome not annotated in GENCODE (intergenic). Overall, this analysis indicates that the number of splice sites has previously been underestimated, in agreement with recent studies that integrated very large datasets of the public RNA-seq repository (Nellore et al., 2016) or studies that used full-length mRNA sequencing (Anvar et al., 2018).

Kinetic modeling
Defining the kinetics of RNA synthesis, splicing, and degradation from short-read-based protocols is inherently ambiguous due to the many RNA species overlapping any genomic position, including precursor RNAs and multiple splice isoforms. In the future, quantitative and high-throughput fulllength transcriptome sequencing may become available to improve the situation; however, co-transcriptional alternative splicing would still cause ambiguities. We have therefore shown it is accurate to analyze the metabolism of phosphodiester bonds rather than RNA species themselves . Following this idea, we modeled the steady-state rates of synthesis and degradation (or equivalently cleavage) of each of three different phosphodiester bonds individually: the exon-intron bond at the donor site, the intron-exon bond at the acceptor site, and the bond between the two joined exons after successful exon ligation to yield product RNA (Figure 2A left, Appendix). We refer to these definitions throughout when we use the terms 'splicing kinetics' or 'splice site kinetics'.
We then considered the metabolism of these three phosphodiester bond types at steady-state. Synthesis balances out degradation at steady-state for any molecular species, independently of the kinetics. The steady-state synthesis rate (amount produced per unit of time) and the steady-state degradation rate (ratio of steady state amount by the steady-state synthesis rate) are defined quantities without any assumption on the kinetics (Appendix). Synthesis of the donor and acceptor bonds reflects precursor RNA synthesis. Cleavage of the donor bond is caused by either splicing or by precursor RNA degradation. The half-life of those donor bonds that get spliced depends on intronic transcription at least up to the branchpoint and on the first transesterification step. Cleavage of the acceptor bond is caused by either splicing or by precursor RNA degradation. The half-life of those acceptor bonds that get spliced is determined by the first and the second transesterification step  but not by intronic transcription. Synthesis of the junction bond is the outcome of completed splicing. Cleavage of the junction bond indicates RNA degradation (Materials and methods, Appendix). Our experimental design includes the injection of labeled and unlabeled spike-ins at constant concentrations in all samples, prior to the purification step. These spike-ins allowed for estimating the variations in sequencing depth as well as the overall newly synthesized RNA fraction of every sample (Materials and methods). The unlabeled spike-ins also allowed estimating the amount of cross-contamination, that is of unlabeled RNAs that are purified and which can represent a large fraction of all RNA-seq reads at short labeling durations (Materials and methods). These technical parameters estimated from the spike-ins read counts were then used as covariates to model expected read counts of all three types of bonds in each sample.
Application and testing of the kinetic model Using these considerations, we fitted the abundance of each of these three types of bonds with a first-order kinetic model for a total of 162,134 donor, 177,543 acceptor and 156,825 junction bonds that showed at least 100 supporting reads across the full dataset (Figure 2A right, Materials and methods, Appendix). Overall, TT-seq read counts agreed with the expected counts of our kinetic model (Figure 2A central column, Figure 2-figure supplement 1A). The synthesis rates for donors and acceptors, and the product half-life inferred from distinct splice junctions (Materials and methods) agreed well, demonstrating the robustness of our approach (Spearman rank correlation >0.33 for synthesis time downstream of the first exon, p<2 Â 10 À16 and Spearman rank correlation >0.76 for half-life, p<2 Â 10 À16 , variation of 180% fold for synthesis rate, and 32% fold for halflife, Materials and methods, Figure 2-figure supplement 1B). Variations were larger for synthesis rates because these estimates are in a first approximation proportional to the coverage in the shortlabeled TT-seq samples and are therefore more sensitive to sequencing biases. In contrast, half-lives, which are in a first approximation proportional to the ratio of coverages in short-labeled TT-seq samples and in RNA-seq, better control for sequencing biases.
We furthermore conducted extensive simulations to assess the performance and limitations of the fitting procedure to estimate the rates when the ground truth is known. We simulated counts based on the estimated distributions of synthesis rates, splicing half-times and half-life times based on the experimental data. Based on simulated data, our method leads to unbiased estimates of ground truth synthesis rates (Appendix 1-figure 12), splicing half-time and half-life time (Appendix 1-figure 13) with high precision compared to the dynamic range. Also, we used simulations to explore how estimation accuracy is affected when using data with much lower read coverage or for extremely slow or fast rates. These simulations showed that lowering the total read coverage cut-off below 100 reads would lead to relative errors typically surpassing 100% (median, Appendix 1-figure 23). These simulations also showed that estimations of half-lives shorter or much longer than our labeling durations (shorter than 1 min or longer than 3 days) would lead to median error surpassing 100% (Appendix 1-figure 24).
First-order kinetic models are simple models that grossly model the underlying biochemical processes. We also investigated two alternative models that potentially capture more complex kinetics. The first one is a delay differential equation model for donor bond kinetics that modeled the time to transcribe the intron up to the branchpoint with a delay, followed by first-order kinetics for the first transesterification step (Appendix). Simulations indicated that identifying the parameters of this delay differential equation model is difficult (Appendix 1-figures 2, 20-22) because the data do not support distinguishing the contribution of transcription from the one of the first transesterification step. However, fitting a first order kinetic model on data simulated according to the delay differential equation model showed that the estimated donor bond half-life approximately equated the sum of the intronic transcription delay and the half-time of the first transesterification step (Appendix 1-figure 5, yet usually underestimating with a median relative level of 0.89). The second alternative model is a coupled differential equation model for the junction bonds that modeled junction formation as the outcome of a first-order kinetics splicing process rather than as a constant. Simulations showed that the data did not allow to easily distinguish this coupled kinetics from first-order kinetics (Appendix 1- figure 3). Moreover, the junction bond half-life estimated by the first order kinetics model approximately equated the sum of the splicing half-time and of the half-life of the processed RNA (Appendix 1-figure 7, yet usually overestimating with a median relative level of 1.2). Unless specifically mentioned, we used the first order kinetics model in the remaining analyses because of its robustness and its approximate equivalence with alternative models (Appendix).

Splicing times are in the range of minutes
Based on GENCODE annotation, the analyzed bonds mapped to 8,770 mRNAs, 162 RNAs antisense to protein-coding genes (asRNA), 204 long intergenic non-coding RNAs (lincRNA), and 290 other non-coding RNAs (Figure 2-figure supplement 1C). When averaged within major isoforms (Materials and methods), synthesis rates and half-lives of donors and acceptors ranged over two orders of magnitude (42-fold and 48-fold change, 90% equi-tailed range), whereas the junction bond half-life spanned only slightly over one order of magnitude (8.1-fold 90% equi-tailed range, Figure 2B-D). These major isoform aggregated rates generally agreed with previously reported splicing rates and   Mukherjee et al., 2017). Moreover, mRNAs were spliced significantly faster (median of 7.2 min) than lincRNAs (median of 11 min, p<2 Â 10 À16 ) and other non-coding RNAs ( Figure 2C). Also, mRNAs had junction bonds with the longest half-lives (median of 316 min) ( Figure 2D), consistent with previous studies (Mukherjee et al., 2017;Schlackow et al., 2017;Schwalb et al., 2016). Similar conclusions can be reached using site-specific rates ( Figure 2-figure supplement 1E-J). The obtained apparent splicing times in the range of minutes agree with many previous estimates that were obtained using different methods, but argue against fast splicing, within less than a minute, that was suggested by some studies (Carmo-Fonseca and Kirchhausen, 2014).

Intron length constrains splicing times
Intron length has been suggested to affect splicing kinetics (Hicks et al., 2010;Khodor et al., 2011;Pai et al., 2017;Proudfoot, 2003;Windhager et al., 2012), and we therefore investigated this further. First, our de novo annotation of introns is in agreement with a minimal intron length of about 80 nt ( Figure 3A), as expected from the spatial needs within the spliceosome Wieringa et al., 1984). Second, we find that among introns shorter than 2,000 nt, acceptor and donor bond half-life showed similar distributions and decreased with increasing intron length ( Figure 3B). The reasons for why short introns are spliced more slowly than long ones remain to be investigated. It is possible that for longer introns the splice site definition by the following exon facilitates splicing and that there are less restraints for splicing for longer introns. This observation also strongly argues for pre-ordering of the spliceosome on the transcribing polymerase.
Our analysis also reveals that donor and acceptor bond half-lives differ for long introns. Among introns longer than 2,000 nt, acceptor bond half-life plateaued at a median value of about 4 min, whereas donor bond half-life increased with intron length up to a median value of about 8 min for introns larger than 7,700 nt (last septile). A possible explanation for this significant difference is that donor sites of long introns are transcribed much earlier than acceptor sites and splicing can only start when the intron is transcribed. Indeed, the donor bond half-life is determined by the elongation time needed to transcribe at least the branchpoint and by the first transesterification step, whereas the acceptor bond half-life is determined by both the time for the first transesterification step and for the second transesterification step to be completed. Assuming a maximum polymerase elongation velocity of 4 kb/min (Fuchs et al., 2014;Gressel et al., 2017;Jonkers and Lis, 2015;Saponaro et al., 2014;Veloso et al., 2014), we observed very few introns violating this predicted limit ( Figure 3C). This limit for donor bond half-life affects only a small proportion of all introns (last septile) so that, overall, there is no positive correlation between donor bond half-life and intron length ( Figure 3C). For shorter introns, the donor bond half-life and the acceptor bond half-life were similar ( Figure 3B), indicating that the second transesterification step is fast compared to the overall splicing kinetics.
Another prediction of this model is that for slowly transcribed introns the donors should take longer to cleave than the acceptors. Consistent with this hypothesis, the median half-life was 2.5-fold (p<2Â10 À16 ) longer for donor bonds than for acceptor bonds of first introns ( Figure 3D), which are known to be more slowly transcribed (Danko et al., 2013;Fuchs et al., 2014;Jonkers and Lis, 2015;Saponaro et al., 2014;Veloso et al., 2014). A small significant difference was also found for the last intron (1.4-fold, p<2Â10 À16 ), which could reflect slower polymerases near the transcript end or different kinetics of splicing of the last intron (Davis-Turak et al., 2015;Rigo and Martinson, 2008). In conclusion, these data show that donor half-life and thus the beginning of splicing is limited by transcription elongation for long introns. Taken together, our results are generally consistent with the co-transcriptional nature of splicing and reveal that the length of the intron influences splicing kinetics in at least two different ways.

Several snRNA interactions are related to donor cleavage kinetics
Whereas overall trends in splicing kinetics can be explained by global features such as intron length and polymerase elongation velocity, the kinetics of splicing also critically depend on the RNA sequence context around the donor, acceptor, and branchpoint. To gain insights into the sequence determinants for splicing, we built a linear model (Materials and methods) that allowed us to estimate changes in donor bond half-life as a function of single nucleotide changes relative to the consensus sequence. The single nucleotide model could explain 19% of the observed variance in log-transformed donor bond half-life and achieved a median relative error for individual sites of 150%, which is small compared to the dynamic range across sites spanning two orders of magnitude ( Figure 4A). This analysis showed that nucleotide deviations from the consensus splice-site increase donor bond half-life. These findings are consistent with evolutionary pressure for donor sequences optimized for fast splicing.   In order to elucidate the contribution a single nucleotide change might have on interactions within the spliceosome, we compared our predicted single nucleotide effects with base interactions observed in three different spliceosome structures ( Figure 4B, bottom). Recognition of the donor site by U1 snRNP plays a crucial role during early spliceosome assembly. RNA-RNA interaction between precursor RNA and U1 snRNA are mainly stabilized through Watson-Crick interactions ( Figure 4C; Kondo et al., 2015). In our model, substitution of the highly frequent G at +1 or À1 from the donor site with a C resulted in an increase in bond half-life ( Figure 4B), likely because C cannot form a stable interaction with a C in the U1 snRNA, in agreement with previous in vitro studies (Kondo et al., 2015). In contrast, at position +3 from the donor site, a change from A to G has   only a minor effect on the bond half-life ( Figure 4B), likely because G can still form a non-canonical base pair with U in the U1 snRNA (Kondo et al., 2015). These results suggest that interactions of the precursor RNA donor region with U1 snRNA contribute to the observed donor bond half-lives. After donor site recognition by U1 snRNP, the Prp28 RNA helicase mediates the exchange of U1 with U6 snRNP and the U4/U5/U6 tri-snRNP can stably bind the precursor RNA. In the resulting B complex, the U5 stem loop interacts with the three terminal nucleotides of the 5´-exon, whereas the U6 ACAGA helix is formed near the donor site ( Figure 4D; Bertram et al., 2017a). Our results suggest that U5 interactions may contribute to the kinetics of donor cleavage. For example, an A in the position À3 relative to the donor site leads to faster donor bond half-life supposedly because this enables base-pairing with U5 snRNA in the extended precursor RNA-U5 snRNA duplex.
Completion of step-one results in the C complex that is then converted to the activated C* complex, which contains the two exon ends in close proximity for the step two reaction. RNA duplexes are formed between the intron region close to the donor site and U6 snRNA and between the branch site region and U2 snRNA ( Figure 5A) (Zhang et al., 2017). We also found that interactions in the C* complex were predictive of donor bond half-life ( Figure 4D). In particular, changes in the branchpoint adenine and at all positions À4 to +3 of the branchpoint show kinetic effects, except for the positions À1 and +2 that are predicted to not contribute to donor bond half-life (grey highlighting in Figure 4D). In agreement with the structural data, these nucleotides are also the only two nucleotides in the vicinity of the branchpoint that do not interact with U2 in the C* complex (Zhang et al., 2017). When compared to each other, the precursor RNA nucleotides interacting with snRNAs during 5' splice site recognition showed the strongest effects on donor bond half-life, followed by nucleotides interacting in the B complex, and to a lesser extent the nucleotides interacting in the C* complex (Figure 4-figure supplement 1A). Positions with no predicted contact in these structures showed least effects (Figure 4-figure supplement 1A). These observations support our kinetic modeling, but also argue that the structurally characterized spliceosomal complexes represent functional states.
Taken together, variation in the in vivo kinetics for the donor cleavage can in part be rationalized with early interactions of precursor RNA with U1 snRNA, and with later U5 snRNA interactions observed in structures of the B and Bact complexes. Moreover, the stability of the C* complex appears to also affect donor bond half-life, possibly because it prevents the reverse reaction of donor site cleavage (Tseng and Cheng, 2008). Since several precursor RNA positions are involved in different types of interactions in different splicing intermediates, the observed overall kinetics of donor cleavage reflect a combination of distinct microscopic rates, which cannot be distinguished by our in vivo approach. Furthermore, not all observed effects of nucleotide changes could be explained with available structures. For example, the first nucleotide of the downstream exon (acceptor +1 position) was important for donor cleavage kinetics. Although it remains unclear why, this could be related with co-transcriptional recruitment and recycling of splicing factors, maybe favored by Pol II 3´splice-site pausing, similar to that suggested in Aitken et al. (2011).

Sequence and structural contributions to acceptor bond half-life
We also built a regression model predicting log-transformed acceptor bond half-life from sequence ( Figure 5B, 20% of variance explained, median relative error of 150%). Single nucleotide changes around the acceptor site generally had larger effects on acceptor bond half-life, reflecting effects on step two kinetics, whereas changes around the donor site had greater effects on donor bond halflife, reflecting effects on step one kinetics ( Figure 5C). The post-catalytic complex (P complex), which is specific to step two splicing reaction, is not yet structurally characterized in human. Nevertheless, nucleotide changes that influence base pair interactions reported for the P-complex in S. cerevisiae (Bai et al., 2017) showed stronger effects on the acceptor bond half-life than for the donor bond half-life (Figure 4-figure supplement 1A). Nucleotide positions in the precursor RNA that are not involved in base pair interaction with snRNAs in B-type and C* complex structures were irrelevant for predicting acceptor bond half-life (grey highlighting in Figures 4B and 5C, feature selection, Materials and methods).
Most nucleotides showed similar effects in the donor and acceptor bond half-life models but some noticeable differences were observed between them ( Figure 5D). Our results indicate that a non-canonical G branchpoint does not affect acceptor bond half-life but increases donor bond halflife. We also observed that the predominant G at the donor À1 nucleotide leads to fast donor cleavage kinetics but to slow acceptor cleavage kinetics, maybe because this interferes with positioning of the neighboring +1 donor nucleotide that serves as a nucleophile during step two. Despite this disadvantage in acceptor cleavage kinetics, the donor À1 position is predominantly G, presumably because this improves donor site recognition by base pairing with a C in U1 snRNA as described above. Taken together, available structural information on the spliceosome help to     Regulatory precursor RNA motifs contribute to splicing kinetics Splicing is modulated by auxiliary factors, including serine/arginine-rich proteins and hnRNPs (heterogeneous nuclear ribonucleoproteins), that bind to regulatory motifs around the splice sites (Fu and Ares, 2014;Matlin et al., 2005;Wang and Burge, 2008). We therefore aimed to identify putative regulatory motifs and to quantify their contribution to splicing kinetics. We derived two extended models for donor and acceptor bond half-life including the single nucleotide effects in the core regions investigated so far and all 65,536 possible RNA octamers in four extended intronic regions, 100 nt downstream of the donor site, 100 nt upstream of the acceptor site, and 100 nt upstream of the branchpoint and the region between branchpoint and acceptor site ( Figure 6A, Materials and methods). We also included intron length and GC content, but did not include exonic regions, because this would require isoform annotations. Because of the very large number of octamers, we used a feature selection method (Lasso regression), which yielded 551 octamers jointly predicting donor bond half-life and 2,319 octamers jointly predicting acceptor bond half-life (Materials and methods). Compared to the single nucleotide models, the extended models substantially increased the proportion of variance explained from 19% to 26% for the log-transformed donor bond half-life and from 20% to 29% for the log-transformed acceptor bond half-life ( Figure 6B and C). These proportions of variance increased when we restricted the analysis to junctions of major isoforms, showing that the results are not over-estimated due to double counting of donors and acceptors belonging to multiple exon junctions (donor bond half-life model by 1.3% and the acceptor bond half-life model by 1.4%). Cumulatively, the proportions of variance explained of these non-overlapping regions largely exceed the proportion of variance explained by the joint model, indicating widespread co-occurrence of splicing-regulatory sequences across introns.
The improved prediction of the extended models over the single nucleotide models is mostly attributable to the octamers of the extended intronic regions. The largest number of predictive octamers identified by the donor bond half-life model was found in the 5' donor site region (Figure 6-figure supplement 1A). This set of octamers was the most predictive feature for donor bond half-life individually (19% of the variance) and the feature with the largest impact on variance when dropped from the joint model (3% of the variance). In the acceptor bond half-life model, the regions flanking the branchpoint and acceptor site contained most predictive octamers ( Figure 6-figure supplement 1A) and associated with largest proportion of variance explained ( Figure 6C). Moreover, the effects of octamers on bond half-life were of similar order of magnitude than the effects of single nucleotides in donor, acceptor and branchpoint sites for both categories (Figure 6-figure supplement 1B, median effect for octamer 1.4%, median effect for nucleotide 5.4%). The drop of proportion of variance explained when a feature was removed from the joint models were small (between 0.0 and 3.4, Figure 6B and C) indicating of substantial correlation between the features. These correlations could be technical in the case of overlapping regions, or the result of co-evolution. Altogether, these results show that the octamers in the extended intronic regions contribute to splicing kinetics.
To identify putative regulatory factors that could bind to the predicted RNA octamers, we scored the octamers binding affinities to the 159 human RNA-binding proteins of the ATtRACT database (Giudice et al., 2016). We found 258 octamers predicted by the donor bond half-life model (47% versus 42% of non-selected octamers, p=0.017, Fisher test) associating with 69 RNA-binding proteins and 1,039 octamers identified by the acceptor bond half-life model (45% versus 42% of nonselected octamers, p=0.007, Fisher test) associating with 99 RNA-binding proteins motifs ( Figure 6D, Figure 6-figure supplement 1C, relative position weight matrix score >0.9 and selecting for the 5% highest absolute scores, Materials and methods). Our results suggest that several serine/arginine-rich and hnRNP proteins (Supplementary file 5) regulate donor and acceptor bond half-life in both positive and negative fashions, depending on the location of their binding site. Octamers associated with the binding site of the polypyrimidine tract-binding protein Ptpb1 are predictive of short donor bond half-life when present between branchpoint and acceptor site but they prolong the donor bond half-life when located near the donor site ( Figure 6D). The remaining octamers may reflect cis-regulatory elements bound by splicing factors that remain to be characterized. To address the evolutionary conservation of the identified octamers, we aligned them to conserved sequences across 99 mammalian and other vertebrate genomes. Except for octamers predicted to affect donor bond half-life in the region 100 nt downstream of the donor site, the remaining ones show significantly higher phylogenetic conservation compared to a negative control of random octamers ( Figure 6E), providing evidence of their biological significance.

Splicing yield differs between RNA classes
Cleavage of the phosphodiester bonds at the donor and acceptor sites can lead to ligation of the two exon ends, thus completing splicing. However, cleavage of these bonds may also be non-productive in the sense that exon ligation can fail and RNA may be degraded after cleavage. To account for this, we defined the 'splicing yield' as the proportion of precursor RNA successfully converted into spliced RNA ( Figure 7A). A splicing yield of 1 means that all precursor RNAs that are synthesized are also successfully spliced, whereas a splicing yield less than one means that only a fraction of the precursor RNA is converted to spliced product. We estimated the splicing yield using the junction bonds modeled with the coupled model and the acceptor bonds modeled with first-order kinetics, because alternative kinetic models or using the donor bonds led to systematic biases (Appendix). Hence, we did not computationally constrain our splicing yield estimates to be bounded by 1. Due to estimation errors of the synthesis rates, yields sometimes turn out to be greater than 1.
We found that the splicing yield across sites was much higher for mRNAs (median = 1.2, Figure 7B) than for antisense RNAs (median = 0.2), lincRNAs (median = 0.3), and other non-coding RNAs (median = 0.5), suggesting that degradation pathways are competing with splicing more intensively for non-coding RNAs than for coding RNAs. Moreover, the higher yield of mRNAs compared to lincRNAs also held when stratifying by cumulative read coverage across all samples and by half-life (Appendix 1-figures 26 and 27), two possible confounders associating with synthesis rate estimation biases in simulations (Appendix). Furthermore, splicing yield was the same for the 139,344 (99.9%) introns harboring the canonical terminal dinucleotides GU and AG recognized by the major spliceosome (U2-type) than for the 182 (0.1%) introns harboring the terminal dinucleotides AU and AC recognized by the minor spliceosome (U12-type) ( Figure 7C). Although introns targeted by the minor spliceosome showed two-fold slower donor and acceptor bond half-lives compared to those targeted by the major spliceosome ( Figure 7D, Figure 7-figure supplement 1B), the minor spliceosome nonetheless reached similar splicing yields.
To analyze how the nature of the splice sites contributes to splicing yield, we built a model that allow us to predict splicing yield based on sequence ( Figure 7E). Similar to the effects on bond halflives ( Figure 4B, Figure 5C), deviations from the consensus sequence led to lower splicing yield. Furthermore, nucleotides near a splice site showed stronger effects than more distant ones, suggesting that the early recognition of donor and acceptor splice sites is a determinant for splicing yield. Taken together, these results indicate that rate and yield are distinct aspects of splicing that may have evolved independently and that the sequence around splice sites determines both the rate and the yield of splicing.

Discussion
RNA splicing is an essential step of eukaryotic gene expression, but the in vivo kinetics of this twostep process and its dependence on transcription remain poorly understood. Here, we have coupled a metabolic RNA labeling time series to TT-seq analysis of new and total RNA to investigate RNA metabolism in human cells. We have then used kinetic modeling based on a definition of RNA metabolic rates at the level of individual phosphodiester bonds to provide rate estimations for cleavage of phosphodiester bonds at donor and acceptor splice sites in human introns. The obtained splice site cleavage rates, expressed as donor and accept bond half-lives, are free of ambiguities introduced by other methods and are related to the independent contributions of the two splicing steps in vivo.
The donor and acceptor bond half-lives were found to be generally in the range of minutes, although we cannot exclude that we are missing a small population of quickly spliced introns. The donor and acceptor bond half-lives were found to depend on intron length, on the nucleotide sequence surrounding splicing sites, including the branchpoint, and on flanking octamer sequences that may bind regulatory factors. This is consistent with a complex relationship between the splicing machinery and its nuclear environment, in which splicing rates can be influenced not only by RNA sequence but also by gene structure and chromatin landscape (Davis-Turak et al., 2018;Davis-Turak et al., 2015). In addition, we define the yield of successful splicing and show that it differs dramatically between different RNA classes.
Our results also provided insights into the nature and evolution of co-transcriptional splicing. Previous studies of splicing kinetics in mouse (Rabani et al., 2014) and fission yeast S. pombe (Eser et al., 2016) all found that splicing is faster for shorter introns. A recently published study in Drosophila (Pai et al., 2017) demonstrated that splicing is faster for intron-defined, short introns (60-70 nt), whereas for exon-defined introns, splicing was faster for introns longer than 2,944 nt, suggesting a complex relationship between splicing kinetics and intron length. We find here that exon-defined splicing in human cells is fastest for introns with a length of around 2,000 nt, whereas short introns (<140 nt) on average take about twice as long to be spliced. High spatial and temporal resolution kinetics data coupled with focused analyses of these very short introns remain to be performed for other species to understand how universal these observations are.
Another observation we made was that longer human introns (>10,000 nt) show increased donor bond half-life, apparently because donor cleavage requires RNA polymerase II to first transcribe the intron. This effect of transcription-limited splicing is observed at~10% of human precursor RNA introns, whereas most human introns are short enough so that their splicing kinetics are not limited by transcription. How the polymerase elongation rate depends on the nature of the intron and how this influences splicing remains to be investigated. The observed relationship between transcription and splicing of coding RNAs extends to non-coding RNAs. We found that spliced coding RNAs and spliced non-coding RNAs showed similar RNA synthesis rates ( Figure 2C), whereas previous studies reported considerably slower synthesis rates for mainly unspliced, non-coding RNAs (Mukherjee et al., 2017).
Our definition of splice site-specific RNA cleavage rates also allowed for a comparison of kinetic information in vivo with detailed structural knowledge of the spliceosome in different functional states obtained in vitro. Structure-based interpretation of our nucleotide-resolution kinetic results highlights the importance of interactions between snRNAs in the spliceosome and the precursor RNA substrate and provides guidance for interpreting interactions in structures of the spliceosome yet to be obtained. Our results suggest that the predicted interactions between the precursor RNA and snRNAs in the P complex, which have presently only been obtained in yeast (Bai et al., 2017;Wilkinson et al., 2017) may be similar in human. We show that different types of RNA-RNA interactions observed at various stages of the process are related to the splice site cleavage time. In particular, RNA-RNA interactions must be of high enough affinity to allow for sufficiently specific recognition of splice sites, yet the affinities must be in a range that also allows for rapid conversion between subsequent states, which can show strongly altered RNA-RNA interactions. However, many processes and factors contribute to the observed apparent splicing rates, and these must be disentangled in the future.
Taken together, we have analyzed the metabolism of individual donor and acceptor splice sites in vivo and provided quantitative models for how RNA splicing kinetics may be encoded in the human genome. As we looked at a single growth condition, our models are derived from comparisons across genes and essentially reflect the affinity of precursor RNAs to the core splicing machinery. However, the experimental and computational methodology presented here could be applied to different cell types or under dynamic responses to reveal and quantify the role of splicing regulatory factors and of their related binding sites. Another interesting future direction is the modeling of alternative splicing, which is understood to be the outcome of competitions of alternative donor or acceptor sites with various strengths. Our distinct models of donor and acceptor site kinetics may help to build up such quantitative competition models. Eventually, quantifying the contribution of individual bases to splicing rates, backed by structural and functional studies, may explain the numerous contributions of splicing to the genetics of rare (Ló pez-Bigas et al., 2005) and common (Li et al., 2016) diseases. Cell culture K562 cells were obtained from DSMZ (DSMZ no.: ACC-10) and grown in RPMI 1640 medium (Thermo Fisher Scientific, 31870-074) supplemented with 10% heat-inactivated fetal bovine serum (Thermo Fisher Scientific, 10500-064) and 2 mM GlutaMAX (Thermo Fisher Scientific, 35050087) at 37˚C and 5% CO2. Cells were routinely verified to be free of mycoplasma contamination using Plasmo Test Mycoplasma Detection Kit (InvivoGen, rep-pt1). K562 cells were authenticated at the DSMZ Identification Service according to standards for STR profiling (ASN-0002).

TT-seq time series
TT-seq was performed as described (Schwalb et al., 2016), with minor modifications. Specifically, 2.5 Â 10 7 cells from two biological replicates were used for each time point. Cells were exposed to 500 mM of 4-thiouracil (4sU, Carbosynth, NT06186) for 2, 5, 10, 15, 20, 30, 60 min at 37˚C and 5% CO 2 . Cells were harvested by centrifugation at 600 g for 2 min at 37˚C. Cell pellets were lysed in 5 mL of QIAzol (Qiagen) and 150 ng of RNA spike-ins mix were added to each sample. RNA spike-ins were produced in house, based on ERCC-RNA sequences (sequences of spike-ins are described in Supplementary file 6). RNA spike-ins were produced as described (Schwalb et al., 2016). RNAs were extracted using QIAzol according to the manufacturer's instructions. RNAs were sonicated to obtain fragments of <6 kbp using AFAmicro tubes in a S220 Focused-ultrasonicator (Covaris Inc, parameters: 10 s, peak power 100, cycles 200, duty cycle 1%). The quality of RNAs and the size of fragmented RNAs were checked using Fragment Analyzer. 1 mg of each of the sonicated RNAs was stored at À80˚C as total RNA (RNA-seq) and later eluted with miRNAeasy Micro Kit (Qiagen, 217084) together with 4sU-labeled purified RNAs. 4sU-labeled RNAs were purified from 300 mg of each of the fragmented RNAs. Biotinylation and purification of 4sU-labeled RNAs was performed as described (Dö lken et al., 2008;Schwalb et al., 2016). Biotinylated 4sU-labeled RNAs were separated from unlabeled RNAs with streptavidin beads (Miltenyi Biotec, Bergisch Gladbach, Germany) and eluted in 100 mM DTT as described in Dö lken et al. (2008) and Schwalb et al. (2016). 0.3M sodium acetate was added to 4sU-labeled purified RNAs and to total RNAs prior RNA extraction. RNAs were extracted and eluted using miR-NAeasy Micro Kit (Qiagen, 217084). The on-column DNAse I treatment (Qiagen, 79254) was performed for 15 min at 25˚C. Prior to library preparation, total RNAs and 4sU-labeled purified RNAs were quantified using Qubit. Enrichment of 4sU-labeled versus unlabeled RNAs was analyzed by RT-qPCR using oligonucleotides amplifying selected regions of 4sU-labeled and unlabeled spike-ins (sequences of oligonucleotides are described in Supplementary file 6). Only 4sU-labeled purified samples showing DDCt changes from 4 to 6 were subjected to library preparation (total RNAs were used as a control for normalization). 100 ng of input RNA was used for strand-specific library preparation according to the Ovation Universal RNA-seq System (NuGEN). Libraries were prepared using random hexamer priming only. The size-selected libraries were analyzed on a Fragment Analyzer before sequencing on the Illumina HiSeq 4000.

Read alignment and counting
Paired-end 150 bp reads with additional 6 bp of barcodes were obtained for each sample. Reads were aligned using STAR version 2.5.0a (Dobin et al., 2013) in single pass mode. The genome Index was built against the full GENCODE version 24 annotation and the hg38 (GRCh38) genome assembly (Human Genome Reference Consortium) using 150 bp overhang size. Additional specified parameters were alignSJDBoverhangMin 2, chimSegmentMin 15, chimScoreMin 15, chimScoreSeparation 10, and chimJunctionOverhangMin 15. The aligned reads were filtered for duplicates using Picard tools version 2.5.0 (https://broadinstitute.github.io/picard/) using the option MarkDuplicates REMOVE_DUPLICATES = true. In average, each TT-seq sample yielded about 250 M reads and each RNA-seq sample about 55 M reads. For each sample,~90% of the reads could be uniquely mapped to the reference genome. The duplication ratio was estimated to 55% by FastQC (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/).
Using the rCube package (https://github.com/gagneurlab/rCube), all split reads (containing N stretches in Cigar string) were extracted to create a database of potential introns (~341 k). The obtained introns were classified relative to annotated introns and genetic elements from the GEN-CODE annotation (version 24 obtained from https://www.gencodegenes.org/releases/24.html). For each intron three characteristic counts were calculated: The numbers of reads starting in the upstream exon and extending into the intron ('donor'), the number of reads starting in the intron and extending into the downstream exon ('acceptor'), and all split reads matching the introns coordinates ('junction'). The reads were filtered using a bam quality score of 255. Reads having secondary alignment flag were discarded.

Estimation of sample normalization factors and cross-contamination
To estimate the sample normalization factors F j that account for variations in sequencing depth as well as the overall newly synthesized RNA fraction and the fraction of cross-contamination j of nonlabeled reads in the TT-seq data, we modeled the expectation of counts E ij of spike-in i in sample j using a statistical model similar to the one of Schwalb et al. (2016).
j is set to 1 for all RNA-seq samples, d i is 0 for labeled spike-ins and 1 for unlabeled spike-ins. The parameter p ij is the condition and spike-in specific extraction probability. The difference with (Schwalb et al., 2016) is to allow the parameter p ij to be condition-specific (TT-seq or RNA-seq), which turned out to model better cross-contamination of unlabeled RNA in the short duration TTseq libraries. We set p ij ¼ p ik for all sets of j and k belonging to either RNA-seq samples, TT-seq samples or if i belongs to a labeled spike-in. We assumed read count data to follow a negative binomial distribution with a common dispersion parameter for all data. The model parameters and the dispersion parameter were fitted as generalized linear model using maximum likelihood.

Kinetic rate modeling and estimation
For each detected intron i we modeled the concentrations c i;l of each of three characteristic bonds (donor, acceptor, junction) independently following a first order kinetic rate equation. Without loss of generality, we consider in the following just one of the three equations -the other two behave the same.
We assume that all newly synthesized RNAs are labeled. The concentration of labeled bonds, assuming an initial concentration of 0, follows: Also, the old, non-labeled RNA decays exponentially as c i t; unlabeled ð Þ ¼ ai b i e Àtb i . Using the normalization factor F j of sample j, the labeling time t j and j the cross-contamination of unlabeled RNAs in the purified fraction, the concentration can be mapped to its expected count E i;j : We modeled read counts using the negative binomial distribution, a count distribution often used for RNA-seq data because it captures sampling noise and further sources of variations. The kinetic parameters a i ; b i are estimated by maximizing the log likelihood , where k i;j are the observed counts, using the BFGS numerical optimization algorithm and using the dispersion parameter obtained from the spike-ins analysis. The optimization was initialized 10 times with independent random parameters; the final solution comprises the median of all a i ; b i over the different runs to compensate for numerical instabilities. We removed all donors, acceptors, junctions with too few counts ( P j k i;j <100) from the modeling. Using the table in Figure 2-figure supplement 1A we map the rates a; b to the characteristic kinetic parameters of donor, acceptor, and junction. The whole modeling approach was implemented in R and is available as apackage called rCube (https://github.com/gagneurlab/rCube). Because the donor and acceptor bond half-life models work on a logarithmic scale, we present model errors as multiplicative errors given by the equation median exp log ŷ y , with y as the observation and y the prediction. More details about kinetic models are provided in the Appendix.

Determination of the major isoforms
We applied the software Salmon (Patro et al., 2017) (index kmer size = 31) to all RNA-seq samples and mapped them against the full transcriptome of the GENCODE (Ver. 24) annotation. For each gene, we selected the isoform with the maximum mean TPM value across all RNA-seq samples as the major isoform. The major isoform was only used in the analyses in Figures 2B-D, 3D and 7B. Elsewhere, analyses were only relying on individual junction annotations.

Estimation of the relative uncertainty for the kinetic parameters
To estimate relative uncertainty of the kinetic parameters in a conservative way we assumed that all donors and acceptors of the major isoform of a given GENCODE gene shared the same synthesis rate equal to the transcription rate of the gene. We further assumed that all products ('junctions') shared the same half-life equal to the mature RNA half-life. Because noise of these rate estimates is typically multiplicative, we computed the standard errors of the logarithm of these rates and reported relative uncertainties as the exponential of these standard errors.

Comparison of 4sU-seq and TT-seq
Alignment, counting and estimation of normalization and cross-contamination factors of the RNAseq data sets of Schwalb et al. (2016) was done as described above for our data. Counts for 4sUseq and TT-seq was normalized usingK i;j ¼ FRNAÀseq , where i denotes the split / unsplit reads as shown if Figure 2A for each intron of the major transcripts and j is the sample (4sU-/ TT-seq and replicate). Both replicates were pooled together.

Branchpoint identification
Due to the limited availability of experimental branchpoint measurements, the prediction algorithm LaBranchoR (Paggi and Bejerano, 2018) was utilized to predict branchpoint positions within introns. We applied the model within the kipoi framework (http://kipoi.org/) to score the last 100 nucleotides of each intron and took the nucleotide with the maximum score for being the utilized branchpoint. The results were validated using experimental data of Mercer et al. (2015) where available.

Estimation of single nucleotide effects
We identified nucleotide positions not predictive of donor or acceptor bond half-life and estimated the effect of the remaining single nucleotides on the donor bond half-life and on the acceptor bond half-life by regression. To this end, we modeled log-transformed half-lives of the donor bonds (and with a separate model of the acceptor bonds) as a weighted sum of each of the 20 nucleotides upstream or downstream of the donor, acceptor site and branchpoint, as well as of the GC frequency of the whole intron, the donor site, and the acceptor site. In this linear model, the reference sequence was chosen to be the consensus sequence so that the coefficients can be interpreted as the effects of substituting a consensus nucleotide to an alternative nucleotide. Lasso regression (Tibshirani, 1996) is a regularized linear regression method that can estimate some of the coefficients to be exactly 0 and that is therefore often used to select explanatory variables. We performed Lasso regression as implemented in glmnet (Friedman et al., 2010), choosing the largest shrinkage parameter at which the mean squared error (MSE) was within one standard error of the minimal MSE using 10-fold cross-validation. For donor bond half-life, as well as for acceptor bond half-life, the Lasso regression fit led to several nucleotide coefficients to be exactly 0. We then removed all the nucleotide positions where all single nucleotide effects had a coefficient equal to 0. Next, we estimated the single nucleotide effects of all remaining positions as well as the effect of GC frequency of the whole intron, the donor site, and the acceptor site on log-transformed donor and acceptor bond half-lives using ordinary least squares regression.

Estimation of octamer effects and multivariate model
The number of occurrences of all 65,536 nucleotide octamers in the regions 15-100 nt downstream of the donor site, 100 nt upstream of the branchpoint, all nucleotides between the branchpoint and the five nt upstream of the acceptor site and 5-100 nt upstream of the acceptor site were counted allowing for two mismatches. The 15 nt immediately downstream of the donor site or 5 nt upstream of the acceptor site were excluded from the octamer search space because they were already incorporated in the single nucleotide model. Regions extending in the upstream or downstream exon were cropped to keep them within the intron. The base 2 logarithm of octamer pseudo-counts log 2 (count +1) were used as covariates together with the GC frequency of the intron and the GC frequency of each region. The log-transformed donor/acceptor bond half-lives were the response variable. Lasso regression was applied to each region independently with 5-fold cross-validation to choose the optimal shrinkage parameter and select potential significant octamers. In a second step all selected octamers of each region were used together with the single nucleotide model as well as the GC frequency of the different regions, intron length and whether an intron is the first within the major transcript in a joint model to refine the selection of octamers (Lasso 10-fold cross-validation).

Octamer match to ATtRACT database
We compared each octamer to all reported PWMs with at least 5 nucleotides of the ATtRACT database and calculated the ratio between the probability of the best matching position (PWM-score) and the highest possible probability for any octamer (RPM-score, Cook et al., 2011). Each octamer was padded with an equal number of 'N's at both sides if the PWM was longer than the octamer. We ranked all matches based on their RPM-score and kept only the best 5% for each PWM and removed afterwards all matches with a RPM-score less than 0.9. The remaining matches were considered as hits.

Phylogenetic conservation of octamers
To calculate the phylogenetic conservation score for each octamer, we retrieved the PhastCons 100way track (http://hgdownload.cse.ucsc.edu/goldenpath/hg38/phastCons100way/), which reports conservation across 99 vertebrates aligned to the human genome, and extracted the mean of all nucleotides for all matching positions. Octamers of the region 100 nt downstream of the donor site found to be predictive for donor site or acceptor bond half-life were also searched in the region 100 nt downstream of the donor site. Octamers of the region 100 nt upstream of the branchpoint or acceptor site or between the branchpoint and the acceptor site found to be predictive for donor or acceptor bond half-life were jointly searched in the region 100 nt upstream of the acceptor site, since these three regions were strongly overlapping. We also included as list of 2000 random octamers to estimate the background distribution in the same regions.

Calculation of splicing yield
We define the splicing yield of donor h donor and acceptor h acceptor as follows:  (5) where a donor and a acceptor denote the synthesis rates of the donor site and of the acceptor site phosphodiester bond, respectively, and a junction donor; acceptor ð Þ denote the synthesis rate of the spliced exon-exon phosphodiester bond utilizing the specified donor and acceptor. Since the first-order kinetic model does systematically underestimate junction synthesis rates and overestimate donor synthesis rates, we switched to the alternative kinetic models to estimate these rates. However, since the first-order kinetic model is more robust and the acceptor kinetics do not include a delay we used the first-order kinetic model for the estimation of the acceptor synthesis rate. We defined the intron splicing yield h as the acceptor site splicing yield because its estimation is more robust compared to the donor site splicing yield.

Code availability
All the code used for counting donor site, acceptor sites, and junction reads as well as estimating the kinetic rates is available in the R package rCube (https://github.com/gagneurlab/ rCube; . The single nucleotide model is shared in the model repository Kipoi (http://kipoi.org/models/CleTimer/; Avsec et al., 2019).

Accession code
The sequencing data and processed files were deposited in NCBI Gene Expression Omnibus (GEO) database under accession code GSE129635. The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.  We assume that the donor bond can only disappear by splicing with rate s D or by precursor RNA degradation with rate l D . Similarly, we assume that the acceptor bond can only disappear by splicing with rate s A or by precursor RNA degradation with rate l A . We assume that the junction bond only disappears by mature RNA degradation with rate l J .

Author contributions
We define the 'splicing yield' of an acceptor bond h A as the fraction of acceptor bonds that get spliced h A :¼ sA sAþlA . For an acceptor that is spliced with a single donor, this is equivalently defined by the relation a J ¼ h A a A because of flux balance at steady-state. Generally, as acceptors can splice with multiple donors, we sum up the synthesis rates of the different resulting junctions J, leading to: The splicing yield for donor bonds h D is analogously defined.

Relation to steady-state quantities used in literature
In this section we are relating our parametrization defined in section 1.1 with other RNA-seq based steady-state quantities typically found in the literature. Many studies used quantities that combine donor and acceptor reads. A reasonable assumption is that RNA polymerase II does not drop off during the transcription of an intron. With this assumption, steady-state synthesis rates of the donor and of the acceptor of one intron are equal to the transcription rate ð :¼ a D ¼ a A Þ. Moreover, the steady-state amount of donor and acceptor unsplit RNA-seq reads of a same intron are often roughly equal, implying that the relative difference of degradation rates of the donor and acceptor bonds are small compared to their average. We define the average degradation rate of the donor and of the acceptor bond 2 Kinetic models 2.1 Models for the donor bonds and for the acceptor bonds

Constant degradation rate
We used a model that assumes first order kinetics, that is with constant parameters a D and b D . Denoting D ½ the concentration of donor bonds, we modelled: This leads to:

Constant splicing rate with a fixed delay
We also used a more complex model that takes into account that splicing can only be completed after the transcription of the corresponding acceptor site, which introduces a delay d. Assuming for the sake of simplicity that degradation exhibits the same time delay we find: The resulting delayed differential equation model solves as: (2.4) We show below that this model is difficult to fit on our data. Although there is no obvious biophysical motivation for the delay to be equal for degradation and splicing, a more complex model would need more parameters and be even more difficult to be fitted on our data.

Models for the acceptor site
Since the acceptor site can be spliced immediately after its transcription, we did not consider delayed models for the acceptor bonds. Instead, we only considered a first order kinetic model with constant synthesis rate as in Equation (2.2).

Constant junction formation rate model
We used a model that assumes first order kinetics, that is with constant parameters a J and b J . Denoting J ½ the concentration of junction bonds, we modelled: which leads to:

Coupled model
We also used a more complex model that describes the formation of junction bonds as the outcome of splicing with a coupled first order kinetic model. Assuming that each acceptor bond that disappears through splicing creates a junction, we have a J ¼ s A A ½ in (2.6). This leads to: Because of possible alternative splicing, this model was fitted to the split reads data only, that is without considering the unsplit reads of the corresponding donor and acceptor sites. Below we show that the coupled model is difficult to fit and therefore we worked with the constant rate model during our whole analysis except when calculating splicing yield.
To calculate the splicing yield h A of an acceptor as in Equation (1.1), we (i) estimated a J by fitting Equation (2.7) to the split reads of all junctions the acceptor is involved in, and (ii) estimated a A using the first order kinetic model on intron-exon reads.
3 Comparison of the models

Parameter estimation
The constant rate models use only two parameters whereas the more complex models (the delay model and the coupled splicing model) used three parameters. To compare how well we can retrieve the model parameters from observed data, we simulated counts based on typical synthesis, splicing, degradation and delay constants using typical negative binomial noise and parameter ranges. We then tried to retrieve the parameters using a maximum likelihood approach with random initialization. We compared the results to the ground truth and found that the two-parameter models (Appendix 1-figure 1) were more accurately fitted than the three-parameter models (Appendix 1-figures 2,3). All models showed an unbiased estimate for the synthesis rate, which is important to calculate the splicing yield.

Relation between the parameters of the different models
Since the two-parameter models are more accurately fitted than the three-parameter models, we investigated the validity of approximations relating the parameter of the twoparameter models to those of the three-parameter models.
For the donor bonds, we asked how well the sum of the delay d and the splicing half-time For the junction bonds, we asked how well the sum of the splicing half-time log 2 ð Þ b A and the mature RNA half-life log 2 ð Þ b J is approximated by the junction bond half-life estimated by the constant rate model. To assess how many introns are affected by potential biases of using the constant rate model instead of the more complex models, we created histograms of the estimated error for each model parameter given the observed distribution of synthesis rates, splicing times and half-lives (Appendix 1-figures 4-7). To get a distribution for the delay, the intron length was divided by an assumed polymerase velocity of 4 kb/min (Jonkers et al., 2014;Gressel et al., 2017). Compared to our estimated measuring precision of 180% fold for synthesis and 32% fold for half-life, the absolute errors are negligible for the synthesis rates and for the splicing half-time.

Quality of fits 4.1 Simulated vs. fitted rates
To assess whether our method is able to estimate synthesis and degradation rates from ground truth, we simulated counts based on the estimated distributions of synthesis rates, splicing half times and half-lives based on the experimental data. Based on simulated data, our method is an unbiased estimator of ground truth synthesis rates (Appendix 1- figure  12), splicing half-time and half-life (Appendix 1- figure 13) with high precision compared to the dynamic range.

Agreement of kinetic parameters donor, acceptor and junction model
Under the assumption that RNA polymerase II does not drop off during the transcription of one intron, the donor and acceptor synthesis rate are equal (Appendix 1-figure 14), whereas the junction synthesis rate reduced by the splicing yield (Appendix 1-figure 15).

Expected vs. observed counts
Comparisons of the predicted expected counts of the constant rate model with our observed experimental counts are shown in Appendix 1-figures 16 (donor), Appendix 1-figures 17 (acceptor) and Appendix 1-figures 18 (junctions).
Normalization was based on spike-ins. Therefore, errors in spike-ins quantification possibly led to off centred the density plots. Indeed, the non-centred scatterplot showed deviation compatible with the bins they were created, {2 min}, {5 min, 10 min}, {15 min, 20 min}, {30 min, 60 min}. lower cut-off of 100 reads (after the 6 th ventile) allows us to estimate the kinetics with a typical error below 100%.
Appendix 1-figure 23. Relative synthesis rate and half-life error vs. the accumulative bond read count accumulated in all samples and binned in its ventiles. The first bin comprises the first two ventiles. To test the shortest and longest bond half-lives our experimental approach is able to capture and model properly we stratified the results also by bond half-live. We proceeded similarly with the bond synthesis rates. We found that most of our modelled data lies within a range where the median relative error of synthesis rates and half-lives is below 100% or 30% respectively (Appendix 1-figure 24, 25). We note that the errors given for the stratifications by bond half-life and synthesis are inflated for real world data, because the simulation is based on half-lives and synthesis rates that were drawn independently and more extreme than our real world data.
genes. Note that this time has to be very short since labeled RNA was detected after 2 min labeling.
Assuming that only a fraction of all transcripts incorporates 4sU, this would lead to an underestimation of the synthesis rates (e.g. a labeling efficiency of 90% would results in a 10% underestimation of the rate), given that the labeling efficiency is constant over all samples. However, it does not affect estimation of the half-lives nor of the splicing yields. Indeed, the kinetic models are then modeling the kinetics of the labelled fraction of the RNA species rather than the overall RNA species. These kinetics differ from the kinetics of the overall RNA species only by different synthesis rates.

Splicing yield
Splicing yield is computed by independently estimating the synthesis rate of the precursor (using unsplit reads of the acceptor bond) and of the mature RNA (using split reads) and taking the ratios (Materials and methods). Due to estimation errors, these ratios may turn out to be greater than 1. Simulations (section 5) showed that cumulative read coverage across all samples as well half-life can lead to bias estimates of synthesis rates. We therefore investigated whether this could confound our observation that non-coding RNAs show lower yield than mRNAs. However, the higher yield of mRNA versus non-coding RNA was recapitulated when stratifying by these possible confounders (Appendix 1-figures 26, 27).
Appendix 1-figure 26. Splicing yield distribution (boxplot) of lincRNA (red) and mRNA (blue) stratified by bins of cumulative number of reads across all samples. Although yield correlates negatively with the cumulative number of reads, indicative of potential estimation bias, the mRNA yield remains higher than the lincRNA yield in every stratum.