Bayesian inference of polymerase dynamics over the exclusion process

Transcription is a complex phenomenon that permits the conversion of genetic information into phenotype by means of an enzyme called RNA polymerase, which erratically moves along and scans the DNA template. We perform Bayesian inference over a paradigmatic mechanistic model of non-equilibrium statistical physics, i.e. the asymmetric exclusion processes in the hydrodynamic limit, assuming a Gaussian process prior for the polymerase progression rate as a latent variable. Our framework allows us to infer the speed of polymerases during transcription given their spatial distribution, while avoiding the explicit inversion of the system’s dynamics. The results, which show processing rates strongly varying with genomic position and minor role of traffic-like congestion, may have strong implications for the understanding of gene expression.


I. INTRODUCTION
DNA is a long polymeric molecule that encodes information as a sequence of nucleotides (Nts).Turning this information into a phenotype is a complex phenomenon hinged upon transcription, the molecular process in which particular segments of DNA (i.e., the genes) are scanned and their information is copied into mRNA by the enzyme RNA polymerase II (PolII).The transcription itself consists of several steps which can be differentially regulated to alter the timing and the output of the mRNA production [1,2].
The transcription can also be seen as a non-equilibrium process, where the PolIIs are being transported as particles on a one dimensional lattice, the lattice being the DNA template which the PolIIs bind to.We can further consider this process having left and right boundaries, representing the transcription start site (TSS) and the transcription end site (TES), respectively (Fig. 1 A).Within the gene body, the PolIIs erratically travel along the template and their abrupt slowing down in certain genomic regions is known as pausing dynamics [3,4].While the pausing is an essential part of the transcriptional machinery and contributes to the regulation of genes' expression levels, a comprehensive quantitative understanding of its dynamics is still missing [5,6].
We present a modelling framework to help understand gene regulation and quantitatively study the pausing dynamics given real-world data.In literature, a number of different mechanistic models have been introduced to elucidate transcription, starting from the simple telegraph model [7] to more complicated multi-state models that account for many interactions [8][9][10][11], with each model reflecting determinate aspects of the whole biological system complexity.Here we are primarily interested in the pausing and employ a generalisation of a paradigmatic model of particle transport, the asymmetric simple exclusion process (ASEP, [12][13][14]) in the hydrodynamic limit [15].The ASEP is a class of models of particles on a one-dimensional lattice, whose behaviour is chiefly determined by the rates at which the particles hop on the lattice.More specifically we require the rate profile function, which we refer to as p, to be spatially varying yet smooth as in reference [16,17], see also [18], thus making it possible to model this function by a Gaussian process (GP) [19].Noticing the analogy between the PolII transport in the gene body and the particle hopping in the exclusion process, learning p allows the study of the pausing dynamics in a gene.Importantly, we provide an inferential scheme to learn this rate function by Bayesian inference given real molecular biology data, assuming a prior on the profile function induced by a GP prior on a latent variable.In other words, integrating the dynamics defined by the rate p generates transient time-course density profiles; we estimating p given observed density profiles without explicitly inverting the system's dynamics.Other models of PolII dynamics also leverage GPs for inference from biological data, with GPs representing transcriptional activity over time [20,21].In contrast, the GP here describes a function of genomic position, with its minima corresponding to pausing regions.Due to its generality, our framework can be deployed to estimate the rate profiles of any one-dimensional transport problem.
The manuscript is organised as follows.Section II A describes the biology of pausing and the next-generation sequencing (NGS) data types which are available to study it.Section II B and II C, respectively, discuss the asymmetric simple exclusion process as a mathematical model for transcription with pausing and a Bayesian inferential framework for model fitting.
We present the results in section III and conclude with a discussion in section IV.

A. Biological processes and data
RNA polymerases have a central role in the biology of transcription.We distinguish different classes of RNA polymerases, each having different structure and control mechanism.
Bacteria and Archaea only have one RNA polymerase type.Eukaryotes have multiple types, of which RNA polymerase II (PolII) is known to catalyse synthesis of protein-encoding RNA (messenger RNA or mRNA).In this paper we describe mRNA transcription by PolII, but the inferential framework we present is general and can be extended to other transport phenomena.PolII binds to DNA upstream the TSS, initiates the mRNA synthesis, and then traverses the DNA downstream (elongation) until it pauses at a certain gene location, ready to respond to a developmental or environmental signal that instructs to resume the elongation.PolIIs are also found proximal to TSS in a so-called "poised" state, which has not initiated synthesis of the mRNA chain.Poised and paused PolII can be differentiated as only paused PoIIs have a tail of nascent mRNA and is bound to transcription factor Spt5 [22].The process terminates when the PolII reaches the TES and the transcribed mRNA is released.As a result of these steps, the output is modulated in both timing and intensity.However, many details, such as the pausing, are not well understood [5].The presence of transcriptional pausing in Eukaryotes is revealed by several assays based on NGS, which is widely used in molecular biology to study molecules involved in genic processes.In the PolII ChIP-seq assay, PolII-bound DNA is isolated by chromatin immunoprecipitation with a PolII antibody and is then subject to high-throughput sequencing.This provides a genome-wide view of the PolII binding sites for all forms of PolII, including both those poised or transcriptionally engaged and those which are bound to DNA and static.In ChIP-seq experiments, DNA fragments extracted from cells and associated with a specific protein (here polymerase) are amplified, sequenced, and mapped to the reference genome, with fragments generally in the 150-300 Nt range [23] (while transcribing PolII covers less than 50 Nts of DNA [24]).This means that the precise locations of the individual proteins are not known and the assay only returns the overlap of reads from many different cells.For each genomic position, PolII ChIP-seq returns a signal as a proxy of polymerase occupancy.
For this study, we binned ChIP-seq reads from genomic ranges of selected genes (from cultured human cell lines, Materials and Methods) into 20-Nt bins, thus yielding coarsegrained read profiles (which we refer to as y) such as those illustrated in Fig. 1 B. The number of these reads at a position x is proportional to the occupation probability (x).
The proportionality factor, which depends on the number of cells used in the experiment and on further signal amplification intrinsic to the sequencing procedure, cannot be directly accessed with precision and is only known with substantial uncertainty [25].
Other methods available to study the pausing include but are not limited to NETseq, where nascent mRNA chunks associated with immunoprecipitated PolII complexes are isolated and sequenced [26], GRO-seq, where RNAs recently transcribed only by transcriptionally-engaged PolIIs are sequenced [27], and PRO-seq, which is similar to GRO-seq but reaches single-nucleotide resolution [28].The evidences of PolII transport are particularly clear in time-course experiments, where sequencing data are collected over time following a perturbation.As an example, time-variant PRO-seq has been suggested to estimate pausing times in key peak regions [29].A classical way to perturb these molecular dynamics is inhibiting the initiation by treating the cells with triptolide (Trp), which is a highly specific drug that blocks initiation [22,30].This permits the PolII already engaged in transcription to progress further downstream the gene while new PolIIs are prevented to attach, thus freeing upstream genomic regions as the run-on time progresses (Fig. 1 C).Our In the presence of triptolide, transport is blocked upstream of TSS, while transcriptional engaged PolII are allowed to complete elongation; this is reflected in the ChIP-seq profiles obtained 2, 5, and 10 minutes after treatment (also in B). approach consists of using the read profiles y as functions of x, collected at fixed run-on times t 1 , t 2 , t 3 , and t 4 after treatment, to infer the dynamics.While Trp inhibits new initiation, poised PolII upstream the TSS can still pass through it, enter the gene template, and perform elongation immediately after Trp treatment [22,30].To account for this, we also perform inference over Spt5 ChIP-seq data, where the poised polymerases are masked while those bound are detected [22].
These types of experiments reveal the presence of a flux of PolIIs, which is the signature of the non-equilibrium physics involved in the elongation process.The profile y * observed prior to the treatment corresponds to a non-equilibrium stationary state (NESS).Disrupting initiation with Trp yields a transient state, which evolves from y * until it settles down to a new NESS.

B. Mathematical model
The transport of particles on a one dimensional lattice is a well-studied problem in mathematics and physics.Its basic features are captured by the asymmetric simple exclusion process (ASEP) [14], which defines the stochastic dynamics of interacting particles on a discrete lattice, which we take here to be a one-dimensional chain with open boundary conditions.Let the total number of lattice sites be N .The state of each site i, 1 ≤ i ≤ N , is characterized by the occupation number n i such that n i = 0 if the site is empty and n i = 1 if it is occupied by a particle.The evolution proceeds in continuous time.A particle on site i < N hops rightward into the site i + 1 with rate p i , the transition being successful only if the site i + 1 is empty.Similarly, a particle on site i > 1 hops leftward into i − 1 with rate q i , if the site i − 1 is empty.Further, particles on the left (right) boundary site i = 1 (i = N ) leave the lattice at rate q 1 (p N ), while particles are injected in the same boundary site at rate p 0 (q N +1 ) if the site is empty.The constraint that a jump can occur only if the target is empty prevents the accumulation of more than one particle on a site and is generically referred to as the exclusion rule.This rule allows particle collision, which causes congestion when the particle density is sufficiently high and permits phase transitions between a low density, high density, and a maximum current phase even if the systems is one dimensional [31].Interestingly, based on theoretical considerations, it has been suggested that traffic-like congestion of PolIIs is important in transcription [32][33][34].
While the ASEP was originally proposed to model biopolymerization on on nucleic acid templates [12,13], this and related models have been more recently applied to diverse problems, including protein translation [35][36][37], but also, e.g., molecular motors [38] and pedestrian and vehicle traffic [39].Applications to transcription incorporating disordered dynamics and obstacles (e.g., [40,41]) were also proposed.ASEP's theoretical appeal is due to its analytical results representative of a large class of models [42,43] and a convenient mean-field treatment that yields the exact stationary solution [44].In the context of transcription, particles entering site 1, moving along the chain, and exiting from site N correspond to initiation, elongation, and termination, respectively.In our setting, the lowest values of p i correspond to genomic locations were elongation slows down.
The dynamics of the expected occupation of a single site i in the bulk are governed by the lattice continuity equation 0 < i < N , where E denotes expectation value and J left (t) and J right (t) are the average flux of particles from site i − 1 to site i and from site i to site i + 1, respectively.These are subject to the exclusion rule and therefore obey ( In order to exactly solve these dynamics, second-order moments such as E(n i (t)n i+1 (t)) need to be known.Under independence assumption, these moments are factorised, which in our case amounts to replacing equations ( 1)-( 2) with where we used φ i (t) := E(n i (t)) to lighten the notation.In other words, equations (3) define the so-called mean-field dynamics of the asymmetric exclusion process, which are known to approximate well the true dynamics in many contexts, predict crucial features such as dynamical phase-transitions, and ease mathematical treatment [31,44,45].With open boundaries, To match the available data that is coarse grained (Fig. 1 B), instead of considering particles individually we rely on their hydrodynamics description, which is obtained as follows.
We assume Euler scaling with constant a and let a → 0, N → ∞, with L := N a held finite.
We define the functions : R 2 → R + 0 , p : R → R + 0 , and q : R → R + 0 such that they are analytic and bounded on ]0, L[×]0, ∞[, ]0, L[, and ]0, L[, respectively, and We further assume that the left and right jump rates satisfy q with 0 ≤ b < 1, where b governs the relative strength of the non-equilibrium driving forces.
The case b = 0 corresponds to a totally asymmetric exclusion process (TASEP), while the limit case b = 1 corresponds to the symmetric exclusion process.Intermediate values 0 < b < 1 correspond to settings where the particles can jump in both directions, but are driven rightwards on average.A continuum-limit counterpart of equations ( 2), as derived in references [16,18], is (7) which, using first-order Taylor expansion, yields To lighten the mathematical notation, we define the two quantities their ratio is constant in x, viz., ν(x)/λ(x) = a/2 (1 + b)/(1 − b), which equals a/2 in the totally asymmetric case.
Substituting ( 8)-( 9) into the continuity equation which is the hydrodynamics limit of equation ( 1), gives the non-linear partial differential which can be linearised to by means of a generalisation of the Cole-Hopf transform (Appendix A and references [18,46,47]).
In transcription, the particle flux is left to right.While PolIIs can backtrack few Nts under certain circumstances [48][49][50], this phenomenon is overall minor and is not observable at our ChIP-seq resolution.Therefore, we assume b = 0 and focus on the inference of the net forward rate profile p(x).For simplicity we we also set a = 1, arguing that our considerations remain valid with such a choice.The required boundary values (0, t), (L, t), and (x, 0), and the numerical scheme used to integrate equation ( 12) are detailed in Appendices A and B.
Integrating equation (11) with boundary conditions analogous to equations ( 4)-( 5) and initial density (x, 0) > 0 yields a NESS for large t, characterised by a non-vanishing average flux and a density profile * (x) which is invariant in time.Setting the latter as initial condition and further integrating with no inward particle flux (p 0 = q N = 0) produces a transient state that mimics the evolution of the PolII profile after Trp treatment until the density profile vanishes.This is illustrated, for a choice of boundary values and jump rate profile, in Fig. 2, which also includes the result of the inference process described in the next subsections.

C. Bayesian framework
We fit the model to real-world data by means of a Bayesian approach, leveraging its ability to explicitly encode prior hypotheses about the quantities we wish to infer [51].We are interested in the forward rate profile p.As this is required to be analytic and nonnegative, it is convenient to assume a Gaussian process (GP) [19] functional prior on a latent variable f and induce a prior on p using a sigmoid link function of f , such that p = pmax /(1 + exp(−f )), which further imposes an upper bound pmax to p.The GP prior here defines a distribution over real valued C 1 functions in R, where any finite set of function evaluations f (x) has multivariate normal distribution with mean m and covariance kernel In practice, the GP is evaluated at the The observations are organised into a collection of values y = {y ij } i=1,2,...,n,j=1,2,...,t , where the subscripts indicate that an observation is taken at position x i and time t j .As the values of x i do not necessarily coincide with the bin centers of ChIP-seq data, we used simple linear interpolation to estimate the data at intermediate coordinates.We assume that the observed values depend on a multiplicative factor and also include an additive error term ∼ N (0, σ ).This can be written in terms of the equation κy ij = (x i , t j ) + , where κ is the inverse of the amplification factor.The likelihood P (y|f , σ , κ) where we made explicit that depends on f .For the hierarchical parameters (m, σ , κ, σ f , l) =: θ we assume a scaled sigmoid Gaussian prior probability P (m, σ , κ, σ f , l) such that where θ min := (m min , σ min , κ min , σ f min , l min ), θ max := (m max , σ max , κ max , σ f max , l max ), and (µ ξ , σ ξ ) are referred to as hyperparameters.Prior distributions are chosen to pull MCMC samples away from inappropriate results that are consistent with the likelihood but would not be consistent with domain knowledge [51].By using scaled sigmoid Gaussian prior probability bounded by θ min and θ max , we only search for solutions constrained in an appropriate interval [52].By virtue of the Bayes theorem the joint posterior probability for θ and f satisfies P (f , m, σ , κ, σ f , l|y) ∝ P (y|f , σ , κ)P (f |m, σ f , l)P (m)P (σ )P (κ)P (σ f )P (l), which we draw random samples from by Markov chain Monte Carlo (MCMC) sampling, more specifically block Gibbs sampling with elliptical slice sampling at each block [52,53] (Appendix C).Equation ( 15) expresses the distribution of parameters given the observed data y and completes the definition of the model.It is worth noting that evaluating the likelihood also requires computing by integrating equation (11) with initial condition

III. RESULTS
We first consider simulated data from a given profile of length L = 100 obtained from GP draw with parameters (l, σ f , m, pmax ) = (7.32,0.67, 0.29, 3).We integrate the dynamics with ∀t ) and no-influx boundary conditions (Fig. 2 A-B).The chosen rate profile shows a local minimum close to the left boundary, which yields a minor local perturbation in the density, and a global minimum around x ≈ 60, whose effect propagates along the lattice and acts as a major bottleneck, which separates a low density phase downstream from a high-density phase upstream.These minima correspond to regions where particles slow down or pause for an exponentially distributed amount of time.As the particles leave the system through the right boundary and are not replenished by the influx through the left boundary, the region upstream the bottleneck is emptied by a reverse wave front.
For the purpose of testing whether we are able to recover the rate profile from timecourse observations, we extract density profiles y at times (t 1 , t 2 , t 3 ) = (30, 60, 300) and set the hyperparameters θ min , θ max , and (µ ξ , σ ξ ) to (0, 0, 0.8, 0, 0), (2, 10, 1.2, 1, 10), and (0, 1), respectively.With these settings and data, we generated 10 4 MCMC samples targeting the posterior (15), discarding the first 2×10 3 as burn-in, demonstrating that the fitting procedure is able to capture the location of both the major and minor minima of the generative model, as well as the overall elongation rate (Fig. 2 B).It is worth noting that the integrated density profile in Fig. 2 C and D displays a very small effect of the first local minimum (minor dip, captured only by time-course profiles at t 1 and t 2 ); this is reflected in relatively wide credible intervals for the inferred rate profile (grey ribbon in Fig. 2 B).On the other hand, the rate at the bottleneck is inferred with very high confidence.The covariance hyperparameters l and σ f control how quickly the rate changes over x; these were slightly misestimated to 6.86 (95%CI 4.63-7.16)and 0.76 (95%CI 0.75-0.82),respectively, thus suggesting that increased wobbling in the rate profile is tolerated; minor patterns in the rate profile are in fact smoothed out and are essentially not identifiable in the density profiles obtained by integration (see Fig. S3).The difficulty of sampling covariance hyperparameters is also addressed, e.g., in [52].The predicted transient density profiles at t = 30, 60, 300 also are in very good agreement with the input data (Fig. 2 D; in fact, all sampled rate profiles yield similar time-course density profiles despite wide CIs in certain regions (see also Fig. S3).
Applying this method to real-world data requires setting the value of pmax to an upper limit of prior expectations on the elongation rate.As this has been estimated at around 2 × 10 3 Nt/min in previous studies [30], we set pmax = 6 × 10 3 Nt/min as an arguably safe upper bound.Literature results can be also used to set bounds on the prior for κ, which regularises the estimation problem [19].From cultured human cell lines, the total number P of bound PolII molecules per cell is estimated to be between P min = 11 × 10 5 and P max = 18 × 10 5 [54].This is related to the total number Y of ChIP-seq counts by P = κ Y .
Based on these heuristic considerations, we set κ min = P min /Y and κ max = P max /Y .All remaining hyperparameters were set identical to the previous simulation experiment.
The results from different genes show a variety of rate profiles which share similar patterns (Fig. 3).The most important observation is that, in all genes considered, the rates vary strongly with the genomic position, with local minima corresponding to regions where PolIIs slow down or pause.In order to look for average patterns, it is desirable to aggregate data from all genes.As genes have different lengths (which in our sample range from 16,680 to 59,880 Nts), we stretch all the rate profiles in the region from TSS+1000 to TES-1000 Nts to the same support length and then average over the genes at each position.This yields the summaries illustrated in Fig. 4 which we refer to as metagene rates and are akin to the so-called metagene profiles [55].Rates are typically lower near the TSS than in the gene body, where elongation approaches its highest rate.The behaviour in proximity of the TES is less definite, with rates varying several fold among the different genes.At the TSS the rate typically dips down consistently with the presence of strong and widespread pausing in this region.Further downstream in the gene body the rate increases to its highest average value.While the dip is evident in both Spt5 and PolII results, it is worth noting that upstream the TSS the average rate inferred from PolII data is higher than that from Spt5.We argue that this difference is due to the fact the former also include poised PolIIs which are not strongly bound to the template and can quickly move towards the TSS before being engaged in transcription.A by-product of the fitting procedure is the estimate of the occupation density (x, t) = κ y(x, t), as illustrated in Fig. 2 D for the simulation experiment and Figs. 5, S4, S5, and S6 for selected genes.The predicted densities are typically very low (total predicted number of PolIIs in a gene is in the order of 10 −1 ), thus suggesting that crowding and congestion of PolIIs into a gene might not be substantial even proximal to rate minima.

IV. DISCUSSION
We developed a general Bayesian framework to study the dynamics of a one-dimensional transport model given time-resolved density profiles.The general problem addressed here is the identification of the PDE parameters that best describe data as a subset of the true PDE solution (see, e.g., [52,[56][57][58] and references therein).We focused on the hydrodynamic TASEP with smoothly-varying jump rates (which are the parameters to be inferred) as a paradigmatic and well-characterised model of transport.By means of its application to ChIP-seq time-course data, we inferred the rate of PolII elongation as a function of the genomic position in selected genes.This rate is not constant but varies within the gene body.It typically dips down nearby the TSS, confirming widespread pausing in this region, while in the bulk the rate also varies between genes.Low predicted densities suggest that the pausing did not cause congestion or crowding.This is an important observation, as factor crowding has been experimentally observed and associated with regulated gene expression in synthetic and mammalian cell systems [59][60][61][62][63].Our analysis supports the view that this phenomenon does not happen between PolIIs bound to the gene but likely occurs in suspension in the nucleoplasm, as described, e.g., in [61,[63][64][65][66][67].
The inference here is complicated by the high dimensionality of the parameter space (which grows as the genes' length increases).We addressed this by assuming a Gaussian process latent prior for the jump-rate profile and using elliptic slice sampling as an appropriate MCMC algorithm.The sampling requires multiple evaluations of the likelihood of equation ( 13).This in turns requires numerically integrating equation (11), which is also slower in longer genes (require larger integration grids, see Appendix B).
This study of molecular dynamics is also subject to limitations.While ChIP-seq is a widely-used assay to quantify the abundances of DNA-bound PolII, studies suggest that it has limited resolution (between 150-300 Nts) and might be subject to technical issues [23].
Most importantly ChIP-seq profiles are obtained from the aggregation of sequencing reads from many cells, which hides variation within the cell population.The transcription of mRNA is a very complex process and it may be interesting to include features not encoded in the model used for this study.Other TASEP variants, such as those incorporating non-Markovian jump dynamics [68,69] or Langmuir kinetics [70], are relevant for the modelling of PolII recycling and its early detachment from DNA [63,71].An assumption of the TASEP is that particles stay in a site for an exponentially distributed waiting time.Variants of TASEP in which defects appear and disappear randomly on any site (and thus slow down the movement of particles or even block it completely) have been introduced in physics literature and can account for occasionally long pausing times [40,41] (see also [68,69]), with defect dynamics representing the effects of pausing and elongation factors.Modelling advancements that combine the site-specific pausing with long pausing times and extended particle size, supplemented by an appropriate inference scheme such as the one presented here, would be an important additional potential area for research and application.Potential extensions of our work also include estimation of the parameters that encode the system's size and asymmetry (a and b, respectively) and the boundary values.Statistical mechanics literature is rich in quantitative studies of TASEPs with particles that occupy more than one lattice site [72][73][74], some generalized to include site-dependent elongation rates or localised defects [37,[75][76][77].These studies have been used to describe protein translation and could be useful to predict PolII-size effects in gene expression, although, with genes much longer than PolIIs, ChIP-seq limited resolution, and very low PolII coverage density, the observable correction would arguably be minor.In fact, including more features plausibly requires sequencing assays of higher resolution than ChIP-seq and comes at the cost of increased computational burden and decreased tractability.Conversely, the chosen TASEP with smoothly-varying jump rates is simple and yet is able to reveal PolII elongation slowing down and speeding up at certain genomic locations.Due to its generality, our approach also serves as a template for future studies seeking to shed light on complex transport phenomena.

MATERIALS AND METHODS
Spt5 and PolII ChIP-seq data mapped to the hg19 University of California at Santa Cruz human genome were downloaded from Gene Expression Omnibus (GEO, http://www.ncbi.nlm.nih.gov/geo),accession number GSE117006.We filtered the list of genes from the reference genome to only contain those with unique gene symbols on chromosomes 1-22 and X, thus excluding alternatively spliced genes.Hg19 gene coordinates were flanked 500 Nts upstream the TSS in order to include poised PolII.The 20 non-overlapping genes with the highest coverage of Spt5 ChIP-seq reads were selected.All simulation codes are written in c++ and Python (v3.7.1), with the PDE solver using Numba JIT compiler (v0.41.0) [78] (https://github.com/mcavallaro/dTASEP-fit).
for the left and the right boundaries, respectively.
The recruitment of new PolII at the 5' end of the gene is arrested upon chemical perturbation of the promoters with Trp.This is modelled by assuming that a PolII molecule at the leftmost site can leave its position, whilst no molecules can be injected.Therefore, in mean-field approximation, the average number of particles at the left boundary of the  For a given value of (f , l, σ f ), ν is obtained as ν = L(σ f , l) −1 f .Hence with parameters f , m, κ, and σ held fixed, we update (σ f , l) with a sample from the unnormalised conditional P (y|f (ν, σ f , l), κ, σ )P (σ f )P (l).
In the third step, given the values for f , l, and σ f , the likelihood hyperparameters m, σ , and κ are updated by sampling from the unnormalised conditional P (y|f , m, σ )P (m)P (σ )P (κ).
We refer the reader to [52] for more details.Posterior predictions are in very good agreement with time-course observations.Concentration of posterior-predictive samples around target data points can be appreciated by zooming in (Fig. S3).The model is weakly sensitive to changes in the rate profile except for the bottleneck minimum at x ≈ 60, which is sampled with high confidence.

FIG. 1 .
FIG. 1. Biological processes and data.A) Simplified diagram of mRNA synthesis.PolII molecules bind to the DNA upstream of transcription start site (TSS) and moves downstream towards the transcription end site (TES), where it is released along with the synthesized mRNA.In certain genomic regions (indicated by a dangerous bend sign), PolIIs slow down.B) ChIP-seq experiments yield the relative abundance of PolII at each genomic position, here illustrated for H3-3B (top) and KR19 (bottom) genes; insets show the Spt5-bound PolII abundances for the same genes.C)

FIG. 3 .
FIG. 3. Inferred rate profiles from Spt5 ChIP-seq (A-B) and PolII ChIP-seq (C-D) for genes H3-B3 (A-C) and KRT19 (B-D) (black lines are posterior medians, shaded area are 99% credible intervals), with the latter gene showing a distinctive jagged profiles.Both genes show the lowest rates in proximity of the transcription starting site (TSS).Red lines are unperturbed ChIP-seq signals in arbitrary units (a.u.).

FIG. 4 .
FIG. 4. Metagene rates from PolII ChIP-seq (A) and Spt5 ChIP-seq (B) data.By construction, the metagene analysis conserves the length scale only in proximity of TSS and TES.Upstream of TSS the Spt5 ChIP-seq yields lower average rate than PolII ChIP-seq, as this assay does not detect PolII poised to move downstream.

system obeys dφ 0
(t) dt = −p 0 φ 0 (t)(1 − φ 1 (t)), (B12)whose hydrodynamic limit is approximated by∂ ∂t (0, t) ≈ − a p(0) (0, t)(1 − (0, t) − a ∂ ∂x (x, t)| + x=0 ).(B13) (B14)with discretised iterative solutionρ j+1,0 − ρ j,0 = −∆t a P 0 ρ j,0 1 − ρ j,0 − a ρ j,1 − ρ j,0 ∆x , ρ 0,0 = L (0),ρ 0,1 = (∆x, 0).(B15) Using equations (B15) to iteratively update L j in (B10) yields open boundary conditions, which are used in the simulation experiments.The dynamics of the hydrodynamic limit approximate well the occupation density a TASEP with open boundaries on a large yet finite lattice (N = 600) as illustrated in Fig. S1.Sequencing data shows that, at the beginning of the flanked region upstream of the TSS, expression is very low and thus signal will remain constantly close to zero during the time course.The reads downstream the TES are also assumed to be constant within measurement errors, given that the Trp perturbation at the TSS does not propagate up until the TES in the longest time course.Based on this, we set fixed left and right boundary conditions.The grid constants ∆t and ∆x are chosen such that ∆t/∆x 2 = 1/p max , which guarantees numerical stability at each grid point provided that p(x) < pmax ∀x ∈ [0, L].The initial values over this grid at j = 0 are obtained from the binned read profile y * by means of linear interpolation.

FIG. S3 .
FIG. S3.The rate profile samples sometimes have jagged profiles (left).The resulting density profiles are smoother than the corresponding rates and their dispersion around the input data profile can be appreciated by zooming in the plot of Fig. 2-D (right).