Bayesian inference and comparison of stochastic transcription elongation models

Transcription elongation can be modelled as a three step process, involving polymerase translocation, NTP binding, and nucleotide incorporation into the nascent mRNA. This cycle of events can be simulated at the single-molecule level as a continuous-time Markov process using parameters derived from single-molecule experiments. Previously developed models differ in the way they are parameterised, and in their incorporation of partial equilibrium approximations. We have formulated a hierarchical network comprised of 12 sequence-dependent transcription elongation models. The simplest model has two parameters and assumes that both translocation and NTP binding can be modelled as equilibrium processes. The most complex model has six parameters makes no partial equilibrium assumptions. We systematically compared the ability of these models to explain published force-velocity data, using approximate Bayesian computation. This analysis was performed using data for the RNA polymerase complexes of E. coli, S. cerevisiae and Bacteriophage T7. Our analysis indicates that the polymerases differ significantly in their translocation rates, with the rates in T7 pol being fast compared to E. coli RNAP and S. cerevisiae pol II. Different models are applicable in different cases. We also show that all three RNA polymerases have an energetic preference for the posttranslocated state over the pretranslocated state. A Bayesian inference and model selection framework, like the one presented in this publication, should be routinely applicable to the interrogation of single-molecule datasets.

Introduction Transcription is carried out by RNA polymerases: RNAP in Escherichia coli, pol II in Saccharomyces cerevisiae, and T7 pol in Bacteriophage T7. It involves the copying of template doublestranded DNA (dsDNA) into single-stranded messenger RNA (mRNA). RNAP and pol II are comprised of multiple subunits, and their catalytic subunits are homologous [1,2]. In contrast, T7 pol exists as a monomer with a distinct sequence, and resembles the E. coli DNA polymerase I [3].
Optical trapping experiments have been performed on the transcription elongation complex (TEC) from a variety of organisms [4][5][6][7][8][9][10]. In a typical experimental setup, two polystyrene beads (around 600 nm in diameter) are tethered to the system; one attached to the RNA polymerase and the other to the DNA [4]. As transcription elongation progresses, the distance between the two beads increases and the velocity of a single TEC can be computed. Optical tweezers can be used to apply a force F to the system (Fig 1).
Single-molecule studies of the TEC have revealed that RNA polymerases progress in a discontinuous fashion [4,[11][12][13][14] with step sizes that correspond to the dimensions of a single nucleotide (3.4 Å [15]). Consequently, at the single molecule level, transcription is best modelled as a discrete process rather than a continuous one.
A single cycle in the main transcription elongation pathway (Fig 2) requires (1) Forward translocation of the RNA polymerase, making the active site accessible; (2) Binding of the complementary nucleoside triphosphate (NTP); (3) Addition of the nucleotide onto the 3 0 end of the mRNA. This third step involves NTP hydrolysis. Nucleoside monophosphate is added onto the chain and pyrophosphate is released from the enzyme.
Our study aimed to identify the best model to describe this reaction cycle for RNAP, pol II, and T7 pol, based on analysis of published force-velocity data. As there are three reactions, up to six rate constants may be necessary for a kinetic model of a single nucleotide addition. These describe forward and backwards translocation (k fwd and k bck ), binding and release of NTP (k bind and k rel ), and NTP catalysis and reverse-catalysis (k cat and k rev ), also known as pyrophosphorolysis [18]. However fewer than six parameters may be required in practice.
First, it is reasonable to assume that polymerisation is effectively irreversible [17,[19][20][21], as pyrophosphorolysis is a highly exergonic reaction, reducing the number of rate constants to five. Second, translocation between the pretranslocated and posttranslocated states, and/or NTP binding, may occur on timescales significantly more rapid than the other steps, in which case they may be modelled as equilibrium processes. These assumptions simplify the model, as the respective forward and reverse reaction rate constants are subsumed by a single equilibrium constant. Third, thermodynamic models of nucleic acid structure can be used to estimate sequence-dependent translocation rates k fwd (l) and k bck (l), by invoking transition state theory, and this can sometimes result in parameter reduction [16,17,21].
Irrespective of equilibrium assumptions and parameterisation, transcription elongation under applied force can be modelled in two fundamentally distinct ways. First, there are the deterministic equations which can be used to calculate the mean pause-free elongation velocity v(F, [NTP]) as a function of force F and NTP concentration [NTP]. This kind of model can be derived from the differential equations describing the time evolution of all species, by application of the steady state approximation. Force effects on the translocation step are incorporated using transition state theory [22,23]. An example is the following 3-parameter model [4].
where δ is the distance between adjacent basepairs (3.4 Å, [15]), K D ¼ k rel k bind is the equilibrium constant of NTP binding, K t ¼ k bck k fwd is the equilibrium constant of translocation, k B is the Boltzmann constant, and T is the absolute temperature. Increasingly complex equations may be used as more parameters or states are added to the model [4,6,17]. Such equations describe the velocity averaged across an ensemble of molecules. Parameter inference applied to velocity-force-[NTP] experimental data is straightforward and computationally fast when using these equations. However these equations do not describe the distribution of velocity nor do they account for site heterogeneity across the nucleic acid sequence and therefore cannot predict local sequence effects.
Second, there are the stochastic models, which can be implemented via simulation of single-molecule behaviour using the Gillespie algorithm [24]. The mean velocity can be calculated by averaging velocities over a number of simulations for a given F and [NTP]. This offers not just the mean but a full distribution of velocities and could potentially explain emergent properties unavailable from a deterministic model. Unfortunately, simulating can be very slow and therefore parameter inference can be a problem. The model of the main transcription elongation pathway, which shows the postulated states; the pathways for interconversion; and the rate constants that govern each part of the reaction. The transcription bubble is the set of β 1 + h + β 2 bases (see main text for definitions) in the double-stranded DNA which are unpaired. States are denoted by S(l, t) where l is the length of the mRNA and t is the position of the polymerase active site (small grey rectangle) with respect to the 3 0 end of the mRNA. Polymerase translocation displaces the polymerase by a distance of δ = 1 bp = 3.4 Å. During polymerisation the chain is extended by one nucleotide. (B) Instantiated posttranslocated state of RNA polymerase transcribing the rpoB gene sequence, with β 1 = 2, h = 9, β 2 = 1. Forward translocation requires melting two T/A basepairs (right arrows). Backward translocation requires melting two C/G basepairs (left arrows). The mRNA secondary structure would also require reconfiguration [16,17]. https://doi.org/10.1371/journal.pcbi.1006717.g002 In this study we used a Markov-chain-Monte-Carlo approximate-Bayesian-computation (MCMC-ABC) algorithm [25] to estimate transcription elongation parameters for stochastic models via simulation. The observed pause-free velocities we are fitting to were measured at varying applied force and NTP concentration. For each RNA polymerase under study-E. coli RNAP, S. cerevisiae pol II, and T7 pol-we fit to one respective dataset from the single-molecule literature [4,26,27].

Notation and state space
Suppose the TEC is transcribing a gene of length L. Then let S(l, t) denote a TEC state, where the mRNA is currently of length l � L, and t 2 Z describes the position of the active site with respect to the 3 0 end of the mRNA. When t = 0 the polymerase is pretranslocated and cannot bind NTP, and when t = 1 the polymerase is posttranslocated and can bind NTP (Fig 2). This study is focused on the main elongation pathway and the observed velocities being fitted have pauses filtered out. Therefore, although additional backtracked states (t < 0) [4,28,29] and hypertranslocated states (t > 1) [30,31] exist, these are not incorporated in the model.
Let β 1 and β 2 be the number of unpaired template nucleotides upstream and downstream of RNA polymerase, respectively, and let h be the number of basepairs in the DNA/mRNA hybrid (Fig 2A). Although there are uncertainties in these parameters, they are held constant at h = 9, β 1 = 2, and β 2 = 1 [17,32].

Parameterisation of the NTP binding step
NTP binding has been modelled as both a kinetic and equilibrium process in the literature [4,17,21].
In a kinetic binding model, NTP binding occurs at pseudo-first order rate k bind [NTP], while NTP release occurs at rate k rel . In this case, k bind and k rel k bind must be estimated. Under a partial equilibrium approximation NTP binding and release are assumed to be rapid enough that equilibrium is achieved. In this case, the rate constants k bind and k rel are subsumed by the NTP dissociation constant K D ¼ k rel k bind which becomes the sole binding-related parameter to estimate.

Parameterisation of the translocation step
While inferences about the rate constants associated with NTP binding and catalysis (k bind , k rel k bind , and k cat ) can be made directly from the data, the translocation step is more complex. Transition state theory is invoked in order to estimate k fwd and k bck . Recasting the problem in this way (1) provides a way of accommodating the effects of applied force on the elongation process, and (2) allows the sequence-dependence of translocation to be incorporated by considering the energetics of basepairing. When allowing for sequence dependence, the total number of translocation rates required to model translocation of the full gene is 2(L − l 0 ). Thermodynamic models of base pairing energies. The standard Gibbs free energies Δ r G 0 (= ΔG) involved in duplex formation are calculated using nearest neighbour models. The standard Gibbs energy of state S-arising from nucleotide basepairing and dangling ends-is calculated as where SantaLucia's DNA/DNA basepairing parameters [33] are used to calculate DG ðbpÞ gene and Sugimoto's DNA/RNA parameters [34] are used for DG ðbpÞ hybrid . For the latter, dangling end energies are estimated as described by Bai et al. 2004 [21]. Here, and elsewhere, the (bp) superscript is used to denote a model parameter that can be evaluated from the sequence alone. Gibbs energies are expressed on a per molecule basis, relative to the thermal energy of the system, in multiples of k B T, where k B T = 4.28001 × 10 −21 J at T = 310 K.
In order for RNA polymerase to translocate forward (backward), up to two basepairs must be disrupted: (1) the basepair at the downstream (upstream) edge of the transcription bubble, and (2) the basepair at the upstream (downstream) end of the DNA/mRNA hybrid (Fig 2B). Differences in the basepairing energies in these regions confer sequence-dependence on the rate of translocation.
Calculation of translocation rates or translocation equilibrium constant. The standard Gibbs energies of the pre and posttranslocated states, DG ðbpÞ Sðl;0Þ and DG ðbpÞ Sðl;1Þ , respectively, are used with up to four additional terms-ΔG τ1 , δ 1 , DG z t , and DG ðbpÞ Tðl;tÞ -to calculate the translocation rates. The first three are model parameters which must be estimated while the latter is directly evaluated from the sequence.
Let T(l, t) be the translocation transition state between S(l, t) and S(l, t + 1). Then DG z Tðl;tÞ ¼ DG z t þ DG ðbpÞ Tðl;tÞ is the sequence-dependent standard Gibbs energy of activation which must be overcome in order to translocate (Fig 3).
Given an applied force F, the translocation rates governing transition between the pre and posttranslocated states (k fwd (l) and k bck (l)) are calculated from barrier height DG z Tðl;0Þ using an Arrhenius type relation: In a translocation equilibrium model, the barrier height is assumed to be so small, = translocation is so rapid, that the transition states are disregarded. (B) A model for the sequence-dependent transition state between translocation states S(l, 0) and S(l, 1). This is required for estimating the Gibbs energy of basepairing DG ðbpÞ Tðl;tÞ in the transition state. The basepairing energy, added to a baseline term DG z t , together specify the height of the activation barrier (Eq 10 The derived rates k fwd (l) and k bck (l) are therefore dependent on the local sequence. The preexponential factor A is held constant at 10 6 s −1 . This term has been arbitrarily set to a variety of values in previous studies (10 6 −10 9 s −1 [16,17,21]). This has little consequence for model fitting, however the value of DG z Tðl;tÞ is entangled with the value of the pre-exponential factor A and can only be meaningfully interpreted in light of its value.
If the system has time to reach equilibrium, the probabilities of observing the pretranslocated state S(l, 0) and posttranslocated state S(l, 1) are The physical meanings of the terms ΔG τ1 , δ 1 , DG z t , and DG ðbpÞ Tðl;tÞ , and the way they are used in the model, are detailed below.
Energetic bias for the posttranslocated states. ΔG τ1 (units k B T) is a parameter added to the standard Gibbs energy of the posttranslocated state. If ΔG τ1 = 0, then the sequence alone determines the Gibbs energy difference between pre and posttranslocated states. In this case, pretranslocated states are usually favoured over posttranslocated states due to the loss of a single basepair in the hybrid of the latter.
ΔG τ1 has frequently been estimated for T7 pol [35][36][37] and there has been discussion around whether such a term is necessary for RNAP [6].
Polymerase displacement and formation of the transition state. δ 1 (units Å) is the distance that the polymerase must translocate forward to facilitate the formation of the transition state. The distance between adjacent basepairs is held constant at an experimentally measured value δ = 3.4 Å [15], and 0 < δ 1 < δ. The response of the system to an applied force F depends on this term. In general, the application of force F tilts the Gibbs energy landscape-the Gibbs energy difference between adjacent translocation states being augmented by a factor Fd k B T ( Fig  3A, [38,39]).
It may be necessary to estimate δ 1 to model the data adequately [17], or it may be sufficient to simply set δ 1 = δ/2 [38].
Energy barrier of translocation. DG z t and DG ðbpÞ Tðl;tÞ (units k B T) together determine the activation barrier height in the translocation step. It is assumed that the sequence-dependent standard Gibbs energy of activation DG z Tðl;tÞ can be written as Tðl;tÞ ð10Þ DG z t is therefore a sequence-independent baseline term used to compute the translocation barrier heights. The parameter DG z t must be estimated in order to evaluate translocation rates.

In contrast DG ðbpÞ
Tðl;tÞ is a term that is evaluated directly from the sequence derived from a model of the transition state ( Fig 3B). The term is evaluated as the standard Gibbs energy of a TEC containing all hybrid and gene basepairs found in both S(l, t) and S(l, t + 1), ie. the intersection between the two sets of basepairs.

Model space
The full transcription elongation model makes use of the following 6 parameters: • k cat (units s −1 ).
• DG z t (units k B T). However fewer than 6 parameters may be needed to adequately describe the data. If it is assumed that the energy differences between pre and posttranslocated states are determined by basepairing energies alone, the parameter ΔG τ1 does not need to be estimated. This is equivalent to holding ΔG τ1 constant at 0. If it is assumed that the displacement required for formation of the translocation transition state is half the distance between adjacent basepairs, the parameter δ 1 does not need to be estimated. This is equivalent to holding δ 1 constant at δ/2.
Partial equilibrium approximations may also simplify the model, as detailed above. If binding is approximated as an equilibrium process, k bind does not need to be estimated. If translocation is approximated as an equilibrium process, DG z t and δ 1 do not need to be estimated. One, both, or neither of these two steps (binding and translocation) could be assumed to achieve equilibrium, thus yielding four equilibrium model variants (Fig 4A). The introduction of partial equilibrium approximations for both the NTP binding and translocation steps has implications when specifying the prior distributions for the Bayesian analysis (S4 Appendix) The chemical master equations for single nucleotide addition cycles of these models are presented in S2 Appendix.
Incorporating these simplifications to the model in a combinatorial fashion results in a total of 12 related models, which together constitute the model space. Our objective was to determine which of these 12 models provides the best description of the experimental data. The simplest model (Model 1) contains 2 parameters (k cat and K D ). The most complex model (Model 12) contains all 6 parameters. The full model space is displayed in Fig 4B.

Stochastic modelling
For each model we performed stochastic simulations, appropriate for the modelling of singlemolecule force-velocity data. The simulations, performed using the Gillespie algorithm [24,40], can be used to estimate the mean elongation velocity under a model.
The estimation of mean velocity can be broken down into three steps. First, the system is initialised by placing the RNA polymerase at the 3 0 end of the template-state S(l 0 , 0)-with the transcription bubble open and a DNA/RNA hybrid formed. The force and NTP concentrations are assigned their experimentally set values. Second, a chemical reaction is randomly sampled. The probability that reaction S ! k S 0 is selected is proportional to its rate constant k (Fig 2). The amount of time taken for the reaction to occur is sampled from the exponential distribution. States which are subject to a partial equilibrium approximation are coalesced into a single state, which augments the outbound rate constants. The second step is repeated until the RNA polymerase has copied the entire template. Third, the previous two steps are repeated c times. The mean elongation velocity is evaluated as the mean of each mean elongation velocity across c simulations. For further information, see S1 Appendix.

Relation to previous models and stochastic simulations
There is an extensive literature concerned with the kinetic modelling of transcription elongation. Such models may incorporate backtracking, hypertranslocation, and other reactions.
Here we are concerned only with the central elongation pathway. A stochastic and sequence-dependent model was proposed by Bai et al. 2004 [21] for RNAP, with both NTP binding and translocation treated as equilibrium processes. The translocation equilibrium constant was calculated entirely from basepairing energies. Therefore this model is equivalent to Model 1, and the parameters were estimated as k cat = 24.7 s −1 and K D = 15.6 μM from fit to experimental data. Maoiléidigh et al. 2011 also presented stochastic simulations of RNAP. The elongation component of their model is equivalent to Model 6 [17]. We build on this work by providing a systematic Bayesian framework for model comparison and parameter estimation.
While our analysis employed sequence-dependent stochastic models, comparisons can also be made with some deterministic models.  [26], and Thomen et al. 2008. [27,37] described a deterministic model (for RNAP, pol II, pol II, and T7 pol respectively) which estimated k cat , K D and translocation equilibrium constant K t ¼ k bck k fwd . These are most similar to Model 4.
Maoiléidigh et al. 2011 for RNAP, and Dangkulwanich et al. 2013 for pol II, however found that the translocation and catalysis were occurring on similar timescales, and modelled only NTP binding as an equilibrium process [17,42]. They also estimated the distance of translocation. These deterministic models are most similar to Model 11.
Finally, Mejia et al. 2015 [43] used a model that is quite different to all the above models, as it does not explicitly treat translocation. Instead elongation is modelled with a two step kinetic scheme, the first step involving NTP binding and conformational change, and the second step involving nucleotide incorporation and product release. This model is most similar to a special case of Model 5 where ΔG τ1 becomes extremely negative, driving the polymerase into the posttranslocated position.

Model selection with MCMC-ABC
Our aim was to 1) use Bayesian inference to select the best of 12 transcription elongation models for each RNA polymerase; and 2) estimate the parameters for those models appearing in the 95% credible set of the posterior distribution. Selecting prior distributions behind each parameter is a critical process in Bayesian inference. A prior distribution should reflect what is known about the parameter before observing the new data. We have explicated our prior assumptions, with justifications, in Table 1.
We performed MCMC-ABC experiments which estimated the parameters and model indicator M i for i 2 Z; 1 � i � 12. Models which appear more often in this posterior distribution are better choices, given the data. The model indicator is a discrete variable which can take 12 values, and is treated identically to the 6 continuous parameters in the Bayesian framework.
The datasets we fit our models to are all from the single-molecule literature and are pre-  [26] for S. cerevisiae pol II, and Table 2 of Thomen et al. 2008 [27] for T7 pol. To computationally replicate these experiments as faithfully as we could with the available information and computational limitations, simulations in this study were run on the 4 kb E. coli rpoB gene for RNAP (GenBank: EU274658), the first 4.75 kb of the human rpb1 gene for pol II (NCBI: NG_027747) the first 10 kb of the Enterobacteria phage λ genome for T7 pol (NCBI: NC_001416). The mean velocities from 32 (for RNAP), 10 (for pol II) and 3 (for T7 pol) simulations of the full respective sequences were used to estimate the mean elongation velocity during MCMC-ABC, given F and [NTP].
For further information about the MCMC-ABC algorithm [25,44], or the model indicator M i , see S3 Appendix.

The posterior distributions
The posterior distributions from our MCMC-ABC experiments are presented in Table 2, Figs 5 and 6.
A large effective sample size (> 100 [53]) and a smallR (< 1.1, as defined by Gelman et al. 1992 [54][55][56]) are essential for making reliable parameter estimates. Table 2 suggests that the parameters in the 95% credible set of models are sufficiently estimated by these criteria.
These results indicate that the best models for the datasets examined are Models 11 and 12 for both RNAP and pol II, and Model 5 for T7 pol (Fig 4B).
For pol II, Model 12 has the highest posterior probability P(M 12 |D) = 0.71. This is the most complex model considered, with 6 estimated parameters. In Model 12 translocation, NTP binding and catalysis are all kinetic processes; the displacement required to facilitate formation of the translocation transition state, δ 1 < δ, is estimated (d 1 ¼ 3:1 A � ); and the standard Gibbs energy of the posttranslocated state is influenced by parameter ΔG τ1 6 ¼ 0. The posterior distribution for RNAP consists of the same set models as that of pol II. For RNAP, Model 11 has the highest probability P(M 11 |D) = 0.81. This model is a submodel of Model 12 with one fewer parameter: in Model 11 NTP binding is treated as an equilibrium process while in Model 12 it is not.
The only model in the 95% credible set for T7 pol is Model 5 P(M 5 |D) = 0.96. In Model 5 (4 parameters) translocation, but not binding, is treated as an equilibrium process, and ΔG τ1 is estimated. This positions T7 pol in a quite different area of the model space to the other two polymerases.

Translocation rates differ among RNA polymerases
For RNAP and pol II, we estimate that a partial equilibrium approximation for the translocation step is inadequate. The posterior probability that such models are inadequate is 1.00 (see Table 2). For T7 pol, however, translocation is significantly faster than catalysis and is best modelled with a partial equilibrium approximation. Using estimates for DG z t and ΔG τ1 under the maximum posterior models (Model 11 for RNAP and Model 12 for pol II) we estimate the  [6-8, 21, 43], but as much as 100 bp/s in vivo [45][46][47][48]. Distribution selected such that (10, 100) is central 95% interval. For T7 pol k cat and elongation velocity estimates range from 43-240 bp/s [9,[49][50][51]. Distribution selected such that (40,240) is central 95% interval.
For RNAP and pol II, centered around 0 with a standard deviation comparable to the free energy of a single nucleotide basepair doublet, and such that the 95% central interval is (-4, 4). For T7 pol ΔG τ1 has been estimated as -4.3 [37] and -4.87 k B T [35]. However these estimates are likely resulting partially from dangling ends. Thus, we subtracted the mean dangling end contribution of * -1 k B T [33] and centered the prior around this interval with a standard deviation the same as above.
DG z t (k B T) Normal(μ = 5.5, σ = 0.97) for RNAP/pol II Normal(μ = 2.5, σ = 1.36) for T7 pol Central 95% interval set so that translocation is a slow kinetic step (S4 Appendix). Selected so that 99% central interval is (3,8) for RNAP and pol II, and (-1, 6) for T7 pol.  mean forward � k fwd and backward � k bck translocation rates averaged across the rpoB sequence as: 230 s −1 and 112 s −1 for RNAP, and 350 s −1 and 12.7 s −1 for pol II, respectively (3 sf). These estimates are within one order of magnitude of the respective estimate for the rate of catalysis ( Fig  5) suggesting that translocation and catalysis indeed occur on similar timescales.
For RNAP and pol II, translocation has frequently been modelled as an equilibrium process [4,21,26,41,43], however in some recent analyses this assumption has been rejected [16,17,42,57,58]. Our Bayesian analysis supports this. In contrast, there is general agreement that translocation in T7 pol is adequately modelled as an equilibrium process [27,59,60].

The data does not determine the kinetics of the NTP binding step
It remains unclear how to best model the NTP binding step. Models that describe NTP binding as a kinetic process have posterior probabilities of 0.19 for RNAP, 0.71 for pol II and 0.96 for T7 pol (Table 2). However, in an earlier experiment, where we used different a prior distribution for k rel k bind , the latter probability was 0.21 and P(M 4 |D) was 0.79. The intermediate magnitude of these posterior probabilities, and sensitivity to the choice of prior, imply that the data contains very little information about which binding model is preferred.
Furthermore, k rel k bind and k bind (Models 5 and 12) are unable to be estimated simultaneously. For pol II and for T7 pol, k bind is estimated at around 0.48 and 1.4 μM −1 s −1 respectively with fairly narrow 95% highest posterior density (HPD) intervals (Fig 5). However, the HPD interval of k rel k bind spans three orders of magnitude and the value of this parameter was therefore poorly informed by the data. For RNAP, in contrast, neither k bind nor k rel k bind were well-informed by the  data and both have HPD intervals spanning 1-2 orders of magnitude. This non-identifiabilitywhere two or more parameters are unable to be estimated simultaneously (S4 Appendix)highlights the appeal of an NTP binding equilibrium model where only one parameter k rel k bind needs to be estimated, despite the unrealistic assumptions it may invoke. In the case of each enzyme, the data has taught us nothing about one or two of the binding parameters. The pause-free mean velocities measured during transcription elongation follow Michaelis-Menten kinetics even though the reaction cycle is more complicated than that of a simple enzyme [61]. As such, the inability to resolve the timescale of the substrate binding step is unsurprising [62][63][64]. In the transcription literature, NTP binding is almost always assumed to achieve equilibrium for RNAP, pol II, and T7 pol [4,16,17,21,26,27,37,41,42,60]. However Mejia et al. 2015 [43] have shown that NTP binding is indeed rate-limiting, and that mutations in the RNAP trigger loop impair the binding rate thus suggesting that the trigger loop is coupled with NTP binding.

RNAP has an energetic preference for the posttranslocated state
In previous stochastic sequence-dependent models [16,21] the standard Gibbs energies of the pre and posttranslocated states have been based solely on the nucleic acid basepairing energies. Our models include an additional term, ΔG τ1 , to account for potential interactions between the protein and the nucleic acid. The marginal posterior probability of a model in which an additional term ΔG τ1 is required is 1.00 in all three polymerases. In each case ΔG τ1 was estimated to be less than 0 k B T and 0 k B T is not included in the 95% HPD interval (Fig 5). We find that DĜ t1 is the most significant in pol II and T7 pol: −4.6 k B T and −4.0 k B T respectively, while DĜ t1 ¼ À 2:0 k B T for RNAP (2 sf).
These results suggest that structural elements within RNA polymerases can energetically favour posttranslocated states over pretranslocated states. We note that the sequence-dependent contribution of the dangling end of the DNA/RNA hybrid is included in the thermodynamic model. The energetic bias for the posttranslocated state is separable from this effect.
To facilitate comparison with previous deterministic models, using our estimates of ΔG τ1 we calculated the equilibrium constant between the pre and posttranslocated states. Geometrically averaged across the rpoB gene, these are Thus, for all three polymerases, K τ < 1, indicating that the small energetic preference that the protein has for the posttranslocated state is sufficient to override the loss of basepairing energy, thereby biasing the system towards population of the posttranslocated positions. This is in agreement with estimates made for pol II and T7 pol [26,27,35,36,41] and Kireeva et al. 2018 [58] for RNAP: "forward translocation occurs in milliseconds and is poorly reversible". However these estimates are inconsistent with some RNAP and pol II studies which place this ratio above 1 [4,17,42,52].
Kinetic modelling can itself suggest no physical mechanism for the stabilisation. Yu et al. 2012 [36] have identified a conserved tyrosine residue near the active site of T7 pol that pushes against the 3 0 end of the mRNA, and thus stabilises the posttranslocated state. They propose a similar mechanism for the multi-subunit RNA polymerases.

δ 1 may be an important parameter but its physical meaning is unclear
Our results suggest that δ 1 , the distance that RNA polymerase must translocate forward by to reach the translocation transition state, is a necessary parameter to estimate for RNAP and pol II. Setting δ 1 = δ/2 is not sufficient. The marginal posterior probability of models which estimate this term is 1.00. δ 1 is irrelevant to the modeling of the T7 pol data because the best models invoke a partial equilibrium approximation for the translocation step.
While our prior distribution restricted δ 1 to lie in the range (0, δ), the upper end our 95% HPD intervals of δ 1 for RNAP and pol II are very close to δ = 3.4 Å. If it was not for this prior distribution, δ 1 estimates would have included values higher than δ. Similar results have been observed by Maoiléidigh et al. 2011 [17] for RNAP.
Our interpretation of δ 1 implies it should never be greater than δ nor should δ be more than the width of one basepair. The physical meaning of δ 1 with values greater than δ is thus unclear. It is noted that δ 1 is only used when F 6 ¼ 0.

Comparing the kinetics of RNA polymerases
The in vivo rate of transcription elongation varies considerably across RNAP, pol II and T7 pol. The prokaryotic and eukaryotic RNA polymerases have a mean rate ranging from 20-120 bp/s [45,46,48,49,[65][66][67], which may be slowed down in histone-wrapped regions of eukaryotic genomes [7]. In contrast, Bacteriophage T7 pol operates up to an order of magnitude faster (around 200-240 bp/s [49,68]) and is known to be quite insensitive to transcriptional pause sites [9,27].
In additional to these differences, we have shown that translocation is very rapid in T7 pol, relative to the rate of NTP incorporation, while the disparity is much less significant in RNAP and pol II. Furthermore, the model does not fit the data for T7 pol as closely it does for RNAP and pol II (Fig 6). T7 pol therefore seems to operate under quite a different kinetic scheme than that of the cellular polymerases, which is not unexpected given their distant evolutionary relationship [3].
The velocity perturbations resulting from the optical trapping apparatus will be propagated into the model parameters, especially k cat , and DG z t , and some caution is needed when extrapolating these results to untethered systems.

Bayesian inference of transcription elongation
To our knowledge we are the first to perform Bayesian inference on single-molecule models of transcription elongation. This was achieved by simulation which necessitated the use of approximate Bayesian computation. An alternative would be to build and use a likelihood function (ie. the probability of taking exactly t units of time for RNA polymerase to copy the sequence n times). The latter approach can be achieved using chemical master equations, as opposed to (Gillespie) sampling from the distribution. Finding analytical, stable numerical, or approximate solutions to the chemical master equations could provide a similar insight in less computational time, however is susceptible to a multitude of analytical and numerical issues associated with the exponentiation of an arbitrary transition rate matrix that grows with the length of the sequence (S2 Appendix) [74]. This problem would be amplified by the introduction of backtracking, hypertranslocation, or NTP misincorporation reactions into the model, for instance. The Bayesian framework we have presented, although computationally intensive due to its simulation requirement, is general and will work on any model of transcription without the need to resolve these issues. The path has been paved for modelling transcriptional pausing, for instance [16,21,75]. Nevertheless, likelihood-based Bayesian inference is an approach that should be explored in the future.
We have demonstrated that single-molecule data can be usefully analysed using a Bayesian inference and model selection framework. This analysis would have even greater statistical power if applied to the progression of individual RNA polymerase complexes instead of mean velocities averaged across multiple experiments.

Conclusion
In this article we evaluated some simple Brownian ratchet models of transcription elongation (Fig 2). By varying the parameterisation of the translocation step (Fig 3) and incorporating partial equilibrium approximations commonly invoked in the literature (Fig 4A) we enumerated a total of 12 related models (Fig 4B). Using stochastic simulations and approximate Bayesian computation, we then assessed which of these models were capable of describing the force-velocity data previously measured for several RNA polymerases (Table 2 and Fig 5) using single-molecule optical trapping experiments [4,26,27].
Our analysis suggests that 1) different partial equilibrium approximations of the translocation step are appropriate for the multisubunit RNA polymerases versus the single subunit T7 RNA polymerase. 2) Treatment of the NTP binding step remains a point of ambiguity. The existing data does not place strong constraints on the modelling of this step. 3) There is an energetic bias for posttranslocated state. 4) The model of the force-dependent translocation, which invokes transition state theory, is not physically realistic.
Supporting information S1 Appendix. Stochastic simulation. The Gillespie algorithm is described.