Folding kinetics of proteins with multiple domains: inference of transition rates from extinction times

In this paper we are concerned with the folding kinetics of large proteins that may have multiple binding/unbinding domains. The theory underlying such systems have received much less attention compared to small, ‘two-state’ (folded/unfolded) proteins. To analyze this problem, we model the reaction coordinate using a birth-death chain, a well-known stochastic process. Specifically, we consider a protein with N domains where every domain that unfolds (folds) increases (decreases) the reaction coordinate by an integer so that the state-space of the reaction coordinate is the set of integers {0, 1, …, N}. As input data, our method uses (i) the extinction time of trajectories, starting from when the protein leaves the 0 state for the first time and finishing when the protein re-enters state 0 for the first time and (ii) the maximum number of unfolded domains in the said trajectory. Since the maximum value n reached by each trajectory is known, we use the proportion of trajectories that do not exceed n and corresponding mean extinction times (ET) to recover the birth-death rates sequentially from 1 to N. In general, the initial error will propagate with the site number exponentially. However, with sufficient amount of input data, we can recover the rates with relatively small error. For instance, given 50 million ETs of an 11-site BDP, we can recover the rates with a relative error of about 3%.


Introduction
The prediction of the native conformation of a protein, given its amino acid sequence is one of the great open problems in biophysics [1][2][3][4][5]. Since the 1990s [6,7], techniques such as Atomic Force Microscopy (AFM) [8] and Förster Resonance Energy Transfer (FRET) [9,10] have allowed experimentalists to explore the relationship between macromolecular structure and folding/unfolding rates.
A computational approach to protein folding is to implement large-scale all-atom molecular dynamics (MD) simulations [11]. The folding of a macromolecule is represented by tracking the positions of every single atom in the molecule and the result of the simulation is a trajectory in a high-dimensional state-space. A more conceptual way to understand protein folding is to introduce a reaction coordinate [12] which effectively maps the high-dimensional space onto a single scalar that measures the progress of the folding. The progress along the reaction coordinate is modulated by a free energy 'landscape' [13] that may exhibit multiple minima, corresponding to multiple metastable configurations. Inferring the shape of these landscapes from quantities such as extinction times [14], rupture forces [15] or time-displacement trajectories [16] remains a challenging theoretical problem.
One of the most common ways of probing the energy landscape is through AFM [17]. In AFM experiments, one end of the molecule is tethered to a movable platform and the other is attached to a cantilever tip: see figure 1(a). Small deflections of the cantiliever are detected using a laser-photodiode setup. The AFM can operate in several ways. One protocol is 'force-ramp' mode where the platform lowers at a constant speed. As a protein domain is coercively unfolded, the cantilever deflection increases until a critical platform position is reached. Beyond this point, the cantilever quickly relaxes, corresponding to domain rupture. The resulting force-extension curve allows quantification of the 'entropic elasticity' of that particular domain. The procedure can also be performed sequentially if multiple domains are present in a large protein [18,19]. The discrete event corresponding to rupture is interesting both physically and mathematically. Experiments show that rupture does not always occur at the same force. Furthermore, the rupture force distribution shifts towards higher values when the platform speed is larger. Both of these observations are in stark contrast to mechanical bonds that break at a single yield stress. They point towards a model of bond-breaking that is based on 'thermally activated escape,' i.e. a theory of random walks.
Besides force-ramp mode, another protocol is to keep the AFM platform stationary and operate in 'forceclamp' mode. Under this mode of operation, one focuses on deflections of the cantilever which essentially provide the reaction coordinate as a function of time. Some possible time traces are shown in figures 1(b) and (c). The protein spends most of its time in metastable configurations (when the reaction coordinate is an integer) and very little time in-between these states. Figure 1(b) shows the trace for a simple protein with a single folding domain that can be in one of two states: folded or unfolded. The kinetics in this case are welldescribed by two exponential distributions [20]; one for the 1 0  transition and the other for the 0 1  transition. The half-lives or rate constants associated with each exponential distribution can easily be inferred from the time trace in figure 1(b). For proteins with multiple domains the traces can be more complex and could resemble figure 1(c). If one assumes exponential kinetics as before (i.e. the transition between states always follows an exponential distribution, but the parameters of the distribution could be state-dependent) the resulting stochastic process is a birth-death chain. These chains correspond to energy landscapes with multiple metastable states: see figure 2.
An extinction time is the time taken for the reaction coordinate to start at 1 and reach 0 for the first time, corresponding to the state where all domains are folded. In figure 1(c), the measurement of τ j starts when the reaction coordinate reaches 1 for the first time. For each excursion, one can also define the maximal site n j that is reached before extinction occurs. After suitable processing of the signal, a single time trace generates many pairs (τ j , n j ), j=1, K, M and M?1. In this paper, we show how to recover all the transition rates of the birth-death chain from {τ j } and {n j }. How to detect jumps in the reaction coordinate, that define when transitions between states have occurred, is a separate issue and beyond the scope of this paper.

Energy landscape inference: existing methods
In this section, we briefly describe some of the existing methods used to interpret AFM data. One of the earliest methods is Bell's method [21] which assumes a force-dependent rate ansatz (of Arrhenius form) for the state of a domain or chemical bond. Providing that the force on the protein depends on the instantaneous deflection in the cantilever, Bell's method predicts rupture force distributions and survival probabilities for a chemical bond under the action of a force ramp. The method is used to solve the 'forward' problem in the sense that rupture distributions are predicted from a potential well whose shape is known. In contrast, Dudko and co-authors [22]

Governing equations for the birth death process
We now present a method to compute the transition rates of a birth death process from its extinction times. Our stochastic model takes the form of a random walker on a finite integer lattice: The birth and death rates at site n are λ n and μ n respectively. We assume the particle starts at site n=1 and the process terminates (goes extinct) when it reaches site 0. For each trajectory, we record the extinction time and the maximal site reached (i.e. the largest site number it attains before reaching site 0). From this data, we wish to infer the N 2 1 rates , , N 1 1 l l ¼ -and μ 1 , K, μ N . Consider a birth-death process on a lattice with sites labeled N 0, 1, 2, , where N is finite but unknown: see figure 3. A particle starts at site 1 and executes a random walk which we write as X(t): X(t) is a random walk on the non-negative integers. At site i, the rightward (leftward) hopping rate is λ i (μ i ). When the particle reaches site 0, we record the time of exit. When this experiment is repeated many times, we may use the resulting data to compute the extinction time distribution W(t). The matrix that determines this process is defined as ] satisfy the forward master equation and it is a simple matter to find W(t) from P 1 (t) (see section 2.1). If N is known, finding all the transition rates , amounts to an exponential fitting problem, which is very ill-conditioned [34]. To overcome this difficulty, we assume that for every trajectory X j (t) we also record the maximal site of the particle, n j . The maximal site for the random walker is simply the largest site number that it attains before exiting. By grouping trajectories according to their maximal site and computing the statistics of the extinction times of each group, we are able to accurately infer transition rates for birth death chains of length 8-11 from about 5×10 7 extinction times with relative error of a few percent.
To be more precise, if X j (t) is the jth trajectory, then τ j is the jth extinction time defined as X inf : 0 is the maximal site of the jth trajectory and S n n : }is the set of extinction times corresponding to trajectories whose maximal site does not exceed n. Then for a finite sample of trajectories, we have S S S N 1 2 Í Í ¼ Í . The algorithm that we propose takes as input S n | | and S n (cardinality and mean of S n ) to infer λ n and μ n for each n.
Random walkers that exit from the dashed box in figure 3 can exit at site 0 or n+1. We informally call the times that correspond to exit at site 0 (n+1) 'left' ('right') extinction times. Then, the distribution of extinction times for the birth-death process, conditioned on trajectories not exceeding site n is identical to the left extinction time distribution out of the sublattice {1, K, n}. By finding finding analytical expressions for the moments of this distribution and matching them to the observed moments, we may infer the transition rates on the lattice. This forms the basis of our method.

Extinction times and probability fluxes
We assume that sites 0 and n+1 are absorbing in the sense that if the particle reaches site 0 or n+1 ('exits'), it stays at these sites for all time. We use the superscript n to distinguish the subproblem from the entire chain. Define the probability that the random walker is at site k at time t as . Figure 3. Birth death chain with N+1 sites, with 0 N l = . Our algorithm uses the extinction times of the process given that the positions of particles never exceed site n (always remain in the dashed box).
• T ( n) , the extinction time of the random walker, defined to be the time at which the walker arrives either at site 0 or n+1 for the first time. Conditioning on E (n) , let the density of extinction times T ( n) be w t L n ( ) ( ) and w t R n ( ) and we let W t L n ( ) ( ) and W t R n ( ) ( ) be the corresponding CDFs.
By equation (2.5), n P ( ) is strictly increasing with n. Now we show how extinction times are related to probability fluxes. The probability of the particle being at site 0 at time t+dt is given by P t dt P t dt P t 1.

2.9
n n n 0 1 1 0 Note that the probability of hopping left in time dt from site 1 is μ 1 dt and the probability of staying at site 0 in time dt is 1 since site 0 is absorbing. Equation (2.9) implies that dP t dt P t .
where μ 1 is the death rate from site 1 and Π (n) is defined in equation (2.5).
Proof. If the random walker is at site 0 at time t, it must have arrived there either at t or before. Then

Algorithm for reconstructing transition rates
Our algorithm for transition rate reconstruction requires the following as input: for each n, the fraction of random walks that exit and whose maximal site does not exceed n; and the mean extinction time for these conditional random walks. For each n1, Π (n) and T E M 0

Inference of μ 1 and λ 1
In the first step, we recover the birth and death rates at site 1. Note that t P 1 ( ) ( ) only contains a single element, and the forward master equation can be written as This simple ODE has solution Suppose we only consider left extinction times, with n=1, generated by all trajectories that directly arrive at site 0 from site 1. These extinction times are exponentially distributed with parameter 1 It follows from the property of exponential distributions that In the next step, we use(2.11) to get P t w t 3.6 We now have obtained the approximations for μ 1 and λ 1 .

Inference of μ 2 and λ 2
The forward master equations are for t and when s=0, Equations (3.17) and (3.18) can be rewritten as Assuming λ 1 and μ 1 are known from the n=1 case, solving equations (3.19) and (3.20) allows us to compute λ 2 and μ 2 from the conditional moments Π (2) and M (2) .
3.3. Inference of μ n and λ n for n3 Now we consider the n-th site after computing the birth and death rates for the first n−1 sites. The Laplace transformed ODEs of s P s P s P , , can be represented in the following matrix form: has n elements and I n is the identity matrix of size n×n. Whenever s is not an l m are the coefficients of the characteristic polynomial of A n ( ) , and they can be written as functions of birth and death rates: Finally, we note that the coefficient of of s n in (3.23) is 1 and for notational convenience, define for some coefficients c c c , , , from the definitions in equations (3.25) and the fact that the (1, 1) and (2, 1) entries of A ( n) are zero when Since the expression . Integrating both sides of (2.11), we have that By equating coefficients at O(1) and O(s), we can establish a recurrence relation between the coefficients of the characteristic polynomial for the n-site subproblem and the n−1 and n−2 site subproblems: The way of computing (λ n , μ n ) for the n-site subproblem is to rewrite equations and σ n =λ n +μ n . Eliminating the vector n n n n T 3 . 5 2 n n n n n n n 1 2 2 ( ) ( ) is invertible, σ n and μ n are uniquely determined, and so λ n and μ n are uniquely determined. Proof. Consider the case n=2. Then, we see that . We split the proof into two parts. First, we find a simple expression for the determinant. Second, we show using induction that this expression is always non-zero.
Expression for Determinant. The determinant depends on r ( n) and M ( n) , which are determined by the (perfect) data. Using the recurrence relations Therefore, after some algebra

The denominator is always nonzero because
. It is well-known that the eigenvalues of the infinitesimal generator matrix (and its submatrices) of a birth-death process are all negative [35], and the matrix A n ( ) is the transpose of a submatrix of such an infinitesimal generator. It remains to show that this expression is non-zero for n 3  .
( ) ( ) is invertible for n3. From (3.52), σ n and μ n are uniquely determined and so (λ n , μ n ) are also uniquely determined. , In summary, at step n3, we must solve the linear system V V r 0 , 3 . 6 7 n n n n n n n 1 2 2 which we may write as The F ( n) and G ( n) depend on the previous transition rates ν 1 , K, ν n−1 and from theorem 1, F ( n) is invertible. For reference, the entries of F ( n) are: n n n n n n n n n n 22 2 where f : Hence the errors of data introduced at each site will propagate exponentially in its subsequent sites. In other words, at each site n, the error in the birth-death rates (λ n , μ n ) is the result of accumulating the exponential growth of errors from sites 1 to n−1. In this special case where all errors in data are bounded by D, we also expect to observe the error in birth-death rates grow exponentially.

Algorithm details
The implementation of the algorithm starts with the inference at site 1 and site 2. In order to keep track of the recurrence relation in lemma 4, we only need to focus on the 'feature vector' defined as u , , ,  x where the j-th column is the feature vector at site j. At each step j3, we need to solve linear system defined by (3.68) and (3.69) to obtain [λ j , μ j ], and update the j-th column of Z by lemma 4. Details are listed in algorithm 1.

Algorithm 1. Inference of birth and death rates up to site N in a birth death chain
Input: An array of extinction times j t { } along with maximal sites {n j } from repeated simulation of a birth death process ( j=1, K, L). Initialize: For k N 1 to = , find all τ j such that n j =k. Compute number of extinctions as a fraction of L (Π (k) ) and the mean of the extinction times (M ( k) ). Initialize Z as a 4×N zero matrix.

Numerical results
Below we reconstruct the transition rates for proteins with N folding/unfolding domains. When all domains are folded, the reaction coordinate is 0 and if all domains are unfolded, the reaction coordinate is N. Every domain that unfolds (folds) increases (decreases) the reaction coordinate by an integer.
Example 1. Protein withN=4 domains. First we present a simple result of a birth-death chain with only 5 states and some pre-determined rates. All the rates are about the same order of magnitude, see figure 5.
The extinction time data is generated by simulating the birth-death process with these given rates, and we evaluate the reconstruction in comparison to them. With about 5×10 6 extinction times, we can infer the rates to very good accuracy-about 0.11% and 0.37% relative errors in l and m, respectively. First, the sum of birth and death rates λ 1 +μ 1 follows from the mean extinction time conditioned on immediate exit through equation (3.5). The fraction of times corresponding to immediate exit then yields μ 1 and λ 1 separately through (3.8) and (3.9). Next, λ 2 and μ 2 are computed by (3.19) (3.20). Then for n=3,4, we compute the n-th columns of matrix Z as in (3.93) and obtain {λ n , μ n } simultaneously. Notice that the last birth rate λ N is always assumed to be zero and no inference is necessary at that site.
Example 2. Protein withN=10 domains. We now test our reconstruction algorithm on a longer, 11-site chain. In this result, 5×10 7 extinction times are simulated from an 11-site birth-death chain. Following the same steps, we have the inference results with a relative error in l and m to be 3.29% and 3.71%, respectively: see figure 6.
Example 3. Protein withN=10 domains and a single high activation energy barrier. This birth-death chain has a 'bottleneck' between states 3 and 4 so that the transition rates between these two configurations are much smaller than the others: see figure 7. The accuracy of the transition rates of sites after the bottleneck are more affected than the accuracy before the bottleneck, due to the fact that all inferences of sites after the bottleneck depend on  the extinction times generated by particles that have gone through the bottleneck. With about 5×10 7 extinction times, there are 1.17% and 1.24% relative errors in l and m, respectively. The effect of the bottleneck is evident from the data since we see an abrupt increase in M n ( ) as n goes from 3 to 4.
Example 4. Protein withN=8 domains and a single low activation energy barrier. This protein has a very low barrier between states 5 and 6; the transition rates between these sites are very large. From an AFM trace, it may be difficult to distinguish or separate these two configurations. However, given 5×10 7 simulated data, the birth and death rates can be inferred within a reasonable amount of accuracy, and the overall relative error is 2.21% in λ k and 2.24% in μ k . See figure 8.
Example 5. 11-site birth death chain error propagation. In this example, we show how error propagates with site number. The birth and death rates are all equal to 1, and 1×10 8 extinction times are generated from the Monte Carlo simulation. Bootstrap is done by taking random samples of size 5×10 6 from these extinction times and errors at each site are computed. This resampling procedure is repeated 50 times, and figures 9 and 10 display the mean error and 95% confidence intervals from the above 50 samples, where it is clear that errors for both λ k and μ k increase exponentially with site number, as a result of theorem 2. Although this example uses noisy data, this exponential increase with site number is not surprising in light of theorem 2.  Example 6. Minimum number of extinction times required. Finally, we consider the minimum number of extinction times required for a birth death chain of length N+1 such that relative error for both λ k and μ k is below 15%. For each N, 50 bootstrap samples of a fixed size are selected and the average relative errors are computed from these samples. This procedure is then repeated for different sample sizes, from which we can estimate the minimum number of extinction times required for the chain of length N+1 to have error below 15%. Note that this estimation applies to the chain where all birth-death rates are about the same order of magnitude. Detail is in table 1, and figure 11. It is obvious from the plot that minimum number of extinction times required is increasing exponentially with the length of chain.

Conclusions
In summary, we have presented a method for extracting the kinetic rates of large proteins, with multiple folding domains, from extinction times (when all domains have folded) and 'maximal sites' (the maximum number of unfolded domains before extinction). Both of these quantities can, in principle, be computed from AFM time traces. The inference relies on the recurrence relation specified in lemma 4 and starts with base cases when n=1, 2, and inference of each subsequent site depends on its previous sites by solving a linear system. If the data n P ( ) and  M n ( ) are exactly given, meaning that they correspond to the statistics of an underlying birth-death process, then the birth-death rates are uniquely determined, and we proved that a small perturbation in site 1 will propagate exponentially throughout the chain, given that the first derivatives of f n 1 ( ) and f n 2 ( ) in theorem 2 are bounded near the exact solution. With sufficient data (about 50 million for a chain of length [8][9][10][11][12], the method can compute these rates to very good accuracy and is capable of detecting bottlenecks or extreme values in the chain. In general, the number of extinction times one needs to obtain reasonable results grows exponentially with the total length of the chain.
There still remain many theoretical challenges to interpreting single-molecule AFM data. For example, inference from extinction times in the 'transmission' problem where absorption/extinction is at site N rather than site 0 is severely ill-posed. Can some aspect of time-trace data be used to better-condition this problem? Also, bifurcations or loops in the birth-death chain representing multiple pathways to a final unfolded state have not yet been explored. Finally, it remains to be seen if our method can be adapted to force-ramp data, or how to proceed if transition times between metastable configurations are not exponentially distributed.

Appendix
There is an alternative way to infer the birth death rates, instead of the recurrence relations. This method is simple, yet more computationally intensive if number of sites is large.