Stochastic Analysis of Chemical Reaction Networks Using Linear Noise Approximation

Stochastic evolution of Chemical Reactions Networks (CRNs) over time is usually analysed through solving the Chemical Master Equation (CME) or performing extensive simulations. Analysing stochasticity is often needed, particularly when some molecules occur in low numbers. Unfortunately, both approaches become infeasible if the system is complex and/or it cannot be ensured that initial populations are small. We develop a probabilistic logic for CRNs that enables stochastic analysis of the evolution of populations of molecular species. We present an approximate model checking algorithm based on the Linear Noise Approximation (LNA) of the CME, whose computational complexity is independent of the population size of each species and polynomial in the number of different species. The algorithm requires the solution of first order polynomial differential equations. We prove that our approach is valid for any CRN close enough to the thermodynamical limit. However, we show on four case studies that it can still provide good approximation even for low molecule counts. Our approach enables rigorous analysis of CRNs that are not analyzable by solving the CME, but are far from the deterministic limit. Moreover, it can be used for a fast approximate stochastic characterization of a CRN.


Introduction
Chemical reaction networks (CRNs) and mass action kinetics are well studied formalisms for modelling biochemical systems [9].In recent years, CRNs have also been successfully used as a formal programming language for biochemical systems [33,7,10].There are two well established approaches for analyzing chemical networks: deterministic and stochastic [20].The deterministic approach models the kinetics of a CRN as a system of ordinary differential equations (ODEs) and represents average behaviour, valid in the thermodynamic limit [19].The stochastic approach, on the other hand, is based on the Chemical Master Equation (CME) and models the CRN as a continuous-time Markov chain (CTMC) [6].The stochastic behavior can be analyzed by stochastic simulation [20] or by exhaustive probabilistic model checking of the CTMC, which can be performed, for example, by using PRISM [25].
Exhaustive analysis of the CTMC is able to find the best-and worst-case scenarios and is correct for any population size, but suffers from the state-space explosion problem [26] and can only be used for relatively small systems.In contrast, deterministic methods are much more robust with respect to statespace explosion, but unable to represent stochastic fluctuations, which play a fundamental role when the system is not in thermodynamic equilibrium.
Contributions.In this paper we develop a novel approach for analysing the stochastic evolution of a CRN based on the Linear Noise Approximation (LNA) of the CME.We formulate SEL (Stochastic Evolution Logic), a probabilistic logic for CRNs that enables reasoning about probability, expectation and variance of linear combinations of populations of the species.Examples of properties that can be specified in our logic are shown in Example 1.We propose an approximate model checking algorithm for the logic based on the LNA and implement it in Matlab and Java.We demonstrate that the complexity of model checking is polynomial in the initial number of species and independent of the initial molecule counts, thus ameliorating state-space explosion.Further, we show that model checking is exact when approaching the thermodynamic limit.Though the algorithm may not be accurate for systems far from the deterministic limit, this generally happens when the populations are small, in which case the analysis can be performed by transient analysis of the induced CTMC [24].Our approach is essential for CRNs that cannot be analyzed by (partial) state space exploration, because of large or infinite state spaces.Moreover, it is useful for a fast (approximate) stochastic characterization of CRNs, since solving the LNA is much faster than solving the CME [16].We prove asymptotic correctness of LNA-based model checking and show on four examples that it is still possible to obtain very good approximations even for small population systems, comparing with standard uniformisation [24] and statistical model checking implemented in PRISM [25].
Related work.The closest work to ours is by Bortolussi et al. [4], which uses the Central Limit Approximation (CLA) (essentially the same as the LNA) for checking restricted timed automata specifications, assuming a fixed population size.Wolf et al. [37] develop a sliding window method to approximately verify infinite-state CTMCs, which applies to cases where most of the probability mass is concentrated in a confined region of the state space.Recently, Finite State Projection algorithms (FSP algorithms) for the solution or approximation of the CME have been introduced [28].Both methods apply to the induced CTMC, but require at least partial exploration of the state space, and are thus not immune to state-space explosion.
Structure of the paper.In Section 2 we summarise the deterministic and stochastic modelling approaches for CRNs, and in Section 3 we describe the Linear Noise Approximation method.Section 4 introduces the logic SEL and the corresponding model checking algorithm based on the LNA.In Section 5 we demonstrate our approach on four networks taken from the literature.Section 6 concludes the paper.
A chemical reaction network (CRN) C = (Λ, R) is a pair of finite sets, where Λ is the set of chemical species and R the set of reactions.|Λ| denotes the size of the set of species.A reaction τ ∈ R is a triple τ = (r τ , p τ , k τ ), where r τ , p τ ∈ N |Λ| and k τ ∈ R >0 .r τ and p τ represent the stoichiometry of reactants and products and k τ is the coefficient associated to the rate of the reaction; its dimension is s −1 .We often write reactions as , where • T indicates the transpose of a vector.We define the net change associated to a reaction τ by υ τ = p τ − r τ .For example, for τ 1 as above, we have We make the assumption that the system is well stirred, that is, the probability of the next reaction occurring between two molecules is independent of the location of those molecules.We consider fixed volume V and temperature; under these assumptions a configuration or state x ∈ N |Λ| of the system is given by the number of molecules of each species.We define [x] = x N , the vector of the species concentration in x for a given N , where N = V • N A is the volumetric factor, V is the volume of the solution and N A is Avogadro's number.The physical dimension of N is M ol −1 • L, where M ol indicates mole and L is litre.Given λ i ∈ Λ then #λ i x ∈ N represents the number of molecules of λ i in x and [λ i ] x ∈ R the concentration of λ i in the same configuration.In some cases we elide x, and we simply write #λ i and [λ i ] instead of #λ i x and [λ i ] x.They are related by The propensity α n,τ of a reaction τ in terms of the number of molecules is a function of the current configuration of the system x such that α n,τ (x)dt is the probability that a reaction event occurs in the next infinitesimal interval dt.In this paper we assume as valid the stochastic form of the law of mass action, so the propensity rates are proportional to the number of molecules that participate in the reaction [6].Stochastic models consider the system in terms of numbers of molecules, while deterministic ones, generally, in terms of concentrations, and the relationship is as follows.For a reaction τ = (r τ , p τ , k τ ), given the configuration x and r τ,i , the i-th component of r τ , then is the propensity function expressed in terms of concentrations as given by the deterministic law of mass action.It is possible to show that, for any order of reaction, α n,τ (x) ≈ N α c,τ (x) if N is sufficiently large [1].Note that α c,τ is independent of N .In this paper we are interested only in finite time horizon, because of the problematic character of studying solutions of ODEs for infinite time horizon [3].
, for a system with N = 1000.Figure 1 plots the expectation and standard deviation of population sizes.We may wish to check if the maximum expected value of #λ 2 remains smaller than 75 molecules during the first 2sec.However, the system is stochastic, so we also need to analyse whether the variance is limited enough when #λ 2 reaches the maximum.Sometimes, analysis of first and second moments does not suffice, so it could be of interest to check the probability of some events, for instance, is the probability that, between t 1 = 0.5sec and t 2 = 1.0sec, #λ 2 − (#λ 1 + #λ 3 ) > 0 greater than 0.6?
describes the behaviour of the system as a set of deterministic equations assuming a continuous state-space semantics, therefore Φ(t) ∈ R |Λ| is the vector of the species concentrations at time t.Assuming t 0 = 0, the initial condition is Φ(0) = [x 0 ], expressed as a concentration.Note that F (Φ(t)) is Lipschitz continuous, so Φ exists and is unique [17].
Stochastic semantics.CRNs are well represented by CTMCs, whose transient analysis can be performed via the Chemical Master Equation (CME) [35].
Definition 1 Given a CRN C = (Λ, R) and the volumetric factor N , we define a time-homogeneous CTMC [11,32] (X N (t), t ∈ R ≥0 ) with state space S = N |Λ| .Given x 0 ∈ S, the initial configuration of the system, then P (X N (0) = x 0 ) = 1.The transition rate from state x i to state x j is defined as r(x i , x j ) = {τ ∈R|xj=xi+vτ } N α c,τ (x i ).
X N (t) describes the stochastic evolution of molecular populations of each species at time t.For x ∈ S, we define P (t) (x) = P (X N (t) = x|X(0) = x 0 ), where x 0 is the initial configuration.The CME describes the time evolution of X N as: The CME can be equivalently defined in terms of the infinitesimal generator matrix [37], which admits computing an approximation of the CME using, for example, fast adaptive uniformisation [14,13] or the sliding window method [37].
We also define the CTMC ( is the random vector describing the system at time t in terms of concentrations.In [1,17] it is proved that lim for every time t.This explains the relationship between the two different semantics, where the deterministic solution can be viewed as a limit of the stochastic solution, valid when close enough to the thermodynamic limit.

Linear Noise Approximation
The solution of the CME can be computationally expensive, or even infeasible, because the set of reachable states can be huge or infinite.The Linear Noise Approximation (LNA) has been introduced by Van Kampen as a second order approximation of the system size expansion of the CME [35].Since stochastic fluctuations depend on N , and specifically, for average concentrations, are of the order of N 1 2 [16,32], to derive the expansion Van Kampen assumes that: where Z(t) = (Z 1 (t), Z 2 (t), ..., Z |Λ| ) is the random vector, independent of N , representing the stochastic fluctuations, Φ(t) is given by the solution of Eqn (1) and X N (t) is the random vector of Definition 1.Using this substitution in the system size expansion and then truncating at the second order, the probability distribution of Z(t) is found to be given by the following linear Fokker-Plank equation [16]: ) and F j (Φ(t)) is the j−th component of F (Φ(t)).The solution of Eqn (4) gives a Gaussian process.For every time t, Z(t) has a multivariate normal distribution, whose expected value and covariance matrix are the solution of the following equations [16,21]: where J F (Φ(t)) is the Jacobian of F (Φ(t)).We consider as initial conditions E[Z(0)] = 0 and C[Z(0)] = 0.This means that E[Z(t)] = 0 for every t.
It is possible to justify the hypothesis (3) noting that in the lowest order the CME expansion reduces to Eqn (1), and with the following theorem by Kurtz: Theorem 1. [17] Consider the subset E ⊂ R |Λ| on which are defined the propensity functions α c,τ .Let Z N (t) be the random vector given by Z The LNA thus permits approximation of the probability distribution of X N (t) with the probability distribution of It is easy to show that Y N (t) has a Gaussian distribution; indeed, Z(t) is Gaussian distributed, and N and Φ(t) are deterministic.
To compute the LNA it is necessary to solve O(|Λ| 2 ) first order differential equations, but the complexity is independent of the initial number of molecules of each species.Therefore, one can avoid the exploration of the state space that methods based on uniformisation rely upon.
Theorem 1 alone only guarantees convergence in distribution.However, in [36], LNA is derived as an approximation of the Chemical Langevin Equation (CLE) [18], rather than system size expansion.This shows that LNA is valid for every real chemical system close enough to the thermodynamical limit, at least for a limited time.Thus, LNA is exact in the limit of high populations, but can also be used for small populations if the behaviour is not too far from the deterministic limit, taking into account the continuous nature of the approximation and Gaussian assumptions on the noise [21,36].

Probabilistic analysis of CRNs
We have shown that X N can be approximated by where Y N (t) has a multivariate Gaussian distribution, so it is completely characterized by its expected value and covariance matrix, whose values are respectively Since Y N has a multivariate normal distribution then every linear combination of its components is normally distributed.Therefore, given where b 1 , b 2 , ..., b |Λ| ∈ Z, we can consider the random variable BY N (t), which defines a linear combination of the species at time t.For every t, BY N (t) is a normal random variable, whose expected value and variance are For a specific time t k , it is possible to calculate the probability that BY N (t k ) is within a set I of closed, disjoint real intervals [l i , u i ], where l i , u i ∈ R ∪ {+∞, −∞}.This probability Ω Y N ,B,I (t k ) is given by where g(x|EV, σ 2 ) is the Gaussian distribution with expected value EV and covariance σ 2 .We recall that it is possible to find numerical solution of Eqn (9) in constant time using the Z table [30].The following theorems are consequences of results in [36], which can be generalized for reactions with a finite number of reagents and products.They show asymptotic pointwise convergence of expected value, variance and probability.
Theorem 2. Let C = (Λ, R) be a CRN.Suppose the solution of Eqn (6) is bounded, then, approaching the thermodynamic limit, for any finite instant of time t i lim where Ω X N ,B,I (t i ) is the probability that B(X N ) is within I at time t i .
Theorem 3. Suppose the solution of Eqn (6) is bounded, then, approaching the thermodynamic limit, for any finite instant of time To solve the differential equations ( 5) and ( 6), it is necessary to use a numerical method such as adaptive Runge-Kutta algorithm [5].This yields the solution for a finite set of sampling times Σ = [t 1 , ..., t |Σ| ] ∈ R |Σ| , where t 1 ≤ ... ≤ t k ≤ ... ≤ t |Σ| and |Σ| is the sample size.Assuming Y N is separable, that is, it is possible to completely define the behavior of Y N by only considering a countable number of points, we can calculate Ω Y N ,B,I for any point in Σ and if points are dense enough then this set exhaustively describes the probability that BX N is within I over time.This restriction is not a limitation since for any stochastic process there exists a separable modification of it [23].

Stochastic Evolution Logic (SEL)
Let C = (Λ, R) be a CRN with initial state x 0 , in a system of size N .We now define the logic SEL (Stochastic Evolution Logic) which enables evaluation of the probability, variance and expectation of linear combinations of populations of the species of C.
The syntax of SEL is given by with probability greater than 0.95: Equivalently, instead of writing B, we write directly the linear combination it defines.For example, in the latter case we have Semantics Given a CRN C = (Λ, R) with initial configuration x 0 in a system of fixed volumetric factor N , its stochastic behaviour is described by the CTMC X N of Definition 1.We define a path of CTMC X N as a sequence ω = x 0 t 1 x 1 t 1 x 2 ... where x i is a state and t i ∈ R >0 is the time spent in the state x i .A path is finite if there is a state x k that is absorbing.ω ⊗ t is the state of the path at time t.P ath(X N , x 0 ) is the set of all (finite and infinite) paths of the CTMC starting in x 0 .We work with the standard probability measure P rob over paths P ath(X N , x 0 ) defined using cylinder sets [24].
We first define when a path ω satisfies (B, I) Note that B(ω ⊗ t) is well defined because ω ⊗ t ∈ N |Λ| .For η formulas we have ) respectively denote the infimum and supremum within ) is the probability that the linear combination defined by B falls within I at a time instant t between t 1 and t 2 , and is well defined since the probability measure P rob on P ath(X N , x 0 ) corresponds to transient probability calculated using the CME.

LNA-based Approximate Model Checking for CRNs
Stochastic model checking of CRNs is usually achieved by transient analysis of the CTMC X N [24], which involves solving the CME and thus suffers from the state-space explosion problem.We propose an approximate model checking algorithm based on LNA.The inputs are a SEL formula η, the stochastic process X N induced by the CRN and initial state x 0 .The output is true in case the formula is verified, and otherwise f alse.
The algorithm proceeds by induction on the structure of formula η, successively computing whether each subformula is satisfied or not.We assume that Eqn ( 5) and ( 6) are solved numerically where Σ is the finite set of sample points on which their solution is defined and that t 0 , initial time, and t max , final time, are always sampling points.
Probabilistic operator To evaluate P ∼p [(B, I)] [t1,t2] we construct the function Theorem 2 guarantees the pointwise correctness of P rob (B,I) and its integrability allows us to compute the following approximation, then compare to threshold p to decide the truth value.If

Expectation and variance operators
) we use the LNA, namely, compute the expected value and variance of Eqn ( 8) and (7).Theorem 3 guarantees the quality of the approximation.We can now compute the following approximations, then compare to the threshold v: and similarly for the expected value.
ensures that for any time interval there is at least one sampling point, even if the interval is a singleton.Note that, for each subformula, the algorithm involves the calculation of some quantity, so one can define a quantitative semantics for SEL as in [15].
LNA-based model checking can also be used for systems far from the thermodynamic limit, at a cost of some loss of precision.LNA assumes continuous state space, and it is not possible to justify this assumption for very small populations.However, if the distributions of interest are not multi-modal and the noise term is finite and approximated by a Gaussian distribution, then LNA gives very good approximation even for quite small systems.It is clear that model checking accuracy increases as N grows.We emphasise that the model checking algorithm we have presented is also able to handle CRNs whose stochastic semantics is an infinite CTMC, which occur frequently in biological models.
Complexity of LNA-based approximate model checking The time complexity for model checking formula η against a CRN C = (Λ, R) is linear in |η|.In the worst case, analysis of a single operator requires the solution of O(|Λ| 2 ) polynomial differential equations for a bounded time.However, an efficient implementation can solve the O(|Λ| 2 ) ODEs only once for the interval [0, t max ], and then reuse this result for every operator, where t max is the greatest (finite) time of interest.Note that ODEs are solved in terms of concentrations (a value between 0 and 1 by convention), ensuring independence of the number of molecules of each species, although stiffness can slow down the solution of the LNA.

Experimental Results
We implemented the methods in a framework based on Matlab and Java.The experiments were run on an Intel Dual Core i7 machine with 8 GB of RAM.To solve the differential equations, we use M atlab ode45, a variable step Runge-Kutta algorithm.We employ LNA-based model checking for the analysis of four biological reaction networks: a Phosphorelay Network [12], a Gene Expression Model [34,27], the FGF pathway [22] and the GW network [8].For every network, the CRN and parameters have been taken from the referenced papers.We coded the same CRNs in PRISM in order to compare accuracy and time of execution with standard uniformisation of the CME [24] and statistical model checking (SMC) techniques (confidence interval method) as implemented in PRISM.For the FGF and GW case studies, we cannot use global analysis nor SMC, because the state space is too large for direct analysis, and SMC requires many timeconsuming simulations to obtain good accuracy.
Phosphorelay Network.We consider a three-layer phosphorelay network whose structure is derived from [12].Each layer (L1, L2, L3) can be found in phosphorylate form (L1p, L2p, L3p).We consider the initial condition #L1p = #L2p = #L3p = 0, #L1 = #L2p = #L3p = Init, where Init ∈ N. Then we analyse the ligand B, whose initial condition is #B = 3 * Init.We are interested in checking the following SEL property: which is verified if, in the first interval, the probability that #L1p is greater than #L3p is > 0.7 and if, between 300 and 600, with probability > 0.98, #L3p is greater than #L1p.We evaluate this formula in three different initial conditions, firstly Init = 32 and N = 5000, then Init = 64 and N = 10000, and finally Init = 100 and N = 15625, so the same concentration but different numbers of molecules.In all cases, the LNA-based model checking evaluates the formula as true.To understand the quality of the approximation, we check the following quantitative formula P =? [(#L3p − #L1p, [0, +∞])] [T,T ] for T ∈ [0, 600] (in our implementation =? gives the quantitity calculated by model checking the operator).We compare the results with the evaluation of the corresponding CSL formula using standard uniformisation (U nif ) with error 10 −7 [24].The following table shows the results.M axErr is the maximum error computed by LNA-based approach compared to standard uniformisation and AvgErr is the average error; T ime(•) stands for execution time.Note that as Init increases the error of our method decreases, while the execution time is practically independent of the molecular count.LNA-based algorithms are faster in all cases.Thus our approach can be used even for quite small population systems, giving a fast approximate stochastic characterization.Gene Expression.We consider a simple CRN that models the transcription of a gene into an mRNA molecule, and the translation of the latter into a protein.The CRN, rates and initial conditions are the same as in [27].The stochastic semantics of the reaction network is an infinite CTMC, and we use this model to show that our method can handle infinite state-space processes.We consider the quantitative property supE =? [#mRN A] [T,T ] , which gives the number of molecules of mRN A in the system at time T .We compare our method with SMC estimation of the same property by using 50000 simulations, for T = {300, 600, 900, 1200}, and in the following tables we compare the results in terms of execution time (T ime( FGF.We consider the model of Fibroblast Growth Factor (FGF) signalling pathway developed in [22] composed of more than 50 reactions and species.We consider the system with initially 105 molecules for species with non-zero initial concentration.Analysis of the model reveals that the phosphorylated form of F RS2 can bind the protein Src, and then this new complex, Src:F RS2, can relocate out.We want to check if the expected value of #Src:F RS2 during the first 3000 seconds reaches a maximum value greater than 40.We do that by checking the property supE >40 [#Src:F RS2] [0,3000] .The formula evaluates to true, and in Figure 2 we analyze the expected value and standard deviation of #Src:F RS2.
We obtain these values directly from the logic considering the quantitative interpretation of supE =? [#Src:F RS2] [T,T ] and supV =? [#Src:F RS2] [T,T ] for T ∈ [0, 3000].It is possible to see that, after an initial peak, relocation causes exponential decay.Fig. 2: Expected number and standard deviation of species of #Src:F RS2 in the FGF pathway during the first 8000 seconds estimated by our method is compared with a stochastic simulation of the same species.
In the same figure we show a single stochastic simulation of the system for the same initial conditions, confirming our evaluation.Moreover, the approximation can be justified theoretically.#Src:F RS2 converges to zero necessarily and this demonstrates the unimodality of the distribution of the species; we note that the variance is finite, so Eqn (3) holds.
DNA strand displacement of GW network GW is a network related to the G2-M cell cycle switch [29].Under particular initial conditions, it has been shown that GW can emulate the Approximate Majority algorithm [8].Here, we consider the two-domain DNA strand-displacement implementation of GW [7].The corresponding CRN is composed of 340 species and 240 reactions.For our analysis the species of interest are R and P , whose initial conditions are #R = 90 and #P = 10; initial conditions of other species are taken from the referenced papers.We check the property P >0.9 [#R − #P, [50, +∞]] [6000,35000] for a system of size N = 45000, which is verified as true in 28 minutes.

Fig. 1 :
Fig. 1: Expected number and standard deviation of species of the CRN of Example 1 for the given initial conditions, calculated by simulating the CME.
and [t 1 , t 2 ] is a closed interval, with the constraint that t 1 ≤ t 2 and t 1 , t 2 ∈ R. If t 1 = t 2 the interval reduces to a singleton.Formulae η describe global properties of the stochastic evolution of the system.(B,I) specifies a linear combination of the species of C and a set of intervals, where B ∈ Z |Λ| is the vector defining the linear combination and I represents a set of disjoint closed real intervals.P ∼p [B, I] [t1,t2] is the probabilistic operator, which specifies the probability that the linear combination defined by B falls within the range I over the time interval [t 1 , t 2 ]. supE, inf E, inf V, supV respectively yield the supremum and infimum of expected value and variance of the random variables associated to B within the specified time interval.Example 3. Consider the CRN of Example 1. Checking if the variance of #λ 1 remains smaller than K 1 within [t j , t k ] can be expressed as supV •)) and expected value of #mRN A estimated (ExpV al(•)).LNA-based model checking is several orders of magnitude faster without loss of accuracy.