Model reduction for stochastic CaMKII reaction kinetics in synapses by graph-constrained correlation dynamics

A stochastic reaction network model of Ca2+ dynamics in synapses (Pepke et al PLoS Comput. Biol. 6 e1000675) is expressed and simulated using rule-based reaction modeling notation in dynamical grammars and in MCell. The model tracks the response of calmodulin and CaMKII to calcium influx in synapses. Data from numerically intensive simulations is used to train a reduced model that, out of sample, correctly predicts the evolution of interaction parameters characterizing the instantaneous probability distribution over molecular states in the much larger fine-scale models. The novel model reduction method, ‘graph-constrained correlation dynamics’, requires a graph of plausible state variables and interactions as input. It parametrically optimizes a set of constant coefficients appearing in differential equations governing the time-varying interaction parameters that determine all correlations between variables in the reduced model at any time slice.


Introduction
Given a stochastic reaction network, even one specified by high-level 'parameterized reactions' or 'rulebased' notation [2][3][4][5][6], there is a corresponding chemical master equation (CME) for the evolution of probability distributions over all possible molecular states of the system. These states are ultimately described in terms of discrete-valued random variables. Unfortunately as the number of such random variables grows, the number of molecular states appearing directly in the CME grows exponentially. On the other hand even for a dynamical system that is nonlinear in its observable variables, the CME is a (very large) system of linear differential equations for time-evolving probabilities. The exponential explosion of state space size with number of random variables can often be bypassed in sampling-style simulation (such as the Gillespie stochastic simulation algorithm (SSA) [7] and its many variants), and also to a lesser extent for reaction rate inference, provided that enough trajectories are sampled to evaluate a required expected value. But the sampling approach requires a lot of computing power to sample enough trajectories, and also poses substantial obstacles for analysis.
The problem of state space growth is compounded in the case of rule-based stochastic models [2,[4][5][6] since in that case even the number of molecular species suffers an exponential growth in terms of natural problem size parameters such as the number of binding sites in a molecular complex. Then the state space described in the master equation grows doubly exponentially in such problem size parameters, and it can be hard to really understand the resulting stochastic dynamical system. And yet, molecular complexes with combinatorially many states (such as transcription complexes, signal transduction complexes, and allosteric enzyme complexes) are ubiquitous in biology, so the problem cannot simply be avoided. For example, the signal transduction cascade in response to calcium influx through n-methyl D-aspartate receptors (NMDARs) in synapses, which plays a key role in synaptic plasticity in spatial learning and memory in mice, has such combinatorially generated number of local states of molecular complexes involving calcium, calmodulin, CaMKII and phosphorylation sites, as will be described in section 3.1 below ( [1,8], and references therein). Each of the many possible states of each complex is itself a 'molecular species' with associated reaction channels.
To address these problems of model size and state space size, reduced models may have substantial advantages. In general, model reduction replaces a large model with a smaller and more tractable model that approximates the large model in some relevant aspect. The ideas of model 'size', 'tractability', 'approximation', and 'relevant aspect' can be defined in various ways. We will suggest one framework for these definitions in section 2.1 below. In sections 2.2 and 2.3 below we use this framework to introduce a new model reduction method for stochastic biochemical networks which in our target application are also rulebased, though they need not be.
Our method can be viewed as a form of 'moment closure' method as will be explained in section 2.4, which also contains further comparisons to related work. Compared to other methods of moment closure we seek a much more aggressive reduction in the model size, as counted by the number of degrees of freedom (chemical or biological variables with dynamically changing values) required even in a sampling approach, such as SSA, to the original unreduced biochemical model. This claim is substantiated in sections 2.4 and 3.4 below. Such a strategy may be appropriate to the eventual goal of finding usable 'phenomenological' but mechanistically well-founded approximate models of fine-scale subsystems to place within yet larger super-system models, for example placing calcium signal pathway reduced models within neuronal-level synaptic plasticity simulations, although we have not yet attempted such an application of this method. Unlike most but not all other moment closure approaches, we retain (approximations to) correlations of arbitrarily high order rather than truncating them to zero. Our approach applies naturally to the case of rule-based models. And perhaps most importantly from the biological point of view, it is based on a problem-specific graph of possible interactions between key system variables. Such a graph is a natural place to impose human biological expertise on the approximate model reduction method. Figure 1 illustrates our general setting. The basic idea is that the results of following the red arrows around from earlier to later observations, by way of a fine-scale predictive dynamical model as one would do in an ordinary simulation, should approximately agree with the results of following the green arrows around through a coarse-scale model instead. Since the coarse-scale model is smaller, following the green arrows could be cheaper computationally and also more amenable to human understanding of the dynamics. To define this possibility technically, figure 1 also includes mappings M and P directly between the fine-scale and coarse-scale model state spaces.

Model reduction criteria
All vectors and maps defined below are assumed to be defined in the sense of probability distributions, so that for example the fine-scale system state vector S t ( ) is a distribution over all possible individual microscopic states s. In the deterministic limit that distribution could be a delta function that picks out one winning microstate. (upper horizontal arrow). There is a model reduction or restriction map M (vertical arrows), and a prolongation map P right-inverse to M. Comparison for approximation (≈) may be made in some relevant third space O, which could specialize (as we assume below) to be the same as S or R. The black arrow quadrilateral comprising two short paths from S t )corresponds to approximation equations (1) and (2). Approximation equations (3) and (4) correspond in their left-hand sides to the green highlighted path and in their right-hand sides to the red highlighted path. In principle multiple copies of this diagram can be composed, either horizontally or vertically.
linear on probability distributions, and thus preserve mixtures, but in this paper we do not assume linearity of M or the other maps introduced below.
Fine-scale and (reduced) course-scale dynamics are illustrated in figure 1 c f for all S t ( ) or more strongly for almost all possible S, which in operator space could be stated as (2) c f We will omit operator-space variants of the approximation statements below but they should be clear if the same vector appears at the rightmost end of each side of the approximation. Here and in this section the sense of approximation ≈ has yet to be defined but it requires aggregating some measure of error over microstates s, r, or o. For deterministic systems, sumof-squares error over dynamical variables is plausible; for stochastic systems, an asymmetric Kullback-Leibler (K-L) divergence or relative entropy between two distributions over system states is plausible. The K-L divergence is useful when approximating probability distributions because (a) it measures the extra information contained in one distribution beyond what is in a second distribution, and (b) it takes its minimal value, zero, when the two distributions are equal almost everywhere.
Equations (1) and (2) are not entirely satisfactory since they provide no control over the space in which approximation comparisons are to be made. Alternatively to equation (1), and adopting terminology used in multigrid/multiscale algorithms [9,10], one could introduce a 'prolongation' map P that is exactly or approximately right-inverse to M (so that • = M P Î or • ≈ M P Î ), and make the commutativity comparison in the fine-scale system space, S, rather than course-scale, R. But a more general scheme that encompasses both alternatives is to compare time-evolved states or probability distributions on states in a third space of significant if space R is not smaller than O so that R R can be surjective) that predict the targeted observables based on the current state of each system. Then we seek R c R S f S as illustrated by the red and green three-arrow paths in figure 1. This, or the corresponding operator statement in the O space: , and define the projection operator And in the special case O = R, which we will also use, c f which is closely related to (1). In all cases some measure of difference or distance is required to define approximation ≈; such a measure may operate directly on microstates s r o , , , or on probability distribution state vectors S R O , , over these microstates as we assume below. Particular definitions of ≈ will be made in section 3.3 (for O = R) and appendix (for (O = S). The foregoing considerations apply to any dynamical system including stochastic, deterministic, and mixed stochastic/deterministic ones.

Fine-and coarse-scale dynamics
To apply the foregoing framework we need to define fine scale dynamics, coarse scale dynamics, observables, mappings between them, and a sense of approximation ('≈' in figure 1). As in equation (7) above, we will report on the results of taking O = R. The approximation metrics will be defined in section 3.3. We now define fine and coarse scale models.
For a master equation derived from a large finescale reaction network (whether rule-based or not) we seek reduced coarse-scale models in the form of a Boltzmann distribution over states at each time point, with successive time points linked by an artificial and approximating dynamics on the 'coupling constant' or interaction parameters (now time-varying rather than constant) appearing in the Boltzmann energy function formula. In machine learning terms, our prolongation map P is given at each instant in time by a probability distribution on fine-scale variables specified by a Markov random field (MRF) [11]. The MRF comprises a set of 'clique potentials' or 'interaction potentials'. Each interaction potential is a function, usually a monomial or polynomial, of just a few random variables. Each such potential function is multiplied by a scalar interaction potential strength, which we will call an 'interaction parameter'. The sum of all the potentials (including their interaction parameter multiplicative factors) yields the energy function in the Boltzmann distribution. Unlike the potentials, the energy function depends on all the random variables. The way we apply this standard apparatus is as follows: the MRF random variables are interpreted as the finescale variables, and the MRF interaction parameters are taken to be the coarse-scale model dynamical variables. Only these interaction parameters can vary with time, and in our model they will vary continuously in time. Otherwise, the structure of each interaction potential is fixed and independent of time. Thus, the energy function and the MRF model depend on time only through the interaction parameters which are the coarse-scale dynamical variables.
Without the constraint between successive time points enforced by coarse-scale dynamics on the interaction parameters, the classic Boltzmann machine learning algorithm (BMLA) [12] can be used separately at each time point to optimize these unknown interaction parameters to fit samples drawn from many simulations of the full model. This learning algorithm optimizes the K-L divergence or relative entropy between sampled and modeled distributions, thereby defining a sense for the approximation relationship in section 2.1, but only for one instant in time. We slightly generalize the BMLA learning algorithm so that it allows for weight-sharing (components of the μ interaction parameter vector that are constrained eg. to be equal) and for polynomial interaction potentials of degree higher than two.
Coupling many such interaction-parameter inference problems together by insisting that the inferred interaction parameters μ all evolve according to imposed ordinary differential equation (ODE) dynamics, with a further set of learnable model-specifying meta-parameters θ defined in section 2.3 below, results in the graph constrained correlation dynamics (GCCD) method presented here and in [13].
A large space of stochastic dynamical systems f which can specialize to deterministic ones is specified by the master equation governing (possibly singular) probability distributions p s t ( , ): where W is some linear operator acting on the state space of all distributions p over S and obeying conservation of probability, = W 1 · 0. Even though the master equation is linear, its effect on moments such as 〈 〉 s t ( ) i p may be highly nonlinear. For stochastic chemical kinetics the master equation specializes to the CME which can be written: ( ) in the number of molecules of type j; also the notation n ( ) m means the falling factorial − n n m ! ( )!, and k r is the reaction rate for reaction number r. This fully defines the fine-scale stochastic system for a mass-action chemical reaction network. Using the same notation, the chemical reaction network itself may be expressed as: where A i represents reacting chemical species number i and the sums are to be interpreted chemically rather than mathematically. The Plenum implementation of dynamical grammars [2,4] and the MCell Monte Carlo simulation software [3] can be used to express rule-based models and thereby to concisely define combinatorially many elementary reactants and reactions for molecular complexes such as CamKII (which will be introduced in section 3.1 below) and to efficiently simulate them. In addition, spatial diffusion processes can be added. Plenum uses computer algebra to express high-level biological models including those in which the number of compartments varies dynamically and hybrid stochastic/ODE systems. MCell has strong stochastic spatial capabilities and has been used extensively for synapse and neuronal simulations.
For coarse-scale approximate models, if we assume that the state space of the reduced model can be described as the product of state spaces for a fixed set of variables μ = α r { }, then we may consider coarsescale dynamical systems that through prolongation P induce an instantaneous fine-scale probability distribution p s t( , ) defined by some Boltzmann distribution where μ Z ( ) normalizes p. This formula separates the time-evolution (which can only affect interaction parameters μ α ) from the correlation-controlling structure of interactions α V . If there are as many values of α as elements in the full state space of s, then any distribution can be described, but generally we choose a far sparser set of interaction terms. In general equation (11) has nonzero moments of all orders, though only a few moments can be controlled independently by varying the μ α interaction parameters. This control would be exercised e.g. when one derives equation (11) by maximizing entropy subject to equality constraints on these moments 〈 〉 α V s ( ) and on total probability. All other moments (which effectively have μ = α′ 0) would fall where they may, following the principle of constrained maximum entropy obeyed by the Boltzmann distribution.
The essential information about the coarse-scale dynamics is contained in equation (11) above and equation (12) in section 2.3 below. In this setting, prolongation P from coarse to fine is obtained by sampling from the Boltzmann distribution μ | p r t( , ). The model reduction map M will be defined by statistical inference, from a sample of S to μ. Unlike the time- , neither M nor P must necessarily be linear on distributions, and in the special case of optimization algorithms such as maximum likelihood inference, maximum a posteriori inference, and BMLA, M would be nonlinear due to its optimization of an objective function that is not quadratic in both p and p. In sections 2.3 and 3 we will specialize and then apply the theory to stochastic biochemical networks.

Basis functions, smoothing, test cases
To define the coarse-scale stochastic model in terms of the time-evolving Boltzmann distribution of equation (11), we hypothesize that even though any particular sample of the stochastic nonlinear system will in general undergo discontinous evolution, the probability distribution governing the ensemble of such samples are likely to evolve continuously in time (as does the master equation itself) even when projected down to distributions described by the statistical interaction parameters μ. We therefore further hypothesize continuous and deterministic ODE dynamics for the μ t ( )interaction parameters: we use bases that arose in elementary solvable examples (two-and four-state binding models mentioned at the end of this subsection and described in [13]): . Machine learning model selection algorithms (such as the 'lasso' algorithm of section 3.3 below) can be used to encourage sparsity in the use of bases i.e. in the matrix θ, which in turn favors good generalization performance. Many other forms for trainable ODE dynamics could be used in this 'system identification' subproblem, such as (nonlinear in θ) neural networks.
Like the Markovian equation (8), the deterministic equation (12) is differential in time so that the evolution of the coarse scale dynamical interaction parameters μ depends only on their state at the current time and not directly on their state at earlier times. However as remarked in [14], many consistent non-Markovian stochastic processes can be obtain from Markovian ones by integrating out extra degrees of freedom not otherwise needed. Similar phenomena obtain for differential equations. In GCCD it may be possible to do this by adding extra 'hidden' interaction parameters and/or extra random variables to the GCCD graph.
As a postprocessing step, the BMLA-learned trajectories of μ α t ( ) interaction parameters could be smoothed in time t by convolution with a Gaussian in t and then differentiated with respect to time to get μ t d d ; what we actually do is the mathematically equivalent operation of convolving with the analytic derivative of a Gaussian.
The solvable examples used to derive the basis functions in equation (13) were: (1) a two-state binding site model in which ligand binds to and unbinds from a site, and (2) a four-state, two-site cooperative binding model, both obeying detailed balance. The K-L divergence minimization algorithm derived in the appendix solved both of these problems correctly to high accuracy. Unfortunately, with these basis functions, the GCCD K-L divergence minimization algorithm exhibited numerical instability and failure to converge on the realistic CaMKII problem outlined below. It is possible that this problem could be solved by variations such as more extensive stacking (defined in the appendix), which would allow the use of more training data, or a different form for the ODEs such as different basis functions and/or ODEs nonlinear in θ.
In particular the ODE right-hand sides could take the mathematical form of trainable nonlinear neural networks. Multilayer neural networks with a final linear layer would generalize equation (12) to include trainable basis functions. In section 3 below we report on the results of a different strategy, which is to optimize approximation in the O = R or μ space (as in equation (1) or (7)) rather than the O = S space (as in equation (6)).

Previous work
If we multiply the appropriate master equation (equation (9) above) by monomials in key observables such as numbers of molecules of selected species in a chemical reaction network, and then sum over all states, we find ODEs for the time-evolution of various moments of the distribution over states. Unfortunately these equations do not close: the time derivative of lower-degree moments depends on the present value of higher-degree moments recursively, generating a countable infinity of coupled differential equations.
The goal of 'moment closure' methods [15][16][17][18][19][20][21][22][23][24][25][26][27] is to obtain explicit though approximate dynamics for some finite subset of the first-order moments of a collection of random variables s i under some dynamics of their joint probability distribution ⃗ p s t ( , ). Many approximate moment closure schemes have been developed starting from k = 1 mean field theory (systematicaly replacing ( ) of the firstorder moments as would be correct for independent distributions) from which one recovers ordinary deterministic chemical kinetics, and escalating to secondorder (k = 2) moment closures that consistently keep track only of means and variances (as would be correct for a Gaussian joint distribution) in chemical reaction networks [15,16], or in population biology reactiondiffusion models [17] explicitly, or by means of the Fokker-Planck equation [18] or stochastic differential equations such as the chemical Langevin equation [19] which may sometimes be further reduced [20].
A slightly higher order scheme is the Kirkwood superposition approximation that retains 〈 〉 s s i j p t ( ) (k = 2) but approximates triple correlations (k = 3) by a function of second-order correlations, is derivable from [22,22] a constrained maximum-entropy sense of approximation ≈, and has been used in multicellular stochastic modeling [23]. Fully dynamic higher order cutoffs to the moment hierarchy for > k 2 include setting higher cumulants to zero [24], dropping higher order terms from the Kramers-Moyal expansion [25] using moment closure functions that would be correct for log-normal rather than Gaussian distributions [26], and 'equation-free' moment closure [27] by sparingly invoking fine-scale simulations.
Each of these moment closure methods has the character, when compared to a sampling algorithm, of first exponentially expanding the space in which dynamics are formulated using the master equation, and then trying to cut the exponentially large space back down to size. Typical results start from a small reaction network with < n 10 molecular species (and thus chemical/biological degrees of freedom if there is a single well-stirred compartment), and produce a more efficient algorithm for determining low-order moment trajectories for a possibly reduced model of between about n 2 and n molecular species, yielding model size reductions on the order of a factor of 1 to 2. The initial combinatorial explosion is not fully mitigated. This is an unpromising route if the goal is to find a large model reduction beyond what one already had at the fine scale (though not an impossible one, due to the need to run sampling simulations many times). We are proposing a different strategy for moment closure which more naturally results in model reduction with fewer chemical/biological degrees of freedom. From the moment closure point of view, what we are proposing is an arbitrary-order method particularly suited to approximating the CME and possibly related master equations, by a timedependent variant of a Boltzmann distribution.
Additional model reductions for stochastic chemical kinetics, other than moment closure methods, include the classic strategy of using separation of time scales to eliminate fast degrees of freedom as carried out in e.g. the quasi-steady state approximation [28], in adiabatic course-graining [29], and with power law scaling of time with respect to an overall problem size parameter, differentially for different subsets of molecular species [30]. Another strategy for molecular species with small expected population sizes is the Finite State Projection method which proposes an adaptive truncation of the state space followed by analytic solution or bounding of the (exponentially big, were it not truncated) master equation [31]. Other reaction network model reduction methods are restricted to deterministic models [32,33] including a reduction from 42 to 29 molecular species [34]. The most comparable method in terms of problem size may be [35] which like GCCD applies to rule-based reaction networks. We will quantitatively compare degrees of model reduction of these methods to GCCD in section 3.4.
As in the case of moment closure methods, all of these methods have advantages and interesting ideas but none that we know of are as yet in the same class of radical model size reduction as GCCD, in terms of fraction of molecular species retained after reduction, in a stochastic biochemical network model.

Computational experiments
Molecular complexes in general, and signal transduction complexes in particular often have state space explosions that pose problems for simulation and modeling. One such macromolecular complex is Ca 2+ /calmodulin-dependent protein kinase II (CaM-KII) in synapses, which is activated when it binds to multiple calmodulin (CaM) molecules that in turn bind to multiple calcium ions. These processes trigger a cascade of reactions in a signal transduction pathway following Ca 2+ influx in response to activation of ligand-gated NMDARs, supporting memory formation and learning.
We applied GCCD to this system as modeled by [1] and as simulated by the Plenum implementation of dynamical grammars [2,4] and also in a much larger spatial simulation using MCell [3].

CaMKII system
The synaptic signal transduction pathway that starts with calcium ion (Ca 2+ ) influx through voltagedependent calcium channels and NMDARs leads ultimately to functional consequences including longterm potentiation (LTP), long-term depression, and spike-timing dependent plasticity underlying learning and memory formation in the hippocampus. The pathway as studied and modeled in [1] is structurally a bit involved. It begins with an externally imposed influx of calcium ion Ca 2+ . Calcium ions bind to the calmodulin protein (CaM), which has four calciumbinding sites, two at the N-terminal end and two at the C-terminal end. CaM in any state of calcium-loading can then bind to unphosphorylated CaMKII monomer (which has one phosphorylation site relevant for activation). However, the binding and unbinding rates for calcium ion to CaM depends on the state of the other binding sites of the CaM protein, and also on whether or not that CaM is bound to CaMKII. Likewise the binding and unbinding rates for CaM to unphosphorylated CaMKII monomer depend on the state of CaM. Two CaMKII monomers, each loaded with CaM in any calcium-binding state, but at most one of which is phosphorylated, may then dimerize (again with state-dependent rates). Dimers may phosphorylate one of their constituent subunits and promptly dissociate; autophosphorylated monomer CaMKII is taken to be the pathway output. Our goal is to produce a reduced stochastic model simplifying the structure of this fine-scale model as formulated in the MCell and Plenum models (model files in supplementary information) that aim to implement stochastic versions of the reaction network of [1].
Many subsequent stages of the biological pathway are thereby omitted, notably the formation of a CaM-KII holoenzyme comprising a ring of six dimers in dodecameric complex. This holoenzyme structure implies an even greater combinatorial explosion of states that poses a future modeling challenge, that may best be met by aggressive model reduction methods such as the GCCD method proposed here. Further downstream components of the pathway beyond CaM and the CaMKII holoenzyme are outlined in [8]. However we leave such explorations, which could aim to extract novel biological consequences from shows the graph-structure of the interaction terms μ α αγ V (hexagons) of degrees 1, 2, and 3 assumed between all the binary-valued random variables (colored labelled circles) described in the text. The left and right sides of the graph correspond to two CaMKII subunits that may dimerize (bottom variable), each of which may be phosphorylated (third row of variables) and/or bind a calmodulin (second row of variables) which in turn may bind calcium at its N-terminus or C-terminus and at site 1 or 2 at each end (first row).
Omitted from the figure are extra factors that ensure each binding site is occupied by either zero or one molecules and not more. This figure comprises a template for the labelled graph of all random variables and their interactions in this application of 'graphconstrained correlation dynamics'. combinatorially large stochastic reaction network models of the NMDA receptor pathway by applying the GCCD model reduction technique, to future research. Figure 2 shows part of the assumed Boltzmann distribution model interaction graph, or MRF, of binary-valued random variables (circles) and monomial interaction potentials of degree 1, 2, and 3 (hexagons). The first row of variables represent the binding of calcium to calmodulin (CaM). With subscripts these ±1-valued random variables are labelled 'CaM a i c n, , '. They are indexed by the C-terminus versus the N-terminus of the calmodulin protein (c or n), the numerical index of the binding site on that end (a = 1 or 2), and the numerical index i (i = 0 or 1 illustrated; ∈ i {0, 1, 2} used in Plenum simulations below) of a calmodulin molecule which may bind to a CaMKII subunit (indexed by j = 0 or 1 illustrated; ∈ … j {0, 8} used in Plenum simulations below). The second row of variables 'bound i j , ' records the state of binding of CaM to CaMKII subunits. The third row of variables 'phos j ', also written (in the notation of the MCell model) as 'Kkp j ', records the binary phosphorylation state of each CaMKII subunit, and the fourth row of variables ' ′ dimer jj ' records whether or not two such subunits dimerize. Not shown are additional cardinality-fixing or winner-take-all interactions enforcing the constraints that (if it occurs) binding is an exclusive relationship between CaM molecules and CaMKII subunits, and likewise for dimerization between two CaMKII subunits.

Boltzmann machine preprocessing step
Weight-sharing is an important strategy in machine learning, widely used to reduce the number of trainable parameters and thereby increase generalization power for a given amount of data. Our weightsharing scheme shares interaction parameters μ α within categories of monomial interactions that seem likely, according to testable human intuition, to have similar interaction strengths if trained separately on far more data. We now introduce some notation for the weight-sharing.
The resulting Boltzmann distribution can be sampled by standard Monte Carlo methods such as Metropolis-Hastings or Gibbs sampling. We use the 'Dependency Diagrams' software [13] (availability described in supplementary information) to do this. A crucial wrinkle on our use of such sampling algorithms is that we have two different biological conditions under which they get used: external calcium influx can be 'on' or 'off'. We train four different sets of coarse-scale dynamical model parameters θ as described in section 3.3 below, for four phases (early and late phases for calcium influx on and for calcium influx off), and for pulsatile or periodic calcium influx we cycle through the four trained (or partly trained) models, while preserving the interaction parameters μ through each instantaneous phase-switching event. Once trained, these four models predict dynamical responses to any other temporal pattern of calcium influx switching including the use of periodic switching with frequencies not in the training set.
A logical alternative to this procedure would be to use just two phases for calcium on and off, and to add the binary calcium-influx variable into the GCCD graph of figure 2 with suitable connections to allow the switch variable to join in modulating some or all of the potentials α V . However, we were not able to get this somewhat cleaner approach to work numerically. Just switching between two phases ( + Ca 2 influx on versus off) rather than four phases without modifying the GCCD graph also produced less robust behavior.
For the MCell simulations below the waveform for calcium influx was a pulse train or rectangular wave, with the 'on' duration being 16 ms. For the Plenum simulations we used the 'alpha model' theoretical calcium influx profile [37]. In

GCCD
The resulting time series (such as figures 3 and 4) are convolved with the analytic temporal derivative of a Gaussian filter in order to smoothly estimate the rates of change μ α t d d of the interaction parameters μ α t ( ).
Such temporal smoothing is useful since taking derivatives of a noisy time series tends to enhance the noise, and in the absence of smoothing that occurs for our time series [13]. The resulting smoothed time derivatives are fit to equation (12) using either (1) online minimization of the K-L divergence as outlined in appendix for which O = S, or (2) lasso-regularized linear regression [36] for which O = R, and which performs model selection on the bases of equation (13) as discussed there. Here we report numerical results for the second method, which also defines a meaning for the ≈ symbol in section 2.1 as the lasso-regularized  (L 1 -regularized) sum of squared differences for the time derivatives of the μ α t ( ) statistical interaction parameters (summed over α, discretized t, and over any set of initial conditions and/or other input conditions c such as calcium influx frequency). Thus, we optimize the model parameters θ αA by minimizing the score (Defining a sense of approximation ≈ as needed in section 2.1 ) which by equation (12) is equivalent to a form that is explicitly lasso-regularized least squares optimization of the trainable interaction parameters θ. The single scalar hyperparameter λ was set using leaveone-out cross validation, with each calcium influx spike held out in turn. 150 out of 253 model bases parameters were always rejected by the lasso algorithm, and the remaining 103 bases were used sparsely depending on the μ α derivative being fit, which of the four phases was being fit, and which BMLA experiment data set was being used. The resulting constrained time-evolution of ODEconstrained interaction parameters μ α t ( ) evaluated out-of-sample (i.e. using different data than was used for training the model parameters) was almost indistinguishable from the unconstrained values of optimal interaction parameters obtained by BMLA, as a function of time over several calcium influx cycles, as shown in figure 5.  [37] for calcium influx. Panels A, B and C visually separate pairs of (model, data) curves that would otherwise overlap, for ease of visual comparison. A high degree of overlap between (model, data) curve pairs is evident. Quantitative comparison of each (model, data) pair is given in table 1. Simulation is out-of-sample in that the calcium influx bursts occur with frequency 8 Hz, though GCCD was trained at frequencies 2, 4, and 10 Hz. Numerical errors in figure 5 are shown in table 1, computed as root mean squared (rms) error and also as normalized rms error, in which the normalization is done by finding the average of the absolute value of each target BMLA time course, and dividing both the target and corresponding prediction time course by this average absolute value before computing the rms error as usual.
The results show good out-of-sample quantitative agreement for all eight time series.

Discussion of results
In quantitative terms, the degree of model reduction obtained by GCCD in the CaMKII example is large. Eight dynamical variables (the interaction parameters μ) suffice to predict key outputs such as the global degree of CaMKII phosphorylation, each of which can be computed from expectations of random variables s in the Boltzmann distribution of equation (11). But the fine-scale set of dynamical variables is much larger.
In the MCell simulations we may conservatively count it as the integer-valued populations of the following classes of molecular species: free Ca 2+ , free CaM ( × + = 3 3 1 10 species), monomeric CaMKII subunit which can bind CaM in any of its states and can also be phosphorylated ( × × = 3 3 2 18species), and dimerize if at most one subunit is phosphorylated ( × + × = 9 9 9 10 2 126 species; phosphorylated dimers dissociate before they can doubly phosphorylate) for a total of 155 species each of which has a dynamical random variable, and therefore a total reduction from 155 to just eight dynamical variables, which is very large.
In one sense this reduction in the number of dynamical variables strongly understates the situation, because in the actual MCell simulation (though not in the Plenum simulations), every individual chemical species has its own independent three-dimensional position which is also a dynamical random variable. Given that a typical molecular population excluding Ca 2+ is 145 CaM + 385 CaMK = 530, and including each 24-unit Ca 2+ pulse is still greater, and that the number of position degrees of freedom is three times greater (1590 or more), the reduction to just eight dynamical variables μ 1 through μ 8 as listed in figures 3-5 is even more remarkable.
Comparable figures for the (deliberately non-spatial and well-mixed) Plenum simulations are again 155 molecular species reduced down to eight. An intermediate step is the formulation of the graph in figure 2 which has just eight interaction parameters (associated with the hexagonal interaction nodes) due to weight sharing, but it has many more molecular binding variables (circular nodes in figure 2). For the Plenum model we can count these variables as follows: due to the smaller simulated volume than for the MCell model, the index i for CaM and the index j for CaMKII run from 0 to 2 and 0 to 8 respectively. If we replicate the circular nodes in the graph of figure 2 accordingly, there are × 4 3, × 3 9, 9, and × 9 10 2 variables respectively in rows 1-4 of the full version of the graph illustrated in figure 5, for a total of 93 finescale binary-valued variables in a highly structured pattern.
As stated in section 2.4, most current results for stochastic model reduction start from a small reaction network with handful of chemical species, and produce a more efficient (sometimes much more efficient) modeling algorithm for a possibly reduced model with model size reductions on the order of a factor of 1 to 2. The authors of [35], a stochastic and rule-based model reduction method as is GCCD, achieve a larger model reduction in an EGF signal transduction model from 2768 to 609 molecular species, for a 4.5 × reduction factor or a 0.81-power relation of reduced to full model ( ≃ 609 2768 0.809 ). We have demonstrated for GCCD at least 155 to 8 for a 19.5 × reduction factor, or a 0.41-power relation ( ≃ 8 155 0.412 ). This factor of almost 20 breaks decisively out of the pattern of model size reductions on the order of a factor of just 1 to 2 in number of chemical or biological degrees of freedom. Exploring tradeoffs between accuracy of approximation and amount of model size reduction (whether measured in direct ratios, or powers, of number of degrees of freedom before and after reduction) for increasingly large models would therefore seem to be an attractive topic for future work.
To what features of GCCD or the present problem do we owe these very substantial reductions in model size? Future work could disentangle the following seemingly relevant factors: (1) GCCD can be applied to rule-based models, which are highly structured in that (as shown in the Plenum model code in the supplementary information text) a small number of rules can combinatorially code for a much larger molecular species-level reaction network. Thus an underlying simplicity is available. (2) The use of weight-sharing may make it possible to exploit such underlying simplicity in the problem domain. (3) The machine learning components of GCCD (BMLA and the new procedures for determining GCCD model parameters θ) generalize directly from specific simulation data rather than algebraically and in full generality from the mathematical form of the finescale model. So currently available computer power is used effectively. (4) The Boltzmann distribution is a natural form for prolongation from coarse to fine models since by constrained entropy maximization it doesn't add any more information than is given by constraints on whatever moments were chosen for use in the GCCD graph structure. (5) The graph structure of GCCD is a good mechanism for importing biological knowledge and expertise into the model reduction process.

Conclusion
We propose a nonlinear model reduction method particularly suited to approximating the CME for stochastic chemical reaction networks (including highly structured ones resulting from 'parameterized' or 'rule-based' reactions), by a time-dependent variant of a Boltzmann distribution. The resulting GCCD method can be an accurate nonlinear model reduction method for stochastic molecular reaction networks involving a combinatorial explosion of states, such as the CaMKII signal transduction complex that is essential to synaptic function, particularly in neuronal learning processes such as LTP. The GCCD method could be further developed in many directions, including application to model-reduction for the master equation semantics of more challenging molecular complexes such as the CaMKII dodecamer, use of the resulting reduced models to extract novel biologically relevant predictions, and generalizing the method to yet more general reaction-like biological modeling formalisms capable of expressing multiscale models in developmental biology [4].
Unless otherwise noted, all expectations 〈…〉 are with respect to the distribution p.
Plugging this result back into equation (19), and using μ ∂ ∂ = α p x t log ( ( ; )) 0, along with the evaluation of If we separate the second integral, we see that the two halves cancel: We can now absorb the leading minus sign, leaving Now we wish to find the derivative of equation (22) with respect to time. In order to accomplish this, we first name pieces of equation (22). Let Then we can write the desired derivative as Concentrating on the first two of these terms, ) and, using the chain rule, ∂ ∂ Shifting the integral inside the sum, we recognize the expression for the covariance between functions α V and , so   Note that the covariance between α V and β V has appeared again, and that it is being subtracted from terms which have the same form as the terms inside a covariance. That is, integrating over the first part of equation (25) would again produce the covariance between α V and β V , times the derivative of μ. Terms such as recur regularly throughout this derivation. Therefore, we define a new notation. Let Δ ≡ − 〈 〉 X X X . Then we can rewrite the covariance in equation (24)  Additionally, equation (25) fits the same pattern, with So, we may rewrite this as .
Plugging this in and copying the definitions for B, C, and D, we have In the second of these terms, the factor μ Z t log ( ( )) is not a function of x. As a result, when the integral is evaluated, the resulting expected values all cancel. So this term is zero, leaving Once again, we break apart equation (26) and consider the parts individually. Beginning with the first integral, we regroup the terms of the multiplication so that there is just one double-sum over cliques, then move terms which do not depend on x outside of the integral, and integrate. This leaves another expected value: This expression belies some of the symmetry of this term. Note that, if = γ X V and Δ Δ = α β ( ) Y V V , the inner most term is Δ 〈 〉 X Y . From the definitions of Δ and expectation, this is equivalent to 〈 〉 − 〈 〉〈 〉 XY X Y , which once again is X Y Cov( , ). Of course, covariance relationships are symmetric, so this is also Y X Cov( , ), which from the preceding argument is Δ 〈 〉 Y X . Thus . (28) Applying this transformation to equation (27) The final piece of equation (23) is Now that we have analyzed each of the parts of equation (23), we can set it equal to zero and move the terms with a sum over cliques across the equals sign. Then we have as a solution