Backward simulation for inferring hidden biomolecular kinetic profiles

Summary Our backward simulation (BS) is an approach to infer the dynamics of individual components in ordinary differential equation (ODE) models, given the information on relatively downstream components or their sums. Here, we demonstrate the use of BS to infer protein synthesis rates with a given profile of protein concentrations over time in a circadian system. This protocol can also be applied to a wide range of problems with undetermined dynamics at the upstream levels. For complete details on the use and execution of this protocol, please refer to Lim et al. (2021).


SUMMARY
Our backward simulation (BS) is an approach to infer the dynamics of individual components in ordinary differential equation (ODE) models, given the information on relatively downstream components or their sums. Here, we demonstrate the use of BS to infer protein synthesis rates with a given profile of protein concentrations over time in a circadian system. This protocol can also be applied to a wide range of problems with undetermined dynamics at the upstream levels. For complete details on the use and execution of this protocol, please refer to Lim et al. (2021).

BEFORE YOU BEGIN Check whether the backward simulation (BS) is desirable
Timing: 10 min We have developed the backward simulation (BS) method to discover the ''internal'' dynamics of the system described by ordinary differential equations (ODEs) with pre-selected model structure and parameter values. When the temporal profiles of relatively downstream components or their sums (but not the profiles of upstream components) are known, BS retrieves the upstream profiles that exactly reproduce the known downstream profiles, through the straightforward ODE calculation with the relevant downstream variables. On the other hand, an existing practice is to assume the plausible forms of these unknown upstream profiles with additional free parameters, and then estimate these parameters to best fit the observed downstream profiles. However, in contrast to the BS, the latter method incurs the computational costs for that parameter estimation and may not even necessarily reproduce the correct downstream profiles.
As a prototype application of this method, we here elaborate the case of the circadian protein degradation model in Lim et al. (2021). Specifically, in contrast to a common practice of simulating the time-course profile of a circadian protein concentration by its upstream elements such as the rhythmic synthesis rate of the protein over time, we intended to maintain the total protein concentration profile at the downstream side as it was and simulate the corresponding rate of the protein synthesis and other upstream kinetic processes in given parameter conditions (Lim et al., 2021). There were three main reasons for this simulation: (i) unlike the mRNA profile, the potentially time-of-day-specific translation rate per mRNA is not commonly known and hence it is difficult to determine the profile of the protein synthesis rate directly from the existing experimental data. In contrast, the experimental profile of the total protein levels is readily available for the incorporation to model simulations. In Lim et al., 2021, given the experimental protein profiles, we performed the BS over randomly-sampled parameter values, and identified the parameter sets for simulation results in quantitative agreement with the empirically observed, rhythmic degradation rates of the proteins. (ii) Another reason was that we wanted to dissect the underpinning mechanism of the rhythmic degradation rate of a circadian protein by rigorously controlling for the effect of the protein profile over a range of parameter values, while avoiding the confounding effect from the changes in the protein profile itself caused by the conventional simulation with given profiles of upstream elements (''forward simulation''). (iii) Lastly, we considered an evolutionary viewpoint that the protein profile can be of a more fundamental position than an mRNA or translation-rate profile so that the protein synthesis rate may have adapted to the protein profile-the protein profile is more likely to influence a biological phenotype than other elements in the system such as mRNA profile. Overall, we foresee a variety of applications of the BS, including the cases with data unavailability at upstream sides, the mechanistic studies with strictly-controlled downstream profiles, and the evolutionary modeling with fixed downstream profiles.

Install python and python packages
Timing: 10 min 1. Download Python 3.7.4 or a higher version from https://www.python.org. The Python version can be checked by the following command: 2. To solve ODE models, the scipy python module is needed. Install scipy package:

STEP-BY-STEP METHOD DETAILS BS in general cases
Timing: 1 h for step 1 and 1 h for step 2 In this section, we will describe how to apply the BS to a general system of ODEs where the downstream profile is given. Application of BS to a circadian protein degradation model Timing: 1 h for step 3, 1 h for step 4, 1 h for step 5, and 2 h for step 6 In this section, we will describe how to apply the BS to a circadian protein degradation model in Lim et al. (2021).
Step 3 and 4 show how to modify the set of coupled ODEs of a circadian protein degradation model for the BS. How to write the codes for the BS and how to examine the ODE system are included in step 5 and 6, respectively.
3. Formulate the dynamics as a set of coupled ODEs. a. In our circadian protein degradation model in Lim et al. (2021), the time derivatives of the concentrations of several forms of a circadian protein are described as follows ( Figure Table 2. Substrate proteins (rounded rectangles, sky blue) are synthesized from mRNA molecules (blue line, top left) in the ribosome (brown, top left) and ubiquitinated by E3 ubiquitin ligases (orange ovals) with ubiquitins (yellow circles). The ubiquitinated proteins are degraded (gray ovals) or deubiquitinated by deubiquitinating enzymes (light green hexagon). The total protein concentration is represented by x(t) with symbol S at the center.

(Equation 12)
This result makes sense because g(t) represents the ultimate source of protein production and the variables with a common coefficient r 0 are the concentrations of ubiquitinated proteins destined for degradation. c. Let x(t) denote the total protein concentration: Additionally, because the protein synthesis rate cannot be negative, the following inequality should be satisfied: gðtÞR0: (Equation 17) 4. Transform the ODEs to utilize the experimentally available information on relatively downstream variables in the ODE model. a. In the example of our model, the experimentally available data were the total concentration x(t) of a protein rather than those of the individual sub-forms of the protein or the protein synthesis rate at the upstream level. Although the experimental data usually represent the relative, but not absolute, molecular concentrations, the use of the relative concentration for x(t) in our model only changes the ''unit'' of concentrations without loss of generality. On the other hand, if multiple datasets of their own relative levels are incorporated into the model variables of the same dimensional quantities, the appropriate proportionality coefficients or conversion factors should be introduced for the sake of a unified scale. b. Plug the following relation (from Equation 13) in Equations 5 and 6: x 0 ðtÞ = xðtÞ À ½x E;0 ðtÞ + x E;ub ðtÞ + x 0;ub ðtÞ + x H;ub ðtÞ: ( variables are simulated using the profile of g(t); in our BS, g(t) and other variables are reversely simulated using the profile of x(t), through Equations 6,7,8,9,10,11,18,and 19. 5. Write the codes for simulation. a. Import Python modules (scipy). b. Read the time-course data of the total protein abundance. If the data do not span more than a single circadian period, expand them by the repetition of the data points to multiple circadian periods. This step is necessary if one wants to simulate the long-term asymptotic behavior of our circadian model, as will be explained later. c. Interpolate the data points of the total protein abundance. This step is necessary for the construction of a continuous trajectory of x(t) for our model simulation, given the discrete nature of time points with available data. i. Import interp1d module from scipy.interpolate.
ii. Implement a cubic interpolation by setting the option ''kind = 'cubic'''. The result is an interpolated curve of x(t), as exemplified by Figure 2A. iii. Build a Python function that uses time, an array of concentrations, an array of kinetic parameters, an interpolated curve of x(t) as input, and returns an array of the model variables and their time derivatives in Equations 6,7,8,9,10,11,18, and 19 as output. d. Solve the ODEs with scipy.solve_ivp module. (B and C) Simulated g(t) (B) and r(t) (C) with different initial conditions. Given the protein profile in (A), the simulated g(t) or r(t) rapidly converges at the same trajectory, regardless of its initial conditions. (D) The example output of the code ''check_physical_constraints.py'' to check the feasibility of the BS solution. If the solution does not satisfy its fundamental conditions, the code returns the message in (D). (E and F) Simulated profiles of protein sub-states: x 0 (t) (purple), x E,0 (t) (yellow), x E,ub (t) (gray), x 0,ub (t) (blue), and x H,ub (t) (red). When v 0 is set to zero, x H,ub (t) becomes zero (E). When a 0 , b 0 , u 0 , and v 0 are set as relatively high and a 2 and b 1 are set as relatively low, x 0,ub (t) becomes almost zero (F).

OPEN ACCESS
i. Import solve_ivp module from scipy.integrate. ii. Set the parameters ''fun'', ''t_span'', ''t_eval'', and ''y0''. Here, ''fun'' is a Python function of which input is 't' (time) and 'y' (an array of the values of variables) and output is an array of the time derivatives of the variables. ''t_span'' is a tuple containing two time points that indicate the beginning and end of the simulation. ''t_eval'' is an array of time points at which to store the computed solutions. ''y0'' is an array of the initial states of the variables. iii. When solving the system of ODEs numerically, we must take the following into consideration:(1) Selection of time step: When selecting the maximum time step for the simulation, it should be much smaller than the time scale of dynamics. For the circadian proteins, the time scale is $24 h. Therefore, it is safe to select the maximum time step much smaller than an hour. Moreover, checking whether the solution does not noticeably change with smaller time steps will assure the selection of an adequate time step. In the scipy.solve_ivp module, changing the ''max_step'' option would modify the maximum time step to solve the ODEs.
(2) Stiffness of the problem: If the ODEs contain stiff terms, the Runge-Kutta method is not recommended. In this case, ''LSODA'', ''BDF'', or ''Radau'' methods are recommended. The method to solve the ODEs can be changed by setting the ''method'' option in the scipy.solve_ivp module.
(3) Numerical error tolerance: When solving the system of ODEs, estimated errors can be controlled. ''atol'' and ''rtol'' options in the scipy.solve_ivp module can be used to control absolute and relative errors respectively.
6. Test the behaviors of the ODE model. q Ubiquitination rate of a protein binding to a ubiquitin ligase.
s Deubiquitination rate of a protein binding to a deubiquitinating enzyme, lumped with its subsequent dissociation from the deubiquitinating enzyme.
r 0 Degradation rate of a ubiquitinated protein.
u Total ubiquitin ligase concentration.
v Total deubiquitinating enzyme concentration. Time.
x 0 (t) Concentration of a free protein without ubiquitination.
x E,0 (t) Concentration of a not-ubiquitinated protein that is binding to a ubiquitin ligase.
x E,ub (t) Concentration of a ubiquitinated protein that is binding to a ubiquitin ligase.
x 0,ub (t) Concentration of a ubiquitinated protein that is not binding to a ubiquitin ligase.
x H,ub (t) Concentration of a ubiquitinated protein that is binding to a deubiquitinating enzyme.

g(t)
Protein synthesis rate.

u(t)
Concentration of a free ubiquitin ligase.

v(t)
Concentration of a free deubiquitinating enzyme. a. Check whether the model output is not too sensitive to the initial conditions, given the profile of the observable (e.g., x(t) in our case) and the parameter values. This step is to determine whether the simulation outcome is essentially unique or not, regardless of particular initial conditions. One method is to randomly sample the initial conditions of each variable within a physiologically-relevant range and then check whether the simulation outcomes converge at similar trajectories. In the case of our model, the code in the following command allows the test of this initial condition dependency: The execution of the code gives the graphs of each variable as the functions of time with different initial conditions. Figures 2B and 2C demonstrate that g(t) and r(t) in our model converge well respectively, regardless of their initial conditions when x(t) (Figure 2A) and parameter values are assigned for the simulation. b. Determine the minimum simulation length to approach the asymptotic solutions of the ODE model. By running the simulation for a long enough time, i.e., setting the large values for 't_span' and 't_eval' options, check when the ODE solutions exhibit saturated and sustained oscillations in our case. If the simulation does not result in the stable oscillations, try longer simulation lengths. Identify the minimum simulation length to ensure the stable oscillations. The code in the following command allows this test: Executing the code gives the graphs of each variable over a long time. Through these graphs, one can determine the minimum simulation length. c. Check whether the ODE solutions satisfy all the physical and biological constraints and thus can be considered as feasible solutions. Some ODE solutions may not satisfy these constraints, particularly in the case of BS. The reason is that BS does not run in a natural causal direction from upstream to downstream levels, but traces back the upstream states without the prospect of their compatibility with the parameter values. In other words, only the parameter values with the feasible solutions of BS are compatible with the downstream observable, and hence BS can identify those sensible parameter values. In the example of our model, the solutions should satisfy all the constraints in Equations 16 and 17. The code in the following command allows this feasibility test of the model solutions: The code verifies whether the constraints in Equations 16 and 17 are satisfied or not (Figure 2D). d. Debug the code of the ODE model. By simulating different parameter values, one can check the potential bugs in the code. In the example of our model, if v is set to 0, x H,ub (t) should be zero. In Equation 8, if a 0 u(t) and b 0 v(t) are much higher than a 2 and b 1 by setting a 0 , b 0 , u, and v as relatively high and setting a 2 and b 1 as relatively low, then x 0,ub (t) should become very small ( Figures 2E and 2F). The code in the following command allows these bug tests: Executing this code gives the parameter conditions and the graphs of the relevant simulation results. i. How to use ''multiprocessing'' module: construct the wrapping module that only takes ''pa-rameter_sets''. The ''parameter_sets'' is a list of N parameter sets, and the wrapper is a wrapping module that takes a single parameter set as input. Parallel computing with a number of CPU cores can be implemented using these modules.

Sampling of parameter values
ii. How to use ''mpi4py'' module: ''mpi4py'' module allows the model simulation with each node. An example is shown below.
Here, N is the target number of the simulations in one node.
Note: In both the cases i and ii above, be careful when writing a file. If two different nodes access the same file, this file may not be readable at the end.

EXPECTED OUTCOMES
The BS method allows us to identify the valid parameter sets for experimentally available downstream data and to inspect the internal dynamics of the system with the fixed profiles of particular components. The latter would be useful for rigorous mechanistic inspection of a dynamical system. For example, we controlled for the protein profile x(t) in Lim et al. (2021) and found that the degradation rate r(t) tends to be more rhythmic with a lower level of a ubiquitin ligase u. In addition, we identified a definite lower bound of u for the establishment of a given profile of x(t) itself, along with other interesting phenomena. Without the help of the BS, these clear conclusions may not be drawn, because in the conventional simulation with a fixed profile of the protein synthesis rate g(t), the change of u modifies the oscillatory form of x(t) itself and thus does not clearly separate the effect of u from that of the x(t) profile on the generation of rhythmic r(t) (Lim et al., 2021).

TROUBLESHOOTING Problem 1
When solving an ODE model with the scipy.solve_ivp module, the computation time may sometimes be very long (step 6 of section ''application of BS to a circadian protein degradation model'').

Potential solution
When the interpolated curve of the experimental profile is smooth enough, the option ''method = ''RK45'''' and ''method = ''LSODA'''' will not take much different computational times. However, if the interpolated curve is not smooth enough, ''method = ''RK45'''' will take longer computational time. Therefore, in this case, we recommend the use of ''method = ''LSODA'''' for shorter computational time.

Problem 2
If the profile of x(t) in our model is too noisy, most BS results will not give the feasible solutions of the upstream states. These noisy patterns are likely to come from very high temporal resolution of the experimental data. For example, the PER2 profile for our BS in Lim et al. (2021) was obtained from the data of Zhou et al. (2015) with 6-min resolution. (step 2.a of section ''BS in general cases'' and step 4.a of section ''application of BS to a circadian protein degradation model'').

Potential solution
Time window averaging or other denoising techniques can be applied to the noisy profile. However, be cautious of the possibility that such smoothening may distort the original patterns in the profile.

Problem 3
The ODE solutions are not sometimes accurate enough (step 5.d.iii and step 6 of section ''application of BS to a circadian protein degradation model'').

Potential solution
There is an option in scipy.solve_ivp to control the error-tolerance level in numerical integration of ODEs. By adjusting the options 'atol' and 'rtol', one can manage the absolute and relative levels of the tolerance to the numerical errors, respectively. However, too small 'atol' and 'rtol' values can considerably slow down the computation.

Problem 4
When simulating excessively many parameter sets, it is difficult to manually determine whether ODE solutions have reached their attractors or not within the simulation time (step 7 of section ''sampling of parameter values'').

Potential solution
Some parameter sets may take long computation time towards the asymptotic states of the model outcome. In the case of our model, if the peak values of the oscillating variables decrease or increase ll OPEN ACCESS with more than some fold change in the past two circadian periods, these variables may not be considered to reach the stable oscillatory states at that time.
Problem 5 When solving the system of ODEs with the scipy.solve_ivp module, the solution includes time points assigned in the ''t_eval'' argument. However, a continuous time series of the solution might be needed in some cases (step 5.d of section ''application of BS to a circadian protein degradation model'').

Potential solution
If an option in the solve_ivp module, ''dense_output'' is set to ''True'', it will return a class instance for the ODE solution at a given time point. However, this option might increase the computation time for the solution especially when the solution involves a long time series. Alternatively, the use of small time steps for the ''t_eval'' option and the interpolation of the solution over the last one or two time periods only would save the computation time.

RESOURCE AVAILABILITY
Lead contact Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact, Pan-Jun Kim (panjunkim@hkbu.edu.hk).

Materials availability
This study did not generate any unique reagents.

Data and code availability
This study did not generate new experimental data.
Source codes for our model simulation have been deposited to public repository GitHub, and the link is provided in the key resources table.