Identification of Gene Regulation Models from Single-Cell Data

In quantitative analyses of biological processes, one may use many different scales of models (e.g., spatial or non-spatial, deterministic or stochastic, time-varying or at steady-state) or many different approaches to match models to experimental data (e.g., model fitting or parameter uncertainty/sloppiness quantification with different experiment designs). These different analyses can lead to surprisingly different results, even when applied to the same data and the same model. We use a simplified gene regulation model to illustrate many of these concerns, especially for ODE analyses of deterministic processes, chemical master equation and finite state projection analyses of heterogeneous processes, and stochastic simulations. For each analysis, we employ Matlab and Python software to consider a time-dependent input signal (e.g., a kinase nuclear translocation) and several model hypotheses, along with simulated single-cell data. We illustrate different approaches (e.g., deterministic and stochastic) to identify the mechanisms and parameters of the same model from the same simulated data. For each approach, we explore how uncertainty in parameter space varies with respect to the chosen analysis approach or specific experiment design. We conclude with a discussion of how our simulated results relate to the integration of experimental and computational investigations to explore signal-activated gene expression models in yeast [1] and human cells [2]‡. PACS numbers: 87.10.+e, 87.15.Aa, 05.10.Gg, 05.40.Ca,02.50.-r Submitted to: Phys. Biol.


Introduction
Recent years have led to rapid advances in the capabilities to measure gene regulatory phenomena at the level of single-cells and in fluctuating conditions. These techniques include image-based strategies, where fluorescence markers are used to highlight specific DNA, RNA, or protein molecules of interest [4,5]; cytometry-based approaches, where cellular properties, especially size or fluorescence, are rapidly measured one cell at a time [6]; mass-spectrometry based approaches, where cellular proteins are digested into peptide fragments whose distinct mass or charge spectra can be analyzed quantitatively [7]; and sequencing approaches, where the abundances of specific RNA or DNA molecules are quantified on a single-cell basis [8]. In this article, we focus on the quantitative analysis of one specific technique, known as single-molecule Fluorescence in situ Hybridization (smFISH) [5,9], which provides precise and reproducible measurements of the number and locations of individual RNA molecules within single cells. In the technique of smFISH, researchers tile individual molecules of endogenous RNA with many complementary DNA oligomers. Each RNA-binding oligomer contains one or more fluorescent dyes, such that the entire multi-labeled RNA molecule appears as a bright diffraction limited spot under the fluorescence microscope. Counting the integer numbers of these spots for many single cells in large populations at many times and in the same conditions, one can capture the temporal distributions of gene regulation activity for those conditions. By using microfluidics or by adding or removing stimuli, one can observe how single-cell statistics are affected by transitions from one environmental condition to another [1,10,11].
Utilizing experiments such as smFISH, many researchers have quantified timevarying and bursty gene expression at the single-cell level, and many models have been developed to capture these dynamics (e.g., see recent reviews at [12,13,14]). In this article, we define a small class of single-cell models which have previously been shown to capture and predict smFISH data [1,10,11], and we employ several computational analyses in attempts to identify these models and their parameters from various sets of simulated data.

A Standard Computational Toolbox for Quantitative Biology
In the following paragraphs, we introduce a handful of computational tools that are frequently used in the analysis and fitting of quantitative gene regulation models.
Ordinary Differential Equations. One of the most common approaches to describe the dynamic behavior of gene regulation is to use coupled sets of Ordinary Differential Equations (ODEs). For example, the ODE related to mRNA transcription is a mathematical equation which describes the rate of change in the number of mRNA over time, taking into consideration both production and degradation of mRNA molecules. For most biological processes of interest, analytical solutions to the corresponding ODEs may not be possible and numerical techniques may be necessary. In this article, we will deal with reaction rates that change over time, for which numerical integration is required.
Parameter Fitting and the Metropolis-Hastings Algorithm. A common task in quantitative biology is to fit parameters of a model to match experimental data. Due to the stochastic nature of biological systems, in addition to the fact that experimental data is noisy and limited in quantity, exact estimation of parameters is impossible. Having a quantitative understanding of parameter uncertainties in light of existing data is crucial for predictive modeling. To accomplish this goal, Markov Chain Monte Carlo (MCMC) analyses, such as the Metropolis-Hastings (MH) algorithm [15], are frequently used to provide unbiased samples of parameter space and to estimate parameter uncertainties. In particular, the MH algorithm samples parameter space by adding random perturbations to parameters and then calculating the likelihood of the data, given the new parameter set. The new parameter set is then discarded unless the likelihood exceeds an appropriately chosen (albeit) random acceptance threshold. Like any parameter fitting routine, the MH algorithm has several tunable parameters. The MH chain is the sequence of parameter sets that have been accepted during the random walk through parameter space. The proposal distribution is a probabilistic rule by which a new parameter set is generated from the old. The burn-in period is the number of MH samples that are ignored to allow the MH chain to relax away from the (potentially poor) initial parameter guess. The chain length is the total number of MH samples that are recorded following the burn-in period. The thin rate is the number of accepted samples that are ignored between those that are recorded (i.e., every 10 th sample). See [3] for a more in depth discussion of Monte Carlo analyses (Chapters 10 and 16), parameter estimation (Chapter 13), and sensitivity analyses (Chapter 14) as applied to models in quantitative biology.
The Stochastic Simulation Algorithm. Unlike most dynamical problems in science and engineering, gene regulation is inherently subject to discrete stochastic behavior arising from the single-molecule nature of genes within individual cells. The standard approach to analyze such discrete stochastic processes is to apply the kinetic Monte Carlo algorithm known as (Gillespie's) Stochastic Simulation Algorithm (SSA) [16,17], which uses a combination of stochastic reaction rates (i.e., propensity functions) and reaction rules (i.e., stoichiometry vectors) to simulate the stochastic trajectories. This algorithm updates the state of the system after each random reaction and assumes exponentially-distributed waiting times between every event. In this article, we apply the SSA to simulate many possible stochastic trajectories for the switching of gene activity states coupled with mRNA production and degradation.
The Finite State Projection Method. In addition to using the SSA to generate stochastic trajectories for the gene regulation processes, we also explore direct computations of the underlying probability distributions for these processes. These distributions evolve according to the infinite dimensional linear equation known as the Chemical Master Equation (CME) [18,19]. Because the CME has no exact solution for most models, we employ the finite state projection (FSP) method to approximate the CME solution and to estimate the full joint probability distributions for gene switching behavior and transcript accumulation [20]. Several recent reviews of the FSP approach and its applications can be found at [14,21,22,23]. A key advantage of the CME and FSP formulation is that they provide systematic approaches to compute and compare the likelihood of data arising from multiple models [1,10,21]. With this in mind, we use the Metropolis-Hastings algorithm in conjunction with the FSP analysis to attain more precise constraints on the parameters when fit to simulated data [11].
Experiment Design. The successful identification of a model from data requires not just good data and appropriate modeling tools, but the experiment itself also needs to be conducted using a well-designed and informative condition [23,24,25,26,27,28,29]. With this in mind, we will explore the importance of experiment design, and we will show that different experiments (e.g., different input signals and different measurement times) and different analyses (e.g., ODEs or the FSP) yield different results for the constraint of system parameters.
In the following section, we introduce a specific gene regulation model and its underlying assumptions. Then, in Section 3, we discuss the simulated data from which we seek to identify our model parameters. In Section 4, we demonstrate the implementation of the above tools to this model and different sets of simulated data. Finally, in Section 5, we provide a brief summary of our results and conclusions, and we introduce a user-friendly Python-based toolbox with a graphical-user-interface to allow readers to reproduce all of the described analyses. All analyses discussed in this article have been implemented in Matlab and Python codes, and the codes are available for download at: https://github.com/MunskyGroup/WeberPB

Gene Regulation Model Description
We begin by specifying a simple semi-mechanistic model for gene regulation. Figure 1 illustrates this model, which is a generalization of the standard bursting gene expression model discussed in [12,13,30]. The model consists of a single gene with three possible states {S 1 , S 2 , and S 3 }, each of which represents a different configuration of transcription factor binding or chromatin conformation. Similar three-state models have been used previously in the context of time-varying inputs to capture and predict signal-activated single-cell gene expression in yeast [1,11] and human [10] cells. The model assumptions are as described with additional detail in Subsection 2.1.

Assumptions of the Three-State Bursting Gene Expression Model
First, we define the conformational states {S 1 , S 2 , and S 3 } of the gene and the production and degradation of mRNA. In this model, S 1 is a transcriptionally inactive ('OFF') state, whereas S 2 and S 3 are transcriptionally active ('ON') states. The transcription rates for S1 S2 S3 k 12 k 23 k 32 k 21 k r2 k r3 ; Figure 1. Schematic of a three-state gene regulation model. State 1 is transcriptionally inactive, whereas states 2 and 3 are transcriptionally active with transcription rates k r2 and k r3 , respectively. Parameters k ij denote the (potentially time-varying) transition rates between states S i and S j . S 2 and S 3 are k r2 and k r3 , respectively. Degradation of mRNA species, R, is assumed to be a first order process with rate γ. Next, we define the transition dynamics that allow the gene to switch among these states. In the three-state model, there are four state transition events: S i → S j , each with a corresponding transition rate of k ij (t). Three of the four transition rates are assumed to be constant with respect to time and the activating kinase: k ij (t) = k ij (0). The gene is activated by a time-varying kinase, Y i (t), which acts as an input to the system. In practice, such signals can be controlled experimentally through modulation of external stimuli [1,10,11,31,32,33]. For illustrative purposes, we initially assume a sinusoidal input signal given by: where the input signal (in arbitrary units of concentration) begins at time t = 5 min and ends at time t = 70 min. We assume exactly one of the four transition rates is linearly dependent upon the activating kinase as follows [1]: For the combination i < j (i.e., when the reaction leads to a state with a higher index), the value of k ij (0) is negative and β is positive. Conversely, when i > j (i.e., when the reaction leads to a state with a lower index), the value of k ij (0) is positive and β is negative. This choice gives rise to a thresholded activation response in which k ij (t) is zero until the kinase exceeds or drops below a specific level. This definition leads to activation, either through direct activation when i < j or indirect loss of repression when i > j.
One of our goals in this study is to determine if it is possible to identify the mechanism of action, i.e., to determine which k ij depends upon the time-varying kinase signal. We define the model number, N m , as the integer that specifies which mechanism depends upon the activating kinase. The model numbers and the corresponding inputdependent transition rates are as follows: Now that we have specified the assumptions needed to define the gene states, mRNA levels, and the associated dynamics, we can write a set of four ODEs to describe this model: These equations are easily integrated in Matlab using the stiff ODE solver, ode15s.
Finally, for parameter estimation, we assume that all rates have an independent prior given by a lognormal distribution with a logarithmic mean of E{log 10 (λ)} = 0 and a wide logarithmic variance of E{(log 10 (λ)) 2 } = 4. Having defined the reactions and priors on all parameters, in the next section we will generate some simulated data and attempt to fit our model, after which we will explore uncertainty in parameter space, and make model predictions in light of these datasets.

Single-cell Data Collection
The technique of smFISH [5,9] makes it possible to measure the number of mRNA molecules within hundreds or thousands of individual cells at specific instances in time during a dynamic response from one condition to another. For example, in [1,11], the authors measured and found a quantitative FSP-based model to predict the statistics of STL1, CTT1, and HSP12 mRNA expression in yeast cells every two to five minutes following osmotic shock and temporal activation of the HOG1p kinase. Similarly, in [10], the authors measured time-varying p-ERK kinase and its effect on c-Fos expression in human-derived U20S cells.
To generate simulated data of this form, we assume a "true" model, which corresponds to Model 2, and for which the parameters are presented in vector form as: Throughout this article, all rate parameters will be assumed to have units of (minutes) −1 , and β has units of (minutes) −1 (concentration) −1 . At the initial time (t = 0), we assume that all cells are in state S 1 and that there are no mRNA (R = 0), meaning that all cells are in a transcriptionally inactive state. This initial condition was chosen arbitrarily, but matches to observed dynamics for many genes (e.g., STL1, CTT1, HSP12, c-Fos), which are transcriptionally silent in the absence of their activating kinase signal [1,10]. We use the FSP approach [14,21,22,23,34] with the true model and known initial condition to compute the timevarying mRNA probability distribution at each combination of time and input condition. We then use the FSP solution to generate histograms of 100 exact independent random samples (i.e., one data point for each simulated cell) at ten equally spaced times between zero and 100 minutes. In the following sections, we will use various computational approaches to infer model parameters from these time-varying histograms.

Results
The goal of our study is to determine if simulated data is sufficient to identify the model and its parameters, and if the identified parameters can be used to predict the system's response under new experimental conditions.

Simulation of ODE Models
We first use a deterministic approach and solve the simple system of ODEs listed in Eqs. 3 -6 to capture the bulk gene regulation dynamics. For a given model and parameter set, these ODEs generate the mean gene expression as a function of time. Figure 2 shows the predicted mean gene expression as a function of time for Model 2 (i.e., where k 23 is time-dependent), assuming the input signal in Eqn. 1, and two different arbitrary parameter sets (blue and magenta) as given in the caption.
Examining how different parameters affect system behavior reveals qualitative insight about how rate parameters control the underlying process dynamics. For example, Fig. 2 illustrates that faster rates for transcription and degradation (k r2 and γ) enable gene expression to track the input signal more closely (Fig. 2, magenta line), whereas slower rates result in greater averaging and lag behind the input signal (Fig. 2, blue line). Biologically, evolution could tune the rates of these reactions to allow cells to respond more quickly to important fluctuations (e.g., sustained stresses) while rejecting faster environmental disturbances (e.g., short-lived transient fluctuations). In addition, such dependence of response dynamics on parameters makes it feasible to perform an inverse analysis (i.e., to estimate parameters from quantitative observations of response dynamics), as discussed in the following subsection.

Maximum Likelihood Estimation for ODE Models
Next, we use the ODE model defined in Eqns. 3-6 and search parameter space to find a parameter set that maximizes the likelihood of the simulated dataset. As described in Section 3, the data were generated at ten equally-spaced time points between zero and 100 minutes, assuming the known input signal, Y 1 (t), as given in Eqn. 1. Prior to fitting with the ODE model, we computed the mean, standard deviation, and standard error of the mean (see Fig. 3).
Before we can fit the model to the data, we must first compute the likelihood of the data, given the model's parameters. Consider a set of experimental data, including N i cells at each time t i . LetR ij denote the number of mRNA in the j th cell at the i th time point. The measured sample mean at each time is given by: To fit the ODE model to the measured sample mean, one must quantify how well one would expect the sample mean to match to the predictions made by the ODE analysis. This question is easily answered by invoking the central limit theorem (CLT) [35], which applies provided that the true distribution has finite variance. According to the CLT, when many independent and identically distributed measurements are taken from the same distribution, the average of those measurements will tend to a Gaussian random variable with mean equal to the true mean and variance equal to the true variance divided by the number of samples (or squared standard error). In turn, the squared standard error can be estimated from the measured histograms according to: (9) Provided that the underlying distribution has a finite true variance and if enough sample measurements are collected, the likelihood to observe the measured sample mean can be estimated using the normal distribution: Our assumed lognormal prior distribution (E{log 10 (λ)} = 0, E{(log 10 (λ)) 2 } = 4) for each parameter can be written as: where A is a normalization constant. The probability of a parameter set Λ, given the data, can be written according to Bayes' rule, (i.e., P (Λ|D) = P (D|Λ)P (Λ)/P (D), where Λ denotes the model and D denotes the data) as: Here, N t is the number of time points, and P (D) is a normalization constant. Taking the logarithm of P rob(Λ|D) and substituting in the expressions for P rob(µ S (t i ) | Λ) and P rob(Λ|D) above yields: where the first summation is the chi-squared function, which quantifies the likelihood of the data given the model, and the second summation relates to the prior probability to select those parameters in the absence of any data. The variable C is a normalization constant that does not depend upon any of the model parameters, and it can be ignored for the purpose of model comparison or likelihood maximization. For small data sets, the second term will dominate, but as more data is collected, the first summation will take on greater importance.
We fit all four possible models (each with a different input-dependent transition rate) by maximizing the likelihood of the model given the data, as specified in Eqn. 13. For this, we use standard optimization routines, including Matlab's fminsearch, which attempts to find the unconstrained minimum of the objective function using a Nelder-Mead simplex algorithm [36,37]. We have also utilized a customized simulated annealing algorithm [38]. Knowing the signs of all parameters, we converted them to positive values, and we conducted all searches in logarithmic space. This choice is advantageous because (i) it allows us to improve efficiency by removing need to check and satisfy parameter positivity constraints, and (ii) it allows for parameter step magnitudes to be relative to their current values, even when relative magnitudes of parameters are unknown at the start of the fitting process. To increase our chances of finding a global maximum of the likelihood function (and to avoid getting stuck in local maxima associated with particular parameter guesses), we restarted all searches using thirty different random initial guesses per model. We selected the local maximum that provide the greatest likelihood of the model, given the data. Figure 3 shows the result for such a fit using the "correct" model (i.e., Model 2) in which the rate k 23 was assumed to be time-dependent. The corresponding best-fit parameter set for Model 2 using the ODE fit was: These parameters are very different from the "true" parameters given in Eqn. 7, suggesting that although the ODE analysis provides excellent fits, it is insufficient for identification of the parameters.

Parameter Uncertainty Estimation for ODE Models
When maximizing the ODE-based likelihood of the model, given the data, we saw that it was possible to find many combinations of models and parameters that match exceptionally well to the simulated data. For example, Models 2 and 4 fit equally well to the data generated from Model 2, meaning that the ODE analysis was insufficient for model discrimination. We next explored the uncertainty in parameter space in order to understand how well the averaged experimental data was able to constrain the parameters of the chosen model. To do this, we implemented the Metropolis-Hastings (MH) algorithm (see Section 1.1) to explore posterior parameter uncertainty in the model, given the simulated experimental data and the likelihood function in Eqn. 13. For our implementation of the MH search, we chose a proposal function in which we select each parameter with probability of 0.5, and, for those selected, we change their logarithmic value by adding a Gaussian-distributed perturbation with a mean of zero and a standard deviation of 0.2. In Matlab, this can be scripted as: x = x + 0.2*randn(size(x)).*(rand(size(x))>0.5), where 'x' is the vector of parameters in logarithmic space. For the MH search of parameter space, we used a total chain length of 750,000 parameter sets with a burn-in of 250,000 ignored initial parameter steps and a thin rate of ten ignored for every one recorded parameter set (see Section 1.1). Figure 4 shows a representative MH trajectory on the plane of parameters k r2 and γ for Model 2.
Since the MH algorithm is random, each finite length run of this algorithm produces different results. In practice, the MH algorithm should be run until convergence (e.g., until multiple independent chains with different initial starting parameter guesses converge, within some specified metric and tolerance). Our analysis to compare the ODE model to the simulated data yielded the following average parameter values:  In Fig. 4, the ellipse depicts the 90% confidence interval in the parameters, with the (dubious) assumption that the posterior parameter distribution is a multivariate lognormal distribution with the mean equal to Λ MH and a covariance matrix equal to Σ Λ . Clearly, with the data available, the model is not well identified despite excellent fits to the average data. This is an illustration of the effect of model sloppiness, where parameters are highly uncertain in some directions, yet potentially well-constrained in other directions within parameter space [39,40]. In Section 4.5, we explore how stochastic analyses can dramatically improve model identification, even when using the same model and the same data.

Simulation of Stochastic Gene Expression Models
Next, we simulate many stochastic trajectories for the gene regulation model, thus generating potential single-cell mRNA distributions from the model. To do this, we define the stoichiometry matrix and the propensity functions for each of the four possible models. Unlike the standard SSA, one of the propensity functions in these models is a time-varying function. In this case, it is necessary to adapt the SSA in a manner similar to that discussed in [41,42]. This modification corresponds to adding a fast reaction to the system, which has a large propensity function (w 1 = 100/min), but which does not change the state of the system (S(:, 1) = 0). We ran 1000 SSA trajectories for each of the four possible models, and we made histograms of the number of mRNA in different cells at each point in time (using the resulting best fit parameters from Section 4.2).
A single SSA trajectory is shown Fig. 5A. In Fig. 5B, we plot the average of multiple sets of 1000 SSA simulations versus time along with the ODE solution. In addition, Fig.  5C shows different FSP distributions for the two parameter sets (given in the caption for Fig. 4) and the simulated data at the same, specific time point. Despite producing excellent fits to the averaged data (Fig. 5B), neither parameter set captures the bimodal behavior of the data, as shown in Fig. 5C. This again shows that fitting the set of ODEs was not a sufficient strategy to extract all available information from our simulated single-cell data set, and the result suggests that fitting full distributions could lead to improvements in parameter identification [11]. We examine this possibility further in the next subsection.

Parameter Uncertainty Estimation for FSP Models
To improve upon the parameter uncertainty estimation results, we next utilized the Finite State Projection (FSP) approach to re-explore the uncertainty in parameter space, given the single-cell data. To accomplish this, we write the CME in the matrix ODE form: where P = [p 1 , p 2 , . . .] is the vector of probabilities of each state of the stochastic process, and A(t) is known as the infinitesimal generator matrix. See [14,21,22,23] for examples on the construction and evaluation of this FSP formulation (Eqn. 16) and the GitHub folder at "https://github.com/MunskyGroup/WeberPB/Matlab Codes" for specific FSP codes for this section. Equation 16 can then be integrated in time to compute full probability distributions for the gene regulation dynamics. We verified that the FSP analysis produced distributions that matched to the SSA results in the previous section (Fig. 5C, compare bars and dashed line). With the FSP-computed probability distributions, the likelihood of the data, given the model, is easily computed using the expression: Nr j=0 n ij log(p(j|t i , Λ)), (17) where N r is the maximum number of mRNA observed in a single cell, n ij is the number of cells with exactly j mRNA at the i th time point, and p(j|t i , Λ) is the probability to have j mRNA at that time, given the model parameters in Λ, as discussed in [21,23]. The corresponding log-likelihood of the model, given the data, can be written (using our previous definition of the prior) as: where C is a constant that is independent of the parameters. With this new likelihood function, we again implemented the MH algorithm to explore parameter uncertainty in the model, given the full information within the simulated single-cell data. In comparison to the parameter uncertainty estimation using the ODE analysis, our analysis using the FSP approach yielded substantially lower covariances for the parameters, as seen in the logarithmic covariance matrix:  The FSP distribution results are presented in Figs. 6 and 7. Figure 6 shows that the FSP distributions resulting from the identified parameters match closely to the data  distributions at the time points presented and capture the bimodal behavior of the data. Unlike the ODE analysis, the FSP was able to recover the correct model (i.e., Model 2) from its fit to the simulated data. Moreover, Fig. 7 shows that there is substantially less parameter uncertainty when the Metropolis-Hastings search is performed with the FSP instead of the ODE analysis. Figure 8 shows the true parameter values compared to the average estimated parameter values for Model 2 from the ODE and FSP searches of the parameter space. For nearly all of the parameters, the FSP search yields substantially closer estimates of the true parameter values and has much lower standard deviations for the parameter estimates as compared to ODE-based search. The only exception to this observation, for parameter k 23 (0), is easily explained by the fact that this reaction (Eqn. 2) is saturated and, therefore, unobservable for much of the system trajectory. It is also worth noting that, in addition to providing better parameter estimates and lower uncertainty, the FSP-based analysis required far shorter MH chains to explore parameter space before converging around the global minimum.

Directions of Parameter Uncertainty
To quantify and compare the model sloppiness after parameter estimation using the ODE and FSP analyses, we examined the eigenvectors and eigenvalues of the MCMC results (in logarithmic space). The square-root of the largest eigenvalue and its corresponding eigenvector quantify the magnitude and direction of the greatest component of parameter uncertainty (also known as the first principal component or PC). The second eigenvalue/eigenvector pair characterizes the next most variable direction, and so on. Figure 9A shows the magnitudes of the eight ranked PCs for the ODE (red) and FSP (black) analyses. The solid lines in Fig. 9A correspond to identification using the initial condition (S 1 and R = 0) as presented above, and the dashed lines correspond to using a different initial condition defined by the steady-state distribution in the absence of external input. For this particular combination of input signal and sampling frequency, the two different initial conditions yield similar results for the contrast between the ODE and FSP analyses. However, we note that both the ODE and the FSP analyses are slightly more variable when the initial condition is defined by the steady-state behavior in the absence of the activating signal. Figures 9B-E show the directions of all eight PCs on the log 10 (k r2 ) × log 10 (γ) plane for the FSP and ODE analyses with the two different initial conditions. For the FSP analysis with the zero-activation initial condition ( Fig. 9 B), the first PC explains 86% of the total uncertainty, and only three PCs are needed to explain more than 95% of the total uncertainty. Moreover, parameters k r2 and γ are linearly related to one another (i.e., their logarithms co-vary with a ratio of one) in each of the first three PCs, and Fig. 9B shows almost no uncertainty except on a single line with unit slope (Fig. 9C shows the same trend for the other initial condition). In contrast, six of the eight of the possible PCs are required in order to explain the total uncertainty in the ODE analysis, and Figs. 9C,E show that all of these PCs have significant variation in the log 10 (k r2 ) × log 10 (γ) plane, for either initial condition.

Importance of Experiment Design
Up to this point, we have demonstrated that the type of analysis chosen (i.e., an ODEor FSP-based likelihood function) to examine experimental data is crucial in model and parameter identification. We have also observed that the type of initial condition can have a smaller, but still significant quantitative effect on parameter identification. We now briefly explore how different experimental designs can affect parameter identification results. To demonstrate this concern, we fit the data and performed MH searches of parameter space for multiple different experiments (i.e., different input signals). Specifically, we compared the original input signal, given in Eqn. 1, with two additional input signals, which are as follows: where both additional input signals begin at time t = 5 min and end at time t = 70 min, similar to Y 1 (t). Figures 10 and 11 illustrate how the three different experimental designs (i.e., three different input signals) lead to dramatically different parameter uncertainties, using the ODE and the FSP likelihood functions, respectively. In all figures, we assume the exact same amount of data, but the effects on which parameters are tightly identified and which remain uncertain is non-trivially dependent on the input signal. Moreover, input signals that are best for one type of analysis may not be the best choice for a different analysis (for example, compare Figs. 10G to 11G).  To further demonstrate how important experimental design considerations are to obtain accurate prediction of gene regulation dynamics, we make model predictions of FSP distributions at different times and for multiple different experiments (i.e., different input signals). In these predictions, we use the best-fit parameter sets from the FSPbased searches for all four models (each with a different time-dependent transition rate and obtained using the original input signal given in Eqn. 1). The potential input signals are Y 2 (t) (Eqn. 20), Y 3 (t) (Eqn. 21), and Y 4 (t), given by:  20 -22). Each model uses the best parameter set found during the fit to the original input signal in Eqn. 1. Figure 12 shows predictions for these 3 potential input signals (left, middle, right columns) for two potential time points (middle, bottom rows), assuming the 4 different models that were identified using the original experiment (different colors). By examining these predictions, we can gain insight into which experiments might be more informative. For example, at t = 30 min (middle row), an experiment with input signal Y 2 (t) (left column) or Y 3 (t) (middle column) would enable us to differentiate easily between the models, whereas input signal Y 4 (t) (right column) is unlikely to differentiate between the models. Furthermore, measuring the output at t = 30 min is clearly more informative than at t = 100 min (compare Figs. 12D-F with 12G-I ).

Summary and Conclusions
In this article, we explored the application of various computational techniques (such as ODE analyses [43,44,45,46,47,48], stochastic simulations [17,49,50,51,52,53,54,55,56,57,58,59,60], and finite state projections [12,14,20,21,23,61,62,18,63,64]) to understand the dynamics of single-cell gene regulation. In particular, we focused on a class of simplified multi-state gene regulation models that has previously been fit to experimental single-cell data [1,10,11], and we used a combination of simulated singlecell data and different modeling procedures to explore how well one could identify models and parameter uncertainties from discrete stochastic data. Additionally, we discussed the importance of experiment design and illustrated how different initial conditions or experiments can be more informative than others for model identification. In addition to changing the initial conditions or input conditions, additional information can be gained from system response by adjusting the times at which measurements are taken. Although there have been many recent developments toward model-driven experiment design for ODE analyses [25,65], the development of experiment design methods for stochastic gene regulatory systems remains an active field of investigation [25,66]. Moreover, our analyses show that fitting ODEs to dynamical data may not be adequate for parameter identification and discerning parameter uncertainty (even when models are relatively simple), and we discussed the fact that fitting full distributions (e.g., using an FSP likelihood function) may lead to substantial improvements, even when considering the same model and same data. These results argue that the integration of discrete stochastic models and single-cell data could open new possibilities for improved model identification, even in cases where traditional ODE analyses have failed due to excessive parameter sloppiness.
In our current study, we focused on the analysis of mRNA levels as can be measured precisely with single-molecule resolution and fast temporal resolution using single-molecule Fluorescence in situ Hybridization [1,5,9,11]. Similar tools and stochastic model can also be applied to the analysis of protein variations, although such analyses must contend with additional complications, such as slow protein translation or activation dynamics that can filter out the fast transcriptional responses [13] or added background noise (e.g., cellular autofluorescence) that must be convolved with the model-generated protein signal [67]. Despite these issues, single-cell flow cytometry data have been successfully combined with stochastic models to generate accurate predictive models for single-cell protein distributions [67,68]. We assumed the possibility of rich and well measured time-varying input signals, which can be temporally controlled through alteration of environmental conditions and quantified with time-lapse fluorescence microscopy [1,10,31,32,33].
To make these approaches more accessible, we have developed a GUI, which is freely available on GitHub in the folder "https://github.com/MunskyGroup/WeberPB/GUI". The GUI, shown in Fig. 13, provides an opportunity for users to specify any input and model number and to perform the analyses outlined throughout, evaluate results, and generate figures. In addition, the GUI enables the automatic generation of simulated data using SSA for user-specified parameter sets and models, or to import a previously generated data file to perform the analyses outlined in this exercise. The GUI requires Python and the packages: NumPy, SciPy, and Matplotlib. Figure 13. Screenshot of the provided GUI, which can be utilized to perform all analyses outlined throughout, including generating simulated data sets and producing desired plots.