A method for estimating Hill function-based dynamic models of gene regulatory networks

Gene regulatory networks (GRNs) are quite large and complex. To better understand and analyse GRNs, mathematical models are being employed. Different types of models, such as logical, continuous and stochastic models, can be used to describe GRNs. In this paper, we present a new approach to identify continuous models, because they are more suitable for large number of genes and quantitative analysis. One of the most promising techniques for identifying continuous models of GRNs is based on Hill functions and the generalized profiling method (GPM). The advantage of this approach is low computational cost and insensitivity to initial conditions. In the GPM, a constrained nonlinear optimization problem has to be solved that is usually underdetermined. In this paper, we propose a new optimization approach in which we reformulate the optimization problem such that constraints are embedded implicitly in the cost function. Moreover, we propose to split the unknown parameter in two sets based on the structure of Hill functions. These two sets are estimated separately to resolve the issue of the underdetermined problem. As a case study, we apply the proposed technique on the SOS response in Escherichia coli and compare the results with the existing literature.


Introduction
Gene expression is a process that transcribes information of genes and translates it into functional gene products. These products are proteins that play a central role in performing numerous cellular functions. Spatio-temporal expression of these products is controlled by close interaction of different genes with each other, which is commonly known as gene regulatory network (GRN). The large and complex nature of GRNs limits our ability to understand them intuitively [1,2]. Therefore, mathematical models are constructed to get better insight into and understanding of these complex phenomena.
2018 The Authors. Published by the Royal Society under the terms of the Creative Commons Attribution License http://creativecommons.org/licenses/by/4.0/, which permits unrestricted use, provided the original author and source are credited.

Background
The complete process of parameter estimation of a Hill function-based ODE model from experimental genome data is shown as a block diagram in figure 1. Firstly, gene network structure is obtained either from well-established literature on the subject, experimental observations or by applying reverse engineering techniques (e.g. BANJO [22], ARACNE [23] and TSNI [9]). Network structure is used to construct a Hill function-based ODE model of the GRN with unknown parameters. The parameters of the ODE model are estimated by using the experimental data. Our work is focused on parameter estimation based on the generalized profiling method (GPM) [11]. In the GPM, an optimization problem has to be solved to calculate the most optimal parameters. Lastly, the identified mathematical model can be used for dynamic simulations of the GRN. In the following subsections, we provide details of the Hill function-based modelling and the GPM.

Hill function-based ordinary differential equation model of gene regulatory networks
ODEs belong to the category of continuous mathematical models. Concentrations of gene products are considered as the state variables. The rate of change of these concentrations is expressed as a function of state variables. The standard form is as follows: x i = f i (x 1 , x 2 , . . . , x N ), i = 1, . . . , N, (2.1) where x i denotes the concentration of the product of gene i and N is the total number of genes in a GRN. The rate of change of a stateẋ i is described as some mathematical function f i (.) of all the states. The concentration of any gene product x i is non-negative. Hence the model should be such that x i ≥ 0 for i = 1, . . . , N for all practically possible initial conditions [24]. For modelling of GRNs with ODEs, Hill or sigmoidal functions are employed in the literature [25,26]. Hill curves are preferred because they can adopt sigmoidal shape [2,27] with suitable parameters. Mendes [27] has suggested the following function: where and , (2.4) where x i denotes concentration of the ith gene product, I i is a subset of {1, . . . , N} that denotes the set of all inhibiting gene products for the ith gene, A i denotes the set of all activating gene products for the ith gene, P i is the synthesis rate constant, S i is the degradation rate constant, h − is the inhibiting Hill function, h + is the activating Hill function, Q i,j and Q i,k denote the threshold parameters of Hill functions, and the exponents R i,j and R i,k denote the cooperative parameters. Threshold parameter is the value of the concentration after which significant effect of inhibitor or activator is observed. Cooperative parameter controls how sharply the level transition occurs across the threshold. Activator and inhibitor Hill functions are depicted graphically for different values of the cooperative parameter in figure 2.
If the GRN composed of N genes has M interconnections, then the complete Mendes    . Activator and inhibitor functions are shown for fixed-threshold parameter Q i,j = Q i,k = 2, which is marked by a thick vertical line, while three different curves correspond to three different cooperative parameters R i,j = R i,k = 1, 2, 3. Increase in the cooperative parameter causes more rapid change across the threshold. constants, N degradation rate constants, M threshold parameters and M cooperative parameters . We can write the model (2.2) in compact form as shown below:

Generalized profiling method for estimation of parameters
Estimation is a challenging task because of non-availability of an analytical solution of (2.2) and numerical methods are expensive. Collocation methods avoid these difficulties by using polynomial regression. One of the efficient collocation methods is the generalized profiling method (GPM), which provides accurate estimation with low computational load [11]. In the GPM, β-splines are used as polynomial regression of states and their derivatives. The β-splines are preferred for interpolation as they are differentiable. Further details on β-splines can be found in [28]. The states and their derivatives are defined in terms of β-splines as follows: where c is column vector of coefficients, while φ(t) denotes β-spline basis systems. Estimation is achieved in two cascaded optimization steps. The inner optimization determines the coefficient vector T for a given fixed value of parameters θ such that the following objective function is minimized:ĉ where y i (t) is the experimentally observed concentration of the product of the ith gene, t 0 is the starting time of the experiment, t f is the ending time of the experiment, T E is the set of all times at which experimental measurements are available and λ i is the weighting parameter. The first term of (2.6) minimizes the sum of squared residuals between the data y i (t) and the state of the model , whereas the second term penalizes the residuals between derivatives from nonlinear functions f i (.) in (2.5) and their estimates from β-splines. Weighting factor λ i is used as a smoothening parameter.
The outer optimization determines the estimate of ODEs parametersΘ by minimizing the following objective function:Θ whereĉ is the optimal value of coefficient vector c such that the objective function in (2.6) is minimized. The above-stated objective function is an implicit function of ODE parameters Θ. Further details on the GPM can be found in [29]. As mentioned in §2.1, the parameters Θ should be positive. Therefore, the above optimization problems are to be solved with the following constraint: (2.8)

Proposed method
The nonlinear optimizations in the generalized profiling method in §2.2 have to be solved by numerical optimization algorithms. Having the constraint (2.8) in the optimization problem makes it more complex to solve. Some solvers, such as Lavenberg-Marquardt [21], also do not handle constraints. We propose a reformulation of the mathematical model that allows us to avoid the constraints. Let P i = e p i , Q i,j = e q i,j , R i,j = e r i,j and S i = e s i . The model (2.2) can be rewritten as follows:   optimizations in the generalized profiling method in §2.2 are underdetermined. This could result in poor estimates of the parameters. Some of the optimization techniques, such as the trust-region-reflective method [20], are more prone to inaccuracy in case of underdeterminedness.
To solve the issue of underdeterminedness, we propose a new approach. The idea is based on the structure of Hill functions. The cooperative parameters R i,j = e r i,j in the Hill functions control the sharpness of the switching action around the threshold parameters. Therefore, cooperative parameters only have prominent impact near the switching of Hill functions. We propose to split the parameters θ to be estimated into two sets θ r = [r 1,2 . . . r N,N−1 ] T , which includes only the cooperative parameters r i,j , and θ p = [p 1 . . . p N q 1,2 . . . q N.N−1 s 1 . . . s N ] T , which includes all the rest of the unknown parameters. We propose to initially assume a value of θ r , e.g. 2 for all r i,j , and estimate only the parameters θ p . As the number of unknown parameters are reduced, it helps to improve the issue of underdeterminedness. However, as θ r was initially guessed, we then fix the value of θ p and solve the optimization problem to estimate θ r . We repeat this process until the estimates of all the parameters converge to a value.
The complete proposed scheme is illustrated with a flow chart in figure 3. Initially the estimation problem is reformulated with the change of variables to avoid constraints. An initial value of θ r is assumed. The GPM method is applied to only estimate the parameters θ p , which has M + 2N elements. Using the recent optimal values of θ p , the GPM method is applied to only estimate the parameters  θ r , which has M elements. If the change in all the parameters θ is less than some threshold, then the estimation process is complete; otherwise repeat the process.

Case study: SOS response in Escherichia coli
SOS response in Escherichia coli is an inducible DNA repair system that enables bacteria to survive under severe DNA damage. In the SOS response, nearly 40 genes are directly regulated by recA and lexA, while tens of other genes are indirectly controlled [30]. Nine genes named lexA, recA, recF, rpoD, rpoS, dinI, umuDC, rpoH and ssB are considered to be directly participating in the SOS response. Established network connections among these nine genes are known in the literature [9,30,31]. Network structure of these nine genes is shown as in figure 4.

Hill function-based model of the SOS response
The ODE model of the SOS response in Escherichia coli is constructed based on the network structure given in figure 4 [9,30]. The network has nine genes and 43 interconnections, i.e. N = 9 and M = 43. The Hill function-based ODEs for the nine gene product concentrations are given below: A total of 2 * (43 + 9) = 104 parameters are needed to fully describe the above model. We can write the above model in compact form as shown below:  figure 3 with a total execution time of approximately 82 minutes on a PC with a Core i5-650 processor. The Matlab code is available at [33]. Estimated parameters are given in table 1.
The estimated model is simulated to generate time course evolution of gene product concentrations. The results are compared with experimental data and the model reported in [31]. Baralla et al. [31] also investigated the same subpart of the SOS response using the same dataset of M3D. They have applied the particle swarm algorithm for optimization and a direct numerical solution of the ODE model for data fitting. It can be seen in figure 5 that dynamic simulation with the model estimated by the proposed method is much more accurate than that of [31], especially for recF, rpoD, ssB and rpoH.
For the purpose of validation, the model estimated by the proposed method has also been used to calculate the steady-state values of concentrations. The steady-state values are given in table 2 along with the experimentally observed values and the values obtained by the estimated model in [31]. The table also gives the sum of squared error between the experimental data and values obtained from estimated models. It can be seen that the total error in the values obtained by the model estimated by the proposed method is much smaller than that of [31].

Conclusion
Obtaining a mathematical model of GRNs requires estimation of model parameters from the experimental data. To estimate the optimal values of parameters an optimization problem has to be solved. Usually the experimental data have much fewer measurements than the number of parameters which could result in an underdetermined optimization problem. Moreover, depending on how the optimization problem is posed, constraints have to be incorporated to find practically feasible values of parameters, which could result in numerical issues. In this paper, we have proposed a new approach to reformulate the estimation problem such that constraints are not required. Based on the structure of Hill functions, we also suggest to split the number of unknowns into two sets, which are estimated one at a time. This helps to alleviate the issue of underdeterminedness. The proposed technique is applied to estimate the model of the SOS response in Escherichia coli. The results are compared with the existing literature and experimental data. Both dynamic and steadystate results show that the proposed technique provides more accurate estimates of the unknown model parameters.
Data accessibility. The Matlab code of the proposed algorithm is available from the Dryad Digital Repository: https://doi.org/10.5061/dryad.ht047 [33].