Discovery of Nonlinear Dynamical Systems using a Runge-Kutta Inspired Dictionary-based Sparse Regression Approach

Discovering dynamical models to describe underlying dynamical behavior is essential to draw decisive conclusions and engineering studies, e.g., optimizing a process. Experimental data availability notwithstanding has increased significantly, but interpretable and explainable models in science and engineering yet remain incomprehensible. In this work, we blend machine learning and dictionary-based learning with numerical analysis tools to discover governing differential equations from noisy and sparsely-sampled measurement data. We utilize the fact that given a dictionary containing huge candidate nonlinear functions, dynamical models can often be described by a few appropriately chosen candidates. As a result, we obtain interpretable and parsimonious models which are prone to generalize better beyond the sampling regime. Additionally, we integrate a numerical integration framework with dictionary learning that yields differential equations without requiring or approximating derivative information at any stage. Hence, it is utterly effective in corrupted and sparsely-sampled data. We discuss its extension to governing equations, containing rational nonlinearities that typically appear in biological networks. Moreover, we generalized the method to governing equations that are subject to parameter variations and externally controlled inputs. We demonstrate the efficiency of the method to discover a number of diverse differential equations using noisy measurements, including a model describing neural dynamics, chaotic Lorenz model, Michaelis-Menten Kinetics, and a parameterized Hopf normal form.


Introduction
Data-driven discovery of dynamic models has recently picked up much attention as there are revolutionary breakthroughs in data science and machine learning [22,29]. With the increasing ease of data availability and advances in machine learning, we can delve into analyzing data and identifying patterns to uncover dynamic models that faithfully describe the underlying dynamical behavior. Though inferring dynamic models have been intensively studied in the literature, drawing conclusions and interpretations from them still remains strenuous. Moreover, extrapolation and generalization of models are limited beyond the training regime.
The sphere of identifying models using data is often referred to as system identification. For linear systems, there is an extensive collection of approaches [26,42]. However, despite several decades of research on learning nonlinear systems [23,25,39], it is still far away from being as mature as linear systems. Inferring nonlinear systems often require a prior model hypothesis by practitioners. A compelling breakthrough towards discovering nonlinear governing equations appeared in [3,37], where an approach based on genetic programming or symbolic regression is developed to identify nonlinear models using measurement data. It provides interpretable analytic models that accomplish a long-standing desire to the engineering community. A parsimonious model is determined by examining the Pareto font that discloses a tread-off between the identified model's complexity and accuracy. In a similar spirit, there have been efforts to develop sparsity promoting approaches to discover nonlinear dynamical systems [5,6,31,32,43]. It is often observed that the dynamics of physical processes can be given by collecting a few nonlinear feature candidates from a highdimensional nonlinear function space, referred to as a feature dictionary. These sparsity-promoting methods are prone to discover models that are interpretable and parsimonious. Significant progress in solving sparse regression [16,20,40] and compressed sensing [7,8,14,41] support developments of these approaches. Although all these methods have gained much popularity, the success of these methods largely depends on the feature candidates included in the dictionary and the ability to accurately approximating the derivative information using measurement data. A derivative approximation using sparsely sampled and noisy measurements impose a tough challenge though there are approaches to deal with noise, see, e.g., [10] We also highlight additional directions explored in the literature to discover nonlinear governing equations, which include discovery of models using time-series data [11], automated inference of dynamics [3,12,38], and equation-free modeling [24,32,46].
In this work, we re-conceptualize the problem of discovering nonlinear differential equations by blending sparse identification with a classical numerical integration tool. We here focus on a widely known integration scheme, namely Runge-Kutta 4 th -order [1]. In contrast to previously studied sparse identification approaches, e.g., [3,5,43], our approach would not require direct access or approximation of temporal gradient information. Therefore, we do not commit errors due to a gradient approximation. The approach becomes an attractive choice when the collected measurement data are sparsely sampled and corrupted with noise. We mention that numerical integration-inspired (e.g., Runge-Kutta) neural network architecture designs have also studied in the literature and have observed their supreme performances in deep learning, see, e.g., [18,19], and from the perspective of dynamical modeling, see, e.g., [33][34][35]. These methods yield black-box models, thus interpretable and generalization of these models are ambiguous.
What is more, we discuss an essential class of dynamic models that typically explains the dynamics of biological networks. It is also witnessed that regulatory and metabolic networks are sparse in nature, i.e., not all components influence each other. Furthermore, such dynamic models are often given by rational nonlinear functions. Consequently, the classical dictionary-based sparse identification ideology is not applicable as building all possible rational feature candidates is infeasible. To deal with this, the authors in [28] have recast the problem as finding the sparsest vector in a given null-space. However, computing a null space using corrupted measurement data is a non-trivial task though there is some work in the direction [17]. In this work, we instead characterize identifying rational functions as a ratio of two functions, where each function is identified using dictionary learning. Hence, we inherently retain the primary principle of sparse identification in the course of discovering models. In addition to these, we discuss the case where a dictionary contains parameterized candidates, e.g., e αx , where x is dependent variables, and α is an unknown parameter. We extend our discussion to parametric and controlled dynamic processes.
The organization of the paper is as follows. In Section 2, we briefly recap the Runge-Kutta 4 th -order scheme that is typically used to integrate differential equations. After that, we propose a methodology to discover differential equations by synthesizing the integration scheme with sparse identification. Furthermore, since the method involves solving nonlinear and non-convex optimization problems that promote sparse solutions, Section 3 discusses algorithms inspired by a sparse-regression approach in [5,40]. In Section 4, we examine a number of extensions to other classes of models, e.g., when governing equations are given by a ratio of two functions and involve model parameters and external control inputs. In the subsequent section, we illustrate the efficiency of the proposed methods by discovering a broad variety of benchmark examples, namely the chaotic Lorenz model, Fitz-Hugh Nagumo models, Michaelis-Menten Kinetics, and parameterized Hopf normal norm. We extensively study the performance of the proposed approach even under noisy measurements and compare it to the approach proposed in [5]. We conclude the paper with a summary and high-priority research directions.

Discovering Nonlinear Governing Equations using a Runge-Kutta Inspired Sparse Identification
In this section, we are determined to discover nonlinear governing equations using sparsely sampled measurement data. These may be corrupted using experimental and/or sensor noise. We establish approaches by combining a numerical integration method and dictionary-based learning of the gradient field. As a result, we develop methodologies that allow us to discover nonlinear differential equations without the explicit need for derivative information, unlike the approach proposed in [5,12,43]. In this work, we utilize the widely employed approach to integrate differential equations, namely Runge-Kutta 4 th -order (RK4) scheme, which is briefly outlined in the following.

Runge-Kutta 4 th order scheme
The RK4 scheme is a widely-used method to solve an initial value problem. Let us consider an initial value problem as follows: being the jth element of the vector x(t). Assume that we aim at predicting x(t k+1 ) for a given x(t k ), where k ∈ {0, 1, . . . , N }. Then, using the RK4 scheme, x(t k+1 ) can be given as a weighted sum of four increments, which are the product of the time-step and gradient field information f (·) at the specific locations. Precisely, it is given as where The RK4 scheme as a network is illustrated in Figure 2.1(a). The local integration error due to the RK4 scheme is of O(h 5 k ); hence, the approach is very accurate for smaller time-steps. Furthermore, if we integrate the equation (2.1) from the time t 0 to t f , we can take N steps with time-steps h k , k ∈ {1, . . . , N } so that t f = t 0 + N i=0 h k . In the rest of the paper, we use a short-hand notation for the step in (2.2) by Lastly, we stress a point that the RK4 scheme readily handles integration backward in time, meaning that h k in (2.2) can also be negative. Hence, we can predict both y(t k+1 ) and y(t k−1 ) using y(t k ) very accurately using RK4 scheme.

Discovering nonlinear dynamical systems
Next, we develop a RK4-inspired sparse identification approach to discover governing equations. Precisely, we aim at disclosing the most parsimonious representation of the gradient field f (x(t)) in (2.1) using only a time-history of x(t). Assume that the data is sampled at the time instances {t 0 , . . . , t N } and let us define time-steps h k := t k+1 − t k . Furthermore, for simplicity of notation, we assume that the data follows RK4 exactly, but the method is not limited to it. Consequently, we form two data matrices: . . .
The next important ingredients to sparse identification is the construction of a huge symbolic dictionary Φ, containing potential nonlinear features. So, the function f (·) can be given by a linear combination of few terms from the dictionary. For example, one can consider a dictionary containing, polynomial, exponential, and trigonometric functions, which, for any given vector v := v 1 , . . . , v n can be given as: . . , e −v , e −2v , . . . , sin(v), cos(v), . . . (2.5) in which v Pi , i ∈ {2, 3} denote high-order polynomials, e.g., v P2 contains all possible degree-2 polynomials of elements of v as: Each element in the dictionary Φ is a potential candidate to describe the function f . Moreover, depending on applications, one may take the help of experts and include empirical knowledge to construct a meaningful feature dictionary. Having paradise of an extensive dictionary, one has many choices to choose candidates from the dictionary. However, our goal is to choose as few candidates as possible, describing the nonlinear function f in (2.1). Hence, we set up a sparsity-promoting optimization problem to pick few candidate functions from the dictionary, e.g., where f k : R n → R is the kth element of f , and θ k a sparse vector; hence, selecting appropriate candidates from the dictionary determines governing equations. As a result, we can write the function f (·) in (2.1) as follows: where Θ = θ 1 , . . . , θ n . This allows to articulate our optimization problem that aims at discovering governing equations -that is to find the sparsest Θ, satisfying Once we identify Θ or {θ 1 , . . . , θ n }, the dynamic model can be given as We referred to the proposed approach as Runge-Kutta inspired sparse identification (RK4-SINDy). We depict all the essential steps for RK4-SINDy to discover governing equations in Figure 2.1 through the Fitz-Hugh Nagumo model (details of the model are provided later). We take the opportunity to stress imperative advantages of RK4-SINDy. That is -to discover nonlinear differential equations, we do not require derivative information of x(t) at any step. We only hypothesize that the gradient field can be given by selecting appropriate features from a dictionary containing a vast number of possible nonlinear features. Consequently, we expect to discover good quality models when data are sparsely collected and/or are corrupted, and this is what we manifest in our results in Section 5. Interestingly, the approach readily handles irregular time-steps.
When the data are corrupted with noise or does not follow RK4 exactly, then we may need to regularize the above optimization problem. Since the l 1 -regularization promotes sparsity in the solution, one can solve an l 1 -regularized optimization problem: (2.10) As discussed in Subsection 2.1, the RK4 scheme can accurately predict both x(t i+1 ) and x(t i−1 ) using x(t i ). Therefore, the following also holds: . . .
x 2 (t 1 ) · · · x n (t 2 ) . . . . . . . . . . . .  It resembles a residual-type network with skip connections. In (b), we present a systematic illustration of RK4-SINDy approach to discover governing equations using the Fitz-Hugh Nagumo model. In the first step, we collect a time history of variables v(t) and w(t). Next, we build a symbolic feature dictionary Φ, containing potential features. It is followed by solving a nonlinear sparse regression problem to pick the right features from the dictionary (encoded in sparse vectors ξ v and ξ w ). Here, we presume that variables at the next time steps are given by following the RK4 scheme. The non-zero entries in vectors ξ v and ξ w determine the right-hand side of the differential equations. As shown, we pick the right features from the dictionary upon solving the optimization problem, and corresponding coefficients are 0.1% accurate.
Therefore, we can have a more involved optimization by including both forward and backward predictions in time. This helps particularly in noisy measurement data. In the next subsection, we discuss an efficient procedure to solve the optimization problem (2.9).

Algorithms to Solve Nonlinear Sparse Regression Problems
Several methodologies exist to solve linear optimization problems that yield a sparse solution, see ,e.g., LASSO [16,40]. However, the optimization problem (2.9) is nonlinear and likely non-convex. There are some developments in solving sparsity-constrained nonlinear optimization problems; see, e.g., [2,45]. Though these methods enjoy many nice theoretical properties, they typically require a priory the maximum number of non-zero elements in the solutions, which is often unknown to us. Also, they are computationally demanding.
Here, we propose two simple gradient-based sequential thresholding schemes, similar to the one discussed in [5] for linear problems. In these schemes, we first solve the nonlinear optimization problem (2.9) using a (stochastic-) gradient descent method to obtain Θ 1 , followed by applying a thresholding to Θ 1 .

Fix cutoff thresholding
In the first approach, we define a cutoff value λ and set all the coefficients smaller than λ to zero. We then update the remaining non-zero coefficients by solving the optimization problem (2.9) again, followed by employing the thresholding. We repeat the procedure until all the non-zero coefficients are equal to or larger than λ. This procedure is efficient as the current value of non-zero coefficients can be used as an initial guess for the next iteration, and the optimal Θ can be found with a little computational effort. Note the Algorithm 1 Fix Cutoff Thresholding Procedure Input: Measurement data {x(t 0 ), x(t 1 ), . . . , x(t N )} and the cutoff parameter λ.

5:
Update Θ by solving the optimization problem (2.9) with the constraint Θ (small idx) = 0 6: Determine indices at which coefficients are less λ 7: Output: The sparse Θ that picks right features from the dictionary.
Determine the smallest non-zero coefficient of abs(Θ), denoted by λ small .
cutoff parameter λ is important to obtain a suited sparse solution, but it can be found using the concept of cross-validation. We sketch the discussed procedure in Algorithm 1.

Iterative cutoff thresholding
In the fix cutoff thresholding approach, we need to pre-define the cutoff value for thresholding. A suitable value of it needs to be found by an iterative procedure. In our empirical observations, applying fix thresholding at each iteration does not yield the most sparse solution in many instances. To circumvent this, we propose an iterative way of thresholding -that is as follows. In the first step, we solve the optimization problem (2.9) for Θ. Then, we determine the smallest non-zero coefficients of |Θ| followed by setting all the coefficients smaller than this to zero. Like the previous approach, we update the remaining non-zero coefficients by solving the optimization problem (2.9). We repeat the step of finding the smallest non-zero coefficient of the updated |Θ| and setting it to zero. We iterate the procedure until the loss of data fidelity is less than a given tolerance. Visually, it can be anticipated using the curve between the data-fitting and number of non-zero elements in Θ, which typically exhibit an elbow -type curve. We shall see in our result section (Section 5).
We sketch the step of the procedure in Algorithm 2. We note that the successive iterations converge faster to the optimal value after the first thresholding as we choose the coefficients after applying thresholding as the initial guess. Moreover, in our experiments, we observe that this thresholding approach yields better results, particularly when data are corrupted with noise. However, it may be computationally more expensive than the fixed cutoff thresholding approach as it may need more iterations to converge. Therefore, an efficient approach combining fixed and iterative thresholding approaches is a worthy future research direction.

A Number of Possible Extensions
In this section, we discuss several extensions to the methodology proposed in Section 2, generalizing to a large class of problems. First, we discuss the discovery of governing differential equations given by a ratio of two functions. Next, we investigate the case in which a symbolic dictionary is parameterized. This is of particular interest when governing equations are expected to have candidate features, e.g., e αx(t) , where α is unknown. We further extend our discussion to parameterized and externally controlled governing equations.

Governing equations as a ratio of two functions
There are many instances, where the governing equations are given as a ratio of two nonlinear functions. Such equations frequently appear in the modeling of biological networks. For simplicity, we here examine a scalar problem; however, the extension to multi-dimensional cases readily follows. Consider governing equations of the form:ẋ where g(x) : R → R and h(x) : R → R are continuous nonlinear functions. Here again, the observation is that the functions g(·) and h(·) can be given as linear combinations of a few terms from corresponding dictionaries. Hence, we can cast the problem of identifying the model (4.1) as a dictionary-based discovery of governing equations. Let us consider two symbolic dictionaries: Consequently, the functions g(·) and h(·) can be given by where θ g and θ h are sparse vectors. Then, we can readily apply the framework discussed in the previous in (2.1). We can determine sparse coefficients θ g and θ h by employing the thresholding concepts presented in Algorithms 1 and 2. These are possible because the algorithms are gradient-based and we only need to compute gradients with respect to θ g and θ h . Furthermore, we notice that it is worthwhile to consider governing equations of the form: .
Indeed, the model (4.6) can be rewritten in the form considered in (4.1). But it is rather efficient to consider the form (4.6). We illustrate it with the following example: which fits to the form considered in (4.6). In this case, all nonlinear functions k(·), g(·) and h(·) are of degree-1 polynomials. On the other hand, if the model (4.7) is written in the form (4.1), then we havė Thus, the nonlinear functions g(·) and h(·) in (4.1) are of degrees 2 and 1, respectively. This gives a hint that if we aim at learning governing equations using sparse identification, it might be efficient to consider the form (4.6) from the complexity of the dictionary. It becomes even more adequate in multi-dimensional differential equations. To discover dynamic model of the form (4.6), we extend the idea of learning nonlinear functions using dictionaries. Let us construct three dictionaries as follows: Then, we believe that the nonlinear functions in (4.6) can be given as a sparse linear combination of the dictionaries, i.e., To determine the sparse coefficients {θ k , θ g , θ h }, we can employ the RK4-SINDy framework, and Algorithms 1 and 2. We will illustrate this approach to discover an enzyme kinetics in Section 5.4 that is given as a rational function.

Discovering of parametric and externally controlled equations
The RK4-SINDy immediately embraces the discovering of governing equations that are parametric and externally controlled. Let us begin with an externally controlled dynamic models of the form: where x(t) ∈ R n and u(t) ∈ R m are state and controlled input vectors. The goal here is to discover f (x(t), u(t)) using the state trajectory x(t) generated using a controlled input u(t). We aim at discovering governing equations using dictionary-based identification. Like discussed in Section 2, we construct a symbolic dictionary Φ of possible candidate features using x and u, i.e., where u i is the i-th element of u. Using measurements of x and u, we can cast the problem exactly as done in Section 2 by assuming that f (x(t), u(t)) can be determined by selecting appropriate functions from the dictionary Φ(x, u). Similarly, system parameters can also be incorporated to discover parametric differential equations of the form: where µ ∈ R p is the system parameters. It can be considered as a special case of (4.13) since a constant input can be thought of as a parameter in the course of discovering governing equations. We illustrate RK4-SINDy for discovering parametrized Hopf normal form using measurement data (see Subsection 5.5).

Parameterized dictionary
The success of the sparse identification highly depends on the quality of a constructed feature dictionary. In other words, the dictionary should contain right features in which governing differential equations can be given as a linear combination of few terms from the dictionary. However, it becomes a challenging task when one aims at including, for instance, trigonometry or exponential functions (e.g., sin (ax), e (bx) ), where {a, b} are unknown. In an extreme case, one might think of including sin(·) and e(·) for each possible value of a and b. This would lead to the dictionary of infinite dimension, hence becomes intractable. To illustrate it, we consider the governing equation as follows: x(t) = −x(t) + exp(−1.75x(t)). (4.17) Let us assume that we concern about discovering the model (4.17) using a time history of x(t) without any prior knowledge except that we expect exponential nonlinearities. It may be gathered with the help of experts or from empirical knowledge. For instance, an electrical circuit modeling containing diode components typically involves exponential nonlinearities, but the corresponding coefficient is unknown. We conventionally build a dictionary containing exponential functions using several possible coefficients as follows: Φ(x) = 1, x, x 2 , x 3 , . . . , e x , e −x , e 2x , e −2x . . . , sin(x), cos(x), . . . .

(4.18)
However, it is impossible to add all infinitely exponential terms with different coefficients in the dictionary. As a remedy, we discuss the idea of a parameterized dictionary that was also discussed in [9]: where Ξ = {ξ 1 , ξ 2 , . . .}. In this case, we do not need to include all frequencies for trigonometric functions and coefficients for exponential functions. However, it comes at the cost of finding suitable coefficients {η i }'s, along with a vector, selecting right features from the dictionary. Since we solve optimization problems, e.g., (2.9) using a gradient descent, we can easily incorporate the parameters η i 's along with θ i 's as learning parameters and can readily employ Algorithms 1 and 2 with a little alteration.

Results
Here, we demonstrate the success of RK4-SINDy to discover governing equations using measurement data through a number of examples of different complexity 1 . In the first example, we consider simple illustrative examples, namely, linear and nonlinear damped oscillators. Using the linear damped oscillator, we perform a comprehensive study under various conditions, i.e., the robustness of the approach to sparsely sampled and highly corrupted data. We compare the performance of our approach to discover governing equations with [5]; we refer to it as Std-SINDy 2 . In the second example, we study the chaotic Lorenz example and show that RK4-SINDy determines the governing equations, exhibiting the chaotic behavior accurately. In the third example, we discover neural dynamics from measurement data using RK4-SINDy. As the fourth example, we illustrate the discovery of a model that describes the dynamics of enzyme activity and contains rational nonlinearities. In the last example, we showcase that RK4-SINDy also successfully discovers the parametric Hopf normal form from collected noisy measurement data for various parameters.

Two-dimensional Damped Oscillators
As simple illustrative examples, we consider two-dimensional damped harmonic oscillators. These can be given by linear and nonlinear models. We begin by considering the linear one.

Linear damped oscillator
Consider a 2D linear damped oscillator whose dynamics is given by: To infer governing equations from measurement data, we first assume to have clean data at a regular timestep dt. We then build a symbolic dictionary containing polynomial nonlinearities up to the degree of 5. Next, we learn governing equations using RK4-SINDy (Algorithm 1 with λ = 5 · 10 −2 ) and observe the quality of inferred equations for different dt. We also present a comparison with Std-SINDy.
The results are shown in Figure 5.1 and Table 5.1. We notice that RK4-SINDy is impressively robust with the variation in time-step as compared to Std-SINDy, and discovers the governing equations accurately. We also emphasis that for large time-steps, Std-SINDy fails to capture dynamics; in fact, for a time-step dt = 5 · 10 −1 , Std-SINDy even yields unstable models, see Figure 5.1d.      Next, we study the performance of both methodologies under corrupted data. We corrupt the measurement data by adding zero-mean Gaussian white noise of different variances. We present the results in Figure 5.2 and Table 5.2 and notice that RK4-SINDy can discover better quality sparse parsimonious models as compared to Std-SINDy even under significantly corrupted data. It is predominately visible in Figure 5.2d.   Noise level

RK4-SINDy
Std-SINDy Table 5.2: Linear 2D model: The discovered governing equations, by employing RK4-SINDy and Std-SINDy, are reported. In this scenario, the measurement data are corrupted using zero-mean Gaussian white noise of different variances.

Cubic damped oscillator
Next, we consider a cubic damped oscillator, governed bẏ Like the linear case, we aim at discovering the governing equation using measurement data. We repeat the study done in the previous example using different regular time-steps. We report the quality of discovered models using RK4-SINDy and Std-SINDy in Figure 5.3 and Table 5.3. We observe that RK4-SINDy successfully discovers the governing equations quite accurately, whereas Std-SINDy struggles to identify the governing equations when measurements data are collected at a larger time-step. It simply fails to obtain a stable model for a time-step dt = 0.1. It showcases the robustness of RK4-SINDy to discover interpretable models even when data are collected sparsely.

Fitz-Hugh Nagumo model
Here, we explore discovery of the nonlinear Fitz-Hugh Nagumo (FHN) model that describes the activation and deactivation of neurons in a simplistic way [15]. The governing equations are: We collect the time-history data of v(t) and w(t) using the zero initial condition. We construct a dictionary containing polynomial terms up to the third degree. We employ RK4-SINDy (Algorithm 1 with λ = 10 −2 ) and Std-SINDy. We discover governing equations by using the data collected between the time interval [0, 600]s. We identify models under different conditions, namely, different time-steps at which data are collected. We report the results in Figure 5.4 and Table 5.4. It can be observed that RK4-SINDy faithfully discovers the underlying governing equations by picking the correct features from the dictionary and estimates the corresponding coefficients up to 1% accurately. On the other hand, Std-SINDy breaks down when data are taken at a large time-step.

Chaotic Lorenz system
As the next example, we consider the problem of discovering the nonlinear Lorenz model [27]. The dynamics of the chaotic system involves on an attractor and is governed bẏ z(t) = x(t)y(t) − 8 3 z(t).

(5.4)
We collect the data by simulating the model from time t = 0 to t = 20 with a time-step of dt = 10 −2 . To discover the governing equations using the measurement data, we employ RK4-SINDy and Std-SINDy with the fixed cutoff parameter λ = 0.5. However, before employing the methodologies, we perform a normalization dt RK4-SINDy Std-SINDy Table 5.4: FHN model: Discovered models using data at various time-step using RK4-SINDy and Std-SINDy.
step. A reason behind is that the mean value of the variable z is large, and the standard deviations of all the three variables is much larger than 1. Consequently, a dictionary containing polynomial terms would be highly ill-conditioned. To circumvent this, we perform a normalization of data. Ideally, one performs normalization such that the mean and variance of the transformed data are 0 and 1. But for this particular example, we normalize such that the interactions between the transformed variables are similar to (5.4). Hence, we propose a transformation as Consequently, we obtain a model:ẋ (t) = −10x(t) + 10ỹ(t), Notwithstanding, the models (5.4) and (5.6) look alike, and the basis features in which dynamics of both models lie are the same except a constant. However, the beauty of the model (5.6) or the transformed data is that the data becomes well-conditioned, hence the dictionary containing polynomial features. Next, we discover models by employing RK4-SINDy and Std-SINDy using the transformed data. For this, we construct a dictionary with polynomial nonlinearities up to degrees 3. We observe the result in Figure 5.5 and Table 5.5. We note that both methods identify correct features from the dictionary with coefficients that are close to the ground truth, but RK4-SINDy model coefficients are relatively closer to the ground-truth ones. It is also worthwhile to note that the coefficients of the obtained RK4-SINDy model are only 0.01% off to the ground-truth, but the dynamics still seem quite different, see Figure 5.5. A reason behind this is the highly chaotic behavior of the dynamics. As a result, a tiny deviation in the coefficients can significantly impact the transient behavior in an absolute sense; however, the dynamics on an attractor are perfectly captured.
Next, we study the performance of the approaches under noisy measurements. For this, we add mean zero Gaussian noise of variance one. To employ RK4-SINDy, we first apply a Savitzky-Golay filter [36] to denoise the time-history data, see Figure 5.6. For Std-SINDy as well, we use the same filter to denoise the signal and approximate the derivative information. We plot the trajectories of the discovered models and ground-truth in Figure 5.7 and observe that dynamics on an attractor is still intact; however, we note that the discovered equations are very different from the ground truth, see Table 5.6. The learning can be improved by employing Algorithm 2, where we iteratively remove the smallest coefficient and determine the sparsest solutions by looking at the Pareto-front. However, it comes at a slightly higher computational cost. We discuss this approach more in detail in our following examples.

Michaelis-Menten kinetics
To illustrate RK4-SINDy to discover governing equations that are given by a ratio of two nonlinear functions, we consider arguably the most well-known model for an enzyme kinetics, namely Michaelis-Menten model [21,30]. The model explains the dynamics of binding and unbinding of enzyme with an substrate s. In a simplistic way, the dynamics are governed by [4]: .
As a first step, we generate data using four initial conditions {0.5, 1.0, 1.5, 2.0}. We collect data at a time-step dt = 5 · 10 −2 , see Figure 5.8(a). Typically, governing equations, explaining biological process have rational functions. Therefore, we aim at discovering the enzyme kinetics model by assuming a rational form as shown in (4.1), i.e., the gradient field of s(t) takes the form g(s(t)) 1+h(s(t)) . Next, in order to identify g(s) and h(s), we construct the polynomial dictionaries, containing terms up to degrees 4. After that, we employ RK4-SINDy to identify the precise features from the dictionaries to characterize g and h. Moreover, we apply the iterative thresholding approach discussed in Algorithm 2, in contrast to previously considered examples where a fixed thresholding is applied. Note that the success of RK4-SINDy approach not only depends on a dictionary containing candidate features but the quality of data. We have marked that the dictionary data matrix's conditioning improves when data are normalized to mean-zero and variance-one. It is crucial for polynomial basis in the dictionary. For this example, we normalize the data before employing RK4-SINDy. It means that the transformation is done as follows: where µ s and σ s are the mean and standard deviation of the collected data. Next, using the normalized data, we learn the governing equation, describing the dynamics ofs(t). Since we consider dictionaries for g and h, containing polynomials of degree 4, there are total 9 coefficients. To identify the correct model while employing Algorithm 2, we keep track of the loss (data-fidelity) and the number of non-zero coefficients, which is shown in Figure 5. 8(c). This allows us to built a Pareto front for the optimization problem and choose the most parsimonious model that describes the dynamics present in collected data. One of the most attractive features of learning parsimonious models is to avoid over-fitting and generalizing better in regions in which data are not collected. It is precisely what we observed as well. As shown in Figure 5.8(e), the learned model predicts dynamics very accurately in the region far away from the training one. In the first step, we have collected data for 4 initial conditions at a time-stepping dt = 5 · 10 −2 . In the second step, we performed data-processing to normalize the data using the mean and standard deviation. In the third step, we employed RK4-SINDy (Algorithm 2) to discover the most parsimonious model. For this, we observe the Pareto front and pick the model that has the best fit to the data yet having the maximum number of zero coefficients. We then compare the discovered model with the ground truth and find that the proposed methodology could find precise candidates from the polynomial dictionary. The corresponding coefficients have less than 1% errors.
Next, we study the performance of the method under noisy measurements. For this, we corrupt the collected data using zero-mean Gaussian noise of variance σ = 2 · 10 −2 . Then, we process the data by first employing a noise-reduction filter, namely Savitzky-Golay, followed by normalizing the data. In the third step, we focus on learning the most parsimonious model by picking appropriate candidates from the polynomial dictionary. Remarkably, the method allows us to find a model with correct features from the dictionary and coefficient accuracy up to 5%. Furthermore, the model faithfully generalizes to regimes outside the training, even using noisy measurements.

Hopf normal form
In our last example, we study discovering parameterized differential equations from noisy measurements. Many real-world dynamical processes have system parameters, and depending on them, the system may The figure demonstrates the necessary steps to uncover the most parsimonious model using noisy measurement data. It also testifies to the impressive capability of discovering the most parsimonious model -that is, they even generalize beyond the training regime.
exhibit very distinctive dynamical behaviors. To illustrate the efficiency of RK4-SINDy to discover parametric equations, we consider the Hopf systeṁ that exhibits bifurcation with respect to the parameter µ. For this example, we collect measurements for eight different parameter values µ at a time-step 0.2 by fixing ω = 1 and A = 1. Then, we corrupt the measurement data by adding a Gaussian sensor noise that is shown in Figure 5.10 (left top). Next, we aim at constructing a symbolic polynomial dictionary Φ by including the parameter µ as the dependent variables. While building a polynomial dictionary, it is important to choose the degree of the polynomial as well. Moreover, it is known that the polynomial basis becomes numerically unstable as the degree increases. Hence, solving optimization problem that discovers governing equations becomes challenging. With mean of this example, we discuss an assessment test to choose the appropriate degree of the polynomial in the dictionary. Essentially, we inspect data fidelity with respect to the degree of the polynomial in the dictionary. When the dictionary contains all essential polynomial features, then a sharp drop in the error is expected. We observe in Figure 5.10 (right-top) a sharp drop in the error at the degree 3, and the error remains almost the same even when higher polynomial features are added. It indicates that polynomial degree 3 is sufficient to describe the dynamics. Thereafter, using the dictionary containing degree 3 polynomial features, we seek to identify the minimum number of features from the dictionary that explains the underlying dynamics. We achieve this by employing RK4-SINDy, and compare the performance with Std-SINDy. We note down the discovered governing equations in Table 5.7, where we notice an impressive performance of RK4-SINDy to discover the exact form of the underlying parametric equations, and the coefficients are up to 1% accurate. On the other hand, Std-SINDy is not able to identify the correct form of the model. Furthermore, we compare the discovered model simulations using RK4-SINDy with ground truth beyond the training regime of the parameter µ in Figure 5.10 (bottom). It exposes the strength of the parsimonious and interpretable discovered models.

Discussion
This work has manifested a compelling approach (RK4-SINDy) to discover nonlinear differential equations without imposing any prior structure on models. For this, we have blended sparsity-promoting identification with a numerical integration scheme, namely, Runge-Kutta 4 th -order scheme. The beauty of the proposed methodology is that we do not require derivative information at any stage, notwithstanding we still discover differential equations. Hence, the proposed algorithm differs from previously suggested sparsity-promoting identification methods in the literature in this aspect. Consequently, we expect RK4-SINDy to perform better The top-left figure shows the noisy measurements that are obtained using various parameter µ. To identify correct degree polynomial basis in the dictionary, we do a assessment test, indicating degree-3 polynomial are sufficient to describe dynamics (topright). The bottom figures shows a comparison of simulations of the ground truth model and identified models for the parameter µ, illustrating the capability of generalizing beyond the training parameter regime.
under sparsely sampled and corrupted data. We have demonstrated the efficiency of the approach on a variety of examples, namely linear and nonlinear damped oscillators, a model describing neural dynamics, chaotic behavior, and parametric differential equations. We have accurately discovered the Fitz-Hugh Nagumo model that describes the activation and de-activation of neurons. We have also illustrated the identification of the Lorenz model and have shown that dynamics of identified models are intact on an attractor as it is more important for chaotic dynamics. The example of Michaelis-Menten Kinetics highlights that the proposed algorithm can discover models that are given by a ratio of two functions. The example also shows the power of determining parsimonious models -that is, their generalization beyond the region in which data are selected. Furthermore, we have demonstrated the remarkable robustness of the proposed RK4-SINDy algorithm to sparsely-sampled data and to corrupted measurement data. In the case of large noise, a noise-reduction filter such as Savitzky-Golay helps to improve the quality of discovered governing equations. We have also reported a comparison with the sparse identification approach [6] and have observed the out-performance of RK4-SINDy over the latter approach. This work opens many exciting doors for further research from both theory and practical perspectives. Since the approach aims at selecting the correct features from a dictionary containing a high-dimensional nonlinear feature basis, the construction of these feature bases in a dictionary plays a significant role in determining the success of the approach. There is no straightforward answer to this obstacle; however, there is some expectation that meaningful features may be constructed with the help of experts and empirical knowledge, or at least they may be realized in raw forms by them. Furthermore, we have solved the optimization problem (2.9) using a gradient-based method. We have observed that if feature functions in the dictionary are similar for given data, then the convergence is slow, and sometimes it even fails and is stuck in a local minimum. In other words, the incoherency between the feature functions is low. Hence, there is a need for the normalization step. In Subsections 5.3 and 5.4, we have employed a normalization step to improve incoherency. However, it is worth investigating better-suited strategies to normalize either data or feature spaces as a pre-processing step so that sparsity in the feature space remains intact. In addition to these, a thorough study on the performance of various noise-reduction methods would provide deep insights into their appropriateness to RK4-SINDy, despite we noticed a good performance of the Sabitzky-Golay filter to reduced noise in our results.
Methods discovering interpretable models that generalize well beyond the training regime are limited, and the proposed method RK4-SINDy is among these. Additionally, approaches discovering governing equations that also obey physical laws are even rarer. A very recent paper [44] has stressed that discovering/learning models can be made even more efficient by incorporating the laws of nature in the course of discovering equations. A solid example comes from the discovering biological networks that often follow the massconversation law. Therefore, integrating physical laws in discovering models and sparse identification will hopefully shape the future of discovering explainable and generalizable differential equations.