A Tracking Approach to Parameter Estimation in Linear Ordinary Differential Equations

Ordinary Differential Equations are widespread tools to model chemical, physical, biological process but they usually rely on parameters which are of critical importance in terms of dynamic and need to be estimated directly from the data. Classical statistical approaches (nonlinear least squares, maximum likelihood estimator) can give unsatisfactory results because of computational difficulties and ill-posedness of the statistical problem. New estimation methods that use some nonparametric devices have been proposed to circumvent these issues. We present a new estimator that shares properties with Two-Step estimator and Generalized Smoothing (introduced by Ramsay et al, 2007). We introduce a perturbed model and we use optimal control theory for constructing a criterion that aims at minimizing the discrepancy with data and the model. Here, we focus on the case of linear Ordinary Differential Equations as our criterion has a closed-form expression that permits a detailed analysis. Our approach avoids the use of a nonparametric estimator of the derivative, which is one of the main cause of inaccuracy in Two-Step estimators. Moreover, we take into account model discrepancy and our estimator is more robust to model misspecification than classical methods. The discrepancy with the parametric ODE model correspond to the minimum perturbation (or control) to apply to the initial model. Its qualitative analysis can be informative for misspecification diagnosis. In the case of well-specified model, we show the consistency of our estimator and that we reach the parametric root-n rate when regression splines are used in the first step.


Introduction
We consider a dynamical process defined by an Ordinary Differential Equation (ODE) with a known and fixed initial value Such a model is called an Initial Value Problem (IVP). The state x is in R d and θ is an unknown parameter, that belongs to a subset Θ of R p . f is a timedependent vector field from [0, T ] × R d × Θ to R d . This class of dynamical models are commonly used in physics, engineering, ecology,. . . [13,30,12,17]. Let t → X θ * (t) = X * (t) be the solution to the IVP (1) on [0, T ], for the true parameter set θ * . We want to estimate θ * from noisy observations Y i , i = 1, . . . , n of the trajectory X * , made at time t i . Estimation can be done by classical estimators such as Nonlinear Least Squares (NLS), Maximum Likelihood Estimator (MLE) [27] or Bayesian approaches ( [21], [14], [6] and [15] for example). Nevertheless, the statistical estimation of an ODE model by NLS leads to a difficult nonlinear estimation problem. These difficulties were pointed out by Ramsay et al. [34]: computational complexity comes from repeated ODE integrations required by the optimization algorithm; moreover, the usual criterion exhibits multiple local minima . Even though meta-heuristic methods can be proposed to circumvent the last issue, parameter estimation can be considered as an ill-posedness inverse problem [11], that needs alternatives to the classic statistical approaches.
As other two-step estimators, our method produces a minimum distance estimator [25] but it shares strong links with Generalized Smoothing approaches. Two-Step estimators were initiated by [39] and aims at minimizing a discrepancy measure between a nonparametric estimatorX and quantities characterizing the differential models. Usually, a Two-Step method is defined by the following procedure: 1. Construct a nonparametric curve estimator X from the data (t i , Y i ) 1≤i≤n , 2. Compute a model discrepancy measure R(X, θ), such as the weighted L 2 distance 3. Define the parameter estimator These estimators have a good computational efficiency vs NLS and they avoid repeated ODE integration. In practice, the used criteria are also smoother and easier to optimize than the NLS criterion. Two-step estimators are consistent in general, but there is a trade-off with the statistical precision, and some care in the use of nonparametric estimate˙ X has to be taken in order to keep the parametric rate [4,18]. The variance of TS estimator can be higher, but the use of other criterion, for instance based on a weak formulation of the differential equation can give competitive alternatives, in particular in high dimensional parameter space or small sample size, see [5].
In the case of Generalized Smoothing [34], the solution X * is approximated by a basis expansion that solves approximately the ODE model; hence, the parameter inference is performed by dealing with an imperfect model, as the collocation approximation of the ODE solution can be seen as a relaxation of the ODE model constraint, needed for taking into account some uncertainty about the model. Based on the Generalized Profiling approach, Hooker proposed a criteria that estimates the lack-of-fit through the estimation of a "forcing function" t → u(t) in the ODEẋ − f (t, x,θ) = u(t), whereθ is a previous estimate obtained by Generalized Profiling [5]. The objective of this paper is to provide a parameter estimate and an approximate solution to the ODE that • avoids the use a nonparametric estimate of the derivativeẊ as in two step estimators, • incorporates robustness in model specification and controls the quality of approximation, • introduces the use of infinite dimensional optimization tools, exploiting the differential structure of the model.
One interest of the latter point is to avoid the use of series expansion for function estimation, and avoid some arbitrary practical choices about the basis. Moreover, infinite dimensional optimization tools give powerful characterization of the solutions that gives an additional insight in Generalized Smoothing.
Our method provides a consistent parametric estimator when the model is correct. We show that it is root-n consistent and asymptotically normal. At the same time, we get a discrepancy measure between the model and the data under the form of an optimal control u analogous to the forcing function in [19].
In the next section, we introduce the notations and we motivate our approach by discussing the Generalized Smoothing approach, and the link with Optimal Control Theory. In section 3, we show that the estimator is consistent under some regularity assumption about the model. Then in section 4, we show that we reach the root−n rate using regression splines for X. Finally, we show the interest of our method on a toy model (and we make comparison with Nonlinear Least Squares and Generalized Smoothing) and we consider also a real data case, where a linear ODE is used for describing the isomerization reaction of α-Pinene.

The statistical model and a Generalized Smoothing wrap-up
We observe a "true" trajectory X * at n random times 0 = t 1 < t 2 · · · < t n = T , such that we have n observations (Y 1 , . . . , Y n ) defined as where ǫ i is the (random) observation error. We assume that there is a true parameter θ * belonging to a subset Θ of R p , such that X * is the unique solution of the linear ODEẋ with initial condition X * (0) = X * 0 ; where t → A θ (t) ∈ R d×d and t → r θ (t) ∈ R d . More generally, we denote X θ the solution of (5) for a given θ, and initial condition X * 0 . We assume that the initial condition X * 0 is exactly known, and we want to infer θ * from (Y 1 , . . . , Y n ). In the linear case, Duhamel's formula gives a closed form expression for X θ for t ∈ [0, T ] where the matrix-valued function Φ θ : (t, s) → Φ θ (t, s) is the so-called resolvant of the ODE. By definition, the resolvant is the solution of the homogenous ODĖ The estimation of θ * can be done straightforwardly with the Nonlinear Least Squares (NLS) estimator that minimizes In Generalized Smoothing (GS), parameter estimation is regularized by using an approximate solutions of the ODE (5), as GS takes advantage of the double interpretation of splines for smoothing data, and for numerical solving of ODE by collocation.
This first step is considered as profiling along the nuisance parameter β, whereas the estimation of the parameter of interest is obtained by minimizing the sum of squared errors of the proxyX(t, θ) bŷ Obviously, the estimator depends on the hyperparameter λ, that needs to be selected from the data in practice (some adaptive procedures have been proposed, see [22]). The essential difference with NLS is to replace the exact solution X θ (·) by an approximationX(·, θ) (that depends also on the data). This means that GS deals with 2 sources of errors: in addition to the classical statistical error (variance due to noisy data), there is an approximation error asX(·, θ) is a spline that does not solve exactly the ODE model (5). Indeed, collocation algorithms compute the coefficients of a B-spline expansion based on the relationships be-tweenX and its derivativeẊ evaluated on an appropriate grid of time points 0 = s 1 < s 2 < · · · < s p , [2]. This gives a nonlinear system that is usually solved with a Newton algorithm, whose roots are the unknown coefficients of the basis expansion. The collocation schemes are essentially useful for solving Boundary Value Problems (instead of the classical Initial Value Problem).
For parameter estimation, the basis expansion is defined in a somehow arbitrary manner and the ODE constraint is not used as an equality constraint as it should be the case in a "normal" collocation scheme. Instead, the ODE equation is transformed into an inequality constraint defined on the interval [0, T ] and the model constraint is never set to 0 because of the trade-off with the data-fitting term . For this reason, the ODE model (5) is not solved and it is useful to introduce the discrepancy term u θ (t) = β Tṗ (t) − A θ (t) β T p(t) + r θ (t) that corresponds to a model error. In fact, the proxyX(·, θ) satisfies a pertubed ODEẋ = A θ x + r θ +û θ . This forcing functionû θ is an outcome of the optimization process and can be relatively hard to analyze or understand, as it depends on the basis expansion used and it depends also on the data via the minimization of J n (β|θ, λ). Nevertheless, Hooker et al. have proposed goodness-of-fit tests based on this so-called "empirical forcing function"û θ , asû θ are the residuals but at the derivative scale and not at the state scale [19,20].
Based on these remarks, we introduce the pertubed linear ODĖ where the function t → u(t) can be any function in L 2 . The solution of the corresponding Initial Value Probleṁ is denoted X θ,u . Instead of using the spline proxyX(·, θ) for approximating X * , we use the trajectories X θ,u of the ODE (8) controlled by the function u.

The Tracking Estimator
Following the Generalized Smoothing approach, we look for a candidate X θ,u that can minimize at the same time the discrepancy with the data and the norm u L 2 . Moreover, we replace the classical Sum of Squared Errors by a smoothed dt based on a nonparametric proxyX. Hence, we consider the subsequent cost function for a given λ > 0. Moreover, for each θ in Θ, we introduce the infimum function obtained by "profiling" on the function u. The definition of S mimicks the minimization of J n (β|θ, λ) but it is more involved as it is defined on infinite functional space, instead of a finite dimensional vector space. Finally, our estimator is defined by minimizing the same function S i.e whereas the GS estimator minimizes a different criterion . This means that in our methodology, we try to find a parameter θ that maintain a reasonable trade-off between model and data, whereas the Generalized Smoothing Estimatorθ GS is dedicated to fit the data with the proxyX(·, θ), without considering the size of model error represented byû θ .
Before going deeper into the interpretation and analysis of our estimator, we need to show that the criterion S X ; θ, λ is properly defined and that we can obtain a tractable expression for computations and for the theoretical analysis of (11). The existence of S is a direct consequence of the so-called Linear-Quadratic Theory (LQ Theory), which belongs to the broader field of Optimal Control Theory [26,37,29,9]. In our case, we consider the control of linear ODE with a quadratic cost function that enables to have quite general and simple results. This is possible because we have replaced the discrete sum of squared errors by an integral criterion where the original data have been replaced by a nonparametric proxyX. Thanks to that, we can use directly calculus of variations and optimal control [24,9]. For completeness, we recall briefly in the appendix the main results of LQ Theory.
Theorem and Definition of S (ζ; θ, λ). Let t → ζ(t) be a function belonging to H 1 ([0, T ] , R d ) and X θ,u be the solution to the controlled ODE (8). For any θ,λ, there exists a unique optimal controlū θ,λ that minimizes the cost function The controlū θ,λ can be computed in a "closed-loop" form as where E and h are solutions of the Final Value Problems and The cost (12) is usually used for solving the so-called "Tracking Problem" that consists in finding the optimal control u to apply to the ODE (8) in order to reach a target trajectory t → ζ(t), see [37] for an excellent introduction. The estimation problem is then to determine the parameter θ so that the corresponding ODE need a small control u (in L 2 norm) in order to be close to the noisy trajectory t →X(t).
classical Sobolev space of L 2 (weakly) differentiable function, see [3]. The derivative is defined in the weak sense, so it allows us to consider non-parametric estimator with some (controlled) discontinuities. Of course, every differentiable functions belong to H 1 and the weak derivative coincides with the classic one.
Remark 2.2. We insist on the fact that t → E(t), h(t) depends also on θ, λ and ζ because of their definition via equation (14). Nevertheless, we do not write it systematically for notational brievety. As mentioned in the theorem, it is possible to compute X θ,u θ,λ in a "closed-loop" form as we can solve in a preliminary stage the 2 equations (14) that gives the function E and h for all t ∈ [0, T ]. Then, we just need to solve the ODĖ (15), we see that S depends smoothly in θ and λ, as in ζ. This was not easy to see from the infimum definition (10), but as the minimum is reached, and attained for a known function, we can have even more information than in the Generalized Smoothing approach based on splines.
Remark 2.4. Our pertubated ODE framework permits to consider naturally the problem of model misspecification, when the true model iṡ is an unknown function. We do not provide any theoretical analysis for this kind of model misspecification, but we consider it in the Experiments section, in order to gain some insight. We will see in a simple example that our estimator allows us to have more accurate estimation than classical NLS estimator in that case. Moreover we can propose a proper correction term to add to the initial model to counteract the misspecification in order to lower the prediction error.
The next section is dedicated to the derivation of the regularity properties of S. Thanks to the use of a functional formulation and the associated Linear-Quadratic theory, we show the smoothness in ζ and θ, and compute directly the needed derivatives.

Consistency of the Tracking estimator
Under reasonable and practical assumptions, we can assert that the tracking estimator (11) is a consistent estimator of θ * when the ODE model (5) is wellspecified, and when we use a consistent nonparametric estimatorX. In practice, it is quite common to use a smoothing spline or a kernel smoother in order to smooth the data and estimates roughly the trajectory X * . As the tracking estimator is an M-estimator, we can employ the classical approaches for consistency that relies on the regularity and convergence of the stochastic criterion S X ; θ, λ to the asymptotic criterion S (X * ; θ, λ). Hence, we need to show some regularity in ζ, uniformly in θ. Similarly, in order to compute the rate of convergence and the variance of the estimator, we will need to check the smoothness w.r.t θ.

Regularity properties of S(ζ; θ, λ)
We introduce some necessary assumptions about the ODE model in order to derive the needed regularity as well as the identifiability property. The conditions are According to the context, · 2 denotes the Euclidean norm in R d ( 2 dt. Continuity and differentiability have to be understood w.r.t these previous norms. For the computation of S X ; θ, λ (and S (X * ; θ, λ)), we need some additional notations. In particular, we recall that the Riccati equationĖ = λ depends on the model (5) , but it does not depend on the dataX, whereas it is the case for h, as we . For this reason, we introduce the functions α and β defined by We denote then h θ the solution to the Final Value Problem and h * the solution corresponding to case ζ = X * . More generally, we will denote t → h θ (t, ζ) for any target trajectory ζ.
We introduce also the matrix-valued function (t, s) → R θ (t, s) defined for all t, s in [0, T ], as the solution of the Initial Value Problem and where the time has been reversed in the function α θ . We show in the next proposition that ∀ζ ∈ H 1 ([0, T ]), θ → S(ζ; θ, λ) is well defined, i.e finite on Θ.
We complete our analysis by showing that S is continuously differentiable in θ.

Proposition 3.2. Under conditions C1-C3
Condition 3, jointly with proposition 1 and 4 in the supplementary materials give the continuity of This is enough to show the continuity of θ −→ S(X; θ, λ) on Θ. Moreover, the gradient w.r.t θ of S(X; θ, λ) is equal to: In addition to the previous proposition, condition 4 and proposition 7 in supplementary material gives the continuity of (θ, X) −→ t −→ ∂(h θ (t,X)) ∂θ on Θ. This is enough to show the continuous differentiability of S(X; θ, λ) on Θ.
The last regularity properties justifies the use of classical optimization method to retrieve the minimum of S.
In the next proposition, we show that the criteria S(X; θ, λ) can be expressed without using the derivativeẊ (thanks to the knowledge of the initial condition). As a consequence, our estimator is less sensible to the nonparametric noise than classical Two-Step estimators.
Proof. We will show S(X; θ, λ) can be written using only X and notẊ. First of all we will use Lemma (B.3) to get rid ofẊ in´T 0Ẋ (t) T h θ (t, X)dt, it gives: And so we can write S(X; θ, λ) under the form we see S(X; θ, λ) does not depend onẊ.

Consistency
As we have seen previously, conditions 1 and 3 ensure the existence of S(X; θ, λ) and S(X * ; θ, λ) for all θ ∈ Θ. We derive the consistency ofθ T by showing the uniform convergence of the criterion S X ; θ, λ , and by insuring that θ * is a unique and isolated global minima of S (X * ; θ, λ). Condition 2 is then sufficient to show that S (X * ; θ, λ) characterizes well θ * , as global unique minimum. Hence, identifiability and convergence in supremum norm are sufficient to imply consistency (theorem 5.7 in [38]).
by using the same notation as in proposition 3.1.
. Application of the proposition 3.4 gives us the identifiability criteria. Hence we conclude by using the theorem 5.7 in [38].

Asymptotics of θ T
Our objective in this part is to derive the proper rate of convergence of the Tracking Estimator, as well as its asymptotic distribution. The properties of the estimator depends on the behavior of the nonparametric estimateX used for the approximation of X * . In order to fix ideas, we consider a regression spline, with a B-Spline decomposition of dimension K (increasing with n). That is we consider that X is defined as where β K is computed by least-squares. It is likely that we could derive the same kind of results for different estimates, such as Local Polynomial or Smoothing Splines, as they behave similarly asymptotically, and that we show that the Tracking Estimate can be approximated by a plug-in estimate of a specific linear functional ofX. We introduce additional regularity conditions needed for the asymptotics: Under these additional conditions, we show thatθ T reaches the parametric convergence rate, and that it is asymptotically normal. Our strategy consists in two stages: Stage 2 (Proposition 4.4) We prove that Γ X − X * is asymptotically normal for regression splines, based on the properties of plug-in estimators computed with series estimators and derived in [31].
Remark 4.1. Condition C5 is a classic feature for M −estimator to ensure local identifiability, here: Condition C8 is a classic feature for non-parametric estimator to ensure optimal convergence rate of X using bias-variance tradeoff.
where Γ : To obtain the final result, we only have to combine the two previous propositions: Theorem 4.5. If X is a regression spline and conditions C1-C8 are satisfied, then θ T − θ * is asymptotically normal and Remark 4.6. The asymptotic linear representation given by proposition 4.3 allows us to obtain an expression for the asymptotic variance. This latter is given by the formula (33) in appendix D as well as a plug-in consistent estimator.

Experiments
We use several simple test beds for evaluating the practical efficiency of the Tracking Estimatorθ T , and we compare it with the performance of NLSθ N LS and of Generalized Smoothingθ GS . The different models are linear in the states, and they can be linear and nonlinear w.r.t parameters. We use several sample size and several variance error for comparing robustness and efficiency for varying sample size and noise level.

Experimental design
For a given sample size n and noise level σ, we estimate the Mean Square Error and the mean Absolute Relative Error (ARE) by Monte Carlo, based on N MC = 100 runs. For each run, we simulate an ODE solution with a Runge-Kutta algorithm (ode45 in Matlab), and a centered Gaussian noise (with variance σ) is added, in order to obtain the Y i 's. We compare also the quality of estimation ofθ T , θ GS and θ N LS based on their prediction quality. We compute the Prediction Error where • Y * is a new observation drawn from the true model (4), • X θ is the solution to the linear ODE (5) with parameter θ.
It should be emphasized that parameter estimation and prediction error minimization are two different problems, although they are related. Parameter estimation is required when parameters are directly of interest, for instance for a deep understanding of the inner dynamics of the system. One put forward prediction error when the aim is only to quantitatively predict the system state. Our primary interest is parameter estimation but we also discuss prediction performance for the three methods.
The nonparametric estimateX required in the first step is a spline defined with a uniform knots sequence ξ k , k = 1, . . . , K. For each run and each state variables, the number of knots is selected by minimizing the GCV criterion, [36]. We discuss in the next section an automated method for selecting adaptively the hyperparameter λ.

Selection of λ
The criterion S X ; θ, λ is based on a balance between data fidelity and model fidelity. When λ → 0, we can select any u in order to interpolateX. In that case, θ has almost no influence on S X ; θ, λ value. When λ → ∞, the optimal perturbation u −→ 0, and we get a NLS-like criterion where the observations Y i 's are replaced by the proxy X. Because of this dramatic influence, we propose to select λ by minimizing the Sum of Squared Errors

Gradient computation
For optimization purpose, we need to compute the gradient ∇ θ S(X; θ, λ) which involves both ∂ h θ ∂θ and ∂E θ ∂θ . These partial derivatives can be obtained by solving the sensitivity equations, that gives the function values at each time t ∈ [0, T ]. Nevertheless, the size of the ODE to solve grows quickly, as the sensitivity systems is of size (d 2 + d) × p, and it becomes a computational burden for the optimization process. For this reason, we use the adjoint method to compute gradient expression [8]. This method exploits the fact that we do not need a point-wise computation of ∂ h θ ∂θ and ∂E θ ∂θ but only some integral of the derivatives. An explanation of this method for gradient and Hessian computation can be found in [1]. The Riccati ODE can be written in vector form (of size D := d 2 +d) where F is the row formulation of the Riccati ODE vector field and the solution is the vector Hence, we can compute ∇ θ S(X; θ, λ) thanks to the formula and P is the so-called adjoint vector of size D = d 2 + d, solution of the adjoint modelṖ (t) = ∂g(Q θ (t),θ,t) ∂Q − P (t). ∂F ∂Q (Q θ , θ, t) P (0) = 0 The computational details for ∂g ∂θ , ∂g ∂Q , ∂F ∂θ , ∂F ∂Q are left in appendix B. The adjoint method is more efficient than the direct sensitivity approach, as we need to solve an ODE of size D, instead of a system of size D × p, which is valuable when the number of parameters increases.

Linear w.r.t parameter
We consider here the basic model linear in parameteṙ with initial condition equal to X * 0 = 1. It is the simplest model we can consider, here the solution of the Cauchy problem is x(t) = X * 0 e at and we will use this closed form for the NLS estimator. Here we have tested two different sample size n = 50 and n = 20 (observations were uniformly distributed between 0 and 5) and two different noise level σ = 2 and σ = 4. The lambda values tested are the 40 values uniformly distributed between 10 and 400. The obtained results are presented in table 1. In every cases, the Tracking estimator gives more precise estimation than NLS and GS estimators (both in term of MSE and ARE), but this improvement is not impressive as MSE is expressed at scale 10 −5 and ARE at scale 10 −3 . The differences are small among the different estimation methods and estimations are reliable in all cases (even for the GS approach). This example mainly illustrates tracking approach is a relevant estimation method and can compete with the most used methods for simple model.

Nonlinear w.r.t parameters
Well-specified model We consider a following time dependent model that is non-linear in parameters. We test two sample sizes n = 50 and n = 20 (observations are uniformly distributed between 0 and 15) and two noise levels σ = 2 and σ = 4. The true parameter value is θ * = (θ * 1 , θ * 2 ) = (1.4, 1) and the initial condition is equal to X * 0 = 1. The sequence of lambda used is λ v = 10 k −1≤k≤7 . The results are presented in table 2. The Tracking and NLS estimators have equivalent performance in the well specified case. The GS approach gives a less precise estimation but it is still reliable, but the prediction error for GS is much more important than for the Tracking and NLS estimators. This illustrates the high sensitivity of ODE model w.r.t parameter and the need of accurate estimates for prediction, even for simple models.
Misspecified model The data is generated by a pertubed model with θ * 1 = 1.4, θ * 2 = 1 and X * 0 = 1. Nevertheless, we still use the model (21) for the parameter estimation. We use the two sample size n = 100 and n = 50 (observations were uniformly distributed between 0 and 15) and the two noise levels σ = 2 and σ = 4. We use a sequence of hyperparameter λ v = 10 k , 5 × 10 k −4≤k≤0 . Moreover, we are interested in the use of the residual control u obtained along the parametric estimation in order to propose a "corrected" model: for minimizing the "corrected" prediction error When model misspecification is suspected, we can consider two trajectory predictors X θ,0 and X θ,u for a given λ. From the definition of the criterion S, the corrected trajectory X θ,u is prone to give smaller prediction errors than the misspecified trajectory X θ,0 . This indicates that we should select the hyperparameter λ in a different way in the case of misspecification, if we want to use the correction u that depends also on λ. For this reason, we propose to select λ by minimizing the Corrected Sum of Squared Errors as a proxy for the prediction errror (24). Finally, we have two tracking estimators θ T and θ T c based on two choices of hyperparameter (λ that minimizes SSE and λ c that minimizes CSSE). The estimation of the perturbation (or control u) is done in two ways: • for the Tracking estimator θ T , it is provided directly by the estimation process; • for the NLS estimator θ N LS , we propose to estimate the perturbation based on the nonparametric proxy X For the Generalized Smoothing, we do not have to estimate the perturbation u, as the penalized spline X(·,θ GS ) already contains the model misspecification.
Results are presented in table 3. The two first columns gives the parametric estimation performance in terms of MSE and ARE. The third column gives an estimation of (19) and the fourth one an estimation of (24).
We can see θ T gives more accurate parametric estimation than the NLS estimator. The use of residual control in X θ,u improves the prediction error in any case. At the contrary, the correction of the NLS estimate with the estimated control makes things worst, which can be explained by the use of the non-parametric estimator of the derivative. The Generalized Smoothing estimator θ GS competes well with others approaches, that can be explained by the relaxation introduced by the collocation which makes the method robust in misspecification presence. However, we observe a fast drop in estimation precision w.r.t noise augmentation and poor performance for prediction purpose.
The corrected estimator θ c T gives higher precision than θ T for small measurement error σ. We also notice the dramatic drop in prediction error by using the corrected model instead of the initial one. It is due to the fact we effectively take into account an exogeneous perturbation using u for parametric estimation. We simultaneously estimate the parametric part and the nonparametric Finally, we are also interested in the control u itself, even though we do not expect it gives us an estimation of the true control u * (t) = sin(t) (because of identifiability issues). Its features can give hints about potential misspecification presence and qualitative informations about its nature. We plot in figure 1 the mean control obtained in the case (n, σ) = (100, 2) in blue and the true control u * (t) = sin(t) in green for the sake of comparison. Although, the scale is not the same, we can see that the estimated control exhibits some features of the u * , such as oscillations, with the same approximate period. The pertubation could be used as an exploratory tool for analyzing and inferring the missing part in the dynamics.

α−Pinene model
The "α−Pinene model" is a model introduced in [35] for modeling the isomerization of α−Pinene. It is an autonomous linear ODE in R 5 with a sparse structurė It is an Initial Value Problem, with known initial condition X * 0 = (100, 0, 0, 0, 0) . This estimation problem is still considered as cumbersome and many estimation methods fails to converge or converge to bad local solutions because of high correlation between θ 4 and θ 5 . Before analyzing the real dataset, we perform a simulation study for evaluating the difficulty of the estimation problem, and benchmarking the several estimators.

Simulated data
The observation interval is [0, 100] and the true parameter is θ * = (5.93, 2.96, 2.05, 27.5, 4) × 10 −4 . Because of dramatic differences in the order of magnitude of the state variables, the noise standard deviation has to be rescaled componentwise. Here for a given σ value the standard deviation applied to the state variable X i is equal to σ 100 × 1 T´T 0 X i (t)dt, 1 T´T 0 X i (t)dt being the X i mean value on the observation interval. For the Tracking estimator, we use a sequence of hyperparameter λ v = 10 k , 5 × 10 k 0≤k≤5 . The results are presented in table 4. The Tracking and Generalized Smoothing estimators give more accurate parameter estimation than the Nonlinear Least Squares. In the last case, GS gives the best performance in terms of ARE. In this model, the Relative Error is especially relevant to quantify parameter precision because of important differences between the scale of θ 4 and the other parameters. As expected the difference in performance mainly comes from the estimation of the couple (θ 4 , θ 5 ). This model shows that the NLS is appropriate for parameter estimation whereas the GS seems to favor the quality of prediction, showing that estimation and prediction are somehow two competing objectives. The Tracking estimator realizes a trade-off between these two objectives.

Real data analysis
We use the data coming from [13], and presented in table 5. They consist in simultaneous measures of the 5 components relative concentration at eight time steps. We compare our results with the previous estimation θ b = 10 −4 × (0.593, 0.296, 0.205, 2.75, 0.4) obtained in [35], which provides a good data fitting. We use the Tracking method for the estimation of the parameter and of the residual control u. In order to avoid numerical problems, we have divided the observation by 1000 and renormalized the parameters (as the system is autonomous).
For the non-parametric estimator X, we use only two nodes: one at the end and one at the beginning of the observation interval. We have also impose to the non-parametric estimator to start from the known initial condition X * 0 = (100, 0, 0, 0, 0). For the Tracking estimator, we use λ = 10 k , k = 1, 2 · · · , 11 and λ = 5 × 10 k , k = 1, 2, 3, 4 and we select the hyper-parameter that minimizes the SSE (named λ SSE . We call the corresponding estimator θ SSE T Tab. 6: Parameter estimates We have obtained λ SSE = 100. For λ ≥ 5000, the estimated values are almost constant and equal to θ = (0.583, 0.295, 0.207, 2.259, 0.238). The first three estimated parameters are close to the estimates given in [35], but θ 4 and θ 5 are different; however, we obtain good predicted curves, similar to Rodriguez et al., see figure 2.
We can compute the minimal control u SSE corresponding to ( θ SSE T , λ SSE ), and we can also compute the minimal control corresponding to θ = θ b for a given value of λ (that is u θ b ,λ ) and compare its norm with u SSE when λ = λ SSE . The controls are represented in figure 3 where the curves plotted with × represent the minimal control obtained for θ b and the curves plotted with • are the control obtained with θ SSE T . Here, the Tracking control is a five-dimensional vector, where each entry u i corresponds to one state variable X i . The control plotted in yellow corresponds to X 1 , the one in black correspond to X 2 , the one in green correspond to X 3 , the one in blue correspond to X 4 and the one in red to X 5 . As expected, the estimated control u SSE is smaller in L 2 − norm for θ SSE T than for θ b . Nonetheless according to 3, there is no clear difference between u θ b ,λ and u SSE except for their component related to X 5 (in red on the figure  3) and which is the state variable exclusively related to the parameters θ 4 and θ 5 (the most difficult parameters to estimate according to [35]. Even though the two resulting solutions X θ b and X θSSE T are close, the insight given by u θ b ,λ and u SSE shows stronger differences at the dynamic scale.

Conclusion
We have introduced a new estimation method for parameter estimation in linear ODE based on relaxation of the initial model. Similarly to the Generalized Smoothing estimator, we end up with a function X θ,u that is an approximate solution of the ODE model of interest. The added perturbation u enables to take into account the noisy observations but also some uncertainty in the model. Quite remarkably, the trade-off between model and data discrepancy is formulated and solved by using Optimal Control Theory and this work is one of the first use of this theory for dealing with statistical estimation and optimization in infinite dimensional spaces. Moreover, this functional framework and the Riccati theory gives practical algorithms and theoretical insight in the properties of the estimator that permits a detailed analysis of the statistical properties. In addition to the parameter estimate, we obtain also directly an estimate of the  [20].
In the experiments, we show that the Tracking estimator have similar or better performances than nonlinear least squares or generalized smoothing, even in the case of very simple models with closed-form expression. Hence, paradoxically, the use of perturbed model and of nonparametric estimators can ameliorate the statistical efficiency of standard estimates, even in well-specified cases.
In the case of model misspecification, the differences are bigger, as the relaxation brought by λ gives us a more robust estimation method (comparing to NLS) which can deal with small model definition imperfection. Moreover, the optimal control obtained for a given parameter estimate allows us to minimize the prediction error by introducing a proper correction term to the initial model. This control can also be used as a qualitative tool to diagnose model misspecification.
However, we are aware of some limitations of our method: first, we assume that the initial condition is known. We can consider X * 0 as an additional parameter to estimate, and reformulate our approach for doing simultaneous observations. The second but the main limitation is the linear assumption about the ODE. Although linear ODEs are common in applications, numerous useful models are nonlinear and thus our methodology cannot be applied directly. Nevetheless, our work can be extended by using more general results of optimal control, such as Pontryagin Maximum Principle that can offer efficient characterization in the general nonlinear case.
It exists a unique optimal trajectory zū associated to the unique optimal control u(t) = U −1 (t)E(t)B(t)z u (t) where E is the matrix solution of the Riccati ODE: Proof. (This proof is presented in Sontag's book "Mathematical Control Theory" [37] chapter 7 theorem 30) By using theorem A.1 and if we define the quadratic cost: Let us reason by contradiction: if E(t)is not bounded then ∃t e ∈ [0 , T ] s.t lim t→te+ E(t) 2 = +∞. It implies: We also know it exists a unique optimal trajectory for the LQ problem on [t 0 , T ] with x(t 0 ) = x 0 and the associated optimal cost is −x T 0 E θ (t 0 )x 0 . But by minimality of this cost it has to be majored by the cost C(t 0 , 0, λ) i.e the cost associated to the control u = 0. We can see it exists a constant D > 0 such C(t 0 , 0, λ) is majored by D x 0 2 2 and so: (27) and finish the proof.
) is an affine function of X and can be written under the form: with R θ defined by (16).
Proof. Considering the backward ODE: We know thanks to Duhamel formula: Taking the value of β and using integration by part we have: And using resolvant property we finally obtain: Proof. Integration by part give us: Moreover using affine nature of h w.r.t X and using the same notation as in proposition lemma B.2:

B.2 Consistency Proof
In the following proposition B.4 we show S( X; θ, λ) − S(X * ; θ, λ) is controlled by the distance between X and X * and between h and h * . In proposition B. S( X; θ, λ) − S(X * ; θ, λ) With : Proof. For the sake of notation we will consider the homogenous case i.e r θ (t) = 0 By triangular inequality we have: dt Now we will separatly bound each of the three previous terms: The first one: The last inequality has been obtained thanks to Cauchy-Schwarz inequality. The second one inequality is a bit cumbersome in terms of computation. For the sake of clarity we left some computational details in lemma B.3 and we obtain with the same notation: T 0˙ Hence we can formulate S( X; θ, λ) without the derivative form expression and the last decomposition allows us to bound ´T By use of norm inequalities we obtain the following bounds: And we obtain for the second part: For the third one we have: Hence by summing we finish the proof.
Proposition B.5. Under conditions 1 and 3 ∀θ ∈ Θ we have: Proof. Thanks lemma B.2 we have the following affine dependance of h w.r.t X: Taking the norm gives us: Using condition C1 and C3 and the upper bound R θ (T − t, T − s) 2 ≤ de √ d A+ E λ T thanks to proposition 3 in supplementary material. Finally we obtain:

B.3 Asymptotic normality proof
The proof of continuity of some functionals useful for proposition 4.3 are left in the supplementary materials, as they require cumbersome computations and they does not provide particular insights in the mechanics of the proofs.
Proof. For the sake of notational simplicity here θ T will be simply denoted θ.
The first order optimal condition is ∇ θ S(X; θ, λ) = 0 Equivalently, we havê We use the following decomposition for A θ (t). X −˙ X and h θ (t, X): with θ being a random point between θ * and θ and N defined as in lemma B.2. By replacing in (30), we obtain: Thanks to propositions in supplementary material, the following functionals Proof. This proposition is a direct consequence of Theorem 9 in [31]. The conditions to be satisfied are is bounded, and V ar(Y | t) is bounded away from 0.
3. The support of t is a compact interval on which t has a probability density function bounded away from 0.
is finite and non-singular such that: is derivable of order s on the support of t. For the fourth requirement we will consider the monodimensional case d = 1. We know that Γ is linear and continuous on L 2 [0, T ] , R d thanks to conditions C1 and C3-4 and hence differentiable with: D(Γ)(X * )(X) = Γ(X) . By the Riesz-Frechet representation theorem we have: v ∈ L 2 ([0 , T ] , R) s.t Γ(X) =´T 0 v(t)X(t)dt which verify the three conditions of the forth requirement. Starting from the mono-dimensional case, multi-dimensional case can be made componentwise.
Requirement 5 is a simple consequence of the condition C8.

C Gradient Computation : Adjoint Method & Sensitivity equation C.1 Notation and partial derivative computation
For optimization purpose we need to compute the gradient of S( X; θ, λ). For this we will present two methods: a direct approach using sensitivity equation and a second one using adjoint method.
C.1.1 Row vector notation for the vector field of the general Riccati equation We will define the solution of the general Riccati equation in row formulation, we introduce It is a D := d 2 + d sized function respecting the ODE : by introducing the general vector field F : with G and H defined by: and A θ,i beeing the i − th column of A θ . We also introduce: In order to write our system under the row form: S( X; θ, λ) :=´T 0 g(Q θ (t), θ, t)dt Q θ = F (Q θ , θ, t) Q θ (T ) = 0 For the next subsections we will drop dependence in θ for A θ , r θ , E θ , h θ
We have: Because: where A t j is in i−th position and A t i is in j − th position.
is in i − th position and E t i is in j − th position. And: ∂θ where ∂Ai ∂θ = ∂Ai ∂θ1 · · · ∂Ai ∂θp a d × p matrix

D Asymptotic variance expression
We know asymptotically θ T − θ * behaves as: ∂θ the hessian of the asymptotic criteria at θ = θ * and: