On penalized estimation for dynamical systems with small noise

We consider a dynamical system with small noise for which the drift is parametrized by a finite dimensional parameter. For this model we consider minimum distance estimation from continuous time observations under $l^p$-penalty imposed on the parameters in the spirit of the Lasso approach with the aim of simultaneous estimation and model selection. We study the consistency and the asymptotic distribution of these Lasso-type estimators for different values of $p$. For $p=1$ we also consider the adaptive version of the Lasso estimator and establish its oracle properties.


Introduction
Usually ordinary differential equation models are the result of averaging and/or neglecting some details of an original system without modeling a complex system with a huge number of degrees of freedom or tuning parameters. Introducing noise is therefore a way to approximate closer the reality of observable complex systems. It is then natural to think of the noise as small, for example when one is considering the dynamics of macroscopic quantities, i.e. averages of quantities of interest over a whole population or in the case of signal that travels through a perturbed medium, etcetera.
Model selection is an important aspect in the above applied fields although sometimes neglected. What occurs for dynamical systems with small noise, is not so different from what happens in ordinary least squares (OLS) model estimation. Indeed, linear regression models are used extensively by many practitioners but, once estimated, these models are useful as long as the set of parameter (or covariates) is correctly specified. Therefore, the model selection step is an important part of the analysis.
To introduce the idea of Lasso-type estimation we begin with linear models and OLS. In this framework model selection occurs when some of the regression parameters are estimated as zero. Different models are compared in terms of information criteria like AIC/BIC or hypotheses testing. The advantage of the Lasso-type approach over AIC/BIC is that statistical models do not need to be nested but one can rather construct a single large parametric model merging two orthogonal models and let the selection method to choose one of the two models [3].
Variable selection becomes particularly important when the true underlying model has a sparse representation. Correctly identifying significant predictors will improve the prediction performance of the fitted model (for an overview of feature selection see [7]).
Considered the linear regression model Y i = x T i β + ε i , with x i a vector of covariates, β a vector of q > 0 parameters and ε i i.i.d. Gaussian random variables. [23] proposed the following l p -penalized estimator for β for some p > 0 and λ n → 0 as n → ∞. The family of estimatorsβ n solution to (1.1) is a generalization of the Ridge estimators which corresponds to the case p = 2 (see [5]). The original Lasso estimators proposed in [35] are obtained setting p = 1 while OLS is the case λ n = 0, not considered here. The link between Lasso-type estimation and model selection is also due to the fact that, in the limit as p → 0, this procedure approximate the AIC or BIC selection methods (which correspond to p = 0 with λ n > 0), i.e.
which amounts to the number of non-null parameters in the model. Here 1 A the indicator function for set A.
As said, the estimators solutions to (1.1) are attractive because with them it is possible to perform estimation and model selection in a single step, i.e. the procedure does not need to estimate different models and compare them later with information criteria as the dimension of the space of the parameters does not change; just some of the components of the vector β j are assumed to be zero. In non-linear models a preliminary simple reparametrization (e.g. β → β − β) is needed to interpret this approach in terms of model selection.
In this work, we extend the problem in (1.1) to the class of diffusion-type processes with small noise solution to the stochastic differential equation dX t = S t (θ, X)dt + εdW t , t ∈ [0, T ], by replacing least squares estimation with minimum distance estimation. The asymptotic is considered as ε → 0 for fixed 0 < T < ∞ with θ ∈ Θ ⊂ R q a q-dimensional parameter.
Since the seminal works of [25,26,27] and [42], statistical inference for continuously observed small diffusion processes is well developed today (see, e.g., [28,18,19,44,40]) but the Lasso problem has not been considered so far. Although here we consider only continuous time observations, it is worth mentioning that there is also a growing literature on parametric inference for discretely observed small diffusion processes (see, e.g., [9,14,31,32,33,36,37,38,39,11,12]) to which this Lasso problem can be extended. Adaptive Lasso-type estimation for ergodic diffusion processes sampled at discrete time has been studied in [4], while for continuous time ergodic diffusion processes shrinkage estimation has been considered in [29]. This paper is organized as follows. In Section 2, we introduce the model, the assumptions and the statement of the problem. In Section 3, we study the consistency of the estimators and derive their asymptotic distribution for different values of p. For p = 1, we also consider the case of adaptive Lasso estimation that is meant to control asymptotic bias. We are also able to prove that the adaptive estimation represents an oracle procedure.

The Lasso-type problem for dynamical systems with small noise
Let us assume that on the probability space (Ω, F, P ), with the filtration be a real valued diffusion-type process solution to the following stochastic differential equation with non random initial condition X 0 = x 0 , where S t (·, X) is a known measurable non-anticipative functional (see, e.g., [13]). The parameter θ ∈ Θ ⊂ R q , where Θ is a bounded, open and convex set, is supposed to be unknown.
We suppose that the trend coefficient in ( where V (θ, t, x) and K(θ, t, s, x) are known measurable, non-anticipative functionals such that (2.1) has a strong unique solution. For example, the usual conditions (1.34) and (1.35) in [27] and Theorem 4.6 in [13] about Lipschitz behavior and linear growth are sufficient.
where L 1 and L 2 are positive constants and H s is a nondecreasing right-contin- Assumption 1 implies that all the probability measures P (ε) θ , θ ∈ Θ, are equivalent (see Theorem 7.7 in [13]). The asymptotic in this model is considered as ε → 0 and 0 < T < ∞ fixed.
We will also write . We assume that, for all 0 ≤ t ≤ T and for each θ ∈ Θ, the random element X t and x t (θ) belong to L 2 (μ). Furthermore, we suppose that the functionals V (θ, t, x) and K(θ, t, s, x) have bounded first derivative with respect to θ.

A. De Gregorio and S.M. Iacus
Let The process x (1) (θ * ) plays a central role in the definition of the asymptotic distribution of the estimators in the theory of dynamical systems with small noise. We need in addition the following assumptions.
We further denote byẋ t (θ) the q-dimensional vector of partial derivatives of s, x s (θ * ))ẋ s (θ * ))ds]dt, where the point corresponds to the differentiation on θ; i.e.

Assumption 4. The matrix
is positive definite and nonsingular.

The Lasso-type estimator
We introduce a constrained minimum distance estimator for θ for the model (2.1). The asymptotic properties of the unconstrained minimum distance estimators in the i.i.d. framework have been established in [15,16]. Later [26,27] and [28] studied in details the properties of such estimators for diffusion processes with small noise. Information criteria for this model have been studied in [40], while here we study the Lasso-type approach.
To define the Lasso-type estimator the following penalized contrast function has to be considered where p > 0, u ∈ Θ and λ ε > 0 is a real sequence. In analogy to (1.1), we introduce the Lasso-type estimatorθ ε : C[0, T ] →Θ for θ, defined aŝ whereΘ is the closure of Θ.
The following example explains well the spirit of the Lasso procedure. We consider a linear small diffusion-type process X given by By applying the estimator (2.4), some parameters θ j will be set equal to 0 and this implies a simultaneous estimation and selection of the model. Therefore, the Lasso methodology is particularly useful in the random dynamical systems framework where a sparse representation of the drift term emerges (i.e. some components of θ are exactly zero) and we are interested in identifying the true model.

Asymptotic properties of the Lasso-type estimator
The additional l p -penalization term in the contrast function (2.3) modifies the traditional properties of the minimum distance estimator. The analysis should be performed for the different values of p.

Theorem 1. Suppose Assumption 1 and Assumption 5 are fulfilled and assume
Proof. By definition ofθ ε , for any ν > 0, we have that Then, from the above inequality, we get Since (see Lemma 1.13, in [27]) where C = C(L 1 , L 2 , K 0 , T ) is a positive constant, under Assumption 5, we get In the above, we made use of the following estimate for N > 0 P sup see, e.g., [27], and observed that From the proof of the consistency of (2.4) it is clear that the speed of convergence ofθ ε depends on the asymptotic rate of λ ε . The rate of convergence of λ ε also affects the asymptotic distribution of the estimator.

Remark 1. It is possible to define other types of Lasso-type estimators modifying the metric in (2.3); i.e. by considering, for instance, the sup-norm and the
respectively. The estimatorsθ ε andθ ε are uniformly consistent and the proof follows by the same steps adopted to prove Theorem 1. Clearly, it is necessary to redefine the functions g ε θ * and h ε θ * appearing in Assumption 5 by replacing L 2 (μ)-norm with sup-norm (forθ ε ) or L 1 (μ)-norm (forθ ε ).

Asymptotic distribution
In order to study the asymptotic distribution of the Lasso-type estimator we need to distinguish the different cases for p. We start with the case of p ≥ 1. We denote by "→ d " the convergence in distribution and we denote by ζ the following Gaussian random vector (see also Lemma 2.13 in [27]). The next two theorems have been inspired from Theorem 2 and Theorem 3 in [23]. Nevertheless, in our case the convexity argument adopted in [23] does not work.

Theorem 2.
Suppose that Assumption 1-Assumption 5 are fulfilled, ζ is defined as in (3.1), p ≥ 1 and ε −1 λ ε → λ 0 ≥ 0. Then for p > 1 and Proof. Let u ∈ R q and introduce the random function which is minimized at the point u = ε −1 (θ ε − θ * ) by definition ofθ ε . By exploiting Assumption 2-Assumption 4, we get where P (ε) θ * −→ stands for the convergence in probability and ζ is from (3.1). For the term in (3.2) we have to distinguish the case p = 1 and p > 1. Let p > 1, then If p = 1, then by similar arguments, we have Notice that V ε (u) is not convex in u and then we have to consider the convergence in distribution on the topology induced by the uniform metric on compact sets; i.e. we deal with the convergence in distribution of V ε (u) on the space of the continuous functions topologized by the distance ρ(y 1 , (3.4) and (3.5) follows the convergence of the finite-dimensional distributions where w(y, h) = sup{ρ(y(u), y(v)) : |u − v| ≤ h}, with y a continuous function on compact sets and h > 0. Therefore by Theorem 16.5 in [20], we conclude that we can use Theorem 2.7 in [21]. Hence, it is sufficient to show that is a convex function. Since for each a ∈ R and δ > 0, there exists a compact set K a,δ such that (see, [22]) In the case 0 < p < 1, a different rate of convergence must be imposed on the sequence λ ε .

Theorem 3. Suppose that Assumption 1-Assumption 4 hold, ζ defined as in
Proof. We proceed analogously to the proof of Theorem 2. As before we start with V ε (u) from (3.2). The first part of the expression in V ε (u) converges in distribution to −2u T ζ + u T I(θ * )u as in Theorem 2. For the second term, we need to distinguish the two cases θ * k = 0 or θ * k = 0. By assumptions we have that λ ε /ε 2−p → λ 0 and hence necessarily λ ε /ε → 0.
Consider first the case θ * k = 0. We have that Conversely, if θ * k = 0 we have that So, by means of the same arguments adopted in the proof of Theorem 2, we can prove that V ε (u) → d V (u) uniformly on u. Following [21], the final step consists in showing that arg min (1) and so arg min V ε → d arg min V . Indeed, and for all u and ε sufficiently small, δ > 0, we have The term |u j | p grows slower than the the first normed terms in V δ ε (u), so arg min u V δ ε (u) = O P (ε) θ * (1) and, in turn, arg min u V ε (u) is also O P (ε) θ * (1). The uniqueness of arg min u V (u) completes the proof.

Adaptive version of the penalized estimator
Theorem 3 shows that, if p < 1, one can estimate the nonzero parameters θ * j = 0 at the usual rate without introducing asymptotic bias due to the penalization and, at the same time, shrink the estimates of the null θ * j = 0 parameters toward zero with positive probability.
On the contrary, if p ≥ 1 non zero parameters are estimated with some asymptotic bias if λ 0 > 0. This is a well known result in the literature [23], [45] and has been indeed considered in [4] for ergodic diffusion models with discrete observations.
In this section we consider only the case for p = 1, i.e. the real Lasso estimator. Furthermore, we deal with an adaptive version of the Lasso estimator for the diffusion-type process (2.1).