Parameter Inference for Degenerate Diffusion Processes

We study parametric inference for ergodic diffusion processes with a degenerate diffusion matrix. Existing research focuses on a particular class of hypo-elliptic SDEs, with components split into `rough'/`smooth' and noise from rough components propagating directly onto smooth ones, but some critical model classes arising in applications have yet to be explored. We aim to cover this gap, thus analyse the highly degenerate class of SDEs, where components split into further sub-groups. Such models include e.g. the notable case of generalised Langevin equations. We propose a tailored time-discretisation scheme and provide asymptotic results supporting our scheme in the context of high-frequency, full observations. The proposed discretisation scheme is applicable in much more general data regimes and is shown to overcome biases via simulation studies also in the practical case when only a smooth component is observed. Joint consideration of our study for highly degenerate SDEs and existing research provides a general `recipe' for the development of time-discretisation schemes to be used within statistical methods for general classes of hypo-elliptic SDEs.


Introduction
This work addresses the statistical calibration of a wide class of hypo-elliptic diffusions. Stochastic Differential Equations (SDEs) are widely used as an effective tool to describe dynamics of the time evolution of phenomena of interest across a multitude of disciplines. Consider SDE models of the following general form: with V j (·, θ) : R N → R N , 0 ≤ j ≤ d, for parameter θ, driven by the d-dimensional standard Brownian motion B = (B1,t, . . . , B d,t ), t ≥ 0, defined upon the filtered probability space (Ω, F, {Ft} t≥0 , P), with d, N ≥ 1. Several theoretical results about parameter inference for SDEs have been established under positive definiteness conditions on the diffusion matrix a = V V ⊤ ∈ R N ×N , with V = [V1, . . . , V d ]. In such a case, the solution of (1) is referred to as an elliptic diffusion. However, many important applications give rise to diffusion processes that allow matrix a to be degenerate. We give below examples for such classes of SDEs. Under the Hörmander's condition, discussed later in this work, the process defined via the SDE (1) with degenerate diffusion matrix a permits the existense of a density with respect to (w.r.t.) the Lebesgue measure for its transition dynamics, and is referred to as a hypo-elliptic diffusion.

Classes of Diffusion Models
We can summarise the SDE models we consider in this work via two classes of hypo-elliptic diffusions. The first hypo-elliptic class is determined via the following degenerate SDE: (Hypo-I) Here, the involved SDE functionals are specified as follows: for positive integers N S , N R such that N = N S + N R , and unknown parameter vector for positive integers N β S , N β R , Nσ, and a compact set Θ. For class (Hypo-I), we will later on introduce a condition upon the vector-valued functionals {V S,0 , V R,0 , . . . , V R,d } that is sufficient for the law of Xt to admit a Lebesgue density, and which is related to Hörmander's condition. In brief, the condition stipulates that X R,t is indeed a rough component, and that all coordinates of the drift function V S,0 (Xt, β S ) properly relate with the rough component X R,t , so that randomness from X R,t is propagated onto all co-ordinates of vector X S,t . The development of a theoretical and algorithmic framework for parametric inference over class (Hypo-I) has been the topic of several recent works, see e.g. Pokern et al. (2009); Ditlevsen and Samson (2019); Gloter and Yoshida (2021); Iguchi et al. (2022). However, we stress that several important hypo-elliptic SDEs used in practice do not belong in class (Hypo-I), and are not covered by recent investigations. We specify a key class of practically useful but under-explored hypo-elliptic SDEs via the following equation: (Hypo-II) Thus, the drift functions of the smooth components are now specified as: and the parameter vector writes as: for positive integers N S 1 , N S 2 , N β S 1 , N β S 2 , and N = N S 1 + N S 2 + N R , where Θ is again a compact set. Notice that the upper-most smooth component X S 1 ,t in (Hypo-II) does not depend on the rough component X R,t , thus noise from the latter is not directly propagated onto X S 1 ,t . We refer to (Hypo-II) as the highly degenerate class of SDEs. For class (Hypo-II) we will set-up a restriction on its constituent vector-valued functionals {V S 1,0 , V S 2,0 , V R,0 , . . . , V R,d }, related to Hörmander's condition. Roughly, such a requirement will guarantee that noise from the rough component X R,t indeed propagates onto all co-ordinates of X S 2 ,t , first, before then moving onto X S 1 ,t . Thus, X S 1 ,t is 'smoother' than X S 2 ,t . Importantly, a consequence of such a behaviour is that class (Hypo-II) is not included within (Hypo-I), instead the two classes, (Hypo-I) and (Hypo-II), are intrinsically distinct and must be treated separately in terms of theoretical and algorithmic considerations.

A Motivating Class of Models
The non-Markovian Langevin equation (or generalised Langevin equation (GLE)) is used in a wide range of applications due to its effectiveness in describing complex stochastic systems with memory effects (thus, of non-Markovian structure). Examples include dynamics observed in protein folding (Ayaz et al., 2021), cancer cells (Mitterwallner et al., 2020), flocks of birds (Ferretti et al., 2020), molecules (Ness et al., 2015) and coarse-grained systems (Kalliadasis et al., 2015;Li et al., 2017). For simplicity, we consider here a one-dimensional particle with unit mass, and denote its position and momentum, respectively, by (q, p). Then, a GLE describes the particle dynamics as follows: where U : R → R is an appropriate potential function, K : [0, ∞) → R is the memory kernel, and ηt is a zero-mean stationary Gaussian noise with auto-correlation specified via a fluctuation-dissipation relation in equilibrium, i.e., E[ηtηs] = K(t−s), s, t > 0, given a unit temperature. Due to the presence of K(·), particle dynamics will depend on the full state history, with such a property being quite desirable in applications, see e.g. the references given above. However, the cost of generating the dynamics of model (GLE) can be overly expensive. Thus, a standard approach followed in practice is to introduce a parametrisation for the memory kernel K(·) and represent the non-Markovian system (GLE) as a Markovian one on an extended space, the latter system being referred to as Quasi-Markovian Generalised Langevin Equation (QGLE). Such parametrisation is extremely rich, thus being able to accurately capture the behaviour of systems with general true kernel K(·). In particular, a common parametrisation of the memory kernel is the following: with δ = δ(·) the Dirac function. In this case, the original system in (GLE) can be equivalently re-written as the following Markovian one: s0 ∼ N (0m, Im), with st ∈ R m an auxiliary component and σ j ∈ R m+1 + , 1 ≤ j ≤ m. Another typical choice for the memory kernel is the following: in which case the equivalent QGLE writes as: s0 ∼ N (0m, Im), (QGLE-II) with σ j ∈ R m + . Class (QGLE-I) is investigated, e.g., in Ceriotti et al. (2010). Then, class (QGLE-II) is popular, e.g., in thermodynamics modelling, see Pavliotis (2014); Leimkuhler and Matthews (2015). Class (QGLE-I) belongs in (Hypo-I), with the rough component comprised of pt, st. For class (QGLE-II), the rough component consists only of st, with qt depending on the smooth component pt and not on st. Thus, (QGLE-II) lies within class (Hypo-II). Recently, parametric inference for GLEs within the QGLE setting, under discrete-time observations of the smooth component qt, has been of interest for applications, see e.g. Ferretti et al. (2020); Vroylandt et al. (2022).

Related Works and Objectives
In this paper we investigate parameter estimation for the two classes of degenerate diffusion processes, (Hypo-I) and (Hypo-II), given discrete-time observations obtained at instances 0 ≤ t0 < t1 < · · · < tn, n ∈ N, with equidistant observation intervals ∆n := t i − t i−1 , 1 ≤ i ≤ n. In particular, we consider the following scenarios for the observations: 1. Complete observation regime, i.e., with all N co-ordinates of Xt being observed. 2. Partial observation regime, i.e., with a strict subset of co-ordinates being observed. In agreement with applications, in this setting only the upper-most smooth component is assumed to be observed.
Within class (Hypo-I), and for the complete observation regime considered in the produced asymptotics, Ditlevsen and Samson (2019) and Gloter and Yoshida (2020) develop judicious discrete-time (conditionally) Gaussian approximations for the transition distribution of an SDE. Such a proxy provides contrast estimators proven to be asymptotically normal in a high-frequency observation setting, i.e., n → ∞, ∆n → 0 and n∆n → ∞, with a requirement that the step-size scales as ∆n = o(n −1/2 ). In Iguchi et al. (2022), such a condition on the step-size to obtain asymptotic normality is weakened to ∆n = o(n −1/3 ).
In the partial observation regime, with the upper-most smooth component being observed, the missing components must be carefully imputed given the available observations. For SDEs in class (Hypo-I), it is often the case that the dynamics of the smooth component is determined as dX S,t = X R,t dt. Such a remark also applies for class (Hypo-II), with the role of X S,t taken up by the upper-most smooth component, and the one of X R,t by the second smooth component. We keep the discussion within class (Hypo-I), as this is the context typically looked at in earlier literature. For the described setting, it is tempting and, indeed, widely used in practice, to recover the hidden rough component via finite-differences, using the observations {X S,t i+1 } i , i.e. via X R,t i = (X S,t i+1 − X S,t i )/∆n, if the step-size ∆n is small enough. However, Pokern et al. (2009) and Samson and Thieullen (2012) show that, in the context of bivariate models within class (Hypo-I), such an approach delivers (asymptotically) biased estimates of diffusion parameters. Pokern et al. (2009);Ditlevsen and Samson (2019) argue against applying finite-differences and, instead, consider appropriate Itô-Taylor schemes leading to non-degenerate conditionally Gaussian approximations for the SDE transition density. Such proxies are then embedded within MCMC Gibbs samplers or Monte-Carlo Expectation-Maximisation (MC-EM) methods to impute the missing components conditionally on observations. Nevertheless, Pokern et al. (2009) illustrate empirically that the scheme they develop then leads to a biased estimation for the drift parameter, β R , of the rough component. Subsequent analytical works (Ditlevsen and Samson, 2019;Gloter and Yoshida, 2020;Iguchi et al., 2022) illustrated that the bias occurs due to omission of drift terms of size O(∆ 2 n ) from the Itô-Taylor expansion of the smooth component within class (Hypo-I), as such terms are needed to counterbalance the noise terms of size O(∆ 3/2 n ) arising in such an expansion. A main conclusion arising from the above discussion is that a recipe for accurate estimation of both hidden components and parameters is the development of a conditionally Gaussian approximation for the full co-ordinates (as such Gaussianity allows for access to computationally effective inference methodologies) obtained via careful inclusion of higher-order terms from the relevant Itô-Taylor expansion. Such an insight for the design of a 'correct' discretisation scheme for the purposes of statistical inference has, arguably, not been clearly spelled out in the literature.
The objective of this work is to provide a comprehensive study of statistical calibration for a wide class of degenerate diffusion models. For this purpose, we review previous works for class (Hypo-I), and, then, we establish new analytical results for class (Hypo-II). The main contributions of our work can be summarised as follows: (i) For the highly degenerate class (Hypo-II), we construct a conditionally Gaussian time-discretisation scheme. The corresponding transition density is well-defined (i.e. non-degenerate) under a suitable assumption on functionals {V S 1,0 , V S 2,0 , V R,0 , . . . , V R,d } motivated both by modelling considerations and by adherence to Hörmander's condition. We refer to the new proxy as the 'locally Gaussian scheme' in agreement with the name assigned by Gloter and Yoshida (2021) to a conditionally Gaussian scheme developed for class (Hypo-I). (ii) For class (Hypo-II), we define a contrast estimator based on the transition density of the locally Gaussian scheme. Then, we show that the estimator is asymptotically normal in the complete observation regime, under a high-frequency observation setting, i.e., for n → ∞, ∆n → 0, n∆n → ∞, with the additional condition that the step-size must scale as ∆n = o(n −1/2 ). (iii) Under the partial observation regime often encountered in practical applications, we show via analytical consideration of some case studies that use of a finite-difference method for estimation of hidden components leads to asymptotically biased estimation of the diffusion parameter σ for class (Hypo-II). Thus, we put forward the developed locally Gaussian scheme for (Hypo-II) as an effective tool to impute hidden components and estimate parameters. (iv) By reviewing the methodology already produced in the literature for class (Hypo-I) and examining the new one produced in this work for (Hypo-II), we can provide a complete guideline for the development of a discretisation scheme for general degenerate diffusion processes so that the corresponding contrast function does not introduce bias in parameter estimation procedures.
The rest of the paper is organised as follows. Section 2 specifies the class of hypo-elliptic SDEs of relevance for this work, with reference to Hörmander's condition. Section 3 revisits the correct (in terms of its statistical properties) discretisation scheme for class (Hypo-I) and introduces the one for (Hypo-II). Section 4 provides our core analytical results of asymptotic consistency and normality for the statistical estimates obtained via the new scheme, in a complete observation setting. All proofs are collected in an Appendix. We present case studies showcasing emergence of bias when standard alternative schemes are called upon or when finite-differences are used to impute unobserved components (a common practice in applications). Under the correct schemes shown here for classes (Hypo-I) and (Hypo-II), we set up a simple Kalman filter for fitting a non-linear sub-class of models commonly arising in applications (we term these conditional Gaussian non-linear systems) in the practical partial observation setting. Section 5 presents numerical studies, for the partial observation regime, both for simple models and ones relevant to real applications, within class (Hypo-II). The code used in the numerical studies is available at https://github.com/YugaIgu/calibration-hypoSDEs. We finish with some conclusions in Section 6.
Notation. For the highly degenerate class (Hypo-II), to establish a common notation with (Hypo-I), we use the argument x S = (x S 1 , x S 2 ) and also set: For φ(·, θ) : R N → R, θ ∈ Θ, bounded up to 2nd order derivatives, we define the differential operators L and L j , 1 ≤ j ≤ d, as: for (x, θ) ∈ R N ×Θ. Application of the above differential operators is extended to vector-valued functions in the apparent way, via separate consideration of each scalar component. Denote by C ∞ b (R n 1 , R n 2 ), n1, n2 ∈ N, the space of smooth functions f : R n 1 → R n 2 with bounded partial derivatives of every order. We denote the probability law of the process {Xt} t≥0 under a parameter θ ∈ Θ as P θ , and we write for convergence in probability and distribution, respectively, under the true parameter θ † . We write the expectation under the probability law P θ as E θ to emphasise the dependence on θ ∈ Θ. For u ∈ R n , n ∈ N and the multi-index α ∈ {1, . . . , n} l , l ∈ N, we define i.e. an operator acting on maps R n → R, and then extended, by separate application on each co-ordinate, on maps R n → R m , m ∈ N.

Hypo-Elliptic SDEs
We fully specify the classes of SDEs of interest in (Hypo-I) and (Hypo-II), by providing, in each case, appropriate conditions on the collection of functionals {V0, V1, . . . , V d }, motivated by modelling considerations and the existence of a Lebesgue density for the SDE transition dynamics. We illustrate later on that the imposed conditions suffice so that the locally Gaussian scheme for Xt in (Hypo-II) we put forward in this paper is non-degenerate.

Hörmander's Condition
We quickly review the definition of Hörmander's condition. Consider the class of SDEs with the general form in (1). We defineṼ From standard properties of Itô's processes,Ṽ0 is the drift function of (1) when written as a Stratonovichtype SDE. The functionals {Ṽ0, . . . , V d } of the Stratonovich SDE can be corresponded to differential operators, the latter applying on mappings on R N → R N and giving as outcome mappings, again, on the same spaces. In particular, we have: Without confusion, we use the same notation both for the SDE functionals and the corresponding differential operators. Parameter θ is removed from the expressions for simplicity. For two functionals (equivalently, differential operators) as above, We introduce the collections of functionals Then, Hörmander's condition is stated as follows: Definition 1 (Hörmander's Condition) There exists M ≥ 1 such that Hörmander's condition implies that for any t > 0 and any initial condition X0 = x ∈ R N , the law of Xt is absolutely continuous w.r.t. the Lebesgue measure. Also, if the coefficients of the SDE are infinitely-times differentiable, with partial derivatives of all orders being bounded, then the Lebesgue density is smooth, see, e.g., Nualart (2006); Pavliotis (2014).

Diffusion Classes (Hypo-I) and (Hypo-II)
We now set up separate conditions for the SDEs in classes (Hypo-I) and (Hypo-II). These will make use of the drift function,Ṽ0 =Ṽ0(x, θ), of the Stratonovich version of the SDEs. For 1 ≤ j ≤ k ≤ N , we define the projection operator proj j,k : R N → R k−j+1 as For (Hypo-I) and (Hypo-II), we assign the following conditions to fully specify the structure of the corresponding degenerate system of SDEs.

Remark 1 Conditions (H)-I and (H)-II separate classes (Hypo-I) and
(Hypo-II). In particular, the top equation in both (H)-I, (H)-II implies that the diffusion matrix of the rough component is of full rank, thus X R,t acquires the roughness of an elliptic SDE. The second equation in (H)-I, (H)-II ensures that all co-ordinates of component X S,t and X S 2 ,t , respectively, possess the same smoothness as integrals of elliptic SDEs. Finally, the third equation in (H)-II implies that component X S 1 ,t has the smoothness of second integrals of elliptic SDEs. Note, e.g., that (4) will not hold for (Hypo-II), since for the highly degenerate case, due to the drift function of the upper-most component X S 1 ,t not involving X R,t , we have To check this latter equation, notice that for both (Hypo-I) and (Hypo-II) we have: i) V0 andṼ0 coincide on the smooth co-ordinates; ii)Ṽ0V k is zero on the smooth co-ordinates. Then, for (Hypo-II) we additionally have that V kṼ0 is zero on the upper-most N S 1 co-ordinates due to the particular choice of V0 = V0(x S 1 , x S 2 ).
Remark 2 We introduced conditions (H)-I & II upon functionals {V0, V1, . . . , V d }, so that the two classes of SDEs, (Hypo-I) and (Hypo-II), possess sufficient structure to allow for their intended use for the modelling objectives in mind. It turns out that the exact same conditions (H)-I & II play a key role so that the locally Gaussian schemes written down later in Section 3 are well-defined with a positive definite covariance matrix.

Time-Discretisation of Hypo-Elliptic SDEs
We discuss time-discretisation schemes for hypo-elliptic diffusions within classes (Hypo-I) and (Hypo-II), with the focus being on the performance of the schemes for the purposes of parametric inference. We set up the context by reviewing schemes proposed in literature for class (Hypo-I). We then propose a new scheme for the highly degenerate diffusion class (Hypo-II) that will later be proven to possess desirable statistical properties. Hereafter, to distinguish among the two classes of SDEs, we use the notation X X (I) and such an imputation is a main cause for the presence of bias in the estimation of parameter σ. To sidestep the above issue, Pokern et al. (2009) proposed the following conditionally Gaussian scheme, for 0 ≤ i ≤ n: Note that now the smooth component X  (7) to estimate both the hidden paths of the rough components and the parameters via a Bayesian approach, namely Gibbs sampling. Under a high-frequency observation setting, they empirically showed that the estimate of parameter σ is asymptotically unbiased, but the estimator of the drift parameter β R based on scheme (7) suffers from bias even in the complete observation regime. Gloter and Yoshida (2020) introduced the 'local Gaussian' scheme, where, for 0 ≤ i ≤ n, Compared to (7), scheme (LG-I) includes term ∆ 2 n (LV S,0 )(X (I) i , θ)/2 in the smooth component. Gloter and Yoshida (2020) illustrate the significance of this term for the purposes of parameter inference, by proving asymptotic consistency and normality for the contrast estimator derived from the likelihood of the discretisation scheme (LG-I), in the high-frequency, complete observation regime, namely, n → ∞, ∆n → 0, n∆n → ∞, under the step-size condition ∆n = o(n −1/2 ).
Remark 3 Ditlevsen and Samson (2019) applied a strong 1.5 order scheme (Kloeden and Platen, 1992) to construct contrast estimators for class (Hypo-I) with N S = 1, under strong conditions for the diffusion matrix so that the scheme becomes conditionally Gaussian. Then, they provided two separate contrast functions for estimating β S and (β R , σ) from the approximate Gaussian density for X S and X R , respectively, rather than the joint density. As noted in Remark 4.6 in Gloter and Yoshida (2020), the separate contrast functions results in a larger asymptotic variance for the estimation for β S compared with the single contrast estimator defined via the joint density of rough and smooth components.

Time-Discretisation of (Hypo-II)
We propose a time-discretisation scheme for the second hypo-elliptic class (Hypo-II), with desirable properties for the purposes of parameter inference. The brief review of schemes for (Hypo-I) in the previous section suggests that the discretisation scheme for (Hypo-II) should satisfy the following two key criteria: I. The scheme should be conditionally non-degenerate, i.e., the law of Xt i+1 given Xt i should admit a Lebesgue transition density for the full co-ordinates. This will allow to impute unobserved paths conditionally on observations without making use of bias-inducing finite-difference approximations. II. The scheme should involve deterministic terms obtained from careful truncation of the stochastic Taylor expansion for the drift of the smooth component, V S,0 (X (II) t , β S ), so that the contrast estimator corresponding to the scheme is asymptotically unbiased under the high-frequency, complete observation regime.
As for Criterion I, we will explain later in Section 4.2.1 that, indeed, use of a degenerate discretisation scheme or of finite-differences to estimate hidden components induces a bias in the estimation of parameters. Based upon the above key criteria, we propose the following discretisation scheme for (Hypo-II): Notice that the scheme involves 3d Gaussian random variables: The latter of the above integrals appears due to the application of a third-order stochastic Taylor expansion on V S 1 ,0 (x S , β S 1 ), in the smoothest componentX (II) S 1 ,i+1 . As we will show in Section 4, the log-likelihood based on the local Gaussian scheme (LG-II) produces a contrast estimator that is asymptotically unbiased in the high-frequency, complete observation regime. In order for the deduced contrast function to provide desirable asymptotic properties, it is required to include terms up to O(∆ 3 n ) in the definition of µ S 1 , otherwise estimation of the parameter β R in the model (Hypo-II) can be asymptotically biased as Pokern et al. (2009) observed for some bivariate hypo-elliptic diffusions in the framework of (Hypo-I).
We denote by Σ(∆, x, θ) the covariance matrix for one-step implementation of schemeX (II) , with step-size ∆ > 0, current state x ∈ R N and parameter θ ∈ Θ. The covariance matrix is given as: where each block matrix is specified as: for In the above, we have set The proof is given in Appendix B. Due to Proposition 1, the covariance Σ(∆, x, θ) is invertible, thus the approximate log-likelihood based on the local Gaussian discretisation scheme (LG-II) is well-defined for the highly degenerate class (Hypo-II). We note that, in brief, the above result follows from the positive definiteness of a R (x, σ), a S 2 (x, θ) and a S 1 (x, θ) under condition (H)-II.

Parameter Inference for Class (Hypo-II)
We explore analytically parameter inference procedures for hypo-elliptic diffusions in class (Hypo-II). We prove in Section 4.1 that a contrast estimator constructed from the conditionally Gaussian discretisation scheme (LG-II) is asymptotically unbiased under the high-frequency, complete observation regime. We illustrate the precise impact of the drift terms involved in scheme (LG-II) on the asymptotic results. In Section 4.2, we consider the partial observation regime. As observed for the case of class (Hypo-I) in the literature, we show via analytical case studies that use of finite-differences for the estimation of hidden paths leads to biased parameter estimates within (Hypo-II). Also, we explain that the local Gaussian scheme can be put into effective use within computational approaches for filtering hidden components and estimating parameters.

Contrast Estimator
Based on the proposed scheme (LG-II) and the corresponding tractable transition density, we construct a contrast estimator for the hypo-elliptic class (Hypo-II). We write the transition density of the local Gaussian scheme (LG-II), for given ∆ > 0, current position x ∈ R N and parameter θ ∈ Θ as: ; θ), we define the following contrast function: Thus, the contrast estimator for the hypo-elliptic class (Hypo-II) is defined as: θn = β S 1 ,n ,β S 2 ,n ,β R,n ,σn = argmin θ∈Θ ℓn θ . (10)

Asymptotic Results
Before we state our main results, we introduce some conditions for class (Hypo-II).
is three times differentiable for all x ∈ R N . Furthermore, derivatives of the above map up to the third order are of polynomial growth in Note that under condition (C1) and (H)-II, the law of the solution to the degenerate SDE (Hypo-II) admits a smooth Lebesgue density as we explained in Section 2.1. We write the true value of the parameter for a model in (Hypo-II) as (3). Then, the contrast estimator defined in (10) has the following asymptotic properties in the high-frequency observation setting.
where the asymptotic precision matrix Γ(θ † ) is given as: with the involved block matrices The proofs of Theorems 1 & 2 are given in Appendix C.
Remark 4 Gloter and Yoshida (2020) prove consistency and asymptotic normality of a contrast estimator constructed via a locally Gaussian scheme for class (Hypo-I). Our proofs follow a different approach from the one in the above work. Indicatively, a condition on the step-size of ∆n = o(n −1/2 ) is not required for our proof of consistency, whereas it is needed in Gloter and Yoshida (2020). Our proofs avoid such a condition by making use of preliminary convergence rates for the estimatorsβ S 1 ,n ,β S 2 ,n and some key identities in the involved matrix calculations to control terms arising in the expansion of the logarithm of the contrast function -this method then provides the consistency forβ R,n . We note that we believe that a consequence of the proofs we derive in this work is that they provide a direction for further results to be obtained on the analysis of contrast functions and the estimates they deliver for general classes of hypo-elliptic SDEs, see the discussion in Section 6.

Case Study -Bias due to Incorrect Drift Expansion
We have proven that the contrast estimator based on the proposed scheme (LG-II) is asymptotically unbiased under the high-frequency, complete observation regime. The inclusion of an appropriate number terms from the stochastic Taylor expansion of V S 1 ,0 (X S 1 ,t , X S 2 ,t , β S 1 ) and V S 2 ,0 (Xt, β S 2 ) in scheme (LG-II) is critical for obtaining desirable asymptotic properties. Omission of such terms will typically give rise to an asymptotic bias. In this subsection, we briefly highlight the effect of the 'drift correction' via a simple three-dimensional hypo-elliptic model from (Hypo-II). We consider the following SDE: where θ = (β, σ) is the parameter vector. We assume that all components of the system are observed, and consider the following discretisation scheme for SDE (12): Thus, terms of size O(∆ 3 n ) are not included in the deterministic part of the approximationq i+1 of the smoothest component. Based on the conditionally Gaussian scheme (13), we define a contrast function as (−2)× (complete log-likelihood), that is, after some constants are removed: Solving ∂ β ℓn(θ; x0:n) = 0, we obtain the contrast estimator for β as βn = g n f n , where we have defined: From the ergodicity of the process {st} and Lemma 2 in the Appendix, we have that as n → ∞, ∆n → 0 and n∆n → ∞, For the numerator gn, we apply Lemmas 2 & 3 in Appendix to obtain that Thus, the drift estimation based on the discretisation scheme (13) with inappropriate drift expansion is, in general, asymptotically biased. One can check that the above bias is removed upon use of our locally Gaussian scheme (LG-II) instead of (13).

Case Study -Bias due to Finite-Differences
To motivate the 'appropriateness' of the proposed locally Gaussian scheme (LG-II) in the context of parameter inference in a partial observation setting, we illustrate that naive use of finite-differences to impute hidden components (a practice quite common in applications) induces a bias in the estimation of the SDE parameters. To observe this, consider the model (12) again but with the drift parameter β fixed to 1: for σ > 0. We then apply the Euler-Maruyama scheme for the first equation of (14) and the locally Gaussian scheme (LG-I) for the remaining dynamics, i.e., Scheme (15) is degenerate since the upper-most equation does not involve noise. We now consider the estimator based on the likelihood provided by (15), given the discrete-time observations {q0:n, s0:n}, and with the hidden paths p0:n imputed via the first equation of (15) using the observations q0:n.
Remark 5 In practice, the rough component st is often not observed, so one must impute the missing components s0:n conditionally on the observations q0:n by making use of the transition density (or some approximation of it) for both co-ordinates (p, s) of (14). One would reasonably expect that presence of bias will be typical in such a practical scenario, if it found to be present in the simpler case when st is directly observed.

Filtering and Parameter Inference via the Proposed Scheme (LG-II)
We put forward the locally Gaussian scheme (LG-II) for imputing hidden components and performing parameter inference under a partial observation regime. The scheme and its transition density on the full set of coordinates can be combined with various computational methods, e.g., Monte-Carlo Expectation Maximisation (MCEM) and Markov Chain Monte Carlo (MCMC), similarly to earlier works (Ditlevsen and Samson, 2019;Pokern et al., 2009) that applied some conditionally Gaussian schemes for inference of specific hypo-elliptic models within the class (Hypo-I).
We now highlight the use of a relatively straightforward Kalman filter recursion for carrying out statistical inference once the locally Gaussian scheme is adopted, for a rich sub-class of hypo-elliptic models, referred to here as conditionally Gaussian non-linear systems. That is, the system is originally specified as a non-linear SDE but can be treated as a linear system given components that correspond to observations. For elliptic diffusions with such a structure, continuous-time filtering and smoothing have been investigated in engineering, see e.g. Chapter 8 of Chen (2023). Several important hypo-elliptic models used in applications fall within this sub-class, e.g., standard Langevin equations, Quasi-Markovian generalised Langevin Equations (QGLE-I, QGLE-II). Here, our interest lies in the sub-class derived via the general model (Hypo-II) once the constituent coefficients are specified as: are independent of the state x. Critically, given the observable component x S 1 , the drift functions are linear functions of the hidden component x H . For the model with the above choice of coefficients, the locally Gaussian scheme (LG-II) writes as: and an Ndimensional Gaussian variate w i (∆n, θ). Since the right-hand-side of scheme (18) is linear w.r.t. the hidden componentsX S 2 ,i ,X R,i given the observed componentX S 1 ,i , one can obtain Kalman filtering and smoothing recursions, and calculate the marginal likelihood for the observationsX S 1 ,0:n . We provide the closed form filtering and marginal likelihood calculations in Appendix F. We use these tools in the numerical experiments of parameter inference under the partial observation regime in Section 5 that follows.
Remark 6 Vroylandt et al. (2022) studied parameter inference for the QGLE of first-type in (QGLE-I), where they applied an Euler-Maruyama scheme to construct Kalman filtering and smoothing for the rough components (pt, st) given the velocity pt, with values of the latter obtained (via finite-differences) from discrete observations of the position qt. Then, they used Kalman filtering and smoothing within an Expectation-Maximisation (EM) algorithm to estimate the parameters. However, as we have seen, such a finite-differences approach induces bias in the estimation of the diffusion parameters.

Linear SDE in a Partial Observation Regime
We illustrate empirically, for an example SDE model, that parameter estimation via the proposed locally Gaussian scheme (LG-II) leads to asymptotically unbiased estimation under the partial observation regime.  We also highlight the effect of the drift correction in the properties of the estimators. We again consider the model studied in Section 4.1.3, that is, where θ = (β, σ) ∈ Θ = (0, ∞)×(0, ∞) is the parameter vector. In agreement with practice, we assume that only discrete observations of the smoothest component, q0:n, are available, with an equidistant step-size ∆n. We compute the following two estimators based on two different discretisation schemes: θ n,j = (β n,j ,σ n,j ) = argmax θ∈Θ log p j (θ; q0:n), j = 1, 2, where p1(θ; q0:n) is the approximate likelihood of the observations as obtained by use of Kalman filter in the setting of our locally Gaussian scheme (LG-II), and p2(θ; q0:n) is a different approximate likelihood obtained in the setting of the following conditionally Gaussian scheme that omits correction terms of order O(∆ 2 n ) (resp. O(∆ 3 n )) from the stochastic Taylor expansion of the drift function of component p (resp. q): We generate 100 independent realisations of the dataset q0:n by sub-sampling trajectories obtained from scheme (LG-II) with a small step-size 10 −4 . We have chosen the scheme because it is expected to have a better accuracy than other classical schemes (such as Euler-Maruyama scheme) due to the higher order stochastic Taylor expansion of drift functions. We consider the following three high-frequency scenarios for the data: Set I. n = 5 · 10 5 , ∆n = 10 −3 , T (= n∆n) = 500. Set II. n = 2 · 10 6 , ∆n = 5 · 10 −4 , T = 10 3 . Set III. n = 10 7 , ∆n = 10 −3 , T = 10 4 .  The true parameter value is set to θ † = (β † , σ † ) = (2.0, 4.0), and the Nelder-Mead method is applied to optimise the marginal likelihoods. In Figure 1, we plot the 100 realisations of the two different estimators. Table 1 summarizes the mean and standard error ofθn − θ † , i.e., (estimate) -(true value), from the 100 repetitions. First, we observe that the estimates ofθn,1 (using our scheme (LG-II)) is centred at the true value in all scenarios, thus in this case we have an empirical illustration of an asymptotically unbiased estimation in the partial observation setting. Secondly, it is clear from the figures and the table that the mean of estimates ofθn,2 (estimator based on the conditionally Gaussian scheme without appropriate drift correction) is shifted from the true value, and seems to be centred at (β, σ) = (2.10, 3.960). Thusθn,2 induces an asymptotic bias in the partial observation regime, in agreement with the case study in Section 4.1.3 in the complete observation case. Notably, there is a clear separation between the two estimators of σ. We stress here that the bias inθn,2 is not removed with increasing n or decreasing ∆n. Also, one can still observe the bias even if the datasets are obtained with other numerical schemes, e.g., Euler-Maruyama scheme, rather than scheme (LG-II).

Scalar Extended State
We consider the QGLE describing one-dimensional positional domain: where α > 0, σ > 0, λ ∈ R \ {0} and U : R → R. In this experiment, we consider the following two choices of potential U : where D > 0 is a parameter. The function U DW (used in experiments in the work Leimkuhler and Sachs (2022)) represents an uneven double well potential, under which model (19) is non-linear. We generate 30 independent datasets by sub-sampling trajectories produced by the discretisation scheme (LG-II) with a small step-size 10 −4 so that obtained observations correspond to n = 2×10 5 , ∆n = 10 −3 , T = n∆n = 200.  harmonic potential U HO and the double well potential U DW , respectively. Also, the initial guesses for the parameter are set to: (D0, λ0, α0, σ0) = (2.0, 2.0, 2.0, 2.0) and (D0, λ0, α0, σ0) = (3.0, 3.0, 3.0, 3.0) for the case of U HO and U DW , respectively. We summarise the mean and standard error of the estimates from the 30 independent trajectories in Table 2. We notice that the results for the complete observation regime are in agreement with the analytical results in Theorem 2. For instance, convergence to the true values appears to be faster for parameters (D, λ) in the smooth component pt (recall the convergence rate ∆n/n for such parameters in the CLT of Theorem 2). Besides, under the partial observation regime, the estimates seem to be centred around the true parameter as well, with standard errors that are larger than the ones in the case of complete observations (as expected). Thus, parameter inference carried out via the proposed locally Gaussian scheme (LG-II) appears in this case to provide unbiased estimates in the partial observation regime.

Multivariate Extended State
We consider a QGLE with one-dimensional coordinates and multivariate extended variable, motivated from the work of (Ayaz et al., 2021) that studies protein-folding kinetics via a Quasi-Markovian GLE (QGLE-II) and showcases that a QGLE accurately reproduces simulations of molecular dynamics (MD) that involve memory effects in the friction. In their investigation, an one-dimensional reaction coordinate, qt, given as the sum of the separations between native contacts, is modelled via the following QGLE: where M, β > 0 denote the mass and the inverse thermal energy respectively, {c l , τ l } 1≤l≤L are the unknown parameters taking positive values, for L ≥ 1, and U : R → R, the folding free energy landscape for proteins, is specified as q → U (q) = −β −1 log ν(q) with ν(·) being the equilibrium probability density function. QGLE (20) corresponds to the non-Markovian GLE (GLE) with the memory kernel given as a so-called Prony series: In our experiment, we estimate the unknown parameters by maximising the marginal likelihood given the partial observations q0:n. For simplicity, we set m = 1, L = 2, and specify the free energy function as q → U (q) = a(q − q min ) 2 (q − qmax) 2 + bq 3 with constants (q min , qmax, a, b) = (0.30, 0.90, 1200, 0.001). We set β −1 = 2.949 and select the true parameters as θ † = (c † 1 , τ † 1 , c † 2 , τ † 2 ) = (0.22, 0.007, 1.2, 4.6), as such a choice closely reproduces the shape of the memory kernel estimated in (Ayaz et al., 2021). We generate 50 independent trajectories of q on the time inverval [0, 1500] by applying scheme (LG-II) to QGLE (20) with step-size 10 −4 . We discard the observations up to time 500 and sub-sample the datasets q0:n in equilibrium, so that n = 10 6 , ∆n = 10 −3 , T = 1000. In Figure 2, we plot the shape of the free energy U and one trajectory of the component q from the QGLE (20) in the chosen setting. Notice that QGLE (20) is a conditionally Gaussian non-linear system (given the component q), thus upon adoption of the locally Gaussian discretisation (LG-II), the marginal likelihood can be calculated via the Kalman filter shown in Section 4.2.2. We use the Nelder-Mead method to optimise the marginal likelihoods with the initial value θ0 = (0.1, 0.01, 1.0, 10.0). Figure 3, summarises the results from estimation of θ = (c1, τ1, c2, τ2) given the partial observations. The boxplots of (approximate) maximum likelihood estimates in Figure (3a) indicate that parameter estimation carried out via the locally Gaussian scheme (LG-II) delivers consistent estimates of the parameters. We typically observe that the standard error of estimates for (c1, τ1) is much smaller than that for (c2, τ2). In Figure (3b), we plot the memory kernel (21) with the parameters set equal to the mean value of the 50 MLEs to observe the level of agreement between the estimated memory kernel and the reference kernel, the latter computed with the true parameter values, where the relative absolute errors between the true and estimated memory kernels are within 0.1 across periods t ∈ [0.001, 10].

Conclusions and Future Directions
We have studied parameter inference procedures for the highly degenerate class of SDEs that includes a wide range of practical models, e.g., quasi-Markovian generalised equations (QGLEs), epidemiological models with time-varying parameters (Spannaus et al., 2022;Dureau et al., 2013), non-linear continuoustime autoregressive models (Tsai and Chan, 2000) and the classical Lorentz system upon consideration of noise effects (Coti Zelati and Hairer, 2021). We have introduced the locally Gaussian time-discretisation scheme (LG-II) and provided analytical/numerical results showcasing that parameter estimation based upon such scheme sidesteps biases that would arise under alternative schemes. The approach followed in this  work for establishing our results for class (Hypo-II) are expected to also guide extensions to more general classes of degenerate diffusions, for which iterated Lie brackets of any order, i.e., [Ṽ0, [Ṽ0, . . . , [Ṽ0, V k ]]], 1 ≤ k ≤ d, are required for Hormander's condition to hold. Here, we draw upon the understanding obtained via the study of classes (Hypo-I) and (Hypo-II) to summarise key arguments for carrying out unbiased parameter estimation for general hypo-elliptic systems. First, in a partial observation regime, use of a degenerate discretisation (e.g. Euler-Maruyama) or equivalently of finite-differences to impute latent components will induce bias at estimates of diffusion coefficient parameters (recall the case study in Section 4.2.1). To avoid use of finite-differences, the development of a non-degenerate conditionally Gaussian scheme for the full coordinates is essential, with the Gaussian noise obtained via high-order stochastic Taylor expansion of the drift functions. A lot of care should be given at the deterministic terms of the expansion to be included into the scheme, to avoid emergence of biases in estimates of drift parameters (recall the analytical study in Section 4.1.3 and the numerical results in Section 5.1). We summarise below the above-designated roadmap for the construction of 'correct' time-discretisation schemes for general classes of hypo-elliptic diffusions.
Step 1. For the rough component, X R , the Euler-Maruyama scheme is applied.
Step 2. For the smooth coordinates in the model, one recursively applies stochastic Taylor expansion to drift functions so that Gaussian variates, in the form of iterated integrals involving Brownian motions, e.g. of the form Table 3. Size (in ∆ n ) of the terms appearing in the locally Gaussian scheme (LG-II).

Component
Gaussian part Deterministic part Our work in this paper leads to further research in several directions. In the CLT of the main analytical result for the parameter estimator (Theorem 2), the step-size ∆n is required to satisfy ∆n = o(n −1/2 ). An open problem for hypo-elliptic diffusions is the construction of estimators giving a CLT under a weaker condition ∆n = o(n −1/p ), p ≥ 3. We expect that such a general estimator for degenerate diffusion models can be produced, with accompanying theory then following the strategy used in our proofs in this work, as we discussed in Remark 4. In a different direction, the effectiveness of the developed locally Gaussian scheme is yet to be studied under a low-frequency observation setting, i.e. with the step-size ∆ assumed fixed and not small enough, in which case a number, say M , of inner sub-steps are introduced by the user. Under such a setting, the discretisation error of the true (intractable) density over the period of size ∆ typically diminishes as M increases. In the case of elliptic diffusions, explicit rates of convergence to zero are provided in Gobet and Labart (2008); Iguchi and Yamada (2021). Finally, in this work, in the practical scenario of partial observations, we have investigated the behaviour of discretisation schemes via case studies and numerical experiments. Analytical theory would be quite instructive in this setting. Techniques used in the context of hidden Markov models (see e.g. Douc et al. (2014)) are expected to be valuable in such a pursuit.
Acknowledgements YI acknowledges support from the Additional Funding Programme for Mathematical Sciences, delivered by EPSRC (EP/V521917/1) and the Heilbronn Institute for Mathematical Research.

A. Preliminaries
In Section A.1 we present some notation used in the Appendix. In Section A.2 we introduce three auxiliary results needed in the proof of our main theorems (Theorems 1, 2) in Section 4.

A.1. Notation
For 0 = t0 < · · · < tn, with equi-distant step-size ∆n, we write X i for the observation at time t i of the solution of the hypo-elliptic SDE (Hypo-II) under the true parameter value θ † , defined upon the filtered probability space (Ω, F, {Ft} t≥0 , P). We denote by ν θ † the invariant distribution of process (Hypo-II) under θ † . In agreement with the structure of class (Hypo-II), we often represent x ∈ R N and θ ∈ Θ ⊂ R N θ as For φ(·, θ) : R N → R, θ ∈ Θ, bounded up to second derivatives, we define the differential operators L and L j , 1 ≤ j ≤ d: Application of the above differential operators is extended to vector-valued functions in the apparent way, via separate consideration of each scalar component. We recall some notation used in the definition of the contrast function ℓn(θ) in (9). We have that When ∆ = 1, we simply write We write, for 1 ≤ i ≤ n, We use Σ(∆, x, θ) to represent the covariance of one step of the local Gaussian scheme (LG-II) for the hypo-elliptic SDE (Hypo-II), given step-size ∆ > 0, initial point x ∈ R N and parameter θ. We often write Σ(x, θ) ≡ Σ(1, x, θ).
We express the inverse of Σ(x, θ) as: where each block matrix is specified as Each block element of the matrix Σ(x, θ) is given later in Section B, and we emphasise here that Σ(x, θ) and its inverse Λ(x, θ) depend on x and (β S , σ) but not on the drift parameter β R in the rough component, and this is critical in the proof of consistency ofβ R,n . Thus, we sometimes write Σ(x, (β S , σ)) and Λ(x, (β S , σ)) to highlight the parameter dependency. We define the mappings We write, with a slight abuse of notation, for 0 ≤ i ≤ n, We denote by S the space of functions f : We recall the probability law of the process {Xt} t≥0 under a parameter θ ∈ Θ is written as P θ , and indicate convergence in probability and distribution, respectively, under the true parameter θ † . An expectation under the probability law P θ is written as E θ . We write for the standard differential operators acting upon maps R n → R, n ≥ 1. We also write ∂ u α = ∂ l ∂u α 1 ···∂u α l for a multi-index α ∈ {1, . . . , n} l , l ∈ N. For a function g : R n → R m , n, m ∈ N, we write:

A.2. Auxiliary Results
We prepare some auxiliary results to be used in the proof of Theorems 1 and 2.

B. Proof of Proposition 1
From the expression of the covariance Σ(∆, x, θ) in (8), its determinant is given as: The assertion holds if we show that under condition (H)-II, the three matrices a R (x, σ), a S 1 (x, θ) and a S 2 (x, θ) are positive definite for any (x, θ) ∈ R N × Θ. Notice that the first equation within the condition leads to the matrix a R (x, σ) being positive definite for all (x, σ) ∈ R N × Θσ. Furthermore, the first and second equations of condition (H)-II yield: for each (x, θ) ∈ R N × Θ, thus the matrix a S (x, θ) is positive definite. Similarly, due to condition (H)-II, for each (x, θ) ∈ R N × Θ. This implies the positive definiteness of a S 1 (x, θ). The proof is complete.

C. Proof of Main Results
In this section we prove the main results, i.e. Theorem 1, 2 in Section 4 of the main text. The proofs make use of some technical results from Appendix D.

C.1. Proof of Theorem 1 -Consistency
To show consistency, we study the limit of the contrast function ℓn(θ), defined in (9), that involves terms such as ..,n are discrete-time observations under the true model (Hypo-II) with parameter θ † . Then, the stochastic Taylor expansion for X S,i+1 yields where R S 1 , R S 2 ∈ S. Careful steps are needed to control the first terms in the right-hand-sides of (29) as ∆n → 0, within the proof of consistency. Our proof proceeds with the following strategy which extends arguments used in Iguchi et al. (2022): Step 1. We prove consistency, along with a convergence rate, for the estimatorβ S 1 ,n . That is, if n → ∞, ∆n → 0 and n∆n → ∞, thenβ In particular, we obtain the rate: Step 2. Making use of the convergence rate in (30), we prove consistency, along with a convergence rate, for the estimatorβ S 2 ,n . That is, if n → ∞, ∆n → 0 and n∆n → ∞, then In particular, we obtain the rate: Step 3. Making use of the rates in (30) and (31), we prove consistency for the estimators (β R,n ,σn). That is, if n → ∞, ∆n → 0 and n∆n → ∞, then (β R,n ,σn) We emphasise that in our proof of consistency, the condition ∆n = o(n −1/2 ) is not required while Gloter and Yoshida (2020) assumed the condition throughout the proof of consistency in the case of the degenerate diffusion class (Hypo-I). Typically, in order to show the consistency ofβ R,n , Gloter and Yoshida (2020) exploited the rates of convergence that are derived under the condition ∆n = o(n −1/2 ). In contrast, in our strategy, the rates of convergence (30) and (31) are obtained without requiring ∆n = o(n −1/2 ), and are put into effective use to avoid explosion of terms such as as ∆n → 0 only, with the help of some results derived from straightforward matrix calculations (Lemma 6, 13 and 14 shown later). C.1.1.
Step 1 Consistency of the estimatorβ S 1 ,n is deduced from the following result.

C.1.2.
Step 2 Making use of convergence (30), we obtain the following result whose proof is postponed to Appendix D.4.
To prove convergence (31), we apply a Taylor expansion on the contrast function to get: where we have set, for θ = (β S , β R , σ) ∈ Θ with β S ≡ (β S 1 , β S 2 ), and M β S ,n ∈ R N β S ×N β S is defined as: Convergence (31) is immediately deduced from the following result.
The consistency of estimatorβ R,n is obtained via the following result whose proof is given in Appendix D.7.
Lemma 10 Assume that conditions (H)-II and (C1)-(C4) hold. If n → ∞, ∆n → 0 and n∆n → ∞, then The proof of consistency for the contrast estimatorθn is now complete.

C.2. Proof of Theorem 2 -Asymptotic Normality
We consider the Taylor expansion of the contrast function ℓn(θ): where we have set, for θ ∈ Θ, with the N θ -dimensional vector vn defined as: The asymptotic normality immediately holds from the following two results -their proofs are shown in Appendices D.9 and D.10.
Recall the notation for the inverse of Σ(x, θ) in (24). Using the inverse formula for a block matrix, we obtain: where we have set From the block matrix representation of Σ(∆, x, θ) in (8), we obtain We then have In the above calculation we have used: where matrix Σ S 2 S 2 (x, θ) is invertible under condition (H)-II as we have seen in the proof of Proposition 1 in Appendix B. Thus, from (38) and (40), we obtain and the proof is now complete.
for some functions R k 1 k 2 j 1 j 2 , R k 1 k 2 j 1 j 2 ,R k 1 k 2 j 1 j 2 ,R k j 1 j 2 ,R ∈ S, and the function b : Notice that for any θ ∈ Θ, because it follows from Lemma 6 that for any x ∈ R N , 1 ≤ j2 ≤ N β S 2 , where Φ(x, θ) is defined as in (33). From Lemmas 2, 3 and the consistency ofβ S,n with convergence (30), we obtain: Finally, we consider the term Q S 2 S 2 (θ). It holds that Q S 2 S 2 (θ) = 1≤k≤4 Q S 2 S 2 ,k (θ), where we set, for 1 ≤ j1, j2 ≤ N β S 2 , for some functions R k j 1 j 2 , R k 1 k 2 j 1 j 2 , R k 1 k 2 j 1 j 2 , R k 1 k 2 j 1 j 2 , , R k j 1 j 2 ,R k 1 k 2 j 1 j 2 , , R k j 1 j 2 , R j 1 j 2 ∈ S. Note that due to Lemma 6, We obtain from Lemma 2, (42) and consistency ofβ S,n that if n → ∞, ∆n → 0 and n∆n → ∞, then The proof is complete by applying the following result to (44) and (45).
Lemma 13 Assume that condition (H)-II holds. We have that, for any (x, θ) ∈ R N × Θ: D.5.1. Proof of Lemma 13 (Proof of (46)). First, we note that the matrices Σ S 1 S 1 (x, θ), Σ S 2 S 2 (x, θ), Σ RR (x, θ) are invertible for any (x, θ) ∈ R N × Θ under condition (H)-II as we have seen in proof of Proposition 1. Due to the block expression of matrix Σ(x, θ) in (36), we have: whereΛ(x, θ), the inverse of matrixΣ(x, θ) given in (37), has the following block expression: where we have set: We then have:Λ where we used (40) for the upper block matrix, while the lower one is obtained via: Thus, we obtain: and now the proof of (46) is complete.
We will derive the limit of the terms U k (θ), 1 ≤ k ≤ 3 evaluated at θ = (β S,n , β R ,σn) by utilising the following result whose proof is given in Section D.8: Lemma 14 Assume that condition (H)-II holds. For any θ = (β S , β R , σ) ∈ Θ and 1 ≤ i ≤ N , we have that We first consider the term U1(θ). Making use of Lemma 14, we have where R j S 1 , R j S 2 , R ∈ S. From Lemma 2 and the limits (30), (31), we obtain that if n → ∞, ∆n → 0 and n∆n → ∞, where U2(θ) is given in the form of the right-hand-side of formula (50). We then obtain U1 β S,n , β R ,σn as n → ∞, ∆n → 0 and n∆n → ∞, uniformly in β R ∈ Θ β R . For the third term U3(θ), it follows from Lemma 14 that where R j , R j ∈ S. From Lemma 3, we have that, if n → ∞, ∆n → 0 and n∆n → ∞, then U3(β S,n , β R ,σn) The proof is now complete.

D.8. Proof of Lemma 14
We have for R j ∈ S, 1 ≤ j ≤ N , where we have set: since it holds Λ(x, θ)Σ(x, θ) = I N ×N for each (x, θ) ∈ R N × Θ. The proof is now complete.
Furthermore, it follows that where we made use of similar arguments in the proof of Lemma 14 (Section D.8) for the term Λ(X k−1 , θ)∂ β (i,j) v(X k−1 , θ). Hence, exploiting Lemmas 2, 3 and the consistency of estimatorθn, we obtain that: as n → ∞, ∆n → 0 and n∆n → ∞, for N β + 1 ≤ i, j ≤ N θ . The proof of (51) is now complete.