Nadaraya-Watson estimator for stochastic processes driven by stable Lévy motions

We discuss the nonparametric Nadaraya-Watson (N-W) estimator of the drift function for ergodic stochastic processes driven by α-stable noises and observed at discrete instants. Under geometrical mixing condition, we derive consistency and rate of convergence of the N-W estimator of the drift function. Furthermore, we obtain a central limit theorem for stable stochastic integrals. The central limit theorem has its own interest and is the crucial tool for the proofs. A simulation study illustrates the finite sample properties of the N-W estimator. AMS 2000 subject classification. 60G52, 62G20, 62M05, 65C30


Introduction
We consider the following nonlinear stochastic differential equation (SDE) driven by an α-stable Lévy motion (0 < α < 2): where b : R → R is an unknown measurable function, σ : R → R + is an unknown positive function (which is considered as a nuisance parameter), and {Z t , t ≥ 0} is a standard α-stable Lévy motion defined on a probability space (Ω, F , P ) equipped with a right continuous and increasing family of σ-algebras {F t , t ≥ 0}, and η is a random variable independent of {Z t }. In this case, Z 1 has a α-stable distribution S α (1, β, 0), where β ∈ [−1, 1] is the skewness parameter of the distribution. When β = 0, the underlying stable distribution is symmetric. For more detailed discussion on stable distributions, we refer to Samorodnitsky and Taqqu [46], Janicki and Weron [25], and Sato [47]. We assume that the stochastic process {X t } is observed at discrete time points {t i = i∆, i = 0, 1, . . . , n}, where ∆ is the time frequency for observation and n is the sample size. The purpose of this paper is to study the nonparametric estimation of the unknown drift function b based on the sampling data {X ti } n i=0 . The nonparametric estimation of the dispersion function σ is much harder, which will be addressed separately.
The SDEs driven by Lévy noises have attracted a lot of attention recently, especially in view of applications to finance (see Schoutens [49], Kyprianou, Schoutens and Wilmott [31]), network traffic (see Mikosch et al. [37]), physics (see, e.g. Schertzer et al. [48]), and climate dynamics (see Ditlevsen [10,11]). The existence and uniqueness of solutions to (1.1) under Lipschitz conditions are standard results in stochastic calculus (see e.g., Applebaum [1]). For simplicity, we assume that the solution X t is stationary and geometrically strong mixing (in this case, the initial distribution is taken from the invariant measure). Recently, Masuda [35] provided sets of ergodic conditions for a multidimensional diffusion process with jumps for any initial distribution to be exponential β-mixing. These conditions build up the bridge between mixing sequences and diffusion processes with jumps.
When the drift function in (1.1) is known to be linear, i.e. b(x) = −θx with unknown parameter θ, the estimation of θ based on discrete or continuous observations of X t was studied in the parametric framework by Hu and Long [21,22]. If the dispersion coefficient is small and the driving noise is a Lévy process, the least squares estimation of θ was also discussed by Long [32] and Ma [34]. Masuda [36] proposed a self-weighted least absolute deviation estimator for discretely observed ergodic Ornstein-Uhlenbeck processes driven by symmetric Lévy processes. But in reality, the drift function is seldom known. Hence in this paper, we focus on estimation of the drift function b(·) in model (1.1) using nonparametric smoothing approach. Nonparametric smooth approach provides a versatile method of exploring the relationship between variables with no prior specified models. The classical Nadaraya-Watson (N-W) estimator of the regression function was proposed independently by Nadaraya [38] and Waston [52].
The major statistical properties (e.g. consistency and rate of convergence) of nonparametric methods for the N-W estimators under independent and identical distribution observations are developed between 1980 and 1990 (see Hardle [19]). These properties have been extended to dependent situations in the 1990s (see Bosq [6]), typically for (α, β and φ)-mixing. These results have been further generalized to stationary processes with so-called uniform predictive dependent structure, which can be regarded as a natural alternative to strong mixing conditions, by Wu [53] and [54].
Many authors have investigated nonparametric estimation for the drift function b in the setting of diffusions driven by Brownian motions. Pham [40] and Prakasa Rao [41] gave a non-parametric estimator for b by mimicking the construction of the well-known Nadaraya-Watson estimator and the asymptotic behavior of the N-W estimator was established. Arfi [2] discussed the uniformly strong consistency of the N-W estimator for the drift function b under ergodic conditions. Recently, Bandi and Phillips [3] extended the N-W estimators to non-stationary recurrent processes.
Other related methods of estimating the drift function have also been proposed. Banon [4] constructed a drift estimator purely based on the kernel estimator and a relation between the drift and the density function along with its derivative, which was further extended by van Zanten [51], Dalalyan and Kutoyants [8,9], Dalalyan [7] (see also the monograph by Kutoyants [30] and references therein). Locally linear (or polynomial) estimators have been proposed by Fan [12] and further discussed in Fan and Gijbels [14], Spokoiny [50] and Fan and Zhang [15]. For a complete review of non-parametric methods for diffusion processes with applications in financial econometrics, see the excellent survey paper by Fan [13]. Hoffmann [20] and Gobet, Hoffmann and Reiss [17] applied wavelet approach.
In the stable setting, Hall, Peng and Yao [18] and Peng and Yao [39] discussed the non-parametric regression estimation for time series with heavy tails. In this paper, we shall consider the regression type of estimation for stochastic processes driven by Lévy motions, which is a natural extension of the discrete time series with heavy tails. We shall discuss the Nadaraya-Waston estimator of the drift function b in this paper. The basic idea of N-W estimator is to minimize an object function given below with certain weights: over the parameter space of a and any given The weight function is given by where K h (·) = K(·/h)/h, K is a kernel density function with mean zero and finite variance, and h is the bandwidth for the kernel. Then, the N-W estimator is given by the following expression . (1. 2) It turns out that N-W estimator is a simple class of a large family called "local polynomial estimator" (see Fan and Gijbels [14]). Hence N-W estimator is also called local constant estimator. As pointed out in Gobet, Hoffmann and Reiss [17], the N-W estimator of the drift or diffusion coefficient in the classical diffusion cases is not consistent for low frequency data (i.e. ∆ is fixed). So, we shall focus on the consistency and asymptotic distribution of the N-W estimator of the drift function for high frequency data (i.e. ∆ → 0) in this paper. The paper is organized as follows. In Section 2, we state our main results on the asymptotic properties of the N-W estimatorb n (x). In Section 3, we conduct a simulation study to confirm the finite sample property. Finally, all the proofs are given in Section 4. We conjecture that the results of this paper can be extended to local polynomial estimator of order p. Throughout the paper, we shall use notation "→ P " to denote "convergence in probability" and notation "⇒" to denote "convergence in distribution".

Asymptotic properties of the Nadaraya-Watson estimator
In this section, we consider asymptotic behavior of the Nadaraya-Watson estimator of the drift coefficient in our stable setting.
Note that α-stable Lévy motion has the scaling (or self-similarity) property Z at d = a 1/α Z t with a > 0. In terms of Theorem 30 in Chapter I of Protter [43], we know that every α-stable Lévy process has a càdlàg (right continuous with left limits) modification. We will henceforth assume that we are using the càdlàg version of any given α-stable Lévy process. When 0 < α < 1, Z t is a pure jump process with locally finite variation; when 1 < α < 2, it is a purely discontinuous martingale; when α = 1, it is a semi-martingale. For more path properties of α-stable Lévy processes, we refer to Sato [47].
Since the strong mixing property of the solution X t to SDE (1.1) is essential in our framework, we give its definition here. The strong mixing coefficient of X is defined by α X (t) = sup where the second supremum is taken over measurable sets A and B in the σalgebras generated by {X u , u ≤ s} and {X u , u ≥ s + t}, respectively. We say that {X t , t ≥ 0} is strongly mixing if α X (t) → 0 as t → ∞.
We will make use of the following assumptions: (A.1). The drift function b(·) and dispersion function σ(·) satisfy a global Lipschitz condition, i.e., there exists a positive constant L > 0 such that (A.2). The dispersion function σ(·) satisfies the following bounded condition: there exist positive constants σ 0 and σ 1 such that 0 < σ 0 ≤ σ(x) ≤ σ 1 for each The solution X t admits a unique invariant distribution µ and is geometrically strong mixing (GSM), i.e. there exists c 0 > 0 and ρ ∈ (0, 1) such that α X (t) ≤ c 0 ρ t , t ≥ 0. Consequently, X t is ergodic. We also assume that X t is stationary.
To study the asymptotic distribution and rate of convergence of the N-W estimator for the drift function, we impose some new conditions as follows: (A.7). The drift function b(·) is twice continuously differentiable with bounded first and second order derivatives.
The density function f (x) of the stationary distribution µ is continuously differentiable (f ′ (x) is continuous).
Note that (A.7) is stronger than the Lipschitz condition on b(·) in (A.1), and (A.8) is stronger than (A.4). We consider the rate of convergence of the N-W estimator when 1 < α < 2.
Let We also assume that f (x) > 0. (2) It is easy to see thatb n (x) is not consistent when α = 1, which is compatible with Theorem 2.4.
(3) Theorem 2.5 (a)-(ii) specifies that the bias of the N-W estimatorb n (x) is in the order of (n∆h , then the estimatorb n (x) has asymptotically zero bias as stated in Theorem 2.5 (a)-(i).
Remark 2.7. We would like to give some comments on the estimation of α which is involved in the rate of convergence of the N-W estimatorb n (x) of the drift function b(x) in Theorem 2.5. Note that we are dealing with non-parametric models with unknown functions b(x) and σ(x), which makes the estimation of α much harder. In order to estimate α, we need also consider the non-parametric estimation of the dispersion function σ(x), which is our ongoing research project and will be addressed in a separate article. After getting estimation for both the drift function b(x) and dispersion function σ(x), we can use the Euler scheme to approximate the SDE (1.1): as an estimate of ∆Z t k for k = 0, 1, . . . , n − 1, which follows the stable distribution S α (∆ 1/α , β, 0). Motivated by the parameter estmation for stable distributions discussed in Press [42], we can define the approximate sample characteristic function bŷ which shall be asymptotically equal to the characteristic function ϕ Z∆ (u) of Then, by replacing ϕ Z∆ (u) withφ n (u), we can get a point estimator of α aŝ Of course, this estimator depends on the choice of the value of u. The consistency and asymptotic behavior ofα will be addressed in our ongoing research project. One can also use the regression method to estimate α as well as other parameters as discussed in Koutrouvelis [28] and [29].
Example 2.8. Here we provide one example of the SDE (1.1) which satisfies all the required conditions in our main results. We consider the following SDE where c ∈ R, d > 0 and ρ > 0 are constants. Namely, the drift function b(x) and dispersion function σ(x) are given by It is clear that b(x) and σ(x) satisfy the Lipschitz condition (A.1), and σ(x) satisfies (A.2). It is easy to verify that b(x) is twice continuously differentiable with bounded first and second derivatives, namely b(x) satisfies (A.7). In order to ensure that the SDE (2.4) satisfies (A.3), we shall employ Theorem 2.2 and Lemma 2.4 of Masuda [35] and verify that Assumption 2 and Assumption 3 * in Masuda [35] are fulfilled. Assumption 2(a) is satisfied since the coefficient functions b(x) and σ(x) are smooth (see the discussion in Section 4.2 of Masuda [35] Therefore the SDE (2.4) satisfies all the required conditions.

A simulation study
In this section, we conduct a simulation study to confirm the finite sample properties of the asymptotic results developed in Section 2. Let T be the length of observation time interval, n be the sample size, and ∆ = T /n be the observation time frequency. For simplicity, let the dispersion function σ(·) be constant. The stable index α considered is 1.5 and the skewness parameter β is zero (symmetric case). We simulate and approximate X t by using the Euler scheme (see e.g., Jacod [23], Jacod et al. [24]): We consider various length of observation time interval T and sample size n. The length of observation interval of the process considered is 10, 50, 100, while the sample size n considered is 1000, 5000, 10000, respectively. The drift function considered in the simulation is one of the following: • Case 1: The purpose of the six cases are two folds. For cases 1-3, we are testing the sensitivity of the N-W drift function estimator away from linearity. By changing the value of σ from 0.05 to 0.1 in cases 1, 2 and 3, we obtain cases 4, 5 and 6, respectively. The purpose of these changes is to test the sensitivity of the N-W estimators with respect to different sizes of the scale parameter σ in the Lévy driven error term.
We adapt the direct method from Janicki and Weron [25] (1994, pp. 48) to generate the α-stable process. That is: we first generate a random sample V ti uniformly distributed on (−π/2, π/2) and an exponential random sample W ti with mean 1. Then we compute the symmetric α-stable random sample Last, we generate the Lévy sample path by putting ∆Z ti = ∆ 1/α S ti . The initial value of the process X t is set to be one. Figure 1 shows the ten simulated sample paths of the process X t . Notice that the jump error term could affect the process a lot at some typical time points.  In computing the N-W estimate, the normal kernel is used and the bandwidth is selected according to the sample size n. In the simulation, we use h = n −1/5 . Figure 3 represents the estimatedb(·) from a random sample with n = 1000 and h = n −1/5 and other information given in the figure. It shows that the N-W estimator performs reasonably well.
The estimatorb(x) is assessed via the square-Root of Average Square Errors (RASE) where {x k } n 1 are chosen uniformly to cover the range of sample path of X t . Table 1 below reports the results on RASE of the N-W estimator of the drift function b(·) based on 1000 replicates. In each cell, the first, second and third numbers represent the sample mean, sample standard deviation and sample median of RASE for three sample sizes n = 1000, n = 5000 and n = 10000, respectively.
Notice that as the time interval expands longer, the estimation is better as expected for whatever sample size even though the time frequency becomes larger. This means that T tending to ∞ is more important than ∆ tending to 0 in the asymptotic behavior of the N-W estimator. The estimates are not so   sensitive to the linearity assumption on the drift function. As the σ increases, the summary statistics of RASE confirm the results in Theorem 2.5. Namely, the increase of σ slows down the convergence of the N-W estimator.
As both T and n get larger, the summary statistics of RASE are getting better seen through the diagonal numerals in Table 1. The bias is getting smaller, the standard deviation is getting smaller most of the time. So, in general, when T h becomes larger, the N-W estimator becomes better with smaller RASE.
One notices that for a fixed length of observation interval T , as the sample size gets larger, the N-W estimator does not behave better, which is consistent with the asymptotic theory of the N-W estimators for stochastic processes driven by Lévy motions. This confirms that the drift function can not be identified in a fixed time interval in the framework of N-W estimator, no matter how frequently the observations are sampled. This phenomenon is also consistent with the sample paths shown in Figure 1. That is, the more often observed diffusion process, it is more affected by the Lévy driven error term which may produce many huge jumps.

Proofs
In this section, we provide proofs for all the results stated in Section 2. The N-W estimator is closely related to the kernel estimatorf n (x) of the density function f (x) of the stationary distribution, which is defined bŷ

Preliminary Lemmas
Before proving our main theorems in Section 2, we prepare some preliminary lemmas. The following result (consistency of the kernel estimator) is an analogue to Lemma 2.1 in Bosq [6].  Proof. Note that By the stationarity of the process X t , we have which converges to f (x) for each x as n → ∞ by Lebesgue dominated convergence theorem. Thus, it suffices to prove thatf n (x) − E[f n (x)] → P 0. We havef . . , n. Note that sup 1≤i≤n |η n,i (x)| ≤ C 0 h −1 a.s. for some positive constant C 0 < ∞. Applying Theorem 1.3 of Bosq [6], we have for each integer q ∈ [1, n 2 ] and each ε > 0 where v 2 (q) = 2 p 2 s(q) + Here we set η n,n+1 (x) = 0 for the well-definedness of s(q). By using Cauchy-Schwarz inequality and stationarity of η n,i (x), it is easy to find that s We claim that By using the GSM property of X t , we find that there exists some a > 0 such that Hence it suffices to show that for some a ′ > 0. Some basic calculation yields that (4.8) is equivalent to The existence of such a a ′ (e.g. a ′ = a 2 ) is justified since the right hand side of the above inequality tends to zero under (A.6). Finally, combining (4.5), (4.6) and (4.7), we have Therefore, the desired convergence result (4.4) follows from given conditions. Next we establish a central limit theorem (CLT) for stable stochastic integrals, which has some independent interest. The CLT will be crucial in establishing the consistency (or inconsistency) and asymptotic distribution of the N-W estimator. Let φ(t) be a predictable process satisfying T 0 |φ(t)| α dt < ∞ almost surely for T < ∞. Then the stochastic integral t 0 φ(s)dZ s is well-defined (see e.g., Rosinski and Woyczynski [45], Kallenberg [27]). We assume that either φ(t) is nonnegative or Z is symmetric. Then, we have the following version of Lenglart's inequality in the stable setting.
Lemma 4.2. For any given ε > 0 and δ > 0, there is some constant c > 0 such that Proof. Let S t = t 0 |φ(s)| α ds. By Theorem 4.1 and Theorem 4.2 of Kallenberg [27] (see also Theorem 3.1 in Rosinski and Woyczynski [45] for the symmetric case), there exists a strictly α-stable process Z ′ with the same finite-dimensional distributions as Z such that t 0 φ(s)dZ s = Z ′ (S t ) almost surely. By the classical maximal inequality (see e.g., Proposition 10.2 of Fristedt [16]), we find that for This completes the proof.
The following result is a version of the CLT for stable stochastic integrals.

Lemma 4.3. Suppose that there is a deterministic and nonnegative function
Then, we have For a fixed T , we redefine the function φ on the interval (T, T + 1] as φ(t) = Φ −1 (T ) and define the stopping time τ T = inf{t ≥ 0 : R t > 1}. Then, τ T ∈ [0, T + 1] almost surely. Note that there is a strictly α-stable process Z ′ with the same finite-dimensional distributions as By using Lemma 4.2 and following exactly the same arguments as in the proof of Theorem 1.19 in Kutoyants [30], we can show that the characteristic function of Φ(T ) T 0 φ(t)dZ t converges to the characteristic function of Φ(T ) τT 0 φ(t)dZ t as T → ∞. Therefore, by the continuity theorem (see Theorem 26.3 of Billingsley [5]), it immediately follows that (4.11) holds.
We say that a continuous function F : [0, ∞) → [0, ∞) grows more slowly than u α (α > 0) if there exist positive constants c, λ 0 and α 0 < α such that F (λu) ≤ cλ α0 F (u) for all u > 0 and all λ ≥ λ 0 . The following lemma (Lemma 2.4 of Long [33]) provides moment inequalities for stable stochastic integrals, which will be a crucial tool in the proofs of our main results.

Lemma 4.4. Let φ(t) be a predictable process satisfying
T 0 |φ(t)| α dt < ∞ almost surely for T < ∞. We assume that either φ is nonnegative or Z is symmetric. If F (u) grows more slowly than u α , then there exist positive constants c 1 and c 2 depending only on α, α 0 , β, c and λ 0 such that for each T > 0
Proof of Claim (i). Note that By Lemma 4.1, it is clear that B n,1 (x) → P b(x)f (x) when n → ∞. For B n,2 (x), by the Lipschitz property of b(·) and stationarity of X t , we have ] is uniformly bounded for each i under condition (A.5). By slightly modifying the proof of Lemma 4.1, we can show that when n → ∞. By using the continuity of f (x) and Lebesgue dominated convergence theorem, we find Combining (4.16), (4.17) and (4.18), it follows that B n,2 (x) → P 0 when n → ∞.
Hence the claim (i) holds.
Proof of Claim (ii). By using the Lipschitz condition of b(·), we have Let us consider the estimate of sup ti≤t≤ti+1 |X t −X ti |. Note that for t i ≤ t ≤ t i+1 By using Lipschitz condition on b(·) again, we find By Gronwall's inequality, we have By (4.19) and (4.20), we find By the claim (i) with b being replaced by |b|, we know that when n → ∞. This implies that B n,3 (x) → P 0 when n → ∞. By Markov inequality and Lemma 4.4, we have Hence, B n,4 (x) → P 0 as n → ∞. This completes the proof of claim (ii).
Proof of Claim (iii). We define so that φ n (t, x) is a predictable process. Then, we have By using Markov inequality, Lemma 4.4, boundedness of σ(X t− ), and stationarity of X t , we find for some constant C 2 > 0 which goes to zero under condition (A.6). This shows that claim (iii) holds.

Proof of Theorem 2.4
The claims (i) in the proof of Theorem 2.3 is still true when 0 < α ≤ 1. For claim (ii), we need to make some minor modifications on its proof. It is clear that we still have B n,3 (x) → P 0 under (A.1)-(A.6). For B n,4 (x), by Markov inequality, Lemma 4.4, condition (A.2), and stationarity of X t , we have for α 1+α < r < α ≤ 1 which tends to zero when nh∆ r α(1−r) → 0. So, under this extra condition, claim (ii) holds when 0 < α ≤ 1. However, we shall prove that claim (iii) is not true, i.e., g n,3 (x) is no longer converging to zero in probability as n → ∞ and consequentlyb n (x) is not consistent. Recall that Then, we have g n,3 (x) = 1 and consequently Similar to Lemma 4.1, it is easy to prove that 1 n Therefore, we have Next we deal with the second term J. By Lipschitz condition (A.1) on σ(·), (4.20) and basic inequality ||u + v| q − |v| q | ≤ |u| q for u, v ∈ R and q ∈ (0, 1], we have It is clear that J 1 → P 0 under given conditions since (similar to Lemma 4.1) To prove that J 2 → P 0, it is sufficient to show that By using Markov inequality, Lemma 4.4, and condition (A.2), we have for q < 1 If g n,3 (x) → P 0 as n → ∞, then the right hand side of (4.26) converges to zero in probability as n → ∞ since (n∆h) 1− 1 α → 0 when 0 < α < 1 and (n∆h) 1− 1 α = 1 when α = 1 under condition (A.6). This contradicts (4.34). Therefore, we conclude that g n,3 (x) does not converge to zero in probability. This completes the proof.

Proof of Theorem 2.5
Proof of Theorem 2.5 (a)-(i). Note that Sincef n (x) → f (x) in probability as n → ∞, it is enough to study the asymptotic behavior of V n (x). By (4.14), we find Under the given conditions, we have the following claims: Claim 1.
Here the notation o P (1) (or O P (1)) means a sequence of random variables converging to zero (or a finite constant) in probability.
Applying Theorem 1.3 of Bosq [6], we have for each integer q ∈ [1, n 2 ] and each ε > 0 By using Billingsley's inequality (see Corollary 1.1 of Bosq [6])and stationarity of ξ n,i (x), it is easy to find that here we have used the fact that k=0 α X (k∆) = O(∆ −1 ) under the GSM condition on X t . Then, we have , which goes to ∞ by choosing . It is also easy to see that (by GSM property of X t again) Therefore, we conclude that 1 n n i=1 ξ n,i (x) = o P (1).
Proof of Claim 2. By (4.21), we have As in the proof of Claim (ii) of Theorem 2.3 via Lemma 4.4, we can show that which goes to zero in probability since (n∆h which goes to zero under the condition (n∆h) 1− 1 α ∆ 1/α = o(1). Thus, it follows that claim 2 holds.