Approximate self-weighted LAD estimation of discretely observed ergodic Ornstein-Uhlenbeck processes

: We consider drift estimation of a discretely observed Ornstein- Uhlenbeck process driven by a possibly heavy-tailed symmetric L´evy process with positive activity index β . Under an inﬁll and large-time sampling design, we ﬁrst establish an asymptotic normality of a self-weighted least absolute deviation estimator with the rate of convergence being √ nh 1 − 1 /β n , where n denotes sample size and h n > 0 the sampling mesh satisfying that h n → 0 and nh n → ∞ . This implies that the rate of convergence is de- termined by the most active part of the driving L´evy process; the presence of a driving Wiener part leads to √ nh n , which is familiar in the context of asymptotically eﬃcient estimation of diﬀusions with compound Poisson jumps, while a pure-jump driving L´evy process leads to a faster one. Also discussed is how to construct corresponding asymptotic conﬁdence regions without full speciﬁcation of the driving L´evy process. Second, by means of a polynomial type large deviation inequality we derive convergence of moments of our estimator under additional conditions.


Introduction
Estimation of discretely observed stochastic processes with jumps has received growing interest from both theoreticians and practitioners. Among others, Markovian Ornstein-Uhlenbeck (OU for short) process has several attractive features mainly stemming from its continuous-time first-order autoregressive structure. Let X = (X t ) t∈R+ be the univariate OU process given by the stochastic differential equation where Z is a nontrivial symmetric Lévy process independent of X 0 . In this paper, we are concerned with estimation of the true value θ 0 := (λ 0 , γ 0 ) of the unknown parameter θ := (λ, γ) ∈ Θ ⊂ (0, ∞) × R based on a discrete-time data (X ti ) n i=0 , without full specification of Z's Lévy measure. Here t i = t n i = ih with h n > 0 such that h n → 0 and nh n → ∞ as n → ∞, that is, we consider infill and large-time asymptotics for sampling design; in the sequel, we often use the abbreviation h = h n .
Analysis of non-Gaussian OU processes was initiated by Doob [10] for symmetric stable Z. The general Lévy driven case has been highlighted in application fields, especially in finance and turbulence, by, among others, Barndorff-Nielsen [2] and Barndorff-Nielsen and Shephard [3]. Also, stochastic modelling of several physical phenomena has been supported by the non-Gaussianity through realistic experimentations, see, e.g., Garbaczewski and Olkiewicz [12] and the references therein. In the stochastic differential equation (1.1), λ and γ stands for the "intensity of the mean reversion" and the "instantaneous constant drift", respectively. Here the mean-reversion level is γ/λ, which also corresponds to the "long-run (invariant) mean" of the process X. We know the general relation mλκ(m) = κ Z (m), m ∈ N, between the mth cumulants κ(m) and κ Z (m) of X's invariant distribution and the distribution of Z 1 , respectively; cf. Barndorff-Nielsen and Shephard [3, Section 2.1], and also (2.5) below. In particular, a smaller λ > 0 (a weaker mean reversion) leads to a larger long-run mean in magnitude in the presence of a nonnull instantaneous mean, and also, to a larger long-run variance. Due to its mathematical tractability against diversity of Lévy processes, OU processes are still being a subject of research in the field of statistical inference for stochastic processes. We refer to Masuda [22] for more detailed information and history about general Lévy driven OU processes.
There are several existing literatures on estimation of OU processes, however, most of them are concerned with cases where the sampling mesh h is fixed, and where µ = 0 and Z has no negative jumps, so that X takes values in (0, ∞). In this case, among others: Jongbloed et al. [16] considered a nonparametric estimation of the Lévy density of Z together with a simple consistent estimator of λ; Jongbloed and van der Meulen [15] studied a parametric estimation based on the "cumulant M -estimator", which is a kind of weighted minimum L 2distance contrast functions; Brockwell et al. [6] derived the limit distribution of the Davis-MacCormick estimator, which cannot be of direct use when jumps of Z are bilateral, and also discussed possibility of consistent estimation of λ as soon as h → 0 even when nh is fixed. Also, the recent work Creal [7] investigated performances of filtering and smoothing algorithms for estimating integrated squared positive OU processes, which is a key quantity in the stochastic volatility model of Barndorff-Nielsen and Shephard [3].
As we target drift estimation, the most naive but practical way would be to use the approximate least-squares estimator (LSE), which minimizes the con- Indeed, the LSE fulfils an asymptotic normality when Z is centered with finite moments and nh 2 → 0 as nh → ∞, the resulting rate of convergence being necessarily √ nh; see Masuda [23] as well as Section 2.2.2 below. Although the rate √ nh is well known to be optimal in the context of drift estimation of diffusions with compound Poisson jumps, our main result says that this is no longer the case as soon as Z is of pure-jump type. Hu and Long [14] recently studied the LSE for λ > 0 when Z is symmetric β-stable, with supposing that γ 0 = 0 from the beginning. Our results are completely different from theirs in view of the rate of convergence and the limit distribution; see Section 2.2.2 for some theoretical comparisons between their result and ours.
Instead, motivated by Ling [20], in this paper we introduce an approximate least absolute deviation (LAD) type estimator and study its asymptotic behavior. The LAD estimation has a long history and is one of popular estimation procedures robust to outlying observations. The LAD estimator is based on the "Laplacian" L 1 -loss, while the LSE on the "Gaussian" L 2 -one. We refer to, among others, Knight [17], Koenker [18], and Portnoy and Koenker [25] as well as the references therein for a detailed account and historical backgrounds of the LAD estimation. The LAD type estimation has been deeply investigated also in the time-series literature, e.g., Davis and Dunsmuir [8], Davis et al. [9], and so on. Just for illustrative purposes, suppose that observed time-series data stems from the ergodic first-order autoregressive model X k = θ 0 X k−1 + ǫ k , k ≤ n, where |θ 0 | < 1 and (ǫ k ) is an i.i.d. noise sequence with common median 0. Then, the unweighted LAD estimatorθ n of θ 0 is defined to be a minimizer of the contrast function θ → n k=1 |X k − θX k−1 |. If ǫ 1 admits finite absolute moments of sufficiently high order,θ n is known to be asymptotically normally distributed at rate √ n. On the other hand, in case where ǫ k has infinite-variance, it is known that the maximum likelihood and the LAD estimators have a faster rate of convergence than √ n, while both of them lead to intractable limit distributions; see Andrews et al. [1] and Davis et al. [9] for details in this direction. On the other hand, Ling [20] introduced a self-weighted LAD (SLAD) contrast function for infinite-variance autoregressive models, which entails asymptotically normally distributed estimators at rate √ n. That is to say, Ling's result means that we may derive a conventional asymptotic normality result in compensation for the slower rate of convergence than the maximum likelihood and the unweighted LAD estimators. It can be expected that the SLAD estimation can be employed on a robust drift estimation for discretely observed continuous-time stochastic processes as well, in which large jumps may deteriorate finite-sample performance of the LSE or, more generally, the quasi-likelihood estimator.
Our SLAD estimator is defined as a minimizerθ n of the contrast function for an appropriate weight function w; the unweighted LAD estimation corresponds to the case where w ≡ 1. Under regularity conditions, we first derive an asymptotic normality ofθ n at rate √ nh 1−1/β (see Theorem 2.1), where β stands for the activity index of the driving Lévy process defined by (2.4) below. As a result, when Z is of pure-jump type, we have a faster rate of convergence than the familiar √ nh. It is interesting that we could get faster rate of convergence only by changing the type of loss from L 2 to L 1 . Although the corresponding asymptotic covariance matrix as well as the rate of convergence inevitably depends on the unknown index β, we clarify that it is possible to formulate a feasible construction of asymptotic confidence interval; specifically, we can construct explicit statisticsT n such thatT n (θ n − θ 0 ) tends to the standard normal distribution (see Theorem 2.3). Due to robustness of LAD type estimates to "outlying" data, our SLAD estimator should be robust to "big" jumps caused by the driving process Z without individual detection of them, making the estimation procedure more practical. Also obtained under additional conditions is the convergence of moments of the normalized quantities √ nh 1−1/β (θ n − θ 0 ) (see Theorem 2.2). This much stronger mode of convergence is obtained as a byproduct of the polynomial type large deviation inequality (4.16), which we prove by applying a general result due to Yoshida [34]. Finally, we remark that convergence of moments as well as a large deviation inequality is a crucial tool for investigating: asymptotic behavior of expected values of statistics depending on estimators; also, error estimate appearing in higher-order theoretical statistics. For smooth statistical random fields associated with a stochastic process, large deviation inequalities have been investigated and applied, e.g., to the information criteria in model selection, the validity of higher-order asymptotic statistical theory, and moment convergence for quasi-likelihood and Bayes estimators of multidimensional ergodic diffusion processes; among others, see Uchida and Yoshida [32,33], Sakamoto and Yoshida [28], and Yoshida [34] for details in these directions. To the best of author's knowledge, our Theorem 2.2 is the first result providing a large deviation inequality and convergence of moments associated with a non-differentiable LAD type statistical random fields for dependent data.
The rest of this paper is organized as follows. Our main results are given in Section 2, along with several preliminary facts. We present some numerical experiments in Section 3. Finally, Section 4 is devoted to the proofs.

Preliminaries and statement of main results
Let X be given by (1.1), and denote by η the initial distribution of X. Throughout this paper we assume that: Θ is a bounded convex domain such that its closure Θ − ⊂ (0, ∞) × R; (2.1) h n → 0 and nh n → ∞; (2.2) There exists a constant q > 0 such that |x| q η(dx) < ∞. Here and in what follows, asymptotic symbols are used for n → ∞ unless otherwise mentioned. We denote by ν and σ 2 the Lévy measure and Gaussian variance of Z, respectively; we implicitly presuppose that either σ 2 > 0 or ν(R) > 0, excluding the trivial case. We refer to Sato [29] for a systematic account of Lévy processes. For the purpose of investigating sample-path properties of processes with independent increments, Blumenthal and Getoor [5] introduced the notion of "activity index" defined by inf{r > 0 : |z|≤1 |z| r ν(dz) < ∞}, which measures the degree of small-jump fluctuations. The index plays a crucial role in our main results. One of our regularity conditions (Assumption 1 below) takes essentially different forms according as σ 2 > 0 or σ 2 = 0. It turns out to be convenient in this paper to introduce the following modified activity index β of Z: As β ≤ 2, we always have

Asymptotic normality
We impose some structural assumptions on Z.
We may ignore Assumption 1.2 if σ 2 > 0. Roughly speaking, Assumption 1.2 entails that small fluctuations of Z should be like that of a β-stable Lévy process; note thatg may take a negative value near the origin (e.g.,g(z) = e −|z| −1), and in particular, that the measure ν ′′ and the functiong are identically null for the symmetric β-stable Z. Especially if g(z) = |z| −1−β v(z) with v being positive, bounded, and smooth on U \{0}, then we have δ = 1 in Assumption 1.2(a).
Typical such examples are the generalized hyperbolic (except for the variance gamma) and the exponentially tempered stable cases, the corresponding Lévy densities (on the whole R\{0}) of which are given by respectively, with some positive constants a k and b k ; see Raible [26, pp.39-40] for the former, and Rosiński [27] as well as the references therein for the latter.
Before proceeding, we point out some facts concerning the OU processes. Denote by P 0 the distribution of X associated with θ 0 , and by E 0 the corresponding expectation operator. Then we know the following, both of which are essential in our forthcoming results.
See Masuda [22,24] for more details. We also note that π 0 is necessarily selfdecomposable, hence admits a density with respect to the Lebesgue measure. The characteristic function of π 0 is given by Trivially, the density of π 0 is symmetric around γ 0 /λ 0 .

Now we introduce our contrast function
where ∆ i X := X ti − X ti−1 , i ≤ n, and w : R → R + (supposed to be free of θ). Then the SLAD estimator is defined to be any measurable mappingθ n = (λ n ,γ n ) such that M n (θ n ) = inf θ∈Θ − M n (θ). We impose the following technical conditions on the "weight" function w.
In analogy with Ling [20], in order to deduce an asymptotic normality result, Assumption 2.2 is indispensable for ν having heavy tails. For example, even if sup t∈R+ E 0 [|X t | q ] < ∞ only for some q < 4, the first part of Assumption 2.2 implies the existence of a universal constant C such that w(x)x 4 = w(x)|x| 4−q |x| q ≤ C|x| q for every x, so that we have sup Note that Assumption 2.2 is redundant if we can pick a q ≥ 4, and in this case w does not need to be tapering at infinity (in particular, we may take w ≡ 1, arriving back to the unweighted LAD estimation). See also Section 2.2.1 for some related remarks.
We need the following condition on the decreasing rate of h = h n in connection with the value β.
Recall that X is given by (1.1) and that θ = (λ, γ). Now we are ready to state our asymptotic normality result. See Section 4.1 for the proof. Then Thus, unlike with the case of diffusions with compound Poisson jumps, for pure-jump Z we have no longer the typical rate √ nh corresponding to √ T in case where we have a continuous-time record (X t ) t∈[0,T ] with the asymptotics T → ∞; see, e.g., Luschgy [21, Section 5] and Sørensen [31]. The rate √ nh 1−1/β reflects the degree of small-jump fluctuation of Z in conjunction with the sampling frequency 1/h. It is worth mentioning that the rate of convergence becomes free of the sampling frequency for β = 1; as mentioned before, this is the case for, e.g., any symmetric generalized hyperbolic Z with positive scale parameter.
For construction of asymptotic confidence intervals of θ, we have to derive a consistent estimator of V 0 . We return to this issue in Section 2.1.3 shortly after stating the moment convergence result.

Convergence of moments
In the sequel, for nonnegative sequences a ′ n and a ′′ n we write a ′ n a ′′ n if there exists a positive generic constant C such that a ′ n ≤ Ca ′′ n a.s. for every n large enough. Here we introduce 1. There exists a constant ǫ 0 ∈ (0, 1) such that nh n n ǫ0 . 2. lim sup |x|→∞ {w(x)|x| 4 } k /|x| q < ∞ for any k > 0.
For Assumption 4.1, we may set h = n −τ with an appropriate τ (see Section 2.2.3), while, for example, the choice h = n −1 log n is not enough. Trivially, as in Assumption 2.2 we can remove Assumption 4.2, if we can make q in Assumption 1 arbitrarily large; otherwise, it suffices to take, e.g., any uniformly continuous function with compact support, or any w subgeometrically decreasing for |x| → ∞. We need "provisory" Assumption 4.3 to handle asymptotically negligible martingale terms uniformly in the parameter, when proving the polynomial type large deviation inequality (see (4.16) below), which is of substantial importance in the proof of our moment convergence result. The proviso says that the dimension of unknown θ comes into play in the uniform estimates of the martingale terms; such a phenomenon does not arise when the contrast function is smooth in θ and sufficiently integrable.
We can get the convergence of moments in compensation for the additional assumptions. See Section 4.2 for the proof.

Interval estimation
Now we look at how to implement an interval estimation based on Theorem 2.1. The asymptotic covariance matrix V 0 depends on the quantities and (β, c) through φ β (0). In order to make Theorem 2.1 usable in practice, we in principle have to estimate these quantities.
Since U 0 is written only by {M (k, l)}, we can readily get a consistent estimator U n of U 0 by means of Lemma 4.6: On the other hand, as specified by (2.7), the remaining quantity φ β (0) depends only on the two parameters β and c (this point is completely different from Ling [20]). Nevertheless, direct consistent estimation of (β, c) seems rather difficult in general, for the full form of ν here is not specified. In addition, even if we could get some consistent estimators (β n ,ĉ n ), we actually need to specify the rate of convergence ofβ n in order to successfully replace √ nh 1−1/β with √ nh 1−1/βn to get the desired asymptotically standard normal version 2φβ , where I 2 denotes the two-dimensional identity matrix; namely, we need to have h 1/β−1/βn → p 1. In Theorem 2.3 given below, we show that an appropriate use of kernel estimator, which was also used by Ling [20], enables us to overcome this annoying aspects. Specifically, we show how to provide a consistent estimator of φ β (0) and then formulate a converted distributional result with asymptotic standard normal distribution, which can be used without direct estimate of (β, c).
and definê Thenφ β (0) n → p φ β (0). In particular, additionally supposing lim inf we have the (β, c)-free version of the asymptotic normality: (2.12) See Section 4.3 for the proof of Theorem 2.3. For the kernel function K, we can adopt the standard-Gaussian kernel K(z) = (2π) −1/2 e −z 2 /2 or, as used by Ling [20], the logistic kernel K(z) = (e z/2 + e −z/2 ) −2 . See Section 2.2.3 for some discussions on the conditions we have imposed so far on the sampling design (asymptotic behavior of h).

Case of nonnull Gaussian part
Just for reference, we single out the nonnull Gaussian-part case as a corollary of the previous results.
2. Suppose that lim sup n→∞ nh 2 < ∞ if both λ 0 and γ 0 are actually unknown; no additional condition is made especially if either λ 0 or γ 0 is known from the beginning. Then, for any continuous function f : Corollary 2.4 directly follows from the previous claims.

Some future issues
In the literature, we could find no previous work concerning LAD type estimation for discretely observed processes with infill asymptotics. Here we mention some future issues worth being considered, together with some conjectures.
• "What is a proper definition of asymptotic efficiency in the present framework?" Concerning the statistical model in question, we want to derive the local asymptotic normality in principle. Nevertheless, we could see that the LAN cannot hold true at least for the non-Gaussian stable Z, as in the case of infinite-variance autoregressive time series models (see Davis et al. [9] for details); for the stable case we conjecture that the best attainable one for estimating λ is the much faster n 1/β h 1−1/β than √ nh 1−1/β of our SLAD estimators. On the other hand, we conjecture that √ nh 1−1/β is the best attainable rate of convergence for estimating γ.
• "Is it possible to relax the ergodicity and the long-term asymptotics?" We have set nh → ∞ and focused on the ergodic case. Nevertheless, our SLAD estimator does seem to work even when X is non-recurrent with or without imposing that nh → ∞ (e.g., nh is a fixed positive constant, but in this case we need β < 2); needless to say, the limit distribution may be then no longer normal. For specific derivation of the limit distribution when nh fixed, we would need a more sophisticated weak limit theorem than the martingale central limit theorem used in the present proof; it would be nice if we could derive a tailor-made stable convergence in law, leading to a mixed normal limit distribution with specified limit random covariance matrix. • "What is occurring in the higher order part?" For example, Knight [17,Section 4.2] (see also the references therein) discussed this issue in case of linear regression models. It would be interesting to investigate this point in the framework of discretely observed OU processes.
• "What will occur for small β?" Overall, our assumptions require that the index β is large. Concerning Theorem 2.1, it is expected from the proof (see Section 4.1) that we may relax the sampling-design condition Assumption 3 by targeting the "genuine" SLAD estimator defined to be arg inf which is a little bit more involved than ours, but should be appropriate in view of the expression (4.1). Furthermore, even if this estimating function works properly, it still excludes the case of β = 0, e.g., the variance gamma and the bilateral gamma Z as well as the purely compound Poisson Z: in such cases the local-limit result given by Lemma 4.4 below, which is essential in our proofs, breaks down. Therefore we have to resort to a whole other kind of estimating procedure. See also Section 2.2.4.
We leave answering such questions to future works.

Remarks on the results
Here we gather some technical remarks concerning the results given in Section 2.1.

On the asymptotic covariance matrix
In general, we do not know which w optimizes the asymptotic covariance matrix V 0 . Nevertheless, V 0 can be simplified and actually optimized in some instances. Let m k := x k π 0 (dx) and denote by v 2 the variance of π 0 if they exist. specifically, we have If additionally γ 0 = 0, then V 0 becomes diagonal, hence the SLAD estimation of λ 0 and γ 0 are asymptotically independent, as in the case of OU diffusions.
2. On the other hand, suppose that γ 0 = 0 and w is symmetric around zero, while we now do not assume q ≥ 4. Then Assumption 2 entails that If further m 2 < ∞, then Schwarz's inequality readily gives the lower bound attained for (any positive) constant w.

Comparisons with respect to the LSE
When nh 2 → 0 and q > 0 in Assumption 1 can be taken large enough, we can deduce the asymptotic normality of the LSEθ n , namely, (See Masuda [23] for details.) Now suppose that w ≡ 1 and β = 2, so that, as in the LSE, the SLAD estimator is asymptotically normal at rate √ nh. Then, theṼ 0 compared with (2.13) implies that the asymptotic relative efficiency of the SLAD estimator with respect to the LSE can be measured by the quantity From this aspect, the SLAD estimator turns out to be asymptotically superior to the LSE if the Gaussian variance is not so large compared with the jump-part variance: In other words, the LSE is asymptotically superior to the SLAD estimator if σ 2 is dominant in the sense that the last inequality sign is reversed.
The SLAD estimation is formally new even for the Gaussian case. Suppose X is given by where w is a standard Wiener process. As is well known in the literature, or as can be seen from direct computations, the exact maximum likelihood estimator of λ 0 is asymptotically normal and efficient with asymptotic variance being 2λ 0 . On the other hand, building on Section 2.2.1 we see that the unweighted SLAD estimatorλ n leads to the asymptotic variance πλ 0 , hence the asymptotic efficiency ofλ n relative to the maximum likelihood estimator is 2/π; this asymptotic relative efficiency is the same as in the one for the sample median over the sample mean in estimating the mean of i.i.d. normal samples. Moreover, for the asymptotic normality ofλ n we do not need the rapidly increasing experimental design nh 2 → 0, which is quite often inevitable when adopting a contrast function based on the naive "Euler-type approximation"; for the SLAD estimator, the weaker sampling design nh 3 → 0 is sufficient in view of Corollary 2.4.
Recently, Hu and Long [14] derived an asymptotic distributional result concerning the (approximate) LSE of λ 0 > 0, presupposing that γ 0 = 0 and letting Z be a symmetric β-stable Lévy process with β ∈ (1, 2); in this setting, we can ignore Assumption 4.3 for our SLAD estimatorλ n from the very beginning. The LSE is given bỹ (2.14) Let us make some comparisons betweenλ n withλ n . The primary point is the differences in the rates of convergence and the limit distributions: theirλ n fulfil that nh log n where nh → ∞, and S ∈ R and S + > 0 are two independent strictly stable random variables with specified parameters. In order to obtain the explicit asymptotic distribution, Hu and Long [14] also imposed some technical conditions on the decreasing rate of h → 0 in connection with the value of β, while they are not necessary for the (strong) consistency; see their (A1). As we see that the SLAD estimatorλ n with appropriate w (tapering at infinity) converges more rapidly thanλ n as soon as In particular, this is the case if nh 2 1. Moreover, in case where h = n −a for some constant a > 0, Hu and Long [14,Remark 3.3] mentioned that the choice a = (1 + β) −1 is optimal. This choice implies that nh 2 → ∞, and we can see thatλ n now converges more rapidly thanλ n . Nevertheless, ourλ n would be more convenient to use because of its asymptotic normality. Moreover, contrary to ourλ n , the convergence of moments seems impossible forλ n .

On the sampling-design conditions in Theorem 2.3
In the statement of Theorem 2.3 we have several conditions on the decreasing rate of h. The conditions can be more specific when h = n −τ with the choice l n = √ nh 2 , where τ ∈ (0, 1) for (2.2) to be fulfilled. In this case Assumption 3 is equivalent to β/{2(2β − 1)} < τ , while the condition lim inf n→∞ nh 2−1/β > 0 to τ ≤ β/(2β − 1). Thus, the admissible region of τ for the "interval estimation (2.12)" turns out to be For the convergence of moments, we additionally need β > 1 and also the condition nh 4(1−1/β) 1 if both λ and γ are actually unknown (see Assumption 4). The condition nh in order to make the admissible region non-empty. Thus, the admissible region of τ for the "convergence of moments" is if both λ and γ unknown and β ≥ 3/2.

Model extension
We have used some inherent property of the OU processes in our proofs, hence, unfortunately, it is not clear whether or not a similar type of contrast function actually works for more general nonlinear-drift stochastic differential equation models with jumps. Nevertheless, a more general statement is formally possible so that we can provide a set of conditions for the convergence of moments and so on to hold true for a broader class of statistical experiments of dependent data as well as independent ones (not necessarily a stochastic differential equation). This may be done by setting the contrast function as wherew n,i−1 andȲ n,i−1 are G n,i−1 -measurable whileX ni is G ni -measurable with respect to some underlying filtration (G ni ) i≤n ; all ofw n,i−1 ,Ȳ n,i−1 andX n,i should be observable in order to follow a similar line to the proofs given in this paper. This setting might allow us to deal with, for instance, discretely observed Lévy process, general i.i.d. regression model, and also the autoregressive timeseries in a unified way. Of course, the rate of convergence as well as the limit distribution should depend on the specific structure of the underlying statistical model.

Numerical experiments
In this section, we report some numerical results concerning finite-sample performance of our SLAD estimatorλ n in the model We treat the following cases for the Lévy process Z and the weight function w: Here N IG(1, 0, 1, 0) stands for the normal inverse Gaussian (NIG) distribution having the density x → (e/π)K 1 ( see Barndorff-Nielsen [2] for details of the general NIG distribution. Also, S β (a) for a > 0 denotes the β-stable distribution having the characteristic function u → exp(−|au| β ). In case (B), we can simulate (X ti ) n i=1 exactly, since the distribution of X ti conditional on which immediately follows from the property of the stable integrals.
In both cases, we also observe performances of the LSEλ n defined by (2.14). Recall that, in case (A), the distribution of √ nh(λ n − λ 0 ) tends to a centered normal, while in case (B), the distribution of (nh/ log n) 1/β (λ n − λ 0 ) tends to a nondegenerate one; see the references cited in Section 2.2.2 for details.
Throughout this section, we set h = n −3/5 , the true value λ 0 = 1, and X 0 = 0. Also, we use the logistic kernel K(x) = (e x/2 + e −x/2 ) −2 for both (A) and (B). In each trial, we use the LSEλ n for the initial value in numerical minimization of the SLAD contrast function.

Case (A)
For generating sample (X ti ) n i=1 , we apply the Euler scheme with generation mesh being h/50. It is easy to see that all the conditions imposed in Theorems 2.1 and 2.3 are fulfilled. With the choice w ≡ 1, the weak convergence (2.12) becomes 2 nh For n = 500, 1000, and 2000, we simulate 1000 independent estimatesλ n and λ n , and compute their sample means, standard deviations (S.D.), maximums, and minimums. The results are reported in Table 1. We can observe from Table 1 that our (unweighted) LAD estimateλ n is much more reliable than LSEλ n . We note that the S.D. of the LAD estimates for n = 500 is smaller even than that of the LSE for n = 2000. Next, we report in Figure 1 the normal probability plots for the Studentized versions of theλ n obtained above. In each panel, we can see that the normality  is well achieved: the 45 degree lines correspond to the target standard normal distribution.

Case (B)
We take β = 1.5 for the simulations. Then, with the exponential weight w(x) = exp(−|x|), we see that all the conditions imposed in Theorems 2.1, 2.2, and 2.3 are fulfilled. It directly follows from (2.12) that the random variable asymptotically obeys the standard normal distribution. As before, we report the numerical results in Table 2. Again, the SLAD estimator exhibits better performance than the LSE, although the superiority seems less drastic compared with the case (A) (see Table 1). Figure 2 reports the corresponding normal probability plots concerning the Studentized SLAD estimators. Once again, the standard normal approximation works quite well. This reveals the usefulness of the SLAD estimator for estimation of the infinite-variance OU processes.

Proofs
We denote by (Ω, F , P ) the underlying probability space on which (X 0 , Z) is defined, and by E the corresponding expectation operator. Recall that P 0 stands for the true distribution of X associated with θ 0 . Throughout this section, we use some basic notation: C denotes a positive generic constant, possibly varying from line to line; A ⊤ denotes the transpose of a matrix A; A ⊗2 := AA ⊤ for any matrix A; w i−1 := w(X ti−1 ); P i−1

Proof of Theorem 2.1
Under P 0 we have the autoregressive representation for n ∈ N and i ≤ n. For convenience we write It follows from the definition (2.6) that, for each θ ∈ Θ, In our proof, it is crucial that ǫ ′ n,i−1 is F ti−1 -measurable.
Let U n (θ 0 ) := {u ∈ R 2 : θ 0 + a n u ∈ Θ} with a n = a n (β) := ( Then we define the random fields Z n (·; θ 0 ) : its maximizer equallingû n := a −1 n (θ n −θ 0 ). To achieve the proof, we are going to derive the following asymptotically locally quadratic structure of log Z n (u; θ 0 ) for each u ∈ U n (θ 0 ): where ∆ n → d N (0, Σ 0 ) and Γ n → p Γ 0 for positive definite nonrandom matrices Σ 0 and Γ 0 given by (2.8) and (2.9), respectively. Then, in view of the convexity of u → − log Z n (u; θ 0 ), Theorem 2.1 follows on applying the optimization result for convex random functions: Proposition 4.1. Let A n be real-valued convex random functions defined on a convex domain S ⊂ R p , and suppose that A n can be represented as A n (s) = s ⊤ U n + s ⊤ V n s/2 + r n (s), where U n weakly tends to a random variable U ∈ R p , V n tends in probability to a positive definite matrix V ∈ R p ⊗ R p , and r n (s) tends in probability to 0 for each s ∈ S. Then the minimizer α n of s → A n (s) weakly tends to −V −1 U .

Proposition 4.1 is a direct corollary of Hjørt and Pollard [13, Basic Corollary].
In order to deduce (4.5), first we rewrite log Z n (u; θ 0 ). For any function K of the form Knight [17], taking k(y) = I(y ≥ 0) − I(y ≤ 0) so that K(x) = |x|, we make use of the following identity valid for any x = 0 and y ∈ R: From Lemma 4.4 below, we have P [ǫ ni = 0] = 1 for each n ∈ N and i ≤ n. Combining (4.3), (4.4), and (4.6) yields log Z n (u; θ 0 ) = L n (u) + Q n (u), P 0 -a.s., where Write L n (u) = u ⊤ n i=1 l ni and Q n (u) = n i=1 q ni (u). Both L n (u) and Q n (u) entail a leading term plus some remainder terms. We look at them separately.
Asymptotic behavior of L n (u). We decompose L n (u) as Denote the distribution of ǫ ni by which is independent of i (see Lemma 4.4). Since p h is symmetric around 0, and p h is bounded, we have | Now we note that the mixing property of X under P 0 leads to the ergodic theorem, namely, (nh) −1 nh 0 F (X s )ds → p F (x)π 0 (dx) for every π 0 -integrable function F : see Bhattacharya [4, for details. This combined with Lemma 4.6 yields that 1 Also, under the assumptions it is easy to see that for a ∈ (0, 2] From (4.9) and (4.10) we can apply the martingale central limit theorem (cf. Dvoretzky [11]) for to obtain ∆ n → d N 1 (0, Σ 0 ). As for R 1 n (u), where we used 1 − 2F h (0) = 0 for the second equality. Thus L n (u) = u ⊤ ∆ n + o p (1) with ∆ n → d N 1 (0, Σ 0 ) for each u, as was desired.
Asymptotic behavior of Q n (u). Again we separate the martingale term: For the first term of the right-hand side, Taylor's formula gives We have Γ n → p Γ 0 by means of Lemmas 4.4 and 4.6. To deal with R 2 n (u), we note that: first, |∂p h (z)| = |∂p h (z) − ∂p h (0)| |z|, which follows from the first half of the proof of Lemma 4.4; second, | x 0 g(y)dy| ≤ |x| 0 {|g(y)|∨|g(−y)|}dy for any x ∈ R and g : R → R. Using these facts, we derive the following estimates: It remains to show that the martingale part is o p (1) for each u. This readily follows on applying Burkholder's and Schwarz's inequalities: Summarizing the above yields that Q n (u) = −u ⊤ Γ n u/2 + o p (1) for each u, with Γ n → p Γ 0 .
Combining the two steps leads to (4.5), hence the claim of Theorem 2.1.

Proof of Theorem 2.2
We keep using the notation introduce in the proof of Theorem 2.1. As we have already derived the asymptotic normalityû n → d N 2 (0, V 0 ), it suffices to ensure the L p (P 0 )-boundedness of (û n ) n∈N for every p > 0. Suppose that, given any L > 0 there exist constant C L > 0 such that for every r > 0 large enough Then the desired L p (P 0 )-boundedness follows: for every r > 0 large enough, We are going to derive the polynomial type large deviation inequality (4.16) by applying Yoshida [34].
From the proof of Theorem 2.1, we know that log Z n (u; θ 0 ) = u ⊤ ∆ n − u ⊤ Γ n u/2 + 3 j=1 R j n (u); see (4.7), (4.12) and (4.14). Rearrange this as where The martingale part ∆ n is L M (P 0 )-bounded for any M > 0: Since Γ 0 is positive definite, we can find a constant χ > 0 such that for each where α ∈ (0, 1) is a constant. Having (4.18) and (4.20) in hand, in order to ensure (4.16) we are left to proving the following lemmas (see Yoshida [34] for details).

Lemma 4.2.
For any L > 0, we can find (sufficiently small) constants α ∈ (0, 1) and ρ 1 ∈ (0, 1) in such a way that there exists a constant C L > 0 such that for every r > 0 For convenience we write in what follows. Then, sup n∈N E 0 [|I n (k)| M ] < ∞ for every k ∈ N and M > 0.

Proof of Lemma 4.2
Fix any L > 0 and let We are going to show H j (r) ≤ C L /r L individually. In the sequel, K j ≥ 2 denote arbitrarily large reals, and the constant C L may vary from step to step.
We begin with (4.35). Put δ ni = −ǫ ′ n,i−1 +û ⊤ n x i−1 / √ n. Using the expression (4.37) we see that for some d > 0 by virtue of Lemma 4.4; second, again by using Lemma 4.4, where we used (4.8) in proceeding from the second line to the third one. Thus we get (4.35).
Next we turn to (4.36). Set then (4.36) follows on proving |V n (û n )| → p 0, where V n (u) := n i=1 χ ni (u). For each u, the sequence {χ ni (u)} i≤n forms an (F ti )-martingale difference array with the associated quadratic characteristics tending to 0 in probability: through the change of variable as in (4.37), we see that This readily implies that V n (u) → p 0 for each u. Now fix any ǫ > 0 and ǫ ′ . Then we can find an A > 0, for which sup n∈N P 0 [|û n | > A] < ǫ. Since it remains to show that sup |u|≤A |V n (u)| → p 0. To this end we apply Lemma 4.8. By means of Burkholder's inequality, It follows from the Lipschitz continuity of K that W ′ ni (u, u ′ ) n −p/2 B −2p n |u − u ′ | p . Also, Lemma 4.4 yields that W ′′ ni (u, u ′ ) n −p/2 |u−u ′ | p . Substituting these estimates into (4.38) yields that In a similar way, we can deduce Lemma 4.8 now yields that sup |u|≤A |V n (u)| → p 0. We thus get the desired convergenceφ β (0) n → p φ β (0).
From Theorem 2.1 and Slutsky's lemma, Now we turn to (b). We have to deal with the cases where σ 2 > 0 and σ 2 = 0 separately. For both cases, we utilize the basic estimate which is valid for any (0, ∞)-valued characteristic functions H 0 , H 10 , and H 11 . First we consider the case where σ 2 > 0. It follows from the Lévy-Itô decomposition that we may write Z t = σw t + J t , where w is a standard Winer process and J is a pure-jump Lévy process with the Blumenthal-Getoor index β ′ ∈ [0, 2). For convenience, we write Then, reminding the expression (4.40), we apply (4.41) with to obtain the following estimates through the Fourier-inversion formula: The estimate of |A h (u)| may change according to the structure of J.
Step 1. First we prove (4.47) for l = 1. It suffices to show thatδ ′ h (k)∨δ ′′ h (k) → 0, whereδ Note that the growth rate of g k,1 is connected with how large q can be taken: g 1,1 and g 2,1 can be unbounded if, e.g., q is set to be large and w is a positive constant.
We are left to considering the case where k ∈ {1, 2} and q ∈ (0, 2]. However, in this case we can see that Assumption 2.2 implies the boundedness and uniform continuity of g 1,1 and g 2,1 , so that we can follow the same line as in the case where k = 0. This completes the proof of (4.47) for l = 1.
Remark 4.7. It is a simple matter to show sup t,s:|t−s|≤h E 0 [|k(X t )−k(X s )| r ] √ h for each r ≥ 2 if q can be large enough and k is a C 1 -function with the derivative being of at most polynomial growth. This trivially leads to the ergodic theorem for the discrete-time samples: n −1 n i=1 k * (X ti−1 ) → p π 0 (k * ) for smooth k * . (Of course, this can be valid for much more general class of diffusions with jumps.) On the other hand, Lemma 4.6 enables us to deal with small q (i.e., heavy-tailed cases) too, without imposing the global differentiability of w. A practical candidate for w when suspecting heavy-tailed nature in data would be the standard Gaussian density, however, it is also possible to take, e.g., the Laplace-type weight w(x) = exp(−|x|).
The following lemma are used to deduce the uniform estimates of some martingale terms. See Kunita [19,Theorem 1.4.7] for details. Then the family {Ξ n (·)} n is tight with respect to the supremum norm over H, and moreover for any compact convex set K ⊂ H − we have