Estimation in a class of nonlinear heteroscedastic time series models

Parameter estimation in a class of heteroscedastic time series models is investigated. The existence of conditional least-squares and conditional likelihood estimators is proved. Their consistency and their asymptotic normality are established. Kernel estimators of the noise's density and its derivatives are defined and shown to be uniformly consistent. A simulation experiment conducted shows that the estimators perform well for large sample size.


Introduction
Although parametric models are prone to misspecification, they still be attractive because they describe concisely the link between the past observations and the predicted variable. Various parametric models have been proposed these last decades. For a review, see Brockwell and Davis (1991), Brockwell and Davis (1996), Shumway and Stoffer (2001), and Tong (1990). Parameter estimation for linear models has been widely studied, while for nonlinear models, because of their complexity, the study is done in general for tractable cases. There is an increasing interest in estimating the parameters of ARCH and GARCH models introduced respectively by Engle (1982) and Bollerslev (1986). Most of the existing literature assume a Gaussian error distribution and study the consistency and asymptotic normality of the conditional Gaussian likelihood estimators. Some relevant papers are Engle (1982), Weiss (1986) for ARCH models, and for ARCH(∞) and/or GARCH models, Bollerslev (1986), Lumsdaine (1996), Hyndman and Yao (2002), Francq and Zakoïan (2004), Robinson and Zaffroni (2006), Straumann and Mikosch (2006), and Francq and Zakoïan (2007). Other papers dealing with parameter estimation in heteroscedastic models include the works of Giraitis and Robinson (2001) who propose a Wittle estimation for a class of parametric ARCH(∞), Chatterjee and Das (2003) who study estimators obtained by minimizing certain functionals for ARCH models, Peng and Yao (2003) who propose least absolute deviations estimators for ARCH and GARCH models, and Berkes and Horváth (2004) who study likelihood estimators for GARCH models.
In the present paper, we study parameter estimation for more general heteroscedastic models. Precisely, we consider the class of identifiable parametric stochastic models (1.1) where (X i ) i∈Z is stationary and ergodic; (Z i = (X i , . . . , X i−q+1 , X i−q )) i∈Z is a sequence of q-dimensional vector with q being a nonnegative possibly infinite integer; (ε i ) i∈Z is a sequence of iid centered random variables with unit variance such that ε i is independent of σ(Z j , j < i); the parameter column vector ψ = (ρ ′ , θ ′ ) ′ belongs to Ψ = Θ ×Θ ⊂ R I × R J , for some positive integers I and J, and the functions m (ρ; z) and σ(θ; z) have known forms. We aim to prove the existence of asymptotical normal estimators for the true parameter vector ψ 0 = (ρ ′ 0 , θ ′ 0 ) ′ , and uniformly consistent estimators for the noise's density and its derivatives, when this function exists.
The class of models (1.1) contains models such as ARMA, EXPAR, ARCH, GARCH, SETAR-ARCH, β-ARCH and many others. As far as the probabilist properties of these models are concerned, their invertibility is readily obtained for example for |σ(θ; z)| > 0. For some of them (see, e.g., Ngatchou-Wandji (2005)), a sufficient condition for strict stationarity can be obtained e.g., by checking the conditions (S1)-(S4) of p. 86 in Taniguchi and Kakizawa (2000). The case of GARCH models which generalizes ARCH models has been studied by Chen and An (1998). Next, a sufficient condition for geometry ergodicity can be obtained by applying a result of Tjøstheim (1990), while for a particular class of ARCH models nested in (1.1), this property has been investigated by An, Chen and Huang (1997). Finally, it is possible that from the theory of Markov chains, other interesting conditions for stationarity and ergodicity be obtained for many models within (1.1).
Under mild conditions, a conditional least-squares estimator of ρ 0 is defined in McKeague and Zhang (1994). Its consistency and asymptotic normality is established. The same is done for θ 0 in Ngatchou-Wandji (2002). Such results have also been established for multivariate nonlinear AR models by Tjøstheim (1986). Our main contribution is the study of the estimation of the couple of parameters ψ 0 = (ρ ′ 0 , θ ′ 0 ) ′ in model (1.1) by conditional least-squares and conditional maximum likelihood methods, when the conditional distribution is non necessarily normal and q possibly infinite. Our results generalize most of those based on least-squares and pseudo or quasi-likelihood estimation.
After the assumptions given in Section 2, we prove in Section 3 the existence of a sequence of asymptotical normal conditional least-squares estimators for ψ 0 . Section 4 deals with the existence of conditional likelihood estimators for this parameter. In Section 5, we give some common examples comprised in (1.1). In Section 6, the estimation of the noise's density and its derivatives is investigated. A simulation study done in Section 7 ends our work.

General assumptions
In the whole text, the transpose of a vector or a matrix function H(x) is denoted by H ′ (x). Let r be either I or J. For given real functions F (α; z) defined on a non-empty subset of R r × R q and K(ψ; z) defined on a non-empty subset of For a vector or matrix function H(x), we denote by ∂ ′ H(x) the transpose of ∂H(x). With this, we define ∂K(ψ; z) = (∂ ′ ρ K(ψ; z); ∂ ′ θ K(ψ; z)) ′ . We also define .
For a real-valued function h, h (p) denotes its pth order derivative, with h (0) = h. We denote by ||V|| E the Euclidean norm of the vector V and by ||M|| M = max i,j |M ij | the norm of the square matrix M=(M ij ). We next assume that the true parameter vector ψ 0 = (ρ ′ 0 , θ ′ 0 ) ′ of (1.1) is such that ρ 0 ∈ int(Θ) and θ 0 ∈ int(Θ), where int(Θ) and int(Θ) denote respectively the nonempty interior of Θ andΘ. We also suppose that all the random variables in this paper are defined on the same probability space (Ω, W, P ), where Ω is a set, W a σ-field of Ω and P a probability measure on W. The following assumptions are needed: (A 1 ) The common fourth order moment of the ε i 's is finite. (A 2 ) The functions m(ρ; z) and σ(θ; z) are each twice continuously differentiable with respect to ρ ∈ int(Θ) and θ ∈ int(Θ) respectively, and there exists a positive function α(z) such that E[α 4 (Z 0 )] < ∞ and There exists a positive function β(z) such that E[β 4 (Z 0 )] < ∞ and for all ρ 1 , ρ 2 ∈ Θ and θ 1 , θ 2 ∈Θ, Assumption (A 1 ) is at least satisfied by Gaussian and Student ε i 's. One can find in the literature, numbers of models with the functions m(ρ; z) and σ(θ; z) satisfying (A 2 ) and (A 3 ) (see, e.g., Ngatchou-Wandji (2005)).
Proof. It suffices to check the hypotheses of Theorem 3.2.23 of Taniguchi and Kakizawa (2000), established by Klimko and Nelson (1978) by using Egorov Theorem (see, e.g., Taniguchi and Kakizawa (2000), p. 97). From simple computations one obtains: By ergodicity, it is immediate that, as n tends to infinity, For any vector ρ * ∈ int(Θ) define the sequence of random matrix functions and denote by V n (ρ * ; X n ) ℓk its (ℓ, k)th entry. Then, Thus, in view of (A 1 )-(A 3 ), it is easy to see that there exists a positive real- Now for δ > 0 such that ||ρ − ρ 0 || E < δ, and for ρ * lying between ρ and ρ 0 , we have by the above inequality that: Next, by ergodicity, the right-hand side of the last inequality converges a.s. to E[ν ℓk (Z 0 )] < ∞ as n tends to infinity. It is then clear that for any (ℓ, k), From Theorem 3.2.23 of Taniguchi and Kakizawa (2000), it follows that there exists a sequence of estimatorsρ n such thatρ n −→ ρ 0 almost surely, as n → ∞ and for ǫ > 0, one can find an event E 1 with P (E 1 ) > 1 − ǫ and a nonnegative integerñ such that on E 1 , for n >ñ, ∂U n (ρ n ; X n ) = 0 and U n (ρ; X n ) attains a relative minimum at ρ =ρ n . The first part of (i) is then handled. For the second part, for fixedρ n , we have from simple computations: In view of (A 1 )-(A 3 ), applying the mean value theorem to (3.4) and (3.5), it is clear by ergodicity that as n tends to infinity, For any vector θ * ∈ int(Θ) define the sequence of random functions and denote by T n (θ * ; X n ) ℓk its (ℓ, k)th entry.

47
In view of (A 1 )-(A 3 ), it is easy to see that there exists a positive real-valued Again, for δ > 0 such that ||θ − θ 0 || E < δ, and for θ * lying between θ and θ 0 , we have from above that: It is easy to see that by the ergodic theorem, the right-hand side of the last inequality converges almost surely to E[̺ ℓk (Z 0 )] < ∞ as n tends to infinity. It is then clear that for any (ℓ, k), (3.3) holds with T n (θ * ; X n ) ℓk . Whence, applying Theorem 3.2.23 of Taniguchi and Kakizawa (2000), one can find a sequence of estimatorsθ n such thatθ n −→ θ 0 almost surely, as n → ∞ and for ǫ > 0, one can find an event E 2 with P (E 2 ) > 1 − ǫ and a nonnegative integern such that on E 1 ∩ E 2 , for n >n, ∂ θ S n (ψ n ; X n ) = 0 and S n ((ρ n , θ); X n ) attains a relative minimum at θ =θ n . It is an easy matter to see that for all ǫ > 0, Thus taking S 1 = E 1 ∩ E 2 and n 1 = max(ñ,n) yields the first part of Theorem 3.1. To handle the second point we observe that and by a Taylor expansion of order one of the function ∂U n (ρ; X n ) around ρ 0 , for larger values of n, one can write One can also observe that and write for larger values of n, Then putting in Theorem 1 of Ngatchou-Wandji (2005): Corollary 3.1. Assume that the assumptions of Theorem 3.1 hold and that E[ε 0 (ε 2 0 − 1)] = 0. Thenρ n andθ n are asymptotically uncorreleted. Remark 3.1. The condition E[ε 0 (ε 2 0 − 1)] = E(ε 3 0 ) = 0 in the above corollary holds for symmetric densities. If it does not hold,ρ n andθ n will not be independent in general. This fact is ignored when the estimation is done for example with Gaussian or Student ε i 's.
Remark 3.2. One could also prove the existence of consistent conditional type estimators for ψ 0 by using directly the random function S n (ψ; X n ). It is clear that this would have provided a one step estimator. However, one may not retrieve the classical least-squares estimators. For example, in the very simple case of m(ρ, z) = ρz and σ(θ; z) = 1, it is very difficult to have a simple expression forρ n by minimizing S n ((ρ, 1); X n ), whereas, the preceding two-steps method yields the traditional least-squares estimator of ρ 0 . Another approach (see, e.g., Heyde (1997)) consists in minimizing the sum of square Remark 3.3. The choice of the functions γ(z) and λ(z) is an open problem and we will not try to tackle it here. In the simulation, they are taken constant. Although this choice may be sub-optimal, it matches with what is done in the literature.

Conditional likelihood estimation
For models such as (1.1), the density function of the noise can be useful for writing the likelihood and/or conditional likelihood functions. In practice, for choosing this density function, (1.1) can first be fitted by least-squares methods. Next, various tests can be applied to the residuals from the fitted model to help postulating an adequate density function f (not necessarily Gaussian) for the noise. However, because it facilitates parameters estimation, pseudo-likelihood estimation method is very popular in practice. This probably explains the huge literature on the subject (see references given in Section 1).
In this section, we study the conditional likelihood estimation of the parameters when the noise has a non necessarily Gaussian density function f . This work has been done by Berkes and Horváth (2004) in the case of GARCH models. For simplicity, we restrict our study to models (1.1) for which the function σ(θ; z) satisfies: Under (B 0 ), the log-likelihood of a given sample X n = (X n , . . . , X 1 , X 0 , where we recall that ψ = (ρ ′ , θ ′ ) ′ ∈ Ψ and for all i ∈ Z, ε i (ψ) = [X i − m(ρ; Z i−1 )]/σ(θ; Z i−1 ). For the derivation of the results of this section, we make the following assumptions on the density function f : Next, for all i ∈ Z, we define on R I × R J , the following random functions: We also need the following additional requirements: (B 3 ) There exist a positive function υ(z) such that E[υ 4 (Z 0 )] < ∞ and for all i ∈ Z and ψ 1 , ψ 2 ∈ Ψ, a.s., Such assumptions have been done in Ngatchou-Wandji (2005). They are at least satisfied by linear autoregressive models, EXPAR and TAR models, ARCH and more generally β-ARCH models, with Gaussian f .
Proof. The tools for the proof are exactly the same as those of the proof of Theorem 3.1. Define Q n (ψ; X n ) = −L n (ψ; X n ). Then It is easy to see that, as n tends to infinity, It is clear that the matrixΣ is positive definite. For any vector ψ * ∈ int(Ψ), define the sequence of random functions T n (ψ * ; X n ) = ∂ 2 Q n (ψ * ; X n ) − ∂ 2 Q n (ψ 0 ; X n ), and denote by T n (ψ * ; X n ) ℓk its (ℓ, k)th entry. Any entry of ∂ 2 Q n (ψ 0 ; X n ) is either a constant times the sum over i = 1, . . . , n of the product of the components or entries of ∂m(ρ 0 ; Z i−1 ), ∂ 2 m(ρ 0 ; Z i−1 ), ∂σ(θ 0 ; Z i−1 ), ∂ 2 σ(θ 0 ; Z i−1 ) and the random functions σ(θ 0 ; Z i−1 ), ε i (ψ 0 ), ξ i (ψ 0 ),ξ i (ψ 0 ), ζ i (ψ 0 ),ζ i (ψ 0 ) and ζ i (ψ 0 ), or sums or differences of such terms. We have for example: In view of the assumptions (A 1 )-(A 3 ) and (B 0 )-(B 4 ), we can deduce from the above example that for each (ℓ, k), there exists a positive real-valued function Then, for δ > 0 such that ||ψ − ψ 0 || E < δ, (nδ) −1 |T n (ψ * ; X n ) ℓk | is bounded from the right by n −1 n i=1 µ ℓk (Z i ) which, by the ergodic theorem, converges almost surely to E[µ ℓk (Z 0 )] < ∞ as n tends to infinity. One can thus conclude that for all (ℓ, k), (3.3) holds with T n (ψ * ; X n ) ℓk . Here also, as in the proof of Theorem 3.1, there exists a sequence of estimatorsψ n = (ρ ′ n ,θ ′ n ) ′ such that, a.s., ψ n −→ ψ 0 , and for any ǫ > 0, there exists an event S 2 with P (S 2 ) > 1 − ǫ, and an integer n 2 such that on S 2 , for n > n 2 , ∂Q n (ψ n ; X n ) = 0 and Q n (ψ; X n ) attains a relative minimum at ψ =ψ n . Since a relative minimum for Q n (ψ; X n ) is a relative maximum for L n (ψ; X n ), the first part of our result is established. For the second part, it remains to prove that n −1/2 ∂Q n (ψ 0 ; X n ) converges in distribution to a Gaussian random vector with mean 0 and covariance matrix Λ. This result is handled if one puts in Theorem 1 of Ngatchou-Wandji (2005) : Finally, applying again the second part of Theorem 3.2.23 of Taniguchi and Kakizawa (2000) one has that Corollary 4.1. Assume that the assumptions of Theorem 4.1 hold, and that the equalities φ (1) The conditions on the integrals in the above Theorem 4.1 and Corollary 4.1 are verified at least by Gaussian density functions f . When those in Corollary 4.1 are satisfied, the Fisher information matrix converges to Σ. Hence, the Cramer-Rao bound is asymptotically achieved andψ n is asymptotically efficient.

Some examples
Here we list some common examples that are comprised in (1.1). It is not difficult to see that the AR(q) models of finite order q, either linear or nonlinear are within (1.1) for σ(θ; z) = Cst. The usual ones are for example AR, SETAR, TARCH, EXPAR (see Tong (1990)). Taking m(ρ; z) = 0 in (1.1) yields ARCH(q) models. For finite q, the most popular one is the ARCH (q) model obtained with For q = ∞, many other common models are within (1.1). It is the case for invertible ARMA models. In the particular case of MA(1) model defined by As can be seen for instance in Peng and Yao (2003), GARCH(p, q) models are also within (1.1). In the particular case of GARCH(1,1) model defined by a j−1 X 2 i−j , and consequently, The class of models (1.1) for q = ∞ also contains invertible bilinear models, such as the subdiagonal bilinear model defined by For this model, it follows from page 103 of Taniguchi and Kakizawa (2000) that if b 2 < 1, then

which in turn yields
Remark 5.1. Although conditional least-squares and conditional maximum likelihood estimators for ψ 0 exists, their computation may need numerical methods, even for Gaussian density functions f . For example, for pure ARCH(1) models defined by (5.1) with q = 1 and Gaussian f , the conditional maximum likelihood estimators will be obtained by solving in i−1 ) 2 = 0, with the restrictions θ 0 > 0 and 0 ≤ θ 1 < 1. This will generally need a numerical method. A similar remark can be done for the GARCH(1,1) and bilinear models defined above when estimating by either the least-squares or likelihood methods.

Kernel estimator for the noise's density and its derivatives
In time series analysis, the conditional distribution can be very useful for the study of nonlinear phenomena such as the symmetry and the multimodality structure of a time series. In the setting of models (1.1), the conditional distribution is the distribution of the noise. Nonparametric estimation of conditional distribution has been studied among others by Hyndman and Yao (2002) who use a kernel method, Fan, Yao and Tong (1996) who use the local polynomials approach, Fan and Yim (2004) who use a cross-validation method, and Hyndman and Yao (2002) who use a kernel method and derive a test for conditional symmetry from their estimator. Bai and Ng (2001) also derive a test for conditional symmetry which rest on the kernel estimators of the conditional density and its derivatives.
In this section, we assume that the ε i 's have an unknown uniformly continuous density function f , and we define its kernel estimator and those of its derivatives. We show the uniform consistency of these estimators. The results of this section can lead to the derivation of adaptative estimators for ψ 0 , or to the construction of some goodness-of-fit tests for the function f . However, we will not study these problems here.
For all i ∈ Z and ψ = (ρ ′ , θ ′ ) ′ ∈ Ψ, we define the random function Let ψ n = (ρ ′ n , θ ′ n ) ′ be any consistent estimator of ψ 0 such that n 1/2 (ψ n − ψ 0 ) converges in distribution to a Gaussian distribution with mean 0 and variance matrix Γ. Take for example the least-squares estimator of Section 3, or the pseudo-maximum likelihood estimator which can be obtained from Section 4 with Gaussian f . Let p be a nonnegative integer and K be a kernel function differentiable up to order p + 1, with modulus of continuity ω K . Let (h n ) be a sequence of real numbers such that h n −→ 0, as n tends to infinity. For n = 1, 2, . . . , and for all x ∈ R, define the random functions For observable ε i (ψ)'s, the convergence of the above Bhattacharya's estimators for f (p) (x) is studied in Silverman (1978). Here, the ε i (ψ)'s are not observable and it is natural to estimate f (p) (x) by f (p) n (ψ n ; x). Following Singh (1979), or the more recent paper of Horová, Vieu and Zelinka (2002), other estimators of f (p) (x) could be defined. The results of this section are established under the following assumptions of Silverman (1978): (H 6 ) For j = 0, . . . , p + 1, K (j) (x) −→ 0 as |x| → ∞ and |K (j) (x)|dx < ∞ (H 7 ) The Fourier transform of K is not identically one in any neighborhood of 0.
We have the following theorem: Theorem 6.1. Assume (B 0 ) and (H 6 ) hold, and the function K (p+1) is continuous. Let r be any integer such that 0 ≤ r ≤ p. Assume n 1/2 h r+2 n −→ ∞, as n → ∞. Then sup Proof. Let 0 ≤ r ≤ p, r integer. By a Taylor expansion of order one, we have, for some vector ψ * lying between ψ 0 and ψ n : By the triangle inequality, it then follows that Since the function K (r+1) is continuous, by (H 6 ) it is bounded, and there exists a constant C > 0 such that almost surely, Also, under (B 0 ) and (A 1 ), one can find a positive function χ(z) with E[χ 4 (Z 0 )] < ∞ such that for all 1 ≤ i ≤ n, From these two inequalities, we have By our assumptions, we have that || √ n(ψ 0 − ψ n )|| E converges in distribution to ||N (0, Γ)|| E . By the ergodic theorem, almost surely, The result then follows by the fact that n 1/2 h r+2 n −→ ∞, as n → ∞.
An immediate consequence of both Theorem 4.1 and Theorems A and C of Silverman (1978), is the following corollary.

Simulation study
To illustrate some of our results, we conducted a simulation experiment that we present and comment in this last section. We restricted to models for which we could obtain explicit and simple expressions for the estimators. This avoided the use of numerical methods. We consider the following models : where the parameters ρ 0 , ρ 1 , κ > 0, θ 0 > 0 and θ 1 ≥ 0 eventually satisfy some conditions insuring the existence, the invertibility, the stationarity and the ergodicity of (X i ) i∈Z . For example, for model (ii) below, (7.1) admits a strictly stationary and geometrically ergodic solution (X i ) i∈Z as soon as 0 ≤ θ 1 < 1. The noise densities f that we used were either Gaussian or Laplace. More precisely, we studied the cases (i) ρ 0 = 0, 0 < ρ 1 < 1, κ = 0.1 and θ 1 = 0, with f either Gaussian or Laplace.
Except model (i) with Gaussian f and model (ii), there is no guaranty that (X i ) i∈Z be stationary and / or ergodic for the other models.
For the computation of least-squares estimators, the weight functions were λ(z) = γ(z) ≡ 1, which yield the classical least-squares estimators. In each case, our estimates were computed on the basis of 1,000 samples of length n. For model (i), from simple computations, it is easy to see that the least-squares estimator coincides with the maximum likelihood estimator for Gaussian f . The results concerning this model are listed in Table 1. They show that, for samples of size n = 50, and for either density considered, the least-squares estimators are close to the true value of the parameters. Rapid calculus show that these estimators are unbiased. The trials for this model were also done for n ≥ 100. From the results that we do not present here, the estimates obtained were more accurate.
Concerning the models (ii), the results were in general better for the maximum likelihood estimators than least-squares, for all the sample sizes n = 100, 200, 400 (see Tables 2 and 3). Both estimators moved to the true parameter as n grew. For the models (iii), only least-squares estimators were computed. This was done for n = 100, 200, 400. We observed in these cases that the estimates of ρ 0 were good while θ 0 was always overestimated and θ 1 was underestimated (see Table 4). The least-squares estimates for the models (ii) also behaved this way. This likely comes from the fact that the least-squares estimators for these models are highly biased. It seems from our simulation experiment that their bias converge slowly to 0, as n grows.    For the kernel density estimation, we restricted our trials to models (i)-(iii) with Gaussian density. The residuals were computed from least-squares fit with γ(z) = λ(z) ≡ 1. We took ψ n =ψ n (see Sections 3 and 5), a Gaussian kernel with h n = c n n −1/9 , where, denoting by σ n the empirical standard deviation and X n, 1 4 and X n, 3 4 the first and third empirical quartiles of (X 1 , . . . , X n ), c n = 0.9 min{σ n , (X n, 3 4 − X n, 1 4 )} 1.34 .
This sequence (c n ) given in the software R seemed to give better results than c n = σ n . It is easy to check that the Gaussian kernel and the sequences (h n ) clearly satisfy the assumptions of Theorem 6.1. We took ρ 1 = −0.5, θ 1 = 1 for (i), θ 0 = 0.4, θ 1 = 0.1 for (ii) and ρ 0 = 0.6, θ 0 = 0.4, θ 1 = 0.05 for (iii). The different plots off n and f are gathered on the same graph. The same is done forf (1) n and f (1) . The trials were done for n = 100, 200, 400, 600. The estimates obtained for the density were good (see Figure 1). Those of the derivative of the density were not good for n = 100, 200, especially in the vicinity of the maxima. They were better for n = 400, 600 (see Figure 2). For the density and its first derivative, one can see that the estimates from the models (ii) and (iii) were not very close to the true functions. This is probably due to the sampling fluctuations or to the bias of the least-squares estimators of the parameters of these models. The good behavior of the estimates obtained from (i) may come from the fact that the conditional likelihood and the conditional least-squares estimators of the parameter ψ 0 are the same in this case as we earlier pointed out.