Least tail-trimmed absolute deviation estimation for autoregressions with inﬁnite/ﬁnite variance

: We propose least tail-trimmed absolute deviation estimation for autoregressive processes with inﬁnite/ﬁnite variance. We explore the large sample properties of the resulting estimator and establish its asymp- totic normality. Moreover, we study convergence rates of the estimator under diﬀerent moment settings and show that it attains a super- √ n con- vergence rate when the innovation variance is inﬁnite. Simulation studies are carried out to examine the ﬁnite-sample performance of the proposed method and that of relevant statistical inferences. A real example is also presented.


Introduction
Heavy-tailed time series is often encountered in the real world and particularly in economics and finance; see [5,6] for some of the examples. Autoregressive (AR) processes with infinite variance have been widely used to model such a series. Let {X t } be an AR process of order p (AR(p)); that is, {X t } satisfies the recursions

R. Wu and Y. Cui
where {Z t } is a sequence of independent and identically distributed (i.i.d.) random variables. Heavy-tailed time series {X t } is induced when the innovation Z t has an infinite variance. A number of papers have been devoted to estimation of φ φ φ = (φ 0 , φ 1 , . . . , φ p ) T , the vector of unknown AR coefficients; see [10,11] for a review. Specifically, [9] established the limiting distribution of the least squares (LS) and Yule-Walker estimators. [8] made available the respective asymptotic theory of the M-estimator and least absolute deviation (LAD) estimator when the innovation Z t is in the domain of attraction of a stable law. However, neither these estimators are asymptotically normal nor their limiting distributions have a closed form. As a result, it is difficult to apply the asymptotic theory to statistical inferences due to the lack of effective inference techniques. Typically one needs to resort to nonparametric methods for an approximation of limiting distribution. Along another strand of research, [11] proposed self-weighted LAD (SLAD) estimation and showed asymptotic normality of the resulting estimator with a standard-√ n convergence rate when the innovation density and its derivative are uniformly bounded. Ling's result admits statistical inferences in a conventional fashion. More recently, aiming at robust parameter estimation, [10] developed a least tail-trimmed squares (LTTS) procedure. It is shown that the LTTS estimator is asymptotically normal and can achieve a super-√ n convergence rate when the innovation variance is infinite. Also, Hill's simulation study indicated that the LTTS procedure is robust to outlying observations in both small and large samples.
In the present paper we explore least tail-trimmed absolute deviation (LT-TAD) estimation, the LAD version of Hill's LTTS estimation, for AR processes. Our study is motivated by the well known fact that the LAD method in general is superior to the LS method when modeling heavy-tailed data due to its robustness to outliers. For infinite variance AR processes, it has been shown that the LAD estimator has a higher convergence rate than the LS estimator. Our study suggests that the merits of the LAD method carry over to LTTAD estimation when compared to LTTS estimation; for example, the LTTAD estimator converges faster than the LTTS estimator. Meanwhile, asymptotic normality of the LTTAD estimator is alike ensured. Moreover, without exerting tail-trimming on the extreme residual values, the LTTAD estimation on the one hand is much simpler to implement numerically and on the other hand requires weaker model assumptions.
Our study is also related to [11], but major difference exists. Ling employed a constant trimming threshold as an example of weight selection. However, when the constant is replaced by quantile order statistic of observations (as in Ling's simulation studies), the corresponding weight no longer meets the requirement (i.e. Ling's Assumption 2 on page 383). Our study develops theoretical treatments under such a situation. It turns out that, while the resulting estimator retains asymptotic normality, it actually achieves a super-√ n (instead of standard-√ n) convergence rate in the case of infinite variance. The remainder of the paper is organized as follows. Section 2 sets up the study and defines the proper LTTAD estimator. We develop the asymptotic properties of the estimator. Statistical inferences are also explored. In Section 3 we discuss convergence rate of the LTTAD estimator under different cases, showing its super-√ n convergence when the innovation variance is infinite. Simulation studies are conducted in Section 4 to assess the finite sample performance of the proposed procedures and a real example of financial time series is presented as well. Technical proofs are relegated to the Appendix.

Least tail-trimmed absolute deviation estimation
In this paper we assume that (A1) the innovation Z t has zero median and a probability density function f (·) that is continuous in a neighborhood of zero, f (0) > 0, and sup z∈R f (z) < B < ∞; and (A2) the distribution of Z t has Pareto-like tails: where C > 0 is a constant and α > 0 is the tail index.
Assumption (A2) is satisfied by the Pareto distribution, stable distribution with index α ∈ (0, 2), Student's t distribution, Fréchet distribution, etc. Under this assumption, E|Z t | ϑ < ∞ for ϑ ∈ (0, α), and therefore there exists a unique strictly stationary and ergodic process {X t }. When α ≤ 2, Z t has no finite variance and whence {X t } is a heavy-tailed series, which is the case of particular interest. For the model (1.1), φ (z) = 1 − p j=1 φ j z j is called the AR polynomial. Under the assumption (A3) the polynomial φ (z) has all its roots outside the unit circle in the complex plane, where ψ j , j = 0, 1, . . ., is the coefficient of z j in the power series expansion of 1/φ(z). According to the results of [4], from which it follows that the distribution of X t also has Pareto-like tails with the same decay rate as in (2.1). Let φ φ φ 0 = (φ 00 , φ 01 , . . . , φ 0p ) T denote the true value of φ φ φ. Suppose X 1 , . . . , X n are the observations from the true model. At t = p + 1, . . . , n, the residual corresponding to φ φ φ is given by To introduce the LTTAD estimator, we will trim the extreme values of X t−i , t = p + 1, . . . , n, i = 1, . . . , p. Define X a t = |X t |, t = 1, . . . , n and let X a (1) ≥ · · · ≥ X a (n) denote the order statistics of X a 1 , . . . , X a n . Also, let k n be an integer 944

R. Wu and Y. Cui
number such that 1 ≤ k n < n and 1/k n + k n /n → 0 as n → ∞. The LTTAD estimator φ φ φ n of φ φ φ 0 is defined as the minimizer of the objective function where kn) ) with I(·) being the indicator function. Unlike Hill's LTTS procedure, there is no trimming in (2.2) based on the magnitude of |z t (φ φ φ)|. There are several advantages of doing so. First, l n (φ φ φ) is a convex function of φ φ φ, which on the one hand guarantees that the established theoretical results hold for the global minimizer and on the other hand considerably reduces the burden of numerical computations. Second, the "fractiles" assumption of [10] can be dropped. Third, the symmetry is no longer required for the distribution of Z t , that is, skewed distributions are allowed for Z t .
For t > p, noting that z t (φ φ φ 0 ) = Z t , we have where {c n } is a nonrandom sequence such that P (|X t | ≥ c n ) = k n /n. By similar arguments of [10], it can be shown that V n is positive definite for sufficiently large n.
The asymptotic distribution of the LTTAD estimator is given in the following theorem.
Theorem 2.1 admits statistical inferences with respect to φ φ φ 0 in a conventional fashion. For example, consider a linear null hypothesis of the form where Γ Γ Γ and γ γ γ are s × (p + 1) constant matrix and s × 1 constant vector, respectively. A Wald test statistic can be constructed once V n and f (0) are consistently estimated. An estimator of V n is given by On the other hand, a kernel density estimator of f (0) is where K h (·) = K(·/h n )/h n with K(·) being a symmetric kernel function and h n > 0 being a bandwidth. It can be shown that V n = V n (1+o p (1)). Moreover, under some regularity conditions on the kernel function and bandwidth. Then, a Wald test statistic defined as has a null limiting chi-squared distribution with s degrees of freedom.

Convergence rate
It follows from Theorem 2.1 that, for i = 0, 1, . . . , p, where φ i,n is the LTTAD estimator of φ 0i and V i+1,i+1,n is the (i + 1) th element on the diagonal of the matrix V n . As we specify the same distribution tails of Z t as in [10], the convergence rate of the LTTAD estimator can be straightforwardly derived. First, by the results of [10] that On the other hand, [10] showed that where d is some scale parameter. So, the convergence rate of our LTTAD estimator is in order.

Remark 1.
When the innovation variance is infinite (i.e. α ∈ (0, 2]), the LT-TAD estimator has a higher convergence rate than Hill's LTTS estimator. More specifically, where k Z n is the number of trimmed |z t (φ φ φ)|'s as in [10], which satisfies 1 ≤ k Z n < n − p and 1/k Z n + k Z n /n → 0 as n → ∞. On the other hand, when α ∈ (0, 2) the LAD estimator of φ 0i , i = 1, . . . , p, has a n 1/α convergence rate so it converges faster than the LTTAD estimator:

Remark 2.
When the innovation variance is finite (i.e. α > 2), the trimming does not affect the efficiency and the asymptotic result coincides with that of the LAD estimator established [7]. When 0 < α ≤ 2, the super-√ n convergence is ensured.

Simulation studies
We conducted simulation studies to evaluate the finite-sample performance of LTTAD estimation. We considered the AR(2) model studied in [10]: Two types of distributions were specified for the innovation {Z t }: the two-sided compound Pareto distribution and the stable distribution. The Pareto (type II) distribution has a probability density function (pdf) of the form where α > 0 is the tail index, μ ∈ R is the location parameter, and σ > 0 is the scale parameter. In the studies we set μ = 0 and σ = 1. The twosided compound Pareto distribution was defined such that it had a pdf of 0.5g(−z)I(z < 0) + 0.5g(z)I(z ≥ 0) for z ∈ R. So, the distribution is symmetric about zero. Three values of α were taken: 0.75 (infinite mean), 1.5 (finite mean but infinite variance), and 2.5 (finite variance). On the other hand, the stable distribution can be described by four parameters: the index parameter α ∈ (0, 2], the skewness parameterβ ∈ [−1, 1], the scale parameterγ > 0, and the shift parameterδ. We denote the distribution as S(α,β,γ,δ). We considered two index values, 0.75 (infinite mean) and 1.5 (finite mean but infinite variance); more specifically, we used skewed distributions S(0.75, 0.5, 1, −0.2565) and S(1.5, 0.5, 1, −0.1339), where for each distribution the value of shift parameter ensures a median of zero (i.e. its magnitude equals the 50 th percentile of the corresponding distribution withδ = 0), as well as symmetric distributions S(0.75, 0, 1, 0) and S(1.5, 0, 1, 0). Three sample sizes: 100, 400, and 800 were used. When sample size is small, the objective function l n (φ φ φ) in (2.2) might be sensitive to extreme values of z t (φ φ φ). As a remedial measure, additional trimming on the residuals may be incorporated into l n (φ φ φ) as in [10]. In this regard, we define likewise z a t (φ φ φ), t = p + 1, . . . , n, and z a (1) (φ φ φ), · · · , z a (n−p) (φ φ φ). Then, we can formulate the following objective function n and is referred to as the LTTADR estimator of φ φ φ 0 . However, our simulation studies indicated that it would not necessarily have a better finite-sample performance than the LTTAD estimator.
We simulated 5000 replicates of time series from the AR(2) model for each case of setting. To compare the performance of the LTTAD estimator with the LAD, Hill's LTTS, Ling's SLAD, and LTTADR estimators we computed the empirical bias and standard deviation (SD) for each type of estimates. The trimming parameter k n was set to max{1, [0.2n/(ln(n)) 2 ]} in the tail-trimmed procedures (i.e. LTTS, LTTAD, and LTTADR); that is, k n = 1, 2, and 4 for n = 100, 400, and 800, respectively. On the other hand, the specification of k Z n = max{1, [0.05n/ ln(n)]} in [10] was used in the computation of LTTS and LTTADR estimates, which equals to 1, 3, and 6 for the three sample sizes, respectively. Table 1 reported the results when α = 0.75 and Z t follows a compound Pareto distribution; a similar pattern was displayed when Z t follows a Stable distribution. The table exposed us to several distinct features. First, while all five procedures performed very well in estimation of φ 01 and φ 02 , the estimates of the intercept φ 00 were quite off except for the LTTAD, SLAD, and LAD (when n = 400 and 800) procedures. The irregular bias was resulted from a large number of extremely outlying estimates. It seemed that the undefined mean of time series somehow caused difficulty in numerical minimization to obtain φ 00 . Note that, this abnormality vanished when there existed finite mean; see Tables 2  and 3. Second, LTTAD estimation significantly outperformed LTTS estimation. Third, additional trimming on the residuals did not necessarily improve the performance of estimation. This may be partially due to the non-convexity of the LTTADR criterion function. Note that, LAD estimates are expected to have a better performance than LTTAD estimates; see Remark 1. The motivation of studying LTTAD estimation is to trade a small loss of precision for asymptotic normality so that statistical inferences can be conducted in the classical fashion. The magnitudes of the reported standard deviations in Table 1 supported the relative convergence rates of the LAD, LTTS, and LTTAD estimators. Table 2 contains the results when the distribution of Z t has a tail index α = 1.5 (finite mean but infinite variance). The LTTAD estimator again outperformed the LTTS estimator. We also examined the impact of skewness of the innovation distribution on estimation. It is seen from the results that skewness made little difference on the performance of LTTAD estimation but resulted in a biased estimate of the intercept φ 00 for the LTTS method. Table 3 reported the results when the time series has a finite variance. On the one hand, the LTTAD estimator outperformed the LTTS estimator. On the other hand, there was not much Table 1 Coefficient estimates for the AR (2)  difference in performance between the LTTAD and LTTADR estimators. In all the cases the LTTAD and SLAD methods had comparable performances. This is not surprising because the SLAD estimation implemented in Ling's simulation studies is essentially a variant of the LTTAD method. To see it, note that the constant trimming threshold in Ling's equation (2.3) was replaced by the 95% quantile of X a 1 , · · · , X a n . As a consequence, Ling weighted down substantially the relevant residual terms in the objective function while we simply assigned a zero weight to them.
To get a feeling of normality of the LTTAD and LTTS estimators, in Figure 1 we constructed normal probability plots of LTTAD and LTTS estimates of φ 00 , φ 01 , and φ 02 , respectively, where Z t is Stable with tail index α = 1.5 and n = 400. The Gaussian approximation was fairly reasonable for both estimators if the few outliers due to the computer glitch were ignored.
To explore the performance of the Wald test, we considered testing AR(1) model against AR(2) model, where H 0 : (0, 0, 1)(φ 0 , φ 1 , φ 2 ) T = 0. Data were simulated from an AR(1)/AR(2) model with φ 0 = 0.2, φ 1 = 0.8, and φ 2 ranging from −0.10 to 0.10 with an increment of 0.05, while other settings remained the same as before. We computed the empirical power/size of the Wald test based on both LTTAD and LTTS methods, respectively, at nominal significance levels of 0.01, 0.05, and 0.1. The results when Z t is a two-sided compound Pareto distribution with tail index α = 1.5 were displayed in Figure 2, where the dotted horizontal line represents the nominal size. In estimation of f (0) a Gaussian kernel was used with the default normal scale bandwidth selector. For both methods, the empirical size was close to the nominal size. But the Wald test Table 2 Coefficient estimates for the AR (2)  based on the LTTAD method had much larger empirical power, especially when the true φ 2 value is close to the hypothesized value of zero.

Empirical example
As an empirical example, we considered the log-returns of Hang Seng Index (HSI) between June 3, 1996 and May 31, 1998. The same data set has been  investigated in [10,11]. Figure 3(a) displayed the time series. The normal probability plot in Figure 3(b) showed strong evidence for heavy-tailedness, which was further confirmed by the Jarque-Bera and Shapiro-Wilk tests. Figure 3(c) depicted the kernel density estimate for the log-returns, superimposed with normal and stable density functions. For the normal density we took the corresponding empirical mean (−4.356×10 −4 ) and standard deviation (2.250×10 −2 ) as the distributional mean and standard deviation. For the stable density we estimated the distribution parameters by fitting a stable distribution to the log-returns using the maximum likelihood method via the R package "fBasics". The fitted stable distribution was S(1.404, −0.204, 0.009, 0.001). It is seen that the estimated density was well approximated by the fitted stable density. Thus, it is plausible to model the log-returns as a non-Gaussian stable time series. Large-scale simulations showed [1] that sample autocorrelation function (ACF) and sample partial autocorrelation function (PACF) are still excellent tools for tentative identification of the orders of a stable ARMA model. The sample PACF plot of the log-returns was given in Figure 3(d). The 95% bounds in the figure are modified bounds for the stable ARMA models computed based on Table 3 of [1]. To be specific, with a fitted index value of 1.404, the bounds were approximately ±2.814(log n/n) 1/1.4 = ±0.124, where n = 490 is the length of the time series. For comparison, the standard bounds under the classical Gaussian framework are ±1.96/ √ n = ±0.089. From the plot the sample partial autocorrelation is significantly outside the bounds at lag 3, slightly outside the bounds at lag 4, and insignificant at other lags. So it is reasonable to consider fitting the following three AR models: The intercept term was dropped (i.e. zero-mean AR models for the log-returns) due to the insignificance of the empirical mean. This is in agreement with the conclusion of [10,11] that φ 0 is insignificant. Also, the model (M3) is the final model chosen by [11].
We first fitted the model (M1) using both the LTTAD and LTTS procedures with the trimming parameters as specified in the simulation studies. The resulting LTTAD model was based on the asymptotic results of [10]. The two fitted models were comparable other than the opposite signs of φ 4 , which, however, do not truly imply a contradiction. Indeed, φ 4 was insignificantly different from zero, as was not surprising according to the sample PACF plot of the log-returns. Thus we removed the term X t−4 and re-fitted the model (M2) to the data, leading to the LTTAD model of with AIC = −7.619. So, both fitted AR(3) models were adequate but the LT-TAD model-fit was slightly better, which is also seen from the sample ACF plot of the residuals in Figure 4. The bounds in the sample ACF plot are again the modified bounds. Finally we considered fitting the log-returns using the model (M3). By the LTTAD method we obtained the model  (1.392, −0.185, 0.009, 0.001) with AIC = −7.621. The diagnostic plots were presented in Figure 5. Again, the LTTAD model-fit was better than the LTTS model-fit although both were adequate. Also, based on the AIC values, the subset AR(3) model was a bit more plausible than the usual AR(3) model. So, we concluded, in agreement with [11], with a subset AR(3) model for fitting the log-returns of Hang Seng Index.

Appendix A: Technical proofs
We first establish a lemma, from which Theorem 2.1 follows. To be specific, we reparameterize by setting ν ν ν = n 1/2 V 1/2 n (φ φ φ − φ φ φ 0 ) and hence can rewrite Then, minimizing l n (φ φ φ) with respect to φ φ φ is equivalent to minimizing with respect to ν ν ν. Stated below is the functional limit theorem for S n (ν ν ν).

Lemma A.1. Under the Assumptions (A1)-(A3),
on C R p+1 , the space of real-valued continuous functions on R p+1 where convergence means uniform convergence on every compact set in R p+1 .