Semi-parametric regression estimation of the tail index

Consider a distribution F with regularly varying tails of index −α. An estimation strategy for α, exploiting the relation between the behavior of the tail at infinity and of the characteristic function at the origin, is proposed. A semi-parametric regression model does the job: a nonparametric component controls the bias and a parametric one produces the actual estimate. Implementation of the estimation strategy is quite simple as it can rely on standard software packages for generalized additive models. A generalized cross validation procedure is suggested in order to handle the bias-variance trade-off. Theoretical properties of the proposed method are derived and simulations show the performance of this estimator in a wide range of cases. An application to data sets on city sizes, facing the debated issue of distinguishing Pareto-type tails from Log-normal tails, illustrates how the proposed method works in practice. MSC 2010 subject classifications: Primary, 62G32; secondary 62J05.


Introduction
Consider a random sample X 1 , X 2 , . . . , X n with a distribution function F sat- whereF = 1−F and L(x) is a slowly varying function, satisfying L(tx)/L(x) → 1 as x → ∞, for any t > 0. We will also say thatF = 1 − F is regularly varying (RV) at infinity with index −α, denoted asF ∈ RV −α . The parameter α > 0 is usually referred to as the tail index; alternatively, in the extreme value (EV) literature it is typical to refer to the EV index γ > 0 with α = 1/γ. The paper discusses an estimator of the tail index α which relies on the relationship between the tail of a distribution and the slope of the real part of the characteristic function (RCF) at the origin. This idea places itself out of the mainstream approaches for tail estimation, usually based on functions Estimation of the tail index 225 of upper order statistics; moreover, the proposed method exploits all sample information in analyzing the slope of the empirical RCF (ERCF) at the origin, which, typically, has lower variability with respect to the largest order statistics. The estimation procedure proposed can be implemented exploiting standard regression analysis packages such as lm or gam in R software and their tools in order to automatically select estimation parameters. Also, a full regression analysis of the ERCF can be exploited to better understand the phenomenon under scrutiny. All these issues will be discussed in turn; before doing so, a connection with previous relevant literature is given.
Probably the most well-known estimator of the tail index is the Hill [26] estimator, which exploits the k upper order statistics through the formula where X i:n is the i-th order statistics from a sample of size n and k = k(n) → ∞ in an appropriate way. The Hill estimator may suffer from high bias and is heavily dependent on the choice of k. It has been thoroughly studied [e.g., 32,23,22,27,40,10]. Several generalizations of the Hill estimator and new approaches have appeared in the literature: for example Csorgo et al [8] present a general class of kernel estimators where the Hill estimator can be obtained by taking a uniform kernel; the family of j-moment ratio estimators [9], the smoothing Hill estimator [41], and the location invariant estimator [14]. Beirlant et al [2] propose the generalized Hill estimator; Brilhante et al [5] define a moment of order p estimator which reduces to the Hill estimator for p = 0; Beran et al [4] defines a harmonic moment tail index estimator. Recently Paulauskas and Vaičiulis [35] and Paulauskas and Vaičiulis [36] have connected in an interesting way some of the above approaches by defining parametric families of functions of the order statistics.
Other notable contributions are those of Pickands III [37], which proposed a simple estimator based on a linear combination of log-spacing order statistics, or the moment estimator of Dekkers et al [11]; estimators based on the growth rate of diverging statistics have been proposed by Meerschaert and Scheffler [34], Politis [39], McElroy and Politis [33]. Kratz and Resnick [29] exploit properties of the QQ-plot while a recent approach based on the asymptotic properties of the partition function, a moment statistic generally employed in the analysis of multifractality, has been introduced by Grahovac et al [21].
Reduced bias (RB) and optimized (for the choice of k) approaches are discussed in Caeiro et al [6], Gomes et al [18] and Gomes et al [20]; these approaches are particularly relevant for this paper where a RB and optimized estimator will be discussed. For further discussion and references one can consult Beirlant et al [3] and Gomes and Guillou [19].
Another reference relevant to our work is Welsh [43], which defines an estimator based on the ERCF evaluated at two points. This paper improves that approach in several ways: by defining a full regression analysis of the ERCF, by providing a bias reduction component and an optimized choice of the points at which to evaluate the ERCF at the origin.
The paper is organized as follows: Section 2 presents the basic estimation strategy and analyzes its properties. Section 3 discusses an optimized and RB estimator and in Section 4 numerical comparisons with alternative RB and optimized estimators are presented. Section 5 gives some examples of applications to simulated and real data; an appendix contains the proofs of the results.

Estimation strategy
The analysis of the tails of F will be based on the tail sum The same behavior of H(x) will hold if F has a lighter left or right tail, e.g. F (−x) ∈ RV −β with β > α, or exponentially decreasing or truncated. In all these cases, the estimator proposed will target α; the approach discussed will also allow to treat each tail separately. Let φ denote the CF of X, i.e., for all real t, where U (t) = E[cos(tX)] and V (t) = E[sin(tX)] are the real and imaginary parts of the CF, respectively. Integrating (4) by parts, one has that is, the RCF U (t) depends only on the tail sum H(x). Moreover if H(x) ∈ RV −α , α > 0, from Pitman [38], it holds that, as t → 0, where is finite for any α which is not an even positive integer [38]. Remark 2.1. We exploit relationship (6) to develop an estimator for 0 < α ≤ 2. Note however that this restriction can be removed: if X has a distribution satisfying (1) (6) can still be used to estimate the tail index for 0 < α ≤ 4. In principle (6) can be used to estimate any α > 0 although in practice the case 0 < α ≤ 4 covers most situations of interest. Examples for these cases will be discussed below.
Using an empirical version of U (t), i.e. U n (t) = n −1 n j=1 cos(tX j ) and evaluating U n (t) at points t 1 , t 2 , . . . , t m (to be defined later), we use the regression equation where and, from (6), By ordinary least squares (OLS), a simple estimator for α iŝ where Note that, given (6), one should discard values ofα greater than 2. However we will retain the estimator unrestricted in order to discuss its properties. Application caveats will be discussed in Section 3 and Section 5. Also,α implies that g(α, t) is treated as a constant even if it depends on t; this may actually be a source of non-negligible bias. A bias reduced version ofα will be developed in Section 3.
The estimator (9) for m = 2 corresponds to the ratio estimator discussed by Welsh (1986); as noted in the introduction, the approach discussed in this paper can be seen as an extension of that work. It will be seen that by adding more observations and taking care of the bias introduced by the fictitious intercept g(α, t) will lead to several advantages.

Remark 2.2.
Our method, in the context of i.i.d. data, is analogous to that used by Geweke and Porter-Hudak [16] in the estimation of long-memory time series models. In that paper, the relation between the spectral density function at the origin and the covariance function at infinity is exploited to estimate the long-memory parameter of the series. See also Robinson [42] and Hassler et al [25].

Properties ofα
In order to understand the behavior ofα we need to analyze in more detail the error ε j = log [1 + ]. Consider first the term } and using (6), as t j ↓ 0, we have (see also Welsh [43]), To make sure V ar(ε j ) < ∞ ∀j and 0 < α ≤ 2, nt α j needs to diverge as n → ∞ for all t j 's. A way to evaluate the ERCF at the origin is to choose t j = j/n η with j = 1, 2, ..., m, m = n δ for some 0 < δ < η < 1. To make nt α j → ∞ hold as n → ∞ for all 0 < α < 2, the condition 0 < δ < η ≤ 1/2 is necessary. Moreover, to have as many points as possible in the regression, t j = j/ √ n with j = 1, 2, ..., m, m = n δ and 0 < δ < 1/2 seems a sensible choice. Thus, a Taylor expansion for ε j = log[1 + ε j ] = ε j − (ε j ) 2 2 + . . . , for 0 < α < 2 and nt α j → ∞, leads to the expression (see also Welsh [43]) Similarly, let The above results can be exploited in order to calculate bias and variance ofα, which are summarized in the following theorem.
V ar(α) = where A α and C α are finite constants (details in the appendix) and Note that from Lemma 7.3 in the Appendix, S zz = m + O(log 2 m); since L(x) ∈ RV 0 , then for some 0 < c < δ, lim n→∞ L( √ n/j)/n c = 0 ∀j ∈ {1, 2, . . . , n δ }. Hence, a rough evaluation of bias and variance from Theorem 2.3 allows to see thatα provides a consistent estimator of α ∈ (0, 2] for any choice of m = n δ and any slowly varying function L. A more precise statement can be made by placing further conditions on L(x). A standard assumption in the context of tail index estimation is that (see Hall [23], Hall and Welsh [24]) where C > 0, β > 0 and D is a non-zero real number. With this, and restriction to the case 0 < α < 2, it is possible to obtain exact rates of convergence ofα and its asymptotic normality. Define θ = min(α + β, 2), Theorems 2.4 to 2.6 give results onα under (A1).
Since θ > α and 0 < α < 2, the bias goes to zero as n → ∞ regardless the value of δ. The small o term may dominate the bias for small δ values.
The exponent of n, α 2 − 1 − δα, is always less than 0, therefore the variance ofα converges regardless of the value of δ in (0, 1/2). Combining Theorems 2.4 and 2.5, the mean squared error (MSE) is as n → ∞, under (A1) with 0 < α < 2. Similarly, the MSE goes to zero as n → ∞ for all 0 < δ < 1/2. Note that in this case the small o term of the bias is always dominated by the variance term. With regard to the explicit size of the MSE, it is determined by the term that dominates on the right-hand side of (17). The difference between these two orders is 2δθ − θ − δα + α 2 + 1. If this value is less than zero, i.e., δ < 1 2 − 1 2θ−α , the second term dominates in the MSE. More precisely, if θ = 2, 2θ −α is always between 2 and 4 for all 0 < α < 2 and the solution of δ < 1 2 − 1 2θ−α exists under the condition 0 < δ < 1/2. If θ = α + β, a combination of small α and β makes 1 2 − 1 2θ−α negative, which conflicts with δ > 0, then the first term dominates.
A general conclusion about the asymptotic normality of n where σ 2 < ∞ is a real constant (given in the proof ).
From this theorem and Theorem 2.4 we deduce that, as n → ∞, where, using notation of (A1), is not specified unless the form of the remainder in (A1) is known. If the second term on the right-hand side of the above equation is finite, i.e., As we can see, the bias term can be quite relevant in the case ofα. Typically bias reduction tools are based on external estimation of parameter of L(x) of assumption A1, see, e.g. Gomes et al [18]. Given the regression context ofα, we will instead resort to a bias reduction approach based on a semi-parametric component in the regression equation. This will allow consideration of a general L(x), going beyond the case considered by A1.

Bias reduction and choice of m
Trying to contemplate a bias reducing approach, consider again equation (7): a non-parametric component will be introduced in order to approximate the unknown slowly varying function in g(α, t) (recall from section 2. where f (t) = log L(1/t) (or log L 1 (1/t)). The semi-parametric regression equation (19) can be solved by approximating f (t) using a smooth spline h(t), which minimizes, for a given λ ≥ 0, see, e.g. Wood [45]. Practical estimation of (19) can be carried out using standard software for generalized additive models. Below, an estimation procedure which uses R is described. The gam() function of the mgcv package, which can handle smooth terms and parametric ones as required in (19), is used. Estimation of (20) will be handled using the default approach defined for gam(), which is based on a thin plate regression spline as discussed in Wood [44]. Practical selection of λ is typically carried out using generalized crossvalidation (GCV) or restricted maximum likelihood (REML), see, e.g. Marra and Wood [31]. Although errors of model (19) are not i.i.d. we found both criteria to work very well, in particular REML.
To complete the estimation procedure we would require to set the value of m, i.e. set the number of points at which the ERCF is evaluated at the origin. With this, we define a procedure to obtain a RB estimator of α optimized, according to GCV or REML, with respect to the choice of m (i.e. δ). a. Estimate model (19) with h(t) defined as in (20). In our simulations gam() was used. b. Select λ that minimizes a GCV or REML criterion.

Among the models selected in
Step 2, select that with lower GCV or REML.
As noted, the range of α for which we can apply our procedure can be extended by transformation of the data. Ifα provides a value equal to 2 it may indicate that the true value of α is above that level. It is possible to run again the estimation procedure with the squared data in order to see if a higher value of α is appropriate. Examples of this situation will be presented in Section 4 and Section 5.

Numerical comparisons
The estimation strategy proposed here can be cast in the class of optimized RB estimators; in our case optimization refers to the choice of m, i.e an appropriate value for 0 < δ < 1/2.
Numerical comparisons will then be carried out with respect to some RB competitors (Caeiro et al [6], Gomes et al [17]) based on Hill [26], generalized Hill [2], moment [11] and moment of order p [20] estimators; optimized with respect to the choice of k as discussed in Gomes et al [20].
For the above mentioned estimators, second order conditions are typically imposed in order to have a non-degenerate behavior under a semi-parametric model. Furthermore, when dealing with bias reduction, a more restrictive class (the so-called third order condition) is imposed; more specifically, the tail quan- This implies that, for example, Log-gamma, Log-Pareto and Pareto distributions are excluded while EV and t-distributions are included. RB estimation of α is based on external estimation of (ρ, β); for further details see Gomes et al [18] and Gomes et al [20]. In our comparisons the following RB-versions are used: where H(k) is the Hill estimator given in (2). 2) RB-Moment estimator [11], denoted by MM in the tables, with and M 3) RB-Generalized Hill estimator,ḠH(k), denoted GH in the tables, with the same bias correction as in (23) applied to with UH j,n = X n−j,n H j,n 1 ≤ j ≤ k. 4) RB-MOP (moment of order p) estimator, for 0 < p < α (the case p = 0 reduces to the Hill estimator) defined bȳ , U ik = X n−i+1:n / X n−k:n , 1 ≤ i ≤ k < n. Denoted by MOP(p) in the tables. In this case p is a tuning parameter which will be set, in our simulations, equal to 0.5 and 1. For an estimated optimal value of p based on a preliminary estimator of α see Gomes et al [20].
Computations of the above estimators have been performed using the package evt0 in R. More precisely, GH(k) and M (k) are obtained using the function other.EVI respectively with the options GH and MO. Estimation of the parameters (ρ, β) for the bias correction terms can be obtained from the function mop. RB-Hill and RB-MOP estimates are directly obtained by the function mop by appropriately specifying a value of p and the option RB-MOP. In order to optimize the choice of k we used the formula [20] where x is the integer part of x and ϕ(ρ) = 1 − (ρ + ρ 2 − 4ρ + 2)/2. For the comparisons, the following distributions are used:  a chosen distribution and a chosen n, has been initialized using set.seed(1).
The acronyms for the alternative estimators have been defined at the beginning of the section. GCV and REML indicate the estimators defined in Algorithm 1. The tables report: a) values not in parentheses, all rows: the empirical relative bias, calculated as (withÊ(α) indicating the mean of the 1000 Monte-Carlo Estimates): where the suffix E stands for the estimator considered in turn and H stands for RB-Hill. A value of the ratio greater than one indicates smaller efficiency of the RB-Hill with respect to estimator E.
In the simulations, for a better comparison of the results with the other estimators, we have used the value ofα provided by the estimation procedure  Overall, considering a larger pool of simulations results we have that GCV and REML estimators compare well in general with other estimators. They are clearly well suited for cases of α quite close to 2 where it can be much more efficient with respect to other estimation strategies. The extremely low bias of REML and GCV shows that probably an overall gain could be achieved by acting on the bias-variance trade-off. This is difficult to tackle in automatic simulations, however, as it will be shown in the next Section, a careful regression analysis of the ERCF at the origin can greatly help in determining the optimal value of m to use.

Simulated Pareto, Fréchet and Normal distributions
This example considers simulated data from the Pareto(1,0.1), the Fréchet(1) and N (0, 1) distributions. In all cases 1000 pseudo-random values were generated. To estimate α we choose δ = 0.45 which gives m = 22 and apply the estimation procedure described in Algorithm 1 using either REML or GCV criterion in order to select the optimal m and smoothing. Considering the Pareto and Fréchet cases first, a plot of log(t j ) and log(1 − U n (t j )) for selected values of t j is in Figure 1. The circle points (red in color) are those used in estimation, i.e. t j = j/ √ n, j = 1, . . . , m while the cross ones (blue in color) are additional points used to depict a fuller range of the behavior of log(1 − U n (t)).  Consider now the case of the Normal distribution. Figure 2 reports the values of log(t) and log(1−U n (t)) where the latter has been calculated: firstly, using the generated data (left graph) and, secondly, using the squared normal data (right graph). In both cases one expects the estimates be close to 2 as the distributions have finite variance.

City sizes
There is an ongoing debate on the specific regularity on the distribution of city sizes: Zipf's law (a heavy-tailed distribution with α = 1) or a log-normal distribution. Early studies show that Zipf's law generally holds well for data on the largest cities in a country (see Gabaix and Ioannides [15] and references therein). However, recent developments by Eeckhout [12] extend the analysis to the entire range of city sizes and demonstrate that the size of all populated places in the US follows a log-normal distribution. Levy [30] argues that significant deviations from a log-normal distribution exist in the top range of the largest cities by using graphical analysis. Several potential problems with Levy's studies are pointed out by Eeckhout [13].
The data set used by Eeckhout [12] and Levy [30] containing the entire distribution with population ranging from 1 to several millions is analyzed here. The "US Census 2000 places" dataset is based on 25,358 units of "city, town, borough, or village". To estimate α we choose δ = 0.4 which gives m = 57.  Figure 3, left graph, it appears that log(1 − U n (t)), when plotted against log(t j ) for t j = j/ √ n, j = 1, . . . , 57, straightens only in very close proximity to 0. For this situation it seems sensible to analyze further the problem by getting closer to 0. We select manually some points by setting t j = 7.97e − 05 + j · 0.0003, j = 0, . . . , 20; the plot of log(1 − U n (t)) for these points plus some additional ones (the blue crosses) are in the right graph of Figure 3, a more regular behavior of the function can be observed now.
Applying the estimation procedure using the points t j = 7.97e−05+j ·0.0003, j = 0, . . . , 20 we obtain α = 0.818 for m = 6 (note that here m = 21) if GCV is used and α = 0.752 for m = 21 if REML is used. Estimates, using the GCV criterion range from a minimum of 0.686 when m = 21 to a maximum of 0.839 when m = 7, while in the case of REML, estimates range from a minimum of 0.752 when m = 21 to a 0.836 when m = 7.
As we can see from this example, in practical applications, a full regression analysis of the ERCF close to the origin may reveal more details and give more insights on the phenomenon. In this case it appears more sensible to select a value of m not too large given the red part of the graph on the right of Figure  3; GCV, which suggests m = 6, seems to be the method to use here. Also, the result supports the presence of a Pareto tail, in the case of a log-normal distribution one would expect to have an estimate of α close to the value of 2.

Conclusions
For any distribution the rate of decay in the tails is related to the RCF at the origin. This fact is exploited to introduce a novel methodology for estimating the tail index of a distribution based on semi-parametric regression on the ERCF at the origin. The approach is conceptually similar to regression of the periodogram on frequency ordinates, which is used in time series to estimate the long memory parameter.
Consistency of a simple regression estimator is discussed under minimal assumptions, while rates of convergence and asymptotic normality are studied under the assumption that the slowly varying function L(x) in (1) has a finite limit as x → ∞. It turns out that a bias term may be severe for certain parameter values.
A semi-parametric regression equation with a smooth spline component approximating L in order to account for bias is implemented; this allows to keep at minimum assumptions on the unknown distribution underlying the data. Alternatively, under assumption A1, one could consider external estimation of parameters in L (e.g. Gomes et al [18]); this approach is not pursued here.
Simulated results indicate thatα has lower MSE than other estimators in a large variety of cases and bias is generally quite small. Practical implementation is supported by GCV or REML and in the examples the estimated values do not show high variability with respect to the choice of m. In real applications, a full regression analysis of the ERCF at the origin can help in determining the optimal estimation strategy. An application to the debated city-sizes distribution gives a clear answer in favor of a Pareto-type tail behavior.
The restriction 0 < α ≤ 2 might not be too severe since, in practical applications, one is often interested to detect low values of the tail index. Moreover, by transformations, extension of the range of values that can be estimated is always possible.
Since the method proposed works extremely well for values of α close to 2 it could be interesting to use it as a starting point to try developing a testing procedure for the hypothesis of finite variance in the data.

Proofs
For the proofs, some preliminary lemmas are needed. In the following, for lighter notation, set z j −z = a j .
For the proof of Theorem 2.3 recall that λ ij = t j /t i = j/i, t i > t j . Since |λ ij | < 1, one has absolute convergence of the expansion for 0 < α < 2 & α = 1 For α = 1, Proof of Theorem 2.3. Consider first the bias term, which is given by As n → ∞, by Lemma 7.1, and, for the second term on the right-hand side of (32), by Lemma 7.1, as m → ∞ and n → ∞, With (33) and setting m = n δ we obtain that Consider now the variance ofα given by Ignoring higher order terms we have Similar to (37), and For α is not a non-negative integer, as r → ∞, It can be proven that where A α = 2 (2−α) 3 + 1 2−α − 2 (2−α) 2 and ∞ r=1 α 2r 1 2r−α+1 → C α . Therefore, the order of the variance ofα is n α/2−1 m α L( √ n/m) for all 0 < α < 2. In the case that α ≥ 2 evaluation of (35) yields The proofs of theorems 2.4 and 2.5 are similar to that of Theorem 2.3, using Assumption A1 and can be found in Jia [28].
For the second term on the right-hand side of (42) see (34). We change the order of the summations in the first term on the right-hand side of (42). The first term is then rewritten as 1 . Note that W i,m 's with i = 1, 2, . . . , n for a given m are i.i.d. random variables with null mean.
As far as V ar(W i,m ) is concerned, it can be derived by noting that, for t k > t j and m = n δ , as n → ∞, which is equal to lim n→∞ nCov(ε k , ε j ).