Tail index estimation, concentration and adaptivity

This paper presents an adaptive version of the Hill estimator based on Lespki's model selection method. This simple data-driven index selection method is shown to satisfy an oracle inequality and is checked to achieve the lower bound recently derived by Carpentier and Kim. In order to establish the oracle inequality, we derive non-asymptotic variance bounds and concentration inequalities for Hill estimators. These concentration inequalities are derived from Talagrand's concentration inequality for smooth functions of independent exponentially distributed random variables combined with three tools of Extreme Value Theory: the quantile transform, Karamata's representation of slowly varying functions, and R\'enyi's characterisation of the order statistics of exponential samples. The performance of this computationally and conceptually simple method is illustrated using Monte-Carlo simulations.


Introduction
The basic questions faced by Extreme Value Analysis consist in estimating the probability of exceeding a threshold that is larger than the sample maximum and estimating a quantile of an order that is larger than 1 minus the reciprocal of the sample size, that is making inferences on regions that lie outside the support of the empirical distribution.In order to face these challenges in a sensible framework, Extreme Value Theory (EVT) assumes that the sampling distribution F satisfies a regularity condition.Indeed, in heavy-tail analysis, the tail function F = 1−F is supposed to be regularly varying that is, lim τ →∞ F (τ x)/F (τ ) exists for all x > 0. This amounts to assume the existence of some γ > 0 such that the limit is x −1/γ for all x.In other words, if we define the excess distribution above the threshold τ by its survival function: x → F τ (x) = F (x)/F (τ ) for x ≥ τ , then F is regularly varying if and only if F τ converges weakly towards a Pareto distribution.The sampling distribution F is then said to belong to the maxdomain of attraction of a Fréchet distribution with index γ > 0 (abbreviated in F ∈ MDA(γ)) and γ is called the extreme value index.
The main impediment to large exceedance and large quantile estimation problems alluded above turns out to be the estimation of the extreme value index.
Since the inception of Extreme Value Analysis, many estimators have been defined, analysed and implemented into software.Hill [1975] introduced a simple, yet remarkable, collection of estimators: for k < n, where X (1) ≥ . . .≥ X (n) are the order statistics of the sample X 1 , . . ., X n (the non-increasing rearrangement of the sample).
An integer sequence (k n ) n is said to be intermediate if lim n→∞ k n = ∞ while lim n→∞ k n /n = 0.It is well known that F belongs to MDA(γ) for some γ > 0 if and only if, for all intermediate sequences (k n ) n , γ(k n ) converges in probability towards γ [de Haan andFerreira, 2006, Mason, 1982].Under mildly stronger conditions, it can be shown that ) is asymptotically Gaussian with variance γ 2 .This suggests that, in order to minimise the quadratic risk E[( γ(k n ) − γ) 2 ] or the absolute risk E | γ(k n ) − γ|, an appropriate choice for k n has to be made.If k n is too large, the Hill estimator γ(k n ) suffers a large bias and, if k n is too small, γ(k n ) suffers erratic fluctuations.As all estimators of the extreme value index face this dilemma [See Beirlant et al., 2004, de Haan and Ferreira, 2006, Resnick, 2007, and references therein], during the last three decades, a variety of data-driven selection methods for k n has been proposed in the literature (See Hall and Weissman [1997], Hall and Welsh [1985], Danielsson et al. [2001], Draisma et al. [1999], Drees and Kaufmann [1998], Drees et al. [2000], Grama and Spokoiny [2008], Carpentier and Kim [2014a] to name a few).A related but distinct problem is considered by Carpentier and Kim [2014b]: constructing uniform and adaptive confidence intervals for the extreme value index.
The rationale for investigating adaptive Hill estimation stems from computational simplicity and variance optimality of properly chosen Hill estimators [Beirlant et al., 2006].
In this paper, we combine Talagrand's concentration inequality for smooth functions of independent exponentially distributed random variables (Theorem 2.15) with three traditional tools of EVT: the quantile transform, Karamata's representation for slowly varying functions, and Rényi's characterisation of the joint distribution of order statistics of exponential samples.This allows us to establish concentration inequalities for the Hill process ( √ k( γ(k) − E γ(k)) k ) (Theorem 3.3, Propositions 3.9, 3.10 and Corollary 3.13) in Section 3.2.Then in Section 3.3, we build on these concentration inequalities to analyse the performance of a variant of Lepki's rule defined in Sections 2.3 and 3.3: Theorem 3.14 describes an oracle inequality and Corollary 3.18 assesses the performance of this simple selection rule under the assumption that for some ρ < 0 and C, C > 0. It reveals that the performance of Hill estimators selected by Lepski's method matches known lower bounds.Proofs are given in Section 4. Finally, in Section 5, we examine the performance of this resulting adaptive Hill estimator for finite sample sizes using Monte-Carlo simulations.
2. Background, notations and tools 2.1.The Hill estimator as a smooth tail statistics Even though it is possible and natural to characterise the fact that a distribution function F belongs to the max-domain of attraction of a Fréchet distribution with index γ > 0 by the regular variation property of F (lim τ →∞ F (τ x)/F (τ ) = x −1/γ ), we will repeatedly use an equivalent characterisation based on the regular variation property of the associated quantile function.We first recall some classical definitions.
If f is a non-decreasing function from (a, b) (where a and b may be infinite) to (c, d), its generalised inverse f ← : (c, d) → (a, b) is defined by f ← (y) = inf{x : a < x < b, f (x) ≥ y} .The quantile function F ← is the generalised inverse of the distribution function F .The tail quantile function of F is a non-decreasing function defined on (1, ∞) by U = (1/(1 − F )) ← , or Quantile function plays a prominent role in stochastic analysis thanks to the fact that if Z is uniformly distributed over [0, 1], F ← (Z) is distributed according to F .In this text, we use a variation of the quantile transform that fits EVT: if E is exponentially distributed, then U (exp(E)) is distributed according to F .Moreover, by the same argument, the order statistics X (1) ≥ . . .≥ X (n) are distributed as a monotone transformation of the order statistics Y (1) ≥ . . .Y (n) of a sample of n independent standard exponential random variables, (X (1) , . . ., X (n) ) d = U (e Y (1) ), . . ., U (e Y (n) ) .
The quantile transform and Rényi's representation are complemented by Karamata's representation for slowly varying functions.Recall that a function L is slowly varying at infinity if for all x > 0, lim t→∞ L(tx)/L(t) = x 0 = 1.The von Mises condition specifies the form of Karamata's representation [See Resnick, 2007, Corollary 2.1] of the slowly varying component of U (t) (t −γ U (t)).Definition 2.1 (von Mises condition).A distribution function F belonging to MDA(γ), γ > 0, satisfies the von Mises condition if there exist a constant s ds with lim s→∞ η(s) = 0.The function η is called the von Mises function.In the sequel, we assume that the sampling distribution F ∈ MDA(γ), γ > 0, satisfies the von Mises condition with t 0 = 1, von Mises function η and define the non-increasing function Combining the quantile transformation, Rényi's and Karamata's representations, it is straightforward that under the von Mises condition, the sequence of Hill estimators satisfies a distributional identity.It is distributed as a function of the largest order statistics of a standard exponential sample.The next proposition follows easily from the definition of the Hill estimator as a weighted sum of log-spacings, as advocated in [Beirlant et al., 2004].
Proposition 2.2.The vector of Hill estimators ( γ(k)) k<n is distributed as the random vector where (E 1 , . . ., E n ) are independent standard exponential random variables while, for i ≤ n, Y (i) = n j=i E j /j is distributed like the ith order statistic of an nsample of the exponential distribution.
For a fixed k < n, a second distributional representation is available, where (E 1 , . . ., E k ) and Y (k+1) are defined as in Proposition 2.2.This second, simpler, distributional representation stresses the fact that, conditionally on Y (k+1) , γ(k) is distributed as a mixture of sums of independent identically distributed random variables.Moreover, these independent random variables are close to exponential random variables with scale γ.This distributional identity suggests that the variance of γ(k) scales like γ 2 /k, an intuition that is corroborated by analysis, see Section 3.1.
The bias of γ(k) is connected with the von Mises function η by the next formula Henceforth, let b be defined for t > 1 by The quantity b(t) is the bias of the Hill estimator γ(k) given F (X (k+1) ) = 1/t.The second expression for b shows that b is differentiable with respect to t (even though η might be nowhere differentiable), and that The von Mises function governs both the rate of convergence of U (tx)/U (t) towards x γ , or equivalently of F (tx)/F (t) towards x −1/γ , and the rate of convergence of |E γ(k) − γ| towards 0.
In the sequel, we work on the probability space where the independent standard exponential random variables E i , 1 ≤ i ≤ n are all defined, and therefore consider the Hill estimators defined by Representation (2.3).

Frameworks
The difficulty in extreme value index estimation stems from the fact that, for any collection of estimators, for any intermediate sequence (k n ) n , and for any γ > 0, there is a distribution function F ∈ MDA(γ) such that the bias |E γ(k n ) − γ| decays at an arbitrarily slow rate.This has led authors to put conditions on the rate of convergence of U (tx)/U (t) towards x γ as t tends to infinity while x > 0, or equivalently on the rate of convergence of F (tx)/F (t) towards x −1/γ .These conditions have then to be translated into conditions on the rate of decay of the bias of estimators.As we focus on Hill estimators, the connection between the rate of convergence of U (tx)/U (t) towards x γ and the rate of decay of the bias is transparent and well-understood [Segers, 2002]: the theory of O-regular variation provides an adequate setting for describing both rates of convergence [Bingham et al., 1987].In words, if a positive function g defined over [1, ∞) is such that for some α ∈ R, for all Λ > 1, lim sup t sup x∈[1,Λ] g(tx)/g(t) < ∞, g is said to have bounded increase.If g has bounded increase, the class OΠ g is the class of measurable functions f on some interval [a, ∞), a > 0, such that as t → ∞, f (tx) − f (t) = O(g(t)) for all x ≥ 1.
For example, the analysis carried out by Carpentier and Kim [2014a] rests on the condition that, if F ∈ MDA(γ), for some C > 0, D = 0 and ρ < 0, (2.6) This condition implies that ln(t −γ U (t)) ∈ OΠ g with g(t) = t ρ [Segers, 2002, p. 473].Thus under the von Mises condition, Condition (2.6) implies that the function ∞ t (η(s)/s)ds belongs to OΠ g with g(t) = t ρ .Moreover, the Abelian and Tauberian Theorems from [Segers, 2002] In this text, we are ready to assume that if F ∈ MDA(γ), then for some C > 0 and ρ < 0, However, we do not want to assume that U (or equivalently F ) satisfies a so-called second-order regular variation property (t → |x −γ U (tx)/U (t) − 1| is asymptotically equivalent to a ρ-regularly varying function where ρ < 0).By [Segers, 2002], this would be equivalent to assuming that t → |b(t)| is ρ-regularly varying.Indeed, assuming as in [Hall and Welsh, 1985] and several subsequent papers that F satisfies where C > 0, D = 0 are constants and ρ < 0, or equivalently [Csörgő, Deheuvels, andMason, 1985, Drees andKaufmann, 1998] that U satisfies makes the problem of extreme value index estimation easier (but not easy).These conditions entail that, for any intermediate sequence (k n ), the ratio et al., 2004, de Haan and Ferreira, 2006, Segers, 2002], this makes the estimation of the second-order parameter a very natural intermediate objective [See for example Drees and Kaufmann, 1998].

Lepski's method and adaptive tail index estimation
The necessity of developing data-driven index selection methods is illustrated in Figure 1, which displays the estimated standardised root mean squared error (rmse) of Hill estimators as a function of k for four related sampling distributions which all satisfy the second-order condition (2.7).Under this second-order condition (2.7), Hall and Welsh proved that the asymptotic mean squared error of the Hill estimator is minimal for sequences 1+2|ρ|) .
Since, C > 0, D = 0 and the second-order parameter ρ < 0 are usually unknown, many authors have been interested in the construction of data-driven selection procedure for k n under conditions such as (2.7), a great deal of ingenuity has been dedicated to the estimation of the second-order parameters and to the use of such estimates when estimating first order parameters.As we do not want to assume a second-order condition such as (2.7), we resort to Lepski's method which is a general attempt to balance bias and variance.
Since its introduction [Lepski, 1991], this general method for model selection that has been proved to achieve adaptivity and provide one with oracle inequalities in a variety of inferential contexts ranging from density estimation to inverse problems and classification [Lepski, 1990[Lepski, , 1991[Lepski, , 1992 Tsybakov , 2000].Very readable introductions to Lepski's method and its connections with penalised contrast methods can be found in [Birgé, 2001, Mathé, 2006].In Extreme Value Theory, we are aware of three papers that explicitly rely on this methodology: [Drees and Kaufmann, 1998], [Grama and Spokoiny, 2008] and [Carpentier and Kim, 2014a].
The selection rule analysed in the present paper (see Section 3.3 for a precise definition) is a variant of the preliminary selection rule introduced in [Drees and Kaufmann, 1998] where (r n ) n is a sequence of thresholds such that √ ln ln n = o(r n ) and r n = o( √ n), and γ(i) is the Hill estimator computed from the (i + 1) largest order statistics.The definition of this "stopping time" is motivated by Lemma 1 from [Drees and Kaufmann, 1998] which asserts that, under the von Mises condition, In words, this selection rule almost picks out the largest index k such that, for all i smaller than k, γ(k) differs from γ(i) by a quantity that is not much larger than the typical fluctuations of γ(i).This index selection rule can be performed graphically by interpreting an alternative Hill plot as shown on Figure 2 [See Drees et al., 2000, Resnick, 2007, for a discussion on the merits of alt-Hill plots].
Under mild conditions on the sampling distribution, κ n (r n ) should be asymptotically equivalent to the deterministic sequence 1 10 1.5 10 2 10 2.5 10 3 10 3.5 k Hill estimates Fig 2 .Lepski's method illustrated on a alt-Hill plot.The plain line describes the sequence of Hill estimates as a function of index k computed on a pseudo-random sample of size n = 10000 from Student distribution with 1 degree of freedom (Cauchy distribution).Hill estimators are computed from the positive order statistics.The grey ribbon around the plain line provides a graphic illustration of Lepski's method.For a given value of i, the width of the ribbon is 2rn γ(i)/ √ i.A point (k, γ(k)) on the plain line corresponds to an eligible index if the horizontal segment between this point and the vertical axis lies inside the ribbon that is, if for all i, 30 If rn were replaced by an appropriate quantile of the Gaussian distribution, the grey ribbon would just represent the confidence tube that is usually added on Hill plots.The triangle represents the selected index with rn = √ 2.1 ln ln n.The cross represents the oracle index estimated from Monte-Carlo simulations, see Table 2.
The intuition behind the definition of κ n (r n ) is the following: if the bias is increasing with index i, and if the bias suffered by γ(k) is smaller than the typical fluctuations of γ(k), then the index k should be eligible that is, should pass all the pairwise tests with high probability.
The goal of Drees and Kaufmann [1998] was not to investigate the performance of the preliminary selection rule defined in Display (2.8) but to design a selection rule κ n (r n ), based on κ n (r n ), that would, under second-order conditions, asymptotically mimic the optimal selection rule k * n .Some of their intermediate results shed light on the behaviour of k n (r n ) for a wide variety of choices for r n .As they are relevant to our work, we will briefly review them.Drees and Kaufmann [1998] characterise the asymptotic behaviours of κ n (r n ) and κ n (r n ) when (r n ) grows sufficiently fast that is, n where c ρ is a constant depending on ρ, and that This suggests that using κ n (r n ) instead of k * n has a price of the order r 2/(1+2|ρ|) n .Not too surprisingly, Corollary 1 from [Drees and Kaufmann, 1998] implies that the preliminary selection rule tends to favor smaller variance over reduced bias.
Our goal, as in [Carpentier andKim, 2014a, Grama andSpokoiny, 2008], is to derive non-asymptotic risk bounds.We briefly review their approaches.Both papers consider sequences of estimators γ(1), . . ., γ(k), . . .defined by thresholds τ 1 ≤ . . .≤ τ k ≤ . ... For each i, the estimator γ(i) is computed from sample points that exceed τ i if there are any.For example, in [Carpentier and Kim, 2014a], | is smaller than a random quantity that is supposed to bound the typical fluctuations of γ(i).The selected index k is the largest eligible index.In both papers, the rationale for working with some special collection of estimators seems to be the ability to derive nonasymptotic deviation inequalities for γ(k) either from exponential inequalities for log-likelihood ratio statistics or from simple binomial tail inequalities such as Bernstein's inequality [See Boucheron et al., 2013, Section 2.8].
We aim at achieving optimal risk bounds under Condition (2.6), using a simple estimation method requiring almost no calibration effort and based on mainstream extreme value index estimators.Before describing the keystone of our approach in Section 2.5, we recall the recent lower risk bound for adaptive extreme value index estimation.

Lower bound
One of key results in [Carpentier and Kim, 2014a] is a lower bound on the accuracy of adaptive tail index estimation.This lower bound reveals that, just as for estimating a density at a point [Lepski, 1991[Lepski, , 1992]], as far as tail index estimation is concerned, adaptivity has a price.Using Fano's Lemma, and a Bayesian game that extends cleanly in the frameworks of [Grama and Spokoiny, 2008] and [Novak, 2014], Carpentier and Kim were able to prove the next minimax lower bound.
Theorem 2.9.Let ρ 0 < −1, and 0 ≤ v ≤ e/(1 + 2e).Then, for any tail index estimator γ and any sample size n such that M = ln n > e/v, there exists a probability distribution P such that i) P ∈ MDA(γ) with γ > 0, ii) P meets the von Mises condition with von Mises function η satisfying . Using Birgé's Lemma instead of Fano's Lemma, we provide a simpler, shorter proof of this theorem (Appendix D).
The lower rate of convergence provided by Theorem 2.9 is another incentive to revisit the preliminary tail index estimator from [Drees and Kaufmann, 1998], but, instead of using a sequence (r n ) n of order larger than √ ln ln n in order to calibrate pairwise tests and ultimately to design estimators of the second-order parameter (if there are any), it is worth investigating a minimal sequence where r n is of order √ ln ln n, and check whether the corresponding adaptive estimator achieves the Carpentier-Kim lower bound (Theorem 2.9).
In this paper, we focus on r n of the order √ ln ln n.The rationale for imposing r n of the order √ ln ln n can be understood by the fact that if lim sup r n /(γ √ 2 ln ln n) < 1, even if the sampling distribution is a pure Pareto distribution with shape parameter γ (F (x) = (x/τ ) −1/γ for x ≥ τ > 0), the preliminary selection rule will, with high probability, select a small value of k and thus pick out a suboptimal estimator.This can be justified using results from [Darling and Erdös, 1956] (See Appendix A for details).
Such an endeavour requires sharp probabilistic tools.They are the topic of the next section.

Talagrand's concentration phenomenon for products of exponential distributions
Before introducing Talagrand's Theorem, which will be the key tool of our investigation, we comment and motivate the use of concentration arguments in Extreme Value Theory.Talagrand's concentration phenomenon for products of exponential distributions is one instance of a general phenomenon: concentration of measure in product spaces [Ledoux, 2001, Ledoux andTalagrand, 1991].
The phenomenon may be summarised in a simple quote: functions of independent random variables that do not depend too much on any of them are almost constant [Talagrand, 1996a].This quote raises a first question: in which way are tail functionals (as used in Extreme Value Theory) smooth functions of independent random variables?We do not attempt here to revisit the asymptotic approach described by [Drees, 1998b] which equates smoothness with Hadamard differentiability.Our approach is non-asymptotic and our conception of smoothness somewhat circular, smooth functionals are these functionals for which we can obtain good concentration inequalities.
The concentration approach helps to split the investigation in two steps: the first step consists in bounding the fluctuations of the random variable under concern around its median or its expectation, while the second step focuses on the expectation.This approach has serioulsy simplified the investigation of suprema of empirical processes and made the life of many statisticians easier [Koltchinskii, 2008, Massart, 2007, Talagrand, 1996b, 2005].Up to our knowledge, the impact of the concentration of measure phenomenon in Extreme Value Theory has received little attention.To point out the potential uses of concentration inequalities in the field of Extreme Value Theory is one purpose of this paper.In statistics, concentration inequalities have proved very useful when dealing with adaptivity issues: sharp, non-asymptotic tail bounds can be combined with simple union bounds in order to obtain uniform guarantees of the risk of collection of estimators.Using concentration inequalities to investigate adaptive choice of the number of order statistics to be used in tail index estimation is a natural thing to do.
Deriving authentic concentration inequalities for Hill estimators is not straightforward.Fortunately, the construction of such inequalities turns out to be possible thanks to general functional inequalities that hold for functions of independent exponentially distributed random variables.We recall these inequalities (Proposition 2.10 and Theorem 2.15) that have been largely overlooked in statistics.A thorough and readable presentation of these inequalities can be found in [Ledoux, 2001].We start by the easiest result, a variance bound that pertains to the family of Poincaré inequalities.
Proposition 2.10 (Poincaré inequality for exponentials, [Bobkov and Ledoux, 1997]).If f is a differentiable function over R n , and Z = f (E 1 , . . ., E n ) where E 1 , . . ., E n are independent standard exponential random variables, then imsart-generic ver.2011/11/15 file: adaptHill.texdate: March 18, 2015 Remark 2.11.The constant 4 can not be improved.The next corollary is stated in order to point the relevance of this Poincaré inequality to the analysis of general order statistics and their functionals.Recall that the hazard rate of an absolutely continuous probability distribution with distribution F is: h = f /F where f and F = 1 − F are the density and the survival function associated with F , respectively.
Corollary 2.12.Assume the distribution of X has a positive density, then the kth order statistic X (k) satisfies where C can be chosen as 4.
Remark 2.13.By Smirnov's Lemma [de Haan and Ferreira, 2006], C can not be smaller than 1.If the distribution of X has a non-decreasing hazard rate, the factor of 4 can be improved into a factor 2 [Boucheron and Thomas, 2012].Bobkov and Ledoux [1997], Maurey [1991], Talagrand [1991] show that smooth functions of independent exponential random variables satisfy Bernstein type concentration inequalities.The next result is extracted from the derivation of Talagrand's concentration phenomenon for product of exponential random variables in [Bobkov and Ledoux, 1997].
The definition of sub-gamma random variables will be used in the formulation of the theorem and in many arguments.Definition 2.14.A real-valued centered random variable X is said to be subgamma on the right tail with variance factor v and scale parameter c if for every λ such that 0 < λ < 1/c .
We denote the collection of such random variables by Γ + (v, c).Similarly, X is said to be sub-gamma on the left tail with variance factor v and scale parameter c if −X is sub-gamma on the right tail with variance factor v and tail parameter c.We denote the collection of such random variables by Γ − (v, c).
), then for all δ ∈ (0, 1), then with probability larger than Let v be the essential supremum of ∇f 2 , then Z is sub-gamma on both tails with variance factor 4v and scale factor max i |∂ i f |.Again, we illustrate the relevance of these versatile tools to the analysis of general order statistics.This general theorem implies that if the sampling distribution has non-decreasing hazard rate, then the order statistics X (k) satisfy Bernstein type inequalities [see Boucheron et al., 2013, Section 2.8] with variance factor 4/kE 1/h(X (k) ) 2 (the Poincaré estimate of variance), and scale parameter (sup x 1/h(x))/k.Starting back from the Efron-Stein-Steele inequality, the authors derived a somewhat sharper inequality [Boucheron and Thomas, 2012].
Corollary 2.16.Assume the distribution function F has non-decreasing hazard rate h that is, ) be distributed as the kth order statistic of a sample distributed according to F .
Then Z is sub-gamma on both tails with variance factor This corollary describes in which way central, intermediate and extreme order statistics can be portrayed as smooth functions of independent exponential random variables.This possibility should not be taken for granted as it is non trivial to capture in a non-asymptotic way the tail behaviour of maxima of independent Gaussians [Boucheron and Thomas, 2012, Chatterjee, 2014, Ledoux, 2001].In the next section, we show in which way the Hill estimator can fit into this picture.

Main results
In this section, the sampling distribution F is assumed to belong to MDA(γ) with γ > 0 and to satisfy the von Mises condition (Definition 2.1) with von Mises function η.Beirlant et al., 2004, de Haan and Ferreira, 2006, Geluk et al., 1997, Resnick, 2007].

It is well known, that under the von Mises condition
In this subsection, we will use the representation (2.4): Proposition 3.1 provides us with a handy upper bound on Var[ γ(k)] − γ 2 /k using the von Mises function.
Proposition 3.1.Let γ(k) be the Hill estimator computed from the (k + 1) largest order statistics of an n-sample from F .Then, The next Abelian result might help in appreciating these variance bounds.
Proposition 3.2.Assuming that η is ρ-regular varying with ρ < 0, then for any intermediate sequence We may now move to genuine concentration inequalities for the Hill estimator.

Concentration inequalities for the Hill estimators
The exponential representation (2.3) suggests that the Hill estimator γ(k) should be approximately distributed according to a gamma(k, γ) distribution where k is the shape parameter and γ the scale parameter.We expect the Hill estimators to satisfy Bernstein type concentration inequalities that is, to be sub-gamma on both tails with variance factors connected to the tail index γ and to the von Mises function.Representation (2.3) actually suggests more.Following [Drees and Kaufmann, 1998], we actually expect the sequence √ k( γ(k) − E γ(k)) to behave like normalized partial sums of independent square integrable random variables that is, we believe max 2≤k≤n √ k( γ(k) − E γ(k)) to scale like √ ln ln n and to be sub-gamma on both tails.The purpose of this section is to meet these expectations in a non-asymptotic way.
Proofs use the Markov property of order statistics: conditionally on the (J + 1)th order statistic, the first largest J order statistics are distributed as the order statistics of a sample of size J of the excess distribution.They consist of appropriate invokations of Talagrand's concentration inequality (Theorem 2.15).However, this theorem generally requires a uniform bound on the gradient of the relevant function.When Hill estimators are analysed as functions of independent exponential random variables, the partial derivatives depend on the points at which the von Mises function is evaluated.In order to get interesting bounds, it is worth conditioning on an intermediate order statistic.
Throughout this subsection, let be an integer larger than ln log 2 n and J an integer not larger than n.As we use the exponential representation of order statistics, besides Hill estimators, the random variables that appear in the main statements are order statistics of exponential samples, Y (k) will denote the kth order statistic of a standard exponential sample of size n (we agree on Y (n+1) = 0).Theorem 3.3, Propositions 3.9 and 3.10 complement each other in the following way.Theorem 3.3 is concerned with the supremum of the Hill process Note the use of random centering.The components of this process are shown to be sub-gamma using Talagrand's inequality, and then chaining is used to control the maximum of the process.Propositions 3.9 and 3.10 are concerned with conditional bias fluctuations, they state that the fluctuations of conditional expectations |E[ γ(i) | Y (k+1) ]| are small and even negligible with respect to the fluctuation of γ(i).
The first theorem provides an exponential refinement of the variance bound stated in Proposition 3.1.However, as announced, there is a price to pay, statements hold conditionally on some order statistic.
In the sequel, let where c 1 may be chosen not larger than 4 and c 1 not larger than 16.
ii) The random variable Z is sub-gamma on both tails with variance factor 4(γ + 2η(T )) 2 and scale factor (γ + 2η(T ))/ and Remark 3.5.If F is a pure Pareto distribution with shape parameter γ > 0, then k γ(k)/γ is distributed according to a gamma distribution with shape parameter k and scale parameter 1. Tight and well-known tail bounds for gamma distributed random variables assert that Remark 3.6.If we choose J = n, all three statements hold unconditionally, but the variance factor may substantially exceed the upper bounds described in Proposition 3.1.Lemma 1 from [Drees and Kaufmann, 1998] reads as follows The second and third statements in Theorem 3.3 provide a non-asymptotic counterpart to this lemma: while the random variable in the expectation is sub-gamma.Remark 3.7.Thanks to the Markov property, Statement i) reads as where 0 < δ < 1/2.Combining Statements ii) and iii), we also get Remark 3.8.The reader may wonder whether resorting to the exponential representation and usual Chernoff bounding would not provide a simpler argument.
The straightforward approach leads to the following conditional bound on the logarithmic moment generating function, k+1) .
A similar statement holds for the lower tail.This leads to exponential bounds for deviation of the Hill estimator above that is, to control deviations of the Hill estimator above its expectation plus a term that may be of the order of magnitude of the bias.
] for 1 ≤ i ≤ k and to exhibit an exponential supermartingale met the same impediments.
At the expense of inflating the variance factor, Theorem 2.15 provides a genuine (conditional) concentration inequality for Hill estimators.As we will deal with values of k for which bias exceeds the typical order of magnitudes of fluctations, this is relevant to our purpose.
The next propositions are concerned with the fluctuations of the conditional bias of Hill estimators.In both propositions, J satisfies ≤ k ≤ J ≤ n, and again T = exp Y (J+1) .
Proposition 3.9.For all 1 ≤ i ≤ k, conditionally on T , is sub-gamma on both tails with variance factor at most 16η(T ) 2 /k and scale factor 2η(T )/k.
The last proposition deals with the maximum of centered conditional biases.The collection of centered conditional biases does not behave like partial sums.
Proposition 3.10.Let the random variable Z be defined by Then, i) Conditionally on T , Z is sub-gamma with variance factor 16η(T ) 2 /k and scale factor 2η(T )/k.ii) Remark 3.11.Statements i) and ii) in Proposition 3.10 can be summarized by the following inequality.For any 0 < δ < 1/2, (3.12) Combining Theorem 3.3, Propositions 3.9 and 3.10 leads to another nonasymptotic perspective on Lemma 1 from [Drees and Kaufmann, 1998].

Adaptive Hill estimation
We are now able to investigate the variant of the selection rule defined by (2.8) [Drees and Kaufmann, 1998] with r n = c 2 √ ln ln n where c 2 is a constant not smaller than √ 2. The deterministic sequence of indices ( k n (r n )) is defined by: and the sequence ( k n (1)) n is defined by Let 0 < δ < 1/2.The index k n is selected according to the following rule: where c 3 is a constant larger than 60 and r n (δ) = 8 2 ln ((2/δ) log 2 n).The tail index estimator is γ( k n ).Note that As tail adaptivity has a price (see Theorem 2.9), the ratio between the risk of the would-be adaptive estimator γ( k n ) and the risk of γ( k n (1)) cannot be upper bounded by a constant factor, let alone by a factor close to 1.This is why in the next theorem, we compare the risk of γ( k n ) with the risk of γ( k n ).
In the sequel, k n stands for k n (r n ).If the context is not clear, we specify k n (1) or k n (r n ).Recall that The next theorem describes a non-asymptotic risk bound for γ( k n ).
Assume that n is large enough so that With probability larger than 1 − 4δ, Remark 3.16.For 0 < δ < 1/2, Remark 3.17.If we assume that the bias b is ρ-regularly varying, then elaborating on Proposition 1 from [Drees and Kaufmann, 1998], the oracle index sequence (k * n ) n and the sequence ( k n (1)) n are connected by and their quadratic risk are related by Thus if the bias is ρ-regularly varying, Theorem 3.14 provides us with a connection between the performance of the simple selection rule and the performance of the (asymptotically) optimal choice.The next corollary upper bounds the risk of the preliminary estimator when we just have an upper bound on the bias.
Corollary 3.18.Assume that for some C > 0 and ρ < 0, for all n, k, then, there exists a constant κ δ,ρ depending on δ and ρ such that, with probability larger than 1 − 4δ, Under the assumption that the bias of the Hill estimators is upper bounded by a power function, the performance of the data-driven estimator γ( k n ) meets the information-theoretic lower bound of Theorem 2.9.

Proof of Proposition 2.2
This proposition is a straightforward consequence of Rényi's representation of order statistics of standard exponential samples.
As F belongs to MDA(γ) and meets the von Mises condition, there exists a function η on (1, ∞) with lim x→∞ η(x) = 0 such that Then,

Proof of Proposition 3.1
Let Z = k γ(k).By the Pythagorean relation, Representation (2.4) asserts that, conditionally on Y (k+1) , Z is distributed as a sum of independent, exponentially distributed random variables.Let E be an exponentially distributed random variable.where we have used the Cauchy-Schwarz inequality and Var E 0 η(e y+u )du ≤ η(e y ) 2 .Taking expectation with respect to Y (k+1) leads to The last term in the Pythagorean decomposition is also handled using elementary arguments.k+1)  du .
As Y (k+1) is a function of independent exponential random variables (Y ] may be upper bounded using Poincaré inequality (Proposition 2.10) In order to derive the lower bound, we first observe that

Proof of Theorem 3.3
In the proofs of Theorem 3.3 and Propositon 3.10, we will use the next maximal inequality.Proofs follow a common pattern.In order to check that some random variable is sub-gamma, we rely on its representation as a function of independent exponential variables and compute partial derivatives, derive convenient upper bounds on the squared Euclidean norm and the supremum norm of the gradient and then invoke Theorem 2.15.
At some point, we will use the next corollary of Theorem 2.15.
Corollary 4.2.If f is an almost everywhere differentiable function on R with uniformly bounded derivative f , then f (Y (k+1) ) is sub-gamma with variance factor 4 f 2 ∞ /k and scale factor f ∞ /k.Proof of Theorem 3.3.We start from the exponential representation of Hill estimators (Proposition 2.2) and represent all γ(i) as functions of independent random variables E 1 , . . ., E k , Y (k+1) where the E j , 1 ≤ j ≤ k are standard exponentially distributed and Y (k+1) is distributed like the (k + 1)th largest order statistic of an n-sample of the standard exponential distribution.Let i be such that 0 ≤ i < i, agree on γ(0) = 0.Then, a few lines of computations lead to for i < j ≤ k which entails that Recalling that T = exp Y (J+1) , this can be summarised by Theorem 2.15 now allows us to establish that, conditionally on T , the centered version of i γ(i) − i γ(i ) is sub-gamma on both tails with variance factor 4|i − i |(γ + 2η(T )) 2 and scale factor (γ + η(T )).Using Theorem 2.15 conditionally on T , and choosing i = 0, imsart-generic ver.2011/11/15 file: adaptHill.texdate: March 18, 2015 Taking expectation on both sides, this implies that The proof of the upper bound on E[Z | T ] in Statement ii) from Theorem 3.3 relies on standard chaining techniques from the theory of empirical processes and uses repeatedly the concentration Theorem 2.15 for smooth functions of independent exponential random variables and the maximal inequality for subgamma random variables (Proposition 4.1).
Recall that As it is commonplace in the analysis of normalized empirical processes [See Giné and Koltchinskii, 2006, Massart, 2007, van de Geer, 2000, and references therein], we peel the index set over which the maximum is computed.
Let L n = { log 2 ( ) , . . ., log 2 (k) }.For all j ∈ L n , let S j = { ∨ 2 j , . . ., k ∧ 2 j+1 − 1} and define Z j as Then, We now derive upper bounds on both summands by resorting to the maximum inequality for sub-gamma random variables (Proposition 4.1).We first bound In order to alleviate notation, let be the binary expansion of i.Then, for h ∈ {0, . . ., j}, let π h (i) be defined by imsart-generic ver.2011/11/15 file: adaptHill.texdate: March 18, 2015 Using that W (π 0 (i)) does not depend on i and that E W (π Now for each h ∈ {0, . . ., j −1}, the maximum is taken over 2 h random variables which are sub-gamma with variance factor 4 × 2 j−h−1 (γ + 2η(T )) 2 and scale factor (γ + η(T )).By Proposition 4.1, So that for all j ∈ L n , and In order to prove Statement iii), we check that for each j ∈ L n , Z j is subgamma on the right-tail with variance factor at most 4 (γ + 2η(T )) 2 and scale factor not larger than (γ + 2η(T )) / .Under the von Mises Condition (2.1), the sampling distribution is absolutely continuous with respect to Lebesgue measure.For almost every sample, the maximum defining Z j is attained at a single index i ∈ S j .Starting again from the exponential representation, and repeating the computation of partial derivatives, we obtain the desired bounds.
By Proposition 4.1, Combining the different bounds leads to Inequality (3.4).
imsart-generic ver.2011/11/15 file: adaptHill.texdate: March 18, 2015 4.4.Proof of Proposition 3.9 Adopting again the exponential representation, Its derivative with respect to y is readily computed, and after integration by parts and handling a telescoping sum, it reads as A conservative upper-bound on |f (y)| is 2η(e Y (k+1) ) which is upper bounded by 2η(T ).The statement of the proposition then follows from Proposition 4.2.A byproduct of the proof is the next variance bound, for i ≤ k,

Proof of Proposition 3.10
In the proof, ∆ i denotes the spacing Y (i) − Y (k+1) , E ∆i the expectation with respect to ∆ i , Y (k+1) an independent copy of Y (k+1) , and E the expectation with respect to Y (k+1) .We will also use the next lemma.
Lemma 4.3.Let X be a non-negative random variable, and imsart-generic ver.2011/11/15 file: adaptHill.texdate: March 18, 2015 and recall that b The expectation of Z is thus upper bounded by the following sum .
Finally, thanks to Inequality (4.4), where the constant C can be chosen not larger than 3.
We check that under E 1 ∩ E 2 ∩ E 3 , the selected index is not smaller than k n .This amounts to check that for all k ≤ k n − 1, and for all i ∈ { c 3 ln n , . . ., k}, Meanwhile, for all k ≤ k n − 1 and for all i ∈ { c 3 ln n , . . ., k}, .
Using again (4.5), under Now, under E 2 , thanks to Assumption i) in the theorem statement, imsart-generic ver.2011/11/15 file: adaptHill.texdate: March 18, 2015 Plugging upper bounds on (i), (ii) and (iii), it comes that under E δ , for all k ≤ k n − 1 and for all i ∈ { c 3 ln n , . . ., k}, In order to warrant that under which holds by definition of r n (δ).We now check that the risk of γ( k n ) is not much larger than the risk of γ( k n ).

Simulations
Risk bounds like Theorem 3.14 and Corollary 3.18 are conservative.For all practical purposes, they are just meant to be reassuring guidelines.In this numerical section, we intend to shed some light on the following issues: imsart-generic ver.2011/11/15 file: adaptHill.texdate: March 18, 2015 1. Is there a reasonable way to calibrate the threshold r n (δ) used in the definition of k n ?How does the method perform if we choose r n (δ) close to 2 ln ln(n)? 2. How large is the ratio between the risk of γ( k n ) and the risk of γ(k * n ) for moderate sample sizes?
The finite-sample performance of the data-driven index selection method described and analysed in Section 3.3 has been assessed by Monte-Carlo simulations.Computations have been carried out in R using packages ggplot2, knitr, foreach, iterators, xtable and dplyr [See Wickham, 2014, for a modern account of the R environment].To get into the details, we investigated the performance of index selection methods on samples of sizes 10000, 20000 and 100000 from the collection of distributions listed in Table 1.The list comprises the following distributions.i) Fréchet distributions F γ (x) = exp(x −1/γ ) for x > 0 and γ ∈ {1, 0.5, 0.2}.ii) Student distributions t ν with ν ∈ {1, 2, 4, 10} degrees of freedom.iii) log-gamma distribution with density proportional to (ln(x)) 2−1 x −3−1 , which means γ = 1/3 and ρ = 0.
Table 1, which is complemented by Figure 3, describes the difficulty of tail index estimation from samples of the different distributions.Monte-Carlo estimates of the standardised root mean square error (rmse) of Hill estimators are represented as functions of the number of order statistics k for samples of size 10000 from the sampling distributions.All curves exhibit a common pattern: for small values of k, the rmse is dominated by the variance term and scales like 1/ √ k.Above a threshold that depends on the sampling distribution but that is not completely characterised by the second-order regular variation index, the rmse grows at a rate that may reflect the second-order regular variation property of the distribution.Not too surprisingly, the three Fréchet distributions exhibit the same risk profile.The three curves are almost undistinguishable.The Student distributions illustrate the impact of the second-order parameter on the difficulty of the index selection problem.For sample size n = 10000, the optimal index for t 10 is smaller than 30, it is smaller than the usual recommendations and for such moderate sample sizes seems as hard to handle as the log-gamma distribution which usually fits in the Horror Hill Plot gallery.The 1/2-stable Lévy distribution and the H-distribution behave very differently.Even though they both have second-order parameter ρ equal to −1, the H distribution seems almost as challenging as the t 4 distribution while the Lévy distribution looks much easier than the Fréchet distributions.The Pareto change point distributions exhibit an abrupt transition.with r n = √ c ln ln n where c = 2.1 unless otherwise specified.The Fréchet, Student, H and stable distributions all fit into the framework considered by [Drees and Kaufmann, 1998].They provide a favorable ground for comparing the performance of the optimal index selection method described by Drees and Kaufmann [1998] which attempts to take advantage of the secondorder regular variation property and the performance of the simple selection rule described in this paper.
Index γ( k dk n ) was computed following the recommandations from Theorem 1 and discussion in [Drees and Kaufmann, 1998]: where ρ should belong to a consistent family of estimators of ρ (under a secondorder regular variation assumption), γ should be a preliminary estimator of γ such as γ( √ n), ζ = .7,and r n = 2n 1/4 .Following the advice from [Drees and Kaufmann, 1998], we replaced | ρ| by 1.Note that the method for computing k dk n depends on a variety of tunable parameters.
Comparison between performances of γ( k n (r n )) and γ( k dk n ) are reported in Tables 2 and 3.For each distribution from Table 1, for sample sizes n = 10000, 20000, and 1000000, 5000 experiments were replicated.As pointed out in [Drees and Kaufmann, 1998], on the sampling distributions that satisfy a second-order regular variation property, carefully tuned k dk n is able to take advantage of it.Despite its computational and conceptual simplicity, despite the fact that it is almost parameter free, the estimator γ( k n (r n )) only suffers a moderate loss with respect to the oracle.When |ρ| = 1, the observed ratios are of the same order as (2 ln ln n) 1/3 ≈ 1.65.Moreover, wheres γ( k dk n ) behaves erratically when facing Pareto change point distributions, γ( k n (r n )) behaves consistently.
Figure 4 concisely describes the behaviour of the two index selection methods on samples from the Pareto change point distribution with parameters γ = 1.5, γ = 1 and threshold τ corresponding to the 1 − 1/15 quantile.The plain line represents the standardised rmse of Hill estimators as a function of selected index.This figure contains the superposition of two density plots corresponding to k dk n and k(r n ).The density plots were generated from 5000 points with coordinates ( k(r n ), | γ( k(r n ))/γ − 1|) and 5000 points with coordi- The contoured and well-concentrated density plot corresponds to the performance of γ( k n ).The diffuse tiled density plot corresponds to the performance of k dk n .Facing Pareto change point samples, the two selection methods behave differently.Lepski's rule detects correctly an abrupt change at some point and selects an index slightly above that point.As the conditional bias varies sharply around the change point, this slight over estimation of the correct index still results in a significant loss as far as rmse is concerned.The Drees-Kaufmann rule, fed with an a priori estimate of the second-order parameter, picks out a much smaller index, and suffers a larger excess risk.
where the last inequality follows from Chebychev negative association inequality.Hence, This differential inequality is readily solved and leads to the corollary.The second summand can be further decomposed using (2.4), .
We check that (i) and (ii) tend to 0 and then that (iii) converges towards a finite limit.Fix , δ > 0 and define M = sup{η(t), t ≤ t 0 }.Let A n denote the event {Y (kn+1) > ln t 0 ( , δ)}.For n such that ln(n/k n ) ≤ 2 ln t 0 , as Y (kn+1) sub-gamma with variance factor 1/k n , We first check that (ii) tends to 0. Let n be such that n/k n ≥ t 0 and W n denote the random variable Y (kn+1) − ln (n/k n ).Note that for 0 ≤ λ ≤ k n /2, Ee λ|Wn| ≤ 2e The first summand has a finite limit thanks to Lemma C.1.The second summand converges to 0 as E1 A c n tends to 0 exponentially fast while 1/η(n/k n ) 2 tends to infinity algebraically fast.
Bounds on (i) are easily obtained, using Jensen's Inequality and Poincaré Inequality.Using the line of arguments as for handling the limit of (ii), we establish that (i) converges to 0. We now check that (iii) converges towards a finite limit.Note that The expected value of the last random variable is 1/(1 − ρ) 2 .We check that for sufficiently large n, In order to take advantage of Lemma D.1, we use the Bayesian game designed in [Carpentier and Kim, 2014a].
Theorem D.2.Let γ > 0, ρ < −1, and 0 ≤ v ≤ e/(1 + 2e).Then, for any tail index estimator γ and any sample size n such that M = ln n > e/v, there exists a collection (P i ) i≤M of probability distributions such that i) P i ∈ MDA(γ i ) with γ i > γ, ii) P i meets the von Mises condition with von Mises function η i satisfying η i (t) ≤ γt ρi where ρ i = ρ + i/M < 0, iii) Proof of Theorem D.2.Choose v so that 0 ≤ v ≤ 2e/(1 + 2e).The number of alternative hypotheses M is chosen in such a way that ln (n/(v ln M )) ≤ M .If ln n ≥ e/v, M = ln n will do.

Fig 3 .
Fig 3. Monte-Carlo estimates of the standardised root mean square error ( rmse) of Hill estimators as a function of the number of order statistics k for samples of size 10000 from the sampling distributions.

Table 3
Ratios between median rmse of and median optimal rmse.