Adaptive confidence intervals for the tail coefficient in a wide second order class of Pareto models

We study the problem of constructing honest and adaptive confidence intervals for the tail coefficient in the second order Pareto model, when the second order coefficient is unknown. This problem is translated into a testing problem on the second order parameter. By constructing an appropriate model and an associated test statistic, we provide a uniform and adaptive confidence interval for the first order parameter. We also provide an almost matching lower bound, which proves that the result is minimax optimal up to a logarithmic factor.


Introduction
The Pareto model for the tail of a distribution is a useful tool to understand extremal phenomena. Indeed, the Fisher-Tippett-Gnedenko theorem states that a necessary and sufficient condition for the convergence in law of the (rescaled) maximum of i.i.d. samples to a Fréchet distribution is that the tail of the distribution of the sample is regularly varying. We say that the distribution F is regularly varying with parameter τ in the tail if the following holds: where l(x) is slowly varying at infinity, that is, such that for any x > 0, lim t→∞ l(tx)/l(t) = 1. There has been considerable work on estimating τ , and many estimators (Hill's estimator of Hill (1975), Pickands' estimator of Pickands (1975) for instance) have been suggested. Such estimators are consistent for the true parameter τ for distributions in the model (1). In order to obtain a rate of convergence for these estimators, some additional assumptions have to be made. A typical one is the so-called (exact) Hall condition 1 − F (x) − Cx −τ = C ′ x −τ (β+1) + o(x −τ (β+1) ), where τ > 0 is the first order parameter, and β > 0 is the second order parameter. The second order parameter characterizes the degree of proximity of the set of distributions to the exact Pareto distribution with parameters τ, C. Here we assume a more general condition that we will call the 1 Carpentier and Kim/Uniform and adaptive confidence interval in the Pareto model 2 second order Pareto assumption. That is, one considers the set of distributions F that satisfies for some C, C ′ This condition is a relaxation of the exact Hall condition in that the tail of the distribution does not have to satisfy the limiting form in the exact Hall model. We write S(τ, β, C, C ′ ) =: S(τ, β) for the set of distributions satisfying (2). Assuming that the parameter β is known, a classical result (for instance, see Drees, 2001) under the exact Hall model states that Hill or Pickands' estimators using the information of β satisfy (for ǫ > 0) sup F ∈ τ >ǫ S(τ,β) |τ − τ | = O P (γ n ) = O P (n −β/(2β+1) ).
( 3) Limiting distributions under the second order Pareto condition (2) are obtained in Hall (1982) and Hall and Welsh (1984). Also, a matching lower bound for this rate of estimation is proved in Hall and Welsh (1984) and Drees (2001). Moreover, this type of bound (3) can be used in order to create a confidence interval for τ of width of order γ n with known β (for the asymptotic confidence interval under the exact Hall model, see Cheng and Peng, 2001;Haeusler and Segers, 2007). However, β is unknown in general. There has been some work on estimating τ under various other assumptions in this case (e.g. see Hall and Welsh, 1985;Drees and Kaufmann, 1998;Danielsson et al., 2001;Carpentier and Kim, 2013). In particular, under the second order Pareto model (2), Carpentier and Kim (2013) prove that it is possible to construct adaptive estimator of τ whose risk is the same (up to a (log log(n)) β 2β+1 factor) as the oracle rate γ n = n − β 2β+1 which is the optimal rate when β is known.
The goal of this paper is to construct uniform and adaptive confidence intervals for τ when β is unknown. That is, we want to build confidence intervals that have controlled coverage and optimal width γ n := γ n (β) (adaptively to the "true" β of the function) uniformly over the set of second order Pareto distributions. This question is closely related to the problem of estimating β. One possible approach for estimating β is to restrict the set of distributions to those verifying a third order condition, which is more restrictive than both conditions (2) and the exact Hall condition. Under the third order condition, it is possible to estimate β with a rate depeding on the third order parameter (for more details, see Beirlant et al., 2008;Gomes et al., 2008), which might be used to construct the confidence interval. However, it is not clear which types of conditions for the parameter space are necessary for our goal. In fact, it has been an open question whether it is possible to construct uniform and adaptive confidence intervals for τ with unknown β under the relaxed condition (2). Using information theoretic bounds, this paper reveals the minimal conditions under which uniform and adaptive confidence intervals can be constructed. We discover that both the exact Hall condition and the third order condition are not required for this task by showing that the model we consider is strictly larger than those two cases (see Subsection 3.3 for more details). Although we do not provide tight constants, our result is minimax optimal (up to a logarithmic term).
Related issues were considered in the domain of non-parametric functional estimation (e.g. Low, 1997;Juditsky and Lambert-Lacroix, 2003;Robin and van der Vaart, 2006;Giné and Nickl, 2010;Hoffmann and Nickl, 2011;Bull and Nickl, 2013;Carpentier, 2013). In this area, one wants to construct a confidence set around a smooth function with the smoothness parameter s. Similar to the case where estimating τ will depend on the unknown second order parameter β, the minimax rate of estimation over the set of s−smooth functions depends on the unknown smoothness s. Then, the oracle width of a confidence interval should depend on s. These papers investigate the case where s is not available. In a first instance, they consider a simpler but related problem where one wants to decide between only two possible smoothness s 0 > s 1 > 0. They state that it is neither possible to test between s 0 and s 1 uniformly over the set of smooth functions, nor to construct uniform and adaptive confidence sets on the whole model. However, it is uniformly possible on a restricted model where one removes some functions (a ring around the set of s 0 −smooth functions, that is, functions that are s 1 smooth but close to s 0 smooth functions). These papers prove the minimax-optimal size of the set of functions that one has to remove.
In this paper, we construct an adaptive confidence interval for τ based on a testing procedure on the second order parameter, and show that this is minimax optimal up to a logarithmic factor. We first consider the testing problem between H 0 : F ∈ S(τ, β 0 ) and H 1 : F ∈ S(τ, β 1 ) for β 0 > β 1 > 0, and propose a test statistic for solving this problem (see Equation (33)). As we will prove in Section 3, it is impossible to test between H 0 and H 1 uniformly over the whole set of β 1 second order Pareto distributions, and also impossible to construct an adaptive and uniform confidence interval for τ . However, by removing a specific region of the set of second order Pareto distributions, we characterize a model that is maximal in a minimax sense, and for which the constructed confidence interval is uniform and adaptive over the class (8) of distributions. We then use this testing idea developed in the two-points case for treating the case of a continuum of β. We provide, also in this case, a construction of an adaptive and uniform confidence interval for τ . The model on which we prove that an uniform and adaptive confidence interval for τ exists is larger than the models considered in previous works such as Beirlant et al. (2008); Gomes et al. (2008Gomes et al. ( , 2012. Moreover, we explain how to modify our method to consider a wider class of distributions such that the second order term is regularly varying in the tail. Finally we illustrate how to construct our adaptive intervals and compare several confidence intervals based on Hill's estimator with various sample fractions by simulations.

Setting
Let D be the set of distributions on R + that are càdlàg (continuous on the right, limit on the left). We define the following subset of heavy tailed distributions Carpentier and Kim/Uniform and adaptive confidence interval in the Pareto model 4 (often referred to as second order Pareto distributions in the literature) given four parameters τ, β, C, C ′ > 0.
From the definition, F ∈ S(τ, β, C, C ′ ) satisfies a Pareto-like tail condition when x → ∞, that is, 1 − F (x) ∼ x −τ with the first order parameter τ . In addition, we note that β characterizes the proximity of the distributions to the exact τ -Pareto distribution. Indeed, if β is large, then the distribution is close to the τ -Pareto distribution, while if β is small, it is further away from the exact τ -Pareto distribution. Note that in the particular case β = ∞, S(τ, β, C, C ′ ) boils down to containing only one function F 0 (x) = 1 − Cx −τ for x ≥ C 1/τ . Let us first consider the simple case of distinguishing between two given second order parameters β 0 and β 1 . We let β 0 > β 1 > 0, and consider two sets of second order Pareto distributions S(τ, β 0 , C, C ′ ) and S(τ, β 1 , C, C ′ ). Since S(τ, β 0 , C, C ′ ) ⊂ S(τ, β 1 , C, C ′ ), there does not exist a uniformly consistent test (see Definition 2.1) for the following hypotheses In order to get around this problem, we restrict the set S(τ, β 1 , C, C ′ ) to distributions which are not too close to S(τ, β 0 , C, C ′ ). Closeness is measured in the following sense, Then, the modified set of S(τ, β 1 , C, C ′ ) is defined as (4) It is straightforward to check thatS(τ, β 1 , β 0 , C, C ′ , 0) = S(τ, β 1 , C, C ′ ), but for ρ n > 0, these sets are proper subsets of S(τ, β 1 , C, C ′ )\S(τ, β 0 , C, C ′ ).
Consider now the modified testing problem We recall that a statistical test Ψ n is a measurable function that takes values in {0, 1} of n i.i.d. observations X 1 , . . . , X n from some distribution F . We say that a test is uniformly consistent if the rejection probability sup F ∈H0 E F Ψ n under the null hypothesis and non-rejection probability sup F ∈H1 E F (1 − Ψ n ) under the alternative hypothesis become small when n increases. In other words, for a uniformly consistent test, both the type I error and type II error are uniformly well controlled by some predetermined level, which is denoted as α.
More formally, we define an α-uniformly consistent test in Definition 2.1.

Carpentier and Kim/Uniform and adaptive confidence interval in the
In order for a test to verify this condition, we emphasize that both errors should be bounded in the worst case, that is, in a minimax sense, not only in a pointwise sense. This problem is related to uniformly consistent estimation of the parameter β of a distribution F . Indeed, if one can construct many such tests on refined grids of an interval [b, B], one can deduce from the outcome of these tests an estimate of β that will be uniformly consistent for a model that will depend on ρ n (see Subsection 3.3).
α-uniformly consistent test can be useful for constructing an α-adaptive and uniform confidence interval for τ over a model P n , and for a set I b of parameters β. We let I 1 be the possible range of τ and I 2 be the possible range of C. A confidence interval C n is a subset of [0, ∞), and the diameter |C n | of C n is the length of the interval C n . We now provide the following definition of α-adaptive and uniform confidence interval for τ .
Definition 2.2 (α-adaptive and uniform confidence interval for τ ). A confidence interval C n for τ is α-adaptive and uniform for a model P n , two sets I 1 , I 2 and a constant C ′ over a set I b , if there exists a constant M such that for n large enough, the following two conditions are satisfied simultaneously.
and inf F ∈Pn Inequality (5) implies that the diameter of the confidence interval is not larger than the oracle diameter, where the oracle diameter is the same as the minimax-optimal rate of estimation of the parameter τ under the correct model for β. In other words, it means that the confidence interval is adaptive to β. When β can take only two values, for instance I b = {β 0 , β 1 }, such a confidence interval is adaptive over the two points β 0 and β 1 . One could also consider a continuous range of β, for instance I b = [b, B] where 0 < b < B, such that a confidence interval would be adaptive over a continuum of parameters. In both cases, the model P n has to be restricted in order to ensure the existence of such a test, and is typically smaller than respectively τ ∈I1,C∈I2 S(τ, β 1 , C, C ′ ) and τ ∈I1,C∈I2 S(τ, b, C, C ′ ). Inequality (6) implies that the confidence interval contains the true parameter with high probability. Again, this definition demands uniformity over the whole model P n . In the two points case I b = {β 0 , β 1 }, we will consider P n = τ ∈I1,C∈I2 S(τ, β 0 , C, C ′ ) S (τ, β 1 , β 0 , C, C ′ , ρ n ) where ρ n is specified later. In the more general case I b = [b, B], the definition of P n will be more involved and described later.

Main results
We consider two settings. In a first instance, we assume preliminary knowledge of τ and C. This is rather a toy setting, but it is useful for us to understand precisely the mechanisms of the problem. In a second instance, we extend the ideas used for this simple setting to the case where τ and C are unknown.
3.1. α-uniformly consistent test when τ, C are known We consider in this subsection only two possible values for the parameter β, that we write β 0 , β 1 with 0 < β 1 < β 0 .
We first assume that (τ, C) are available, as well as C ′ (which should be upper bounded). In this setting, the problem of building uniform and adaptive confidence intervals for τ is meaningless. But we can still cast in a meaningful way the problem of building a uniformly consistent test between S 0 (τ, C) := S(τ, β 0 , C, C ′ ) andS 1 (τ, C, ρ n ) :=S(τ, β 1 , β 0 , C, C ′ ρ n ). (7) We consider the following testing problem and we are interested in the minimal order of ρ n such that there exists a uniformly consistent test for this problem.
Theorem 3.1 states that n −β1/(2β1+1) is the minimax-optimal order which ensures the existence of an α-uniformly consistent test between H 0 and H 1 . Note that this ρ n is the same as the rate of estimation under the alternative hypothesis. The test statistic we propose in this setting is based on a simple idea: estimating 1 − F (x) byp(x), and then testing whether there exists a point x such that with a small constant c, In other words, we test if the empirical distribution belongs to a small enlargement of S 0 (τ, C), where this enlargement does not intersect withS 1 (τ, C, ρ n ).
3.2. Uniform and adaptive confidence interval for τ over two points β 0 , β 1 In this subsection, we still consider only two possible values β 0 , β 1 for the parameter β, and the associated sets S 0 (τ, C) := S(τ, β 0 , C, C ′ ) andS 1 (τ, C, ρ n ) := S(τ, β 1 , β 0 , C, C ′ ρ n ). But we extend the previous toy example results for the case where τ, C are not available (upper bound on the parameter C ′ is available). Then we are interested in testing where I 1 , I 2 are two closed intervals of (0, ∞).
A natural idea in this case is to plug estimators of τ and C in the test statistics which are used in the case where τ and C are known. Doing this leads to the following theorem.
Theorem 3.2. Let α > 0 and β 0 > β 1 > 0 (and it may be that β 0 = ∞) be given. Let τ ∈ I 1 , C ∈ I 2 be two unknown parameters. An α-uniformly consistent test exists for some sequence (ρ n ) n such that lim sup We lose a log(n) factor with respect to the previous result. This comes from the fact that we have to estimate τ and C. Even though we do not know whether this factor is necessary or not, we know from Theorem 3.1 that it is not possible to deviate more than this log(n) factor (the lower bound of Theorem 3.1 applies a fortiori to this enlarged model).
The previous result is immediately translated into the existence of adaptive and uniform confidence intervals for τ . Theorem 3.3. Let α > 0 and β 0 > β 1 > 0 (and it may be that β 0 = ∞) be given. Let τ ∈ I 1 , C ∈ I 2 be two unknown parameters. Also with the notation (4) and (7), we set the model as follows.

Then,
A. An α-adaptive and uniform confidence interval for τ in the model P n over B. If there exists an α-adaptive and uniform confidence interval for τ in the model P n over I b = {β 0 , β 1 }, then necessarily lim inf n→∞ ρ n n β1/(2β1+1) > 0.

Carpentier and Kim/Uniform and adaptive confidence interval in the Pareto model 8
In Theorem 3.3, the first claim A. follows from Theorem 3.2. Indeed, provided an α-uniformly consistent test between H 0 and H 1 , one is able to choose adaptively the sample fraction for estimating τ (see Section 6.3). Then the risk of this estimate depends on the β, that is, the risk is of order n − β 2β+1 where β ∈ {β 0 , β 1 } is the "true" parameter. On the other hand, B. follows easily from the previously established lower bound.
Remark: The upper bounds in both Theorems 3.2 and 3.3 are not matching the lower bounds in Theorems 3.1 and 3.3 by a log(n) factor. It is unclear to us whether this log(n) is necessary or not, but we conjecture that at least some power of log(n) is necessary, because of the uncertainty on τ, C. Indeed, it is the use of estimators for τ and C that causes this log(n) factor. We however believe that proving a matching lower bound with this additional log(n) factor is very involved, since one would need to consider a composite alternative and a composite null hypothesis in the construction of the lower bound (if one of these two is not composite, it is actually possible to construct a test and a confidence interval without the additional log(n) factor in ρ n ). We leave this as an open problem for a future research.

Uniform and adaptive confidence interval for τ over a continuum of parameter β ∈ [b, B]
In this subsection, we extend the results in the two classes to the case of a continuous set of parameters such that β ∈ The key idea is to discretize the set [b, B] into about log(n) number of disjoint intervals and do the successive testings. Let the number of grid points be M n := ⌊log(n)/ξ⌋ with a positive constant ξ. We first discretize the interval [b, B] by a grid of points that are 1/M n -apart from each other. That is, Mn such that B = β 0 > β 1 > β 2 > . . . > β Mn = b. Separation rate ρ n (β) is defined as a function of β, such that ρ n (β) = max {2(E(α/(9M n )) log(n) + D log((9M n )/α), 2C ′ } n − β 2β+1 , similarly as we defined in (34) in the two points test. We also extend the definition of the modified setS in (4) by introducingS(τ, β i+2 , β i , C, C ′ , ρ n (β i+2 )) to be the set of β i+2second order Pareto distributions that is ρ n (β i+2 ) away from S(τ, β i , C, C ′ ), Consider the model where I 1 , I 2 be two closed sets of (0, ∞).
This supremum is well defined since it is upper bounded by B, and the set {β : F ∈ S(τ, β, C, C ′ )} is non-empty since it contains b. This β * (F ) can be thought of as the "intrinsic" β of F , i.e. the one that characterizes the complexity of the model to which F belongs.
Theorem 3.4. Let α > 0, and let P n be the model defined in (10). For a positive constant ξ, there exists a sequence (ρ n (β i )) n,i such that such that there exists an estimateβ of β * (F ) which satisfies Theorem 3.4 states that on the model P n defined in (10), it is possible to estimate β * with the rate 1/ log(n). On the one hand, this rate seems very slow; but it is the price to pay for considering a rather wide model. On the other hand, as we will see in the next Theorem, it is actually sufficient in order to obtain an adaptive and uniform confidence set. The idea in the proof of Theorem 3.4 is somewhat similar to the successive testing procedure considered in Hoffmann and Nickl (2011, Section 2.5).
Theorem 3.5. Let α > 0. There exists a sequence (ρ n (β i )) n,i satisfying (12) and such that on the model P n defined in (10) over I b = [b, B], there exists an α-adaptive and uniform confidence interval C n for τ .
The model (10) is not very easy to interpret as such, but it is actually a rather general model. In particular, consider a class G of distributions defined for 0 < C 1 < C ′ and D > 0 as F ∈ S(τ, β, C, C ′ ) : Then it is possible to prove that G is included in the set (10) when we pick ξ large enough for a sufficiently large n as shown below.
Lemma 3.1. Let (ρ n (β i )) n,i be a sequence satisfying (12). Then if ξ is chosen such that C 1 − 2C ′ e −ξ(B−b)/(2(2B+1)) > 0, then for n large enough, we have For the constant in the grid points, large ξ means that we use coarse grids so that we remove smaller regions of the parameter space. That is, large ξ corresponds to the large model so that it yields a wide confidence interval for τ . Nevertheless, this large choice of ξ will soften the requirements on the choice of C 1 and C ′ such that G is contained in the model (10). This set G is actually larger than the set of distributions verifying the third order condition in the literature (e.g. Gomes et al., 2012), i.e. the set of distributions that verify for since convergence in the tail to a distribution of the form is not imposed. The set G is also larger, when n grows to ∞, than the set of distributions that satisfy the exact Hall condition, This implies in particular that our model P n is much larger than usual models where adaptive estimators or adaptive and uniform confidence intervals are derived. In fact, G can be understood as an analogue to the set of self-similar functions (see Condition 3 of Giné and Nickl, 2010).

Construction of the test statistic and the confidence interval
The test statistics (33) involves the empirical distribution of data and estimators of τ and C. For instance, in the two classes testing, we can use Hill's estimate and the associated estimate of C with the sample fraction corresponding to the null (if Ψ n = 0) or alternative hypotheses (if Ψ n = 1). Another option is to consider an adaptive estimate of τ and construct an adaptive confidence interval centred on this estimate. We illustrate the practical construction of the estimate we propose in Algorithm 1, in the case of a continuum of parameters β (Theorem 3.5). The choice of B and b depends on the belief of the user about the magnitude of the parameter β, i.e. of the models one wishes to consider. The parameter ξ corresponds to the desired precision on β. The choice of the constantc 1 (α, i) is in practice made following methods (Cheng and Peng, 2001;Haeusler and Segers, 2007) combined with a delta-method, and is thus chosen lower than the bound derived in the proofs of this paper, which is conservative. Finally, α corresponds to the desired coverage of the confidence set. These parameters can be fixed arbitrarily according to the user's preference. However, there is no simple answer for how to choose the constant C ′ . We recall that the larger C ′ is, the larger the class of β−second order Pareto distributions becomes. This implies that the theory would be more complete if we could handle the class of β−second order Pareto distributions with the second order condition for any C ′ < ∞. But then, a larger C ′ would yield a larger separation zone ρ n as well as the necessity of Algorithm 1 Practical construction of the confidence interval for τ Estimates: Estimate τ, C byτ i ,Ĉ i with the sample fraction associated to β i (for instance, forτ i , we can use the inverse of Hill's estimator computed with the ⌊n 2β i /(2β i +1) ⌋ largest samples, or the adaptive estimate (Carpentier and Kim, 2013); forĈ i , we can use where q 1−α/2 is the α/2 quantile of a N (0, 1) (following Cheng and Peng, 2001;Haeusler and Segers, 2007, combined with a delta method) considering wider confidence intervals for τ . We believe that the choice of the upper bound for C ′ should depend on the specific problem considered, and also on whether one is interested in asymptotic rates (in which case C ′ has to slowly go to infinity with n), or in the final width of the confidence set for not-too-small tail probabilities (in which case C ′ has to be of reasonably small magnitude). A reasonable heuristic for fixing C ′ is as follows. For each candidate β i , we fix where c n is a confidence bound as e.g. c n = log(1/α)n −βi/(2βi+1) and 0.2 makes C ′ not-too-small. This heuristic is efficient, in particular if the model error is maximized for small x, which is often the case in practice, and in usual parametric heavy tailed models (Fréchet, Student, etc).
The choices of β i ,Ĉ i from the threshold ⌊n 2βi/(2βi+1) ⌋ follow from the theory, and with these choices, the results are minimax-optimal up to a log(n) log log(n) 3/2 factor (see Theorem 3.3 and 3.4). Ideally, to obtain better constants in the bounds, one would like to tune the constants in all these quantities. In order to do this, however, it is necessary to estimate the model bias (i.e. the deviance with respect to Pareto distribution) precisely. This is possible in more restrictive models than ours (e.g. Gomes et al., 2012), since they assume the third order condition for the model bias (definite shape order plus a negligible term). It is thus possible to estimate this model bias with few parameters from a finite sample size using the information regarding the bias whose shape is guaranteed to hold on the entire tail.
In contrast, we consider a setting which is much broader than this one, and where the model bias is only upper bounded, which implies that the bias in our setting does not have any definite shape. It is thus not possible to infer from a finite sample the shape of the (far right) tail of the model bias, which makes it difficult to tune the constants without having an oracle knowledge of the distribution. Although we agree on the practical impact for more refined tuning, we believe that it is not possible in this broader model to tune exactly without a priori knowledge of the problem; this is the price to pay for a broader model, which is particularly relevant for discrete distributions.
4.2. Distributions that are in P n but not in the models (14) and (15) We have shown in Section 3.3 that our model P n (c.f. Equation (10)) is strictly larger than the usual models such as (14) and (15) as well as the set of selfsimilar distributions (see Equation (13)). Then it is of interest to see if there are useful models in applications that belong to our model, but that are out of the more restrictive classes previously considered. To that end, we consider the class of heavy tailed distributions that take only a countable set of values, i.e. discrete distributions. There are many examples of such discrete heavy tailed distributions; either for "natural" reasons (e.g. the distributions of wages in a population) or for rounding issues (e.g. hydrology measures). Thus for practical applications, it is very important to consider these distributions.
We claim here that many discrete distributions that are in the domain of attraction of a Fréchet distribution are not in the models (14) and (15), while some of them are in our model P n . Let us write H for the distributions that satisfy Equation (15) (and who thus also satisfy (14)). The following lemma states that distributions which are discretized (with equally spaced grids) versions of distributions in H are not in H.
Lemma 4.1. Let F ∈ H with parameters τ, β > 0. Let a 0 ≥ 1, t > 0, and set a i = a 0 + it for any i ∈ N. LetF be the discretized version of F according to (a j ) j , i.e. for all x ≥ 1,F (x) = F (a i ) where a i is the largest element of {a j , j ∈ N} smaller than or equal to x.
On the other hand, if min(β, 1/τ ) ∈ (b, B), then for n large enough, where P n is defined in (10).
The proof of this lemma is in Subsection 6.8. In order to illustrate this lemma, we consider the distributioñ which puts mass only on integers, and is equal to a τ −Pareto distribution on these integers. Clearly, this distribution is such that 1 so it is in the domain of attraction of a Fréchet distribution. Lemma 4.1 implies that this distribution does not belong to H, i.e. to the models (14) and (15), for any constants β, C, C 1 , C ′ ,C. This reveals that even simple heavy tailed discrete distributions are not in the usually considered models (14) and (15). On the other hand, this distribution is in the model P n defined in Equation (10) by Lemma 4.1. Lemma 4.1 holds for many other discrete distributions as well as this simple discrete distribution.

Extension to the case of a regularly varying second-order term
Condition (1) is a necessary and sufficient condition for a distribution F to be in the domain of attraction of a Fréchet distribution. However, as mentioned in the introduction, this model is too broad to allow for uniform rates of convergence for an estimate of τ . Setting (2) was introduced to characterize a set of parameters in which it is possible to estimate τ at a uniform rate. This rate depends on the second order parameter β that characterizes the maximal distance between a distribution included in (2) and the exact Pareto distribution.
As an alternative to the setting (2), we might consider the set of distributions F which satisfy where |R| is upper bounded by a τ β-regularly varying function at +∞ (see Lemma 2.3.2 of Dekkers and De Haan, 1993). From the form of (17), we know that this alternative setting is an extension of the exact Hall condition (15), i.e. R(x) =Cx −τ β + o(x −τ β ). In addition, by definition of regularly varying functions, if a function verifies Equation (17) (where R is τ β-regularly varying), then it verifies where D is a constant, and φ is a function such that lim x→∞ φ(x) = 0. This condition is slightly weaker than Equation (17), but these two conditions are very much related. The condition (2) we consider in this paper for the two points test and confidence interval is weaker than the exact Hall condition (15) (although it is stronger than (18) since it requires φ(x) = 0). Indeed, condition (15) implies while our condition (2) does not impose the existence of this limit; it gives just an upper bound on the distance to the exact Pareto distribution. Moreover, we have proved that up to a log(n) factor, the model P n that we derive from the condition (2) is the largest possible in a minimax sense for studying the question of uniform and adaptive confidence intervals. In the case of a continuum of parameters β, the model (10) that we define for constructing adaptive and uniform confidence intervals contains the set of distributions that verify the exact Hall condition. It is larger (see Lemma 3.1) than the set (13), which clearly includes the set of distributions in the exact Hall model. We emphasize that most papers considering estimation of β actually consider a more restrictive setting than the exact Hall condition (15), as e.g. the third order condition (see Gomes et al., 2012, for instance). Before considering our method in the more general setting (18), we want to make an important remark. In the setting (18), β does not characterize the uniform rates of convergence for the estimator of τ anymore (see Drees, 1998Drees, , 2001. Indeed, if φ converges to 0 too slowly, the distance between F and the exact Pareto distribution can become too large, and the rate n − β 2β+1 for estimating τ is then out of reach. The main reason why we are interested in testing β in the model (2) is because in this model, β characterizes the complexity of the model to which F belongs. More specifically, β controls the bias to an exact Pareto distribution, and yields the estimation error being n − β 2β+1 . Also, through this complexity, knowing β should allow us to (a) compute the optimal sample fraction for obtaining optimal estimators of τ and (b) have an idea of the precision of these estimators so that we can build a confidence interval for τ . However, under condition (18), the role of β is less clear, in particular if φ decays very slowly to 0-actually, as soon as lim x→∞ | log(x)φ(x)| = ∞. In this case, the minimax-optimal rate of estimation for τ and the optimal width of the confidence interval is not n − β 2β+1 , but it is larger (see Drees, 1998Drees, , 2001. Thus, testing whether β = β 0 or β = β 1 , or estimating β under the general model (18) does not provide an answer for confidence statements.
To closer look at the construction of uniform confidence interval in this alternative setting (18), let us consider the wider model In the extreme case |R(x)| = x −τ β+φ(x) and if lim x→∞ | log(x)φ(x)| = ∞, as mentioned above, β is not driving the rate of convergence of an optimal estimate of τ and the length of the associated confidence interval. The quantity that is driving the optimal rate can depend on n and is defined as follows. Let n > 0 be the number of samples. Let x n be the point such that it equalizes the bias and the standard deviation of an estimate (as the one in Carpentier and Kim, 2013) computed only with the samples larger than this point, i.e. x n be such that Let β n be such that n − βn 2βn +1 is equal to the risk of this estimate (or equivalently, let β n be such that x n = n 1 τ (2βn +1) ). Then we define β * n = inf N ≥n β N . This β * n is actually the quantity of importance (and not β): it is the quantity that characterizes the performance of the estimate for a fixed n. And although β * n → β as n → ∞, the rate of convergence of this quantity can be arbitrarily slow depending on how slowly φ(x) goes to 0-when this convergence is too slow and | log(x)φ(x)| → ∞, the rate is not n − β 2β+1 . Hence, in this setting and for a fixed n, one does not want to test β (or do confidence intervals according to β), but we want to test the quantity β * n (and build confidence intervals according to β * n ). Actually, the estimate defined in Carpentier and Kim (2013) computed with all the points larger than n 1 τ (2β * n +1) attains the optimal risk n − β * n 2β * n +1 (as the one defined in Drees, 1998). And if one does not know β (and also β * n ), by modifying slightly the proof of Theorem 3.6 in the paper (Carpentier and Kim, 2013), one can prove that the adaptive estimateτ from the paper (Carpentier and Kim, 2013) satisfies that with high probability, There is then an easy extension of our procedure to test whether β * n equals β 0 or β 1 . The only change with respect to the procedure we proposed is in the definition of the test statistic T n introduced in Section 5.1, which should be defined asT i.e. the distance to the model is considered only for large enough x. This test will be consistent if the alternative hypothesis is restricted to distributions which are at least n − β 1 2β 1 +1 away from distributions such that β * n = β 0 . The idea behind this change is to test deviations from the exact Pareto distribution only for large x so that R(x) is small enough under the null hypothesis (under the condition (2), R(x) is always properly bounded for any x). To extend the case where τ is unknown, the test statistic should be replaced by an estimate similar to (33) as in the proof of Theorems 3.2.

Experiments on continuous distributions
We consider usual parametric extreme value distributions: This distribution is in S(τ, 1/(2τ ), C, C ′ ) for some constants C, C ′ . • Cauchy distribution on R. The standard Cauchy distribution f (x) = 1 π 1 1+x 2 is included in S(1, 2, C, C ′ ) for some constants C, C ′ .
The last two distributions are symmetric at 0 and defined on R, so we consider the absolute value of the samples so that Hill's estimate with large sample fraction still exists (τ and β do not change).
We compute confidence sets around 1/τ (not around τ ), since most empirical studies in the literature focus on 1/τ , which enables us to compare more easily to these results. We follow the algorithm in Algorithm 1 for computing 1/τ i for each β i (using Hill's estimate with the ⌊n 2β i 2β i +1 ⌋ largest samples). For the estimation of 1/τ , the correspondingc 1 (α, i) is q 1−α/2 1/τ i , but not q 1−α/2τi (see Cheng and Peng, 2001;Haeusler and Segers, 2007). We fix C ′ according to the heuristic discussed in Section 4, and consider [b, B] = [0.5, 10], ξ = log(n)/95 and α = 0.05. We denote our confidence interval as AdapCI.
The other methods we choose to compare are derived according to Haeusler and Segers (2007) and Cheng and Peng (2001). We first estimate β byβ as discussed in Cheng and Peng (2001, §3). We then use it to compute the number of the largest samplesk that we will consider. We will consider three different valuesk * ,k, k CP ofk. We first consider the optimal estimated sample numberk * := ⌊n 2β/(2β+1) ⌋. This sample fraction should provide the best results in terms of estimating 1/τ . Second, we use a sample number that is a o(k * ), namelyk := ⌊k * / √ log n⌋. The rational behind this heuristic is that the asymptotic normality of Hill estimate with known constants (i.e. with negligible bias) holds if and only if the sample number is o(n 2β/(2β+1) ), and in this case exact asymptotic coverage of the confidence interval can be achieved. Third, we usek CP suggested in Cheng and Peng (2001, §3). The idea behind this heuristic is to provide a coverage that is as close to the theoretic coverage 1 − α as possible.
The confidence interval will then be computed using thek largest samples, according to two methods discussed in Haeusler and Segers (2007): the Wald and score type confidence intervals. A first step is to estimate 1/τ by Hill's estimate 1/τ (k) with the sample numberk. The confidence intervals will then be centred on this estimate. The Wald type confidence interval is obtained by Haeusler and Segers (2007, p.177) (1 −k −1/2 q 1−α/2 ) 1/τ (k), (1 +k −1/2 q 1−α/2 ) 1/τ (k)) , The score type confidence interval is obtained by We denote these confidence intervals Wk * , Wk, W CP for Wald, and similarly Sk * , Sk, S CP for score method.
We iterate these procedures 100 times, and compute the number of times that obtained confidence intervals contain the true 1/τ (coverage) and the average of length of intervals (size). In order to compare these 7 methods, both coverage and size have to be taken into account, and the ranking of the methods we analyse is not straightforward. A good confidence interval should have high coverage and small size, but there is a trade-off between these two quantities. Although our focus is to provide confidence intervals, for the readers' interests, we also provide their mean values and MSEs from 100 iterations (see Table 5, 6, 7, and 8).
A natural competitor would also be the method presented in Gomes et al. (2012, Table 5) since it is the only method (to the best of our knowledge) that is proven to be adaptive to the second order parameter-the results of Haeusler and Segers (2007) and Cheng and Peng (2001) are proven for a fixed sample fraction, and some fixed oracle sample fractions are discussed. They are however not proven for an adaptive sample fraction, and only heuristics with adaptive sample fractions are proposed in these papers. However, Gomes et al. (2012) describe their method as a "terribly time-consuming algorithm", and they display computational results with size and coverage of the confidence set only for a Student distribution of parameter 2, for n ∈ {100, 200, 1000} (see Table  5 in Gomes et al., 2012). Moreover, we found that our method, as well as the score and Wald methods, give much better results in this case, simultaneously in terms of size and coverage (see Table 2). Thus we do not implement the method of Gomes et al. (2012) on our experiments as a competitor, but we can still compare the results for the Student distribution of parameter 2.
We provide the simulation results on the coverage and size of the confidence intervals in Tables 1 (Pareto), 2 (Student), 3 (Fréchet) and 4 (Cauchy). We can see that our adaptive method AdapCI gives fairly stable and small confidence intervals in terms of both the coverage and the size, and is particularly efficient for small sample sizes. The Wald method provides also good results, both in terms of size and coverage, in particular withk * number of samples (which provides the smallest confidence interval, with almost always good coverage). Our method is in most cases comparable to the associated method Wk * . In contrast, the score method gives often a too wide confidence interval for small sample sizes n = 100, 200. For the case τ = 2 in Table 2, we can compare our result with Gomes et al. (2012, Table 5). The results in Gomes et al. (2012, Table 5) are almost always worse both in terms of coverage and size than the results of Table 2.

Experiments on discretized distributions
As we claim in Subsection 4.2, our model contains discretized Pareto distributions which are not contained in usual models previously considered. In this Subsection, we thus consider the model (16) F : x ∈ [1, ∞) → 1 − ⌊x⌋ −τ . As discussed in Subsection 4.2, F ∈ S(τ, 1/τ, 1, C ′ ) for C ′ large enough depending on τ . We perform the experiments for the seven methods discussed in the last subsection, and for sample sizes n ∈ {100, 200, 1000, 10 4 , 10 5 }, and for τ ∈ {1, 2}.   Table 9 shows the results for these class discretized distributions. All methods perform correctly in the case τ = 1. However, for τ = 2, the coverage probability of all the Wald and score methods are very small for a large sample size n ∈ {10 4 , 10 5 }. This comes from the fact that these methods over-estimate β (which is 1/τ = 1/2 in this case), which implies the size of the confidence interval is too small for guaranteeing a good coverage. This problem is more acute for τ = 2 that for τ = 1 since β = 1/τ is smaller for τ = 2. Our method AdapCI, on the other hand, detects that the complexity of this model is higher than in the case of the exact Pareto distribution, and increases the size of the confidence  intervals, which guarantees a good coverage for the resulting confidence interval.

Proof of the upper bound in Theorem 3.1 (Proof of [A.])
In this Subsection, we write for simplicity S 0 = S 0 (τ, C), S 1 = S 1 (τ, C) and S 1 =S 1 (τ, C, ρ n ).   Let X 1 , . . . , X n be an i.i.d. random sample from a distribution F ∈ S 1 . We write, for any x ∈ R + , p x := P(X > x) = 1 − F (x), and its empirical estimate that we define for all x rationals (we write Q for the   rationals) larger than 0p For the x that are non rational, we setp x = lim y∈Q,y>x,y→xpy . We propose the following test statistic The test is of the form where ρ n ≥ max(4D log(1/α), 2C ′ )n − β 1 2β 1 +1 with a universal constant D in Lemma 6.2. Then, we reject the null if Ψ n = 1, and vice versa.
The following results in Lemma 6.1 show that the test statistics T n is a reasonable criterion for this testing problem. Lemma 6.2 proves that the difference between empirical estimate and the true probability is controlled uniformly well.
Lemma 6.2. Suppose we have an iid sample from F ∈ S 1 . With probability larger than 1 − α, we have where D is some constant that depends only on β 1 , on a lower bound on τ and on an upper bound on C, C ′ . Now, we combine the results obtained in Lemma 6.1 and 6.2 by considering two hypotheses separately. Let α > 0.
Under H 0 : We obtain that with probability larger than 1 − α Under H 1 : We obtain that with probability larger than 1 − α This concludes the proof of the upper bound in Theorem 3.1.
Proof of Lemma 6.1. (i) is clear by definition of H 0 , but proof of (ii) is more involved. First, we define three regions Then for any F ∈ H 1 , we defineF as follows, We haveF ∈ S 0 by definition.
By definition ofF , we have But then, by the fact that F ∈ S 1 , by using the upper bound for x τ p x ≤ C + C ′ x −τ β1 and the lower bound for (19) can be upper bounded as follows, Recall that by definition ofS(β 1 , ρ n ), for any F 0 ∈ S 0 , there exists x 0 such that Thus, we can restrict the set of x 0 such that SinceF ∈ S 0 , this implies that there exists x 0 ≤ n 1/(τ (2β1+1) such that (19), it is clear that the maximum point should be either in R + or R − . Either way, we have found that there exists x 0 such that

Combining this with Equation
This concludes the proof of Lemma 6.1.
Let σ ≤ 1/2, and V be any two numbers satisfying Then for all t ≥ 0, Here we set f ( e −kτ , 1/4). The following result is obtained by applying the last theorem to the class F , and by rescaling.
Since the function F is cadlag, and since the rationals are dense in the real line, by construction ofp x , the last inequality implies the following corollary.
Proof. For any x ∈ Q (and in I k ), we have by definition ofp x and since F (and thus p) is cadlag This above equality implies both , nt .
This concludes the proof.

Carpentier and Kim/Uniform and adaptive confidence interval in the Pareto model 25
In order to use Corollary 6.1, we want to bound µ k , and the following result proves the upper bound for µ k . Lemma 6.3. There exists a constant D 1 that depends only on β 1 , on a lower bound on τ and on an upper bound on C, C ′ , and that is such that Let k be such that δ k > 0 and δ k ≤ exp(−D 2 1 /12). By plugging the result of Lemma 6.3 into Corollary 6.1, with t = 12 where D 2 = max(12, D 1 ). By multiplying by x τ (since on I k , x τ ≤ e (k+1)τ ) in the probability in (20), we have , where E = K k=1 exp(− exp((K − k)τ )) < ∞ and depends on τ only (since it is a hypergeometric sum). Plugging this δ k in the last inequality yields which implies by definition of K and by denoting ζ := n − β 1 By combining the last equation for all k = 1, . . . , K, we obtain This concludes the proof of Lemma 6.2.
Proof of Lemma 6.3. Here we use the same notation for B, K, I k used in the proof of Lemma 6.2. Let k ≤ K. Consider the grid of points of I k , that we write χ k := (x 1 , . . . , x i , . . . , x Υ k −1 ), that are rationals, and that are such that p 1 = e k (or arbitrarily close rational to e k ) and otherwise p xi − p xi+1 = c/n, (or arbitrarily close rationals that verify this) until we reach e k+1 , and where c is the smallest constant larger than 1 such that log(Υ k ) is an integer.
Step 1. We claim that for any x, y ∈ χ 2 k such that x ≤ y, the tail probability ofp x −p y is upper bounded by an sub-exponential bound plus a sub-Gaussian bound with a distance function d(x, y) = px−py n . That is, by denoting U x,y = p x −p y − (p x − p y ), Note thatp x −p y = 1 n n i=1 ½{x ≤ X i ≤ y} is the average of Binomial random variable with parameters (n, p x − p y ) where p x − p y ≥ 1/n since x, y are points of χ k . Then, by Bernstein inequality, This implies (since p x − p y ≥ 1/n) Thus, the claim is proved.
Step 2. In this step, we bound the expectation of the supremum when x and y can take possible m number of values x j and y j in χ k , that is, E sup j≤m |p xj − p yj − (p xj − p yj )| such that p x − p y ≤ d 2 .
Equation (21) implies in particular that for (x, y) ∈ χ 2 k , we can express U x,y as a sum of a sub-Gaussian random variable U x,y plus a sub-exponential random variable V x,y , i.e. U x,y = U x,y + V x,y and that are such that where · Ψ1 and · Ψ2 are the Orlicz norms 1 and 2. Consider m pairs (x j , y j ) ∈ χ 2 k such that p x − p y ≤ d 2 . By definition of the Orlicz norms, This implies that These two equations give the following results, using EX = 1 D 4 (log(m) + 1) du = D 4 (log(m) + 1) 2d 2 n < 1.5D 4 (log(m) + 1) d 2 n .
Combining the above ideas, we have Step 3. Now, we bound E sup x∈χ k |p x − p x | using the results in Step 2 by a chaining argument. For any 1 ≤ i ≤ log 2 (Υ k ), we define chaining set A i by a sequence of finite subsets A 1 contains only one element (e.g. if Υ k /2 is an integer, A 1 = {x Υ k /2 }), and the cardinality of A i is 2 i . Note that A i ⊆ χ k , and the last set A log 2 (Υ k ) becomes {x 1 , . . . , x Υ k −1 } =: χ k . Also by definition of these sets, for any point x ∈ χ k , there exists a chain (y 1 , . . . , y Υ k ) such that y i ∈ A i and and such that x = y Υ k . Note that given y i+1 , there is only two choices of y i which are possible (because of the previous equation). Let us write B i+1 for such possible pairs (i.e. that are at a distance less than |p e k − p e k+1 |/2 i+1 ), and by definition there are less than 2 × 2 i+1 such pairs. By using the triangle inequality, Using Equation (22) on B i+1 which contains at most 2 × 2 i+1 pairs satisfying Also since there is only one element in A 1 , we have By plugging the above both equations in the chaining equation, we obtain Step 4. In this final step, we extend the above inequality to any x ∈ I k not necessarily on the grid point.
Note that for any x ∈ I k , there exist x,x ∈ χ 2 k such that x ≤ x ≤x, px ≤ p x ≤ p x andpx ≤p x ≤p x where p x − px = c/n. Then for any x ∈ I k , where the last inequality is followed since for k ≤ K, we know that exp(−kτ ) ≥ exp(−Kτ ) = n −1/(2β1+1) ≥ 1/n. This concludes the proof.

Proof of Theorem 3.2
Let X 1 , . . . , X n be an i.i.d. random sample from a distribution F ∈ S 1 . Letτ be an estimator of τ such that for any τ ∈ I 1 , we have with probability where c 1 is a function defined on (0, 1). For instance, Theorem 1 in (Cheng and Peng, 2001) implies that with Hill estimatorτ H , we can choose (asymptotically) c 1 (η) as q 1−η/2τH , where q 1−α/2 is such as P(|N (0, 1)| ≥ q 1−α/2 ) = α (where N is the standard Gaussian distribution). See also Theorem 3.6 and Remark 3.7 of Carpentier and Kim (2013) for another estimator for which c 1 (η) ∼ log(1/η) is well defined with a finite n. Also we defineĈ as an estimator of C such that for any C ∈ I 2 , we have with probability at least 1 − η where c 2 is a function defined on (0, 1). For instance, we can defineĈ as follows, From (23), for a sufficiently large n (such that 2 log(n)n − β 1 2β 1 +1 c 1 (η)/τ ≤ 1/2, for any τ ∈ I 1 )), we know with probability 1 − η,

Proof of Theorem 3.3
Proof of [A.] We use the same η and test statistics T n , and test Ψ n from the previous section.
The third condition in Definition of 2.2 is shown using the last calculation in Thus, we have proved the existence of an adaptive and uniform confidence interval for P n (by checking the two conditions (5) and (6)).

Proof of [B.]
The proof depends on the lower bound construction, which is previously considered similarly in the papers (Drees, 2001;Novak, 2014;Carpentier and Kim, 2013).
Let n ≥ 2. Let τ > 0, υ > 0, β 0 > β 1 > 0, and we define B = n 2β 1 +1 . Then we consider the Pareto with τ parameter as F 0 such as 1 − F 0 (x) = x −τ , and for F 1 we perturb the tail larger than B so that it has heavier tail. That is, we let Then it is known (as proved in Drees, 2001;Novak, 2014) that there exists no δ−uniformly consistent test for distinguishing between F 0 and F 1 whenever n is large enough, for small enough υ. Now, we use a contradiction to prove our claim. Suppose 0 < α < 1/3 and 3α = δ. Assume that there exists an α-uniform and adaptive confidence interval C n for the first order parameter when P n = {F 0 , F 1 }. Then we consider the test Ψ n such that Then since C n is uniform and adaptive, we have This implies that Ψ n is 3α uniformly consistent for P n = {F 0 , F 1 }. This contradicts the fact that no δ−uniformly consistent test exists. This concludes the proof.

Proof of the lower bound in Theorem 3.1 (Proof of [B.])
Here, we prove the lower bound by constructing two distributions F 0 and F 1 in the model with the specific ρ n ∼ n −β1/(2β1+1) and by proving the distance between F n 0 and F n 1 is close enough so that these two are not distinguishable as n → ∞ (so that α-uniform consistent test does not exist).
Let τ, β 1 > 0. Let F 0 be the distribution such that for any x ≥ 1, we have Note that F 0 ∈ S(τ, ∞, 1, 0). Let υ > 0 be a small constant. Now, we construct another continuous distribution F 1 . Let B = n 1/(τ (2β1+1)) , t = υB −τ β1 = υn −β1/(2β1+1) , and let C ′2 ≤ 2β1+1 (β1+1) 2 1 3τ 2 . Also we suppose that n is large enough such that t ≤ min( √ 3υτ 2 √ 2β1+1 , τ 4 ). Then, consider B 1 such that B < B 1 = (1 +C)B wherẽ C > 0 (later it will be chosen as the smallestC such that F 1 is continuous). More precisely, As we can see in the definition (35), F 1 is defined to be slightly perturbed distribution from F 0 such that it is exactly Pareto with parameter τ on the region x ≤ B, and it attains the upper bound for the second order Pareto tails after B 1 , but in the middle region B ≤ x ≤ B 1 it only satisfies exactly Pareto with parameter τ − t.
Equivalently, we are testing the following two hypotheses, and show that there does not exist uniform consistent test. Let β 0 > β 1 . By definition, F 0 ∈ S(τ, β 0 , 1, C ′ ) (since it is exactly pareto).
Step 1. Checking that F 1 ∈ S(τ, β 1 , 1, C ′ ). Clearly, we only need to check the second order Pareto condition for the region {x : B < x < B 1 }; we need to show that Trivially the inequality is true when x = B. Also, since the LHS is an increasing function of x while the RHS is a decreasing function of x, we verify the claim F 1 ∈ S(τ, β 1 , 1, C ′ ) by choosing B 1 = B + u such that Step 2. Range of B 1 . For convenience, we let u =:CB (withC > 0). Then from (36), which gives the upper bound Then, B 1 = (1 +C)B ≤ exp(C ′ /υ)B.
Step 3. Checking that Separation condition is verified. First, we claim that Indeed, since the constructed F 1 is a continuous function, we just need to check the kink point B 1 . Note that where the last inequality is followed by definition of B and the upper bound υ . This implies that there exists a point x 0 such that Consider a function F ∈ S(τ, β 0 , 1, C ′ ). We know that this function is such that This implies in particular that for n large enough. This implies that F 1 − S(τ, β 1 , 1, C ′ ) ∞,τ ≥ (M/4)n − β 1 2β 1 +1 , so F 1 belongs to the separated setS 1 (τ, β 1 , C, C ′ , ρ n ) with ρ n = M/4n − β 1 2β 1 +1 .
Moreover, for any i ≥ i * , we know that This implies by Theorem 3.2 that for any M n − 2 ≥ i ≥ i * , i th test is 4α 9Mnconsistent. More presicely, for any M n − 2 ≥ i ≥ i * , with probability larger than 1 − 4α 9Mn , Ψ n (i) = 0 (see the proof of Theorem 3.2). By an union bound and by the definition ofî, with probability larger than 1 − α, eitherî = i * orî = i * + 1.
This implies by Theorem 3.2 that for any M n − 2 ≥ i ≥ 0, test i is 4α 9Mnconsistent. More presicely, for any M n − 2 ≥ i ≥ i * , with probability larger than 1 − 4α 9Mn , we have Ψ n (i) = 0. By an union bound and the definition ofî, with probability larger than 1 − α,î = i * = 0.

B−b
Mn by definiton of i * , together with the previous equation, we have that with probability larger than 1 − α, for n large enough so that log(n) ≥ 2ξ. The constants in the bound are independent of F so the result holds uniformly over P n .
6.8. Proof of Lemma 4.1 Since F ∈ H, we have Let 0 ≤ u < t and k ∈ {a j , j ∈ N} be large compared to t. We have by definition ofF followed by Taylor expansion, This implies in particular that β * (F ) = min(β, 1/τ ), and that F ∈ S(τ, min(β, 1/τ ), C, C ′ ), where C ′ is a large enough constant.